Re: Get position of first occurrence in search result

2014-06-24 Thread Tri Cao

It wouldn't be too hard to write a Solr Plugin that take a param docId together 
with a query and return the position of that doc within the result list for 
that query. You will still need to deal with the performance though. For 
example, if the doc ranks at one millionth, the plugin still needs to get at 
least 1M docs, and so the underlying collector still needs to sort through 1M 
documents.

Maybe your business requirement is different, but does it really make a 
difference if a document ranks at one thousandth or one millionth? For example 
with Google SEO, either you rank in the first 3 (precision at 3) or you don't 
:) 

--Tri

On Jun 23, 2014, at 10:55 PM, Jorge Luis Betancourt Gonzalez 
jlbetanco...@uci.cu wrote:

Basically this is for analytical purposes, essentially we want to help people 
(which sites we’ve indexed in our app) to find out for which particular terms 
(in theory related with their domain) they are bad positioned in our index. 
Initially we’re starting with this basic “position per term” but the idea is to 
elaborate further in this direction.

This logic por position finding could be abstracted effectively in a plugin 
inside Solr? I guess it would be more efficient to iterate (or fire the 2 
queries) from within solr itself than in our app (written in PHP, so not so 
fast for some things) speeding up things?

Regards,

On Jun 24, 2014, at 1:42 AM, Aman Tandon amantandon...@gmail.com         
wrote:

        Jorge, i don't think that solr provide this functionality, you have to
        iterate and solr is very fast in this, you can create a script for that
        which search for pattern(term) and parse(request) the records until 
get the
        record of that desired url, i don't thing 1/3 seconds time to find out 
is
        more.
        
        As per the search result analysis, there are very few people who request

        for the second page for their query, otherwise mostly leave the search 
or
        modify query string. So i better suggest you that the if the website 
has
        the appropriate and good data it should come on first page, so its 
better
        to come on first page rather than finding the position.
        
        With Regards

        Aman Tandon
        
        
        On Tue, Jun 24, 2014 at 10:35 AM, Jorge Luis Betancourt Gonzalez 

        jlbetanco...@uci.cu         wrote:
        
                Yes, but I’m looking for the position of the url field of interest in the

                response of solr. Solr matches the terms against the 
collection of
                documents and returns sorted list by score, what I’m trying 
to do is get
                the position of the a specific id in this sorted response. 
The response
                could be something like position: 5, or position 500. To do 
this manually
                suppose the response consists of a very large amount of 
documents
                (webpages) in this case I would need to iterate over the 
complete response
                to find the position, which in a worst case scenario could be 
in the last
                page for instance. For this particular use case I’m not so 
interested in
                the URL field per se but more on the position a certain url 
has in the full
                solr response.
                
                On Jun 24, 2014, at 12:31 AM, Walter Underwood wun...@wunderwood.org      

                wrote:
                
                        Solr is designed to do exactly this very, very fast. So there isn't a

                faster way to do it. But you only need to fetch the URL 
field. You can
                ignore everything else.
                        
                        wunder
                        
                        On Jun 23, 2014, at 9:32 PM, Jorge Luis Betancourt Gonzalez 

                jlbetanco...@uci.cu         wrote:
                        
                                Basically given a few search terms (query) the idea is to know given

                one or more terms in which position your website is located 
for those
                specific terms.
                                
                                On Jun 24, 2014, at 12:12 AM, Aman Tandon amantandon...@gmail.com    

                wrote:
                                
                                        What kind of search criteria, could you please explain
                                        
                                        With Regards

                                        Aman Tandon
                                        
                                        
                                        On Tue, Jun 24, 2014 at 4:30 AM, Jorge Luis Betancourt Gonzalez 

                                        jlbetanco...@uci.cu         wrote:
                                        
                                                I’m using Solr for an analytic use case, one of the 

solr4.7.2 startup take too long time

2014-06-24 Thread hrdxwandg
Before i upgrade solr to 4.7.2, i use solr3.6.where i startup tomcat, the
solr is started up quickly,the index size is 35G. After i upgrade solr to
4.7.2. i rebuild the index totally. and the size of index is 16G. But when i
restart the tomcat, i found that solr is startedup too slowly, almost take
about 10 minutes.

i do not know the reason, And ask for help. thank you 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr4-7-2-startup-take-too-long-time-tp4143634.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: solr4.7.2 startup take too long time

2014-06-24 Thread Alexandre Rafalovitch
You have any warming queries? Also, how do you measure the speed? What
does the boot log timestamps show for your index as opposed to - say -
an empty example index?

Regards,
   Alex.
Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency


On Tue, Jun 24, 2014 at 1:51 PM, hrdxwandg hrdxwa...@gmail.com wrote:
 Before i upgrade solr to 4.7.2, i use solr3.6.where i startup tomcat, the
 solr is started up quickly,the index size is 35G. After i upgrade solr to
 4.7.2. i rebuild the index totally. and the size of index is 16G. But when i
 restart the tomcat, i found that solr is startedup too slowly, almost take
 about 10 minutes.

 i do not know the reason, And ask for help. thank you



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/solr4-7-2-startup-take-too-long-time-tp4143634.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Fwd: solr4.7.2 startup take too long time

2014-06-24 Thread Alexandre Rafalovitch
Forwarding to the mailing list.

-- Forwarded message --
From:  hrdxwa...@gmail.com
Date: Tue, Jun 24, 2014 at 2:15 PM
Subject: Re: solr4.7.2 startup take too long time

thanks for your reply.
i do not warm any queries. and the configuration is default as follow:
listener event=newSearcher class=solr.QuerySenderListener
  arr name=queries
!--
   lststr name=qsolr/strstr name=sortprice asc/str/lst
   lststr name=qrocks/strstr name=sortweight asc/str/lst
  --
  /arr
/listener
listener event=firstSearcher class=solr.QuerySenderListener
  arr name=queries
lst
  str name=qstatic firstSearcher warming in solrconfig.xml/str
/lst
  /arr
/listener

the speed is a little objective. i check the tomcat log and its startup time.
when i use solr3.6, after i restart the tomcat, i can get the front
page from chrome browser. but After i use solr4.7.2, i must wait for a
long time.
In addition, the index is not empty, my index is 35G in solr3.6, and
16G in solr4.7.2

the question is so confused.

quote author='Alexandre Rafalovitch'
You have any warming queries? Also, how do you measure the speed? What
does the boot log timestamps show for your index as opposed to - say -
an empty example index?

Regards,
   Alex.
Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr
proficiency


On Tue, Jun 24, 2014 at 1:51 PM, hrdxwandg hrdxwa...@gmail.com wrote:
 Before i upgrade solr to 4.7.2, i use solr3.6.where i startup tomcat, the
 solr is started up quickly,the index size is 35G. After i upgrade solr to
 4.7.2. i rebuild the index totally. and the size of index is 16G. But when
 i
 restart the tomcat, i found that solr is startedup too slowly, almost take
 about 10 minutes.

 i do not know the reason, And ask for help. thank you



Re: Bug in Collapsing QParserPlugin : Sort by 3 or more fields is broken

2014-06-24 Thread Umesh Prasad
Hi Joel,
   Had missed this email .. Some issue with my gmail setting.

The reason CollapsignQParserPlugin is more performant than regular grouping
is because

1.  QParser refers to global ords for group.field and avoids storing
strings in a set. This has two advantage.
  a) Terms of memory (storing millions of ints vs strings, results in major
savings).
  b)  No binary search / look up is necessary when segment changes.
Resulting in huge computation savings.

2. The cost
CollapsingFieldValue has to maintain score/field value for each unique
ord.
   Memory requirement = number of ords * size of 1 field value.
   The basic types byte, int, float , long etc will consume reasonable
memory.
String/Text value can be stored as ords and will consume only 4 bytes.

The memory requirement is because arrays are dense and it is per request.
Taking an example :
 Index Size = 100 million documents
 Unique ords =  10 million
 Sort field = 4   ( 1 int field + 1 long  field + 2 string/text field)
 Memory  requirement =  40 MB  for  int field  +  80 MB for long field
+ 80 MB for string ords  = 200 MB


I agree 200 MB per request just for collapsing the search results is huge
but at least it increases linearly with number of sort fields.. For my use
case, I am willing to pay the linear cost specially when I can't combine
the sort fields intelligently into a sort function. Plus it allows me to
sort by String/Text fields also which is a big win.

PS :
1. We can store long/string fields also as byte/short ords ..For sort
fields, where number of unique values are smaller ( example sort by date ,
sales rank etc), this will result into significant memory savings.








On 19 June 2014 19:40, Joel Bernstein joels...@gmail.com wrote:

 Umesh, this is a good summary.

 So, the question is what is the cost (performance and memory) of having the
 CollapsingQParserPlugin choose the group head by using the Solr sort
 criteria?

 Keep in mind that the CollapsingQParserPlugin's main design goal is to
 provide fast performance when collapsing on a high cardinality field. How
 you choose the group head can have a big impact here, both on memory
 consumption performance.

 The function query collapse criteria was added to allow you to come up with
 custom formulas for selecting the group head, with little or no impact on
 performance and memory. Using Solr's recip() function query it seems like
 you could come up with some nice scenarios where two variables could be
 used to select the group head. For example:

 fq={!collapse field=a max='sub(prod(cscore(),1000), recip(field(x),1, 1000,
 1000))'}

 This seems like it would basically give you two sort critea: cscore(),
 which returns the score, would be the primary criteria. The recip of field
 x would be the secondary criteria.













 Joel Bernstein
 Search Engineer at Heliosearch


 On Thu, Jun 19, 2014 at 2:18 AM, Umesh Prasad umesh.i...@gmail.com
 wrote:

  Continuing the discussion on mailing list from Jira.
 
  An Example
 
 
  *id  group   f1  f2*1   g1
  5   10
  2   g1 5   1000
  3   g1 5   1000
  4   g1 10  100
  5   g2 5   10
  6   g2 5   1000
  7   g2 5   1000
  8   g210  100
 
  sort= f1 asc, f2 desc , id desc
 
 
  *Without collapse will give : *
  (7,g2), (6,g2),  (3,g1), (2,g1), (5,g2), (1,g1), (8,g2), (4,g1)
 
 
  *On collapsing by group_s  expected output is : *  (7,g2), (3,g1)
 
  solr standard collapsing does give this output  with
  group=on,group.field=group_s,group.main=true
 
  * Collapsing with CollapsingQParserPlugin* fq={!collapse field=group_s} :
(5,g2), (1,g1)
 
 
 
  * Summarizing Jira Discussion :*
  1. CollapsingQParserPlugin picks up the group heads from matching results
  and passes those further. So in essence filtering some of the matching
  documents, so that subsequent collectors never see them. It can also pass
  on score to subsequent collectors using a dummy scorer.
 
  2. TopDocCollector comes later in hierarchy and it will sort on the
  collapsed set. That works fine.
 
  The issue is with step 1. Collapsing is done by a single comparator which
  can take its value from a field or function. It defaults to score.
  Function queries do allow us to combine multiple fields / value sources,
  however it would be difficult to construct a function for given sort
  fields. Primarily because
  a) The range of values for a given sort field is not known in
 advance.
  It is possible for one sort field to unbounded, but other to be bounded
  within a small range.
  b) The sort field can itself hold custom logic.
 
  Because of (a) the group head selected by CollapsingQParserPlugin will be
  incorrect and subsequent sorting will break.
 
 
 
  On 14 June 

CollapsingQParserPlugin throws Exception when useFilterForSortedQuery=true

2014-06-24 Thread Umesh Prasad
Hi ,
Found another bug with CollapsignQParserPlugin. Not a critical one.

It throws an exception when used with

useFilterForSortedQuery true /useFilterForSortedQuery

Patch attached (against 4.8.1 but reproducible in other branches also)


518 T11 C0 oasc.SolrCore.execute [collection1] webapp=null path=null
params={q=*%3A*fq=%7B%21collapse+field%3Dgroup_s%7DdefType=edismaxbf=field%28test_ti%29}
hits=2 status=0 QTime=99
4557 T11 C0 oasc.SolrCore.execute [collection1] webapp=null path=null
params={q=*%3A*fq=%7B%21collapse+field%3Dgroup_s+nullPolicy%3Dexpand+min%3Dtest_tf%7DdefType=edismaxbf=field%28test_ti%29sort=}
hits=4 status=0 QTime=15
4587 T11 C0 oasc.SolrException.log ERROR
java.lang.UnsupportedOperationException: Query  does not implement
createWeight
at org.apache.lucene.search.Query.createWeight(Query.java:80)
at
org.apache.lucene.search.IndexSearcher.createNormalizedWeight(IndexSearcher.java:684)
at
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:297)
at
org.apache.solr.search.SolrIndexSearcher.getDocSetScore(SolrIndexSearcher.java:879)
at
org.apache.solr.search.SolrIndexSearcher.getDocSet(SolrIndexSearcher.java:902)
at
org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1381)
at
org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:478)
at
org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:461)
at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:218)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1952)
at org.apache.solr.util.TestHarness.query(TestHarness.java:295)
at org.apache.solr.util.TestHarness.query(TestHarness.java:278)
at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:676)
at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:669)
at
org.apache.solr.search.TestCollapseQParserPlugin.testCollapseQueries(TestCollapseQParserPlugin.java:106)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1618)
at
com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:827)
at
com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:863)
at
com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:877)
at
com.carrotsearch.randomizedtesting.rules.SystemPropertiesRestoreRule$1.evaluate(SystemPropertiesRestoreRule.java:53)
at
org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:50)
at
org.apache.lucene.util.TestRuleFieldCacheSanity$1.evaluate(TestRuleFieldCacheSanity.java:51)
at
org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:46)
at
com.carrotsearch.randomizedtesting.rules.SystemPropertiesInvariantRule$1.evaluate(SystemPropertiesInvariantRule.java:55)
at
org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:49)
at
org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:65)
at
org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:48)
at
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at
com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:360)
at
com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:793)
at
com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:453)
at
com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:836)
at
com.carrotsearch.randomizedtesting.RandomizedRunner$3.evaluate(RandomizedRunner.java:738)
at
com.carrotsearch.randomizedtesting.RandomizedRunner$4.evaluate(RandomizedRunner.java:772)
at
com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:783)
at
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at
com.carrotsearch.randomizedtesting.rules.SystemPropertiesRestoreRule$1.evaluate(SystemPropertiesRestoreRule.java:53)
at
org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:46)
at

RequestHandler init failure

2014-06-24 Thread atp
Hi,

I'm getting below error while trying to access solrcloud , 

i tried with , 

added these two in to solrconfig.xml file

  lib dir=../../../../contrib/dataimporthandler/lib/ regex=.*\.jar /
  lib dir=../../../../dist/ regex=solr-dataimporthandler-.*\.jar /

actual location is /opt/apps/prod/solr/dist here all the required below jars
are available


solr-4.8.1.war   solr-map-reduce-4.8.1.jar
solr-analysis-extras-4.8.1.jar   solr-morphlines-cell-4.8.1.jar
solr-cell-4.8.1.jar  solr-morphlines-core-4.8.1.jar
solr-clustering-4.8.1.jarsolr-solrj-4.8.1.jar
solr-core-4.8.1.jar  solr-test-framework-4.8.1.jar
solr-dataimporthandler-4.8.1.jar solr-uima-4.8.1.jar
solr-dataimporthandler-extras-4.8.1.jar  solr-velocity-4.8.1.jar
solrj-libtest-framework
solr-langid-4.8.1.jar




but stil getting same issue ,  any help please



HTTP Status 500 - {msg=SolrCore 'collection1' is not available due to init
failure: RequestHandler init
failure,trace=org.apache.solr.common.SolrException: SolrCore 'collection1'
is not available due to init failure: RequestHandler init failure at
org.apache.solr.core.CoreContainer.getCore(CoreContainer.java:753) at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:347)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:220)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:122)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:171)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:950)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:408)
at
org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1040)
at
org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:607)
at
org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:314)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at
org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
at java.lang.Thread.run(Thread.java:745) Caused by:
org.apache.solr.common.SolrException: RequestHandler init failure at
org.apache.solr.core.SolrCore.init(SolrCore.java:858) at
org.apache.solr.core.SolrCore.init(SolrCore.java:641) at
org.apache.solr.core.CoreContainer.create(CoreContainer.java:556) at
org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:261) at
org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:253) at
java.util.concurrent.FutureTask.run(FutureTask.java:262) at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at
java.util.concurrent.FutureTask.run(FutureTask.java:262) at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
... 1 more Caused by: org.apache.solr.common.SolrException: RequestHandler
init failure at
org.apache.solr.core.RequestHandlers.initHandlersFromConfig(RequestHandlers.java:167)
at org.apache.solr.core.SolrCore.init(SolrCore.java:785) ... 10 more
Caused by: org.apache.solr.common.SolrException: Error Instantiating Request
Handler, solr.DataImportHandler failed to instantiate
org.apache.solr.request.SolrRequestHandler at
org.apache.solr.core.SolrCore.createInstance(SolrCore.java:559) at
org.apache.solr.core.SolrCore.createRequestHandler(SolrCore.java:611) at
org.apache.solr.core.RequestHandlers.initHandlersFromConfig(RequestHandlers.java:153)
... 11 more Caused by: java.lang.ClassCastException: class
org.apache.solr.handler.dataimport.DataImportHandler at
java.lang.Class.asSubclass(Class.java:3165) at
org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:484)
at
org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:421)
at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:538) ... 13
more ,code=500}



--
View this message in context: 
http://lucene.472066.n3.nabble.com/RequestHandler-init-failure-tp4143640.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: No results for a wildcard query for text_general field in solr 4.1

2014-06-24 Thread Sven Schönfeldt
Hi Erick,

that is what i did, tried that input on analysis page. 

The index field splitting the value into two words: „test“ and „or123
Now checking the query at analysis page, and there are the word ist splitting 
into „test“ and „or123“.

By doing the query and look into the debug result, i see that there is no 
splitting of words. Thats what i expect…

str name=rawquerystringsearchField_t:test\-or123*/str
str name=querystringsearchField_t:test\-or123*/str
str name=parsedquerysearchField_t:test-or123*/str
str name=parsedquery_toStringsearchField_t:test-or123*/str

Without the wildcard, the word is splitting also in two parts:

str name=rawquerystringsearchField_t:test\-or123/str
str name=querystringsearchField_t:test\-or123/str
str name=parsedquerysearchField_t:test searchField_t:or123/str
str name=parsedquery_toStringsearchField_t:test searchField_t:or123/str

Any idea which configuration has the responsibility for that behavior?

Thanks!


Am 23.06.2014 um 22:55 schrieb Erick Erickson erickerick...@gmail.com:

 Well, you can do more than guess by looking at the admin/analysis page
 and trying your input on the field in question. That'll show you what
 actual transformations are performed.
 
 You're probably right though. Try adding debug=query to your URL to
 see what the actual parsed query looks like and compare with the
 admin/analysis page
 
 But yeah, it's a matter of getting all the parts (query parser and
 analysis chains) to do the right thing.
 
 Best,
 Erick
 
 On Mon, Jun 23, 2014 at 7:30 AM, Sven Schönfeldt
 schoenfe...@subshell.com wrote:
 Hi Solr-Users,
 
 i am trying to do a wildcard query on a dynamic textfield (_t), but don’t 
 get the right result.
 The configuration for the field type is „text_general“, the default 
 configuration:
 
 fieldType name=text_general class=solr.TextField 
 positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true 
 words=stopwords.txt enablePositionIncrements=true /
filter class=solr.LowerCaseFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true 
 words=stopwords.txt enablePositionIncrements=true /
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt 
 ignoreCase=true expand=true/
filter class=solr.LowerCaseFilterFactory/
  /analyzer
/fieldType
 
 
 Input for the textfield is test-or123 and my query looks like test\-or*“.
 
 It seems that the input is allready split into two words: „test“ and 
 „or123“, but that's just a guess.
 
 Anyone who can help me, and know why i don’t find the document and whats 
 todo to make the quert working?
 
 Regards!
 
 
 



Re: solr4.7.2 startup take too long time

2014-06-24 Thread hrdxwandg
i am a fresh man of nabble, i am so sorry that i take the wrong operation.
now i will repeat my answer here.

there is no warming queries i my solrconfig.xml , as follows:
 listener event=newSearcher class=solr.QuerySenderListener
  arr name=queries

  /arr
/listener
listener event=firstSearcher class=solr.QuerySenderListener
  arr name=queries
lst
  str name=qstatic firstSearcher warming in solrconfig.xml/str
/lst
  /arr
/listener

the startup speed is a little objective.i check the startup time in tomcat
log.when i use solr3.6, After i restart the tomcat, i can get the front page
quickly using chrome browser. But i must wait for a long time when i use
solr4.7.2.

In addtion, the index is not empty. in solr3.6, the index size is 35G, and
in solr4.7.2, i rebuild the index and the size of index is 16G.





--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr4-7-2-startup-take-too-long-time-tp4143634p4143648.html
Sent from the Solr - User mailing list archive at Nabble.com.


RAMDirectoryFactory setting on replication slave

2014-06-24 Thread Lee Chunki
Hi Guys,

As I know RAMDirectoryFactory setting does not work with replication.
( 
https://cwiki.apache.org/confluence/display/solr/DataDir+and+DirectoryFactory+in+SolrConfig
 )

By the way,  can I use it for replication slave nodes ( not master )  
or for SolrCloud ?

Thanks,
Chunki.



Re: No results for a wildcard query for text_general field in solr 4.1

2014-06-24 Thread Ahmet Arslan
Hi Sven,

StandardTokenizerFactory splits it into two pieces. You can confirm this at 
analysis page.
If this is something you don't want, lets us know. 
We can help you to create an analysis chain that suits your needs.

Ahmet


On Tuesday, June 24, 2014 10:39 AM, Sven Schönfeldt schoenfe...@subshell.com 
wrote:
Hi Erick,

that is what i did, tried that input on analysis page. 

The index field splitting the value into two words: „test“ and „or123
Now checking the query at analysis page, and there are the word ist splitting 
into „test“ and „or123“.

By doing the query and look into the debug result, i see that there is no 
splitting of words. Thats what i expect…

str name=rawquerystringsearchField_t:test\-or123*/str
str name=querystringsearchField_t:test\-or123*/str
str name=parsedquerysearchField_t:test-or123*/str
str name=parsedquery_toStringsearchField_t:test-or123*/str

Without the wildcard, the word is splitting also in two parts:

str name=rawquerystringsearchField_t:test\-or123/str
str name=querystringsearchField_t:test\-or123/str
str name=parsedquerysearchField_t:test searchField_t:or123/str
str name=parsedquery_toStringsearchField_t:test searchField_t:or123/str

Any idea which configuration has the responsibility for that behavior?

Thanks!





Am 23.06.2014 um 22:55 schrieb Erick Erickson erickerick...@gmail.com:

 Well, you can do more than guess by looking at the admin/analysis page
 and trying your input on the field in question. That'll show you what
 actual transformations are performed.
 
 You're probably right though. Try adding debug=query to your URL to
 see what the actual parsed query looks like and compare with the
 admin/analysis page
 
 But yeah, it's a matter of getting all the parts (query parser and
 analysis chains) to do the right thing.
 
 Best,
 Erick
 
 On Mon, Jun 23, 2014 at 7:30 AM, Sven Schönfeldt
 schoenfe...@subshell.com wrote:
 Hi Solr-Users,
 
 i am trying to do a wildcard query on a dynamic textfield (_t), but don’t 
 get the right result.
 The configuration for the field type is „text_general“, the default 
 configuration:
 
 fieldType name=text_general class=solr.TextField 
 positionIncrementGap=100
      analyzer type=index
        tokenizer class=solr.StandardTokenizerFactory/
        filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt enablePositionIncrements=true /
        filter class=solr.LowerCaseFilterFactory/
      /analyzer
      analyzer type=query
        tokenizer class=solr.StandardTokenizerFactory/
        filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt enablePositionIncrements=true /
        filter class=solr.SynonymFilterFactory synonyms=synonyms.txt 
ignoreCase=true expand=true/
        filter class=solr.LowerCaseFilterFactory/
      /analyzer
    /fieldType
 
 
 Input for the textfield is test-or123 and my query looks like test\-or*“.
 
 It seems that the input is allready split into two words: „test“ and 
 „or123“, but that's just a guess.
 
 Anyone who can help me, and know why i don’t find the document and whats 
 todo to make the quert working?
 
 Regards!
 
 
 



DIH on Solr

2014-06-24 Thread atp
Hi experts,

We have a requirement to import the data from hbase tables using solr, we
have tried with help of Dataimporthandler, we couldn't find the
configuration streps or document for dataimporthandler for HBASE, can
anybody please share the steps to configure, 

we tried with basic configuration but while select full import its throwing
error ,  please share the docs or links to configure DIH for hbase table. 

6/24/2014 3:44:00 PM
WARN
ZKPropertiesWriter
Could not read DIH properties from
/configs/collection1/dataimport.properties :class
org.apache.zookeeper.KeeperException$NoNodeException
6/24/2014 3:44:00 PM
ERROR
DataImporter
Full Import failed:java.lang.RuntimeException:
org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to
load EntityProcessor implementation for entity:msg Processing Document # 1


Thanks  in Advance





--
View this message in context: 
http://lucene.472066.n3.nabble.com/DIH-on-Solr-tp4143669.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: RequestHandler init failure

2014-06-24 Thread Ahmet Arslan
Hi,

Looks like you have different versions jars than solr.war ?

Ahmet 



On Tuesday, June 24, 2014 10:33 AM, atp annamalai...@hcl.com wrote:
Hi,

I'm getting below error while trying to access solrcloud , 

i tried with , 

added these two in to solrconfig.xml file

  lib dir=../../../../contrib/dataimporthandler/lib/ regex=.*\.jar /
  lib dir=../../../../dist/ regex=solr-dataimporthandler-.*\.jar /

actual location is /opt/apps/prod/solr/dist here all the required below jars
are available


solr-4.8.1.war                           solr-map-reduce-4.8.1.jar
solr-analysis-extras-4.8.1.jar           solr-morphlines-cell-4.8.1.jar
solr-cell-4.8.1.jar                      solr-morphlines-core-4.8.1.jar
solr-clustering-4.8.1.jar                solr-solrj-4.8.1.jar
solr-core-4.8.1.jar                      solr-test-framework-4.8.1.jar
solr-dataimporthandler-4.8.1.jar         solr-uima-4.8.1.jar
solr-dataimporthandler-extras-4.8.1.jar  solr-velocity-4.8.1.jar
solrj-lib                                test-framework
solr-langid-4.8.1.jar




but stil getting same issue ,  any help please



HTTP Status 500 - {msg=SolrCore 'collection1' is not available due to init
failure: RequestHandler init
failure,trace=org.apache.solr.common.SolrException: SolrCore 'collection1'
is not available due to init failure: RequestHandler init failure at
org.apache.solr.core.CoreContainer.getCore(CoreContainer.java:753) at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:347)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:220)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:122)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:171)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:950)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:408)
at
org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1040)
at
org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:607)
at
org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:314)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at
org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
at java.lang.Thread.run(Thread.java:745) Caused by:
org.apache.solr.common.SolrException: RequestHandler init failure at
org.apache.solr.core.SolrCore.init(SolrCore.java:858) at
org.apache.solr.core.SolrCore.init(SolrCore.java:641) at
org.apache.solr.core.CoreContainer.create(CoreContainer.java:556) at
org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:261) at
org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:253) at
java.util.concurrent.FutureTask.run(FutureTask.java:262) at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at
java.util.concurrent.FutureTask.run(FutureTask.java:262) at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
... 1 more Caused by: org.apache.solr.common.SolrException: RequestHandler
init failure at
org.apache.solr.core.RequestHandlers.initHandlersFromConfig(RequestHandlers.java:167)
at org.apache.solr.core.SolrCore.init(SolrCore.java:785) ... 10 more
Caused by: org.apache.solr.common.SolrException: Error Instantiating Request
Handler, solr.DataImportHandler failed to instantiate
org.apache.solr.request.SolrRequestHandler at
org.apache.solr.core.SolrCore.createInstance(SolrCore.java:559) at
org.apache.solr.core.SolrCore.createRequestHandler(SolrCore.java:611) at
org.apache.solr.core.RequestHandlers.initHandlersFromConfig(RequestHandlers.java:153)
... 11 more Caused by: java.lang.ClassCastException: class
org.apache.solr.handler.dataimport.DataImportHandler at
java.lang.Class.asSubclass(Class.java:3165) at
org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:484)
at
org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:421)
at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:538) ... 13
more ,code=500}



--
View this message in context: 
http://lucene.472066.n3.nabble.com/RequestHandler-init-failure-tp4143640.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: DIH on Solr

2014-06-24 Thread Ahmet Arslan
Hi,

There is no DataSource or EntityProcessor for HBase, I think.

May be http://www.lilyproject.org/lily/index.html works for you?

AHmet


On Tuesday, June 24, 2014 1:27 PM, atp annamalai...@hcl.com wrote:
Hi experts,

We have a requirement to import the data from hbase tables using solr, we
have tried with help of Dataimporthandler, we couldn't find the
configuration streps or document for dataimporthandler for HBASE, can
anybody please share the steps to configure, 

we tried with basic configuration but while select full import its throwing
error ,  please share the docs or links to configure DIH for hbase table. 

6/24/2014 3:44:00 PM
WARN
ZKPropertiesWriter
Could not read DIH properties from
/configs/collection1/dataimport.properties :class
org.apache.zookeeper.KeeperException$NoNodeException
6/24/2014 3:44:00 PM
ERROR
DataImporter
Full Import failed:java.lang.RuntimeException:
org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to
load EntityProcessor implementation for entity:msg Processing Document # 1


Thanks  in Advance





--
View this message in context: 
http://lucene.472066.n3.nabble.com/DIH-on-Solr-tp4143669.html
Sent from the Solr - User mailing list archive at Nabble.com.



TokenFilter not working at index time

2014-06-24 Thread Erlend Garåsen


I'm trying to create a Norwegian Lemmatizer based on a dictionary, but 
for some odd reason I don't get any search results even thought the 
Analyzer in Solr Admin shows that it does the right thing. It works at 
query time if I have reindexed everything based on another stemmer, e.g. 
NorwegianMinimalStemmer.


Here's a screenshot of how it lemmatizes the Norwegian word studenter 
(masculine indefinite noun, plural - English: students). The stem is 
student. So far so good:

http://folk.uio.no/erlendfg/solr/lemmatizer.png

But I get no/few results if I search for studenter compared to 
student. If I switch to solr.NorwegianMinimalStemFilterFactory in 
schema.xml at index time and reindexes everything, it works as it should:

analyzer type=index
  filter class=solr.NorwegianMinimalStemFilterFactory variant=no/

What is wrong with my TokenFilter and/or how can I debug this further? I 
have tried a lot of different things without any luck, for example 
decode everything explicitly to UTF8 (the wordlist is in iso-8859-1, but 
I'm reading it properly by setting the correct character set) and trim 
all the words without any help. The byte sequence also seems to be 
correct for the stemmed word. My lemmatizer shows [73 74 75 64 65 6e 
74], exactly the same as when I have configured 
NorwegianMinimalStemFilterFactory in schema.xml.


Here's the source code of my lemmatizer. Please note that it is not 
finished:

http://folk.uio.no/erlendfg/solr/

Here's the line in my wordlist which contains the word studenter:
66235   student studenter   subst mask appell fl ub normert 700 3

The following line returns the stem (input is studenter):
final String[] values = stemmer.stem(termAtt.buffer());

The rest of the code is in NorwegianLemmatizerFilter. If several stems 
are returned, they are all added.


Erlend


How to add some more documents to an existing index file

2014-06-24 Thread Pai, Gurunath (GE Corporate, consultant)
I am having an index file which contains the data from mysql database, I 
created this index file using dataimporthandler of solr. My requirement is, 
suppose if i add a new row to database, I want to update that row in my 
existing index file of solr. I dont have any idea how to add the new record 
from database to solr ? Do I need to re-index the index file again or even 
single record update is possible.

Thanks  Regards
Gurunath Pai.


Re: Solr alternates returning different versions of the same document

2014-06-24 Thread yann
Hi Erik,

thanks  - if it helps, I eventually fixed the problem by deleting the
documents by id (via an http request), which apparently deleted all the
versions everywhere, then re-creating the documents via the admin interface
(update, csv). This seems to have left only one version of each document.

Yann



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-alternates-returning-different-versions-of-the-same-document-tp4143006p4143680.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Evaluate function only on subset of documents

2014-06-24 Thread Costi Muraru
Thanks guys for your answers.
Sorry for the query syntax errors I've added in the previous queries.

Chris, you've been really helpful. Indeed, point 3 is the one I'm trying to
solve, rather than 2.
You're saying that BooleanScorer will consult the clauses in order based
on which clause
says it can skip the most documents.
I think this might be the culprit for me.

Let's take this query sample:
XXX OR AAA AND {!frange ...}

For my use case:
AAA returns a subset of 100k documents.
frange returns 5k documents, all part of these 100k documents.

Therefore, frange skips the most documents. From what you are saying,
frange is going to be applied on all documents (since it skips the most
documents) and AAA is going to be applied on the subset. This is kind of
what I've originally noticed. My goal is to have this in reverse order,
since frange is much more expensive than AAA.
I was hoping to do so by specifying the cost, saying that Hey, frange has
cost 100 while AAA has cost 1, so run AAA first and then run frange on the
subset. However this does not seem to be taken into consideration.
Does this make sense / Am I getting something wrong? Is there something I
can do to achieve this?

Thanks,
Costi


On Tue, Jun 24, 2014 at 4:23 AM, Chris Hostetter hossman_luc...@fucit.org
wrote:

 : Now, if I want to make a query that also contains some OR, it is
 impossible
 : to do so with this approach. This is because fq with OR operator is not
 : supported (SOLR-1223). As an alternative I've tried these queries:
 :
 : county='New York' AND (location:Maylands OR location:Holliscort or
 : parking:yes) AND_val_:{!frange u=0 cost=150
 cache=false}mycustomfunction()

 1) most of the examples you've posted have syntax errors in them that are
 probably throwing a wrench into your testing.  in this example county='New
 York' is not valid syntax, presumably you want conty='New Your'

 2) based on the example you give, what you're trying to do here doesn't
 really depend on using SHOULD (ie: OR) type logic against the frange:
 the only disjunction you have is in a sub-query of a top level
 conjunction (e: all required) ... the frange itself is still mandatory.

 so you could still use it as a non-cached postfilter just like in your
 previous example:

 q=+XXX +(YYY ZZZ)fq={!frange cost=150 cache=false ...}


 3) if that query wasn't exactly what you ment, and your top level query is
 more complex, containing a mix of MUST, MUST_NOT, and SHOULD clauses, ie:

 q=+XXX YYY ZZZ -AAA +{!frange ...}

 ...then the internal behavior of BooleanQuery will automatically do what
 you want (no need for cache or cost params on the fq) to the best
 of it's ability because of how the evaluation of boolean clauses are
 re-ordered internally based on the next match.

 it's kind of complicated to explain, but the short version is:

 a) BooleanScorer will avoid asking any clause if it matches a document
 which has already been disqualified by another clause
 b) BooleanScorer will consult the clauses in order based on which clause
 says it can skip the most documents

 So you migght see your custom function evaluated for some docs that
 ultimately don't match, but if there are more rare mandatory clauses
 of your BQ that tell Lucene it can skip over a large number of docs
 then, your custom function will be skipped.

 This is how BooleanQuery has always worked, but i just committed a test to
 verify it even when wrapping a FunctionRangeQuery...

 https://svn.apache.org/r1604990


 4) the extreme of #3 is that if you need to use the {!frange} as part of
 a full disjunction, ie:

q=XXX OR YYY OR {!frange ...}

 ...then it would be impossible for Solr to only execute the expensive
 function against the subset of documents that match the query -- because
 BooleanScorer won't be able to tell which documents match the query unless
 it evaluates the function (it's a catch-22).   even if every doc does not
 match either XXX or YYY, solr has to evaluate the function against every
 doc to see if that function *makes* the document match the entire query.






 -Hoss
 http://www.lucidworks.com/



Slow QTimes - 5 seconds for Small sized Collections

2014-06-24 Thread RadhaJayalakshmi
I am running Solr 4.5.1. Here is how my setup looks:

Have 2 modest sized Collections.
Collection 1 - 2 shards, 3 replicas (Size of Shard 1 - 115
MB, Size of Shard 2 - 55 MB) 
Collection 2 - 2 shards, 3 replicas (Size of Shard 2 - 3.5
GB, Size of Shard 2 - 1 GB)
These two collections are distributed across:
6 Tomcat Nodes setup on 3 VMs (2 Nodes per VM)
Each of the 6 Tomcat nodes has a XmS / XmX setting of 2 GB
Each of the 3 VMs have a Physical Memory (RAM) of 32 GB

As you can see my Collections are pretty small - This is actually a test
environment (and NOT Production), However my users (only have a handful of
testers) are complaining of sporadic performances issues on the Search. 

Here are my observations from the application logs:
1) Out of 200 sample searches across both collections - 13 requests are slow
(3 slow responses on Collection 1 and 10 slow responses on Collection 2).

2) When things run fast - they are really fast (Qtimes of 25 - 100
milliseconds) - but when things are slow - I can see that the QTime
consistently hovers around the 5 second (or 5000 millisecond mark). I am
seeing responses of the order of 5024, 5094, 5035 ms - as though something
just hung for 5 seconds. I am observing this 5 second delay on both
Collections - which I feel is unusual - because both contain very different
data sets. I am unable to figure out whats causing the QTime to be so
consistent around the 5 second mark.

3) I build my index only once. I did try running an optimize on both
Collection 1 and Collection 2 after the users complained - I did notice that
post the optimize the segment count on each of the four shards did come down
- but that still didn't resolve the slowness on the searches (I was hoping
it would).

4) I am looking at the Solr Dashboard for more clues - My TomCat nodes are
definitely NOT running out of memory - the 6 nodes are consuming anywhere
between 500 MB - 1 GB RAM

5) The File Descriptor counts are under control - can only see a maximum of
100 file descriptors being used of a total of 4096

6) The Solr dashboard is however showing that 0.2% (or 9.8MB) of Swap Space
being consumed on one of the 3 VMs. Is this a concern ?

7) Also looked at the Plugin / Stats for every core on the Solr Dashboard. I
can't see any evictions happening in any of the caches - Its always ZERO. 

Has anyone encountered such an issue ? What else should I be looking for to
debug my problem ?

Thanks



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Slow-QTimes-5-seconds-for-Small-sized-Collections-tp4143681.html
Sent from the Solr - User mailing list archive at Nabble.com.


Does one need to perform an optimize soon after doing a batch indexing using SolrJ ?

2014-06-24 Thread RadhaJayalakshmi
I am using Solr 4.5.1. I have two collections:
Collection 1 - 2 shards, 3 replicas (Size of Shard 1 - 115
MB, Size of Shard 2 - 55 MB) 
Collection 2 - 2 shards, 3 replicas (Size of Shard 2 - 3.5
GB, Size of Shard 2 - 1 GB)

I have a batch process that performs indexing (full refresh) - once a week
on the same index.

Here is some information on how I index:
a) I use SolrJ's bulk ADD API for indexing - CloudSolrServer.add(Collection
docs).
b) I have an autoCommit (hardcommit) setting of for both my Collections
(solrConfig.xml):
autoCommit
maxDocs10/maxDocs
   
openSearcherfalse/openSearcher
/autoCommit
c) I do a programatic hardcommit at the end of the indexing cycle - with an
open searcher of true - so that the documents show up on the Search
Results.
d) I neither programatically soft commit (nor have any autoSoftCommit
seetings) during the batch indexing process
e) When I re-index all my data again (the following week) into the same
index - I don't delete existing docs. Rather, I just re-index into the same
Collection.
f) I am using the default mergefactor of 10 in my solrconfig.xml
mergeFactor10/mergeFactor

Here is what I am observing:
1) After a batch indexing cycle - the segment counts for each shard / core
is pretty high. The Solr Dashboard reports segment counts between 8 - 30
segments on the variousr cores.
2) Sometimes the Solr Dashboard shows the status of my Core as - NOT
OPTIMIZED. This I find unusual - since I have just finished a Batch indexing
cycle - and would assume that the Index should already be optimized - Is
this happening because I don't delete my docs before re-indexing all my data
?
3) After I run an optimize on my Collections - the segment count does reduce
to significantly - to 1 segment.

Am I doing indexing the right way ? Is there a better strategy ?

Is it necessary to perform an optimize after every batch indexing cycle ?? 

The outcome I am looking for is that I need an optimized index after every
major Batch Indexing cycle.

Thanks!!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Does-one-need-to-perform-an-optimize-soon-after-doing-a-batch-indexing-using-SolrJ-tp4143686.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Slow QTimes - 5 seconds for Small sized Collections

2014-06-24 Thread Toke Eskildsen
On Tue, 2014-06-24 at 14:26 +0200, RadhaJayalakshmi wrote:
 Here are my observations from the application logs:
 1) Out of 200 sample searches across both collections - 13 requests are slow
 (3 slow responses on Collection 1 and 10 slow responses on Collection 2).
 
 2) When things run fast - they are really fast (Qtimes of 25 - 100
 milliseconds) - but when things are slow - I can see that the QTime
 consistently hovers around the 5 second (or 5000 millisecond mark). I am
 seeing responses of the order of 5024, 5094, 5035 ms - as though something
 just hung for 5 seconds.

We have a strange recurring pattern where the first search every hour on
the hour takes about 4 seconds, where standard response time is 400ms.
That is for a single shard Solr server, running in Tomcat.

Can you check if your slow response times are at the start of every full
hour?

 6) The Solr dashboard is however showing that 0.2% (or 9.8MB) of Swap Space
 being consumed on one of the 3 VMs. Is this a concern ?

Swap in itself is of no concern. Swapping out unused memory blocks is a
feature. As long as the machine rarely accesses the swap file, it is
working as intended.

- Toke Eskildsen, State and University Library, Denmark




Re: How to add some more documents to an existing index file

2014-06-24 Thread Erik Hatcher
Single document update is quite possible!   No worries there.

Since you’re using DIH (data import handler) you can use the delta-import 
command, see 
https://cwiki.apache.org/confluence/display/solr/Uploading+Structured+Data+Store+Data+with+the+Data+Import+Handler#UploadingStructuredDataStoreDatawiththeDataImportHandler-DataImportHandlerCommands

You’ll need some way to determine what a “new” document is.  DIH provides the 
last indexed timestamp that you can leverage in the delta query configuration 
to pick up documents since that time.

Erik


On Jun 24, 2014, at 8:07 AM, Pai, Gurunath (GE Corporate, consultant) 
gurunath@ge.com wrote:

 I am having an index file which contains the data from mysql database, I 
 created this index file using dataimporthandler of solr. My requirement is, 
 suppose if i add a new row to database, I want to update that row in my 
 existing index file of solr. I dont have any idea how to add the new record 
 from database to solr ? Do I need to re-index the index file again or even 
 single record update is possible.
 
 Thanks  Regards
 Gurunath Pai.



Re: How to add some more documents to an existing index file

2014-06-24 Thread gurunath
I have found the answer for the above query, i.e, by using Delta import
handler. But If I am going to use the deltaimporthandler of solr, Then  I
need to add a last modified column in the database table. Is it possible to
achieve the same without altering the database table.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-add-some-more-documents-to-an-existing-index-file-tp4143677p4143690.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: solr4.7.2 startup take too long time

2014-06-24 Thread Shawn Heisey
On 6/24/2014 12:51 AM, hrdxwandg wrote:
 Before i upgrade solr to 4.7.2, i use solr3.6.where i startup tomcat, the
 solr is started up quickly,the index size is 35G. After i upgrade solr to
 4.7.2. i rebuild the index totally. and the size of index is 16G. But when i
 restart the tomcat, i found that solr is startedup too slowly, almost take
 about 10 minutes.

When you upgraded, what did you change in the config?  One thing I am
looking for specifically in this situation is the updateLog config.  It
is a good idea to turn this on in the new version, but depending on how
you do your commits, this may make restarts take a very long time.
Here's a wiki page that discusses the possible problem in some detail:

http://wiki.apache.org/solr/SolrPerformanceProblems#Slow_startup

Thanks,
Shawn



Aggregate functions in Solr entity Queries.

2014-06-24 Thread wenky
Hi,
   Im new to solr and would ike to index my database.It is working fine for
columns.
But in solr I have one which which will take the average value from
database.The avaerage is not saving in solr.
Below is my sample dataconfig.xml
dataSource  /
document
entity name=doctor query=***
//fields-Columns

entity name=count query=select count(*) from table 

field column=count name=countValue /
/entity

  ..
/document
countValue is always returning zero.
Can anyone help me in this regard.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Aggregate-functions-in-Solr-entity-Queries-tp4143675.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: running Post jar from different server

2014-06-24 Thread EXTERNAL Taminidi Ravi (ETI, Automotive-Service-Solutions)
Yes, the localhost is replaced to the right solr URL. The pasted one is test 
URL. 

After debug, we found the actual problem is the XML files path is not correct.

Thanks for all the support.

--Ravi

-Original Message-
From: Shalin Shekhar Mangar [mailto:shalinman...@gmail.com] 
Sent: Monday, June 23, 2014 2:34 PM
To: solr-user@lucene.apache.org
Subject: Re: running Post jar from different server

You said that SQLDB and Solr are on different servers and that you are running 
post.jar from a network drive mapped to your SQLDB. If so, then why are you 
trying to post to localhost? That would resolve to the SQLDB host where Solr is 
not running.

Instead of using localhost in the -Durl part of your command line, use the full 
hostname or IP address of your Solr server.


On Mon, Jun 23, 2014 at 11:04 PM, EXTERNAL Taminidi Ravi (ETI,
Automotive-Service-Solutions) external.ravi.tamin...@us.bosch.com wrote:

 HI Anyone has the any reference for these type of execution..?

 -Original Message-
 From: EXTERNAL Taminidi Ravi (ETI, Automotive-Service-Solutions) [mailto:
 external.ravi.tamin...@us.bosch.com]
 Sent: Friday, June 20, 2014 1:46 PM
 To: solr-user@lucene.apache.org
 Subject: RE: running Post jar from different server

 Hi Sameer, Thanks for looking the post. Below are the two variables 
 read from the xml file in my tool.

 add key=JavaPath value=%JAVA_HOME%\bin\java.exe /
   add key=JavaArgument value= -Xms128m -Xmx256m -Durl= 
 http://localhost:8983/solr/{0}/update -jar F:/DataDump/Tools/post.jar 
 /

 In commandline it is something like

 C:\DataImport\bin\java.exe -Xms128m -Xmx256m -Durl= 
 http://localhost:8983/solr/DataCollection/update -jar 
 F:/DataDump/Tools/post.jar  F:/DatFiles/*.xml

 F:\ is the network drive.

 Thanks
 Ravi

 -Original Message-
 From: Sameer Maggon [mailto:sam...@measuredsearch.com]
 Sent: Thursday, June 19, 2014 10:02 PM
 To: solr-user@lucene.apache.org
 Subject: Re: running Post jar from different server

 Ravi,

 post.jar is a standalone utility that does not have to be on the same 
 server. If you can share the command you are executing, there might be 
 some pointers in there.

 Thanks,
 --
 *Sameer Maggon*
 http://measuredsearch.com


 On Thu, Jun 19, 2014 at 8:54 PM, EXTERNAL Taminidi Ravi (ETI,
 Automotive-Service-Solutions) external.ravi.tamin...@us.bosch.com wrote:

  Hi,  I have situation where my SQL Job initiate a console 
  application , where I am calling the post.jar to upload data to 
  SOLR. Both SQL DB and SOLR are 2 different servers.
 
  I am calling post.jar from my SQLDB where the path is mapped to a 
  network drive. I am getting an error file not found.
 
  Is the above scenario is possible, if anyone has some experience on 
  this can you share or any direction will be really appreciated.
 
  Thanks
 
  Ravi
 




--
Regards,
Shalin Shekhar Mangar.


Re: Get position of first occurrence in search result

2014-06-24 Thread Walter Underwood
How fast does it need to be?

I've done this sort of things for relevance evaluation with a driver in Python. 
Send the query, request 10 or 100 hits in JSON. Request only the URL field (fl 
parameter). Iterate through them until the URL matches. If it doesn't match, 
request more. Print the number.

Try it, I bet you would be surprised at how fast it is. You can run several 
copies of this script in parallel, maybe 10 or 20.

Writing this in Solr seems like hitting a fly with a hammer. It is 
over-engineering. Build it in PHP first, even if you do want to do it in Solr.

wunder

On Jun 23, 2014, at 11:20 PM, Tri Cao tm...@me.com wrote:

 It wouldn't be too hard to write a Solr Plugin that take a param docId 
 together with a query and return the position of that doc within the result 
 list for that query. You will still need to deal with the performance though. 
 For example, if the doc ranks at one millionth, the plugin still needs to get 
 at least 1M docs, and so the underlying collector still needs to sort through 
 1M documents.
 
 Maybe your business requirement is different, but does it really make a 
 difference if a document ranks at one thousandth or one millionth? For 
 example with Google SEO, either you rank in the first 3 (precision at 3) or 
 you don't :) 
 
 --Tri
 
 On Jun 23, 2014, at 10:55 PM, Jorge Luis Betancourt Gonzalez 
 jlbetanco...@uci.cu wrote:
 
 Basically this is for analytical purposes, essentially we want to help 
 people (which sites we’ve indexed in our app) to find out for which 
 particular terms (in theory related with their domain) they are bad 
 positioned in our index. Initially we’re starting with this basic “position 
 per term” but the idea is to elaborate further in this direction.
 
 This logic por position finding could be abstracted effectively in a plugin 
 inside Solr? I guess it would be more efficient to iterate (or fire the 2 
 queries) from within solr itself than in our app (written in PHP, so not so 
 fast for some things) speeding up things?
 
 Regards,
 
 On Jun 24, 2014, at 1:42 AM, Aman Tandon amantandon...@gmail.com 
 wrote:
 
  Jorge, i don't think that solr provide this functionality, you 
 have to
  iterate and solr is very fast in this, you can create a script for 
 that
  which search for pattern(term) and parse(request) the records 
 until get the
  record of that desired url, i don't thing 1/3 seconds time to find 
 out is
  more.
  
  As per the search result analysis, there are very few people who 
 request
  for the second page for their query, otherwise mostly leave the 
 search or
  modify query string. So i better suggest you that the if the 
 website has
  the appropriate and good data it should come on first page, so its 
 better
  to come on first page rather than finding the position.
  
  With Regards
  Aman Tandon
  
  
  On Tue, Jun 24, 2014 at 10:35 AM, Jorge Luis Betancourt Gonzalez 
  jlbetanco...@uci.cu wrote:
  
  Yes, but I’m looking for the position of the url field of 
 interest in the
  response of solr. Solr matches the terms against the 
 collection of
  documents and returns sorted list by score, what I’m 
 trying to do is get
  the position of the a specific id in this sorted 
 response. The response
  could be something like position: 5, or position 500. To 
 do this manually
  suppose the response consists of a very large amount of 
 documents
  (webpages) in this case I would need to iterate over the 
 complete response
  to find the position, which in a worst case scenario 
 could be in the last
  page for instance. For this particular use case I’m not 
 so interested in
  the URL field per se but more on the position a certain 
 url has in the full
  solr response.
  
  On Jun 24, 2014, at 12:31 AM, Walter Underwood 
 wun...@wunderwood.org  
  wrote:
  
  Solr is designed to do exactly this very, very 
 fast. So there isn't a
  faster way to do it. But you only need to fetch the URL 
 field. You can
  ignore everything else.
  
  wunder
  
  On Jun 23, 2014, at 9:32 PM, Jorge Luis 
 Betancourt Gonzalez 
  jlbetanco...@uci.cu wrote:
  
  Basically given a few search terms 
 (query) the idea is to know given
  one or more terms in which position your website is 
 located for those
  specific terms.
  
   

RE: POST Vs GET

2014-06-24 Thread EXTERNAL Taminidi Ravi (ETI, Automotive-Service-Solutions)
I have few questions before that.
Do you mean, running Jetty in Production is good enough? Like, all Clustering, 
Load Balance  will be taken care..?
Can we run Jetty as a service in windows Server.?
Security won't be a problem if we use Jetty..?

I am in the impression, Tomcat will be more robust on handling  all these 
heading.. 

May be i can think about running in Jetty in production.

--Ravi

-Original Message-
From: Shalin Shekhar Mangar [mailto:shalinman...@gmail.com] 
Sent: Monday, June 23, 2014 2:30 PM
To: solr-user@lucene.apache.org
Subject: Re: POST Vs GET

Why don't you just use the jetty shipped with Solr? It has all the correct 
defaults. In future, we may not even support shipping a war file.


On Mon, Jun 23, 2014 at 11:07 PM, EXTERNAL Taminidi Ravi (ETI,
Automotive-Service-Solutions) external.ravi.tamin...@us.bosch.com wrote:

 Hi, I am executing a solr query runs 10 to 12 lines with all the 
 boosting and condition. I change the Http Contentype to POST from GET 
 as post doesn't have any restriction for size. But I am getting an 
 error. I am using Tomcat 7, Is there any place we need to specify in 
 Tomcat to accept POST..

 FYI, From my Jetty solr version everthing works good.

 Thanks

 Ravi




--
Regards,
Shalin Shekhar Mangar.


Re: Block Join Not Working - what am I doing wrong?

2014-06-24 Thread Vinay B,
Okay, Let me try again.

1. Here is some sample SolrJ code that creates a parent and child document
(I hope)
https://gist.github.com/anonymous/d03747661ef03923de74

2. I tried a block join query which didn't return any results (I tried the
Block Join Parent Query Parser approach described in this link
https://cwiki.apache.org/confluence/display/solr/Other+Parsers). I expected
to get back the parent doc of a child which has ATTRIBUTES.STATE:TX, which
I did not , That is what I'm trying to figure out.

Thanks

http://localhost:8088/solr/test_core/select?q={!parent
which=content_type:parentDocument}ATTRIBUTES.STATE:TXwt=jsonindent=true

(
equivalent to
http://localhost:8088/solr/test_core/select?q=%7b!parent+which%3d%22content_type%3aparentDocument%22%7dATTRIBUTES.STATE%3aTX%26wt%3djson%26indent%3dtrue
)

Resulting in
response
lst name=responseHeader
int name=status0/int
int name=QTime1/int
lst name=params
str name=q
{!parent
which=content_type:parentDocument}ATTRIBUTES.STATE:TXwt=jsonindent=true
/str
/lst
/lst
result name=response numFound=0 start=0/
/response




On Mon, Jun 23, 2014 at 4:04 PM, Erick Erickson erickerick...@gmail.com
wrote:

 Well, what  do you mean by not working? You might review:
 http://wiki.apache.org/solr/UsingMailingLists

 Best,
 Erick

 On Mon, Jun 23, 2014 at 12:20 PM, Vinay B, vybe3...@gmail.com wrote:
  Hi,
  I've been trying to experiment with block joins and parent / child docs
 as
  described in this thread (input described in my first post of the thread,
  .. and block join in my second post, as per the suggestions given). What
  else am I missing?
 
  Thanks
 
 
 http://lucene.472066.n3.nabble.com/Why-aren-t-my-nested-documents-nesting-tt4142702.html#none



Re: RAMDirectoryFactory setting on replication slave

2014-06-24 Thread Erick Erickson
Please don't. At least not until you prove that this
is where your bottleneck is. You haven't described
what you're trying to fix by making such a change.

Solr/Lucene already works a _lot_ to keep the relevant bits of
the index in memory. Additionally, the defaults use
MMapDirectory, which makes use of the OS cache
to hold yet more of the index in memory, see:
http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

This feels like an XY problem, what is the reason you're
interested?

Best,
Erick

On Tue, Jun 24, 2014 at 1:25 AM, Lee Chunki lck7...@coupang.com wrote:
 Hi Guys,

 As I know RAMDirectoryFactory setting does not work with replication.
 ( 
 https://cwiki.apache.org/confluence/display/solr/DataDir+and+DirectoryFactory+in+SolrConfig
  )

 By the way,  can I use it for replication slave nodes ( not master )
 or for SolrCloud ?

 Thanks,
 Chunki.



Re: TokenFilter not working at index time

2014-06-24 Thread Erick Erickson
Hmmm. It would help if you posted a couple of other
pieces of information BTW, if this is new code are you
considering donating it back? If so please open a JIRA so
we can track it, see: http://wiki.apache.org/solr/HowToContribute

But to your question:
First couple of things I'd do:
1 see what the admin/analysis page tells you happens.
2 attach debug=query to your test case, see what the parsed
query looks like.
3 use the admin/schema browser link for the field in question
   to see what actually makes it into the index. (Or use Luke or
   even the TermsComponent).

My bet is that 2 or 3 will show something unexpected which may
give you some clues.

Best,
Erick

On Tue, Jun 24, 2014 at 5:00 AM, Erlend Garåsen e.f.gara...@usit.uio.no wrote:

 I'm trying to create a Norwegian Lemmatizer based on a dictionary, but for
 some odd reason I don't get any search results even thought the Analyzer in
 Solr Admin shows that it does the right thing. It works at query time if I
 have reindexed everything based on another stemmer, e.g.
 NorwegianMinimalStemmer.

 Here's a screenshot of how it lemmatizes the Norwegian word studenter
 (masculine indefinite noun, plural - English: students). The stem is
 student. So far so good:
 http://folk.uio.no/erlendfg/solr/lemmatizer.png

 But I get no/few results if I search for studenter compared to student.
 If I switch to solr.NorwegianMinimalStemFilterFactory in schema.xml at index
 time and reindexes everything, it works as it should:
 analyzer type=index
   filter class=solr.NorwegianMinimalStemFilterFactory variant=no/

 What is wrong with my TokenFilter and/or how can I debug this further? I
 have tried a lot of different things without any luck, for example decode
 everything explicitly to UTF8 (the wordlist is in iso-8859-1, but I'm
 reading it properly by setting the correct character set) and trim all the
 words without any help. The byte sequence also seems to be correct for the
 stemmed word. My lemmatizer shows [73 74 75 64 65 6e 74], exactly the same
 as when I have configured NorwegianMinimalStemFilterFactory in schema.xml.

 Here's the source code of my lemmatizer. Please note that it is not
 finished:
 http://folk.uio.no/erlendfg/solr/

 Here's the line in my wordlist which contains the word studenter:
 66235   student studenter   subst mask appell fl ub normert 700 3

 The following line returns the stem (input is studenter):
 final String[] values = stemmer.stem(termAtt.buffer());

 The rest of the code is in NorwegianLemmatizerFilter. If several stems are
 returned, they are all added.

 Erlend


Re: No results for a wildcard query for text_general field in solr 4.1

2014-06-24 Thread Erick Erickson
Wildcards are a tough thing to get your head around. I
think my first post on the users list was titled
I just don't get wildcards at all or something like that...

Right, wildcards aren't tokenized. So by getting your term
through the query parsing as a single token, including the
hyphen, when the analyzer sees that it's a wildcard
it doesn't break on the hyphen. So it's looking for a single
token. And since there is not single term like
test-or123 you get no matches.

I'm afraid this is just how it works. You can do something like
replace the hyphen at the app layer. But I don't think there's
a way to do what you want OOB.

Best,
Erick

On Tue, Jun 24, 2014 at 1:55 AM, Ahmet Arslan iori...@yahoo.com.invalid wrote:
 Hi Sven,

 StandardTokenizerFactory splits it into two pieces. You can confirm this at 
 analysis page.
 If this is something you don't want, lets us know.
 We can help you to create an analysis chain that suits your needs.

 Ahmet


 On Tuesday, June 24, 2014 10:39 AM, Sven Schönfeldt 
 schoenfe...@subshell.com wrote:
 Hi Erick,

 that is what i did, tried that input on analysis page.

 The index field splitting the value into two words: „test“ and „or123
 Now checking the query at analysis page, and there are the word ist splitting 
 into „test“ and „or123“.

 By doing the query and look into the debug result, i see that there is no 
 splitting of words. Thats what i expect…

 str name=rawquerystringsearchField_t:test\-or123*/str
 str name=querystringsearchField_t:test\-or123*/str
 str name=parsedquerysearchField_t:test-or123*/str
 str name=parsedquery_toStringsearchField_t:test-or123*/str

 Without the wildcard, the word is splitting also in two parts:

 str name=rawquerystringsearchField_t:test\-or123/str
 str name=querystringsearchField_t:test\-or123/str
 str name=parsedquerysearchField_t:test searchField_t:or123/str
 str name=parsedquery_toStringsearchField_t:test searchField_t:or123/str

 Any idea which configuration has the responsibility for that behavior?

 Thanks!





 Am 23.06.2014 um 22:55 schrieb Erick Erickson erickerick...@gmail.com:

 Well, you can do more than guess by looking at the admin/analysis page
 and trying your input on the field in question. That'll show you what
 actual transformations are performed.

 You're probably right though. Try adding debug=query to your URL to
 see what the actual parsed query looks like and compare with the
 admin/analysis page

 But yeah, it's a matter of getting all the parts (query parser and
 analysis chains) to do the right thing.

 Best,
 Erick

 On Mon, Jun 23, 2014 at 7:30 AM, Sven Schönfeldt
 schoenfe...@subshell.com wrote:
 Hi Solr-Users,

 i am trying to do a wildcard query on a dynamic textfield (_t), but don’t 
 get the right result.
 The configuration for the field type is „text_general“, the default 
 configuration:

 fieldType name=text_general class=solr.TextField 
 positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true 
 words=stopwords.txt enablePositionIncrements=true /
filter class=solr.LowerCaseFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true 
 words=stopwords.txt enablePositionIncrements=true /
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt 
 ignoreCase=true expand=true/
filter class=solr.LowerCaseFilterFactory/
  /analyzer
/fieldType


 Input for the textfield is test-or123 and my query looks like test\-or*“.

 It seems that the input is allready split into two words: „test“ and 
 „or123“, but that's just a guess.

 Anyone who can help me, and know why i don’t find the document and whats 
 todo to make the quert working?

 Regards!






Re: Solr alternates returning different versions of the same document

2014-06-24 Thread Erick Erickson
Thanks for letting us know.

Erick

On Tue, Jun 24, 2014 at 5:25 AM, yann yannick.lallem...@gmail.com wrote:
 Hi Erik,

 thanks  - if it helps, I eventually fixed the problem by deleting the
 documents by id (via an http request), which apparently deleted all the
 versions everywhere, then re-creating the documents via the admin interface
 (update, csv). This seems to have left only one version of each document.

 Yann



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Solr-alternates-returning-different-versions-of-the-same-document-tp4143006p4143680.html
 Sent from the Solr - User mailing list archive at Nabble.com.


OOM during indexing nested docs

2014-06-24 Thread adfel70
Hi, 

I am getting OOM during indexing 400 million docs (nested 7-20 children).
The memory usage gets higher while indexing until it gets to 24g.
also after OOM and stop indexing, the memory stays on 24g, *seems like a
leak.*


*Solr  Collection Info: *
solr 4.8 , 6 shards, 1 replicas per shard, 24g for jvm

Thanks 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/OOM-during-indexing-nested-docs-tp4143722.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: TokenFilter not working at index time

2014-06-24 Thread Ahmet Arslan
Hi Erlend,

After a quick look, I have implemented similar TokenFilter that injects several 
tokens at same position.

Please see source code of : Zemberek2DeasciifyFilter in 
https://github.com/iorixxx/lucene-solr-analysis-turkish 


You can insert your line :  final String[] values = 
stemmer.stem(termAtt.buffer()); to it.


Another note : You can use o.a.l.analysis.util.CharArrayMapString instead of 
MapString, Stringwordlist for efficiency.

Please see TurkishDeasciifyFilter for example usage.

Let us know if that works for you.

Ahmet


On Tuesday, June 24, 2014 3:00 PM, Erlend Garåsen e.f.gara...@usit.uio.no 
wrote:

I'm trying to create a Norwegian Lemmatizer based on a dictionary, but 
for some odd reason I don't get any search results even thought the 
Analyzer in Solr Admin shows that it does the right thing. It works at 
query time if I have reindexed everything based on another stemmer, e.g. 
NorwegianMinimalStemmer.

Here's a screenshot of how it lemmatizes the Norwegian word studenter 
(masculine indefinite noun, plural - English: students). The stem is 
student. So far so good:
http://folk.uio.no/erlendfg/solr/lemmatizer.png

But I get no/few results if I search for studenter compared to 
student. If I switch to solr.NorwegianMinimalStemFilterFactory in 
schema.xml at index time and reindexes everything, it works as it should:
analyzer type=index
   filter class=solr.NorwegianMinimalStemFilterFactory variant=no/

What is wrong with my TokenFilter and/or how can I debug this further? I 
have tried a lot of different things without any luck, for example 
decode everything explicitly to UTF8 (the wordlist is in iso-8859-1, but 
I'm reading it properly by setting the correct character set) and trim 
all the words without any help. The byte sequence also seems to be 
correct for the stemmed word. My lemmatizer shows [73 74 75 64 65 6e 
74], exactly the same as when I have configured 
NorwegianMinimalStemFilterFactory in schema.xml.

Here's the source code of my lemmatizer. Please note that it is not 
finished:
http://folk.uio.no/erlendfg/solr/

Here's the line in my wordlist which contains the word studenter:
66235    student    studenter    subst mask appell fl ub normert    700    3

The following line returns the stem (input is studenter):
final String[] values = stemmer.stem(termAtt.buffer());

The rest of the code is in NorwegianLemmatizerFilter. If several stems 
are returned, they are all added.

Erlend



Re: Block Join Not Working - what am I doing wrong?

2014-06-24 Thread Mikhail Khludnev
did you run the underneath query ATTRIBUTES.
STATE:TX. does it return anything?


On Tue, Jun 24, 2014 at 6:59 PM, Vinay B, vybe3...@gmail.com wrote:

 Okay, Let me try again.

 1. Here is some sample SolrJ code that creates a parent and child document
 (I hope)
 https://gist.github.com/anonymous/d03747661ef03923de74

 2. I tried a block join query which didn't return any results (I tried the
 Block Join Parent Query Parser approach described in this link
 https://cwiki.apache.org/confluence/display/solr/Other+Parsers). I
 expected
 to get back the parent doc of a child which has ATTRIBUTES.STATE:TX, which
 I did not , That is what I'm trying to figure out.

 Thanks

 http://localhost:8088/solr/test_core/select?q={!parent
 which=content_type:parentDocument}ATTRIBUTES.STATE:TXwt=jsonindent=true

 (
 equivalent to

 http://localhost:8088/solr/test_core/select?q=%7b!parent+which%3d%22content_type%3aparentDocument%22%7dATTRIBUTES.STATE%3aTX%26wt%3djson%26indent%3dtrue
 )

 Resulting in
 response
 lst name=responseHeader
 int name=status0/int
 int name=QTime1/int
 lst name=params
 str name=q
 {!parent
 which=content_type:parentDocument}ATTRIBUTES.STATE:TXwt=jsonindent=true
 /str
 /lst
 /lst
 result name=response numFound=0 start=0/
 /response




 On Mon, Jun 23, 2014 at 4:04 PM, Erick Erickson erickerick...@gmail.com
 wrote:

  Well, what  do you mean by not working? You might review:
  http://wiki.apache.org/solr/UsingMailingLists
 
  Best,
  Erick
 
  On Mon, Jun 23, 2014 at 12:20 PM, Vinay B, vybe3...@gmail.com wrote:
   Hi,
   I've been trying to experiment with block joins and parent / child docs
  as
   described in this thread (input described in my first post of the
 thread,
   .. and block join in my second post, as per the suggestions given).
 What
   else am I missing?
  
   Thanks
  
  
 
 http://lucene.472066.n3.nabble.com/Why-aren-t-my-nested-documents-nesting-tt4142702.html#none
 




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Does one need to perform an optimize soon after doing a batch indexing using SolrJ ?

2014-06-24 Thread Erick Erickson
Your indexing process looks fine, there's no reason to
change it.

Optimizing is _probably_ unnecessary at all. In fact in the 4.x
world it was changed to forceMerge to make it seem less
attractive (I mean, who wouldn't want an optimized index?)

That said, the batch indexing process has nothing at all to
do with optimization. Nothing in the process of adding docs
to a server will trigger an optimize.

In your case, since your index only changes once a week it
will help your performance a little (but perhaps so little you won't
notice) to optimize after the batch index is done.

In short, your process seems fine. Indexes are never optimized
unless you explicitly do it. After all, how would Solr know that
you are done with your batch indexing?

Best,
Erick

On Tue, Jun 24, 2014 at 5:32 AM, RadhaJayalakshmi
rlakshminaraya...@inautix.co.in wrote:
 I am using Solr 4.5.1. I have two collections:
 Collection 1 - 2 shards, 3 replicas (Size of Shard 1 - 115
 MB, Size of Shard 2 - 55 MB)
 Collection 2 - 2 shards, 3 replicas (Size of Shard 2 - 3.5
 GB, Size of Shard 2 - 1 GB)

 I have a batch process that performs indexing (full refresh) - once a week
 on the same index.

 Here is some information on how I index:
 a) I use SolrJ's bulk ADD API for indexing - CloudSolrServer.add(Collection
 docs).
 b) I have an autoCommit (hardcommit) setting of for both my Collections
 (solrConfig.xml):
 autoCommit
 maxDocs10/maxDocs

 openSearcherfalse/openSearcher
 /autoCommit
 c) I do a programatic hardcommit at the end of the indexing cycle - with an
 open searcher of true - so that the documents show up on the Search
 Results.
 d) I neither programatically soft commit (nor have any autoSoftCommit
 seetings) during the batch indexing process
 e) When I re-index all my data again (the following week) into the same
 index - I don't delete existing docs. Rather, I just re-index into the same
 Collection.
 f) I am using the default mergefactor of 10 in my solrconfig.xml
 mergeFactor10/mergeFactor

 Here is what I am observing:
 1) After a batch indexing cycle - the segment counts for each shard / core
 is pretty high. The Solr Dashboard reports segment counts between 8 - 30
 segments on the variousr cores.
 2) Sometimes the Solr Dashboard shows the status of my Core as - NOT
 OPTIMIZED. This I find unusual - since I have just finished a Batch indexing
 cycle - and would assume that the Index should already be optimized - Is
 this happening because I don't delete my docs before re-indexing all my data
 ?
 3) After I run an optimize on my Collections - the segment count does reduce
 to significantly - to 1 segment.

 Am I doing indexing the right way ? Is there a better strategy ?

 Is it necessary to perform an optimize after every batch indexing cycle ??

 The outcome I am looking for is that I need an optimized index after every
 major Batch Indexing cycle.

 Thanks!!



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Does-one-need-to-perform-an-optimize-soon-after-doing-a-batch-indexing-using-SolrJ-tp4143686.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: Does one need to perform an optimize soon after doing a batch indexing using SolrJ ?

2014-06-24 Thread Michael Della Bitta
Hi,

You don't need to optimize just based on segment counts. Solr doesn't
optimize automatically because often it doesn't improve things enough to
justify the computational cost of optimizing. You shouldn't optimize unless
you do a benchmark and discover that optimizing improves performance.

If you're just worried about the segment count, you can tune that in
solrconfig.xml and Solr will merge down your index on the fly as it indexes.

Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions https://twitter.com/Appinions | g+:
plus.google.com/appinions
https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
w: appinions.com http://www.appinions.com/


On Tue, Jun 24, 2014 at 8:32 AM, RadhaJayalakshmi 
rlakshminaraya...@inautix.co.in wrote:

 I am using Solr 4.5.1. I have two collections:
 Collection 1 - 2 shards, 3 replicas (Size of Shard 1 - 115
 MB, Size of Shard 2 - 55 MB)
 Collection 2 - 2 shards, 3 replicas (Size of Shard 2 - 3.5
 GB, Size of Shard 2 - 1 GB)

 I have a batch process that performs indexing (full refresh) - once a week
 on the same index.

 Here is some information on how I index:
 a) I use SolrJ's bulk ADD API for indexing - CloudSolrServer.add(Collection
 docs).
 b) I have an autoCommit (hardcommit) setting of for both my Collections
 (solrConfig.xml):
 autoCommit
 maxDocs10/maxDocs

 openSearcherfalse/openSearcher
 /autoCommit
 c) I do a programatic hardcommit at the end of the indexing cycle - with an
 open searcher of true - so that the documents show up on the Search
 Results.
 d) I neither programatically soft commit (nor have any autoSoftCommit
 seetings) during the batch indexing process
 e) When I re-index all my data again (the following week) into the same
 index - I don't delete existing docs. Rather, I just re-index into the same
 Collection.
 f) I am using the default mergefactor of 10 in my solrconfig.xml
 mergeFactor10/mergeFactor

 Here is what I am observing:
 1) After a batch indexing cycle - the segment counts for each shard / core
 is pretty high. The Solr Dashboard reports segment counts between 8 - 30
 segments on the variousr cores.
 2) Sometimes the Solr Dashboard shows the status of my Core as - NOT
 OPTIMIZED. This I find unusual - since I have just finished a Batch
 indexing
 cycle - and would assume that the Index should already be optimized - Is
 this happening because I don't delete my docs before re-indexing all my
 data
 ?
 3) After I run an optimize on my Collections - the segment count does
 reduce
 to significantly - to 1 segment.

 Am I doing indexing the right way ? Is there a better strategy ?

 Is it necessary to perform an optimize after every batch indexing cycle ??

 The outcome I am looking for is that I need an optimized index after every
 major Batch Indexing cycle.

 Thanks!!



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Does-one-need-to-perform-an-optimize-soon-after-doing-a-batch-indexing-using-SolrJ-tp4143686.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: No results for a wildcard query for text_general field in solr 4.1

2014-06-24 Thread Jack Krupansky
I think I am officially tired of having to explain why Solr doesn't do what 
users expect for this query. I mean, I can accept that low level Lucene 
should work strictly on the decomposed terms of test test-or*, but is is 
very reasonable for users (even EXPERT users) to expect that the Solr query 
parser will generate what the complex phrase query parser generates.


See:
https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-ComplexPhraseQueryParser

Having to use a separate query parser for this obvious, common case is... 
absurd.


(What does Elasticsearch do for this case??)

-- Jack Krupansky

-Original Message- 
From: Erick Erickson

Sent: Tuesday, June 24, 2014 11:38 AM
To: solr-user@lucene.apache.org ; Ahmet Arslan
Subject: Re: No results for a wildcard query for text_general field in solr 
4.1


Wildcards are a tough thing to get your head around. I
think my first post on the users list was titled
I just don't get wildcards at all or something like that...

Right, wildcards aren't tokenized. So by getting your term
through the query parsing as a single token, including the
hyphen, when the analyzer sees that it's a wildcard
it doesn't break on the hyphen. So it's looking for a single
token. And since there is not single term like
test-or123 you get no matches.

I'm afraid this is just how it works. You can do something like
replace the hyphen at the app layer. But I don't think there's
a way to do what you want OOB.

Best,
Erick

On Tue, Jun 24, 2014 at 1:55 AM, Ahmet Arslan iori...@yahoo.com.invalid 
wrote:

Hi Sven,

StandardTokenizerFactory splits it into two pieces. You can confirm this 
at analysis page.

If this is something you don't want, lets us know.
We can help you to create an analysis chain that suits your needs.

Ahmet


On Tuesday, June 24, 2014 10:39 AM, Sven Schönfeldt 
schoenfe...@subshell.com wrote:

Hi Erick,

that is what i did, tried that input on analysis page.

The index field splitting the value into two words: „test“ and „or123
Now checking the query at analysis page, and there are the word ist 
splitting into „test“ and „or123“.


By doing the query and look into the debug result, i see that there is no 
splitting of words. Thats what i expect…


str name=rawquerystringsearchField_t:test\-or123*/str
str name=querystringsearchField_t:test\-or123*/str
str name=parsedquerysearchField_t:test-or123*/str
str name=parsedquery_toStringsearchField_t:test-or123*/str

Without the wildcard, the word is splitting also in two parts:

str name=rawquerystringsearchField_t:test\-or123/str
str name=querystringsearchField_t:test\-or123/str
str name=parsedquerysearchField_t:test searchField_t:or123/str
str name=parsedquery_toStringsearchField_t:test 
searchField_t:or123/str


Any idea which configuration has the responsibility for that behavior?

Thanks!





Am 23.06.2014 um 22:55 schrieb Erick Erickson erickerick...@gmail.com:


Well, you can do more than guess by looking at the admin/analysis page
and trying your input on the field in question. That'll show you what
actual transformations are performed.

You're probably right though. Try adding debug=query to your URL to
see what the actual parsed query looks like and compare with the
admin/analysis page

But yeah, it's a matter of getting all the parts (query parser and
analysis chains) to do the right thing.

Best,
Erick

On Mon, Jun 23, 2014 at 7:30 AM, Sven Schönfeldt
schoenfe...@subshell.com wrote:

Hi Solr-Users,

i am trying to do a wildcard query on a dynamic textfield (_t), but don’t 
get the right result.
The configuration for the field type is „text_general“, the default 
configuration:


fieldType name=text_general class=solr.TextField 
positionIncrementGap=100

 analyzer type=index
   tokenizer class=solr.StandardTokenizerFactory/
   filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt enablePositionIncrements=true /

   filter class=solr.LowerCaseFilterFactory/
 /analyzer
 analyzer type=query
   tokenizer class=solr.StandardTokenizerFactory/
   filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt enablePositionIncrements=true /
   filter class=solr.SynonymFilterFactory synonyms=synonyms.txt 
ignoreCase=true expand=true/

   filter class=solr.LowerCaseFilterFactory/
 /analyzer
   /fieldType


Input for the textfield is test-or123 and my query looks like 
test\-or*“.


It seems that the input is allready split into two words: „test“ and 
„or123“, but that's just a guess.


Anyone who can help me, and know why i don’t find the document and whats 
todo to make the quert working?


Regards!









Re: Slow QTimes - 5 seconds for Small sized Collections

2014-06-24 Thread Erick Erickson
That is strange indeed. The usual culprit is that there is a commit
in there and no autowarming, so you see pauses when the first
query hits after a commit. But you say you only build the index once
which would seem to rule that out.

I'd be interested in what is in your Solr logs around the time
in question. Say 10,000 lines leading up to a slow query (10,000
lines is completely arbitrary, hopefully it's enough to see something
interesting).

Best,
Erick

On Tue, Jun 24, 2014 at 5:26 AM, RadhaJayalakshmi
rlakshminaraya...@inautix.co.in wrote:
 I am running Solr 4.5.1. Here is how my setup looks:

 Have 2 modest sized Collections.
 Collection 1 - 2 shards, 3 replicas (Size of Shard 1 - 115
 MB, Size of Shard 2 - 55 MB)
 Collection 2 - 2 shards, 3 replicas (Size of Shard 2 - 3.5
 GB, Size of Shard 2 - 1 GB)
 These two collections are distributed across:
 6 Tomcat Nodes setup on 3 VMs (2 Nodes per VM)
 Each of the 6 Tomcat nodes has a XmS / XmX setting of 2 GB
 Each of the 3 VMs have a Physical Memory (RAM) of 32 GB

 As you can see my Collections are pretty small - This is actually a test
 environment (and NOT Production), However my users (only have a handful of
 testers) are complaining of sporadic performances issues on the Search.

 Here are my observations from the application logs:
 1) Out of 200 sample searches across both collections - 13 requests are slow
 (3 slow responses on Collection 1 and 10 slow responses on Collection 2).

 2) When things run fast - they are really fast (Qtimes of 25 - 100
 milliseconds) - but when things are slow - I can see that the QTime
 consistently hovers around the 5 second (or 5000 millisecond mark). I am
 seeing responses of the order of 5024, 5094, 5035 ms - as though something
 just hung for 5 seconds. I am observing this 5 second delay on both
 Collections - which I feel is unusual - because both contain very different
 data sets. I am unable to figure out whats causing the QTime to be so
 consistent around the 5 second mark.

 3) I build my index only once. I did try running an optimize on both
 Collection 1 and Collection 2 after the users complained - I did notice that
 post the optimize the segment count on each of the four shards did come down
 - but that still didn't resolve the slowness on the searches (I was hoping
 it would).

 4) I am looking at the Solr Dashboard for more clues - My TomCat nodes are
 definitely NOT running out of memory - the 6 nodes are consuming anywhere
 between 500 MB - 1 GB RAM

 5) The File Descriptor counts are under control - can only see a maximum of
 100 file descriptors being used of a total of 4096

 6) The Solr dashboard is however showing that 0.2% (or 9.8MB) of Swap Space
 being consumed on one of the 3 VMs. Is this a concern ?

 7) Also looked at the Plugin / Stats for every core on the Solr Dashboard. I
 can't see any evictions happening in any of the caches - Its always ZERO.

 Has anyone encountered such an issue ? What else should I be looking for to
 debug my problem ?

 Thanks



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Slow-QTimes-5-seconds-for-Small-sized-Collections-tp4143681.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: TokenFilter not working at index time

2014-06-24 Thread Dmitry Kan
By quickly looking at it, I think you have unreachable code in the
NorwegianLemmatizerFilter
class (certainly, attaching  debugging would be your best bet):

@Override
public boolean incrementToken() throws IOException {
if (input.incrementToken()) {
if (!keywordAttr.isKeyword()) {
final String[] values = stemmer.stem(termAtt.buffer());
if (values == null || values.length == 0) {
return false;
} else {
termAtt.setEmpty().append(values[0]);
if (values.length  1) {
for (int i = 1; i  values.length; i++) {
terms.add(values[i]);
}
}
return true;
}
}
return false;
} else if (!terms.isEmpty()) {
termAtt.setEmpty().append(terms.poll()); // I don't think
this will exhaust terms queue at full for this token,

 // because on the
next call to incrementToken() method

 //
input.incrementToken() is called
return true;
} else {
return false;
}
}


Instead I would do something like this:

[code]
private IteratorString iterator;

@Override public boolean incrementToken() throws IOException { String
nextStem = next(); if (next == null) return false;
// chain the stems; // if this is undesired, you can put them into the same
position by restoring previous state
termAtt.setEmpty(); termAtt.append(nextStem);
termAtt.setLength(nextStem.length()); return true; }

  public String next() throws IOException
  {
if ((iterator == null) || (!iterator.hasNext())) {
  if (!input.incrementToken())
return null;

  char[] buffer = termAtt.buffer();
  if (buffer == null || buffer.length == 0)
return null;

  final String tokenTerm = new String(buffer, 0, termAtt.length());
  final String lcTokenTerm = tokenTerm.toLowerCase();

  CollectionString stems = new ArrayList();
Collections.addAll(stems, stemmer.stem(lcTokenTerm));

  iterator = stems.iterator();
}

if (iterator.hasNext()) {
  String next = iterator.next();
  if (next != null) {
return next;
  }
}
return null;
  }

[/code]


On Tue, Jun 24, 2014 at 3:00 PM, Erlend Garåsen e.f.gara...@usit.uio.no
wrote:


 I'm trying to create a Norwegian Lemmatizer based on a dictionary, but for
 some odd reason I don't get any search results even thought the Analyzer in
 Solr Admin shows that it does the right thing. It works at query time if I
 have reindexed everything based on another stemmer, e.g.
 NorwegianMinimalStemmer.

 Here's a screenshot of how it lemmatizes the Norwegian word studenter
 (masculine indefinite noun, plural - English: students). The stem is
 student. So far so good:
 http://folk.uio.no/erlendfg/solr/lemmatizer.png

 But I get no/few results if I search for studenter compared to
 student. If I switch to solr.NorwegianMinimalStemFilterFactory in
 schema.xml at index time and reindexes everything, it works as it should:
 analyzer type=index
   filter class=solr.NorwegianMinimalStemFilterFactory variant=no/

 What is wrong with my TokenFilter and/or how can I debug this further? I
 have tried a lot of different things without any luck, for example decode
 everything explicitly to UTF8 (the wordlist is in iso-8859-1, but I'm
 reading it properly by setting the correct character set) and trim all the
 words without any help. The byte sequence also seems to be correct for the
 stemmed word. My lemmatizer shows [73 74 75 64 65 6e 74], exactly the same
 as when I have configured NorwegianMinimalStemFilterFactory in schema.xml.

 Here's the source code of my lemmatizer. Please note that it is not
 finished:
 http://folk.uio.no/erlendfg/solr/

 Here's the line in my wordlist which contains the word studenter:
 66235   student studenter   subst mask appell fl ub normert 700 3

 The following line returns the stem (input is studenter):
 final String[] values = stemmer.stem(termAtt.buffer());

 The rest of the code is in NorwegianLemmatizerFilter. If several stems are
 returned, they are all added.

 Erlend




-- 
Dmitry Kan
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan
SemanticAnalyzer: www.semanticanalyzer.info


limit solr results before join

2014-06-24 Thread Kevin Stone
Is there any way to limit the results of a query on the from index before it 
gets joined?

The SQL analogy might be...
SELECT *
from toIndex join
(select * from fromIndex
where some query
limit 1000
) fromIndex on fromIndex.from=toIndex.to


Example:
_query_:{!join fromIndex=expressionData from=anatomyID to=anatomyID 
v='(anatomy:\brain\)'}

Say I have an index representing data for gene expression (we work with 
genetics), and you query it by anatomy term. So the above would query for all 
data that shows gene expression in brain.

Now I want to get a set of related data for each anatomy term via the join. Is 
there any way to get the related data for only anatomy terms in the first 1000 
expression data documents (fromIndex)? The reason is because there could be 
millions of data documents (fromIndex), and we process them in batches to load 
a visualization of the query results.

Doing the join on all the results for each batch I process is becoming a 
bottleneck for large sets of data.

Thanks,
-Kevin

The information in this email, including attachments, may be confidential and 
is intended solely for the addressee(s). If you believe you received this email 
by mistake, please notify the sender by return email as soon as possible.


Re: Does one need to perform an optimize soon after doing a batch indexing using SolrJ ?

2014-06-24 Thread Jack Krupansky
The one exception that we should always note is that if your batch includes 
deletion of existing documents, an optimize can be appropriate since the 
term frequencies stored by Lucene may be off since the deleted documents 
still count as existing terms.


Is this exception noted in the Solr ref guide?

-- Jack Krupansky

-Original Message- 
From: Erick Erickson

Sent: Tuesday, June 24, 2014 11:46 AM
To: solr-user@lucene.apache.org
Subject: Re: Does one need to perform an optimize soon after doing a batch 
indexing using SolrJ ?


Your indexing process looks fine, there's no reason to
change it.

Optimizing is _probably_ unnecessary at all. In fact in the 4.x
world it was changed to forceMerge to make it seem less
attractive (I mean, who wouldn't want an optimized index?)

That said, the batch indexing process has nothing at all to
do with optimization. Nothing in the process of adding docs
to a server will trigger an optimize.

In your case, since your index only changes once a week it
will help your performance a little (but perhaps so little you won't
notice) to optimize after the batch index is done.

In short, your process seems fine. Indexes are never optimized
unless you explicitly do it. After all, how would Solr know that
you are done with your batch indexing?

Best,
Erick

On Tue, Jun 24, 2014 at 5:32 AM, RadhaJayalakshmi
rlakshminaraya...@inautix.co.in wrote:

I am using Solr 4.5.1. I have two collections:
Collection 1 - 2 shards, 3 replicas (Size of Shard 1 - 115
MB, Size of Shard 2 - 55 MB)
Collection 2 - 2 shards, 3 replicas (Size of Shard 2 - 3.5
GB, Size of Shard 2 - 1 GB)

I have a batch process that performs indexing (full refresh) - once a week
on the same index.

Here is some information on how I index:
a) I use SolrJ's bulk ADD API for indexing - 
CloudSolrServer.add(Collection

docs).
b) I have an autoCommit (hardcommit) setting of for both my Collections
(solrConfig.xml):
autoCommit
maxDocs10/maxDocs

openSearcherfalse/openSearcher
/autoCommit
c) I do a programatic hardcommit at the end of the indexing cycle - with 
an

open searcher of true - so that the documents show up on the Search
Results.
d) I neither programatically soft commit (nor have any autoSoftCommit
seetings) during the batch indexing process
e) When I re-index all my data again (the following week) into the same
index - I don't delete existing docs. Rather, I just re-index into the 
same

Collection.
f) I am using the default mergefactor of 10 in my solrconfig.xml
mergeFactor10/mergeFactor

Here is what I am observing:
1) After a batch indexing cycle - the segment counts for each shard / core
is pretty high. The Solr Dashboard reports segment counts between 8 - 30
segments on the variousr cores.
2) Sometimes the Solr Dashboard shows the status of my Core as - NOT
OPTIMIZED. This I find unusual - since I have just finished a Batch 
indexing

cycle - and would assume that the Index should already be optimized - Is
this happening because I don't delete my docs before re-indexing all my 
data

?
3) After I run an optimize on my Collections - the segment count does 
reduce

to significantly - to 1 segment.

Am I doing indexing the right way ? Is there a better strategy ?

Is it necessary to perform an optimize after every batch indexing cycle ??

The outcome I am looking for is that I need an optimized index after every
major Batch Indexing cycle.

Thanks!!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Does-one-need-to-perform-an-optimize-soon-after-doing-a-batch-indexing-using-SolrJ-tp4143686.html
Sent from the Solr - User mailing list archive at Nabble.com. 




Re: Evaluate function only on subset of documents

2014-06-24 Thread Chris Hostetter

: Let's take this query sample:
: XXX OR AAA AND {!frange ...}
: 
: For my use case:
: AAA returns a subset of 100k documents.
: frange returns 5k documents, all part of these 100k documents.
: 
: Therefore, frange skips the most documents. From what you are saying,
: frange is going to be applied on all documents (since it skips the most
: documents) and AAA is going to be applied on the subset. This is kind of
: what I've originally noticed. My goal is to have this in reverse order,

That's not exactly it ... there's no way for the query to know in advance 
how many documents it matches -- what BooleanQuery asks each clause is 
looking at the index, tell me the (internal) lucene docid of the first do 
you match.  it then looks at the lowest matching docid of each clause, and 
the Occur property of the clause (MUST, MUST_NOT, SHOULD) to be able to 
tell if/when it can say things like clause AAA is mandatory but the 
lowest id it matches is doc# 8675 -- so it doesn't mater that clause XXX's 
lowest match is doc# 10 or that clause {!frange}'s lowest matche is doc# 
100

it can then ask XXX and {!frange} to both skip ahead, and find lowest 
docid they each match that is no less then 8675, etc...

from the perspective of {!frange} in particular, this means that on the 
first call it will evaluate itself against docid #0, #1, #2, etc... untill 
it finds a match.  and on the secod call it will evaluate itself against 
docid #8675, 8676, etc... until it finds a match...

: since frange is much more expensive than AAA.
: I was hoping to do so by specifying the cost, saying that Hey, frange has

There is no support for specifying cost on individual clauses instead of a 
BooleanQuery.

But i really want to re-iterate, that even with the example you posted 
above you *still* don't need to nest your {!frange} instead of a boolean 
query -- what you have is this:

XXX OR AAA AND {!frange ...}

in which the {!frange ...} clause is completely mandatory -- so my 
previous point #2 still applies... 

:  2) based on the example you give, what you're trying to do here doesn't
:  really depend on using SHOULD (ie: OR) type logic against the frange:
:  the only disjunction you have is in a sub-query of a top level
:  conjunction (e: all required) ... the frange itself is still mandatory.
: 
:  so you could still use it as a non-cached postfilter just like in your
:  previous example:

  q=XXX OR AAA  fq={!frange cost=150 cache=false ...}


-Hoss
http://www.lucidworks.com/


SolrCloud copy the index to another cluster.

2014-06-24 Thread heaven
Hello,

We do have a running SolrCloud cluster, a simple set up of 4 nodes — 2
shards and 2 replicas and ≈ 140GB index. And now we have to move to another
server and need to somehow copy existing index without downtime (if
applicable).

New config is exactly the same, same 4 nodes, same collections and their own
zookeeper.

What options do we have?

What I was thinking about is to add 2 nodes (from the new cluster, those
that are supposed to be shards) as replicas for the existing old cluster and
when the replication is done simply switch the app to use those new
replicas. Then reconfigure these replicas and run them as shards with their
own zookeper. So there will be minimal downtime just to restart the new
cluster.

My concerns are:
* Will those new replicas be automatically populated with the index from the
old cluster?
* Will I then be able to disconnect them from the old cluster and run as
primary shards with their own zookeper and then add their own replicas from
the new cluster?

Thank you,
Alex



--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrCloud-copy-the-index-to-another-cluster-tp4143759.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: SolrCloud copy the index to another cluster.

2014-06-24 Thread heaven
I've just realized that old and new clusters do use different installations,
configs and lib paths. So the nodes from the new cluster will probably
simply refuse to start using configs from the old zookeper.

Only if there is a way to run them with their own zookeper and then manually
add as replicas to the old cluster, so old and new clusters keep using their
own zookepers.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrCloud-copy-the-index-to-another-cluster-tp4143759p4143769.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Block Join Not Working - what am I doing wrong?

2014-06-24 Thread Vinay B,
Hi,
Yes, the query ATTRIBUTES.STATE:TX returns the child doc (see response
below) . Is there something else that I'm missing to link the parent and
the child? I followed your advice from my last thread and used a block join
in this attempt, but still don't see how the parent and child realize their
association. We're using solr 4.8.1

Thanks

Query Response

{
  responseHeader:{
status:0,
QTime:0,
params:{
  indent:true,
  q:ATTRIBUTES.STATE:TX,
  wt:json}},
  response:{numFound:1,start:0,docs:[
  {
id:1-A,
ATTRIBUTES.STATE:[LA,
  TX]}]
  }}


Raw doc dump

{
  responseHeader:{
status:0,
QTime:0,
params:{
  indent:true,
  q:*:*,
  wt:json}},
  response:{numFound:2,start:0,docs:[
  {
id:1-A,
ATTRIBUTES.STATE:[LA,
  TX]},
  {
id:1,
content_type:parentDocument,
_version_:1471814208097091584}]
  }}



On Tue, Jun 24, 2014 at 10:45 AM, Mikhail Khludnev 
mkhlud...@griddynamics.com wrote:

 did you run the underneath query ATTRIBUTES.
 STATE:TX. does it return anything?


 On Tue, Jun 24, 2014 at 6:59 PM, Vinay B, vybe3...@gmail.com wrote:

  Okay, Let me try again.
 
  1. Here is some sample SolrJ code that creates a parent and child
 document
  (I hope)
  https://gist.github.com/anonymous/d03747661ef03923de74
 
  2. I tried a block join query which didn't return any results (I tried
 the
  Block Join Parent Query Parser approach described in this link
  https://cwiki.apache.org/confluence/display/solr/Other+Parsers). I
  expected
  to get back the parent doc of a child which has ATTRIBUTES.STATE:TX,
 which
  I did not , That is what I'm trying to figure out.
 
  Thanks
 
  http://localhost:8088/solr/test_core/select?q={!parent
 
 which=content_type:parentDocument}ATTRIBUTES.STATE:TXwt=jsonindent=true
 
  (
  equivalent to
 
 
 http://localhost:8088/solr/test_core/select?q=%7b!parent+which%3d%22content_type%3aparentDocument%22%7dATTRIBUTES.STATE%3aTX%26wt%3djson%26indent%3dtrue
  )
 
  Resulting in
  response
  lst name=responseHeader
  int name=status0/int
  int name=QTime1/int
  lst name=params
  str name=q
  {!parent
 
 which=content_type:parentDocument}ATTRIBUTES.STATE:TXwt=jsonindent=true
  /str
  /lst
  /lst
  result name=response numFound=0 start=0/
  /response
 
 
 
 
  On Mon, Jun 23, 2014 at 4:04 PM, Erick Erickson erickerick...@gmail.com
 
  wrote:
 
   Well, what  do you mean by not working? You might review:
   http://wiki.apache.org/solr/UsingMailingLists
  
   Best,
   Erick
  
   On Mon, Jun 23, 2014 at 12:20 PM, Vinay B, vybe3...@gmail.com wrote:
Hi,
I've been trying to experiment with block joins and parent / child
 docs
   as
described in this thread (input described in my first post of the
  thread,
.. and block join in my second post, as per the suggestions given).
  What
else am I missing?
   
Thanks
   
   
  
 
 http://lucene.472066.n3.nabble.com/Why-aren-t-my-nested-documents-nesting-tt4142702.html#none
  
 



 --
 Sincerely yours
 Mikhail Khludnev
 Principal Engineer,
 Grid Dynamics

 http://www.griddynamics.com
  mkhlud...@griddynamics.com



Re: Slow QTimes - 5 seconds for Small sized Collections

2014-06-24 Thread Dmitry Kan
Two ideas:

1) monitor the GC activity with jvisualvm (comes with Oracle JDK), install
a VisualGC plugin, it is quite helpful. The idea is to try to find the GC
stop-the-world activities. If any found, look at tweaking the GC
parameters. Some insight: http://wiki.apache.org/solr/ShawnHeisey Some more
on tools for GC monitoring:
http://architects.dzone.com/articles/how-monitor-java-garbage

2) monitor the network latency. Any possibility of the network being
periodically congested? Can you plot a graph with number of concurrent (per
second) queries versus their Qtimes ?




On Tue, Jun 24, 2014 at 3:26 PM, RadhaJayalakshmi 
rlakshminaraya...@inautix.co.in wrote:

 I am running Solr 4.5.1. Here is how my setup looks:

 Have 2 modest sized Collections.
 Collection 1 - 2 shards, 3 replicas (Size of Shard 1 - 115
 MB, Size of Shard 2 - 55 MB)
 Collection 2 - 2 shards, 3 replicas (Size of Shard 2 - 3.5
 GB, Size of Shard 2 - 1 GB)
 These two collections are distributed across:
 6 Tomcat Nodes setup on 3 VMs (2 Nodes per VM)
 Each of the 6 Tomcat nodes has a XmS / XmX setting of 2 GB
 Each of the 3 VMs have a Physical Memory (RAM) of 32 GB

 As you can see my Collections are pretty small - This is actually a test
 environment (and NOT Production), However my users (only have a handful of
 testers) are complaining of sporadic performances issues on the Search.

 Here are my observations from the application logs:
 1) Out of 200 sample searches across both collections - 13 requests are
 slow
 (3 slow responses on Collection 1 and 10 slow responses on Collection 2).

 2) When things run fast - they are really fast (Qtimes of 25 - 100
 milliseconds) - but when things are slow - I can see that the QTime
 consistently hovers around the 5 second (or 5000 millisecond mark). I am
 seeing responses of the order of 5024, 5094, 5035 ms - as though something
 just hung for 5 seconds. I am observing this 5 second delay on both
 Collections - which I feel is unusual - because both contain very different
 data sets. I am unable to figure out whats causing the QTime to be so
 consistent around the 5 second mark.

 3) I build my index only once. I did try running an optimize on both
 Collection 1 and Collection 2 after the users complained - I did notice
 that
 post the optimize the segment count on each of the four shards did come
 down
 - but that still didn't resolve the slowness on the searches (I was hoping
 it would).

 4) I am looking at the Solr Dashboard for more clues - My TomCat nodes are
 definitely NOT running out of memory - the 6 nodes are consuming anywhere
 between 500 MB - 1 GB RAM

 5) The File Descriptor counts are under control - can only see a maximum of
 100 file descriptors being used of a total of 4096

 6) The Solr dashboard is however showing that 0.2% (or 9.8MB) of Swap Space
 being consumed on one of the 3 VMs. Is this a concern ?

 7) Also looked at the Plugin / Stats for every core on the Solr Dashboard.
 I
 can't see any evictions happening in any of the caches - Its always ZERO.

 Has anyone encountered such an issue ? What else should I be looking for to
 debug my problem ?

 Thanks



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Slow-QTimes-5-seconds-for-Small-sized-Collections-tp4143681.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Dmitry Kan
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan
SemanticAnalyzer: www.semanticanalyzer.info


Re: Block Join Not Working - what am I doing wrong?

2014-06-24 Thread Mikhail Khludnev
Vinay,
pls upload your index dir somewhere, I can try to check what's wrong with
it.


On Tue, Jun 24, 2014 at 9:43 PM, Vinay B, vybe3...@gmail.com wrote:

 Hi,
 Yes, the query ATTRIBUTES.STATE:TX returns the child doc (see response
 below) . Is there something else that I'm missing to link the parent and
 the child? I followed your advice from my last thread and used a block join
 in this attempt, but still don't see how the parent and child realize their
 association. We're using solr 4.8.1

 Thanks

 Query Response

 {
   responseHeader:{
 status:0,
 QTime:0,
 params:{
   indent:true,
   q:ATTRIBUTES.STATE:TX,
   wt:json}},
   response:{numFound:1,start:0,docs:[
   {
 id:1-A,
 ATTRIBUTES.STATE:[LA,
   TX]}]
   }}


 Raw doc dump

 {
   responseHeader:{
 status:0,
 QTime:0,
 params:{
   indent:true,
   q:*:*,
   wt:json}},
   response:{numFound:2,start:0,docs:[
   {
 id:1-A,
 ATTRIBUTES.STATE:[LA,
   TX]},
   {
 id:1,
 content_type:parentDocument,
 _version_:1471814208097091584}]
   }}



 On Tue, Jun 24, 2014 at 10:45 AM, Mikhail Khludnev 
 mkhlud...@griddynamics.com wrote:

  did you run the underneath query ATTRIBUTES.
  STATE:TX. does it return anything?
 
 
  On Tue, Jun 24, 2014 at 6:59 PM, Vinay B, vybe3...@gmail.com wrote:
 
   Okay, Let me try again.
  
   1. Here is some sample SolrJ code that creates a parent and child
  document
   (I hope)
   https://gist.github.com/anonymous/d03747661ef03923de74
  
   2. I tried a block join query which didn't return any results (I tried
  the
   Block Join Parent Query Parser approach described in this link
   https://cwiki.apache.org/confluence/display/solr/Other+Parsers). I
   expected
   to get back the parent doc of a child which has ATTRIBUTES.STATE:TX,
  which
   I did not , That is what I'm trying to figure out.
  
   Thanks
  
   http://localhost:8088/solr/test_core/select?q={!parent
  
 
 which=content_type:parentDocument}ATTRIBUTES.STATE:TXwt=jsonindent=true
  
   (
   equivalent to
  
  
 
 http://localhost:8088/solr/test_core/select?q=%7b!parent+which%3d%22content_type%3aparentDocument%22%7dATTRIBUTES.STATE%3aTX%26wt%3djson%26indent%3dtrue
   )
  
   Resulting in
   response
   lst name=responseHeader
   int name=status0/int
   int name=QTime1/int
   lst name=params
   str name=q
   {!parent
  
 
 which=content_type:parentDocument}ATTRIBUTES.STATE:TXwt=jsonindent=true
   /str
   /lst
   /lst
   result name=response numFound=0 start=0/
   /response
  
  
  
  
   On Mon, Jun 23, 2014 at 4:04 PM, Erick Erickson 
 erickerick...@gmail.com
  
   wrote:
  
Well, what  do you mean by not working? You might review:
http://wiki.apache.org/solr/UsingMailingLists
   
Best,
Erick
   
On Mon, Jun 23, 2014 at 12:20 PM, Vinay B, vybe3...@gmail.com
 wrote:
 Hi,
 I've been trying to experiment with block joins and parent / child
  docs
as
 described in this thread (input described in my first post of the
   thread,
 .. and block join in my second post, as per the suggestions given).
   What
 else am I missing?

 Thanks


   
  
 
 http://lucene.472066.n3.nabble.com/Why-aren-t-my-nested-documents-nesting-tt4142702.html#none
   
  
 
 
 
  --
  Sincerely yours
  Mikhail Khludnev
  Principal Engineer,
  Grid Dynamics
 
  http://www.griddynamics.com
   mkhlud...@griddynamics.com
 




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: limit solr results before join

2014-06-24 Thread Mikhail Khludnev
Hello Kevin,
You can only apply some restriction clauses (with +) to the from side
query.


On Tue, Jun 24, 2014 at 8:09 PM, Kevin Stone kevin.st...@jax.org wrote:

 Is there any way to limit the results of a query on the from index
 before it gets joined?

 The SQL analogy might be...
 SELECT *
 from toIndex join
 (select * from fromIndex
 where some query
 limit 1000
 ) fromIndex on fromIndex.from=toIndex.to


 Example:
 _query_:{!join fromIndex=expressionData from=anatomyID to=anatomyID
 v='(anatomy:\brain\)'}

 Say I have an index representing data for gene expression (we work with
 genetics), and you query it by anatomy term. So the above would query for
 all data that shows gene expression in brain.

 Now I want to get a set of related data for each anatomy term via the
 join. Is there any way to get the related data for only anatomy terms in
 the first 1000 expression data documents (fromIndex)? The reason is because
 there could be millions of data documents (fromIndex), and we process them
 in batches to load a visualization of the query results.

 Doing the join on all the results for each batch I process is becoming a
 bottleneck for large sets of data.

 Thanks,
 -Kevin

 The information in this email, including attachments, may be confidential
 and is intended solely for the addressee(s). If you believe you received
 this email by mistake, please notify the sender by return email as soon as
 possible.




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: OOM during indexing nested docs

2014-06-24 Thread Mikhail Khludnev
enable heap dump on OOME, and build the histogram by jhat.
Did you try to reduce MaxRamBuffer or max buffered docs? or enable
autocommit?


On Tue, Jun 24, 2014 at 7:43 PM, adfel70 adfe...@gmail.com wrote:

 Hi,

 I am getting OOM during indexing 400 million docs (nested 7-20 children).
 The memory usage gets higher while indexing until it gets to 24g.
 also after OOM and stop indexing, the memory stays on 24g, *seems like a
 leak.*


 *Solr  Collection Info: *
 solr 4.8 , 6 shards, 1 replicas per shard, 24g for jvm

 Thanks



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/OOM-during-indexing-nested-docs-tp4143722.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


RE: limit solr results before join

2014-06-24 Thread Kevin Stone
I don't know what that means. Is that a no?


From: Mikhail Khludnev [mkhlud...@griddynamics.com]
Sent: Tuesday, June 24, 2014 2:18 PM
To: solr-user
Subject: Re: limit solr results before join

Hello Kevin,
You can only apply some restriction clauses (with +) to the from side
query.


On Tue, Jun 24, 2014 at 8:09 PM, Kevin Stone kevin.st...@jax.org wrote:

 Is there any way to limit the results of a query on the from index
 before it gets joined?

 The SQL analogy might be...
 SELECT *
 from toIndex join
 (select * from fromIndex
 where some query
 limit 1000
 ) fromIndex on fromIndex.from=toIndex.to


 Example:
 _query_:{!join fromIndex=expressionData from=anatomyID to=anatomyID
 v='(anatomy:\brain\)'}

 Say I have an index representing data for gene expression (we work with
 genetics), and you query it by anatomy term. So the above would query for
 all data that shows gene expression in brain.

 Now I want to get a set of related data for each anatomy term via the
 join. Is there any way to get the related data for only anatomy terms in
 the first 1000 expression data documents (fromIndex)? The reason is because
 there could be millions of data documents (fromIndex), and we process them
 in batches to load a visualization of the query results.

 Doing the join on all the results for each batch I process is becoming a
 bottleneck for large sets of data.

 Thanks,
 -Kevin

 The information in this email, including attachments, may be confidential
 and is intended solely for the addressee(s). If you believe you received
 this email by mistake, please notify the sender by return email as soon as
 possible.




--
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com

The information in this email, including attachments, may be confidential and 
is intended solely for the addressee(s). If you believe you received this email 
by mistake, please notify the sender by return email as soon as possible.


Re: Block Join Not Working - what am I doing wrong?

2014-06-24 Thread Vinay B,
Michael, try this,

Thanks

https://www.dropbox.com/s/074p0wpjz916d78/test_core.tar.gz



On Tue, Jun 24, 2014 at 1:16 PM, Mikhail Khludnev 
mkhlud...@griddynamics.com wrote:

 Vinay,
 pls upload your index dir somewhere, I can try to check what's wrong with
 it.


 On Tue, Jun 24, 2014 at 9:43 PM, Vinay B, vybe3...@gmail.com wrote:

  Hi,
  Yes, the query ATTRIBUTES.STATE:TX returns the child doc (see response
  below) . Is there something else that I'm missing to link the parent and
  the child? I followed your advice from my last thread and used a block
 join
  in this attempt, but still don't see how the parent and child realize
 their
  association. We're using solr 4.8.1
 
  Thanks
 
  Query Response
 
  {
responseHeader:{
  status:0,
  QTime:0,
  params:{
indent:true,
q:ATTRIBUTES.STATE:TX,
wt:json}},
response:{numFound:1,start:0,docs:[
{
  id:1-A,
  ATTRIBUTES.STATE:[LA,
TX]}]
}}
 
 
  Raw doc dump
 
  {
responseHeader:{
  status:0,
  QTime:0,
  params:{
indent:true,
q:*:*,
wt:json}},
response:{numFound:2,start:0,docs:[
{
  id:1-A,
  ATTRIBUTES.STATE:[LA,
TX]},
{
  id:1,
  content_type:parentDocument,
  _version_:1471814208097091584}]
}}
 
 
 
  On Tue, Jun 24, 2014 at 10:45 AM, Mikhail Khludnev 
  mkhlud...@griddynamics.com wrote:
 
   did you run the underneath query ATTRIBUTES.
   STATE:TX. does it return anything?
  
  
   On Tue, Jun 24, 2014 at 6:59 PM, Vinay B, vybe3...@gmail.com wrote:
  
Okay, Let me try again.
   
1. Here is some sample SolrJ code that creates a parent and child
   document
(I hope)
https://gist.github.com/anonymous/d03747661ef03923de74
   
2. I tried a block join query which didn't return any results (I
 tried
   the
Block Join Parent Query Parser approach described in this link
https://cwiki.apache.org/confluence/display/solr/Other+Parsers). I
expected
to get back the parent doc of a child which has ATTRIBUTES.STATE:TX,
   which
I did not , That is what I'm trying to figure out.
   
Thanks
   
http://localhost:8088/solr/test_core/select?q={!parent
   
  
 
 which=content_type:parentDocument}ATTRIBUTES.STATE:TXwt=jsonindent=true
   
(
equivalent to
   
   
  
 
 http://localhost:8088/solr/test_core/select?q=%7b!parent+which%3d%22content_type%3aparentDocument%22%7dATTRIBUTES.STATE%3aTX%26wt%3djson%26indent%3dtrue
)
   
Resulting in
response
lst name=responseHeader
int name=status0/int
int name=QTime1/int
lst name=params
str name=q
{!parent
   
  
 
 which=content_type:parentDocument}ATTRIBUTES.STATE:TXwt=jsonindent=true
/str
/lst
/lst
result name=response numFound=0 start=0/
/response
   
   
   
   
On Mon, Jun 23, 2014 at 4:04 PM, Erick Erickson 
  erickerick...@gmail.com
   
wrote:
   
 Well, what  do you mean by not working? You might review:
 http://wiki.apache.org/solr/UsingMailingLists

 Best,
 Erick

 On Mon, Jun 23, 2014 at 12:20 PM, Vinay B, vybe3...@gmail.com
  wrote:
  Hi,
  I've been trying to experiment with block joins and parent /
 child
   docs
 as
  described in this thread (input described in my first post of the
thread,
  .. and block join in my second post, as per the suggestions
 given).
What
  else am I missing?
 
  Thanks
 
 

   
  
 
 http://lucene.472066.n3.nabble.com/Why-aren-t-my-nested-documents-nesting-tt4142702.html#none

   
  
  
  
   --
   Sincerely yours
   Mikhail Khludnev
   Principal Engineer,
   Grid Dynamics
  
   http://www.griddynamics.com
mkhlud...@griddynamics.com
  
 



 --
 Sincerely yours
 Mikhail Khludnev
 Principal Engineer,
 Grid Dynamics

 http://www.griddynamics.com
  mkhlud...@griddynamics.com



Re: limit solr results before join

2014-06-24 Thread Mikhail Khludnev
_query_:{!join fromIndex=expressionData from=anatomyID to=anatomyID
v='(anatomy:\brain\) +id:[1 TO 1]'}


On Tue, Jun 24, 2014 at 10:24 PM, Kevin Stone kevin.st...@jax.org wrote:

 I don't know what that means. Is that a no?

 
 From: Mikhail Khludnev [mkhlud...@griddynamics.com]
 Sent: Tuesday, June 24, 2014 2:18 PM
 To: solr-user
 Subject: Re: limit solr results before join

 Hello Kevin,
 You can only apply some restriction clauses (with +) to the from side
 query.


 On Tue, Jun 24, 2014 at 8:09 PM, Kevin Stone kevin.st...@jax.org wrote:

  Is there any way to limit the results of a query on the from index
  before it gets joined?
 
  The SQL analogy might be...
  SELECT *
  from toIndex join
  (select * from fromIndex
  where some query
  limit 1000
  ) fromIndex on fromIndex.from=toIndex.to
 
 
  Example:
  _query_:{!join fromIndex=expressionData from=anatomyID to=anatomyID
  v='(anatomy:\brain\)'}
 
  Say I have an index representing data for gene expression (we work with
  genetics), and you query it by anatomy term. So the above would query for
  all data that shows gene expression in brain.
 
  Now I want to get a set of related data for each anatomy term via the
  join. Is there any way to get the related data for only anatomy terms in
  the first 1000 expression data documents (fromIndex)? The reason is
 because
  there could be millions of data documents (fromIndex), and we process
 them
  in batches to load a visualization of the query results.
 
  Doing the join on all the results for each batch I process is becoming a
  bottleneck for large sets of data.
 
  Thanks,
  -Kevin
 
  The information in this email, including attachments, may be confidential
  and is intended solely for the addressee(s). If you believe you received
  this email by mistake, please notify the sender by return email as soon
 as
  possible.
 



 --
 Sincerely yours
 Mikhail Khludnev
 Principal Engineer,
 Grid Dynamics

 http://www.griddynamics.com
  mkhlud...@griddynamics.com

 The information in this email, including attachments, may be confidential
 and is intended solely for the addressee(s). If you believe you received
 this email by mistake, please notify the sender by return email as soon as
 possible.




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: SolrCloud copy the index to another cluster.

2014-06-24 Thread Michael Della Bitta
I'm currently playing around with Solr Cloud migration strategies, too. I'm
wondering... when you say zero downtime, do you mean zero *read*
downtime, or zero downtime altogether?

Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions https://twitter.com/Appinions | g+:
plus.google.com/appinions
https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
w: appinions.com http://www.appinions.com/


On Tue, Jun 24, 2014 at 1:43 PM, heaven aheave...@gmail.com wrote:

 I've just realized that old and new clusters do use different
 installations,
 configs and lib paths. So the nodes from the new cluster will probably
 simply refuse to start using configs from the old zookeper.

 Only if there is a way to run them with their own zookeper and then
 manually
 add as replicas to the old cluster, so old and new clusters keep using
 their
 own zookepers.



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/SolrCloud-copy-the-index-to-another-cluster-tp4143759p4143769.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: SolrCloud copy the index to another cluster.

2014-06-24 Thread heaven
Zero read would be enough, we can safely stop index updates for a while. But
have some API endpoints, where read downtime is very undesirable.

Best,
Alex



--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrCloud-copy-the-index-to-another-cluster-tp4143759p4143795.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Block Join Not Working - what am I doing wrong?

2014-06-24 Thread Mikhail Khludnev
I wonder what can be wrong there.. it works for me absolutely fine

proofpic
http://postimg.org/image/51qrsm48p/

query
http://localhost:8983/solr/collection1/select?q=%7B!parent+which%3D%22content_type%3AparentDocument%22%7DATTRIBUTES.STATE%3ATXwt=jsonindent=truedebugQuery=true

gives


response:{numFound:1,start:0,docs:[
  {
id:1,
content_type:[parentDocument],
_version_:1471814208097091584}]
  },

I just can wish you good luck, I can't help. Two minor notes:

I declared the field explicitly in schema.xml  it might help to you if
you didn't it yet.. I r'lly dunno

   field name=ATTRIBUTES.STATE type=string indexed=true
stored=true required=false multiValued=true /

Just a hint to debug block join is using wt=csv that shows the block
alignment pretty well.



On Tue, Jun 24, 2014 at 10:38 PM, Vinay B, vybe3...@gmail.com wrote:

 Michael, try this,

 Thanks

 https://www.dropbox.com/s/074p0wpjz916d78/test_core.tar.gz



 On Tue, Jun 24, 2014 at 1:16 PM, Mikhail Khludnev 
 mkhlud...@griddynamics.com wrote:

  Vinay,
  pls upload your index dir somewhere, I can try to check what's wrong with
  it.
 
 
  On Tue, Jun 24, 2014 at 9:43 PM, Vinay B, vybe3...@gmail.com wrote:
 
   Hi,
   Yes, the query ATTRIBUTES.STATE:TX returns the child doc (see response
   below) . Is there something else that I'm missing to link the parent
 and
   the child? I followed your advice from my last thread and used a block
  join
   in this attempt, but still don't see how the parent and child realize
  their
   association. We're using solr 4.8.1
  
   Thanks
  
   Query Response
  
   {
 responseHeader:{
   status:0,
   QTime:0,
   params:{
 indent:true,
 q:ATTRIBUTES.STATE:TX,
 wt:json}},
 response:{numFound:1,start:0,docs:[
 {
   id:1-A,
   ATTRIBUTES.STATE:[LA,
 TX]}]
 }}
  
  
   Raw doc dump
  
   {
 responseHeader:{
   status:0,
   QTime:0,
   params:{
 indent:true,
 q:*:*,
 wt:json}},
 response:{numFound:2,start:0,docs:[
 {
   id:1-A,
   ATTRIBUTES.STATE:[LA,
 TX]},
 {
   id:1,
   content_type:parentDocument,
   _version_:1471814208097091584}]
 }}
  
  
  
   On Tue, Jun 24, 2014 at 10:45 AM, Mikhail Khludnev 
   mkhlud...@griddynamics.com wrote:
  
did you run the underneath query ATTRIBUTES.
STATE:TX. does it return anything?
   
   
On Tue, Jun 24, 2014 at 6:59 PM, Vinay B, vybe3...@gmail.com
 wrote:
   
 Okay, Let me try again.

 1. Here is some sample SolrJ code that creates a parent and child
document
 (I hope)
 https://gist.github.com/anonymous/d03747661ef03923de74

 2. I tried a block join query which didn't return any results (I
  tried
the
 Block Join Parent Query Parser approach described in this link
 https://cwiki.apache.org/confluence/display/solr/Other+Parsers). I
 expected
 to get back the parent doc of a child which has
 ATTRIBUTES.STATE:TX,
which
 I did not , That is what I'm trying to figure out.

 Thanks

 http://localhost:8088/solr/test_core/select?q={!parent

   
  
 
 which=content_type:parentDocument}ATTRIBUTES.STATE:TXwt=jsonindent=true

 (
 equivalent to


   
  
 
 http://localhost:8088/solr/test_core/select?q=%7b!parent+which%3d%22content_type%3aparentDocument%22%7dATTRIBUTES.STATE%3aTX%26wt%3djson%26indent%3dtrue
 )

 Resulting in
 response
 lst name=responseHeader
 int name=status0/int
 int name=QTime1/int
 lst name=params
 str name=q
 {!parent

   
  
 
 which=content_type:parentDocument}ATTRIBUTES.STATE:TXwt=jsonindent=true
 /str
 /lst
 /lst
 result name=response numFound=0 start=0/
 /response




 On Mon, Jun 23, 2014 at 4:04 PM, Erick Erickson 
   erickerick...@gmail.com

 wrote:

  Well, what  do you mean by not working? You might review:
  http://wiki.apache.org/solr/UsingMailingLists
 
  Best,
  Erick
 
  On Mon, Jun 23, 2014 at 12:20 PM, Vinay B, vybe3...@gmail.com
   wrote:
   Hi,
   I've been trying to experiment with block joins and parent /
  child
docs
  as
   described in this thread (input described in my first post of
 the
 thread,
   .. and block join in my second post, as per the suggestions
  given).
 What
   else am I missing?
  
   Thanks
  
  
 

   
  
 
 http://lucene.472066.n3.nabble.com/Why-aren-t-my-nested-documents-nesting-tt4142702.html#none
 

   
   
   
--
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics
   
http://www.griddynamics.com
 mkhlud...@griddynamics.com
   
  
 
 
 
  --
  Sincerely yours
  Mikhail Khludnev
  Principal Engineer,
  Grid Dynamics
 
  http://www.griddynamics.com
   

Re: SolrCloud copy the index to another cluster.

2014-06-24 Thread Michael Della Bitta
So what I'm playing with now is creating a new collection on the target
cluster, turning off the target cluster, wiping the indexes, and manually
just copying the indexes over to the correct directories and starting
again. In the middle, you can run an optimize or use the Lucene index
upgrader tool to bring yourself up to the new version.

Part of this for me is a migration to HDFSDirectory so there's an added
level of complication there.

I would assume that since you only need to preserve reads, you could cut
over once your collections were created on the new cloud?

Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions https://twitter.com/Appinions | g+:
plus.google.com/appinions
https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
w: appinions.com http://www.appinions.com/


On Tue, Jun 24, 2014 at 3:25 PM, heaven aheave...@gmail.com wrote:

 Zero read would be enough, we can safely stop index updates for a while.
 But
 have some API endpoints, where read downtime is very undesirable.

 Best,
 Alex



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/SolrCloud-copy-the-index-to-another-cluster-tp4143759p4143795.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Block Join Not Working - what am I doing wrong?

2014-06-24 Thread Vinay B,
Thanks, I Figured it out based on your last response, I  mistakenly
UUencoded the wt=json and indent = true when i  manufacturing the request.


%26wt%3djson%26indent%3dtrue

Incidentally, this translates to
{!parent
which=content_type:parentDocument}ATTRIBUTES.STATE:TXwt=jsonindent=true

and returns an xml response (malformed)

For your correct request (expressed as xml), the query looks like this

{!parent which=content_type:parentDocument}ATTRIBUTES.STATE:TX


In any case, I'll write up a practical HOW TO using SOLRJ for the benefit
of the community.




On Tue, Jun 24, 2014 at 2:29 PM, Mikhail Khludnev 
mkhlud...@griddynamics.com wrote:

 I wonder what can be wrong there.. it works for me absolutely fine

 proofpic
 http://postimg.org/image/51qrsm48p/

 query

 http://localhost:8983/solr/collection1/select?q=%7B!parent+which%3D%22content_type%3AparentDocument%22%7DATTRIBUTES.STATE%3ATXwt=jsonindent=truedebugQuery=true

 gives


 response:{numFound:1,start:0,docs:[
   {
 id:1,
 content_type:[parentDocument],
 _version_:1471814208097091584}]
   },

 I just can wish you good luck, I can't help. Two minor notes:

 I declared the field explicitly in schema.xml  it might help to you if
 you didn't it yet.. I r'lly dunno

field name=ATTRIBUTES.STATE type=string indexed=true
 stored=true required=false multiValued=true /

 Just a hint to debug block join is using wt=csv that shows the block
 alignment pretty well.



 On Tue, Jun 24, 2014 at 10:38 PM, Vinay B, vybe3...@gmail.com wrote:

  Michael, try this,
 
  Thanks
 
  https://www.dropbox.com/s/074p0wpjz916d78/test_core.tar.gz
 
 
 
  On Tue, Jun 24, 2014 at 1:16 PM, Mikhail Khludnev 
  mkhlud...@griddynamics.com wrote:
 
   Vinay,
   pls upload your index dir somewhere, I can try to check what's wrong
 with
   it.
  
  
   On Tue, Jun 24, 2014 at 9:43 PM, Vinay B, vybe3...@gmail.com wrote:
  
Hi,
Yes, the query ATTRIBUTES.STATE:TX returns the child doc (see
 response
below) . Is there something else that I'm missing to link the parent
  and
the child? I followed your advice from my last thread and used a
 block
   join
in this attempt, but still don't see how the parent and child realize
   their
association. We're using solr 4.8.1
   
Thanks
   
Query Response
   
{
  responseHeader:{
status:0,
QTime:0,
params:{
  indent:true,
  q:ATTRIBUTES.STATE:TX,
  wt:json}},
  response:{numFound:1,start:0,docs:[
  {
id:1-A,
ATTRIBUTES.STATE:[LA,
  TX]}]
  }}
   
   
Raw doc dump
   
{
  responseHeader:{
status:0,
QTime:0,
params:{
  indent:true,
  q:*:*,
  wt:json}},
  response:{numFound:2,start:0,docs:[
  {
id:1-A,
ATTRIBUTES.STATE:[LA,
  TX]},
  {
id:1,
content_type:parentDocument,
_version_:1471814208097091584}]
  }}
   
   
   
On Tue, Jun 24, 2014 at 10:45 AM, Mikhail Khludnev 
mkhlud...@griddynamics.com wrote:
   
 did you run the underneath query ATTRIBUTES.
 STATE:TX. does it return anything?


 On Tue, Jun 24, 2014 at 6:59 PM, Vinay B, vybe3...@gmail.com
  wrote:

  Okay, Let me try again.
 
  1. Here is some sample SolrJ code that creates a parent and child
 document
  (I hope)
  https://gist.github.com/anonymous/d03747661ef03923de74
 
  2. I tried a block join query which didn't return any results (I
   tried
 the
  Block Join Parent Query Parser approach described in this link
  https://cwiki.apache.org/confluence/display/solr/Other+Parsers).
 I
  expected
  to get back the parent doc of a child which has
  ATTRIBUTES.STATE:TX,
 which
  I did not , That is what I'm trying to figure out.
 
  Thanks
 
  http://localhost:8088/solr/test_core/select?q={!parent
 

   
  
 
 which=content_type:parentDocument}ATTRIBUTES.STATE:TXwt=jsonindent=true
 
  (
  equivalent to
 
 

   
  
 
 http://localhost:8088/solr/test_core/select?q=%7b!parent+which%3d%22content_type%3aparentDocument%22%7dATTRIBUTES.STATE%3aTX%26wt%3djson%26indent%3dtrue
  )
 
  Resulting in
  response
  lst name=responseHeader
  int name=status0/int
  int name=QTime1/int
  lst name=params
  str name=q
  {!parent
 

   
  
 
 which=content_type:parentDocument}ATTRIBUTES.STATE:TXwt=jsonindent=true
  /str
  /lst
  /lst
  result name=response numFound=0 start=0/
  /response
 
 
 
 
  On Mon, Jun 23, 2014 at 4:04 PM, Erick Erickson 
erickerick...@gmail.com
 
  wrote:
 
   Well, what  do you mean by not working? You might review:
   http://wiki.apache.org/solr/UsingMailingLists
  
   Best,
   Erick
  
   On Mon, Jun 

Clubbing queries with different criterias together?

2014-06-24 Thread lalitjangra
Hi,

I have a number of documents in single core getting inedxed from different
sources with common properties but different values.

Problem is while fetching from one set of documents, i need to use  Raw
Query Parameters as below.

http://solrserver/solr/collection1/select?q=*%3A*wt=jsonindent=true_query_=%22AuthenticatedUserName=lalit%22

But for second set of documents, i need to use filter queries.

http://solrserver/solr/collection1/select?q=*%3A*fq=alf_acls%3AGROUP_EVERYONEwt=jsonindent=true
 

One way of getting all documents is to make two different queries and
combine their results but i want to avoid two queries due to performance
reasons as it will be double the load on system.

Is there any way exist where i can use a single query and get results from
both sets simultaneously?

Thanks for help!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Clubbing-queries-with-different-criterias-together-tp4143829.html
Sent from the Solr - User mailing list archive at Nabble.com.


Questions about solr.SuggestComponent

2014-06-24 Thread Sergio R. Charpinel Jr.
Hi,

I'm testing SuggestComponent so I came up with some questions.

1. How can I set the term frequency as the weightField?

2. Why does it only work with stored fields? How can I return the value
resulted from my filter transformations (index)?


Thanks!

-- 
Sergio Roberto Charpinel Jr.


Re: Clubbing queries with different criterias together?

2014-06-24 Thread Ahmet Arslan
Hi Lalit,

_query_ is a magic field name. Please see : 
http://searchhub.org/2009/03/31/nested-queries-in-solr/

What do you use _query_=AuthenticatedUserName=lalit ? It is simply ignored. 

Ahmet



On Tuesday, June 24, 2014 11:34 PM, lalitjangra lalit.j.jan...@gmail.com 
wrote:
Hi,

I have a number of documents in single core getting inedxed from different
sources with common properties but different values.

Problem is while fetching from one set of documents, i need to use  Raw
Query Parameters as below.

http://solrserver/solr/collection1/select?q=*%3A*wt=jsonindent=true_query_=%22AuthenticatedUserName=lalit%22

But for second set of documents, i need to use filter queries.

http://solrserver/solr/collection1/select?q=*%3A*fq=alf_acls%3AGROUP_EVERYONEwt=jsonindent=true
 

One way of getting all documents is to make two different queries and
combine their results but i want to avoid two queries due to performance
reasons as it will be double the load on system.

Is there any way exist where i can use a single query and get results from
both sets simultaneously?

Thanks for help!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Clubbing-queries-with-different-criterias-together-tp4143829.html
Sent from the Solr - User mailing list archive at Nabble.com.



How to extend the behavior of a common text field (such as text_general) to recognize regex

2014-06-24 Thread Vinay B,
This is easy if I only reqdefine a custom field to identify the desired
patterns (numbers, in my case)

For example, I could define a field thus:
!-- A text field that identifies numberical entities--
fieldType name=text_num class=solr.TextField 
  analyzer
tokenizer class=solr.PatternTokenizerFactory
pattern=\s*[0-9][0-9-]*[0-9]?\s* group=0/
  /analyzer
/fieldType

Input:
hello, world bye 123-45 abcd  sdfssdf --- aaa

Output:
123-45 , 

However, I also want to retain the behavio


Re: How to extend the behavior of a common text field (such as text_general) to recognize regex

2014-06-24 Thread Vinay B,
Sorry, previous post got sent prematurely.

Here is the complete post:

This is easy if I only reqdefine a custom field to identify the desired
patterns (numbers, in my case)

For example, I could define a field thus:
!-- A text field that identifies numberical entities--
fieldType name=text_num class=solr.TextField 
  analyzer
tokenizer class=solr.PatternTokenizerFactory
pattern=\s*[0-9][0-9-]*[0-9]?\s* group=0/
  /analyzer
/fieldType

Input:
hello, world bye 123-45 abcd  sdfssdf --- aaa

Output:
123-45 , 

However, I also want to retain the behavior of the default text_general
field , that is recognize the usual text tokens (hello, world, bye etc
...). What is the best way to achieve this.
I've looked at PatternCaptureGroupFilterFactory (
http://lucene.apache.org/core/4_7_0/analyzers-common/org/apache/lucene/analysis/pattern/PatternCaptureGroupFilterFactory.html
) but I suspect that it too is subject to the behavior of the prior
tokenizer (which for text_general is StandardTokenizerFactory ).

Thanks





Re: Evaluate function only on subset of documents

2014-06-24 Thread Costi Muraru
Hi Chris,

Thanks for your patience, I've now got a better image on how things work.
I don't believe however that the two queries (the one with the post filter
and the one without one) are equivalent.

Suppose out of the whole document set:
XXX returns documents 1,2,3.
AAA returns documents  6,7,8.
{!frange}customfunction returns documents 7,8.

Running this query:
XXX OR AAA AND {!frange ...}
Matched documents are:
(1,2,3) OR (6,7,8) AND (7,8) = (1,2,3) OR (7,8) = 1,2,3,7,8

With the post filter:
q=XXX OR AAA  fq={!frange cost=150 cache=false ...}
Matched documents are:
(1,2,3) OR (6,7,8) = (1,2,3,6,7,8) with post filter (7,8) = (7,8)


I was hoping that the evaluation process would be short circuit.
Document set: 1,2,3,4,5,6,7,8

Document id 1:
Does it match XXX? Yes. Document matches query. Skip the second clause (AAA
AND {!frange ...}) and evaluate next doc.
Document id 2:
Does it match XXX? Yes. Document matches query. Skip second clause and
evaluate next doc.
Document id 3:
Does it match XXX? Yes. Document matches query. Skip second clause and
evaluate next doc.

Document id 4:
Does it match XXX? No.
Does it match AAA? No. Document does not match query. Skip frange and
evaluate next doc.

Document id 5:
Does it match XXX? No.
Does it match AAA? No. Document does not match query. Skip frange and
evaluate next doc.

Document id 6:
Does it match XXX? No.
Does it match AAA? Yes.
Does it match frange? No.  Document does not match query. [Only here the
custom function would be evaluated first.]

Document id 7:
Does it match XXX? No.
Does it match AAA? Yes.
Does it match frange? Yes.  Document matches query.

Document id 8:
Does it match XXX? No.
Does it match AAA? Yes.
Does it match frange? Yes.  Document matches query.

Returned documents: 1,2,3,7,8.

So with this logic the custom function would be evaluated on documents
6,7,8 rather than on the whole set to see the smallest doc index, like
you've described in your last email.

I hope I'm not rambling. :-)
Does it make sense?

Costi


On Tue, Jun 24, 2014 at 7:26 PM, Chris Hostetter hossman_luc...@fucit.org
 wrote:


 : Let's take this query sample:
 : XXX OR AAA AND {!frange ...}
 :
 : For my use case:
 : AAA returns a subset of 100k documents.
 : frange returns 5k documents, all part of these 100k documents.
 :
 : Therefore, frange skips the most documents. From what you are saying,
 : frange is going to be applied on all documents (since it skips the most
 : documents) and AAA is going to be applied on the subset. This is kind of
 : what I've originally noticed. My goal is to have this in reverse order,

 That's not exactly it ... there's no way for the query to know in advance
 how many documents it matches -- what BooleanQuery asks each clause is
 looking at the index, tell me the (internal) lucene docid of the first do
 you match.  it then looks at the lowest matching docid of each clause, and
 the Occur property of the clause (MUST, MUST_NOT, SHOULD) to be able to
 tell if/when it can say things like clause AAA is mandatory but the
 lowest id it matches is doc# 8675 -- so it doesn't mater that clause XXX's
 lowest match is doc# 10 or that clause {!frange}'s lowest matche is doc#
 100

 it can then ask XXX and {!frange} to both skip ahead, and find lowest
 docid they each match that is no less then 8675, etc...

 from the perspective of {!frange} in particular, this means that on the
 first call it will evaluate itself against docid #0, #1, #2, etc... untill
 it finds a match.  and on the secod call it will evaluate itself against
 docid #8675, 8676, etc... until it finds a match...

 : since frange is much more expensive than AAA.
 : I was hoping to do so by specifying the cost, saying that Hey, frange
 has

 There is no support for specifying cost on individual clauses instead of a
 BooleanQuery.

 But i really want to re-iterate, that even with the example you posted
 above you *still* don't need to nest your {!frange} instead of a boolean
 query -- what you have is this:

 XXX OR AAA AND {!frange ...}

 in which the {!frange ...} clause is completely mandatory -- so my
 previous point #2 still applies...

 :  2) based on the example you give, what you're trying to do here doesn't
 :  really depend on using SHOULD (ie: OR) type logic against the frange:
 :  the only disjunction you have is in a sub-query of a top level
 :  conjunction (e: all required) ... the frange itself is still mandatory.
 : 
 :  so you could still use it as a non-cached postfilter just like in your
 :  previous example:

   q=XXX OR AAA  fq={!frange cost=150 cache=false ...}


 -Hoss
 http://www.lucidworks.com/



Re: Evaluate function only on subset of documents

2014-06-24 Thread Chris Hostetter

: I don't believe however that the two queries (the one with the post filter
: and the one without one) are equivalent.
: 
: Suppose out of the whole document set:
: XXX returns documents 1,2,3.
: AAA returns documents  6,7,8.
: {!frange}customfunction returns documents 7,8.
: 
: Running this query:
: XXX OR AAA AND {!frange ...}
: Matched documents are:
: (1,2,3) OR (6,7,8) AND (7,8) = (1,2,3) OR (7,8) = 1,2,3,7,8

Did you actually test out that specific example, because those results 
don't make sense to me given how the parser deals with multiple AND and OR 
keywords in a single BooleanQuery (which is why i hate AND and OR and 
advise anyone who will listen to never use them)...

  http://searchhub.org//2011/12/28/why-not-and-or-and-not/

$ curl -sS 
'http://localhost:8983/solr/select?q=xxx%20OR%20AaA%20AND%20zZzdebug=querywt=jsonindent=true'
 | grep xxx 
  q:xxx OR AaA AND zZz,
rawquerystring:xxx OR AaA AND zZz,
querystring:xxx OR AaA AND zZz,
parsedquery:text:xxx +text:aaa +text:zzz,
parsedquery_toString:text:xxx +text:aaa +text:zzz,

Based on your walk through of the logic you'd like to have, 
it seems like the query you ment to write is something like this...

  XXX (+AAA +{!frange ...})
...aka...
  XXX OR (AAA AND {!frange ...})

...in which case i'm afraid i don't have many good suggestions for you on 
how to minimize the number of times the function is called to eliminate 
any doc that already matches XXX (or to force it to check AAA first)

Looking at one of your specific examples...

: Document id 1:
: Does it match XXX? Yes. Document matches query. Skip the second clause (AAA
: AND {!frange ...}) and evaluate next doc.

...this type of skipping fundementally can't happen with a BooleanQuery 
because of the way scoring works in lucene -- even if it matches the XXX 
clause, the other clauses will still be consulted to determine what the 
total score will be -- all SHOULD and MUST clauses that match contribute 
to the final score.

: I hope I'm not rambling. :-)
: Does it make sense?

You're not rambling -- there's ust no general way to force the kind 
of check this last optimization you're hoping for in all cases, and 
even if there was, it wouldn't help you as much as you might think because 
of the scoring.


-Hoss
http://www.lucidworks.com/


Re: Trouble with TrieDateFields

2014-06-24 Thread Chris Hostetter

: I am upgrading an index from Solr 3.6 to 4.2.0.

: Everything has been picked up except for the old DateFields.

Just to be crystal clear:

1) 4.2 is alreayd over a year old.  the current rleease of Solr is 4.8, 
and 4.9 will most likeley be available within a day or two

2) Even in 4.9, solr.DateField still exists -- it has been deprecated 
and removed from the example schema, and will not be supported in 5.0, but 
just because you are upgrading to 4.x doesn't mean you have to stop using 
solr.DateField if it currently works for you.

: I read some posts that due to the extra functionality of the 
: TrieDateField you would need to re-index for those fields.

It's not a question of extra functionality -- the internal 
representation of the dates in the index is completley different.

: To avoid re-indexing I was trying to do a Partial Update 
: (http://solr.pl/en/2012/07/09/solr-4-0-partial-documents-update/),

you can't use partial updates to work arround a problem like this -- 
partial updates only work if the stored values in the index can be read by 
solr, the modified by the update command, and then written back out.  but 
upgrading from 3.6, and try to trick solr by changing a solr.DateField 
to a solr.TrieDateField just fundementally won't work.  Solr' won't be 
able to correctly read the stored date fields to return them, let alone 
modify them and write them back to hte index.

If you really can't re-index from scratch, and all of your fields are in 
fact stored in your 3.6 index, and you really wnat to switch to using 
TrieDateField, then your best option is to fetch every doc from your 3.6 
solr instance (like you were doing with your partial updates approach but 
pull back every field) and then push each doc to a *new* 4.x instance 
you've setup with the updated schema.xml using TrieDateField.


-Hoss
http://www.lucidworks.com/


Re: How to extend the behavior of a common text field (such as text_general) to recognize regex

2014-06-24 Thread Alexandre Rafalovitch
What about copyField'ing the content into the second field where you
apply the alternative processing. Than eDismax searching both. Don't
have to store the other field, just index.

Regards,
   Alex.
Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency


On Wed, Jun 25, 2014 at 5:55 AM, Vinay B, vybe3...@gmail.com wrote:
 Sorry, previous post got sent prematurely.

 Here is the complete post:

 This is easy if I only reqdefine a custom field to identify the desired
 patterns (numbers, in my case)

 For example, I could define a field thus:
 !-- A text field that identifies numberical entities--
 fieldType name=text_num class=solr.TextField 
   analyzer
 tokenizer class=solr.PatternTokenizerFactory
 pattern=\s*[0-9][0-9-]*[0-9]?\s* group=0/
   /analyzer
 /fieldType

 Input:
 hello, world bye 123-45 abcd  sdfssdf --- aaa

 Output:
 123-45 , 

 However, I also want to retain the behavior of the default text_general
 field , that is recognize the usual text tokens (hello, world, bye etc
 ...). What is the best way to achieve this.
 I've looked at PatternCaptureGroupFilterFactory (
 http://lucene.apache.org/core/4_7_0/analyzers-common/org/apache/lucene/analysis/pattern/PatternCaptureGroupFilterFactory.html
 ) but I suspect that it too is subject to the behavior of the prior
 tokenizer (which for text_general is StandardTokenizerFactory ).

 Thanks





Re: DIH on Solr

2014-06-24 Thread Wolfgang Hoschek
Check out the HBase Indexer http://ngdata.github.io/hbase-indexer/

Wolfgang.

On Jun 24, 2014, at 3:55 AM, Ahmet Arslan iori...@yahoo.com.INVALID wrote:

 Hi,
 
 There is no DataSource or EntityProcessor for HBase, I think.
 
 May be http://www.lilyproject.org/lily/index.html works for you?
 
 AHmet
 
 
 On Tuesday, June 24, 2014 1:27 PM, atp annamalai...@hcl.com wrote:
 Hi experts,
 
 We have a requirement to import the data from hbase tables using solr, we
 have tried with help of Dataimporthandler, we couldn't find the
 configuration streps or document for dataimporthandler for HBASE, can
 anybody please share the steps to configure, 
 
 we tried with basic configuration but while select full import its throwing
 error ,  please share the docs or links to configure DIH for hbase table. 
 
 6/24/2014 3:44:00 PM
 WARN
 ZKPropertiesWriter
 Could not read DIH properties from
 /configs/collection1/dataimport.properties :class
 org.apache.zookeeper.KeeperException$NoNodeException
 6/24/2014 3:44:00 PM
 ERROR
 DataImporter
 Full Import failed:java.lang.RuntimeException:
 org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to
 load EntityProcessor implementation for entity:msg Processing Document # 1
 
 
 Thanks  in Advance
 
 
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/DIH-on-Solr-tp4143669.html
 Sent from the Solr - User mailing list archive at Nabble.com.
 



Re: Double cast exception with grouping and sort function

2014-06-24 Thread Chris Hostetter

: I recently tried upgrading our setup from 4.5.1 to 4.7+, and I'm
: seeing an exception when I use (1) a function to sort and (2) result
: grouping.  The same query works fine with either (1) or (2) alone.
: Example below.

Did you modify your schema in any way when upgrading?

Can you provide some sample data to demonstrates the problem? (ideally 
using the 4.x example configs - but if you can't reproduce with that 
then providing your own configs would be helpful)

I was unabled to reproduce doing a quick sanity check using the example 
with a shard param to force a distrib query...

http://localhost:8983/solr/select?q=*:*shards=localhost:8983/solrsort=sum%281,1%29%20descgroup=truegroup.field=inStock

It's possible that the distributed grouping code has a bug in it related 
to the marshalling of sort values and i'm just not tickling that bug 
with my quick check ... but if i remember correctly work was done to fix 
grouped sorting to correctly deal with this when 
FieldType.marshalSortValue was introduced.


: Example (v4.8.1):
: {
:   responseHeader: {
: status: 500,
: QTime: 14,
: params: {
:   sort: sum(1,1) desc,
:   indent: true,
:   q: title:solr,
:   _: 1403586036335,
:   group.field: type,
:   group: true,
:   wt: json
: }
:   },
:   error: {
: msg: java.lang.Double cannot be cast to 
org.apache.lucene.util.BytesRef,
: trace: 
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
: java.lang.Double cannot be cast to org.apache.lucene.util.BytesRef
: code: 500
:   }
: }
: 
: From the log:
: 
: org.apache.solr.common.SolrException;
: null:java.lang.ClassCastException: java.lang.Double cannot be cast to
: org.apache.lucene.util.BytesRef
: at 
org.apache.solr.schema.FieldType.marshalStringSortValue(FieldType.java:981)
: at 
org.apache.solr.schema.TextField.marshalSortValue(TextField.java:176)
: at 
org.apache.solr.search.grouping.distributed.shardresultserializer.SearchGroupsResultTransformer.serializeSearchGroup(SearchGroupsResultTransformer.java:125)
: at 
org.apache.solr.search.grouping.distributed.shardresultserializer.SearchGroupsResultTransformer.transform(SearchGroupsResultTransformer.java:65)
: at 
org.apache.solr.search.grouping.distributed.shardresultserializer.SearchGroupsResultTransformer.transform(SearchGroupsResultTransformer.java:43)
: at 
org.apache.solr.search.grouping.CommandHandler.processResult(CommandHandler.java:193)


-Hoss
http://www.lucidworks.com/


Does updating a child document destroy the parent - child relationship

2014-06-24 Thread Vinay B,
When I edit a child document, a block join query for the parent no longer
returns any hits. I thought I read that this was the way things worked but
needed to know for sure.

If so, is there any other way to achieve this functionality (I can deal
with creating the child doc with the parent, but would like to edit it
separately).

My rough prototype code is at

https://github.com/balamuru/SolrChildDocs

and the code in question is commented out in
https://github.com/balamuru/SolrChildDocs/blob/master/src/main/java/com/vgb/solr/SolrApp.java


Thanks


Re: fq= more then one ?

2014-06-24 Thread rulinma
good.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/fq-more-then-one-tp959849p4143943.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Does updating a child document destroy the parent - child relationship

2014-06-24 Thread Jack Krupansky
Block join is a very specialized feature of Solr - it requires that creation 
and update of the parent and all children be done as a single update 
operation for all of the documents. So... you cannot update a child document 
by itself, but need to update the entire block.


Unfortunately, this limitation does not appear to be documented in the Solr 
ref guide.


-- Jack Krupansky

-Original Message- 
From: Vinay B,

Sent: Tuesday, June 24, 2014 10:40 PM
To: solr-user
Subject: Does updating a child document destroy the parent - child 
relationship


When I edit a child document, a block join query for the parent no longer
returns any hits. I thought I read that this was the way things worked but
needed to know for sure.

If so, is there any other way to achieve this functionality (I can deal
with creating the child doc with the parent, but would like to edit it
separately).

My rough prototype code is at

https://github.com/balamuru/SolrChildDocs

and the code in question is commented out in
https://github.com/balamuru/SolrChildDocs/blob/master/src/main/java/com/vgb/solr/SolrApp.java


Thanks 



Re: Solr on S3FileSystem, Kosmos, GlusterFS, etc….

2014-06-24 Thread Jay Vyas
Hi Solr ! 

I got this working .  Here's how : 

With the example jetty runner, you can Extract the tarball, and go to the 
examples/ directory, where you can launch an embedded core. Then, find the 
solrconfig.xml file. Edit it to contain the following xml:
 
directoryFactory name=DirectoryFactory 
class=org.apache.solr.core.HdfsDirectoryFactory
str name=solr.hdfs.homemyhcfs:///solr/str  
str name=solr.hdfs.confdir/etc/hadoop/conf/str  
/directoryFactory

the confdir is important: That is where you will have something like a 
core-site.xml that defines all the parameters for your filesystem 
(fs.defaultFS, fs.mycfs.impl…. and so on). 


This tells solr, when launched, to use myhcfs as the underlying file store. 

You also should make sure that the jar for your plugin (in our case glisters, 
but hadoop will reference it by looking up the dynamically generated parameters 
that come from the base uri myhcfs… classes are on the class path, and the 
hadoop-common jar is also there (Some HCFS shims will need FilterFileSystem to 
run correctly, which is only in hadoop-common.jar).

So - how to modify the running sold core's class path?  

To do so – you can update the solrconfig.xml jar directives. There are a bunch 
of regular expression templates you can modify in the 
examples/.../solrconfig.xml file. You can also copy the jars in at runtime, to 
be really safe.
 
Once your example core with gluster configuration is setup, launch it with the 
following properties:
 
java -Dsolr.directoryFactory=HdfsDirectoryFactory -Dsolr.lock.type=hdfs 
-Dsolr.data.dir=glusterfs:///solr -Dsolr.updatelog=glusterfs:///solr 
-Dlog4j.configuration=file:/opt/solr-4.4.0-cdh5.0.2/example/etc/logging.properties
 -jar start.jar

This starts a basic SOLR server on port 8983.
 
If you are running from the simple jetty based examples which I've used to 
describe this above, then you should see the collection1 core up and running, 
and you should see its index sitting inside the /solr directory of your file 
system. 

Hope this helps those interested in expanding the use of SolrCloud outside of a 
single FS. 


On Jun 23, 2014, at 6:16 PM, Jay Vyas jayunit100.apa...@gmail.com wrote:

 Hi folks.  Does anyone deploy solr indices on other HCFS implementations 
 (S3FileSystem, for example) regularly ? If so I'm wondering 
 
 1) Where are the docs for doing this - or examples?  Seems like everything, 
 including parameter names for dfs setup, are based around hdfs.   Maybe I 
 should file a JIRA similar to 
 https://issues.apache.org/jira/browse/FLUME-2410 (to make the generic 
 deployment of SOLR on any file system explicit / obvious).
 
 2) if there are any interesting requirements (i.e. createNonRecursive, Atomic 
 mkdirs, sharing, blocking expectations etc etc) which need to be implemented



Re: Solr on S3FileSystem, Kosmos, GlusterFS, etc….

2014-06-24 Thread Paul Libbrecht
I've always been under the impression that file-system-access-speed is crucial 
for Lucene-based storage and have always advocated to not use NFS for that (for 
which we had slowness of a factor of 5 approximately). Has there any 
performance measurement made for such a setting? Is FS-caching suddenly getting 
so much better that it is not a problem.

Also, as far as I know S3 bills by the amount of (giga-)bytes exchanged…. this 
gives plenty of room but if each starts needs to exchange a big part of the 
index from the storage to the solr server because of cache filling, it looks 
like it won't be that cheap.

thanks for experience report.

paul


On 25 juin 2014, at 07:16, Jay Vyas jayunit100.apa...@gmail.com wrote:

 Hi Solr ! 
 
 I got this working .  Here's how : 
 
 With the example jetty runner, you can Extract the tarball, and go to the 
 examples/ directory, where you can launch an embedded core. Then, find the 
 solrconfig.xml file. Edit it to contain the following xml:
 
 directoryFactory name=DirectoryFactory 
 class=org.apache.solr.core.HdfsDirectoryFactory
 str name=solr.hdfs.homemyhcfs:///solr/str  
 str name=solr.hdfs.confdir/etc/hadoop/conf/str  
 /directoryFactory
 
 the confdir is important: That is where you will have something like a 
 core-site.xml that defines all the parameters for your filesystem 
 (fs.defaultFS, fs.mycfs.impl…. and so on). 
 
 
 This tells solr, when launched, to use myhcfs as the underlying file store. 
 
 You also should make sure that the jar for your plugin (in our case glisters, 
 but hadoop will reference it by looking up the dynamically generated 
 parameters that come from the base uri myhcfs… classes are on the class 
 path, and the hadoop-common jar is also there (Some HCFS shims will need 
 FilterFileSystem to run correctly, which is only in hadoop-common.jar).
 
 So - how to modify the running sold core's class path?  
 
 To do so – you can update the solrconfig.xml jar directives. There are a 
 bunch of regular expression templates you can modify in the 
 examples/.../solrconfig.xml file. You can also copy the jars in at runtime, 
 to be really safe.
 
 Once your example core with gluster configuration is setup, launch it with 
 the following properties:
 
 java -Dsolr.directoryFactory=HdfsDirectoryFactory -Dsolr.lock.type=hdfs 
 -Dsolr.data.dir=glusterfs:///solr -Dsolr.updatelog=glusterfs:///solr 
 -Dlog4j.configuration=file:/opt/solr-4.4.0-cdh5.0.2/example/etc/logging.properties
  -jar start.jar
 
 This starts a basic SOLR server on port 8983.
 
 If you are running from the simple jetty based examples which I've used to 
 describe this above, then you should see the collection1 core up and running, 
 and you should see its index sitting inside the /solr directory of your file 
 system. 
 
 Hope this helps those interested in expanding the use of SolrCloud outside of 
 a single FS. 
 
 
 On Jun 23, 2014, at 6:16 PM, Jay Vyas jayunit100.apa...@gmail.com wrote:
 
 Hi folks.  Does anyone deploy solr indices on other HCFS implementations 
 (S3FileSystem, for example) regularly ? If so I'm wondering 
 
 1) Where are the docs for doing this - or examples?  Seems like everything, 
 including parameter names for dfs setup, are based around hdfs.   Maybe I 
 should file a JIRA similar to 
 https://issues.apache.org/jira/browse/FLUME-2410 (to make the generic 
 deployment of SOLR on any file system explicit / obvious).
 
 2) if there are any interesting requirements (i.e. createNonRecursive, 
 Atomic mkdirs, sharing, blocking expectations etc etc) which need to be 
 implemented