Running Solr 4 on Sun vs OpenJDK JVM?

2013-07-23 Thread Cosimo Streppone

Hi,

do you have any advice on operating a Solr 4.0
read-only instance with regards to the underlying
JVM?

In particular I'm wondering about stability and
memory usage, but anything else you might add is welcome,
when it comes to OpenJDK vs Sun/Oracle Hotspot,
v6 vs v7.

What are you running, what would you suggest and why?
I tried searching for some information, and the old(?)
run-on-sun-java6-jre tip is all I get back.

This morning I also read that Oracle JDK v7 is
apparently based on OpenJDK sources... so?

Thanks,

--
Cosimo


Re: adding date column to the index

2013-07-23 Thread Gora Mohanty
On 23 July 2013 11:13, Mysurf Mail stammail...@gmail.com wrote:
 clarify: I did deleted the data in the index and reloaded it (+ commit).
 (As i said, I have seen it loaded in the sb profiler)
[...]

Please share your DIH configuration file, and Solr's
schema.xml. It must be that somehow the column
is not getting indexed.

Regards,
Gora


Refering SOLRcore properties in zookeeper

2013-07-23 Thread sathish_ix
Hi,

I have uploaded solrconfig.xml, db-data-config.xml , solrcore.properties
(ABC.properties ) files into zookeeper.

below is my solr.xml file,

?xml version=1.0 encoding=UTF-8 ?
solr persistent=true
  cores defaultCoreName=ABC adminPath=/admin/cores
zkClientTimeout=${zkClientTimeout:15000} host=${host:}
hostPort=${port} hostContext=${hostContext:}  

  core loadOnStartup=true instanceDir=ABC transient=false
name=ABC properties=$propfile 
 property name=propfile value=ABC.properties /
/core

  /cores
  /solr


while starting the solr, it was not able to recognize ABC.properties . Am i
doing correct ?

Thanks,
Sathish




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Refering-SOLRcore-properties-in-zookeeper-tp4079639.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: adding date column to the index

2013-07-23 Thread Mysurf Mail
Ahaa
I deleted the data folder and now I get
Invalid Date String:'2010-01-01 00:00:00 +02:00'
I need to cast it to solr. as I read it in the schema using

field name=LastModificationTime type=date indexed=false
stored=true required=true/


On Tue, Jul 23, 2013 at 10:50 AM, Gora Mohanty g...@mimirtech.com wrote:

 On 23 July 2013 11:13, Mysurf Mail stammail...@gmail.com wrote:
  clarify: I did deleted the data in the index and reloaded it (+ commit).
  (As i said, I have seen it loaded in the sb profiler)
 [...]

 Please share your DIH configuration file, and Solr's
 schema.xml. It must be that somehow the column
 is not getting indexed.

 Regards,
 Gora



Re: adding date column to the index

2013-07-23 Thread Mysurf Mail
How do I cast datetimeoffset(7)) to solr date


On Tue, Jul 23, 2013 at 11:11 AM, Mysurf Mail stammail...@gmail.com wrote:

 Ahaa
 I deleted the data folder and now I get
 Invalid Date String:'2010-01-01 00:00:00 +02:00'
 I need to cast it to solr. as I read it in the schema using

 field name=LastModificationTime type=date indexed=false
 stored=true required=true/


 On Tue, Jul 23, 2013 at 10:50 AM, Gora Mohanty g...@mimirtech.com wrote:

 On 23 July 2013 11:13, Mysurf Mail stammail...@gmail.com wrote:
  clarify: I did deleted the data in the index and reloaded it (+ commit).
  (As i said, I have seen it loaded in the sb profiler)
 [...]

 Please share your DIH configuration file, and Solr's
 schema.xml. It must be that somehow the column
 is not getting indexed.

 Regards,
 Gora





Re:

2013-07-23 Thread wiredkel

Hi!   http://mackieprice.org/cbs.com.network.html



Indexing Oracle Database in Solr using Data Import Handler

2013-07-23 Thread archit2112
Im trying to Index oracle database 10g XE using Solr's Data Import Handler.

My data-config.xml looks like this

dataConfig
dataSource type=JdbcDataSource driver=oracle.jdbc.OracleDriver
url=jdbc:oracle:thin:@XXX.XXX.XXX.XXX::xe user=XX
password=XX / 
document name=product_info
entity name=product query=select * from product
field column=pid name=id /  
field column=pname name=itemName / 
field column=initqty name=itemQuantity /
field column=remQty name=remQuantity /
field column=price name=itemPrice / 
field column=specification name=specifications / 
/entity
/document
/dataConfig

My schema.xml looks like this -

field name=id type=text_general indexed=true stored=true
required=true multiValued=false / 
   field name=itemName type=text_general indexed=true stored=true
multiValued=true omitNorms=true termVectors=true /
   field name=itemQuantity type=text_general indexed=true
stored=true multiValued=true omitNorms=true termVectors=true /   
   field name=remQuantity type=text_general indexed=true
stored=true multiValued=true omitNorms=true termVectors=true /   
   field name=itemPrice type=text_general indexed=true stored=true
multiValued=true omitNorms=true termVectors=true /   
   field name=specifications type=text_general indexed=true
stored=true multiValued=true omitNorms=true termVectors=true /   
   field name=brand type=text_general indexed=true stored=true
multiValued=true omitNorms=true termVectors=true /   
   field name=itemCategory type=text_general indexed=true
stored=true multiValued=true omitNorms=true termVectors=true / 

Now when I try to index it, Solr is not able to read the columns of the
table and therefore indexing fails. it says that the document is missing the
unique key id which ,as you can see, is clearly present in document. Also,
generally in the log when such an exception is thrown it is clearly shown
that what all fields were picked up by the document. However in this case,
No fields are being read.

But if i change my query then everything works perfectly. The modified
data-config.xml -

dataConfig
dataSource name=db1 type=JdbcDataSource
driver=oracle.jdbc.OracleDriver
url=jdbc:oracle:thin:@XXX.XXX.XX.XX::xe user=
password=X / 
document name=product_info
entity name=products dataSource=db1 query=select pid as id,pname as
itemName,initqty as itemQuantity, remqty as remQuantity, price as itemPrice,
specification as specifications from product
field column=id name=id / 
field column=itemName name=itemName / 
field column=itemQuantity name=itemQuantity /
field column=remQuantity name=remQuantity /
field column=itemPrice name=itemPrice / 
field column=specifications name=specifications /
/entity
/document
/dataConfig

Why is this happening? how do i solve it? how does giving an alias affect
indexing process? Thanks in advance




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Indexing-Oracle-Database-in-Solr-using-Data-Import-Handler-tp4079649.html
Sent from the Solr - User mailing list archive at Nabble.com.


Document Similarity Algorithm at Solr/Lucene

2013-07-23 Thread Furkan KAMACI
Hi;

Sometimes a huge part of a document may exist in another document. As like
in student plagiarism or quotation of a blog post at another blog post.
Does Solr/Lucene or its libraries (UIMA, OpenNLP, etc.) has any class to
detect it?


Re: Document Similarity Algorithm at Solr/Lucene

2013-07-23 Thread Tommaso Teofili
Hi,

I you may leverage and / or improve MLT component [1].

HTH,
Tommaso

[1] : http://wiki.apache.org/solr/MoreLikeThis


2013/7/23 Furkan KAMACI furkankam...@gmail.com

 Hi;

 Sometimes a huge part of a document may exist in another document. As like
 in student plagiarism or quotation of a blog post at another blog post.
 Does Solr/Lucene or its libraries (UIMA, OpenNLP, etc.) has any class to
 detect it?



RE: Solr 4.1.0 not using solrcore.properties ?

2013-07-23 Thread sathish_ix
Hi ,
Can any one help on how to refer the solrcore.properties uploaded into
Zookeeper ?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-4-1-0-not-using-solrcore-properties-tp4040228p4079654.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Document Similarity Algorithm at Solr/Lucene

2013-07-23 Thread Furkan KAMACI
Actually I need a specialized algorithm. I want to use that algorithm to
detect duplicate blog posts.

2013/7/23 Tommaso Teofili tommaso.teof...@gmail.com

 Hi,

 I you may leverage and / or improve MLT component [1].

 HTH,
 Tommaso

 [1] : http://wiki.apache.org/solr/MoreLikeThis


 2013/7/23 Furkan KAMACI furkankam...@gmail.com

  Hi;
 
  Sometimes a huge part of a document may exist in another document. As
 like
  in student plagiarism or quotation of a blog post at another blog post.
  Does Solr/Lucene or its libraries (UIMA, OpenNLP, etc.) has any class to
  detect it?
 



problems about solr replication in 4.3

2013-07-23 Thread xiaoqi

hi,all 

i have two solr ,one is master , one is replication , before i use them
under 3.5 version . it works fine .
 when i upgrade to 4.3version , i found when replication solr copying index
from master , it will clean current index and copy new version to self
folder . slave can't search  during this process !

i am newer to solr 4 , does this normal ? any ideas , thanks !





--
View this message in context: 
http://lucene.472066.n3.nabble.com/problems-about-solr-replication-in-4-3-tp4079665.html
Sent from the Solr - User mailing list archive at Nabble.com.


facet.maxcount ?

2013-07-23 Thread Jérôme Étévé
Hi all happy Solr users!

I was wondering if it's possible to have some sort of facet.maxcount equivalent?

In short, that would exclude from the facet any term (or query) that
matches at least facet.maxcount times.

That facet.maxcount would probably significantly improve the
performance of request of the type:

I want the facet values with zero result.

Is it a mad idea or does it make some sort of sense?

Cheers,

Jerome.

-- 
Jerome Eteve
+44(0)7738864546
http://www.eteve.net/


RE: facet.maxcount ?

2013-07-23 Thread Markus Jelsma
Hi - No but there are two unresolved issues about this topic:
https://issues.apache.org/jira/browse/SOLR-4411
https://issues.apache.org/jira/browse/SOLR-4411

Cheers
 
-Original message-
 From:Jérôme Étévé jerome.et...@gmail.com
 Sent: Tuesday 23rd July 2013 12:58
 To: solr-user@lucene.apache.org
 Subject: facet.maxcount ?
 
 Hi all happy Solr users!
 
 I was wondering if it's possible to have some sort of facet.maxcount 
 equivalent?
 
 In short, that would exclude from the facet any term (or query) that
 matches at least facet.maxcount times.
 
 That facet.maxcount would probably significantly improve the
 performance of request of the type:
 
 I want the facet values with zero result.
 
 Is it a mad idea or does it make some sort of sense?
 
 Cheers,
 
 Jerome.
 
 -- 
 Jerome Eteve
 +44(0)7738864546
 http://www.eteve.net/
 


RE: facet.maxcount ?

2013-07-23 Thread Markus Jelsma
Eeh, here's the other one: https://issues.apache.org/jira/browse/SOLR-1712
 
 
-Original message-
 From:Markus Jelsma markus.jel...@openindex.io
 Sent: Tuesday 23rd July 2013 13:18
 To: solr-user@lucene.apache.org
 Subject: RE: facet.maxcount ?
 
 Hi - No but there are two unresolved issues about this topic:
 https://issues.apache.org/jira/browse/SOLR-4411
 https://issues.apache.org/jira/browse/SOLR-4411
 
 Cheers
  
 -Original message-
  From:Jérôme Étévé jerome.et...@gmail.com
  Sent: Tuesday 23rd July 2013 12:58
  To: solr-user@lucene.apache.org
  Subject: facet.maxcount ?
  
  Hi all happy Solr users!
  
  I was wondering if it's possible to have some sort of facet.maxcount 
  equivalent?
  
  In short, that would exclude from the facet any term (or query) that
  matches at least facet.maxcount times.
  
  That facet.maxcount would probably significantly improve the
  performance of request of the type:
  
  I want the facet values with zero result.
  
  Is it a mad idea or does it make some sort of sense?
  
  Cheers,
  
  Jerome.
  
  -- 
  Jerome Eteve
  +44(0)7738864546
  http://www.eteve.net/
  
 


Appending *-wildcard suffix on all terms for querying: move logic from client to server side

2013-07-23 Thread Paul Blanchaert
My client has an installation with 3 different clients using the same Solr
index. These clients all append a * wildcard suffix in the query: user
enters abc def while search is performed against (abc* def*).
In order to move away from this way of searching, we'd like to move the
clients away from this wildcard search at the moment we implement a new
index. However, at that time, the client apps will still need to use this
wildcard suffix search. So the goal is to have the wildcard search option
to append * suffix when not yet set configurable on server side.
I thought a tokenizer would do the work, but as the wildcard searches are
detected before analyzers do the work, this is not an option.
Can I enable this without coding? Or should I use a (custom) functionquery
or custom search handler?
Any thought is appreciated.


-
Kind regards,

Paul Blanchaert


Re: highlighting required in document

2013-07-23 Thread Dmitry Kan
You just need to specify the emphasizing tag in hl params by adding
something like this to your query:



hl.fl=contenthl.simple.pre=bhl.simple.post=%2Fb

Check the solr admin page, the querying item, it shows the constructed
query, so you don't need to guess!

Regards,

Dmitry


On Mon, Jul 22, 2013 at 10:31 AM, Jamshaid Ashraf jamshaid...@gmail.comwrote:

 Hi,

 I'm using solr 4.3.0  following is the response against hit highlighting
 request:

 Request:
 http://localhost:8080/solr/collection2/select?q=content:ps4hl=true

 Response:

 doc
  arr name=contentstrThis post is regarding ps4 accuracy and qulaity
 which is smooth and factastic/str/arr
 /doc
 lst name=highlighting
 lst name=1
  arr name=contentstrThis post is regarding bps4/b accuracy and
 qulaity which is smooth and factastic/str/arr
 /lst

 I wanted result like this:

 doc
  arr name=contentstrThis post is regarding bps4/b accuracy and
 qulaity which is smooth and factastic/str/arr
 /doc
 lst name=highlighting
 lst name=1
  arr name=contentstrThis post is regarding bps4/b accuracy and
 qulaity which is smooth and factastic/str/arr
 /lst

 Thanks in advance!

 Regards,
 Jamshaid



Re: highlighting required in document

2013-07-23 Thread Dmitry Kan
Ah, I think I misread your question. So your question is actually, how make
solr embed higlighting into the doc response itself. I'm not aware of such
a functionality. This why you have the highlighting section in your
response.


On Tue, Jul 23, 2013 at 2:30 PM, Dmitry Kan solrexp...@gmail.com wrote:

 You just need to specify the emphasizing tag in hl params by adding
 something like this to your query:



 hl.fl=contenthl.simple.pre=bhl.simple.post=%2Fb

 Check the solr admin page, the querying item, it shows the constructed
 query, so you don't need to guess!

 Regards,

 Dmitry



 On Mon, Jul 22, 2013 at 10:31 AM, Jamshaid Ashraf 
 jamshaid...@gmail.comwrote:

 Hi,

 I'm using solr 4.3.0  following is the response against hit highlighting
 request:

 Request:
 http://localhost:8080/solr/collection2/select?q=content:ps4hl=true

 Response:

 doc
  arr name=contentstrThis post is regarding ps4 accuracy and qulaity
 which is smooth and factastic/str/arr
 /doc
 lst name=highlighting
 lst name=1
  arr name=contentstrThis post is regarding bps4/b accuracy and
 qulaity which is smooth and factastic/str/arr
 /lst

 I wanted result like this:

 doc
  arr name=contentstrThis post is regarding bps4/b accuracy and
 qulaity which is smooth and factastic/str/arr
 /doc
 lst name=highlighting
 lst name=1
  arr name=contentstrThis post is regarding bps4/b accuracy and
 qulaity which is smooth and factastic/str/arr
 /lst

 Thanks in advance!

 Regards,
 Jamshaid





Re: facet.maxcount ?

2013-07-23 Thread Jérôme Étévé
Thanks!


On 23 July 2013 12:19, Markus Jelsma markus.jel...@openindex.io wrote:
 Eeh, here's the other one: https://issues.apache.org/jira/browse/SOLR-1712


 -Original message-
 From:Markus Jelsma markus.jel...@openindex.io
 Sent: Tuesday 23rd July 2013 13:18
 To: solr-user@lucene.apache.org
 Subject: RE: facet.maxcount ?

 Hi - No but there are two unresolved issues about this topic:
 https://issues.apache.org/jira/browse/SOLR-4411
 https://issues.apache.org/jira/browse/SOLR-4411

 Cheers

 -Original message-
  From:Jérôme Étévé jerome.et...@gmail.com
  Sent: Tuesday 23rd July 2013 12:58
  To: solr-user@lucene.apache.org
  Subject: facet.maxcount ?
 
  Hi all happy Solr users!
 
  I was wondering if it's possible to have some sort of facet.maxcount 
  equivalent?
 
  In short, that would exclude from the facet any term (or query) that
  matches at least facet.maxcount times.
 
  That facet.maxcount would probably significantly improve the
  performance of request of the type:
 
  I want the facet values with zero result.
 
  Is it a mad idea or does it make some sort of sense?
 
  Cheers,
 
  Jerome.
 
  --
  Jerome Eteve
  +44(0)7738864546
  http://www.eteve.net/
 




-- 
Jerome Eteve
+44(0)7738864546
http://www.eteve.net/


Re: how to improve (keyword) relevance?

2013-07-23 Thread Erick Erickson
Another thing I've seen people do is something like
text:(test AND pdf)^10 text:(test pdf).

so docs with both terms in the text field get boosted a lot, but docs
with either one will still get found.

But as Jack says, you have to demonstrate a problem before you propose
a solution.

You say  a lot people are concerned about improving relevance.. Just
get them to define a good set of search results. Bet they can't except
by looking at specific result lists and saying I like that one more
than this one. You gotta quantify this somehow, do A/B testing,
whatever or you'll go mad.

Erick

On Mon, Jul 22, 2013 at 12:47 PM, Jack Krupansky
j...@basetechnology.com wrote:
 Again, you haven't indicated what the problem is. I mean, have you actually
 confirmed that a problem exists? Add debugQuery=true to your query and
 examine the explain section if you believe that Solr has improperly
 computed any document scores.

 If you simply want to boost a term in a query, use the ^ operator, which
 applies to the preceding term. a boost of 1.0 means no change, 2.0 means
 double, 0.5 means cut in half.

 But, you don't need to boost. Relevancy is based on the data in the
 documents themselves.

 BTW, q=text%3Atest+pdf does not search for pdf in the text field -
 field- qualification only applies to a single term, but you can use
 parentheses: q=text%3A(test+pdf)


 -- Jack Krupansky

 -Original Message- From: eShard
 Sent: Monday, July 22, 2013 12:34 PM
 To: solr-user@lucene.apache.org
 Subject: Re: how to improve (keyword) relevance?


 Sure, let's say the user types in test pdf;
 we need the results with all the query words to be near the top of the
 result set.
 the query will look like this: /select?q=text%3Atest+pdfwt=xml

 How do I ensure that the top resultset contains all of the query words?
 How can I boost the first (or second) term when they are both the same field
 (i.e. text)?

 Does this make sense?

 Please bear with me; I'm still new to the solr query syntax so I don't even
 know if I'm asking the right question.

 Thanks,



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/how-to-improve-keyword-relevance-tp4079462p4079502.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: IllegalStateException

2013-07-23 Thread Erick Erickson
There has been a _ton_ of work since 4.0, and 4.4 will be out in a day
or two. I suspect the best advice is to try 4.4...

Best
Erick

On Mon, Jul 22, 2013 at 2:54 PM, Michael Long ml...@mlong.us wrote:
 I'm seeing random crashes in solr 4.0 but I don't have anything to go on
 other than IllegalStateException. Other than checking for corrupt index
 and out of memory, what other things should I check?


 org.apache.catalina.core.StandardWrapperValve invoke
 SEVERE: Servlet.service() for servlet default threw exception
 java.lang.IllegalStateException
 at
 org.apache.catalina.connector.ResponseFacade.sendError(ResponseFacade.java:407)
 at
 org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:483)
 at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:297)
 at
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
 at
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
 at
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
 at
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
 at
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
 at
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
 at
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
 at
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
 at
 org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:857)
 at
 org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588)
 at
 org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
 at java.lang.Thread.run(Thread.java:662)



Re: how to improve (keyword) relevance?

2013-07-23 Thread Otis Gospodnetic
To add to what Erick said, that *quantifying* is hugely important!
How do you measure your search relevance improvements?
How are you currently measuring it?
How will you see, after you apply any changes, whether relevance was
improved and how much?
How will you know whether, even test queries you are using to evaluate
relevance, the end users also see the same sort of improvements or
whether you improved your test queries, but made no difference overall
or maybe even made things worse?
...

Have a look at:
* http://www.slideshare.net/sematext/tag/analytics
* http://sematext.com/search-analytics/index.html - it's free, and we
regularly use it with our clients with great success

Otis
--
Solr  ElasticSearch Support -- http://sematext.com/
Performance Monitoring -- http://sematext.com/spm



On Tue, Jul 23, 2013 at 7:50 AM, Erick Erickson erickerick...@gmail.com wrote:
 Another thing I've seen people do is something like
 text:(test AND pdf)^10 text:(test pdf).

 so docs with both terms in the text field get boosted a lot, but docs
 with either one will still get found.

 But as Jack says, you have to demonstrate a problem before you propose
 a solution.

 You say  a lot people are concerned about improving relevance.. Just
 get them to define a good set of search results. Bet they can't except
 by looking at specific result lists and saying I like that one more
 than this one. You gotta quantify this somehow, do A/B testing,
 whatever or you'll go mad.

 Erick

 On Mon, Jul 22, 2013 at 12:47 PM, Jack Krupansky
 j...@basetechnology.com wrote:
 Again, you haven't indicated what the problem is. I mean, have you actually
 confirmed that a problem exists? Add debugQuery=true to your query and
 examine the explain section if you believe that Solr has improperly
 computed any document scores.

 If you simply want to boost a term in a query, use the ^ operator, which
 applies to the preceding term. a boost of 1.0 means no change, 2.0 means
 double, 0.5 means cut in half.

 But, you don't need to boost. Relevancy is based on the data in the
 documents themselves.

 BTW, q=text%3Atest+pdf does not search for pdf in the text field -
 field- qualification only applies to a single term, but you can use
 parentheses: q=text%3A(test+pdf)


 -- Jack Krupansky

 -Original Message- From: eShard
 Sent: Monday, July 22, 2013 12:34 PM
 To: solr-user@lucene.apache.org
 Subject: Re: how to improve (keyword) relevance?


 Sure, let's say the user types in test pdf;
 we need the results with all the query words to be near the top of the
 result set.
 the query will look like this: /select?q=text%3Atest+pdfwt=xml

 How do I ensure that the top resultset contains all of the query words?
 How can I boost the first (or second) term when they are both the same field
 (i.e. text)?

 Does this make sense?

 Please bear with me; I'm still new to the solr query syntax so I don't even
 know if I'm asking the right question.

 Thanks,



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/how-to-improve-keyword-relevance-tp4079462p4079502.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr 4.3.1 - SolrCloud nodes down and lost documents

2013-07-23 Thread Erick Erickson
Neil:

Here's a must-read blog about why allocating more memory
to the JVM than Solr requires is a Bad Thing:
http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

It turns out that you actually do yourself harm by allocating more
memory to the JVM than it really needs. Of course the problem is
figuring out how much it really needs, which if pretty tricky.

Your long GC pauses _might_ be ameliorated by allocating _less_
memory to the JVM, counterintuitive as that seems.

Best
Erick

On Mon, Jul 22, 2013 at 5:05 PM, Neil Prosser neil.pros...@gmail.com wrote:
 I just have a little python script which I run with cron (luckily that's
 the granularity we have in Graphite). It reads the same JSON the admin UI
 displays and dumps numeric values into Graphite.

 I can open source it if you like. I just need to make sure I remove any
 hacks/shortcuts that I've taken because I'm working with our cluster!


 On 22 July 2013 19:26, Lance Norskog goks...@gmail.com wrote:

 Are you feeding Graphite from Solr? If so, how?


 On 07/19/2013 01:02 AM, Neil Prosser wrote:

 That was overnight so I was unable to track exactly what happened (I'm
 going off our Graphite graphs here).





Re: Appending *-wildcard suffix on all terms for querying: move logic from client to server side

2013-07-23 Thread Mikhail Khludnev
It can be done by extending LuceneQParser/SolrQueryParser see
http://wiki.apache.org/solr/SolrPlugins#QParserPlugin
there is newTermQuery(Term) it should be overridden and delegate to
newPrefixQuery() method.
Overall, I suggest you consider to use EdgeNGramTokenFilter in index time,
and then search by plain termqueries.


On Tue, Jul 23, 2013 at 2:05 PM, Paul Blanchaert p...@amosis.eu wrote:

 My client has an installation with 3 different clients using the same Solr
 index. These clients all append a * wildcard suffix in the query: user
 enters abc def while search is performed against (abc* def*).
 In order to move away from this way of searching, we'd like to move the
 clients away from this wildcard search at the moment we implement a new
 index. However, at that time, the client apps will still need to use this
 wildcard suffix search. So the goal is to have the wildcard search option
 to append * suffix when not yet set configurable on server side.
 I thought a tokenizer would do the work, but as the wildcard searches are
 detected before analyzers do the work, this is not an option.
 Can I enable this without coding? Or should I use a (custom) functionquery
 or custom search handler?
 Any thought is appreciated.


 -
 Kind regards,

 Paul Blanchaert




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Running Solr 4 on Sun vs OpenJDK JVM?

2013-07-23 Thread Otis Gospodnetic
Hi Cosimo,

Very simple: Oracle 1.7 is your best bet.  If you have a large heap
and are seeing STW pauses, try G1 - we've been using it and have been
happy with it.

Ciao,
Otis
--
Solr  ElasticSearch Support -- http://sematext.com/
Performance Monitoring -- http://sematext.com/spm



On Tue, Jul 23, 2013 at 3:17 AM, Cosimo Streppone cos...@streppone.it wrote:
 Hi,

 do you have any advice on operating a Solr 4.0
 read-only instance with regards to the underlying
 JVM?

 In particular I'm wondering about stability and
 memory usage, but anything else you might add is welcome,
 when it comes to OpenJDK vs Sun/Oracle Hotspot,
 v6 vs v7.

 What are you running, what would you suggest and why?
 I tried searching for some information, and the old(?)
 run-on-sun-java6-jre tip is all I get back.

 This morning I also read that Oracle JDK v7 is
 apparently based on OpenJDK sources... so?

 Thanks,

 --
 Cosimo


Re: Solr 4.3.1 - SolrCloud nodes down and lost documents

2013-07-23 Thread Otis Gospodnetic
Hi,

On Tue, Jul 23, 2013 at 8:02 AM, Erick Erickson erickerick...@gmail.com wrote:
 Neil:

 Here's a must-read blog about why allocating more memory
 to the JVM than Solr requires is a Bad Thing:
 http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

 It turns out that you actually do yourself harm by allocating more
 memory to the JVM than it really needs. Of course the problem is
 figuring out how much it really needs, which if pretty tricky.

 Your long GC pauses _might_ be ameliorated by allocating _less_
 memory to the JVM, counterintuitive as that seems.

or by using G1 :)

See http://blog.sematext.com/2013/06/24/g1-cms-java-garbage-collector/

Otis
--
Solr  ElasticSearch Support -- http://sematext.com/
Performance Monitoring -- http://sematext.com/spm


 On Mon, Jul 22, 2013 at 5:05 PM, Neil Prosser neil.pros...@gmail.com wrote:
 I just have a little python script which I run with cron (luckily that's
 the granularity we have in Graphite). It reads the same JSON the admin UI
 displays and dumps numeric values into Graphite.

 I can open source it if you like. I just need to make sure I remove any
 hacks/shortcuts that I've taken because I'm working with our cluster!


 On 22 July 2013 19:26, Lance Norskog goks...@gmail.com wrote:

 Are you feeding Graphite from Solr? If so, how?


 On 07/19/2013 01:02 AM, Neil Prosser wrote:

 That was overnight so I was unable to track exactly what happened (I'm
 going off our Graphite graphs here).





Re: softCommit doesn't work - ?

2013-07-23 Thread Erick Erickson
First a minor nit. The server.add(doc, time) is a hard commit, not a soft one.

But the rest of it. When you add your 70 docs, do they all have the same id
(i.e. the uniqueKey field). If so, there will be only one document, the last
one since all the earlier ones will be overwritten.

Not quite sure why your first example doesn't work, it should. Although
killing the process before the commit complete will lose documents in the
uncommitted segments.

Best
Erick

On Mon, Jul 22, 2013 at 8:45 PM, tskom tsiedlac...@hotmail.co.uk wrote:
 Hi,

 I use solr 4.3.1.
 I tried to index about 70 documents using sofCommit as below:
 
 SolrInputDocument doc = new SolrInputDocument();
 result = fillMetaData(request, doc); // custom one
 int softCommit = 1;
 solrServer.add(doc, softCommit);
 
 Process ran very fast but there is nothing in the index neither after 10sec
 nor after restarting server application
 In the solr log I got something like that:
 2013-07-23 01:58:01,543 INFO
 [org.apache.solr.update.processor.LogUpdateProcessor]
 (http-127.0.0.1-8090-5) [collection1] webapp=/solr path=/update
 params={wt=javabinversion=2} {add=[Rep_CA_FairyCakes
 (1441307014244335616)]} 0 3
 2013-07-23 01:58:01,546 INFO  [org.apache.solr.update.UpdateHandler]
 (http-127.0.0.1-8090-5) start rollback{}
 2013-07-23 01:58:01,547 INFO  [org.apache.solr.update.DefaultSolrCoreState]
 (http-127.0.0.1-8090-5) Creating new IndexWriter...
 2013-07-23 01:58:01,547 INFO  [org.apache.solr.update.DefaultSolrCoreState]
 (http-127.0.0.1-8090-5) Waiting until IndexWriter is unused...
 core=collection1
 2013-07-23 01:58:01,547 INFO  [org.apache.solr.update.DefaultSolrCoreState]
 (http-127.0.0.1-8090-5) Rollback old IndexWriter... core=collection1
 2013-07-23 01:58:01,617 INFO  [org.apache.solr.core.SolrCore]
 (http-127.0.0.1-8090-5) SolrDeletionPolicy.onInit: commits:num=1

 commit{dir=NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@C:\solr\data\index
 lockFactory=org.apache.lucene.store.NativeFSLockFactory@7ed1f882;
 maxCacheMB=48.0
 maxMergeSizeMB=4.0),segFN=segments_ew,generation=536,filenames=[_ah_Lucene41_0.tim,
 _9d.fdt, _a5.fdx, _ag_Lucene41_0.pos, _9l.si, _a7.nvd, _a0_Lucene41_0.pos,
 _ah_Lucene41_0.tip, _9d.fdx, _a5.fdt, _9r.fnm, _97_Lucene41_0.doc,
 _9k_Lucene41_0.tim, _a7.nvm, _ad.fnm, _9k_Lucene41_0.tip, _a9.fnm, _9g.nvm,
 _ao_Lucene41_0.tim, _ao_Lucene41_0.tip, _9i_Lucene41_0.doc, _a2.nvm,
 _az_Lucene41_0.tim, _az_Lucene41_0.tip, _af_Lucene41_0.pos, _9t.nvm,
 _9w.fnm, _9z.si, _a9_Lucene41_0.tim, _9h.fnm, _9g.nvd, _a9_Lucene41_0.tip,
 _9d_Lucene41_0.pos, _9t.nvd, _a3.fdx, _aw.nvm, _9i_Lucene41_0.pos, _98.fnm,
 _a3.fdt, _a8_Lucene41_0.tim, _am.nvd, _aw.nvd, _a8_Lucene41_0.tip, _9f.si,
 _ap.fdt, _ag.fdt, _au.fnm, _aq.nvm, _ap.fdx, _av.fdt, _a0.si,
 _ac_Lucene41_0.doc, _a9_Lucene41_0.doc, _at_Lucene41_0.doc, _9u.fdx,
 _9z.fnm, _9d.si, _af.nvd, _9j_Lucene41_0.doc, _9u.fdt, _ag.fdx, _9b.si,
 _af.nvm, _9q.fnm, _aw_Lucene41_0.tim, _aw_Lucene41_0.tip, _ao.fnm, _9f.fnm,
 _a1.fdt, _9l_Lucene41_0.pos, _ad_Lucene41_0.pos, _a1.fdx,
 _aa_Lucene41_0.tip, _aa_Lucene41_0.tim, _9j_Lucene41_0.pos, _a2.nvd,
 _aj.nvd, _9o.fnm, _am.fnm, _9t_Lucene41_0.doc, _av.fdx, _ab.fdt, _an.nvd,
 _at.nvd, _ao_Lucene41_0.doc, _al.fnm, _9e_Lucene41_0.doc, _ab.fdx, _9x.fnm,
 _aj.nvm, _at.nvm, _ai.fnm, _9a_Lucene41_0.tim, _ak.nvm, _a2_Lucene41_0.doc,
 _an.nvm, _ah.nvd, _aw.fnm, _al_Lucene41_0.doc, _9a_Lucene41_0.tip,
 _9f_Lucene41_0.tim, _aq.fnm, _ah.nvm, _9k.nvd, _9b.nvm, _9c.fnm,
 _9f_Lucene41_0.tip, _9y_Lucene41_0.pos, _ax_Lucene41_0.doc,
 _av_Lucene41_0.tip, _ar_Lucene41_0.tim, _9c.si, _av_Lucene41_0.tim, _9b.nvd,
 _ar_Lucene41_0.tip, _as_Lucene41_0.tip, _as_Lucene41_0.tim,
 _ae_Lucene41_0.pos, _9j.si, _9z.nvd, _9y_Lucene41_0.doc, _a6_Lucene41_0.doc,
 _9d_Lucene41_0.doc, _ao.nvd, _9m.fdx, _ac.fdx, _a6.si, _aa_Lucene41_0.doc,
 _9m.fdt, _ac.fdt, _a3_Lucene41_0.pos, _av_Lucene41_0.doc, _9k.nvm,
 _ay_Lucene41_0.pos, _9z.nvm, _ai_Lucene41_0.tim, _aq.si, _ap_Lucene41_0.pos,
 _ai_Lucene41_0.tip, _96.si, _ab_Lucene41_0.pos, _9e.fnm, _as_Lucene41_0.doc,
 _9h.si, _96.nvm, _96.nvd, _ae.fdt, _9f_Lucene41_0.pos, _a4.fdx, _ae.fdx,
 _a4.fdt, _9j.fnm, _9z_Lucene41_0.doc, _9p.nvm, _aw.si, _a8.nvm, _9p.nvd,
 _9s.fdx, _9v.fnm, _a8.nvd, _9f_Lucene41_0.doc, _9s.fdt, _a2.si, _ai.si,
 _9o_Lucene41_0.tip, _a3.si, _9o_Lucene41_0.tim, _aj_Lucene41_0.tip,
 _aj_Lucene41_0.tim, _99.si, _9k_Lucene41_0.pos, _97.fdt, _9w.fdx, _a5.si,
 _9s_Lucene41_0.pos, _9w.fdt, _aj.fnm, _97.fdx, _9p.fdx, _9t.fnm, _9j.fdx,
 _9j.fdt, _ar_Lucene41_0.pos, _au_Lucene41_0.doc, _9p_Lucene41_0.doc,
 _9a.fdx, _9j_Lucene41_0.tip, _9q.nvd, _at_Lucene41_0.tip, _an.si,
 _9j_Lucene41_0.tim, _at_Lucene41_0.tim, _ad.fdx, _az_Lucene41_0.doc,
 _ad.fdt, _9q.nvm, _9g.fdx, _ax_Lucene41_0.pos, _9r.fdt, _9g.fdt, _9r.fdx,
 _9a.fdt, _a7.si, _98.nvm, _au_Lucene41_0.tim, _ag.nvm, _az.si,
 _au_Lucene41_0.tip, _ag.nvd, _ao.nvm, _9o.fdx, _9q_Lucene41_0.tip, _ax.si,
 _9p_Lucene41_0.pos, 

Re: Question about field boost

2013-07-23 Thread Erick Erickson
this isn't doing what you think.
title^10 content
is actually parsed as

text:title^100 text:content

where text is my default search field.

assuming title is a field. If you look a little
farther up the debug output you'll see that.

You probably want
title:content^100 or some such?

Erick

On Tue, Jul 23, 2013 at 1:43 AM, Jack Krupansky j...@basetechnology.com wrote:
 That means that for that document china occurs in the title vs. snowden
 found in a document but not in the title.


 -- Jack Krupansky

 -Original Message- From: Joe Zhang
 Sent: Tuesday, July 23, 2013 12:52 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Question about field boost


 Is my reading correct that the boost is only applied on china but not
 snowden? How can that be?

 My query is: q=china+snowdenqf=title^10 content


 On Mon, Jul 22, 2013 at 9:43 PM, Joe Zhang smartag...@gmail.com wrote:

 Thanks for your hint, Jack. Here is the debug results, which I'm having a
 hard deciphering (the two terms are china and snowden)...

 0.26839527 = (MATCH) sum of:
   0.26839527 = (MATCH) sum of:
 0.26757246 = (MATCH) max of:
   7.9147343E-4 = (MATCH) weight(content:china in 249), product of:
 0.019873314 = queryWeight(content:china), product of:
   1.6649085 = idf(docFreq=46832, maxDocs=91058)
   0.01193658 = queryNorm
 0.039825942 = (MATCH) fieldWeight(content:china in 249), product
 of:
   4.8989797 = tf(termFreq(content:china)=24)
   1.6649085 = idf(docFreq=46832, maxDocs=91058)
   0.0048828125 = fieldNorm(field=content, doc=249)
   0.26757246 = (MATCH) weight(title:china^10.0 in 249), product of:
 0.5836803 = queryWeight(title:china^10.0), product of:
   10.0 = boost
   4.8898454 = idf(docFreq=1861, maxDocs=91058)
   0.01193658 = queryNorm
 0.45842302 = (MATCH) fieldWeight(title:china in 249), product of:
   1.0 = tf(termFreq(title:china)=1)
   4.8898454 = idf(docFreq=1861, maxDocs=91058)
   0.09375 = fieldNorm(field=title, doc=249)
 8.2282536E-4 = (MATCH) max of:
   8.2282536E-4 = (MATCH) weight(content:snowden in 249), product of:
 0.03407834 = queryWeight(content:snowden), product of:
   2.8549502 = idf(docFreq=14246, maxDocs=91058)
   0.01193658 = queryNorm
 0.024145111 = (MATCH) fieldWeight(content:snowden in 249), product
 of:
   1.7320508 = tf(termFreq(content:snowden)=3)
   2.8549502 = idf(docFreq=14246, maxDocs=91058)
   0.0048828125 = fieldNorm(field=content, doc=249)


 On Mon, Jul 22, 2013 at 9:27 PM, Jack Krupansky
 j...@basetechnology.comwrote:

 Maybe you're not doing anything wrong - other than having an artificial
 expectation of what the true relevance of your data actually is. Many
 factors go into relevance scoring. You need to look at all aspects of
 your
 data.

 Maybe your terms don't occur in your titles the way you think they do.

 Maybe you need a boost of 500 or more...

 Lots of potential maybes.

 Relevancy tuning is an art and craft, hardly a science.

 Step one: Know your data, inside and out.

 Use the debugQuery=true parameter on your queries and see how much of the
 score is dominated by your query terms in the non-title fields.

 -- Jack Krupansky

 -Original Message- From: Joe Zhang
 Sent: Monday, July 22, 2013 11:06 PM
 To: solr-user@lucene.apache.org
 Subject: Question about field boost


 Dear Solr experts:

 Here is my query:

 defType=dismaxq=term1+term2**qf=title^100 content

 Apparently (at least I thought) my intention is to boost the title field.
 While I'm getting some non-trivial results, I'm surprised that the
 documents with both term1 and term2 in title (I know such docs do exist
 in
 my repository) were not returned (or maybe ranked very low). The
 situation
 does not change even when I use much larger boost factors.

 What am I doing wrong?






solr - Deleting a row from the index, using the configuration files only.

2013-07-23 Thread Mysurf Mail
I am updating my solr index using deltaQuery and deltaImportQuery
attributes in data-config.xml.
In my condition I write

where MyDoc.LastModificationTime  '${dataimporter.last_index_time}'
then after I add a row I trigger an update using data-config.xml.

Now, sometimes I delete a row.
How can I implement this with configuration files only
(without sending a delete rest command to solr ).

Lets say my object is not deleted but its status is changed to deleted.
I dont index that status field, as I want to hold only the live rows.
(otherwise I could have just filtered it)
Is there a way to do it?
thanks.


Re: dataimporter, custom fields and parsing error

2013-07-23 Thread Andreas Owen
i have tried post.jar and it works when i set the literal.id in solrconfig.xml. 
i can't pass the id with post.jar (-Dparams=literal.id=abc) because i get a 
error: could not find or load main class .id=abc.


On 20. Jul 2013, at 7:05 PM, Andreas Owen wrote:

 path was set text wasn't, but it doesn't make a difference. my importer says 
 1 row fetched, 0 docs processed, 0 docs skipped. i don't understand how it 
 can have 2 docs indexed with such a output.
 
 
 On 20. Jul 2013, at 12:47 PM, Shalin Shekhar Mangar wrote:
 
 Are the path and text fields set to stored in the schema.xml?
 
 
 On Sat, Jul 20, 2013 at 3:37 PM, Andreas Owen a...@conx.ch wrote:
 
 they are in my schema, path is typed correctly the others are default
 fields which already exist. all the other fields are populated and i can
 search for them, just path and text aren't.
 
 
 On 19. Jul 2013, at 6:16 PM, Alexandre Rafalovitch wrote:
 
 Dumb question: they are in your schema? Spelled right, in the right
 section, using types also defined? Can you populate them by hand with a
 CSV
 file and post.jar?
 
 Regards,
 Alex.
 
 Personal website: http://www.outerthoughts.com/
 LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
 - Time is the quality of nature that keeps events from happening all at
 once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)
 
 
 On Fri, Jul 19, 2013 at 12:09 PM, Andreas Owen a...@conx.ch wrote:
 
 i'm using solr 4.3 which i just downloaded today and am using only jars
 that came with it. i have enabled the dataimporter and it runs without
 error. but the field path (included in schema.xml) and text (file
 content) aren't indexed. what am i doing wrong?
 
 solr-path: C:\ColdFusion10\cfusion\jetty-new
 collection-path: C:\ColdFusion10\cfusion\jetty-new\solr\collection1
 pdf-doc-path: C:\web\development\tkb\internet\public
 
 
 data-config.xml:
 
 dataConfig
  dataSource type=BinFileDataSource name=data/
  dataSource type=BinURLDataSource name=dataUrl/
  dataSource type=URLDataSource baseUrl=
 http://127.0.0.1/tkb/internet/; name=main/
 document
  entity name=rec processor=XPathEntityProcessor
 url=docImportUrl.xml forEach=/albums/album dataSource=main !--
 
 transformer=script:GenerateId--
  field column=title xpath=//title /
  field column=id xpath=//file /
  field column=path xpath=//path /
  field column=Author xpath=//author /
 
  !-- field
 column=tstamp2013-07-05T14:59:46.889Z/field --
 
  entity name=tika processor=TikaEntityProcessor
 url=../../../../../web/development/tkb/internet/public/${rec.path}/${
 rec.id}
 
 dataSource=data 
  field column=text /
 
  /entity
  /entity
 /document
 /dataConfig
 
 
 docImportUrl.xml:
 
 ?xml version=1.0 encoding=utf-8?
 albums
  album
  authorPeter Z./author
  titleBeratungsseminar kundenbrief/title
  descriptionwie kommuniziert man/description
 
 file0226520141_e-banking_Checkliste_CLX.Sentinel.pdf/file
  pathdownload/online/path
  /album
  album
  authorMarcel X./author
  titlekuchen backen/title
  descriptiontorten, kuchen, geb‰ck .../description
  fileKundenbrief.pdf/file
  pathdownload/online/path
  /album
 /albums
 
 
 
 
 -- 
 Regards,
 Shalin Shekhar Mangar.



zkHost in solr.xml goes missing after SPLITSHARD using Collections API

2013-07-23 Thread Ali, Saqib
Hello all,

Every time I issue a SPLITSHARD using Collections API, the zkHost attribute
in the solr.xml goes missing. I have to manually edit the solr.xml to add
zkHost after every SPLITSHARD.

Any thoughts on what could be causing this?

Thanks.


Start independent Zookeeper from within Solr install

2013-07-23 Thread Upayavira
Assumptions:

 * you currently have two choices to start Zookeeper: run it embedded
 within Solr, or download it from the ZooKeeper site and start it
 independently.
 * everything you need to run ZooKeeper (embedded or not) is included
 within the Solr distribution

Assuming I've got the above right, then currently starting an embedded
ZooKeeper is easy (-DzkRun), and starting an ensemble is irritatingly
complex.

So, my question is, how hard would it be to start Zookeeper without
Solr, but from within the Solr codebase? -DensembleOnly or some such,
causes Solr not to load, but Zookeeper still starts. I'm assuming that
Jetty would still listen on port 8983, but it wouldn't initialise the
Solr webapp:

java -DzkRun -DzkEnsembleOnly
-DzkHosts=zkhost01:9983,zkhost02:9983,zkhost03:9983 -jar start.jar

Is this possible? If it is, I'm happy to have a go at making it happen.

Upayavira




filter query result by user

2013-07-23 Thread Mysurf Mail
I want to restrict the returned results to be only the documents that were
created by the user.
I then load to the index the createdBy attribute and set it to index
false,stored=true

field name=CreatedBy type=string indexed=false stored=true
required=true/

then in the I want to filter by CreatedBy so I use the dashboard, check
edismax and add
I check edismax and add CreatedBy:user1 to the qf field.


the result query is

http://
:8983/solr/vault/select?q=*%3A*defType=edismaxqf=CreatedBy%3Auser1

Nothing is filtered. all rows returned.
What was I doing wrong?


Re: filter query result by user

2013-07-23 Thread Raymond Wiker
Simple: the field needs to be indexed in order to search (or filter) on
it.


On Tue, Jul 23, 2013 at 3:26 PM, Mysurf Mail stammail...@gmail.com wrote:

 I want to restrict the returned results to be only the documents that were
 created by the user.
 I then load to the index the createdBy attribute and set it to index
 false,stored=true

 field name=CreatedBy type=string indexed=false stored=true
 required=true/

 then in the I want to filter by CreatedBy so I use the dashboard, check
 edismax and add
 I check edismax and add CreatedBy:user1 to the qf field.


 the result query is

 http://
 :8983/solr/vault/select?q=*%3A*defType=edismaxqf=CreatedBy%3Auser1

 Nothing is filtered. all rows returned.
 What was I doing wrong?



Re: filter query result by user

2013-07-23 Thread Otis Gospodnetic
Moreover, you may want to use fq=CreatedBy:user1 for filtering.

Otis
--
Solr  ElasticSearch Support -- http://sematext.com/
Performance Monitoring -- http://sematext.com/spm



On Tue, Jul 23, 2013 at 9:28 AM, Raymond Wiker rwi...@gmail.com wrote:
 Simple: the field needs to be indexed in order to search (or filter) on
 it.


 On Tue, Jul 23, 2013 at 3:26 PM, Mysurf Mail stammail...@gmail.com wrote:

 I want to restrict the returned results to be only the documents that were
 created by the user.
 I then load to the index the createdBy attribute and set it to index
 false,stored=true

 field name=CreatedBy type=string indexed=false stored=true
 required=true/

 then in the I want to filter by CreatedBy so I use the dashboard, check
 edismax and add
 I check edismax and add CreatedBy:user1 to the qf field.


 the result query is

 http://
 :8983/solr/vault/select?q=*%3A*defType=edismaxqf=CreatedBy%3Auser1

 Nothing is filtered. all rows returned.
 What was I doing wrong?



Re: filter query result by user

2013-07-23 Thread Mysurf Mail
But I dont want it to be searched.on

lets say the user name is giraffe
I do want to filter to be where created by = giraffe

but when the user searches his name, I will want only documents with name
Giraffe.
since it is indexed, wouldn't it return all rows created by him?
Thanks.



On Tue, Jul 23, 2013 at 4:28 PM, Raymond Wiker rwi...@gmail.com wrote:

 Simple: the field needs to be indexed in order to search (or filter) on
 it.


 On Tue, Jul 23, 2013 at 3:26 PM, Mysurf Mail stammail...@gmail.com
 wrote:

  I want to restrict the returned results to be only the documents that
 were
  created by the user.
  I then load to the index the createdBy attribute and set it to index
  false,stored=true
 
  field name=CreatedBy type=string indexed=false stored=true
  required=true/
 
  then in the I want to filter by CreatedBy so I use the dashboard, check
  edismax and add
  I check edismax and add CreatedBy:user1 to the qf field.
 
 
  the result query is
 
  http://
  :8983/solr/vault/select?q=*%3A*defType=edismaxqf=CreatedBy%3Auser1
 
  Nothing is filtered. all rows returned.
  What was I doing wrong?
 



Re:

2013-07-23 Thread Gary Young
Can anyone remove this spammer please?


On Tue, Jul 23, 2013 at 4:47 AM, wired...@yahoo.com wrote:


 Hi!   http://mackieprice.org/cbs.com.network.html




Re: filter query result by user

2013-07-23 Thread Mysurf Mail
I am probably using it wrong.
http://
...:8983/solr/vault10k/select?q=*%3A*defType=edismaxqf=CreatedBy%BLABLA
returns all rows.
It neglects my qf filter.

Should I even use qf for filtrering with edismax?
(It doesnt say that in the doc
http://wiki.apache.org/solr/ExtendedDisMax#qf_.28Query_Fields.29)



On Tue, Jul 23, 2013 at 4:32 PM, Mysurf Mail stammail...@gmail.com wrote:

 But I dont want it to be searched.on

 lets say the user name is giraffe
 I do want to filter to be where created by = giraffe

 but when the user searches his name, I will want only documents with name
 Giraffe.
 since it is indexed, wouldn't it return all rows created by him?
 Thanks.



 On Tue, Jul 23, 2013 at 4:28 PM, Raymond Wiker rwi...@gmail.com wrote:

 Simple: the field needs to be indexed in order to search (or filter) on
 it.


 On Tue, Jul 23, 2013 at 3:26 PM, Mysurf Mail stammail...@gmail.com
 wrote:

  I want to restrict the returned results to be only the documents that
 were
  created by the user.
  I then load to the index the createdBy attribute and set it to index
  false,stored=true
 
  field name=CreatedBy type=string indexed=false stored=true
  required=true/
 
  then in the I want to filter by CreatedBy so I use the dashboard,
 check
  edismax and add
  I check edismax and add CreatedBy:user1 to the qf field.
 
 
  the result query is
 
  http://
  :8983/solr/vault/select?q=*%3A*defType=edismaxqf=CreatedBy%3Auser1
 
  Nothing is filtered. all rows returned.
  What was I doing wrong?
 





Re: filter query result by user

2013-07-23 Thread Otis Gospodnetic
Hi,

Use fq, not qf.  It needs to be indexed.  Filtering is like searching
without scoring.

Otis
--
Solr  ElasticSearch Support -- http://sematext.com/
Performance Monitoring -- http://sematext.com/spm



On Tue, Jul 23, 2013 at 9:39 AM, Mysurf Mail stammail...@gmail.com wrote:
 I am probably using it wrong.
 http://
 ...:8983/solr/vault10k/select?q=*%3A*defType=edismaxqf=CreatedBy%BLABLA
 returns all rows.
 It neglects my qf filter.

 Should I even use qf for filtrering with edismax?
 (It doesnt say that in the doc
 http://wiki.apache.org/solr/ExtendedDisMax#qf_.28Query_Fields.29)



 On Tue, Jul 23, 2013 at 4:32 PM, Mysurf Mail stammail...@gmail.com wrote:

 But I dont want it to be searched.on

 lets say the user name is giraffe
 I do want to filter to be where created by = giraffe

 but when the user searches his name, I will want only documents with name
 Giraffe.
 since it is indexed, wouldn't it return all rows created by him?
 Thanks.



 On Tue, Jul 23, 2013 at 4:28 PM, Raymond Wiker rwi...@gmail.com wrote:

 Simple: the field needs to be indexed in order to search (or filter) on
 it.


 On Tue, Jul 23, 2013 at 3:26 PM, Mysurf Mail stammail...@gmail.com
 wrote:

  I want to restrict the returned results to be only the documents that
 were
  created by the user.
  I then load to the index the createdBy attribute and set it to index
  false,stored=true
 
  field name=CreatedBy type=string indexed=false stored=true
  required=true/
 
  then in the I want to filter by CreatedBy so I use the dashboard,
 check
  edismax and add
  I check edismax and add CreatedBy:user1 to the qf field.
 
 
  the result query is
 
  http://
  :8983/solr/vault/select?q=*%3A*defType=edismaxqf=CreatedBy%3Auser1
 
  Nothing is filtered. all rows returned.
  What was I doing wrong?
 





Re: filter query result by user

2013-07-23 Thread Jack Krupansky
There is no such thing as a qf filter - qf is simply a list of names of 
fields to search for the terms from the query, q, as well as boost 
factors. Filtering is done with filter queries - fq.


-- Jack Krupansky

-Original Message- 
From: Mysurf Mail

Sent: Tuesday, July 23, 2013 9:39 AM
To: solr-user@lucene.apache.org
Subject: Re: filter query result by user

I am probably using it wrong.
http://
...:8983/solr/vault10k/select?q=*%3A*defType=edismaxqf=CreatedBy%BLABLA
returns all rows.
It neglects my qf filter.

Should I even use qf for filtrering with edismax?
(It doesnt say that in the doc
http://wiki.apache.org/solr/ExtendedDisMax#qf_.28Query_Fields.29)



On Tue, Jul 23, 2013 at 4:32 PM, Mysurf Mail stammail...@gmail.com wrote:


But I dont want it to be searched.on

lets say the user name is giraffe
I do want to filter to be where created by = giraffe

but when the user searches his name, I will want only documents with name
Giraffe.
since it is indexed, wouldn't it return all rows created by him?
Thanks.



On Tue, Jul 23, 2013 at 4:28 PM, Raymond Wiker rwi...@gmail.com wrote:


Simple: the field needs to be indexed in order to search (or filter) on
it.


On Tue, Jul 23, 2013 at 3:26 PM, Mysurf Mail stammail...@gmail.com
wrote:

 I want to restrict the returned results to be only the documents that
were
 created by the user.
 I then load to the index the createdBy attribute and set it to index
 false,stored=true

 field name=CreatedBy type=string indexed=false stored=true
 required=true/

 then in the I want to filter by CreatedBy so I use the dashboard,
check
 edismax and add
 I check edismax and add CreatedBy:user1 to the qf field.


 the result query is

 http://
 :8983/solr/vault/select?q=*%3A*defType=edismaxqf=CreatedBy%3Auser1

 Nothing is filtered. all rows returned.
 What was I doing wrong?









Re: zkHost in solr.xml goes missing after SPLITSHARD using Collections API

2013-07-23 Thread Alan Woodward
Can you try upgrading to the just-released 4.4?  Solr.xml persistence had all 
kinds of bugs in 4.3, which should have been fixed now.

Alan Woodward
www.flax.co.uk


On 23 Jul 2013, at 13:36, Ali, Saqib wrote:

 Hello all,
 
 Every time I issue a SPLITSHARD using Collections API, the zkHost attribute
 in the solr.xml goes missing. I have to manually edit the solr.xml to add
 zkHost after every SPLITSHARD.
 
 Any thoughts on what could be causing this?
 
 Thanks.



Re: solr - Deleting a row from the index, using the configuration files only.

2013-07-23 Thread Alexandre Rafalovitch
Did you look at:
*) $deleteDocById
*) $deleteDocByQuery
*) deletedPkQuery

Just search for delete on https://wiki.apache.org/solr/DataImportHandler

If you tried all of those, maybe you need to explain your problem in more
specific details.

Regards,
   Alex.

Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


On Tue, Jul 23, 2013 at 8:18 AM, Mysurf Mail stammail...@gmail.com wrote:

 I am updating my solr index using deltaQuery and deltaImportQuery
 attributes in data-config.xml.
 In my condition I write

 where MyDoc.LastModificationTime  '${dataimporter.last_index_time}'
 then after I add a row I trigger an update using data-config.xml.

 Now, sometimes I delete a row.
 How can I implement this with configuration files only
 (without sending a delete rest command to solr ).

 Lets say my object is not deleted but its status is changed to deleted.
 I dont index that status field, as I want to hold only the live rows.
 (otherwise I could have just filtered it)
 Is there a way to do it?
 thanks.



Re: Document Similarity Algorithm at Solr/Lucene

2013-07-23 Thread Jack Krupansky
One classic approach is to simply use the full text of the suspect text as 
well as bigrams and trigrams (phrases) from that text with OR operators. 
The top results will be the documents that most closely match the subject 
text. That provides a visual set similar results. You will then have to 
apply some heuristic of your own as far as how many top results to look at 
or what score to cut off at. The use of OR operators assures that similar 
documents will be found even if not 100% of the words are used. Yes, OR 
guarantees that your total result count will be high, but scoring assures 
that the top results will be more relevant.


-- Jack Krupansky

-Original Message- 
From: Furkan KAMACI

Sent: Tuesday, July 23, 2013 6:16 AM
To: solr-user@lucene.apache.org
Subject: Re: Document Similarity Algorithm at Solr/Lucene

Actually I need a specialized algorithm. I want to use that algorithm to
detect duplicate blog posts.

2013/7/23 Tommaso Teofili tommaso.teof...@gmail.com


Hi,

I you may leverage and / or improve MLT component [1].

HTH,
Tommaso

[1] : http://wiki.apache.org/solr/MoreLikeThis


2013/7/23 Furkan KAMACI furkankam...@gmail.com

 Hi;

 Sometimes a huge part of a document may exist in another document. As
like
 in student plagiarism or quotation of a blog post at another blog post.
 Does Solr/Lucene or its libraries (UIMA, OpenNLP, etc.) has any class to
 detect it?






Re: how number of indexed fields effect performance

2013-07-23 Thread Alexandre Rafalovitch
Do you need all of the fields loaded every time and are they stored? Maybe
there is a document with gigantic content that you don't actually need but
it gets deserialized anyway. Try lazy loading
setting: enableLazyFieldLoading in solrconfig.xml

Regards,
   Alex.

Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


On Tue, Jul 23, 2013 at 12:36 AM, Jack Krupansky j...@basetechnology.comwrote:

 After restarting Solr and doing a couple of queries to warm the caches,
 are queries already slow/failing, or does it take some time and a number of
 queries before failures start occurring?

 One possibility is that you just need a lot more memory for caches for
 this amount of data. So, maybe the failures are caused by heavy garbage
 collections. So, after restarting Solr, check how much Java heap is
 available, then do some warming queries, then check the Java heap available
 again.

 Add the debugQuery=true parameter to your queries and look at the timings
 to see what phases of query processing are taking the most time. Also check
 whether the reported QTime seems to match actual wall clock time; sometimes
 formatting of the results and network transfer time can dwarf actual query
 time.

 How many fields are you returning on a typical query?


 -- Jack Krupansky


 -Original Message- From: Suryansh Purwar
 Sent: Monday, July 22, 2013 11:06 PM
 To: solr-user@lucene.apache.org ; j...@basetechnology.com

 Subject: how number of indexed fields effect performance

 It was running fine initially when we just had around 100 fields
 indexed. In this case as well it runs fine but after sometime broken pipe
 exception starts coming which results in shard getting down.

 Regards,
 Suryansh



 On Tuesday, July 23, 2013, Jack Krupansky wrote:

  Was all of this running fine previously and only started running slow
 recently, or is this your first measurement?

 Are very simple queries (single keyword, no filters or facets or sorting
 or anything else, and returning only a few fields) working reasonably
 well?

 -- Jack Krupansky

 -Original Message- From: Suryansh Purwar
 Sent: Monday, July 22, 2013 4:07 PM
 To: solr-user@lucene.apache.org
 Subject: how number of indexed fields effect performance

 Hi,

 We have a two shard solrcloud cluster with each shard allocated 3 separate
 machines. We do complex queries involving a number of filter queries
 coupled with group queries and faceting. All of our machines are 64 bit
 with 32 gb ram. Our index size is around 10gb with around 8,00,000
 documents. We have around 1000 indexed fields per document. 6gb of memeory
 is allocated to tomcat under which solr is running  on each of the six
 machines. We have a zookeeper ensemble consisting of 3 zookeeper instances
 running on 3 of the six machines with 4gb memory allocated to each of the
 zookeeper instance. First solr start taking too much time with Broken
 pipe
 exception because of timeout from client side coming again and again,
 then
 after sometime a whole shard goes down with one machine at at time
 followed
 by other machines.  Is having 1000 fields indexed with each document
 resulting in this problem? If it is so, what would be the ideal number of
 indexed fields in such environment.

 Regards,
 Suryansh





Re: Document Similarity Algorithm at Solr/Lucene

2013-07-23 Thread Shawn Heisey
On 7/23/2013 3:33 AM, Furkan KAMACI wrote:
 Sometimes a huge part of a document may exist in another document. As like
 in student plagiarism or quotation of a blog post at another blog post.
 Does Solr/Lucene or its libraries (UIMA, OpenNLP, etc.) has any class to
 detect it?

Solr is designed for search, not heavy analysis.  It might be possible,
as Tommaso suggested, to take the MoreLikeThis functionality from Solr
and adapt it to this use case, but this isn't really something Solr was
designed to do.

If you did use MoreLikeThis out of the box, the most it could do is show
you similar documents to a specific document, but then you'd have to do
your own actual comparison.  Solr would not be able to tell you whether
it's copied, just that it's similar.  Also, it would not be able to
easily and quickly do a full comparison across a huge number of documents.

You'd be much better off with a tool specifically designed for the
purpose.  Perhaps Solr's MoreLikeThis capability would be something you
could use in creating such a tool, but I couldn't say.

Thanks,
Shawn



Collection not current after insert

2013-07-23 Thread Alistair Young
Hi there,

My Solr is being fed by Fedora GSearch and when uploading a new resource, the 
Collection is optimized but not current so the new resource can't be found. I 
have to go to the Core Admin page and Optimize it from there, in order to make 
the collection current. Is there anything I should look for to see what the 
problem is? This is the comms to solr when inserting:

DEBUG 2013-07-23 13:27:37,023 (OperationsImpl) resultXml =
solrUpdateIndex indexName=FgsIndex
insertedltk:13000116/inserted
counts insertTotal=1 updateTotal=0 deleteTotal=0 emptyTotal=0 
docCount=854 warnCount=0/
/solrUpdateIndex

DEBUG 2013-07-23 13:27:37,023 (GTransformer) 
xsltName=fgsconfigFinal/index/FgsIndex/updateIndexToResultPage
DEBUG 2013-07-23 13:27:37,027 (GTransformer) getTransformer 
transformer=org.apache.xalan.transformer.TransformerImpl@6561b973 
uriResolver=null
DEBUG 2013-07-23 13:27:37,028 (GenericOperationsImpl) resultXml=?xml 
version=1.0 encoding=UTF-8?
resultPage operation=updateIndex action=fromPid value=ltk:13000116 
repositoryName=FgsRepos indexNames= resultPageXslt= dateTime=Tue Jul 23 
13:27:36 UTC 2013
updateIndex xmlns:dc=http://purl.org/dc/elements/1.1/; 
xmlns:foxml=info:fedora/fedora-system:def/foxml# 
xmlns:zs=http://www.loc.gov/zing/srw/; warnCount=0 docCount=854 
deleteTotal=0 updateTotal=0 insertTotal=1 indexName=FgsIndex/
/resultPage

INFO 2013-07-23 13:27:37,028 (UpdateListener) Index updated by notification 
message, returning:
?xml version=1.0 encoding=UTF-8?
resultPage operation=updateIndex action=fromPid value=ltk:13000116 
repositoryName=FgsRepos indexNames= resultPageXslt= dateTime=Tue Jul 23 
13:27:36 UTC 2013
updateIndex xmlns:dc=http://purl.org/dc/elements/1.1/; 
xmlns:foxml=info:fedora/fedora-system:def/foxml# 
xmlns:zs=http://www.loc.gov/zing/srw/; warnCount=0 docCount=854 
deleteTotal=0 updateTotal=0 insertTotal=1 indexName=FgsIndex/
/resultPage

thanks,

Alistair

--
mov eax,1
mov ebx,0
int 80h


Re: deserializing highlighting json result

2013-07-23 Thread Jack Krupansky
The JSON keys within the highlighting object are the document IDs, and 
then the keys within those objects are the highlighted field names.


Again, I repeat my question: Exactly why is it difficult to deserialize? 
Seems simple enough.


-- Jack Krupansky

-Original Message- 
From: Mysurf Mail

Sent: Tuesday, July 23, 2013 1:48 AM
To: solr-user@lucene.apache.org
Subject: Re: deserializing highlighting json result

the guid appears as the attribute id and not as

id:baf8434a-99a4-4046-8a4d-2f7ec09eafc8:

Trying to create an object that holds this guid will create an attribute
with name baf8434a-99a4-4046-8a4d-2f7ec09eafc8

On Mon, Jul 22, 2013 at 6:30 PM, Jack Krupansky 
j...@basetechnology.comwrote:



Exactly why is it difficult to deserialize? Seems simple enough.

-- Jack Krupansky

-Original Message- From: Mysurf Mail Sent: Monday, July 22, 2013
11:14 AM To: solr-user@lucene.apache.org Subject: deserializing
highlighting json result
When I request a json result I get the following streucture in the
highlighting

{highlighting:{
  394c65f1-dfb1-4b76-9b6c-**2f14c9682cc9:{
 PackageName:[- emTestingem channel twenty.]},
  baf8434a-99a4-4046-8a4d-**2f7ec09eafc8:{
 PackageName:[- emTestingem channel twenty.]},
  0a699062-cd09-4b2e-a817-**330193a352c1:{
PackageName:[- emTestingem channel twenty.]},
  0b9ec891-5ef8-4085-9de2-**38bfa9ea327e:{
PackageName:[- emTestingem channel twenty.]}}}


It is difficult to deserialize this json because the guid is in the
attribute name.
Is that solveable (using c#)?





Re: Appending *-wildcard suffix on all terms for querying: move logic from client to server side

2013-07-23 Thread Paul Blanchaert
Thanks Mikhail,

I'll go for your EdgeNGramTokenFilter suggestion.

-
Kind regards,

Paul


Re: Start independent Zookeeper from within Solr install

2013-07-23 Thread Timothy Potter
Curious what the use case is for this? Zookeeper is not an HTTP
service so loading it in Jetty by itself doesn't really make sense. I
also think this creates more work for the Solr team especially since
setting up a production ensemble shouldn't take more than a few
minutes once you have the nodes provisioned.

On Tue, Jul 23, 2013 at 7:05 AM, Upayavira u...@odoko.co.uk wrote:
 Assumptions:

  * you currently have two choices to start Zookeeper: run it embedded
  within Solr, or download it from the ZooKeeper site and start it
  independently.
  * everything you need to run ZooKeeper (embedded or not) is included
  within the Solr distribution

 Assuming I've got the above right, then currently starting an embedded
 ZooKeeper is easy (-DzkRun), and starting an ensemble is irritatingly
 complex.

 So, my question is, how hard would it be to start Zookeeper without
 Solr, but from within the Solr codebase? -DensembleOnly or some such,
 causes Solr not to load, but Zookeeper still starts. I'm assuming that
 Jetty would still listen on port 8983, but it wouldn't initialise the
 Solr webapp:

 java -DzkRun -DzkEnsembleOnly
 -DzkHosts=zkhost01:9983,zkhost02:9983,zkhost03:9983 -jar start.jar

 Is this possible? If it is, I'm happy to have a go at making it happen.

 Upayavira




Re: zkHost in solr.xml goes missing after SPLITSHARD using Collections API

2013-07-23 Thread Shawn Heisey
On 7/23/2013 7:50 AM, Alan Woodward wrote:
 Can you try upgrading to the just-released 4.4?  Solr.xml persistence had all 
 kinds of bugs in 4.3, which should have been fixed now.

The 4.4.0 release has been finalized and uploaded, but the download link
hasn't been changed yet because the mirror network isn't fully
synchronized yet.  It is available from many mirrors, but until the
website download links get changed, there's not yet a direct way to
access it.

Here's some generic instructions for situations where the new version is
done, but the official announcement isn't out yet:

http://lucene.apache.org/solr/

1) Go the the Solr website (URL above) and click on the latest version
download button, which at this moment is 4.3.1.  Wait for the redirect
to take you to a mirror list.

2) Click on one of the mirrors, the best option is usually the one right
on top that the website chose for you.

3) When the file list comes up, click the Parent Directory link.  If
this isn't showing, it will most likely be labelled with .. instead.

4) If a directory for the new version (in this case 4.4.0) is listed,
click on it and then click the file that you want to download.

If the new version is not listed, click the Back button on your browser
twice, then go back to step 2, but this time choose a different mirror.

One last reminder: This only works right before a release is officially
announced.  These instructions cannot be used while a release is still
in development.

Thanks,
Shawn



Re: Document Similarity Algorithm at Solr/Lucene

2013-07-23 Thread Tommaso Teofili
if you need a specialized algorithm for detecting blogposts plagiarism /
quotations (which are different tasks IMHO) I think you have 2 options:
1. implement a dedicated one based on your features / metrics / domain
2. try to fine tune an existing algorithm that is flexible enough

If I were to do it with Solr I'd probably do something like:
1. index original blogposts in Solr (possibly using Jack's suggestion
about ngrams / shingles)
2. do MLT queries with candidate blogposts copies text
3. get the first, say, 2-3 hits
4. mark it as quote / plagiarism
5. eventually train a classifier to help you mark other texts as quote /
plagiarism

HTH,
Tommaso



2013/7/23 Furkan KAMACI furkankam...@gmail.com

 Actually I need a specialized algorithm. I want to use that algorithm to
 detect duplicate blog posts.

 2013/7/23 Tommaso Teofili tommaso.teof...@gmail.com

  Hi,
 
  I you may leverage and / or improve MLT component [1].
 
  HTH,
  Tommaso
 
  [1] : http://wiki.apache.org/solr/MoreLikeThis
 
 
  2013/7/23 Furkan KAMACI furkankam...@gmail.com
 
   Hi;
  
   Sometimes a huge part of a document may exist in another document. As
  like
   in student plagiarism or quotation of a blog post at another blog post.
   Does Solr/Lucene or its libraries (UIMA, OpenNLP, etc.) has any class
 to
   detect it?
  
 



Re: Document Similarity Algorithm at Solr/Lucene

2013-07-23 Thread Furkan KAMACI
Thanks for your comments.

2013/7/23 Tommaso Teofili tommaso.teof...@gmail.com

 if you need a specialized algorithm for detecting blogposts plagiarism /
 quotations (which are different tasks IMHO) I think you have 2 options:
 1. implement a dedicated one based on your features / metrics / domain
 2. try to fine tune an existing algorithm that is flexible enough

 If I were to do it with Solr I'd probably do something like:
 1. index original blogposts in Solr (possibly using Jack's suggestion
 about ngrams / shingles)
 2. do MLT queries with candidate blogposts copies text
 3. get the first, say, 2-3 hits
 4. mark it as quote / plagiarism
 5. eventually train a classifier to help you mark other texts as quote /
 plagiarism

 HTH,
 Tommaso



 2013/7/23 Furkan KAMACI furkankam...@gmail.com

  Actually I need a specialized algorithm. I want to use that algorithm to
  detect duplicate blog posts.
 
  2013/7/23 Tommaso Teofili tommaso.teof...@gmail.com
 
   Hi,
  
   I you may leverage and / or improve MLT component [1].
  
   HTH,
   Tommaso
  
   [1] : http://wiki.apache.org/solr/MoreLikeThis
  
  
   2013/7/23 Furkan KAMACI furkankam...@gmail.com
  
Hi;
   
Sometimes a huge part of a document may exist in another document. As
   like
in student plagiarism or quotation of a blog post at another blog
 post.
Does Solr/Lucene or its libraries (UIMA, OpenNLP, etc.) has any class
  to
detect it?
   
  
 



WikipediaTokenizer for Removing Unnecesary Parts

2013-07-23 Thread Furkan KAMACI
Hi;

I have indexed wikipedia data with Solr DIH. However when I look data that
is indexed at Solr I something like that as well:

{| style=text-align: left; width: 50%; table-layout: fixed; border=0
|- valign=top
| style=width: 50%|
:*[[Ubuntu]]
:*[[Fedora]]
:*[[Mandriva]]
:*[[Linux Mint]]
:*[[Debian]]
:*[[OpenSUSE]]
|
*[[Red Hat]]
*[[Mageia]]
*[[Arch Linux]]
*[[PCLinuxOS]]
*[[Slackware]]
|}

However I want to remove them before indexing. I know that there is a
WikipediaTokenizer in Lucene but how can I remove unnecessary parts ( as
like links, style, etc..) with Solr?


Re: Document Similarity Algorithm at Solr/Lucene

2013-07-23 Thread Shashi Kant
Here is a paper that I found useful:
http://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf


On Tue, Jul 23, 2013 at 10:42 AM, Furkan KAMACI furkankam...@gmail.com wrote:
 Thanks for your comments.

 2013/7/23 Tommaso Teofili tommaso.teof...@gmail.com

 if you need a specialized algorithm for detecting blogposts plagiarism /
 quotations (which are different tasks IMHO) I think you have 2 options:
 1. implement a dedicated one based on your features / metrics / domain
 2. try to fine tune an existing algorithm that is flexible enough

 If I were to do it with Solr I'd probably do something like:
 1. index original blogposts in Solr (possibly using Jack's suggestion
 about ngrams / shingles)
 2. do MLT queries with candidate blogposts copies text
 3. get the first, say, 2-3 hits
 4. mark it as quote / plagiarism
 5. eventually train a classifier to help you mark other texts as quote /
 plagiarism

 HTH,
 Tommaso



 2013/7/23 Furkan KAMACI furkankam...@gmail.com

  Actually I need a specialized algorithm. I want to use that algorithm to
  detect duplicate blog posts.
 
  2013/7/23 Tommaso Teofili tommaso.teof...@gmail.com
 
   Hi,
  
   I you may leverage and / or improve MLT component [1].
  
   HTH,
   Tommaso
  
   [1] : http://wiki.apache.org/solr/MoreLikeThis
  
  
   2013/7/23 Furkan KAMACI furkankam...@gmail.com
  
Hi;
   
Sometimes a huge part of a document may exist in another document. As
   like
in student plagiarism or quotation of a blog post at another blog
 post.
Does Solr/Lucene or its libraries (UIMA, OpenNLP, etc.) has any class
  to
detect it?
   
  
 



Re: how number of indexed fields effect performance

2013-07-23 Thread Jack Krupansky
There was also a bug in the lazy loading of multivalued fields at one point 
recently in Solr 4.2


https://issues.apache.org/jira/browse/SOLR-4589
4.x + enableLazyFieldLoading + large multivalued fields + varying fl = 
pathological CPU load  response time


Do you use multivalued fields very heavily?

I'm still not ready to suggest that 1,000 fields is an okay thing to do, but 
there are still plenty of nuances in Solr performance that could explain the 
difficulties, before we even get to the 1,000 field issue itself.


The real bottom line is that as you increase field count, there are lots of 
other aspects of Solr memory and performance degradation that increase as 
well. Some of those factors can be dealt with simply with more memory, more 
and faster CPU cores, or even more sharding, or other tuning, but not 
necessarily all of them.


I think that I am already on the record on other threads as suggesting that 
a couple hundred is about the limit for field count for a slam dunk use 
of Solr. That doesn't mean you can't go above a couple hundred fields, just 
that you are in uncharted territory and may need to take extraordinary 
measures to get everything working satisfactorily. There's no magic hard 
limit, just a general sense that smaller numbers of of field are like a 
walk in a park, while higher numbers of fields are like chopping through a 
jungle. We each have our own threshold for... adventure.


We need answers to the previous questions we raised before we can analyze 
this a lot further.


Oh, and make sure there is enough OS system memory available for caching of 
the index pages. Sometimes, it is little things like this that can crush 
Solr performance.


Unfortunately, Solr is not a packaged solution that automatically and 
magically auto-configures everything to work just right. Instead, it is a 
powerful toolkit that lets you do amazing things, but you the 
developer/architect need to supply amazing intelligence, wisdom, foresight, 
and insight to get it (and its hardware and software environment) to do 
those amazing things.


-- Jack Krupansky

-Original Message- 
From: Alexandre Rafalovitch

Sent: Tuesday, July 23, 2013 9:54 AM
To: solr-user@lucene.apache.org
Subject: Re: how number of indexed fields effect performance

Do you need all of the fields loaded every time and are they stored? Maybe
there is a document with gigantic content that you don't actually need but
it gets deserialized anyway. Try lazy loading
setting: enableLazyFieldLoading in solrconfig.xml

Regards,
  Alex.

Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


On Tue, Jul 23, 2013 at 12:36 AM, Jack Krupansky 
j...@basetechnology.comwrote:



After restarting Solr and doing a couple of queries to warm the caches,
are queries already slow/failing, or does it take some time and a number 
of

queries before failures start occurring?

One possibility is that you just need a lot more memory for caches for
this amount of data. So, maybe the failures are caused by heavy garbage
collections. So, after restarting Solr, check how much Java heap is
available, then do some warming queries, then check the Java heap 
available

again.

Add the debugQuery=true parameter to your queries and look at the timings
to see what phases of query processing are taking the most time. Also 
check
whether the reported QTime seems to match actual wall clock time; 
sometimes

formatting of the results and network transfer time can dwarf actual query
time.

How many fields are you returning on a typical query?


-- Jack Krupansky


-Original Message- From: Suryansh Purwar
Sent: Monday, July 22, 2013 11:06 PM
To: solr-user@lucene.apache.org ; j...@basetechnology.com

Subject: how number of indexed fields effect performance

It was running fine initially when we just had around 100 fields
indexed. In this case as well it runs fine but after sometime broken pipe
exception starts coming which results in shard getting down.

Regards,
Suryansh



On Tuesday, July 23, 2013, Jack Krupansky wrote:

 Was all of this running fine previously and only started running slow

recently, or is this your first measurement?

Are very simple queries (single keyword, no filters or facets or sorting
or anything else, and returning only a few fields) working reasonably
well?

-- Jack Krupansky

-Original Message- From: Suryansh Purwar
Sent: Monday, July 22, 2013 4:07 PM
To: solr-user@lucene.apache.org
Subject: how number of indexed fields effect performance

Hi,

We have a two shard solrcloud cluster with each shard allocated 3 
separate

machines. We do complex queries involving a number of filter queries
coupled with group queries and faceting. All of our machines are 64 bit
with 32 gb ram. Our index size is around 10gb with around 8,00,000

Re: WikipediaTokenizer for Removing Unnecesary Parts

2013-07-23 Thread Robert Muir
If you use wikipediatokenizer it will tag different wiki elements with
different types (you can see it in the admin UI).

so then followup with typetokenfilter to only filter the types you care
about, and i think it will do what you want.

On Tue, Jul 23, 2013 at 7:53 AM, Furkan KAMACI furkankam...@gmail.comwrote:

 Hi;

 I have indexed wikipedia data with Solr DIH. However when I look data that
 is indexed at Solr I something like that as well:

 {| style=text-align: left; width: 50%; table-layout: fixed; border=0
 |- valign=top
 | style=width: 50%|
 :*[[Ubuntu]]
 :*[[Fedora]]
 :*[[Mandriva]]
 :*[[Linux Mint]]
 :*[[Debian]]
 :*[[OpenSUSE]]
 |
 *[[Red Hat]]
 *[[Mageia]]
 *[[Arch Linux]]
 *[[PCLinuxOS]]
 *[[Slackware]]
 |}

 However I want to remove them before indexing. I know that there is a
 WikipediaTokenizer in Lucene but how can I remove unnecessary parts ( as
 like links, style, etc..) with Solr?



Re: WikipediaTokenizer for Removing Unnecesary Parts

2013-07-23 Thread Jack Krupansky
Are you actually seeing that output from the WikipediaTokenizerFactory?? 
Really? Even if you use the Solr Admin UI analysis page?


You should just see the text tokens plus the URLs for links.

-- Jack Krupansky

-Original Message- 
From: Furkan KAMACI

Sent: Tuesday, July 23, 2013 10:53 AM
To: solr-user@lucene.apache.org
Subject: WikipediaTokenizer for Removing Unnecesary Parts

Hi;

I have indexed wikipedia data with Solr DIH. However when I look data that
is indexed at Solr I something like that as well:

{| style=text-align: left; width: 50%; table-layout: fixed; border=0
|- valign=top
| style=width: 50%|
:*[[Ubuntu]]
:*[[Fedora]]
:*[[Mandriva]]
:*[[Linux Mint]]
:*[[Debian]]
:*[[OpenSUSE]]
|
*[[Red Hat]]
*[[Mageia]]
*[[Arch Linux]]
*[[PCLinuxOS]]
*[[Slackware]]
|}

However I want to remove them before indexing. I know that there is a
WikipediaTokenizer in Lucene but how can I remove unnecessary parts ( as
like links, style, etc..) with Solr? 



Re: softCommit doesn't work - ?

2013-07-23 Thread tskom
Thanks for your comment Eric.

When I use  *server.add(doc);* - everything is fine (but takes long time
to hard commit every single doc) , so I am sure docs are uniquely indexed.

Maybe I shouldn't do *server.commit();* at all from solrj code, so SOLR
would use autoCommit/autoSoftCommit configuration defined in solrconfig.xml
?

Maybe there are some bits missing ? 











--
View this message in context: 
http://lucene.472066.n3.nabble.com/softCommit-doesn-t-work-tp4079578p4079772.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: XInclude and Document Entity not working on schema.xml

2013-07-23 Thread Elodie Sannier

Hello Chris,

Thank you for your help.

I checked differences between my files and your test files but I didn't
find bugs in my files.

All my files are in the same directory: collection1/conf

= schema.xml content:

?xml version=1.0 encoding=UTF-8 ?
!DOCTYPE schema [
!ENTITY commonschema_types SYSTEM commonschema_types.xml
!ENTITY commonschema_others SYSTEM commonschema_others.xml
]
schema name=searchSolrSchema version=1.5

  types

fieldType name=text_stemmed class=solr.TextField
positionIncrementGap=100 omitNorms=true
!-- FR : french --
!-- least aggressive stemming --
  analyzer type=index
tokenizer class=solr.StandardTokenizerFactory/
filter
class=com.kelkoo.search.solr.plugins.stemmer.fr.KelkooFrenchMinimalStemFilterFactory/

  /analyzer
  analyzer type=query
tokenizer class=solr.StandardTokenizerFactory/
filter
class=com.kelkoo.search.solr.plugins.stemmer.fr.KelkooFrenchMinimalStemFilterFactory/
  /analyzer
/fieldType

commonschema_types;

  /types

  commonschema_others;

/schema

= commonschema_types.xml content:

fieldType name=string class=solr.StrField
sortMissingLast=true omitNorms=true/
fieldType name=boolean class=solr.BoolField
sortMissingLast=true omitNorms=true/

!-- int is for exact ids, work with grouped=true and distrib=true --
fieldType name=int
class=solr.TrieIntField  precisionStep=0
sortMissingLast=true omitNorms=true positionIncrementGap=0/

!-- tint is for numbers that need sorting and/or range queries
(precisionStep=4 has better performance than precisionStep=8)
   and that do *not* need grouping (grouping
does not work in distrib=true for tint)--
fieldType name=tint
class=solr.TrieIntField precisionStep=4
sortMissingLast=true omitNorms=true positionIncrementGap=0/

fieldType name=long class=solr.TrieLongField precisionStep=0
positionIncrementGap=0/
fieldType name=byte class=solr.ByteField omitNorms=true/
fieldType name=float class=solr.TrieFloatField
sortMissingLast=true omitNorms=true/

 !-- A general text field which tokenizes with StandardTokenizer
 omitNorms=true means the (index time) lenghtNorm will be the
same whatever the number of tokens.
  --
fieldType name=text_general class=solr.TextField
positionIncrementGap=100 omitNorms=true
  analyzer type=index
tokenizer class=solr.StandardTokenizerFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.StandardTokenizerFactory/
  /analyzer
/fieldType


commonschema_others; include works.

Do you see something wrong ?

Unfortunately I cannot use the 4.3.0 version because I'm using solr.xml
sharedLib which does not work in 4.3.0
(cf.https://issues.apache.org/jira/browse/SOLR-4791).
Where can I found the newly voted 4.4 ?
I have this bug with the nightly 4.5-2013-07-18_06-04-44 found here
https://builds.apache.org/job/Solr-Artifacts-4.x/lastSuccessfulBuild/artifact/solr/package/
(the 18th of july).

Elodie Sannier


Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
message, merci de le détruire et d'en avertir l'expéditeur.


Re: WikipediaTokenizer for Removing Unnecesary Parts

2013-07-23 Thread Furkan KAMACI
Here is my fieldtype:

fieldType name=text_tr class=solr.TextField
positionIncrementGap=100
analyzer type=index
tokenizer class=solr.WikipediaTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords_tr.txt enablePositionIncrements=true/
filter class=solr.LowerCaseFilterFactory/
/analyzer
  analyzer type=query
  tokenizer class=solr.WikipediaTokenizerFactory/
  filter class=solr.SynonymFilterFactory
synonyms=synonyms_tr.txt ignoreCase=true expand=true/
  filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords_tr.txt enablePositionIncrements=true/
  filter class=solr.LowerCaseFilterFactory/
  /analyzer
/fieldType


My input for indexing at analysis section of Solr admin page:


{| style=text-align: left; width: 50%; table-layout: fixed; border=0
|- valign=top
| style=width: 50%|
:*[[Ubuntu]]
:*[[Fedora]]
:*[[Mandriva]]
:*[[Linux Mint]]
:*[[Debian]]
:*[[OpenSUSE]]
|
*[[Red Hat]]
*[[Mageia]]
*[[Arch Linux]]
*[[PCLinuxOS]]
*[[Slackware]]
|}

and the output:


WT  styletextalignleft
width50tablelayoutfixed
border0valigntop
stylewidth50UbuntuFedora
MandrivaLinuxMintDebian
OpenSUSERedHatMageiaArch
Linux  PCLinuxOSSlackware

SF   styletextalignleft
width50tablelayoutfixed
border0valigntop
stylewidth50UbuntuFedora
MandrivaLinuxMintDebian
OpenSUSERedHatMageiaArch
Linux PCLinuxOSSlackware

LCF styletextalignleft
width50tablelayoutfixed
border0valigntop
stylewidth50   ubuntufedora
mandrivalinuxmintdebian
opensuseredhatmageiaarch
linuxpclinuxosslackware



Any ideas?



2013/7/23 Jack Krupansky j...@basetechnology.com

 Are you actually seeing that output from the WikipediaTokenizerFactory??
 Really? Even if you use the Solr Admin UI analysis page?

 You should just see the text tokens plus the URLs for links.

 -- Jack Krupansky

 -Original Message- From: Furkan KAMACI
 Sent: Tuesday, July 23, 2013 10:53 AM
 To: solr-user@lucene.apache.org
 Subject: WikipediaTokenizer for Removing Unnecesary Parts


 Hi;

 I have indexed wikipedia data with Solr DIH. However when I look data that
 is indexed at Solr I something like that as well:

 {| style=text-align: left; width: 50%; table-layout: fixed; border=0
 |- valign=top
 | style=width: 50%|
 :*[[Ubuntu]]
 :*[[Fedora]]
 :*[[Mandriva]]
 :*[[Linux Mint]]
 :*[[Debian]]
 :*[[OpenSUSE]]
 |
 *[[Red Hat]]
 *[[Mageia]]
 *[[Arch Linux]]
 *[[PCLinuxOS]]
 *[[Slackware]]
 |}

 However I want to remove them before indexing. I know that there is a
 WikipediaTokenizer in Lucene but how can I remove unnecessary parts ( as
 like links, style, etc..) with Solr?



Re:

2013-07-23 Thread Chris Hostetter

: Can anyone remove this spammer please?

The recent influx is not confined to a single user, or a single list.  Nor 
is there a clear course of action just yet, since the senders in question 
are all legitimate subscribers who have been active members of the 
community.

There is an open issue to track the recent surge and see if we can address 
this problem holisticly before resorting to forcibly unsubscribing 
legitimate list members based on messages spoofed on their behalf...

   https://issues.apache.org/jira/browse/INFRA-6585

...interested parties should watch that issue, and/or post comments there 
with any concrete technical suggestions for addressing the problem.


-Hoss


RE: spellcheck and search in a same solr request

2013-07-23 Thread Dyer, James
Solr doesn't support any kind of short-circuting the original query and 
returning the results of the corrected query or collation.  You just re-issue 
the query in a second request.  This would be a nice feature to add though.

James Dyer
Ingram Content Group
(615) 213-4311

-Original Message-
From: smanad [mailto:sma...@gmail.com] 
Sent: Monday, July 22, 2013 6:29 PM
To: solr-user@lucene.apache.org
Subject: spellcheck and search in a same solr request

Hey, 

Is there a way to do spellcheck and search (using suggestions returned from
spellcheck) in a single Solr request?

I am seeing that if my query is spelled correctly, i get results but if
misspelled, I just get suggestions.

Any pointers will be very helpful.
Thanks, 
-Manasi



--
View this message in context: 
http://lucene.472066.n3.nabble.com/spellcheck-and-search-in-a-same-solr-request-tp4079571.html
Sent from the Solr - User mailing list archive at Nabble.com.




RE: Use same spell check dictionary across different collections

2013-07-23 Thread Dyer, James
DirectSolrSpellChecker does not prepare any kind of dictionary.  It just uses 
the term dictionary from the indexed field.  So what you are trying to do is 
impossible.

You would think it would be possible with IndexBasedSpellChecker because it 
creates a dictionary as a sidecar lucene index.  But it won't work either 
because it uses the term dictionary of the field the sidecar index was based on 
to get term frequencies.  

So both IndexBased- and DirectSolr- rely on the original field's term 
dictionary to work.  You cannot use either to move one core's field data to 
another as a spellcheck dictionary.

Possibly, you can create a Dictionary file based on the data that is in the 
field you want to use.  Then you can use FileBasedSpellChecker.  See 
http://wiki.apache.org/solr/FileBasedSpellChecker .

James Dyer
Ingram Content Group
(615) 213-4311


-Original Message-
From: smanad [mailto:sma...@gmail.com] 
Sent: Monday, July 22, 2013 5:55 PM
To: solr-user@lucene.apache.org
Subject: Use same spell check dictionary across different collections

I have 2 collections, lets say coll1 and coll2.

I configured solr.DirectSolrSpellChecker in coll1 solrconfig.xml and works
fine. 

Now, I want to configure coll2 solrconfig.xml to use SAME spell check
dictionary index created above. (I do not want coll2 prepare its own
dictionary index but just do spell check against the coll1 Spell dictionary
index)

Is it possible to do it? Tried out with IndexBasedSpellChecker but could not
get it working. 

Any suggestions?
Thanks, 
-Manasi



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Use-same-spell-check-dictionary-across-different-collections-tp4079566.html
Sent from the Solr - User mailing list archive at Nabble.com.




[ANNOUNCE] Apache Solr 4.4 released

2013-07-23 Thread Steve Rowe
July 2013, Apache Solr™ 4.4 available

The Lucene PMC is pleased to announce the release of Apache Solr 4.4

Solr is the popular, blazing fast, open source NoSQL search platform
from the Apache Lucene project. Its major features include powerful
full-text search, hit highlighting, faceted search, dynamic
clustering, database integration, rich document (e.g., Word, PDF)
handling, and geospatial search.  Solr is highly scalable, providing
fault tolerant distributed search and indexing, and powers the search
and navigation features of many of the world's largest internet sites.

Solr 4.4 is available for immediate download at:
  http://lucene.apache.org/solr/mirrors-solr-latest-redir.html

See the CHANGES.txt file included with the release for a full list of
details.

Solr 4.4 Release Highlights:

* Solr indexes and transaction logs may stored in HDFS with full read/write
  capability.

* Schemaless mode: Added support for a mode that requires no up-front schema
  modifications, in which previously unknown fields' types are guessed based
  on the values in added/updated documents, and are then added to the schema
  prior to processing the update.  Note that the below-described features
  are also useful independently from schemaless mode operation.   
  * New Parse{Date,Integer,Long,Float,Double,Boolean}UpdateProcessorFactory
classes parse/guess the field value class for String-valued and unknown
fields.
  * New AddSchemaFieldsUpdateProcessor: Automatically add new fields to the
schema when adding/updating documents with unknown fields. Custom rules
map field value class(es) to schema fieldTypes.
  * A new schemaless mode example configuration, using the above-described 
field-value-class-guessing and unknown-field-schema-addition features,
is provided at solr/example/example-schemaless/.

* Core Discovery mode: A new solr.xml format which does not store core
  information, but instead searches for files named 'core.properties' in
  the filesystem which tell Solr all the details about that core.  The main
  example and the schemaless example both use this new format.

* Schema REST API: Add support for creating copy fields.

* A merged segment warmer may now be plugged into solrconfig.xml. 

* New MaxScoreQParserPlugin: Return max() instead of sum() of terms.

* Binary files are now supported in ZooKeeper.

* SolrJ's SolrPing object has new methods for ping, enable, and disable.

* The Admin UI now supports adding documents to Solr.

* Added a PUT command to the Solr ZkCli tool.

* New deleteshard collections API that unloads all replicas of a given
  shard and then removes it from the cluster state. It will remove only
  those shards which are INACTIVE or have no range.

* The Overseer can now optionally assign generic node names so that
  new addresses can host shards without naming confusion.

* The CSV Update Handler now supports optionally adding the line number/
  row id to a document.

* Added a new system wide info admin handler that exposes the system info
  that could previously only be retrieved using a SolrCore.

Solr 4.4 also includes many other new features as well as numerous
optimizations and bugfixes.

Please report any feedback to the mailing lists 
(http://lucene.apache.org/solr/discussion.html)

In the coming days, we will also be announcing the first official Solr 
Reference Guide available for download.  In the meantime, users are 
encouraged to browse the online version and post comments and suggestions on 
the documentation: 
  https://cwiki.apache.org/confluence/display/solr/Apache+Solr+Reference+Guide

Note: The Apache Software Foundation uses an extensive mirroring network
for distributing releases.  It is possible that the mirror you are using
may not have replicated the release yet.  If that is the case, please
try another mirror.  This also goes for Maven access.



Re: Start independent Zookeeper from within Solr install

2013-07-23 Thread Upayavira
The use case is to prevent the necessity to download something else
(zookeeper) when everything needed to run it is (likely) present in the
Solr distribution already.

Maybe we don't need to start Jetty, maybe we can start Zookeeper with an
extra script in the Solr codebase.

At present, if you are unfamiliar with ZooKeeper, getting it up and
running can be a challenge (I've seen quite a few people fail at it
during training scenarios).

Upayavira

On Tue, Jul 23, 2013, at 03:21 PM, Timothy Potter wrote:
 Curious what the use case is for this? Zookeeper is not an HTTP
 service so loading it in Jetty by itself doesn't really make sense. I
 also think this creates more work for the Solr team especially since
 setting up a production ensemble shouldn't take more than a few
 minutes once you have the nodes provisioned.
 
 On Tue, Jul 23, 2013 at 7:05 AM, Upayavira u...@odoko.co.uk wrote:
  Assumptions:
 
   * you currently have two choices to start Zookeeper: run it embedded
   within Solr, or download it from the ZooKeeper site and start it
   independently.
   * everything you need to run ZooKeeper (embedded or not) is included
   within the Solr distribution
 
  Assuming I've got the above right, then currently starting an embedded
  ZooKeeper is easy (-DzkRun), and starting an ensemble is irritatingly
  complex.
 
  So, my question is, how hard would it be to start Zookeeper without
  Solr, but from within the Solr codebase? -DensembleOnly or some such,
  causes Solr not to load, but Zookeeper still starts. I'm assuming that
  Jetty would still listen on port 8983, but it wouldn't initialise the
  Solr webapp:
 
  java -DzkRun -DzkEnsembleOnly
  -DzkHosts=zkhost01:9983,zkhost02:9983,zkhost03:9983 -jar start.jar
 
  Is this possible? If it is, I'm happy to have a go at making it happen.
 
  Upayavira
 
 


Re: custom field type plugin

2013-07-23 Thread Kevin Stone
What are the dangers of trying to use a range of 10 billion? Simply a
slower index time? Or will I get inaccurate results?
I have tried it on a very small sample of documents, and it seemed to
work. I could spend some time this week trying to get a more robust (and
accurate) dataset loaded to play around with. The reason for the 10
billion is to support being able to query for a region on a chromosome.

A user might want to know what genes overlap a point on a specific
chromosome. Unless I can use 3 dimensional coordinates (which gave an
error when I tried it), I'll need to multiply the coordinates by some
offset for each chromosome to be able to normalise the data (at both index
and query time). The largest chromosome (chr 1) has almost 250,000,000
base pairs. I could probably squeeze the rest a bit smaller, but I'd
rather use one size for all chromosomes, since we have more than just
human data to deal with. It would get quite messy otherwise.


On 7/22/13 11:50 AM, David Smiley (@MITRE.org) dsmi...@mitre.org wrote:

Like Hoss said, you're going to have to solve this using
http://wiki.apache.org/solr/SpatialForTimeDurations
Using PointType is *not* going to work because your durations are
multi-valued per document.

It would be useful to create a custom field type that wraps the capability
outlined on the wiki to make it easier to use without requiring the user
to
think spatially.

You mentioned that these numeric ranges extend upwards of 10 billion or
so.
Unfortunately, the current prefix tree implementation under the hood for
non-geodetic spatial, the QuadTree, is unlikely to scale to numbers that
big.  I don't know where the boundary is, but I doubt 10B.  You could try
and see what happens.  I'm working (very slowly on very little spare time)
on improving the PrefixTree implementations to scale to such large
numbers;
I hope something will be available this fall.

~ David Smiley


Kevin Stone wrote
 I have a particular use case that I think might require a custom field
 type, however I am having trouble getting the plugin to work.
 My use case has to do with genetics data, and we are running into
several
 situations were we need to be able to query multiple regions of a
 chromosome (or gene, or other object types). All that really boils down
to
 is being able to give a number, e.g. 10234, and return documents that
have
 regions containing the number. So you'd have a document with a list like
 [1:16090,400:8000,40123:43564], and it should come back
because
 10234 falls between 1:16090. If there is a better or easier way to
 do this please speak up. I'd rather not have to use a join on another
 index, because 1) it's more complex to set up, and 2) we might need to
 join against something else and you can only do one join at a time.

 AnywayŠ I tried creating a field type similar to a PointType just to see
 if I could get one working. I added the following jars to get it to
 compile:

apache-solr-core-4.0.0,lucene-core-4.0.0,lucene-queries-4.0.0,apache-solr
-solrj-4.0.0.
 I am running solr 4.0.0 on jetty, and put my jar file in a sharedLib
 folder, and specified it in my solr.xml (I have multiple cores).

 After starting up solr, I got the line that it picked up the jar:
 INFO: Adding 'file:/blah/blah/lib/CustomPlugins.jar' to classloader

 But I get this error about it not being able to find the
 AbstractSubTypeFieldType class.
 Here is the first bit of the trace:

 SEVERE: null:java.lang.NoClassDefFoundError:
 org/apache/solr/schema/AbstractSubTypeFieldType
 at java.lang.ClassLoader.defineClass1(Native Method)
 at java.lang.ClassLoader.defineClass(ClassLoader.java:791)
 at
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
 at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
 at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 ...etcŠ


 Any hints as to what I did wrong? I can provide source code, or a fuller
 stack trace, config settings, etc.

 Also, I did try to unpack the solr.war, stick my jar in WEB-INF/lib,
then
 repack. However, when I did that, I get a NoClassDefFoundError for my
 plugin itself.


 Thanks,
 Kevin

 The information in this email, including attachments, may be
confidential
 and is intended solely for the addressee(s). If you believe you received
 this email by mistake, please notify the sender by return email as soon
as
 possible.





-
 Author:
http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
--
View this message in context:
http://lucene.472066.n3.nabble.com/custom-field-type-plugin-tp4079086p4079
494.html
Sent from the Solr - User mailing list archive at Nabble.com.


The information in this email, including attachments, may be confidential and 
is intended solely for the addressee(s). If you believe you received this email 
by mistake, please notify the sender by return email as soon 

Re: XInclude and Document Entity not working on schema.xml

2013-07-23 Thread Chris Hostetter

Elodie:  I just tested your configs (as close as i could get since i don't 
have the com.kelkoo classes) using the current HEAD of the 4x branch and 
had no problems with the entity includes.

what java version/vendor are you using?  

are you using the provided jetty or your own servlet container?

My best guess is that some combination of JVM version or servlet container 
implementation is subtly affecting the way the XML files are getting 
parsed, introducing the xml:base attribute in a way that isn't getting 
cleanly ignored by the code added to handle that in the issue you noted 
before ... but w/o being able to reproduce that's just a guess.

: Where can I found the newly voted 4.4 ?

the release announcement just went live on the mailing list, and it's now 
the main download link on the website...

https://lucene.apache.org/solr/downloads.html


-Hoss


Re: Collection not current after insert

2013-07-23 Thread Michael Della Bitta
Hi Alistair,

You probably need a commit, and not an optimize.

Which version of Solr are you running against? The 4.0 releases have more
complications, but generally sending a commit will do. Not sure if GSearch
sends one, only partly because I never was able to make it work. :)


Michael Della Bitta

Applications Developer

o: +1 646 532 3062  | c: +1 917 477 7906

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions https://twitter.com/Appinions | g+:
plus.google.com/appinions
w: appinions.com http://www.appinions.com/


On Tue, Jul 23, 2013 at 9:57 AM, Alistair Young alistair.yo...@uhi.ac.ukwrote:

 Hi there,

 My Solr is being fed by Fedora GSearch and when uploading a new resource,
 the Collection is optimized but not current so the new resource can't be
 found. I have to go to the Core Admin page and Optimize it from there, in
 order to make the collection current. Is there anything I should look for
 to see what the problem is? This is the comms to solr when inserting:

 DEBUG 2013-07-23 13:27:37,023 (OperationsImpl) resultXml =
 solrUpdateIndex indexName=FgsIndex
 insertedltk:13000116/inserted
 counts insertTotal=1 updateTotal=0 deleteTotal=0 emptyTotal=0
 docCount=854 warnCount=0/
 /solrUpdateIndex

 DEBUG 2013-07-23 13:27:37,023 (GTransformer)
 xsltName=fgsconfigFinal/index/FgsIndex/updateIndexToResultPage
 DEBUG 2013-07-23 13:27:37,027 (GTransformer) getTransformer
 transformer=org.apache.xalan.transformer.TransformerImpl@6561b973uriResolver=null
 DEBUG 2013-07-23 13:27:37,028 (GenericOperationsImpl) resultXml=?xml
 version=1.0 encoding=UTF-8?
 resultPage operation=updateIndex action=fromPid value=ltk:13000116
 repositoryName=FgsRepos indexNames= resultPageXslt= dateTime=Tue Jul
 23 13:27:36 UTC 2013
 updateIndex xmlns:dc=http://purl.org/dc/elements/1.1/;
 xmlns:foxml=info:fedora/fedora-system:def/foxml# xmlns:zs=
 http://www.loc.gov/zing/srw/; warnCount=0 docCount=854
 deleteTotal=0 updateTotal=0 insertTotal=1 indexName=FgsIndex/
 /resultPage

 INFO 2013-07-23 13:27:37,028 (UpdateListener) Index updated by
 notification message, returning:
 ?xml version=1.0 encoding=UTF-8?
 resultPage operation=updateIndex action=fromPid value=ltk:13000116
 repositoryName=FgsRepos indexNames= resultPageXslt= dateTime=Tue Jul
 23 13:27:36 UTC 2013
 updateIndex xmlns:dc=http://purl.org/dc/elements/1.1/;
 xmlns:foxml=info:fedora/fedora-system:def/foxml# xmlns:zs=
 http://www.loc.gov/zing/srw/; warnCount=0 docCount=854
 deleteTotal=0 updateTotal=0 insertTotal=1 indexName=FgsIndex/
 /resultPage

 thanks,

 Alistair

 --
 mov eax,1
 mov ebx,0
 int 80h



Re: custom field type plugin

2013-07-23 Thread David Smiley (@MITRE.org)
Oh cool!  I'm glad it at least seemed to work.  Can you post your
configuration of the field type and report from Solr's logs what the
maxLevels is used for this field, which is logged the first time you use
the field type?

Maybe there isn't a limit under 10B after all.  Some quick'n'dirty
calculations I just did indicate there shouldn't be a problem but real-world
usage will be a better proof.  Indexing probably won't be terribly slow,
queries could get pretty slow if the amount of indexed data is really high. 
I'd love to hear how it works out for you.  Your use-case would benefit a
lot from an improved prefix tree implementation.

I don't gather how a 3rd dimension would play into this.  Support for
multi-dimensional spatial is on the drawing board.

~ David


Kevin Stone wrote
 What are the dangers of trying to use a range of 10 billion? Simply a
 slower index time? Or will I get inaccurate results?
 I have tried it on a very small sample of documents, and it seemed to
 work. I could spend some time this week trying to get a more robust (and
 accurate) dataset loaded to play around with. The reason for the 10
 billion is to support being able to query for a region on a chromosome.
 
 A user might want to know what genes overlap a point on a specific
 chromosome. Unless I can use 3 dimensional coordinates (which gave an
 error when I tried it), I'll need to multiply the coordinates by some
 offset for each chromosome to be able to normalise the data (at both index
 and query time). The largest chromosome (chr 1) has almost 250,000,000
 base pairs. I could probably squeeze the rest a bit smaller, but I'd
 rather use one size for all chromosomes, since we have more than just
 human data to deal with. It would get quite messy otherwise.
 
 
 On 7/22/13 11:50 AM, David Smiley (@MITRE.org) lt;

 DSMILEY@

 gt; wrote:
 
Like Hoss said, you're going to have to solve this using
http://wiki.apache.org/solr/SpatialForTimeDurations
Using PointType is *not* going to work because your durations are
multi-valued per document.

It would be useful to create a custom field type that wraps the capability
outlined on the wiki to make it easier to use without requiring the user
to
think spatially.

You mentioned that these numeric ranges extend upwards of 10 billion or
so.
Unfortunately, the current prefix tree implementation under the hood for
non-geodetic spatial, the QuadTree, is unlikely to scale to numbers that
big.  I don't know where the boundary is, but I doubt 10B.  You could try
and see what happens.  I'm working (very slowly on very little spare time)
on improving the PrefixTree implementations to scale to such large
numbers;
I hope something will be available this fall.

~ David Smiley


Kevin Stone wrote
 I have a particular use case that I think might require a custom field
 type, however I am having trouble getting the plugin to work.
 My use case has to do with genetics data, and we are running into
several
 situations were we need to be able to query multiple regions of a
 chromosome (or gene, or other object types). All that really boils down
to
 is being able to give a number, e.g. 10234, and return documents that
have
 regions containing the number. So you'd have a document with a list like
 [1:16090,400:8000,40123:43564], and it should come back
because
 10234 falls between 1:16090. If there is a better or easier way to
 do this please speak up. I'd rather not have to use a join on another
 index, because 1) it's more complex to set up, and 2) we might need to
 join against something else and you can only do one join at a time.

 AnywayŠ I tried creating a field type similar to a PointType just to see
 if I could get one working. I added the following jars to get it to
 compile:

apache-solr-core-4.0.0,lucene-core-4.0.0,lucene-queries-4.0.0,apache-solr
-solrj-4.0.0.
 I am running solr 4.0.0 on jetty, and put my jar file in a sharedLib
 folder, and specified it in my solr.xml (I have multiple cores).

 After starting up solr, I got the line that it picked up the jar:
 INFO: Adding 'file:/blah/blah/lib/CustomPlugins.jar' to classloader

 But I get this error about it not being able to find the
 AbstractSubTypeFieldType class.
 Here is the first bit of the trace:

 SEVERE: null:java.lang.NoClassDefFoundError:
 org/apache/solr/schema/AbstractSubTypeFieldType
 at java.lang.ClassLoader.defineClass1(Native Method)
 at java.lang.ClassLoader.defineClass(ClassLoader.java:791)
 at
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
 at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
 at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 ...etcŠ


 Any hints as to what I did wrong? I can provide source code, or a fuller
 stack trace, config settings, etc.

 Also, I did try to unpack the solr.war, stick my jar in WEB-INF/lib,
then
 repack. However, 

Re: Calculating Solr document score by ignoring the boost field.

2013-07-23 Thread Chris Hostetter

: Ok thanks, I just wanted the know is it possible to ignore boost value or
: not during score calculation and as you said its not.
: Now I would have to focus on nutch to fix the issue and not to send boost=0
: to Solr.

the index time bosts are encoded in field norms -- if you wnat to ignore 
them, you could either modify your schema to 'omitNOrms=true' on all 
fields *beore* indexing, or you could customize hte SImilarity 
implementation you use to be osmething custom that does not tke into 
account field norms at all.

https://cwiki.apache.org/confluence/display/solr/Other+Schema+Elements#OtherSchemaElements-Similarity

https://lucene.apache.org/core/4_3_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html


-Hoss


Re:

2013-07-23 Thread Gora Mohanty
On 23 July 2013 21:52, Chris Hostetter hossman_luc...@fucit.org wrote:


 : Can anyone remove this spammer please?

 The recent influx is not confined to a single user, or a single list.  Nor
 is there a clear course of action just yet, since the senders in question
 are all legitimate subscribers who have been active members of the
 community.

 There is an open issue to track the recent surge and see if we can address
 this problem holisticly before resorting to forcibly unsubscribing
 legitimate list members based on messages spoofed on their behalf...

https://issues.apache.org/jira/browse/INFRA-6585

 ...interested parties should watch that issue, and/or post comments there
 with any concrete technical suggestions for addressing the problem.

Yes, this seems to be an across-the-board attack, and I am seeing
this on other mailing lists, and also in personal mail from known
friends. The modus operandi seems to be hacking (probably weak)
passwords for accounts on sites like Gmail and Yahoo offering email
addresses to people at large.

Have no immediate worthwhile suggestions on how best to deal with
this, but as Hoss says, it is worth bearing in mind that the people in
whose names the mail is being sent are also victims.

Regards,
Gora


Re: Question about field boost

2013-07-23 Thread Joe Zhang
I'm not sure I understand, Erick. I don't have a text field in my schema;
title and content are both legal fields.


On Tue, Jul 23, 2013 at 5:15 AM, Erick Erickson erickerick...@gmail.comwrote:

 this isn't doing what you think.
 title^10 content
 is actually parsed as

 text:title^100 text:content

 where text is my default search field.

 assuming title is a field. If you look a little
 farther up the debug output you'll see that.

 You probably want
 title:content^100 or some such?

 Erick

 On Tue, Jul 23, 2013 at 1:43 AM, Jack Krupansky j...@basetechnology.com
 wrote:
  That means that for that document china occurs in the title vs.
 snowden
  found in a document but not in the title.
 
 
  -- Jack Krupansky
 
  -Original Message- From: Joe Zhang
  Sent: Tuesday, July 23, 2013 12:52 AM
  To: solr-user@lucene.apache.org
  Subject: Re: Question about field boost
 
 
  Is my reading correct that the boost is only applied on china but not
  snowden? How can that be?
 
  My query is: q=china+snowdenqf=title^10 content
 
 
  On Mon, Jul 22, 2013 at 9:43 PM, Joe Zhang smartag...@gmail.com wrote:
 
  Thanks for your hint, Jack. Here is the debug results, which I'm having
 a
  hard deciphering (the two terms are china and snowden)...
 
  0.26839527 = (MATCH) sum of:
0.26839527 = (MATCH) sum of:
  0.26757246 = (MATCH) max of:
7.9147343E-4 = (MATCH) weight(content:china in 249), product of:
  0.019873314 = queryWeight(content:china), product of:
1.6649085 = idf(docFreq=46832, maxDocs=91058)
0.01193658 = queryNorm
  0.039825942 = (MATCH) fieldWeight(content:china in 249), product
  of:
4.8989797 = tf(termFreq(content:china)=24)
1.6649085 = idf(docFreq=46832, maxDocs=91058)
0.0048828125 = fieldNorm(field=content, doc=249)
0.26757246 = (MATCH) weight(title:china^10.0 in 249), product of:
  0.5836803 = queryWeight(title:china^10.0), product of:
10.0 = boost
4.8898454 = idf(docFreq=1861, maxDocs=91058)
0.01193658 = queryNorm
  0.45842302 = (MATCH) fieldWeight(title:china in 249), product
 of:
1.0 = tf(termFreq(title:china)=1)
4.8898454 = idf(docFreq=1861, maxDocs=91058)
0.09375 = fieldNorm(field=title, doc=249)
  8.2282536E-4 = (MATCH) max of:
8.2282536E-4 = (MATCH) weight(content:snowden in 249), product of:
  0.03407834 = queryWeight(content:snowden), product of:
2.8549502 = idf(docFreq=14246, maxDocs=91058)
0.01193658 = queryNorm
  0.024145111 = (MATCH) fieldWeight(content:snowden in 249),
 product
  of:
1.7320508 = tf(termFreq(content:snowden)=3)
2.8549502 = idf(docFreq=14246, maxDocs=91058)
0.0048828125 = fieldNorm(field=content, doc=249)
 
 
  On Mon, Jul 22, 2013 at 9:27 PM, Jack Krupansky
  j...@basetechnology.comwrote:
 
  Maybe you're not doing anything wrong - other than having an artificial
  expectation of what the true relevance of your data actually is. Many
  factors go into relevance scoring. You need to look at all aspects of
  your
  data.
 
  Maybe your terms don't occur in your titles the way you think they do.
 
  Maybe you need a boost of 500 or more...
 
  Lots of potential maybes.
 
  Relevancy tuning is an art and craft, hardly a science.
 
  Step one: Know your data, inside and out.
 
  Use the debugQuery=true parameter on your queries and see how much of
 the
  score is dominated by your query terms in the non-title fields.
 
  -- Jack Krupansky
 
  -Original Message- From: Joe Zhang
  Sent: Monday, July 22, 2013 11:06 PM
  To: solr-user@lucene.apache.org
  Subject: Question about field boost
 
 
  Dear Solr experts:
 
  Here is my query:
 
  defType=dismaxq=term1+term2**qf=title^100 content
 
  Apparently (at least I thought) my intention is to boost the title
 field.
  While I'm getting some non-trivial results, I'm surprised that the
  documents with both term1 and term2 in title (I know such docs do exist
  in
  my repository) were not returned (or maybe ranked very low). The
  situation
  does not change even when I use much larger boost factors.
 
  What am I doing wrong?
 
 
 
 



Spellcheck field element and collation issues

2013-07-23 Thread Brendan Grainger
Hi All,

I have an IndexBasedSpellChecker component configured as follows (note the
field parameter is set to the spellcheck field):

  searchComponent name=spellcheck class=solr.SpellCheckComponent

str name=queryAnalyzerFieldTypetext_spell/str

lst name=spellchecker
  str name=namedefault/str
  str name=classnamesolr.IndexBasedSpellChecker/str
  !--
  Load tokens from the following field for spell checking,
  analyzer for the field's type as defined in schema.xml are used
  --
*  str name=fieldspellcheck/str*
  str name=spellcheckIndexDir./spellchecker/str
  float name=thresholdTokenFrequency.0001/float
/lst
  /searchComponent

with the corresponding field type for spellcheck:

fieldType name=text_spell class=solr.TextField
positionIncrementGap=100 omitNorms=true
  analyzer type=index
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.StopFilterFactory
ignoreCase=true
words=lang/stopwords_en.txt
enablePositionIncrements=true
/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.StandardFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.SynonymFilterFactory
synonyms=moto_synonyms.txt ignoreCase=true expand=true/
filter class=solr.StopFilterFactory
ignoreCase=true
words=lang/stopwords_en.txt
enablePositionIncrements=true
/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.StandardFilterFactory/
  /analyzer
/fieldType

and field:

!-- spellcheck field is multivalued because it has the title and markup
  fields copied into it --
field name=spellcheck type=text_spell stored=false
omitTermFreqAndPositions=true multiValued=true/

values from a markup and title field are copied into the spellcheck field.

My /select search component has the following defaults:

lst name=defaults
  str name=echoParamsexplicit/str
  int name=rows10/int
  str name=dfmarkup_texts title_texts/str

  !-- Spell checking defaults --
  str name=spellchecktrue/str
  str name=spellcheck.collateExtendedResultstrue/str
  str name=spellcheck.extendedResultstrue/str
  str name=spellcheck.maxCollations2/str
  str name=spellcheck.maxCollationTries5/str
  str name=spellcheck.count5/str
  str name=spellcheck.collatetrue/str

  str name=spellcheck.maxResultsForSuggest5/str
  str name=spellcheck.alternativeTermCount5/str

 /lst


When I issue a search like this:

http://localhost:8981/solr/articles/select?indent=truespellcheck.q=markup_texts:(Perfrm%20HVC)q=Perfrm%20HVCrows=0

I get collations:

lst name=collation
str name=collationQuerymarkup_texts:(perform hvac)/str
int name=hits4/int
lst name=misspellingsAndCorrections
str name=perfrmperform/str
str name=hvchvac/str
/lst
/lst
lst name=collation
str name=collationQuerymarkup_texts:(performed hvac)/str
int name=hits4/int
lst name=misspellingsAndCorrections
str name=perfrmperformed/str
str name=hvchvac/str
/lst
/lst

However, if I remove the spellcheck.q parameter I do not, i.e. no
collations are returned for the following:

http://localhost:8981/solr/articles/select?indent=trueq=Perfrm%20HVCrows=0



If I specify the fields being searched over for the q parameter I get
collations:

http://localhost:8981/solr/articles/select?indent=trueq=markup_texts:(Perfrm%20HVC)rows=0

lst name=collation
str name=collationQuerymarkup_texts:(perform hvac)/str
int name=hits4/int
lst name=misspellingsAndCorrections
str name=perfrmperform/str
str name=hvchvac/str
/lst
/lst
lst name=collation
str name=collationQuerymarkup_texts:(performed hvac)/str
int name=hits4/int
lst name=misspellingsAndCorrections
str name=perfrmperformed/str
str name=hvchvac/str
/lst
/lst


I'm a bit confused as to what the value for field should be in spellcheck
component definition. In fact what is it's purpose here, just as the input
for building the spellchecking index? If that is so then why do I need to
even specify the queryAnalyzerFieldType?

Also, why do I need to explicitly specify the field in the query or
spellcheck.q to get collations?

Thanks and sorry for the rather long question.

Brendan


Fw:

2013-07-23 Thread wiredkel

Hi!   http://optiideas.com/google.com.offers.html



RE: Spellcheck field element and collation issues

2013-07-23 Thread Dyer, James
For this query:

http://localhost:8981/solr/articles/select?indent=trueq=Perfrm%20HVCrows=0

...do you get anything back in the spellcheck response?  Is it correcting the 
individual words and not giving collations?  Or are you getting no individual 
word suggestions also?

James Dyer
Ingram Content Group
(615) 213-4311


-Original Message-
From: Brendan Grainger [mailto:brendan.grain...@gmail.com] 
Sent: Tuesday, July 23, 2013 1:47 PM
To: solr-user@lucene.apache.org
Subject: Spellcheck field element and collation issues

Hi All,

I have an IndexBasedSpellChecker component configured as follows (note the
field parameter is set to the spellcheck field):

  searchComponent name=spellcheck class=solr.SpellCheckComponent

str name=queryAnalyzerFieldTypetext_spell/str

lst name=spellchecker
  str name=namedefault/str
  str name=classnamesolr.IndexBasedSpellChecker/str
  !--
  Load tokens from the following field for spell checking,
  analyzer for the field's type as defined in schema.xml are used
  --
*  str name=fieldspellcheck/str*
  str name=spellcheckIndexDir./spellchecker/str
  float name=thresholdTokenFrequency.0001/float
/lst
  /searchComponent

with the corresponding field type for spellcheck:

fieldType name=text_spell class=solr.TextField
positionIncrementGap=100 omitNorms=true
  analyzer type=index
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.StopFilterFactory
ignoreCase=true
words=lang/stopwords_en.txt
enablePositionIncrements=true
/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.StandardFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.SynonymFilterFactory
synonyms=moto_synonyms.txt ignoreCase=true expand=true/
filter class=solr.StopFilterFactory
ignoreCase=true
words=lang/stopwords_en.txt
enablePositionIncrements=true
/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.StandardFilterFactory/
  /analyzer
/fieldType

and field:

!-- spellcheck field is multivalued because it has the title and markup
  fields copied into it --
field name=spellcheck type=text_spell stored=false
omitTermFreqAndPositions=true multiValued=true/

values from a markup and title field are copied into the spellcheck field.

My /select search component has the following defaults:

lst name=defaults
  str name=echoParamsexplicit/str
  int name=rows10/int
  str name=dfmarkup_texts title_texts/str

  !-- Spell checking defaults --
  str name=spellchecktrue/str
  str name=spellcheck.collateExtendedResultstrue/str
  str name=spellcheck.extendedResultstrue/str
  str name=spellcheck.maxCollations2/str
  str name=spellcheck.maxCollationTries5/str
  str name=spellcheck.count5/str
  str name=spellcheck.collatetrue/str

  str name=spellcheck.maxResultsForSuggest5/str
  str name=spellcheck.alternativeTermCount5/str

 /lst


When I issue a search like this:

http://localhost:8981/solr/articles/select?indent=truespellcheck.q=markup_texts:(Perfrm%20HVC)q=Perfrm%20HVCrows=0

I get collations:

lst name=collation
str name=collationQuerymarkup_texts:(perform hvac)/str
int name=hits4/int
lst name=misspellingsAndCorrections
str name=perfrmperform/str
str name=hvchvac/str
/lst
/lst
lst name=collation
str name=collationQuerymarkup_texts:(performed hvac)/str
int name=hits4/int
lst name=misspellingsAndCorrections
str name=perfrmperformed/str
str name=hvchvac/str
/lst
/lst

However, if I remove the spellcheck.q parameter I do not, i.e. no
collations are returned for the following:

http://localhost:8981/solr/articles/select?indent=trueq=Perfrm%20HVCrows=0



If I specify the fields being searched over for the q parameter I get
collations:

http://localhost:8981/solr/articles/select?indent=trueq=markup_texts:(Perfrm%20HVC)rows=0

lst name=collation
str name=collationQuerymarkup_texts:(perform hvac)/str
int name=hits4/int
lst name=misspellingsAndCorrections
str name=perfrmperform/str
str name=hvchvac/str
/lst
/lst
lst name=collation
str name=collationQuerymarkup_texts:(performed hvac)/str
int name=hits4/int
lst name=misspellingsAndCorrections
str name=perfrmperformed/str
str name=hvchvac/str
/lst
/lst


I'm a bit confused as to what the value for field should be in spellcheck
component definition. In fact what is it's purpose here, just as the input
for building the spellchecking index? If that is so then why do I need to
even specify the queryAnalyzerFieldType?

Also, why do I need to explicitly specify the field in the query or
spellcheck.q to get collations?

Thanks and sorry for the rather long question.

Brendan


how number of indexed fields effect performance

2013-07-23 Thread Suryansh Purwar
Hi,
Thanks for your suggestions. I'll be able to provide answers to a few of
your  questions right now rest I'll answer after some time. It takes
 around 150k to 200k queries before it goes down again after restarting it.
In a typical query we are returning around 20 fields. Memory utilization
peaks only after sometime.


Regard,
Suryansh

On Tuesday, July 23, 2013, Jack Krupansky wrote:

 There was also a bug in the lazy loading of multivalued fields at one
 point recently in Solr 4.2

 https://issues.apache.org/**jira/browse/SOLR-4589https://issues.apache.org/jira/browse/SOLR-4589
 4.x + enableLazyFieldLoading + large multivalued fields + varying fl =
 pathological CPU load  response time

 Do you use multivalued fields very heavily?

 I'm still not ready to suggest that 1,000 fields is an okay thing to do,
 but there are still plenty of nuances in Solr performance that could
 explain the difficulties, before we even get to the 1,000 field issue
 itself.

 The real bottom line is that as you increase field count, there are lots
 of other aspects of Solr memory and performance degradation that increase
 as well. Some of those factors can be dealt with simply with more memory,
 more and faster CPU cores, or even more sharding, or other tuning, but not
 necessarily all of them.

 I think that I am already on the record on other threads as suggesting
 that a couple hundred is about the limit for field count for a slam
 dunk use of Solr. That doesn't mean you can't go above a couple hundred
 fields, just that you are in uncharted territory and may need to take
 extraordinary measures to get everything working satisfactorily. There's no
 magic hard limit, just a general sense that smaller numbers of of field are
 like a walk in a park, while higher numbers of fields are like chopping
 through a jungle. We each have our own threshold for... adventure.

 We need answers to the previous questions we raised before we can analyze
 this a lot further.

 Oh, and make sure there is enough OS system memory available for caching
 of the index pages. Sometimes, it is little things like this that can crush
 Solr performance.

 Unfortunately, Solr is not a packaged solution that automatically and
 magically auto-configures everything to work just right. Instead, it is a
 powerful toolkit that lets you do amazing things, but you the
 developer/architect need to supply amazing intelligence, wisdom, foresight,
 and insight to get it (and its hardware and software environment) to do
 those amazing things.

 -- Jack Krupansky

 -Original Message- From: Alexandre Rafalovitch
 Sent: Tuesday, July 23, 2013 9:54 AM
 To: solr-user@lucene.apache.org
 Subject: Re: how number of indexed fields effect performance

 Do you need all of the fields loaded every time and are they stored? Maybe
 there is a document with gigantic content that you don't actually need but
 it gets deserialized anyway. Try lazy loading
 setting: enableLazyFieldLoading in solrconfig.xml

 Regards,
   Alex.

 Personal website: http://www.outerthoughts.com/
 LinkedIn: 
 http://www.linkedin.com/in/**alexandrerafalovitchhttp://www.linkedin.com/in/alexandrerafalovitch
 - Time is the quality of nature that keeps events from happening all at
 once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


 On Tue, Jul 23, 2013 at 12:36 AM, Jack Krupansky j...@basetechnology.com
 wrote:

  After restarting Solr and doing a couple of queries to warm the caches,
 are queries already slow/failing, or does it take some time and a number
 of
 queries before failures start occurring?

 One possibility is that you just need a lot more memory for caches for
 this amount of data. So, maybe the failures are caused by heavy garbage
 collections. So, after restarting Solr, check how much Java heap is
 available, then do some warming queries, then check the Java heap
 available
 again.

 Add the debugQuery=true parameter to your queries and look at the timings
 to see what phases of query processing are taking the most time. Also
 check
 whether the reported QTime seems to match actual wall clock time;
 sometimes
 formatting of the results and network transfer time can dwarf actual query
 time.

 How many fields are you returning on a typical query?


 -- Jack Krupansky


 -Original Message- From: Suryansh Purwar
 Sent: Monday, July 22, 2013 11:06 PM
 To: solr-user@lucene.apache.org ; j...@basetechnology.com

 Subject: how number of indexed fields effect performance

 It was running fine initially when we just had around 100 fields
 indexed. In this case as well it runs fine but after sometime broken pipe
 exception starts coming which results in shard getting down.

 Regards,
 Suryansh



 On Tuesday, July 23, 2013, Jack Krupansky wrote:

  Was all of this running fine previously and only started running slow

 recently, or is this your first measurement?

 Are very simple queries (single keyword, no filters or facets or sorting
 or anything else, and 

Re: Spellcheck field element and collation issues

2013-07-23 Thread Brendan Grainger
Hi James,

I get the following response for that query:

response
lst name=responseHeader
int name=status0/int
int name=QTime8/int
lst name=params
str name=indenttrue/str
str name=qPerfrm HVC/str
str name=rows0/str
/lst
/lst
result name=response numFound=0 start=0/result
lst name=spellcheck
lst name=suggestions
lst name=perfrm
int name=numFound3/int
int name=startOffset0/int
int name=endOffset6/int
int name=origFreq0/int
arr name=suggestion
lst
str name=wordperform/str
int name=freq4/int
/lst
lst
str name=wordperformed/str
int name=freq1/int
/lst
lst
str name=wordperformance/str
int name=freq3/int
/lst
/arr
/lst
lst name=hvc
int name=numFound2/int
int name=startOffset7/int
int name=endOffset10/int
int name=origFreq0/int
arr name=suggestion
lst
str name=wordhvac/str
int name=freq4/int
/lst
lst
str name=wordhave/str
int name=freq5/int
/lst
/arr
/lst
bool name=correctlySpelledfalse/bool
/lst
/lst
/response

Thanks
Brendan


On Tue, Jul 23, 2013 at 3:19 PM, Dyer, James
james.d...@ingramcontent.comwrote:

 For this query:


 http://localhost:8981/solr/articles/select?indent=trueq=Perfrm%20HVCrows=0

 ...do you get anything back in the spellcheck response?  Is it correcting
 the individual words and not giving collations?  Or are you getting no
 individual word suggestions also?

 James Dyer
 Ingram Content Group
 (615) 213-4311


 -Original Message-
 From: Brendan Grainger [mailto:brendan.grain...@gmail.com]
 Sent: Tuesday, July 23, 2013 1:47 PM
 To: solr-user@lucene.apache.org
 Subject: Spellcheck field element and collation issues

 Hi All,

 I have an IndexBasedSpellChecker component configured as follows (note the
 field parameter is set to the spellcheck field):

   searchComponent name=spellcheck class=solr.SpellCheckComponent

 str name=queryAnalyzerFieldTypetext_spell/str

 lst name=spellchecker
   str name=namedefault/str
   str name=classnamesolr.IndexBasedSpellChecker/str
   !--
   Load tokens from the following field for spell checking,
   analyzer for the field's type as defined in schema.xml are used
   --
 *  str name=fieldspellcheck/str*
   str name=spellcheckIndexDir./spellchecker/str
   float name=thresholdTokenFrequency.0001/float
 /lst
   /searchComponent

 with the corresponding field type for spellcheck:

 fieldType name=text_spell class=solr.TextField
 positionIncrementGap=100 omitNorms=true
   analyzer type=index
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.StopFilterFactory
 ignoreCase=true
 words=lang/stopwords_en.txt
 enablePositionIncrements=true
 /
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.StandardFilterFactory/
   /analyzer
   analyzer type=query
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.SynonymFilterFactory
 synonyms=moto_synonyms.txt ignoreCase=true expand=true/
 filter class=solr.StopFilterFactory
 ignoreCase=true
 words=lang/stopwords_en.txt
 enablePositionIncrements=true
 /
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.StandardFilterFactory/
   /analyzer
 /fieldType

 and field:

 !-- spellcheck field is multivalued because it has the title and
 markup
   fields copied into it --
 field name=spellcheck type=text_spell stored=false
 omitTermFreqAndPositions=true multiValued=true/

 values from a markup and title field are copied into the spellcheck field.

 My /select search component has the following defaults:

 lst name=defaults
   str name=echoParamsexplicit/str
   int name=rows10/int
   str name=dfmarkup_texts title_texts/str

   !-- Spell checking defaults --
   str name=spellchecktrue/str
   str name=spellcheck.collateExtendedResultstrue/str
   str name=spellcheck.extendedResultstrue/str
   str name=spellcheck.maxCollations2/str
   str name=spellcheck.maxCollationTries5/str
   str name=spellcheck.count5/str
   str name=spellcheck.collatetrue/str

   str name=spellcheck.maxResultsForSuggest5/str
   str name=spellcheck.alternativeTermCount5/str

  /lst


 When I issue a search like this:


 http://localhost:8981/solr/articles/select?indent=truespellcheck.q=markup_texts:(Perfrm%20HVC)q=Perfrm%20HVCrows=0

 I get collations:

 lst name=collation
 str name=collationQuerymarkup_texts:(perform hvac)/str
 int name=hits4/int
 lst name=misspellingsAndCorrections
 str name=perfrmperform/str
 str name=hvchvac/str
 /lst
 /lst
 lst name=collation
 str name=collationQuerymarkup_texts:(performed hvac)/str
 int name=hits4/int
 lst name=misspellingsAndCorrections
 str name=perfrmperformed/str
 str name=hvchvac/str
 /lst
 /lst

 However, if I remove the spellcheck.q parameter I do not, i.e. no
 collations are returned for the following:


 

Re: Node down, but not out

2013-07-23 Thread jimtronic
I think the best bet here would be a ping like handler that would simply
return the state of only this box in the cluster:

Something like /admin/state which would return
down,active,leader,recovering

I'm not really sure where to begin however. Any ideas?

jim

On Mon, Jul 22, 2013 at 12:52 PM, Timothy Potter [via Lucene] 
ml-node+s472066n4079518...@n3.nabble.com wrote:

 There is but I couldn't get it to work in my environment on Jetty, see:


 http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201306.mbox/%3CCAJt9Wnib+p_woYODtrSPhF==v8Vx==mDBd_qH=x_knbw-BnPXQ@...%3Ehttp://mail-archives.apache.org/mod_mbox/lucene-solr-user/201306.mbox/%3CCAJt9Wnib+p_woYODtrSPhF==v8Vx==mDBd_qH=x_knbw-bn...@mail.gmail.com%3E

 Let me know if you have any better luck. I had to resort to something
 hacky but was out of time I could devote to such unproductive
 endeavors ;-)

 On Mon, Jul 22, 2013 at 10:49 AM, jimtronic [hidden 
 email]http://user/SendEmail.jtp?type=nodenode=4079518i=0
 wrote:

  I'm not sure why it went down exactly -- I restarted the process and
 lost the
  logs. (d'oh!)
 
  An OOM seems likely, however. Is there a setting for killing the
 processes
  when solr encounters an OOM?
 
  Thanks!
 
  Jim
 
 
 
  --
  View this message in context:
 http://lucene.472066.n3.nabble.com/Node-down-but-not-out-tp4079495p4079507.html

  Sent from the Solr - User mailing list archive at Nabble.com.


 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://lucene.472066.n3.nabble.com/Node-down-but-not-out-tp4079495p4079518.html
  To unsubscribe from Node down, but not out, click 
 herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=4079495code=amltdHJvbmljQGdtYWlsLmNvbXw0MDc5NDk1fDEzMjQ4NDk0MTQ=
 .
 NAMLhttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Node-down-but-not-out-tp4079495p4079856.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Spellcheck field element and collation issues

2013-07-23 Thread Brendan Grainger
Hi James,

If I try:

http://localhost:8981/solr/articles/select?indent=trueq=Perfrm%20HVCrows=0maxCollationTries=0

I get the same result:

response
lst name=responseHeader
int name=status0/int
int name=QTime7/int
lst name=params
str name=indenttrue/str
str name=qPerfrm HVC/str
str name=maxCollationTries0/str
str name=rows0/str
/lst
/lst
result name=response numFound=0 start=0/result
lst name=spellcheck
lst name=suggestions
lst name=perfrm
int name=numFound3/int
int name=startOffset0/int
int name=endOffset6/int
int name=origFreq0/int
arr name=suggestion
lst
str name=wordperform/str
int name=freq4/int
/lst
lst
str name=wordperformed/str
int name=freq1/int
/lst
lst
str name=wordperformance/str
int name=freq3/int
/lst
/arr
/lst
lst name=hvc
int name=numFound2/int
int name=startOffset7/int
int name=endOffset10/int
int name=origFreq0/int
arr name=suggestion
lst
str name=wordhvac/str
int name=freq4/int
/lst
lst
str name=wordhave/str
int name=freq5/int
/lst
/arr
/lst
bool name=correctlySpelledfalse/bool
/lst
/lst
/response

However, you're right that my df field for the /select handler is in fact:

 str name=dfmarkup_texts title_texts/str

I would note that if I specify the query as follows:

http://localhost:8981/solr/articles/select?indent=trueq=markup_texts:(Perfrm%20HVC)+OR+title_texts:(Perfrm%20HVC)rows=0maxCollationTries=0

which is what I thought specifying a df would effectively do, I get
collation results:

lst name=collation
str name=collationQuery
markup_texts:(perform hvac) OR title_texts:(perform hvac)
/str
int name=hits4/int
lst name=misspellingsAndCorrections
str name=perfrmperform/str
str name=hvchvac/str
str name=perfrmperform/str
str name=hvchvac/str
/lst
/lst
lst name=collation
str name=collationQuery
markup_texts:(perform hvac) OR title_texts:(performed hvac)
/str
int name=hits4/int
lst name=misspellingsAndCorrections
str name=perfrmperform/str
str name=hvchvac/str
str name=perfrmperformed/str
str name=hvchvac/str
/lst
/lst

I think I'm confused about the relationship between the q parameter and
what the field and queryAnalyzerFieldType are for in the spellcheck
component definition, i.e. what is this for:

   str name=fieldspellcheck/str

is it even needed if I've specified how the spelling index terms should
analyzed with:

   str name=queryAnalyzerFieldTypetext_spell/str

Thanks again
Brendan





On Tue, Jul 23, 2013 at 3:58 PM, Dyer, James
james.d...@ingramcontent.comwrote:

 Try tacking maxCollationTries=0 to the URL and see if the collation
 returns.

 If you get a collation, then try the same URL with the collation as the
 q parameter.  Does that get results?

 My suspicion here is that you are assuming that markup_texts is the
 default search field for /select but in fact it isn't.

 James Dyer
 Ingram Content Group
 (615) 213-4311


 -Original Message-
 From: Brendan Grainger [mailto:brendan.grain...@gmail.com]
 Sent: Tuesday, July 23, 2013 2:43 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Spellcheck field element and collation issues

 Hi James,

 I get the following response for that query:

 response
 lst name=responseHeader
 int name=status0/int
 int name=QTime8/int
 lst name=params
 str name=indenttrue/str
 str name=qPerfrm HVC/str
 str name=rows0/str
 /lst
 /lst
 result name=response numFound=0 start=0/result
 lst name=spellcheck
 lst name=suggestions
 lst name=perfrm
 int name=numFound3/int
 int name=startOffset0/int
 int name=endOffset6/int
 int name=origFreq0/int
 arr name=suggestion
 lst
 str name=wordperform/str
 int name=freq4/int
 /lst
 lst
 str name=wordperformed/str
 int name=freq1/int
 /lst
 lst
 str name=wordperformance/str
 int name=freq3/int
 /lst
 /arr
 /lst
 lst name=hvc
 int name=numFound2/int
 int name=startOffset7/int
 int name=endOffset10/int
 int name=origFreq0/int
 arr name=suggestion
 lst
 str name=wordhvac/str
 int name=freq4/int
 /lst
 lst
 str name=wordhave/str
 int name=freq5/int
 /lst
 /arr
 /lst
 bool name=correctlySpelledfalse/bool
 /lst
 /lst
 /response

 Thanks
 Brendan


 On Tue, Jul 23, 2013 at 3:19 PM, Dyer, James
 james.d...@ingramcontent.comwrote:

  For this query:
 
 
 
 http://localhost:8981/solr/articles/select?indent=trueq=Perfrm%20HVCrows=0
 
  ...do you get anything back in the spellcheck response?  Is it correcting
  the individual words and not giving collations?  Or are you getting no
  individual word suggestions also?
 
  James Dyer
  Ingram Content Group
  (615) 213-4311
 
 
  -Original Message-
  From: Brendan Grainger [mailto:brendan.grain...@gmail.com]
  Sent: Tuesday, July 23, 2013 1:47 PM
  To: solr-user@lucene.apache.org
  Subject: Spellcheck field element and collation issues
 
  Hi All,
 
  I have an IndexBasedSpellChecker component configured as follows (note
 the
  field parameter is set to the spellcheck field):
 
searchComponent name=spellcheck class=solr.SpellCheckComponent
 
  str name=queryAnalyzerFieldTypetext_spell/str
 
  lst name=spellchecker
str 

RE: Spellcheck field element and collation issues

2013-07-23 Thread Dyer, James
Try tacking maxCollationTries=0 to the URL and see if the collation returns.

If you get a collation, then try the same URL with the collation as the q 
parameter.  Does that get results?

My suspicion here is that you are assuming that markup_texts is the default 
search field for /select but in fact it isn't.

James Dyer
Ingram Content Group
(615) 213-4311


-Original Message-
From: Brendan Grainger [mailto:brendan.grain...@gmail.com] 
Sent: Tuesday, July 23, 2013 2:43 PM
To: solr-user@lucene.apache.org
Subject: Re: Spellcheck field element and collation issues

Hi James,

I get the following response for that query:

response
lst name=responseHeader
int name=status0/int
int name=QTime8/int
lst name=params
str name=indenttrue/str
str name=qPerfrm HVC/str
str name=rows0/str
/lst
/lst
result name=response numFound=0 start=0/result
lst name=spellcheck
lst name=suggestions
lst name=perfrm
int name=numFound3/int
int name=startOffset0/int
int name=endOffset6/int
int name=origFreq0/int
arr name=suggestion
lst
str name=wordperform/str
int name=freq4/int
/lst
lst
str name=wordperformed/str
int name=freq1/int
/lst
lst
str name=wordperformance/str
int name=freq3/int
/lst
/arr
/lst
lst name=hvc
int name=numFound2/int
int name=startOffset7/int
int name=endOffset10/int
int name=origFreq0/int
arr name=suggestion
lst
str name=wordhvac/str
int name=freq4/int
/lst
lst
str name=wordhave/str
int name=freq5/int
/lst
/arr
/lst
bool name=correctlySpelledfalse/bool
/lst
/lst
/response

Thanks
Brendan


On Tue, Jul 23, 2013 at 3:19 PM, Dyer, James
james.d...@ingramcontent.comwrote:

 For this query:


 http://localhost:8981/solr/articles/select?indent=trueq=Perfrm%20HVCrows=0

 ...do you get anything back in the spellcheck response?  Is it correcting
 the individual words and not giving collations?  Or are you getting no
 individual word suggestions also?

 James Dyer
 Ingram Content Group
 (615) 213-4311


 -Original Message-
 From: Brendan Grainger [mailto:brendan.grain...@gmail.com]
 Sent: Tuesday, July 23, 2013 1:47 PM
 To: solr-user@lucene.apache.org
 Subject: Spellcheck field element and collation issues

 Hi All,

 I have an IndexBasedSpellChecker component configured as follows (note the
 field parameter is set to the spellcheck field):

   searchComponent name=spellcheck class=solr.SpellCheckComponent

 str name=queryAnalyzerFieldTypetext_spell/str

 lst name=spellchecker
   str name=namedefault/str
   str name=classnamesolr.IndexBasedSpellChecker/str
   !--
   Load tokens from the following field for spell checking,
   analyzer for the field's type as defined in schema.xml are used
   --
 *  str name=fieldspellcheck/str*
   str name=spellcheckIndexDir./spellchecker/str
   float name=thresholdTokenFrequency.0001/float
 /lst
   /searchComponent

 with the corresponding field type for spellcheck:

 fieldType name=text_spell class=solr.TextField
 positionIncrementGap=100 omitNorms=true
   analyzer type=index
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.StopFilterFactory
 ignoreCase=true
 words=lang/stopwords_en.txt
 enablePositionIncrements=true
 /
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.StandardFilterFactory/
   /analyzer
   analyzer type=query
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.SynonymFilterFactory
 synonyms=moto_synonyms.txt ignoreCase=true expand=true/
 filter class=solr.StopFilterFactory
 ignoreCase=true
 words=lang/stopwords_en.txt
 enablePositionIncrements=true
 /
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.StandardFilterFactory/
   /analyzer
 /fieldType

 and field:

 !-- spellcheck field is multivalued because it has the title and
 markup
   fields copied into it --
 field name=spellcheck type=text_spell stored=false
 omitTermFreqAndPositions=true multiValued=true/

 values from a markup and title field are copied into the spellcheck field.

 My /select search component has the following defaults:

 lst name=defaults
   str name=echoParamsexplicit/str
   int name=rows10/int
   str name=dfmarkup_texts title_texts/str

   !-- Spell checking defaults --
   str name=spellchecktrue/str
   str name=spellcheck.collateExtendedResultstrue/str
   str name=spellcheck.extendedResultstrue/str
   str name=spellcheck.maxCollations2/str
   str name=spellcheck.maxCollationTries5/str
   str name=spellcheck.count5/str
   str name=spellcheck.collatetrue/str

   str name=spellcheck.maxResultsForSuggest5/str
   str name=spellcheck.alternativeTermCount5/str

  /lst


 When I issue a search like this:


 

socket write error Solrj 4.3.1

2013-07-23 Thread franagan
Hi all,

im testing solrcloud (version 4.3.1) with 2 shards and 1 external zookeeper.
All its runing ok, documents are indexing in 2 diferent shards and select
*:* give me all documents.

Now im trying to add/index a new document via solj ussing CloudSolrServer.

*the code:*

server = new CloudSolrServer(localhost:2181);
server.setDefaultCollection(tika);
server.setZkConnectTimeout(9);  

input = new FileInputStream(new File(C:\\sample.pdf));

ContentStreamUpdateRequest up = new
ContentStreamUpdateRequest(/update/extract);
up.addFile(new File(C:\\caca.pdf), 
application/octet-stream);
up.setParam(literal.id, 444);


Parser parser = new PDFParser();
ContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
parser.parse(input, handler, metadata, context);

up.setParam(literal.text,handler.toString());
up.setMethod(SolrRequest.METHOD.POST);

server.request(up);
server.commit();
input.close();
} catch (MalformedURLException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (SolrServerException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}

*My schema looks like this:*
 fields
 field name=id type=integer indexed=true stored=true
required=true/
   field name=title type=string indexed=true stored=true/
   field name=author type=string indexed=true stored=true /
   field name=text type=text_ind indexed=true stored=true /   
   field name=_version_ type=long indexed=true stored=true/
   dynamicField name=ignored_* type=string indexed=true
stored=true/
 /fields

*where text_ind type is like this:*
  fieldType name=text_ind class=solr.TextField
positionIncrementGap=100
analyzer type=index
tokenizer class=solr.LetterTokenizerFactory/
filter class=solr.EdgeNGramFilterFactory minGramSize=3
maxGramSize=25 /
filter class=solr.LowerCaseFilterFactory/
/analyzer
/fieldType

*when i execute code next exception is thrown:*

log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for
more info.
org.apache.solr.client.solrj.SolrServerException: No live SolrServers
available to handle this
request:[http://192.168.1.12:8983/solr/tika_shard1_replica1,
http://192.168.1.12:8984/solr/tika_shard2_replica1]
at
org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:333)
at
org.apache.solr.client.solrj.impl.CloudSolrServer.request(CloudSolrServer.java:306)
at solrCloud.solrJDemo.main(solrJDemo.java:51)
Caused by: org.apache.solr.client.solrj.SolrServerException: IOException
occured when talking to server at:
http://192.168.1.12:8983/solr/tika_shard1_replica1
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:413)
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:180)
at
org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:264)
... 2 more
Caused by: org.apache.http.client.ClientProtocolException
at
org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:909)
at
org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805)
at
org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:784)
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:352)
... 4 more
Caused by: org.apache.http.client.NonRepeatableRequestException: Cannot
retry request with a non-repeatable request entity.  The cause lists the
reason the original request failed.
at
org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:691)
at
org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:522)
at
org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906)
... 7 

Re: zkHost in solr.xml goes missing after SPLITSHARD using Collections API

2013-07-23 Thread Ali, Saqib
Thanks Alan and Shawn. Just installed Solr 4.4, and no longer experiencing
the issue.

Thanks! :)


On Tue, Jul 23, 2013 at 7:21 AM, Shawn Heisey s...@elyograg.org wrote:

 On 7/23/2013 7:50 AM, Alan Woodward wrote:
  Can you try upgrading to the just-released 4.4?  Solr.xml persistence
 had all kinds of bugs in 4.3, which should have been fixed now.

 The 4.4.0 release has been finalized and uploaded, but the download link
 hasn't been changed yet because the mirror network isn't fully
 synchronized yet.  It is available from many mirrors, but until the
 website download links get changed, there's not yet a direct way to
 access it.

 Here's some generic instructions for situations where the new version is
 done, but the official announcement isn't out yet:

 http://lucene.apache.org/solr/

 1) Go the the Solr website (URL above) and click on the latest version
 download button, which at this moment is 4.3.1.  Wait for the redirect
 to take you to a mirror list.

 2) Click on one of the mirrors, the best option is usually the one right
 on top that the website chose for you.

 3) When the file list comes up, click the Parent Directory link.  If
 this isn't showing, it will most likely be labelled with .. instead.

 4) If a directory for the new version (in this case 4.4.0) is listed,
 click on it and then click the file that you want to download.

 If the new version is not listed, click the Back button on your browser
 twice, then go back to step 2, but this time choose a different mirror.

 One last reminder: This only works right before a release is officially
 announced.  These instructions cannot be used while a release is still
 in development.

 Thanks,
 Shawn




Processing a lot of results in Solr

2013-07-23 Thread Matt Lieber
Hello Solr users,

Question regarding processing a lot of docs returned from a query; I
potentially have millions of documents returned back from a query. What is
the common design to deal with this ?

2 ideas I have are:
- create a client service that is multithreaded to handled this
- Use the Solr pagination to retrieve a batch of rows at a time (start,
rows in Solr Admin console )

Any other ideas that I may be missing ?

Thanks,
Matt









NOTE: This message may contain information that is confidential, proprietary, 
privileged or otherwise protected by law. The message is intended solely for 
the named addressee. If received in error, please destroy and notify the 
sender. Any use of this email is prohibited when received in error. Impetus 
does not represent, warrant and/or guarantee, that the integrity of this 
communication has been maintained nor that the communication is free of errors, 
virus, interception or interference.


RE: Spellcheck field element and collation issues

2013-07-23 Thread Dyer, James
I don't believe you can specify more than 1 field on df (default field).  
What you want, I think, is qf (query fields), which is available only if 
using dismax/edismax.

http://wiki.apache.org/solr/SearchHandler#df
http://wiki.apache.org/solr/ExtendedDisMax#qf_.28Query_Fields.29

James Dyer
Ingram Content Group
(615) 213-4311


-Original Message-
From: Brendan Grainger [mailto:brendan.grain...@gmail.com] 
Sent: Tuesday, July 23, 2013 3:22 PM
To: solr-user@lucene.apache.org
Subject: Re: Spellcheck field element and collation issues

Hi James,

If I try:

http://localhost:8981/solr/articles/select?indent=trueq=Perfrm%20HVCrows=0maxCollationTries=0

I get the same result:

response
lst name=responseHeader
int name=status0/int
int name=QTime7/int
lst name=params
str name=indenttrue/str
str name=qPerfrm HVC/str
str name=maxCollationTries0/str
str name=rows0/str
/lst
/lst
result name=response numFound=0 start=0/result
lst name=spellcheck
lst name=suggestions
lst name=perfrm
int name=numFound3/int
int name=startOffset0/int
int name=endOffset6/int
int name=origFreq0/int
arr name=suggestion
lst
str name=wordperform/str
int name=freq4/int
/lst
lst
str name=wordperformed/str
int name=freq1/int
/lst
lst
str name=wordperformance/str
int name=freq3/int
/lst
/arr
/lst
lst name=hvc
int name=numFound2/int
int name=startOffset7/int
int name=endOffset10/int
int name=origFreq0/int
arr name=suggestion
lst
str name=wordhvac/str
int name=freq4/int
/lst
lst
str name=wordhave/str
int name=freq5/int
/lst
/arr
/lst
bool name=correctlySpelledfalse/bool
/lst
/lst
/response

However, you're right that my df field for the /select handler is in fact:

 str name=dfmarkup_texts title_texts/str

I would note that if I specify the query as follows:

http://localhost:8981/solr/articles/select?indent=trueq=markup_texts:(Perfrm%20HVC)+OR+title_texts:(Perfrm%20HVC)rows=0maxCollationTries=0

which is what I thought specifying a df would effectively do, I get
collation results:

lst name=collation
str name=collationQuery
markup_texts:(perform hvac) OR title_texts:(perform hvac)
/str
int name=hits4/int
lst name=misspellingsAndCorrections
str name=perfrmperform/str
str name=hvchvac/str
str name=perfrmperform/str
str name=hvchvac/str
/lst
/lst
lst name=collation
str name=collationQuery
markup_texts:(perform hvac) OR title_texts:(performed hvac)
/str
int name=hits4/int
lst name=misspellingsAndCorrections
str name=perfrmperform/str
str name=hvchvac/str
str name=perfrmperformed/str
str name=hvchvac/str
/lst
/lst

I think I'm confused about the relationship between the q parameter and
what the field and queryAnalyzerFieldType are for in the spellcheck
component definition, i.e. what is this for:

   str name=fieldspellcheck/str

is it even needed if I've specified how the spelling index terms should
analyzed with:

   str name=queryAnalyzerFieldTypetext_spell/str

Thanks again
Brendan





On Tue, Jul 23, 2013 at 3:58 PM, Dyer, James
james.d...@ingramcontent.comwrote:

 Try tacking maxCollationTries=0 to the URL and see if the collation
 returns.

 If you get a collation, then try the same URL with the collation as the
 q parameter.  Does that get results?

 My suspicion here is that you are assuming that markup_texts is the
 default search field for /select but in fact it isn't.

 James Dyer
 Ingram Content Group
 (615) 213-4311


 -Original Message-
 From: Brendan Grainger [mailto:brendan.grain...@gmail.com]
 Sent: Tuesday, July 23, 2013 2:43 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Spellcheck field element and collation issues

 Hi James,

 I get the following response for that query:

 response
 lst name=responseHeader
 int name=status0/int
 int name=QTime8/int
 lst name=params
 str name=indenttrue/str
 str name=qPerfrm HVC/str
 str name=rows0/str
 /lst
 /lst
 result name=response numFound=0 start=0/result
 lst name=spellcheck
 lst name=suggestions
 lst name=perfrm
 int name=numFound3/int
 int name=startOffset0/int
 int name=endOffset6/int
 int name=origFreq0/int
 arr name=suggestion
 lst
 str name=wordperform/str
 int name=freq4/int
 /lst
 lst
 str name=wordperformed/str
 int name=freq1/int
 /lst
 lst
 str name=wordperformance/str
 int name=freq3/int
 /lst
 /arr
 /lst
 lst name=hvc
 int name=numFound2/int
 int name=startOffset7/int
 int name=endOffset10/int
 int name=origFreq0/int
 arr name=suggestion
 lst
 str name=wordhvac/str
 int name=freq4/int
 /lst
 lst
 str name=wordhave/str
 int name=freq5/int
 /lst
 /arr
 /lst
 bool name=correctlySpelledfalse/bool
 /lst
 /lst
 /response

 Thanks
 Brendan


 On Tue, Jul 23, 2013 at 3:19 PM, Dyer, James
 james.d...@ingramcontent.comwrote:

  For this query:
 
 
 
 http://localhost:8981/solr/articles/select?indent=trueq=Perfrm%20HVCrows=0
 
  ...do you get anything back in the spellcheck response?  Is it correcting
  the individual words and not giving collations?  Or are you getting no
  individual word suggestions also?
 
  James Dyer
  Ingram Content Group
  (615) 213-4311
 
 

Re: socket write error Solrj 4.3.1

2013-07-23 Thread franagan
For people who have same issue, solved solved adding:

str name=fmap.contenttext/str

in the requestHandler /update/extract in solrconfig.xml:

  requestHandler name=/update/extract
class=org.apache.solr.handler.extraction.ExtractingRequestHandler
 lst name=defaults
  str name=fmap.Last-Modifiedlast_modified/str
  str name=uprefixignored_/str
 * str name=fmap.contenttext/str*
/lst
lst name=date.formats
  str-MM-dd/str
/lst
  /requestHandler

So no need to add content in solrj:

p.setParam(literal.text,handler.toString());

Regards



--
View this message in context: 
http://lucene.472066.n3.nabble.com/socket-write-error-Solrj-4-3-1-tp4079869p4079881.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Spellcheck field element and collation issues

2013-07-23 Thread Brendan Grainger
Thanks James. That's it! Now:

http://localhost:8981/solr/articles/select?indent=trueq=Perfrm%20HVCrows=0maxCollationTries=0

returns:

lst name=collation
str name=collationQueryperform hvac/str
int name=hits4/int
lst name=misspellingsAndCorrections
str name=perfrmperform/str
str name=hvchvac/str
/lst
/lst
lst name=collation
str name=collationQueryperformed hvac/str
int name=hits4/int
lst name=misspellingsAndCorrections
str name=perfrmperformed/str
str name=hvchvac/str
/lst
/lst

If you have time, I'm still slightly unclear on the field element in the
spellcheck configuration. Maybe I should explain how I think it works:

1. You create a relatively unanalyzed field type (e.g. no stemming)
2. You copy text you want to be used to build the spellcheck index into
that field.
3. Build the spellcheck sidecar index (or noop if using DirectSpellChecker
in which case I assume it still uses the dedicated spellcheck field text
was copied into).

When executing a spellcheck request, solr uses the analyzer specified in
queryAnalyzerFieldType to tokenize the query passed in via the q or
spellcheck.q parameter and this tokenized text is the input the
spellcheckchecking instance.

Does that sound right?

Thanks
Brendan







On Tue, Jul 23, 2013 at 5:15 PM, Dyer, James
james.d...@ingramcontent.comwrote:

 I don't believe you can specify more than 1 field on df (default field).
  What you want, I think, is qf (query fields), which is available only if
 using dismax/edismax.

 http://wiki.apache.org/solr/SearchHandler#df
 http://wiki.apache.org/solr/ExtendedDisMax#qf_.28Query_Fields.29

 James Dyer
 Ingram Content Group
 (615) 213-4311


 -Original Message-
 From: Brendan Grainger [mailto:brendan.grain...@gmail.com]
 Sent: Tuesday, July 23, 2013 3:22 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Spellcheck field element and collation issues

 Hi James,

 If I try:


 http://localhost:8981/solr/articles/select?indent=trueq=Perfrm%20HVCrows=0maxCollationTries=0

 I get the same result:

 response
 lst name=responseHeader
 int name=status0/int
 int name=QTime7/int
 lst name=params
 str name=indenttrue/str
 str name=qPerfrm HVC/str
 str name=maxCollationTries0/str
 str name=rows0/str
 /lst
 /lst
 result name=response numFound=0 start=0/result
 lst name=spellcheck
 lst name=suggestions
 lst name=perfrm
 int name=numFound3/int
 int name=startOffset0/int
 int name=endOffset6/int
 int name=origFreq0/int
 arr name=suggestion
 lst
 str name=wordperform/str
 int name=freq4/int
 /lst
 lst
 str name=wordperformed/str
 int name=freq1/int
 /lst
 lst
 str name=wordperformance/str
 int name=freq3/int
 /lst
 /arr
 /lst
 lst name=hvc
 int name=numFound2/int
 int name=startOffset7/int
 int name=endOffset10/int
 int name=origFreq0/int
 arr name=suggestion
 lst
 str name=wordhvac/str
 int name=freq4/int
 /lst
 lst
 str name=wordhave/str
 int name=freq5/int
 /lst
 /arr
 /lst
 bool name=correctlySpelledfalse/bool
 /lst
 /lst
 /response

 However, you're right that my df field for the /select handler is in fact:

  str name=dfmarkup_texts title_texts/str

 I would note that if I specify the query as follows:


 http://localhost:8981/solr/articles/select?indent=trueq=markup_texts:(Perfrm%20HVC)+OR+title_texts:(Perfrm%20HVC)rows=0maxCollationTries=0

 which is what I thought specifying a df would effectively do, I get
 collation results:

 lst name=collation
 str name=collationQuery
 markup_texts:(perform hvac) OR title_texts:(perform hvac)
 /str
 int name=hits4/int
 lst name=misspellingsAndCorrections
 str name=perfrmperform/str
 str name=hvchvac/str
 str name=perfrmperform/str
 str name=hvchvac/str
 /lst
 /lst
 lst name=collation
 str name=collationQuery
 markup_texts:(perform hvac) OR title_texts:(performed hvac)
 /str
 int name=hits4/int
 lst name=misspellingsAndCorrections
 str name=perfrmperform/str
 str name=hvchvac/str
 str name=perfrmperformed/str
 str name=hvchvac/str
 /lst
 /lst

 I think I'm confused about the relationship between the q parameter and
 what the field and queryAnalyzerFieldType are for in the spellcheck
 component definition, i.e. what is this for:

str name=fieldspellcheck/str

 is it even needed if I've specified how the spelling index terms should
 analyzed with:

str name=queryAnalyzerFieldTypetext_spell/str

 Thanks again
 Brendan





 On Tue, Jul 23, 2013 at 3:58 PM, Dyer, James
 james.d...@ingramcontent.comwrote:

  Try tacking maxCollationTries=0 to the URL and see if the collation
  returns.
 
  If you get a collation, then try the same URL with the collation as the
  q parameter.  Does that get results?
 
  My suspicion here is that you are assuming that markup_texts is the
  default search field for /select but in fact it isn't.
 
  James Dyer
  Ingram Content Group
  (615) 213-4311
 
 
  -Original Message-
  From: Brendan Grainger [mailto:brendan.grain...@gmail.com]
  Sent: Tuesday, July 23, 2013 2:43 PM
  To: solr-user@lucene.apache.org
  Subject: Re: Spellcheck field element 

maximum number of documents per shard?

2013-07-23 Thread Ali, Saqib
still 2.1 billion documents?


RE: Spellcheck field element and collation issues

2013-07-23 Thread Dyer, James
You've got it.  The only other thing is that spellcheck.q does not analyze 
anything.  The whole purpose of this is to allow you to just send raw keywords 
to be spellchecked.  This is handy if you have a complex q parameter (say, 
you're using local params, etc) and the SpellingQueryConverter cannot handle 
it.  You could write your own Query COnverter but its often just easier to 
strip out the keywords and send them over with spellcheck.q.

James Dyer
Ingram Content Group
(615) 213-4311


-Original Message-
From: Brendan Grainger [mailto:brendan.grain...@gmail.com]
Sent: Tuesday, July 23, 2013 4:41 PM
To: solr-user@lucene.apache.org
Subject: Re: Spellcheck field element and collation issues

Thanks James. That's it! Now:

http://localhost:8981/solr/articles/select?indent=trueq=Perfrm%20HVCrows=0maxCollationTries=0

returns:

lst name=collation
str name=collationQueryperform hvac/str
int name=hits4/int
lst name=misspellingsAndCorrections
str name=perfrmperform/str
str name=hvchvac/str
/lst
/lst
lst name=collation
str name=collationQueryperformed hvac/str
int name=hits4/int
lst name=misspellingsAndCorrections
str name=perfrmperformed/str
str name=hvchvac/str
/lst
/lst

If you have time, I'm still slightly unclear on the field element in the
spellcheck configuration. Maybe I should explain how I think it works:

1. You create a relatively unanalyzed field type (e.g. no stemming)
2. You copy text you want to be used to build the spellcheck index into
that field.
3. Build the spellcheck sidecar index (or noop if using DirectSpellChecker
in which case I assume it still uses the dedicated spellcheck field text
was copied into).

When executing a spellcheck request, solr uses the analyzer specified in
queryAnalyzerFieldType to tokenize the query passed in via the q or
spellcheck.q parameter and this tokenized text is the input the
spellcheckchecking instance.

Does that sound right?

Thanks
Brendan







On Tue, Jul 23, 2013 at 5:15 PM, Dyer, James
james.d...@ingramcontent.comwrote:

 I don't believe you can specify more than 1 field on df (default field).
  What you want, I think, is qf (query fields), which is available only if
 using dismax/edismax.

 http://wiki.apache.org/solr/SearchHandler#df
 http://wiki.apache.org/solr/ExtendedDisMax#qf_.28Query_Fields.29

 James Dyer
 Ingram Content Group
 (615) 213-4311


 -Original Message-
 From: Brendan Grainger [mailto:brendan.grain...@gmail.com]
 Sent: Tuesday, July 23, 2013 3:22 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Spellcheck field element and collation issues

 Hi James,

 If I try:


 http://localhost:8981/solr/articles/select?indent=trueq=Perfrm%20HVCrows=0maxCollationTries=0

 I get the same result:

 response
 lst name=responseHeader
 int name=status0/int
 int name=QTime7/int
 lst name=params
 str name=indenttrue/str
 str name=qPerfrm HVC/str
 str name=maxCollationTries0/str
 str name=rows0/str
 /lst
 /lst
 result name=response numFound=0 start=0/result
 lst name=spellcheck
 lst name=suggestions
 lst name=perfrm
 int name=numFound3/int
 int name=startOffset0/int
 int name=endOffset6/int
 int name=origFreq0/int
 arr name=suggestion
 lst
 str name=wordperform/str
 int name=freq4/int
 /lst
 lst
 str name=wordperformed/str
 int name=freq1/int
 /lst
 lst
 str name=wordperformance/str
 int name=freq3/int
 /lst
 /arr
 /lst
 lst name=hvc
 int name=numFound2/int
 int name=startOffset7/int
 int name=endOffset10/int
 int name=origFreq0/int
 arr name=suggestion
 lst
 str name=wordhvac/str
 int name=freq4/int
 /lst
 lst
 str name=wordhave/str
 int name=freq5/int
 /lst
 /arr
 /lst
 bool name=correctlySpelledfalse/bool
 /lst
 /lst
 /response

 However, you're right that my df field for the /select handler is in fact:

  str name=dfmarkup_texts title_texts/str

 I would note that if I specify the query as follows:


 http://localhost:8981/solr/articles/select?indent=trueq=markup_texts:(Perfrm%20HVC)+OR+title_texts:(Perfrm%20HVC)rows=0maxCollationTries=0

 which is what I thought specifying a df would effectively do, I get
 collation results:

 lst name=collation
 str name=collationQuery
 markup_texts:(perform hvac) OR title_texts:(perform hvac)
 /str
 int name=hits4/int
 lst name=misspellingsAndCorrections
 str name=perfrmperform/str
 str name=hvchvac/str
 str name=perfrmperform/str
 str name=hvchvac/str
 /lst
 /lst
 lst name=collation
 str name=collationQuery
 markup_texts:(perform hvac) OR title_texts:(performed hvac)
 /str
 int name=hits4/int
 lst name=misspellingsAndCorrections
 str name=perfrmperform/str
 str name=hvchvac/str
 str name=perfrmperformed/str
 str name=hvchvac/str
 /lst
 /lst

 I think I'm confused about the relationship between the q parameter and
 what the field and queryAnalyzerFieldType are for in the spellcheck
 component definition, i.e. what is this for:

str name=fieldspellcheck/str

 is it even needed if I've specified how the spelling index terms should
 analyzed with:

str 

How to make soft commit more reliable?

2013-07-23 Thread SolrLover
Currently I am using SOLR 3.5.X and I push updates to SOLR via queue (Active
MQ) and perform hard commit every 30 minutes (since my index is relatively
big around 30 million documents). I am thinking of using soft commit to
implement NRT search but I am worried about the reliability.

For ex: If I have the hard autocommit set to 10 minutes and a softcommit
every second, new documents will show up every second but in case of JVM
crash or power goes out I will lose all the documents after the last hard
commit. 

I was thinking of using a backup database or another SOLR index that I can
use as a backup and write the document from queue in both places (one with
soft commit, another index with just the push updates with normal hard
commits (or) write simultaneously to a db and delete the rows once the hard
commit is successful after making sure that we didn't lose any records).

Does someone have any other idea to improve the reliability of the push
updates when using soft commit?




--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-make-soft-commit-more-reliable-tp4079892.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Spellcheck field element and collation issues

2013-07-23 Thread Brendan Grainger
Perfect thanks so much. You just cleared up the other little bit, i.e. when
the SpellingQueryConverter is used/not used and why you might implement
your own.

Thanks again.


On Tue, Jul 23, 2013 at 6:48 PM, Dyer, James
james.d...@ingramcontent.comwrote:

 You've got it.  The only other thing is that spellcheck.q does not
 analyze anything.  The whole purpose of this is to allow you to just send
 raw keywords to be spellchecked.  This is handy if you have a complex q
 parameter (say, you're using local params, etc) and the
 SpellingQueryConverter cannot handle it.  You could write your own Query
 COnverter but its often just easier to strip out the keywords and send them
 over with spellcheck.q.

 James Dyer
 Ingram Content Group
 (615) 213-4311


 -Original Message-
 From: Brendan Grainger [mailto:brendan.grain...@gmail.com]
 Sent: Tuesday, July 23, 2013 4:41 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Spellcheck field element and collation issues

 Thanks James. That's it! Now:


 http://localhost:8981/solr/articles/select?indent=trueq=Perfrm%20HVCrows=0maxCollationTries=0

 returns:

 lst name=collation
 str name=collationQueryperform hvac/str
 int name=hits4/int
 lst name=misspellingsAndCorrections
 str name=perfrmperform/str
 str name=hvchvac/str
 /lst
 /lst
 lst name=collation
 str name=collationQueryperformed hvac/str
 int name=hits4/int
 lst name=misspellingsAndCorrections
 str name=perfrmperformed/str
 str name=hvchvac/str
 /lst
 /lst

 If you have time, I'm still slightly unclear on the field element in the
 spellcheck configuration. Maybe I should explain how I think it works:

 1. You create a relatively unanalyzed field type (e.g. no stemming)
 2. You copy text you want to be used to build the spellcheck index into
 that field.
 3. Build the spellcheck sidecar index (or noop if using DirectSpellChecker
 in which case I assume it still uses the dedicated spellcheck field text
 was copied into).

 When executing a spellcheck request, solr uses the analyzer specified in
 queryAnalyzerFieldType to tokenize the query passed in via the q or
 spellcheck.q parameter and this tokenized text is the input the
 spellcheckchecking instance.

 Does that sound right?

 Thanks
 Brendan







 On Tue, Jul 23, 2013 at 5:15 PM, Dyer, James
 james.d...@ingramcontent.comwrote:

  I don't believe you can specify more than 1 field on df (default
 field).
   What you want, I think, is qf (query fields), which is available only
 if
  using dismax/edismax.
 
  http://wiki.apache.org/solr/SearchHandler#df
  http://wiki.apache.org/solr/ExtendedDisMax#qf_.28Query_Fields.29
 
  James Dyer
  Ingram Content Group
  (615) 213-4311
 
 
  -Original Message-
  From: Brendan Grainger [mailto:brendan.grain...@gmail.com]
  Sent: Tuesday, July 23, 2013 3:22 PM
  To: solr-user@lucene.apache.org
  Subject: Re: Spellcheck field element and collation issues
 
  Hi James,
 
  If I try:
 
 
 
 http://localhost:8981/solr/articles/select?indent=trueq=Perfrm%20HVCrows=0maxCollationTries=0
 
  I get the same result:
 
  response
  lst name=responseHeader
  int name=status0/int
  int name=QTime7/int
  lst name=params
  str name=indenttrue/str
  str name=qPerfrm HVC/str
  str name=maxCollationTries0/str
  str name=rows0/str
  /lst
  /lst
  result name=response numFound=0 start=0/result
  lst name=spellcheck
  lst name=suggestions
  lst name=perfrm
  int name=numFound3/int
  int name=startOffset0/int
  int name=endOffset6/int
  int name=origFreq0/int
  arr name=suggestion
  lst
  str name=wordperform/str
  int name=freq4/int
  /lst
  lst
  str name=wordperformed/str
  int name=freq1/int
  /lst
  lst
  str name=wordperformance/str
  int name=freq3/int
  /lst
  /arr
  /lst
  lst name=hvc
  int name=numFound2/int
  int name=startOffset7/int
  int name=endOffset10/int
  int name=origFreq0/int
  arr name=suggestion
  lst
  str name=wordhvac/str
  int name=freq4/int
  /lst
  lst
  str name=wordhave/str
  int name=freq5/int
  /lst
  /arr
  /lst
  bool name=correctlySpelledfalse/bool
  /lst
  /lst
  /response
 
  However, you're right that my df field for the /select handler is in
 fact:
 
   str name=dfmarkup_texts title_texts/str
 
  I would note that if I specify the query as follows:
 
 
 
 http://localhost:8981/solr/articles/select?indent=trueq=markup_texts:(Perfrm%20HVC)+OR+title_texts:(Perfrm%20HVC)rows=0maxCollationTries=0
 
  which is what I thought specifying a df would effectively do, I get
  collation results:
 
  lst name=collation
  str name=collationQuery
  markup_texts:(perform hvac) OR title_texts:(perform hvac)
  /str
  int name=hits4/int
  lst name=misspellingsAndCorrections
  str name=perfrmperform/str
  str name=hvchvac/str
  str name=perfrmperform/str
  str name=hvchvac/str
  /lst
  /lst
  lst name=collation
  str name=collationQuery
  markup_texts:(perform hvac) OR title_texts:(performed hvac)
  /str
  int name=hits4/int
  lst name=misspellingsAndCorrections
  str name=perfrmperform/str
  str 

Re: maximum number of documents per shard?

2013-07-23 Thread Jack Krupansky
2.1 billion documents (including deleted documents) per Lucene index, but 
essentially per Solr shard as well.


But don’t even think about going that high. In fact, don't plan on going 
above 100 million unless you do a proof of concept that validates that you 
get acceptable query and update performance . There is no hard limit besides 
that 2.1 billion Lucene limit, but... performance will vary.


-- Jack Krupansky

-Original Message- 
From: Ali, Saqib

Sent: Tuesday, July 23, 2013 6:18 PM
To: solr-user@lucene.apache.org
Subject: maximum number of documents per shard?

still 2.1 billion documents? 



Re: problems about solr replication in 4.3

2013-07-23 Thread Erick Erickson
Are you mixing SolrCloud and old-style master/slave?

There was a bug a while ago (search the JIRA) where
replication was copying the entire index unnecessarily,
but I think that was fixed by 4.3.

Best
Erick

On Tue, Jul 23, 2013 at 6:33 AM, xiaoqi belivexia...@gmail.com wrote:

 hi,all

 i have two solr ,one is master , one is replication , before i use them
 under 3.5 version . it works fine .
  when i upgrade to 4.3version , i found when replication solr copying index
 from master , it will clean current index and copy new version to self
 folder . slave can't search  during this process !

 i am newer to solr 4 , does this normal ? any ideas , thanks !





 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/problems-about-solr-replication-in-4-3-tp4079665.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: softCommit doesn't work - ?

2013-07-23 Thread Erick Erickson
Right, issuing a commit after every document is not good
practice. Relying on the auto commit parameters in
solrconfig.xml is usually best, although I will sometimes
issue a commit at the very end of the indexing run.


Several things about this thread aren't making sense. First of
all your commitwithin parameter (your server.add(doc, int)
is not soft committing anything, it's just telling solr to commit
documents in the future). But you should be seeing
these after 10 seconds.

Check your solrconfig and insure that your autocommit
settings have openSearchertrue/openSearcher set,
that could possibly be what you're seeing.

The fact that you have all those segments indicates that
the commits are going through, so maybe you just
have openSearcher set to false.

This is almost certainly an assumption you're making that's
not so, what you're doing _should_ work

For that matter, what are your soft commit settings in solrconfig.xml?


Best
Erick


On Tue, Jul 23, 2013 at 11:48 AM, tskom tsiedlac...@hotmail.co.uk wrote:
 Thanks for your comment Eric.

 When I use  *server.add(doc);* - everything is fine (but takes long time
 to hard commit every single doc) , so I am sure docs are uniquely indexed.

 Maybe I shouldn't do *server.commit();* at all from solrj code, so SOLR
 would use autoCommit/autoSoftCommit configuration defined in solrconfig.xml
 ?

 Maybe there are some bits missing ?











 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/softCommit-doesn-t-work-tp4079578p4079772.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: Question about field boost

2013-07-23 Thread Erick Erickson
Bah! I didn't notice that you'd used edismax, ignore
my comments.

Sorry for the confusion
Erick

On Tue, Jul 23, 2013 at 2:34 PM, Joe Zhang smartag...@gmail.com wrote:
 I'm not sure I understand, Erick. I don't have a text field in my schema;
 title and content are both legal fields.


 On Tue, Jul 23, 2013 at 5:15 AM, Erick Erickson 
 erickerick...@gmail.comwrote:

 this isn't doing what you think.
 title^10 content
 is actually parsed as

 text:title^100 text:content

 where text is my default search field.

 assuming title is a field. If you look a little
 farther up the debug output you'll see that.

 You probably want
 title:content^100 or some such?

 Erick

 On Tue, Jul 23, 2013 at 1:43 AM, Jack Krupansky j...@basetechnology.com
 wrote:
  That means that for that document china occurs in the title vs.
 snowden
  found in a document but not in the title.
 
 
  -- Jack Krupansky
 
  -Original Message- From: Joe Zhang
  Sent: Tuesday, July 23, 2013 12:52 AM
  To: solr-user@lucene.apache.org
  Subject: Re: Question about field boost
 
 
  Is my reading correct that the boost is only applied on china but not
  snowden? How can that be?
 
  My query is: q=china+snowdenqf=title^10 content
 
 
  On Mon, Jul 22, 2013 at 9:43 PM, Joe Zhang smartag...@gmail.com wrote:
 
  Thanks for your hint, Jack. Here is the debug results, which I'm having
 a
  hard deciphering (the two terms are china and snowden)...
 
  0.26839527 = (MATCH) sum of:
0.26839527 = (MATCH) sum of:
  0.26757246 = (MATCH) max of:
7.9147343E-4 = (MATCH) weight(content:china in 249), product of:
  0.019873314 = queryWeight(content:china), product of:
1.6649085 = idf(docFreq=46832, maxDocs=91058)
0.01193658 = queryNorm
  0.039825942 = (MATCH) fieldWeight(content:china in 249), product
  of:
4.8989797 = tf(termFreq(content:china)=24)
1.6649085 = idf(docFreq=46832, maxDocs=91058)
0.0048828125 = fieldNorm(field=content, doc=249)
0.26757246 = (MATCH) weight(title:china^10.0 in 249), product of:
  0.5836803 = queryWeight(title:china^10.0), product of:
10.0 = boost
4.8898454 = idf(docFreq=1861, maxDocs=91058)
0.01193658 = queryNorm
  0.45842302 = (MATCH) fieldWeight(title:china in 249), product
 of:
1.0 = tf(termFreq(title:china)=1)
4.8898454 = idf(docFreq=1861, maxDocs=91058)
0.09375 = fieldNorm(field=title, doc=249)
  8.2282536E-4 = (MATCH) max of:
8.2282536E-4 = (MATCH) weight(content:snowden in 249), product of:
  0.03407834 = queryWeight(content:snowden), product of:
2.8549502 = idf(docFreq=14246, maxDocs=91058)
0.01193658 = queryNorm
  0.024145111 = (MATCH) fieldWeight(content:snowden in 249),
 product
  of:
1.7320508 = tf(termFreq(content:snowden)=3)
2.8549502 = idf(docFreq=14246, maxDocs=91058)
0.0048828125 = fieldNorm(field=content, doc=249)
 
 
  On Mon, Jul 22, 2013 at 9:27 PM, Jack Krupansky
  j...@basetechnology.comwrote:
 
  Maybe you're not doing anything wrong - other than having an artificial
  expectation of what the true relevance of your data actually is. Many
  factors go into relevance scoring. You need to look at all aspects of
  your
  data.
 
  Maybe your terms don't occur in your titles the way you think they do.
 
  Maybe you need a boost of 500 or more...
 
  Lots of potential maybes.
 
  Relevancy tuning is an art and craft, hardly a science.
 
  Step one: Know your data, inside and out.
 
  Use the debugQuery=true parameter on your queries and see how much of
 the
  score is dominated by your query terms in the non-title fields.
 
  -- Jack Krupansky
 
  -Original Message- From: Joe Zhang
  Sent: Monday, July 22, 2013 11:06 PM
  To: solr-user@lucene.apache.org
  Subject: Question about field boost
 
 
  Dear Solr experts:
 
  Here is my query:
 
  defType=dismaxq=term1+term2**qf=title^100 content
 
  Apparently (at least I thought) my intention is to boost the title
 field.
  While I'm getting some non-trivial results, I'm surprised that the
  documents with both term1 and term2 in title (I know such docs do exist
  in
  my repository) were not returned (or maybe ranked very low). The
  situation
  does not change even when I use much larger boost factors.
 
  What am I doing wrong?
 
 
 
 



Fw:

2013-07-23 Thread wiredkel

Hi!   http://millanao.cl/google.com.offers.html



Re: Processing a lot of results in Solr

2013-07-23 Thread Timothy Potter
Hi Matt,

This feature is commonly known as deep paging and Lucene and Solr have
issues with it ... take a look at
http://solr.pl/en/2011/07/18/deep-paging-problem/ as a potential
starting point using filters to bucketize a result set into sets of
sub result sets.

Cheers,
Tim

On Tue, Jul 23, 2013 at 3:04 PM, Matt Lieber mlie...@impetus.com wrote:
 Hello Solr users,

 Question regarding processing a lot of docs returned from a query; I
 potentially have millions of documents returned back from a query. What is
 the common design to deal with this ?

 2 ideas I have are:
 - create a client service that is multithreaded to handled this
 - Use the Solr pagination to retrieve a batch of rows at a time (start,
 rows in Solr Admin console )

 Any other ideas that I may be missing ?

 Thanks,
 Matt


 






 NOTE: This message may contain information that is confidential, proprietary, 
 privileged or otherwise protected by law. The message is intended solely for 
 the named addressee. If received in error, please destroy and notify the 
 sender. Any use of this email is prohibited when received in error. Impetus 
 does not represent, warrant and/or guarantee, that the integrity of this 
 communication has been maintained nor that the communication is free of 
 errors, virus, interception or interference.


Re: custom field type plugin

2013-07-23 Thread Kevin Stone
Sorry for the late response. I needed to find the time to load a lot of
extra data (closer to what we're anticipating). I have an index with close
to 220,000 documents, each with at least two coordinate regions anywhere
between -10 billion to +10 billion, but could potentially have up to maybe
half dozen regions in one document. The reason for the negatives, is
because you can read a chromosome either backwards or forwards, so many
coordinates can be minus.

Here is the schema field definition:

fieldType name=geneticLocation
 class=solr.SpatialRecursivePrefixTreeFieldType
 multiValued=true
 geo=false
 worldBounds=-1000 -1000 1000
1000
 distErrPct=0
 maxDistErr=0.9
 units=degrees
 /


Here is the first query in the log:

INFO:
geneticLocation{class=org.apache.solr.schema.SpatialRecursivePrefixTreeFiel
dType,analyzer=org.apache.solr.schema.FieldType$DefaultAnalyzer,args={distE
rrPct=0, geo=false, multiValued=true, worldBounds=-1000
-1000 1000 1000, maxDistErr=0.9,
units=degrees}} strat:
RecursivePrefixTreeStrategy(prefixGridScanLevel:46,SPG:(QuadPrefixTree(maxL
evels:50,ctx:SpatialContext{geo=false, calculator=CartesianDistCalc,
worldBounds=Rect(minX=-1.0E11,maxX=1.0E11,minY=-1.0E11,maxY=1.0E11)})))
maxLevels: 50
Jul 23, 2013 9:11:45 PM org.apache.solr.core.SolrCore execute
INFO: [testIndex] webapp=/solr path=/select
params={wt=xmlq=humanCoordinate:Intersects(0+60330+6033041244+100
)rows=100} hits=81112 status=0 QTime=122





Here are some other queries to give different timings (the one above
brings back quite a lot):

INFO: [testIndex] webapp=/solr path=/select
params={wt=xmlq=humanCoordinate:Intersects(0+60+69+10
0)rows=100} hits=6031 status=0 QTime=10
Jul 23, 2013 9:13:43 PM org.apache.solr.core.SolrCore execute
INFO: [testIndex] webapp=/solr path=/select
params={wt=xmlq=humanCoordinate:Intersects(0+0+1000+100)row
s=100} hits=500 status=0 QTime=15
Jul 23, 2013 9:14:14 PM org.apache.solr.core.SolrCore execute
INFO: [testIndex] webapp=/solr path=/select
params={wt=xmlq=humanCoordinate:Intersects(0+7831329+7831329+100)
rows=100} hits=4 status=0 QTime=17
INFO: [testIndex] webapp=/solr path=/select
params={wt=xmlq=humanCoordinate:Intersects(-100+-1051057963+-1001
057963+0)rows=100} hits=661 status=0 QTime=8



The query times look pretty fast to me. Certainly I'm pretty impressed.
Our other backup solutions (involving SQL) likely wouldn't even touch this
in terms of speed.



We will be testing this more in depth in the coming month. I am sort of
jumping ahead of our team to research possible solutions, since this is
something that worried us. Looks like it might work!

Thanks,
-Kevin

On 7/23/13 1:47 PM, David Smiley (@MITRE.org) dsmi...@mitre.org wrote:

Oh cool!  I'm glad it at least seemed to work.  Can you post your
configuration of the field type and report from Solr's logs what the
maxLevels is used for this field, which is logged the first time you use
the field type?

Maybe there isn't a limit under 10B after all.  Some quick'n'dirty
calculations I just did indicate there shouldn't be a problem but
real-world
usage will be a better proof.  Indexing probably won't be terribly slow,
queries could get pretty slow if the amount of indexed data is really
high.
I'd love to hear how it works out for you.  Your use-case would benefit a
lot from an improved prefix tree implementation.

I don't gather how a 3rd dimension would play into this.  Support for
multi-dimensional spatial is on the drawing board.

~ David


Kevin Stone wrote
 What are the dangers of trying to use a range of 10 billion? Simply a
 slower index time? Or will I get inaccurate results?
 I have tried it on a very small sample of documents, and it seemed to
 work. I could spend some time this week trying to get a more robust (and
 accurate) dataset loaded to play around with. The reason for the 10
 billion is to support being able to query for a region on a chromosome.

 A user might want to know what genes overlap a point on a specific
 chromosome. Unless I can use 3 dimensional coordinates (which gave an
 error when I tried it), I'll need to multiply the coordinates by some
 offset for each chromosome to be able to normalise the data (at both
index
 and query time). The largest chromosome (chr 1) has almost 250,000,000
 base pairs. I could probably squeeze the rest a bit smaller, but I'd
 rather use one size for all chromosomes, since we have more than just
 human data to deal with. It would get quite messy otherwise.


 On 7/22/13 11:50 AM, David Smiley (@MITRE.org) lt;

 DSMILEY@

 gt; wrote:

Like Hoss said, you're going to have to solve this using
http://wiki.apache.org/solr/SpatialForTimeDurations
Using PointType is *not* going to work because your durations are
multi-valued per document.

It would be useful to create a 

Re: Processing a lot of results in Solr

2013-07-23 Thread Roman Chyla
Hello Matt,

You can consider writing a batch processing handler, which receives a query
and instead of sending results back, it writes them into a file which is
then available for streaming (it has its own UUID). I am dumping many GBs
of data from solr in few minutes - your query + streaming writer can go
very long way :)

roman


On Tue, Jul 23, 2013 at 5:04 PM, Matt Lieber mlie...@impetus.com wrote:

 Hello Solr users,

 Question regarding processing a lot of docs returned from a query; I
 potentially have millions of documents returned back from a query. What is
 the common design to deal with this ?

 2 ideas I have are:
 - create a client service that is multithreaded to handled this
 - Use the Solr pagination to retrieve a batch of rows at a time (start,
 rows in Solr Admin console )

 Any other ideas that I may be missing ?

 Thanks,
 Matt


 






 NOTE: This message may contain information that is confidential,
 proprietary, privileged or otherwise protected by law. The message is
 intended solely for the named addressee. If received in error, please
 destroy and notify the sender. Any use of this email is prohibited when
 received in error. Impetus does not represent, warrant and/or guarantee,
 that the integrity of this communication has been maintained nor that the
 communication is free of errors, virus, interception or interference.



Re: Processing a lot of results in Solr

2013-07-23 Thread Matt Lieber
That sounds like a satisfactory solution for the time being -
I am assuming you dump the data from Solr in a csv format?
How did you implement the streaming processor ? (what tool did you use for
this? Not familiar with that)
You say it takes a few minutes only to dump the data - how long does it to
stream it back in, are performances acceptable (~ within minutes) ?

Thanks,
Matt

On 7/23/13 6:57 PM, Roman Chyla roman.ch...@gmail.com wrote:

Hello Matt,

You can consider writing a batch processing handler, which receives a
query
and instead of sending results back, it writes them into a file which is
then available for streaming (it has its own UUID). I am dumping many GBs
of data from solr in few minutes - your query + streaming writer can go
very long way :)

roman


On Tue, Jul 23, 2013 at 5:04 PM, Matt Lieber mlie...@impetus.com wrote:

 Hello Solr users,

 Question regarding processing a lot of docs returned from a query; I
 potentially have millions of documents returned back from a query. What
is
 the common design to deal with this ?

 2 ideas I have are:
 - create a client service that is multithreaded to handled this
 - Use the Solr pagination to retrieve a batch of rows at a time
(start,
 rows in Solr Admin console )

 Any other ideas that I may be missing ?

 Thanks,
 Matt


 






 NOTE: This message may contain information that is confidential,
 proprietary, privileged or otherwise protected by law. The message is
 intended solely for the named addressee. If received in error, please
 destroy and notify the sender. Any use of this email is prohibited when
 received in error. Impetus does not represent, warrant and/or guarantee,
 that the integrity of this communication has been maintained nor that
the
 communication is free of errors, virus, interception or interference.










NOTE: This message may contain information that is confidential, proprietary, 
privileged or otherwise protected by law. The message is intended solely for 
the named addressee. If received in error, please destroy and notify the 
sender. Any use of this email is prohibited when received in error. Impetus 
does not represent, warrant and/or guarantee, that the integrity of this 
communication has been maintained nor that the communication is free of errors, 
virus, interception or interference.


Re: custom field type plugin

2013-07-23 Thread Smiley, David W.
Kevin,

Those are some good query response times but they could be better.  You've
configured the field type sub-optimally.  Look again at
http://wiki.apache.org/solr/SpatialForTimeDurations and note in particular
maxDistErr.  You've left it at the value that comes pre-configured with
Solr, 0.9, which is ~1 meter measured in degrees, and this value
makes no sense when your numeric range is in whole numbers.  I suspect you
inherited this value from Hoss's slides.  **Instead use 1.** (as shown on
the wiki). This affects performance in a big way since you've configured
the prefixTree to hold 2.22e18 values (calculated via (max-min) /
maxDistErr) as opposed to just 2e10.  Your log shows maxLevels is 50 for
quad tree.  The comments in QuadPrefixTree (and I put them there once)
indicate maxLevels of 50 is about as much as is supported.  But again, I'm
not certain what the limit really is without validating.  Hopefully you
can stay clear of 50.  To do some tests, try querying just on the edge on
either side of an indexed value to make sure you match the point and then
don't match the indexed point as you would expect based on the
instructions.  Also, be sure to read more of the details on Search on
this wiki page in which you are advised to buffer the query shape
slightly; you didn't do this in your examples below.  This is all a bit of
a hack when using a field that internally is using floating point instead
of fixed precision.

~ David Smiley

On 7/23/13 9:32 PM, Kevin Stone kevin.st...@jax.org wrote:

Sorry for the late response. I needed to find the time to load a lot of
extra data (closer to what we're anticipating). I have an index with close
to 220,000 documents, each with at least two coordinate regions anywhere
between -10 billion to +10 billion, but could potentially have up to maybe
half dozen regions in one document. The reason for the negatives, is
because you can read a chromosome either backwards or forwards, so many
coordinates can be minus.

Here is the schema field definition:

fieldType name=geneticLocation
 class=solr.SpatialRecursivePrefixTreeFieldType
 multiValued=true
 geo=false
 worldBounds=-1000 -1000 1000
1000
 distErrPct=0
 maxDistErr=0.9
 units=degrees
 /


Here is the first query in the log:

INFO: 
geneticLocation{class=org.apache.solr.schema.SpatialRecursivePrefixTreeFie
l
dType,analyzer=org.apache.solr.schema.FieldType$DefaultAnalyzer,args={dist
E
rrPct=0, geo=false, multiValued=true, worldBounds=-1000
-1000 1000 1000, maxDistErr=0.9,
units=degrees}} strat:
RecursivePrefixTreeStrategy(prefixGridScanLevel:46,SPG:(QuadPrefixTree(max
L
evels:50,ctx:SpatialContext{geo=false, calculator=CartesianDistCalc,
worldBounds=Rect(minX=-1.0E11,maxX=1.0E11,minY=-1.0E11,maxY=1.0E11)})))
maxLevels: 50
Jul 23, 2013 9:11:45 PM org.apache.solr.core.SolrCore execute
INFO: [testIndex] webapp=/solr path=/select
params={wt=xmlq=humanCoordinate:Intersects(0+60330+6033041244+10
0
)rows=100} hits=81112 status=0 QTime=122





Here are some other queries to give different timings (the one above
brings back quite a lot):

INFO: [testIndex] webapp=/solr path=/select
params={wt=xmlq=humanCoordinate:Intersects(0+60+69+1
0
0)rows=100} hits=6031 status=0 QTime=10
Jul 23, 2013 9:13:43 PM org.apache.solr.core.SolrCore execute
INFO: [testIndex] webapp=/solr path=/select
params={wt=xmlq=humanCoordinate:Intersects(0+0+1000+100)ro
w
s=100} hits=500 status=0 QTime=15
Jul 23, 2013 9:14:14 PM org.apache.solr.core.SolrCore execute
INFO: [testIndex] webapp=/solr path=/select
params={wt=xmlq=humanCoordinate:Intersects(0+7831329+7831329+100
)
rows=100} hits=4 status=0 QTime=17
INFO: [testIndex] webapp=/solr path=/select
params={wt=xmlq=humanCoordinate:Intersects(-100+-1051057963+-100
1
057963+0)rows=100} hits=661 status=0 QTime=8



The query times look pretty fast to me. Certainly I'm pretty impressed.
Our other backup solutions (involving SQL) likely wouldn't even touch this
in terms of speed.



We will be testing this more in depth in the coming month. I am sort of
jumping ahead of our team to research possible solutions, since this is
something that worried us. Looks like it might work!

Thanks,
-Kevin

On 7/23/13 1:47 PM, David Smiley (@MITRE.org) dsmi...@mitre.org wrote:

Oh cool!  I'm glad it at least seemed to work.  Can you post your
configuration of the field type and report from Solr's logs what the
maxLevels is used for this field, which is logged the first time you
use
the field type?

Maybe there isn't a limit under 10B after all.  Some quick'n'dirty
calculations I just did indicate there shouldn't be a problem but
real-world
usage will be a better proof.  Indexing probably won't be terribly slow,
queries could get pretty slow if the amount of indexed data is really
high. 
I'd love to hear how it 

  1   2   >