date:20111103

Re: exact matches possible?

2011-11-03 Thread Roland Tollenaar


Hi Erik,

thanks for the response. I have ensured the type is string and that the 
field is indexed. No luck though:


(Schema setting under solr/conf):
field name=Word type=string indexed=true stored=true /

Query:

Word:apple

Desired result:

apple

Achieved Results:

apple, the red apple, pine-apple, etc, etc


I have also tried your other suggestion:
q={!  f=Word}apple
(attmpting to eliminate any results with spaces)

But that just gives errors (from calling from the solr/admin query 
interface.


Am I doing something obviously wrong?

Thanks again,

Roland

It's certainly quite possible with Lucene/Solr.  But you have to index 
the field to accommodate it.  If you literally want an exact match 
query, use the string field type and then issue a term query. 
q=field:value will work in simple cases (where the value has no spaces 
or colons, or other query parser syntax), but q={!term f=field}value 
is the fail-safe way to do that.


Erik








Erik Hatcher wrote:

It's certainly quite possible with Lucene/Solr.  But you have to index the field to 
accommodate it.  If you literally want an exact match query, use the string 
field type and then issue a term query.  q=field:value will work in simple cases (where 
the value has no spaces or colons, or other query parser syntax), but q={!term 
f=field}value is the fail-safe way to do that.

Erik

On Nov 2, 2011, at 07:08 , Roland Tollenaar wrote:


Hi,

I am trying to do a search that will only match exact words on a field.

I have read somewhere that this is not what SOLR is meant for but I am still 
hoping that its possible.

This is an example of what I have tried (to exclude spaces) but the workaround 
does not seem to work.

Word:apple NOT  

What I am really looking for is the = operator in SQL (eg Word='apple') but I 
cannot find its equivalent for lucene.

Thanks for the help.

Regards,

Roland

pingQuery problem ?

2011-11-03 Thread darul

My solr instance works well, when calling ping page I get no problem :



But in logs, I see this error lines repeated, do you know how to solve this
?



solrconfig.xml



Thanks

--
View this message in context: 
http://lucene.472066.n3.nabble.com/pingQuery-problem-tp3476850p3476850.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Using Solr components for dictionary matching?

2011-11-03 Thread Erick Erickson

I really don't understand what you're asking. Could you give some
examples of what you're trying to do?

Best
Erick

On Tue, Nov 1, 2011 at 10:38 AM, Nagendra Mishr nmi...@gmail.com wrote:
 Hi all,

 Is there a good guide on using Solr components as a dictionary
 matcher?  I'm need to do some pre-processing that involves lots of
 dictionary lookups and it doesn't seem right to query solr for each
 instance.

 Thanks in advance,

 Nagendra

Re: SOLRJ commitWithin inconsistent

2011-11-03 Thread Nagendra Nagarajayya


Vijay:

You may want to try Solr 3.3/3.4 with RankingAlgorithm as it supports 
NRT (Real Time Updates). You can set the commit interval to about 15 
mins or as desired.


You can get more information about NRT with 3.3/3.4.0 from here:
http://solr-ra.tgels.org/wiki/en/Near_Real_Time_Search_ver_3.x

You can download Solr 3.3/3.4.0 with RankingAlgorithm 1.3 from here:
http://solr-ra.tgels.org


Regards,

- Nagendra Nagarajayya
http://solr-ra.tgels.org
http://rankingalgorithm.tgels.org

On 11/2/2011 8:40 PM, Vijay Sampath wrote:

Hi,

  I'm using CommitWithin for immediate commit.  The response times are
inconsistent. Sometimes it's less than a second. Sometimes more than 25
seconds. I'm not sending concurrent requests. Any idea?

  http://wiki.apache.org/solr/CommitWithin

   Snippet:

   UpdateRequest req = new UpdateRequest(); 
   req.add( solrDoc);
   req.setCommitWithin(5000);
   req.process( server );



Thanks,
Vijay

--
View this message in context: 
http://lucene.472066.n3.nabble.com/SOLRJ-commitWithin-inconsistent-tp3476104p3476104.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: exact matches possible?

2011-11-03 Thread Erik Hatcher

Roland -

Is it possible that you indexed with a different field type and then changed to 
string without reindexing?   A query on a string will only match literally 
the exact value (barring any wildcard/regex syntax), so something is fishy with 
your example.  Your query example was odd, not sure if you meant it literally, 
but given the Word field name the query would be q={!term f=Word}apple - maybe 
you thought term was meta, but it is meant literally here.

Erik

On Nov 3, 2011, at 04:45 , Roland Tollenaar wrote:

 Hi Erik,
 
 thanks for the response. I have ensured the type is string and that the field 
 is indexed. No luck though:
 
 (Schema setting under solr/conf):
 field name=Word type=string indexed=true stored=true /
 
 Query:
 
 Word:apple
 
 Desired result:
 
 apple
 
 Achieved Results:
 
 apple, the red apple, pine-apple, etc, etc
 
 
 I have also tried your other suggestion:
 q={!  f=Word}apple
 (attmpting to eliminate any results with spaces)
 
 But that just gives errors (from calling from the solr/admin query interface.
 
 Am I doing something obviously wrong?
 
 Thanks again,
 
 Roland
 
 It's certainly quite possible with Lucene/Solr.  But you have to index the 
 field to accommodate it.  If you literally want an exact match query, use 
 the string field type and then issue a term query. q=field:value will 
 work in simple cases (where the value has no spaces or colons, or other 
 query parser syntax), but q={!term f=field}value is the fail-safe way to do 
 that.
 
  Erik
 
 
 
 
 
 
 
 
 Erik Hatcher wrote:
 It's certainly quite possible with Lucene/Solr.  But you have to index the 
 field to accommodate it.  If you literally want an exact match query, use 
 the string field type and then issue a term query.  q=field:value will 
 work in simple cases (where the value has no spaces or colons, or other 
 query parser syntax), but q={!term f=field}value is the fail-safe way to do 
 that.
  Erik
 On Nov 2, 2011, at 07:08 , Roland Tollenaar wrote:
 Hi,
 
 I am trying to do a search that will only match exact words on a field.
 
 I have read somewhere that this is not what SOLR is meant for but I am 
 still hoping that its possible.
 
 This is an example of what I have tried (to exclude spaces) but the 
 workaround does not seem to work.
 
 Word:apple NOT  
 
 What I am really looking for is the = operator in SQL (eg Word='apple') 
 but I cannot find its equivalent for lucene.
 
 Thanks for the help.
 
 Regards,
 
 Roland

Re: Multivalued fields question

2011-11-03 Thread Erick Erickson

multiValued has nothing to do with how many tokens are in the field,
it's just whether you can call document.add(field1, val1) more than
once on the same field. Or, equivalently, in input document in XML
has two field entries with the same name=field entries. So it
strictly depends upon whether you want to take it upon yourself
to make these long strings or call document.add once for each
value in the field.

The field is returned as an array if it's multiValued

Just to make your life interesting If you define your increment gap as 0,
there is no difference between how multiValued fields are searched
as opposed to single-valued fields.

FWIW
Erick

On Tue, Nov 1, 2011 at 1:26 PM, Travis Low t...@4centurion.com wrote:
 Greetings.  We're finally kicking off our little Solr project.  We're
 indexing a paltry 25,000 records but each has MANY documents attached, so
 we're using Tika to parse those documents into a big long string, which we
 use in a call to solrj.addField(relateddoccontents,
 bigLongStringOfDocumentContents).  We don't care about search results
 pointing back to a particular document, just one of the 25K records, so
 this should work.

 Now my question.  Many of these records have related records in other
 tables, and there are several types of these related records.  For example,
 we have record #100 that my have blue records with numbers , ,
 , and , and red records with numbers , , , .
 Currently we're just handling these the same way as related document
 contents -- we concatenate them, separated by spaces, into one long string,
 then we do solrj.addField(redRecords, stringOfRedRecordNumbers).  That
 is, stringOfRedRecordNumbers is    .

 We have no need to show these records to the user in Solr search results,
 because we're going to use the database for displaying of detailed
 information for any records found.  Is there any reason to specify
 redRecords and blueRecords as multivalued fields in schema.xml?  And if we
 did that, we'd call solrj.addField() once for each value, would we not?

 cheers,

 Travis

Re: pingQuery problem ?

2011-11-03 Thread darul

One of my core had a missing ping request handler.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/pingQuery-problem-tp3476850p3476980.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Jetty logging

2011-11-03 Thread Kai Gülzau

Hi,

remove slf4j-jdk14-1.6.1.jar from the war and repack it with slf4j-log4j12.jar 
and log4j-1.2.14.jar instead.

-http://wiki.apache.org/solr/SolrLogging

Regards,

Kai Gülzau

-Original Message-
From: darul [mailto:daru...@gmail.com] 
Sent: Thursday, November 03, 2011 11:26 AM
To: solr-user@lucene.apache.org
Subject: Jetty logging

Hello everybody,

I do not find a solution on how to configure jetty with sl4j and a 
log4j.properties file.

In  I have put :

- log4j-1.2.14.jar
- slf4j-api-1.3.1.jar

in  directory:
- log4j.properties



At the end, nothing append when running jetty.

Do you have any ideas ?

Thanks,

Julien





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Jetty-logging-tp3476715p3476715.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Using Solr components for dictionary matching?

2011-11-03 Thread Andrea Gazzarini

Assuming that with dictionary you would mean (also) a thesaurus, you
can consider to use SIREn which is a SOLR / Lucene add-on, able to
index (and search) RDF data.

In this way, you could index an already available thesaurus like LCSH,
Agrovoc or build and index your own vocabulary.

subsequently, querying its services for lookups will benefit of SOLR /
Lucene features.

Best,
Andrea

On 11/1/11, Nagendra Mishr nmi...@gmail.com wrote:
 Hi all,

 Is there a good guide on using Solr components as a dictionary
 matcher?  I'm need to do some pre-processing that involves lots of
 dictionary lookups and it doesn't seem right to query solr for each
 instance.

 Thanks in advance,

 Nagendra

RE: large scale indexing issues / single threaded bottleneck

2011-11-03 Thread Jaeger, Jay - DOT

Shishir, we have 35 million documents, and should be doing about 5000-1 
new documents a day, but with very small documents:  40 fields which have 
at most a few terms, with many being single terms.   

You may occasionally see some impact from top level index merges but those 
should be very infrequent given your stated volumes.

For more concrete advice, you should also provide information on the size of 
your documents, and your search volume.

JRJ

-Original Message-
From: Awasthi, Shishir [mailto:shishir.awas...@baml.com] 
Sent: Tuesday, November 01, 2011 10:58 PM
To: solr-user@lucene.apache.org
Subject: RE: large scale indexing issues / single threaded bottleneck

Roman,
How frequently do you update your index? I have a need to do real time
add/delete to SOLR documents at a rate of approximately 20/min.
The total number of documents are in the range of 4 million. Will there
be any performance issues?

Thanks,
Shishir

-Original Message-
From: Roman Alekseenkov [mailto:ralekseen...@gmail.com] 
Sent: Sunday, October 30, 2011 6:11 PM
To: solr-user@lucene.apache.org
Subject: Re: large scale indexing issues / single threaded bottleneck

Guys, thank you for all the replies.

I think I have figured out a partial solution for the problem on Friday
night. Adding a whole bunch of debug statements to the info stream
showed that every document is following update document path instead
of add document path. Meaning that all document IDs are getting into
the pending deletes queue, and Solr has to rescan its index on every
commit for potential deletions. This is single threaded and seems to get
progressively slower with the index size.

Adding overwrite=false to the URL in /update handler did NOT help, as my
debug statements showed that messages still go to updateDocument()
function with deleteTerm not being null. So, I hacked Lucene a little
bit and set deleteTerm=null as a temporary solution in the beginning of
updateDocument(), and it does not call applyDeletes() anymore. 

This gave a 6-8x performance boost, and now we can index about 9 million
documents/hour (producing 20Gb of index every hour). Right now it's at
1TB index size and going, without noticeable degradation of the indexing
speed.
This is decent, but still the 24-core machine is barely utilized :)

Now I think it's hitting a merge bottleneck, where all indexing threads
are being paused. And ConcurrentMergeScheduler with 4 threads is not
helping much. I guess the changes on the trunk would definitely help,
but we will likely stay on 3.4

Will dig more into the issue on Monday. Really curious to see why
overwrite=false didn't help, but the hack did.

Once again, thank you for the answers and recommendations

Roman



--
View this message in context:
http://lucene.472066.n3.nabble.com/large-scale-indexing-issues-single-th
readed-bottleneck-tp3461815p3466523.html
Sent from the Solr - User mailing list archive at Nabble.com.

--
This message w/attachments (message) is intended solely for the use of the 
intended recipient(s) and may contain information that is privileged, 
confidential or proprietary. If you are not an intended recipient, please 
notify the sender, and then please delete and destroy all copies and 
attachments, and be advised that any review or dissemination of, or the taking 
of any action in reliance on, the information contained in or attached to this 
message is prohibited. 
Unless specifically indicated, this message is not an offer to sell or a 
solicitation of any investment products or other financial product or service, 
an official confirmation of any transaction, or an official statement of 
Sender. Subject to applicable law, Sender may intercept, monitor, review and 
retain e-communications (EC) traveling through its networks/systems and may 
produce any such EC to regulators, law enforcement, in litigation and as 
required by law. 
The laws of the country of each sender/recipient may impact the handling of EC, 
and EC may be archived, supervised and produced in countries other than the 
country in which you are located. This message cannot be guaranteed to be 
secure or free of errors or viruses. 

References to Sender are references to any subsidiary of Bank of America 
Corporation. Securities and Insurance Products: * Are Not FDIC Insured * Are 
Not Bank Guaranteed * May Lose Value * Are Not a Bank Deposit * Are Not a 
Condition to Any Banking Service or Activity * Are Not Insured by Any Federal 
Government Agency. Attachments that are part of this EC may have additional 
important disclosures and disclaimers, which you should read. This message is 
subject to terms available at the following link: 
http://www.bankofamerica.com/emaildisclaimer. By messaging with Sender you 
consent to the foregoing.

RE: change solr url

2011-11-03 Thread Jaeger, Jay - DOT

The file that he refers to, web.xml, is inside the solr WAR file in folder 
web-inf.  That WAR file is in ...\example\webapps.   You would have to 
uncomment the init-param section under filter-class and change the 
param-value to something else.  But, as the comments in the filter-class 
section explain, you would also have to make other changes.

If you are unfamiliar with how JEE Java applications are packaged, it might be 
best to leave it alone.

Note that both alternatives that he has suggested would change the path for all 
of solr, not just admin.

JRJ


-Original Message-
From: Ankita Patil [mailto:ankita.pa...@germinait.com] 
Sent: Tuesday, November 01, 2011 11:44 PM
To: solr-user@lucene.apache.org
Subject: Re: change solr url

I am not very clear. Could you explain a bit in detail or give an example.

Ankita.

On 2 November 2011 06:26, Chris Hostetter hossman_luc...@fucit.org wrote:


 : Is it possible to change the url for solr admin??
 : What i want is :
 : http://192.168.0.89:8983/solr/private/coreName/admin
 :
 : i want to add /private/ before the coreName. Is that possible? If yes
 how?

 You can either do this via settings in your servlet container (to specify
 that hte mapping of hte solr applicaiton should be solr/private instead
 of solr/ or you can modify the path-prefix value in Solr's web.xml
 (but that is not very well tested/supported)




 -Hoss

RE: Questions about Solr's security

2011-11-03 Thread Jaeger, Jay - DOT

It seems to me that this issue needs to be addressed in the FAQ and in the 
tutorial, and that somewhere there should be a /select lock-down how to.   
This is not obvious to many (most?) users of Solr.  It certainly wasn't obvious 
to me before I read this.

JRJ

-Original Message-
From: Erik Hatcher [mailto:erik.hatc...@gmail.com] 
Sent: Tuesday, November 01, 2011 3:50 PM
To: solr-user@lucene.apache.org
Subject: Re: Questions about Solr's security

SSL and auth doesn't address that /select can hit any request handler defined 
(/select?qt=/updatestream.body=deletequery*:*/query/deletecommit=true).
  Be careful!

But certainly knowing all the issues mentioned on this thread, it is possible 
to lock Solr down and make it safe to hit directly.  But not out of the box or 
trivially.

Erik



On Nov 1, 2011, at 16:09 , Alireza Salimi wrote:

 I'm not sure if anybody has asked these questions before or not.
 Sorry if they are duplicates.
 
 The problem is that the clients (smart phones) of our Solr machines
 are outside the network in which solr machines are located. So, we
 need to somehow expose their service to the outside word.
 
 What's the safest way to do that?
 If we implement just a controlled app sitting between those clients
 we gonna waste lots of processing power because of proxying between
 Solr and clients.
 
 We might also ignore some HTTP headers that Solr would generate
 such as HTTP Cache headers. Anyways, creating such an application
 seems to be a lot of work which is not that needed.
 
 Erik, do you think even if we use SSL and HTTP Authentication, still
 it's not a good idea to expose Solr services?
 
 
 
 On Tue, Nov 1, 2011 at 3:57 PM, Erik Hatcher erik.hatc...@gmail.com wrote:
 
 Be aware that even /select could have some harmful effects, see
 https://issues.apache.org/jira/browse/SOLR-2854 (addressed on trunk).
 
 Even disregarding that issue, /select is a potential gateway to any
 request handler defined via /select?qt=/req_handler
 
 Again, in general it's not a good idea to expose Solr to anything but a
 controlled app server.
 
   Erik
 
 On Nov 1, 2011, at 15:51 , Alireza Salimi wrote:
 
 What if we just expose '/select' paths - by firewalls and load balancers
 -
 and
 also use SSL and HTTP basic or digest access control?
 
 On Tue, Nov 1, 2011 at 2:20 PM, Chris Hostetter 
 hossman_luc...@fucit.orgwrote:
 
 
 : I was wondering if it's a good idea to expose Solr to the outside
 world,
 : so that our clients running on smart phones will be able to use Solr.
 
 As a general rule of thumb, i would say that it is not a good idea to
 expose solr directly to the public internet.
 
 there are exceptions to this rule -- AOL hosted some live solr instances
 of the Sarah Palin emails for HufPo -- but it is definitely an expert
 level type thing for people who are so familiar with solr they know
 exactly what to lock down to make it safe
 
 for typical users: put an application between your untrusted users and
 solr and only let that application generate safe welformed requests to
 Solr...
 
 https://wiki.apache.org/solr/SolrSecurity
 
 
 -Hoss
 
 
 
 
 --
 Alireza Salimi
 Java EE Developer
 
 
 
 
 -- 
 Alireza Salimi
 Java EE Developer

AW: large scale indexing issues / single threaded bottleneck

2011-11-03 Thread sebastian.reese

Hi,

we are currently thinking about the performance facts too.
I wonder if there are any sites on the net describing what a large index is? 

People always talk about huge indexes and heavy commits etc. but i can't find 
some stats about it in numbers and no information about the hardware used.

Maybe an article in the wiki would help.

I expect our index to be about 4 to 5 gig with 500.000 docs and 80.000 commits 
a day. Is that considered to be large, medium or small?

Greets
Sebastian

-Ursprüngliche Nachricht-
Von: Jaeger, Jay - DOT [mailto:jay.jae...@dot.wi.gov] 
Gesendet: Donnerstag, 3. November 2011 14:00
An: 'solr-user@lucene.apache.org'
Betreff: RE: large scale indexing issues / single threaded bottleneck

Shishir, we have 35 million documents, and should be doing about 5000-1 
new documents a day, but with very small documents:  40 fields which have 
at most a few terms, with many being single terms.   

You may occasionally see some impact from top level index merges but those 
should be very infrequent given your stated volumes.

For more concrete advice, you should also provide information on the size of 
your documents, and your search volume.

JRJ

-Original Message-
From: Awasthi, Shishir [mailto:shishir.awas...@baml.com]
Sent: Tuesday, November 01, 2011 10:58 PM
To: solr-user@lucene.apache.org
Subject: RE: large scale indexing issues / single threaded bottleneck

Roman,
How frequently do you update your index? I have a need to do real time 
add/delete to SOLR documents at a rate of approximately 20/min.
The total number of documents are in the range of 4 million. Will there be any 
performance issues?

Thanks,
Shishir

-Original Message-
From: Roman Alekseenkov [mailto:ralekseen...@gmail.com]
Sent: Sunday, October 30, 2011 6:11 PM
To: solr-user@lucene.apache.org
Subject: Re: large scale indexing issues / single threaded bottleneck

Guys, thank you for all the replies.

I think I have figured out a partial solution for the problem on Friday night. 
Adding a whole bunch of debug statements to the info stream showed that every 
document is following update document path instead of add document path. 
Meaning that all document IDs are getting into the pending deletes queue, and 
Solr has to rescan its index on every commit for potential deletions. This is 
single threaded and seems to get progressively slower with the index size.

Adding overwrite=false to the URL in /update handler did NOT help, as my debug 
statements showed that messages still go to updateDocument() function with 
deleteTerm not being null. So, I hacked Lucene a little bit and set 
deleteTerm=null as a temporary solution in the beginning of updateDocument(), 
and it does not call applyDeletes() anymore. 

This gave a 6-8x performance boost, and now we can index about 9 million 
documents/hour (producing 20Gb of index every hour). Right now it's at 1TB 
index size and going, without noticeable degradation of the indexing speed.
This is decent, but still the 24-core machine is barely utilized :)

Now I think it's hitting a merge bottleneck, where all indexing threads are 
being paused. And ConcurrentMergeScheduler with 4 threads is not helping much. 
I guess the changes on the trunk would definitely help, but we will likely stay 
on 3.4

Will dig more into the issue on Monday. Really curious to see why 
overwrite=false didn't help, but the hack did.

Once again, thank you for the answers and recommendations

Roman



--
View this message in context:
http://lucene.472066.n3.nabble.com/large-scale-indexing-issues-single-th
readed-bottleneck-tp3461815p3466523.html
Sent from the Solr - User mailing list archive at Nabble.com.

--
This message w/attachments (message) is intended solely for the use of the 
intended recipient(s) and may contain information that is privileged, 
confidential or proprietary. If you are not an intended recipient, please 
notify the sender, and then please delete and destroy all copies and 
attachments, and be advised that any review or dissemination of, or the taking 
of any action in reliance on, the information contained in or attached to this 
message is prohibited. 
Unless specifically indicated, this message is not an offer to sell or a 
solicitation of any investment products or other financial product or service, 
an official confirmation of any transaction, or an official statement of 
Sender. Subject to applicable law, Sender may intercept, monitor, review and 
retain e-communications (EC) traveling through its networks/systems and may 
produce any such EC to regulators, law enforcement, in litigation and as 
required by law. 
The laws of the country of each sender/recipient may impact the handling of EC, 
and EC may be archived, supervised and produced in countries other than the 
country in which you are located. This message cannot be guaranteed to be 
secure or free of errors or viruses.

RE: Jetty logging

2011-11-03 Thread darul

Well,  jetty is running as a unix service.

Here is run command :



jetty-logging.xml:




With this configuration I have  logs of jetty but no logs of log4j: exemple
/logs/_mm_dd.stderrout.log

2011-11-03 14:36:59.306:INFO::jetty-6.1-SNAPSHOT
Nov 3, 2011 2:36:59 PM org.apache.solr.core.SolrResourceLoader
locateSolrHome
INFO: JNDI not configured for solr (NoInitialContextEx)
Nov 3, 2011 2:36:59 PM org.apache.solr.core.SolrResourceLoader
locateSolrHome
INFO: using system property solr.solr.home: /opt/solr-slave/multicore
Nov 3, 2011 2:36:59 PM org.apache.solr.core.SolrResourceLoader init
INFO: Solr home set to '/opt/solr-slave/multicore/'
Nov 3, 2011 2:36:59 PM org.apache.solr.servlet.SolrDispatchFilter init
INFO: SolrDispatchFilter.init()
Nov 3, 2011 2:36:59 PM org.apache.solr.core.SolrResourceLoader
locateSolrHome
INFO: JNDI not configured for solr (NoInitialContextEx)
Nov 3, 2011 2:36:59 PM org.apache.solr.core.SolrResourceLoader
locateSolrHome
INFO: using system property solr.solr.home: /opt/solr-slave/multicore
Nov 3, 2011 2:36:59 PM org.apache.solr.core.CoreContainer$Initializer
initialize

I would like jetty use my resource/log4j.properties file :




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Jetty-logging-tp3476715p3477221.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: how to apply sort and search both on multivalued field in solr

2011-11-03 Thread Erick Erickson

What does sorting on a multivalued field mean? Should the document
appear, in your example, in the a's? c's? e's? p's? There's no logical
place to sort a document into a list when there's more than one token that
makes sense in the general case that I can think of

Why wouldn't searching oh your multivalued field and sorting on your
min and max fields give you what you want? Can you give an example?

Best
Erick

On Wed, Nov 2, 2011 at 8:32 AM, vrpar...@gmail.com vrpar...@gmail.com wrote:
 Hello all,

 i did googling and also as per wiki, we can not apply sorting on multivalued
 field.

 workaround for that is we need to add two more fields for particular
 multivalued field, min and max.
     e.g.     multivalued field have 4 values
                         abc,
                         cde,
                         efg,
                         pqr
 than min=abc and max=pqr    and we can make sort on it.

 this is fine if there is only required to sort on multivalued field.

 but i want to do searching and sorting on same multivalued field, then
 result would not fine.

 how to solve this problem ?

 Thanks
 vishal parekh

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/how-to-apply-sort-and-search-both-on-multivalued-field-in-solr-tp3473652p3473652.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: DIH doesn't handle bound namespaces?

2011-11-03 Thread P Williams

Hi Gary,

From
http://wiki.apache.org/solr/DataImportHandler#Usage_with_XML.2BAC8-HTTP_Datasource

*It does not support namespaces , but it can handle xmls with namespaces .
When you provide the xpath, just drop the namespace and give the rest (eg
if the tag is 'dc:subject' the mapping should just
contain 'subject').Easy, isn't it? And you didn't need to write one line of
code! Enjoy **
*
You should be able to use xpath=//titleInfo/title without making any
modifications (removing the namespace) to your xml.

I hope that answers your question.

Regards,
Tricia

On Mon, Oct 31, 2011 at 9:24 AM, Moore, Gary gary.mo...@ars.usda.govwrote:

 I'm trying to import some MODS XML using DIH.  The XML uses bound
 namespacing:

 mods xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance;
  xmlns:mods=http://www.loc.gov/mods/v3;
  xmlns:xlink=http://www.w3.org/1999/xlink;
  xmlns=http://www.loc.gov/mods/v3;
  xsi:schemaLocation=http://www.loc.gov/mods/v3
 http://www.loc.gov/mods/v3/mods-3-4.xsd;
  version=3.4
   mods:titleInfo
  mods:titleMalus domestica: Arnold/mods:title
   /mods:titleInfo
 /mods

 However, XPathEntityProcessor doesn't seem to handle xpaths of the type
 xpath=//mods:titleInfo/mods:title.

 If I remove the namespaces from the source XML:

 mods xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance;
  xmlns:mods=http://www.loc.gov/mods/v3;
  xmlns:xlink=http://www.w3.org/1999/xlink;
  xmlns=http://www.loc.gov/mods/v3;
  xsi:schemaLocation=http://www.loc.gov/mods/v3
 http://www.loc.gov/mods/v3/mods-3-4.xsd;
  version=3.4
   titleInfo
  titleMalus domestica: Arnold/title
   /titleInfo
 /mods

 then xpath=//titleInfo/title works just fine.  Can anyone confirm that
 this is the case and, if so, recommend a solution?
 Thanks
 Gary


 Gary Moore
 Technical Lead
 LCA Digital Commons Project
 NAL/ARS/USDA

Re: Stream still in memory after tika exception? Possible memoryleak?

2011-11-03 Thread P Williams

Hi All,

I'm experiencing a similar problem to the other's in the thread.

I've recently upgraded from apache-solr-4.0-2011-06-14_08-33-23.war to
apache-solr-4.0-2011-10-14_08-56-59.war and then
apache-solr-4.0-2011-10-30_09-00-00.war to index ~5300 pdfs, of various
sizes, using the TikaEntityProcessor.  My indexing would run to completion
and was completely successful under the June build.  The only error was
readability of the fulltext in highlighting.  This was fixed in Tika 0.10
(TIKA-611).  I chose to use the October 14 build of Solr because Tika 0.10
had recently been included (SOLR-2372).

On the same machine without changing any memory settings my initial problem
is a Perm Gen error.  Fine, I increase the PermGen space.

I've set the onError parameter to skip for the TikaEntityProcessor.
 Now I get several (6)

*SEVERE: Exception thrown while getting data*
*java.net.SocketTimeoutException: Read timed out*
*SEVERE: Exception in entity :
tika:org.apache.solr.handler.dataimport.DataImport*
*HandlerException: Exception in invoking url url removed # 2975*

pairs.  And after ~3881 documents, with auto commit set unreasonably
frequently I consistently get an Out of Memory Error

*SEVERE: Exception while processing: f document :
null:org.apache.solr.handle**r.dataimport.DataImportHandlerException:
java.lang.OutOfMemoryError: Java heap s**pace*

The stack trace points
to 
org.apache.pdfbox.io.RandomAccessBuffer.expandBuffer(RandomAccessBuffer.java:151)
and org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde
r.java:718).

The October 30 build performs identically.

Funny thing is that monitoring via JConsole doesn't reveal any memory
issues.

Because the out of Memory error did not occur in June, this leads me to
believe that a bug has been introduced to the code since then.  Should I
open an issue in JIRA?

Thanks,
Tricia

On Tue, Aug 30, 2011 at 12:22 PM, Marc Jacobs jacob...@gmail.com wrote:

 Hi Erick,

 I am using Solr 3.3.0, but with 1.4.1 the same problems.
 The connector is a homemade program in the C# programming language and is
 posting via http remote streaming (i.e.

 http://localhost:8080/solr/update/extract?stream.file=/path/to/file.docliteral.id=1
 )
 I'm using Tika to extract the content (comes with the Solr Cell).

 A possible problem is that the filestream needs to be closed, after
 extracting, by the client application, but it seems that there is going
 something wrong while getting a Tika-exception: the stream never leaves the
 memory. At least that is my assumption.

 What is the common way to extract content from officefiles (pdf, doc, rtf,
 xls etc) and index them? To write a content extractor / validator yourself?
 Or is it possible to do this with the Solr Cell without getting a huge
 memory consumption? Please let me know. Thanks in advance.

 Marc

 2011/8/30 Erick Erickson erickerick...@gmail.com

  What version of Solr are you using, and how are you indexing?
  DIH? SolrJ?
 
  I'm guessing you're using Tika, but how?
 
  Best
  Erick
 
  On Tue, Aug 30, 2011 at 4:55 AM, Marc Jacobs jacob...@gmail.com wrote:
   Hi all,
  
   Currently I'm testing Solr's indexing performance, but unfortunately
 I'm
   running into memory problems.
   It looks like Solr is not closing the filestream after an exception,
 but
  I'm
   not really sure.
  
   The current system I'm using has 150GB of memory and while I'm indexing
  the
   memoryconsumption is growing and growing (eventually more then 50GB).
   In the attached graph I indexed about 70k of office-documents
  (pdf,doc,xls
   etc) and between 1 and 2 percent throws an exception.
   The commits are after 64MB, 60 seconds or after a job (there are 6
 evenly
   divided jobs).
  
   After indexing the memoryconsumption isn't dropping. Even after an
  optimize
   command it's still there.
   What am I doing wrong? I can't imagine I'm the only one with this
  problem.
   Thanks in advance!
  
   Kind regards,
  
   Marc

Default value for dynamic fields

2011-11-03 Thread Milan Dobrota

Is there any way to define the default value for the dynamic fields in
SOLR? I use some dynamic fields of type float with _val_ and if they
haven't been created at index time, the value defaults to 0. I would want
this to be 1. Can that be changed?

Re: how to apply sort and search both on multivalued field in solr

2011-11-03 Thread vrpar...@gmail.com

Thanks Erick,

what i given 'abc',...etc...   its values of one multivalued field in one
document, but might be its confusing.

lets say, i have one field named  Array1   has multivalued=true

now i want to Search on Array1 , but i want only affected values (which i
can get in highlighting),  now i also want to sort on filed Array1,

now whatever be the response should be sorted on only affected values (which
contains search term).

also without search sorting on Array1 sometimes works fine, sometimes not.


Thanks 
Vishal Parekh

--
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-apply-sort-and-search-both-on-multivalued-field-in-solr-tp3473652p3477747.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: exact matches possible?

2011-11-03 Thread Roland Tollenaar

Hi Erik,

you are spot on with your guess. I had reinserted my data but apparently
that does not reindex. Delete everything and re-enter was required.

Behaviour now seems to be as desired.

Thank you very much.

PS, thanks for pointing out that the !term is literal. Where can I find
that kind of information on the internet? I use the lucene syntax page
as my reference but it appears to be somewhat limited:

http://lucene.apache.org/java/2_9_1/queryparsersyntax.html

Kind regards,

Roland

Erik Hatcher wrote:

Roland -

Is it possible that you indexed with a different field type and then changed to string
without reindexing? A query on a string will only match literally the exact value (barring any
wildcard/regex syntax), so something is fishy with your example. Your query example was odd, not
sure if you meant it literally, but given the Word field name the query would be q={!term
f=Word}apple - maybe you thought term was meta, but it is meant literally here.

Erik

On Nov 3, 2011, at 04:45 , Roland Tollenaar wrote:

Hi Erik,

thanks for the response. I have ensured the type is string and that the field
is indexed. No luck though:

(Schema setting under solr/conf):
field name=Word type=string indexed=true stored=true /

Query:

Word:apple

Desired result:

apple

Achieved Results:

apple, the red apple, pine-apple, etc, etc

I have also tried your other suggestion:
q={! f=Word}apple
(attmpting to eliminate any results with spaces)

But that just gives errors (from calling from the solr/admin query interface.

Am I doing something obviously wrong?

Thanks again,

Roland

It's certainly quite possible with Lucene/Solr. But you have to index the field to accommodate it.
If you literally want an exact match query, use the string field type and then issue a
term query. q=field:value will work in simple cases (where the value has no spaces or colons, or
other query parser syntax), but q={!term f=field}value is the fail-safe way to do that.
Erik

Erik Hatcher wrote:

It's certainly quite possible with Lucene/Solr. But you have to index the field to
accommodate it. If you literally want an exact match query, use the string
field type and then issue a term query. q=field:value will work in simple cases (where
the value has no spaces or colons, or other query parser syntax), but q={!term
f=field}value is the fail-safe way to do that.
Erik
On Nov 2, 2011, at 07:08 , Roland Tollenaar wrote:

Hi,

I am trying to do a search that will only match exact words on a field.

I have read somewhere that this is not what SOLR is meant for but I am still
hoping that its possible.

This is an example of what I have tried (to exclude spaces) but the workaround
does not seem to work.

Word:apple NOT

What I am really looking for is the = operator in SQL (eg Word='apple') but I
cannot find its equivalent for lucene.

Thanks for the help.

Regards,

Roland

Re: Selective Result Grouping

2011-11-03 Thread Martijn v Groningen

Ok I think I get this. I think this can be achieved if one could
specify a filter inside a group and only documents that pass the
filter get grouped. For example only group documents with the value
image for the mimetype field. This filter should be specified per
group command. Maybe we should open an issue for this?

Martijn

On 1 November 2011 19:58, entdeveloper cameron.develo...@gmail.com wrote:

Martijn v Groningen-2 wrote:

When using the group.field option values must be the same otherwise
they don't get grouped together. Maybe fuzzy grouping would be nice.
Grouping videos and images based on mimetype should be easy, right?
Videos have a mimetype that start with video/ and images have a
mimetype that start with image/. Storing the mime type's subtype and
type in separate fields and group on the type field would do the job.
Off course you need to know the mimetype during indexing, but
solutions like Apache Tika can do that for you.

Not necessarily interested in grouping by mimetype (that's an analysis
issue). I simply used videos and images as an example.

I'm not sure what you mean by fuzzy grouping. But my goal is to have
collapse be more selective somehow on what gets grouped. As a more specific
example, I have a field called 'type', with the following possible field
values:

Type
--
image
video
webpage

Basically I want to be able to collapse all the images into a single result
so that they don't fill up the first page of the results. This is not
possible with the current grouping implementation because if you call
group.field=type, it'll group everything. I do not want to collapse videos
or webpages, only images.

I've attached a screenshot of google's srp to help explain what I mean.

http://lucene.472066.n3.nabble.com/file/n3471548/Screen_Shot_2011-11-01_at_11.52.04_AM.png

Hopefully that makes more sense. If it's still not clear I can email you
privately.

--
View this message in context:
http://lucene.472066.n3.nabble.com/Selective-Result-Grouping-tp3391538p3471548.html
Sent from the Solr - User mailing list archive at Nabble.com.

--
Met vriendelijke groet,

Martijn van Groningen

Re: Default value for dynamic fields

2011-11-03 Thread Yury Kats

On 11/3/2011 12:59 PM, Milan Dobrota wrote:
 Is there any way to define the default value for the dynamic fields in
 SOLR? I use some dynamic fields of type float with _val_ and if they
 haven't been created at index time, the value defaults to 0. I would want
 this to be 1. Can that be changed?

Does specifying default=1 not work?

Re: Stopword filter - refreshing stop word list periodically

2011-11-03 Thread Sami Siren

On Fri, Oct 14, 2011 at 10:06 PM, Jithin jithin1...@gmail.com wrote:
 What will be the name of this hard coded core? I was re arranging my
 directory structure adding a separate directory for code. And it does work
 with a single core.

In trunk the single core setup core is called collection1. So to
reload that you'd call url:
http://localhost:8983/solr/admin/cores?action=RELOADcore=collection1

--
 Sami Siren

Re: how to apply sort and search both on multivalued field in solr

2011-11-03 Thread Erick Erickson

Right, the behavior when sorting on a multivalued field is not defined, so
results are unreliable.


There's nothing that I know of that'll allow your sort to occur on the matched
terms in a multiValued field. But, again, defining correct behavior here isn't
easy. What if you searched for two terms and both terms matched a value in
a single document's multiValued field? Which term should it sort by?

Sorry, but sorting just doesn't work that way and I don't have any bright
ideas how to get this to work as you'd like

Best
Erick

On Thu, Nov 3, 2011 at 1:06 PM, vrpar...@gmail.com vrpar...@gmail.com wrote:
 Thanks Erick,

 what i given 'abc',...etc...   its values of one multivalued field in one
 document, but might be its confusing.

 lets say, i have one field named  Array1   has multivalued=true

 now i want to Search on Array1 , but i want only affected values (which i
 can get in highlighting),  now i also want to sort on filed Array1,

 now whatever be the response should be sorted on only affected values (which
 contains search term).

 also without search sorting on Array1 sometimes works fine, sometimes not.


 Thanks
 Vishal Parekh

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/how-to-apply-sort-and-search-both-on-multivalued-field-in-solr-tp3473652p3477747.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Three questions about: Commit, single index vs multiple indexes and implementation advice

2011-11-03 Thread Gustavo Falco

Hi guys!

I have a couple of questions that I hope someone could help me with:

1) Recently I've implemented Solr in my app. My use case is not
complicated. Suppose that there will be 50 concurrent users tops. This is
an app like, let's say, a CRM. I tell you this so you have an idea in terms
of how many read and write operations will be needed. What I do need is
that the data that is added / updated be available right after it's added /
updated (maybe a second later it's ok). I know that the commit operation is
expensive, so maybe doing a commit right after each write operation is not
a good idea. I'm trying to use the autoCommit feature with a maxTime of
1000ms, but then the question arised: Is this the best way to handle this
type of situation? and if not, what should I do?

2) I'm using a single index per entity type because I've read that if the
app is not handling lots of data (let's say, 1 million of records) then
it's safe to use a single index. Is this true? if not, why?

3) Is it a problem if I use a simple setup of Solr using a single core for
this use case? if not, what do you recommend?



Any help in any of these topics would be greatly appreciated.

Thanks in advance!

Ordered proximity search

2011-11-03 Thread LT.thomas

Hi,

By ordered I mean term1 will always come before term2 in the document.

I have two documents:
1. By ordered I mean term1 will always come before term2 in the document
2. By ordered I mean term2 will always come before term1 in the document

if I make the query:

term1 term2~Integer.MAX_VALUE

my results is: 2 documents

How can I query to have one result (only if term1 come before term2): 
By ordered I mean term1 will always come before term2 in the document

Thanks

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Ordered-proximity-search-tp3477946p3477946.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Default value for dynamic fields

2011-11-03 Thread Milan Dobrota

It doesn't work for me.

2011/11/3 Yury Kats yuryk...@yahoo.com

 On 11/3/2011 12:59 PM, Milan Dobrota wrote:
  Is there any way to define the default value for the dynamic fields in
  SOLR? I use some dynamic fields of type float with _val_ and if they
  haven't been created at index time, the value defaults to 0. I would want
  this to be 1. Can that be changed?

 Does specifying default=1 not work?




-- 
Milan Dobrota
Ruby on Rails developer
milandobrota.com
rubylove.info

Re: Can you please guide me through step-by-step installation of Solr Cell ?

2011-11-03 Thread Chris Hostetter


: Caused by: org.apache.solr.common.SolrException: Error loading class 
'solr.extraction.ExtractingRequestHandler'
: 
: With the jetty and the provided example, I have no problem. It all happens 
when I use tomcat and solr.
: 
: My setup is as follows: 
: 
: I downloaded the apache-solr-3.3.0 and unpacked itI am using 
: apache-solr-3.3.0 folder as my solr-home folder. Inside the dist 
: folder I have the apache-solr-3.3.0.war and coppied everything from the 
: contrib/extraction/lib into dist.

just copying jars into dist isn't going to make things magically work 
for you -- what matters is that your solr instance knows how to find those 
plugin jars.  when you use the example jetty instance, the solrconfig.xml 
file has lib directives with relative paths that indicate where to find 
them.  If you use a differnet solr home dir and/or move files arround 
then those lib directives are no longer going to work...

https://wiki.apache.org/solr/SolrPlugins#How_to_Load_Plugins
https://wiki.apache.org/solr/SolrConfigXml#lib


-Hoss

performance - dynamic fields versus static fields

2011-11-03 Thread Memory Makers

Hi,

Is there a handy resource on the:
  a. performance of: dynamic fields versus static fields
  b. other pros-cons?

Thanks.

Re: score based on unique words matching

2011-11-03 Thread Chris Hostetter


:  q=david bowie changes
:  
:  Problem : If a record mentions david bowie a lot, it beats out something
:  more relevant (more unique matches) ...
:  
:  A. (now appearing david bowie at the cineplex 7pm david bowie goes on stage,
:  then mr. bowie will sign autographs)
:  B. song :david bowie - changes
:  
:  (A) ends up more relevant because of the frequency or number of words in
:  it.. not cool...
:  I want it so the number of words matching will trump density/weight

debugQuery=true is your freind .. it will show you exactly how the scores 
are being computed.

the key factors in something like this are fieldNorm, tf, and the coord 
factor.

The fieldNorm includes as a factor the length of the field, so as long as 
you have omitNorm=false configured for this field, doc#A should be 
panalized relative doc#B for being longer -- but if you omitNorm's then 
that won't help you -- so start by checking that.

The coord factor will penalize documents that don't match all of the 
clauses of a boolean query (ie: doc #A only matches 2/3 clauses becuase it 
doesn't match the word changes) so you could customize your Similarity 
implementation to make that coord penalty higher, but that requires some 
custom java code.

As an extreme option, you could use omitTf to completley eliminate the 
term frequency from being a factor in scoring so the number of times 
bowie appears won't affect the score, just that it appears at least 
once) but that probably isn't what you want: david bowie changes 
some stuff would get the same score as david bowie changes david bowie

in general the simplest way to deal with a lot of this type of thing is to 
think about how you are structuring your query.  something as simple as 
using the dismax parser with your field in both the qf and pf fields 
(and a little bit of slop in the ps param) may give you exactly what you 
want (since it will reward docs where the whole query string appears in 
the field...

https://wiki.apache.org/solr/DisMaxQParserPlugin


-Hoss

Re: Access Document Score in Custom Function Query (ValueSource)

2011-11-03 Thread Chris Hostetter


: In this value source I compute another score for every document
: using some features. I want to  access the score of the query myField^2 
: (for a given document) in this same value source.
: 
: Ideas?

your ValueSource can wrap the score from the other query using a 
QueryValueSource.

just keep in mind that by definition function queries match every 
document in the index, so you'll still need to use the other query in some 
way (or use somehting like the frange parser to constrain the set of 
docs returned based on a range of values produced by your function)


-Hoss

admin index version not updating

2011-11-03 Thread Nathan Moon

I have a setup with a master and single slave, using the collection 
distribution scripts.  I'm not sure if it's relevant, but I'm running multicore 
also.  I am on version 3.4.0 (we are upgrading from 1.3).

My understanding that the indexVersion (a number) reported by the stats page 
(admin/stats.jsp) is a timestamp that should correspond to the time of the 
latest snapshot.  At least that's how it has behaved on version 1.3.

When I install a new snapshot on the slave (snapinstaller), it does not report 
any errors, and the logs/snapshot.current is updated with the latest snapshot, 
but the admin/stats page still reports the old version.  Actually, the version 
number increases by 4 each time I install a new index, but doesn't update to 
anywhere near the time of the latest snapshot (it's a few days off at this 
point).

I have verified that the slave is actually running on the latest index by 
searching for something that only exists in the latest index.

Am I misunderstanding how to interpret the indexVersion, or is the latest 
snapshot not getting fully installed?

Thanks

Nathan

Re: DIH doesn't handle bound namespaces?

2011-11-03 Thread Chris Hostetter


: *It does not support namespaces , but it can handle xmls with namespaces .

The real crux of hte issue is that XPathEntityProcessor is terribly named.  
it should have been called LimitedXPathishSyntaxEntityProcessor or 
something like that because it doesn't support full xpath syntax...

The XPathEntityProcessor implements a streaming parser which supports a 
subset of xpath syntax. Complete xpath syntax is not supported but most of 
the common use cases are covered...

...i thought there was a DIH FAQ about this, but if not there really 
should be.


-Hoss

Re: Dismax and phrases

2011-11-03 Thread Chris Hostetter


Interesting, in the case where you use quotes...

: +result name=response numFound=6888 start=0 maxScore=3.0879765
...
: /lststr name=rawquerystringasuntojen hinnat/str
: str name=querystringasuntojen hinnat/str

...there is one DisjunctionMaxQuery (expected) for the entire phrase, 
but in the sub-clauses for each individual field the clauses coming from 
your _fi fields are just building boolean OR queries of the terms from 
your phrase (instead of building an actual phrase query...

: str name=parsedquery+DisjunctionMaxQuery((table.title_t:asuntojen
: hinnat^2.0 | title_t:asuntojen hinnat^2.0 | ingress_t:asuntojen hinnat |
: (text_fi:asunto text_fi:hinta) | (table.description_fi:asunto
: table.description_fi:hinta) | table.description_t:asuntojen hinnat |
: graphic.title_t:asuntojen hinnat^2.0 | ((graphic.title_fi:asunto
: graphic.title_fi:hinta)^2.0) | ((table.title_fi:asunto
: table.title_fi:hinta)^2.0) | table.contents_t:asuntojen hinnat |
: text_t:asuntojen hinnat | (ingress_fi:asunto ingress_fi:hinta) |
: (table.contents_fi:asunto table.contents_fi:hinta) | ((title_fi:asunto
: title_fi:hinta)^2.0))~0.01) () type:tie^6.0 type:kuv^2.0 type:tau^2.0
: 
FunctionQuery((1.0/(3.16E-11*float(ms(const(1319437912691),date(date.modified_dt)))+1.0))^100.0)/str

...is this perhaps a side effect of the new autoGeneratePhraseQueries 
option? ... you are explicitly specifying a quoted phrase, but 
maybe somehwere in the code path of the dismax parser that information is 
getting lost?

can you post the details of your schema.xml?  (ie: the version property 
on the schema file, and the dynamicField/field + fieldType definitions for 
all these fields)

In contrast, your unquoted example is working exactly as i'd expect.  a 
DisjunctionMaxQuery is built for each clause of the input, and the two 
DisjunctionMaxQuery objects are then combined in a BooleanQuery where the 
minNrShouldMatch property is set to 2

: +result name=response numFound=1065 start=0
: maxScore=2.230382/result
...
: str name=rawquerystringasuntojen hinnat/str
: str name=querystringasuntojen hinnat/str
: 
: str name=parsedquery+((DisjunctionMaxQuery((table.title_t:asuntojen^2.0 |
: title_t:asuntojen^2.0 | ingress_t:asuntojen | text_fi:asunto |
: table.description_fi:asunto | table.description_t:asuntojen |
: graphic.title_t:asuntojen^2.0 | graphic.title_fi:asunto^2.0 |
: table.title_fi:asunto^2.0 | table.contents_t:asuntojen | text_t:asuntojen |
: ingress_fi:asunto | table.contents_fi:asunto | title_fi:asunto^2.0)~0.01)
: DisjunctionMaxQuery((table.title_t:hinnat^2.0 | title_t:hinnat^2.0 |
: ingress_t:hinnat | text_fi:hinta | table.description_fi:hinta |
: table.description_t:hinnat | graphic.title_t:hinnat^2.0 |
: graphic.title_fi:hinta^2.0 | table.title_fi:hinta^2.0 |
: table.contents_t:hinnat | text_t:hinnat | ingress_fi:hinta |
: table.contents_fi:hinta | title_fi:hinta^2.0)~0.01))~2) () type:tie^6.0
: type:kuv^2.0 type:tau^2.0
: 
FunctionQuery((1.0/(3.16E-11*float(ms(const(1319438484878),date(date.modified_dt)))+1.0))^100.0)/str


-Hoss

UnInvertedField vs FieldCache for facets for single-token text fields

2011-11-03 Thread Michael Ryan

I have some fields I facet on that are TextFields but have just a single token.
The fieldType looks like this:

fieldType name=myStringFieldType class=solr.TextField indexed=true
stored=false omitNorms=true sortMissingLast=true
positionIncrementGap=100
  analyzer
tokenizer class=solr.KeywordTokenizerFactory/
  /analyzer
/fieldType

SimpleFacets uses an UnInvertedField for these fields because
multiValuedFieldCache() returns true for TextField. I tried changing the type 
for
these fields to the plain string type (StrField). The facets *seem* to be
generated much faster. Is it expected that FieldCache would be faster than
UnInvertedField for single-token strings like this?

My goal is to make the facet re-generation after a commit as fast as possible. I
would like to continue using TextField for these fields since I have a need for
filters like LowerCaseFilterFactory, which still produces a single token. Is it
safe to extend TextField and have multiValuedFieldCache() return false for these
fields, so that UnInvertedField is not used? Or is there a better way to
accomplish what I'm trying to do?

-Michael

Re: Dismax and phrases

2011-11-03 Thread Chris Hostetter


: ...is this perhaps a side effect of the new autoGeneratePhraseQueries 
: option? ... you are explicitly specifying a quoted phrase, but 
: maybe somehwere in the code path of the dismax parser that information is 
: getting lost?

FWIW:

a) I just realized you said in your first message you were using Solr 
1.4.1, which *definitely* predates the autoGeneratePhraseQueries option - 
so i'm really at a loss to understand how you are getting that query 
structure (definitely want to see your configs)

b) I did some quick testing with Solr 3.4 using the example configs, and 
verified that regardless of how autoGeneratePhraseQueries is set on the 
fieldType for the name field, this request...

http://localhost:8983/solr/select/?fl=namedebugQuery=trueq=%22samsung%20hard%20drive%22defType=dismaxqf=nameqs=100

..always produces a dismax query wrapped arround a phrase query.


-Hoss

RE: Questions about Solr's security

2011-11-03 Thread Robert Petersen

Me too!

-Original Message-
From: Walter Underwood [mailto:wun...@wunderwood.org] 
Sent: Tuesday, November 01, 2011 1:02 PM
To: solr-user@lucene.apache.org
Subject: Re: Questions about Solr's security

I once had to deal with a severe performance problem caused by a bot
that was requesting results starting at 5000. We disallowed requests
over a certain number of pages in the front end to fix it.

wunder

On Nov 1, 2011, at 12:57 PM, Erik Hatcher wrote:

 Be aware that even /select could have some harmful effects, see
https://issues.apache.org/jira/browse/SOLR-2854 (addressed on trunk).

 Even disregarding that issue, /select is a potential gateway to any
request handler defined via /select?qt=/req_handler

 Again, in general it's not a good idea to expose Solr to anything but
a controlled app server.  

   Erik

 On Nov 1, 2011, at 15:51 , Alireza Salimi wrote:

 What if we just expose '/select' paths - by firewalls and load
balancers -
 and
 also use SSL and HTTP basic or digest access control?

 On Tue, Nov 1, 2011 at 2:20 PM, Chris Hostetter
hossman_luc...@fucit.orgwrote:

 : I was wondering if it's a good idea to expose Solr to the outside
world,
 : so that our clients running on smart phones will be able to use
Solr.

 As a general rule of thumb, i would say that it is not a good idea
to
 expose solr directly to the public internet.

 there are exceptions to this rule -- AOL hosted some live solr
instances
 of the Sarah Palin emails for HufPo -- but it is definitely an
expert
 level type thing for people who are so familiar with solr they know
 exactly what to lock down to make it safe

 for typical users: put an application between your untrusted users
and
 solr and only let that application generate safe welformed
requests to
 Solr...

 https://wiki.apache.org/solr/SolrSecurity

 -Hoss

 -- 
 Alireza Salimi
 Java EE Developer

--
Walter Underwood
Venture Asst. Scoutmaster
Troop 14, Palo Alto, CA

Re: BaseTokenFilterFactory not found in plugin

2011-11-03 Thread Chris Hostetter


: myorg/solr/analysis/*.java`. I then made a `.jar` file from the .class files
: and put the .jar file in the solr/lib/ directory. I modified schema.xml to
: include the new filter:

what exactly do you mean by the solr/lib/ directory ? ... if you mean 
that solr is the solr home dir where you are running solr, so you have a 
structure like this...

  solr/conf/solrconfig.xml
  solr/conf/schema.xml
  solr/lib/your-jar-name.jar

...then that should be correct.  If however you put it in some other lib 
directory (like, perhaps jetty's lib directory) then it might get loaded 
by a lower level class loader so it has no runtime visibility of the 
classes loaded by Solr.

when Solr starts up, the SolrResourceLoader explicitly logs every jar file 
it finds in it's lib dir, or any jars explicitly specified, or loaded 
because of @sharedLib or lib/ configurations, so check your logs to make 
sure your jar is listed there -- if it's not, but it's still getting 
loaded, then it's getting loaded by a different classloader.




-Hoss

Re: Default value for dynamic fields

2011-11-03 Thread Yonik Seeley

On Thu, Nov 3, 2011 at 12:59 PM, Milan Dobrota mi...@milandobrota.com wrote:
 Is there any way to define the default value for the dynamic fields in
 SOLR? I use some dynamic fields of type float with _val_ and if they
 haven't been created at index time, the value defaults to 0. I would want
 this to be 1. Can that be changed?


On trunk, there are some new (currently undocumented) function queries
that can do this:
def(myfield,1)

If there are not normally 0 values anyway, you can also map any 0
values encountered via map(),
or min() if existing values are all positive.

-Yonik
http://www.lucidimagination.com

Re: facet with group by (or field collapsing)

2011-11-03 Thread erj35

I'm attempting the following query:

http://{host}/apache-solr-3.3.0/select/?q=cesyversion=2.2start=0rows=10indent=ongroup=truegroup.field=SIPgroup.limit=1facet=truefacet.field=REPOSITORYNAME

The result is 4 matches all in 1 group (with group.limit=1).  Rather than
show facet.field=REPOSITORYNAME's 4 facets, I want to see the
REPOSITORYNAMES facet with a count of 1 (for the 1 group returned) with the
value of the REPOSITORYNAMES field in the 1 doc returned in the group. Is
this possible? I tried adding the parameter collapse.facet=after, but that
seemed to have no effect.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/facet-with-group-by-or-field-collapsing-tp497252p3478515.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: facet with group by (or field collapsing)

2011-11-03 Thread Martijn v Groningen

collapse.facet=after doesn't exists in Solr 3.3. This parameter exists
in the SOLR-236 patches and is implemented differently in the released
versions of Solr.
From Solr 3.4 you can use group.truncate. The facet counts are then
computed based on the most relevant documents per group.

Martijn

On 3 November 2011 22:47, erj35 eric.ja...@yale.edu wrote:
 I'm attempting the following query:

 http://{host}/apache-solr-3.3.0/select/?q=cesyversion=2.2start=0rows=10indent=ongroup=truegroup.field=SIPgroup.limit=1facet=truefacet.field=REPOSITORYNAME

 The result is 4 matches all in 1 group (with group.limit=1).  Rather than
 show facet.field=REPOSITORYNAME's 4 facets, I want to see the
 REPOSITORYNAMES facet with a count of 1 (for the 1 group returned) with the
 value of the REPOSITORYNAMES field in the 1 doc returned in the group. Is
 this possible? I tried adding the parameter collapse.facet=after, but that
 seemed to have no effect.

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/facet-with-group-by-or-field-collapsing-tp497252p3478515.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Met vriendelijke groet,

Martijn van Groningen

Access Score in Custom Function Query

2011-11-03 Thread sangrish



Hi,


  I have a custom function query (value source) where I want to use the
score for some computation. For example, for every document I want to add
some number (obtained from an external file) to its score. I am achieving
this like the following:

http://localhost:PORT/myCore/select?q=queryStringqt=my_request_handlerfl=field1,field2,scoredebugQuery=onsort=myfunc(query$(qq))
desc

Where, definition of my_request_handler  qq are as follows:

requestHandler name=my_request_handler class=solr.SearchHandler
lst name=defaults
str name=qq{!dismax v=$q}/str

str name=qf
field1^2 field^3
/str
/requestHandler

Questions:

1. To obtain the score in my function query I am executing the dismax query
again ( myfunc( query($qq)). Could it slow things down? Is there any way I
can access the score without querying again?
2. I also want to normalize the (query) score I get to a range between 0 -
1. Is there any way to access the MAX_SCORE in the same function query/Value
source (so that I can divide every score by that)?


Thanks lot guys

Sid




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Access-Score-in-Custom-Function-Query-tp3478597p3478597.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Access Document Score in Custom Function Query (ValueSource)

2011-11-03 Thread sangrish


I understand that. Thanks. 

I just posted a related question , titled : Access Score in Custom Function
Query  

where (among other things) I am asking about the performance aspects of this
method. As you said, I need to execute some query first to create a
constrained recall set  then apply my custom function query (which in turn
executes another query) to it.

In my case I am using the same query again. First to create the recall set
(and also score the docs which I don't use though) and then execute that
query in my custom function to get the score. I am worried it may slow
things down.

Comments?

Thanks
Sid

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Access-Document-Score-in-Custom-Function-Query-ValueSource-tp3432459p3478619.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: UnInvertedField vs FieldCache for facets for single-token text fields

2011-11-03 Thread Martijn v Groningen

Hi Micheal,

The FieldCache is an easier data structure and easier to create, so I
also expect it to be faster. Unfortunately for TextField
UnInvertedField
is always used even if you have one token per document. I think
overriding the multiValuedFieldCache method and return false would
work.

If you're using 4.0-dev (trunk) I'd use facet.method=fcs (this
parameter is only useable if multiValuedFieldCache method returns
false)
This is per segment faceting and the cache will only be extended for
new segments. This field facet approach is better for indexes with
frequent changes.
I think this even faster in your case then just using the FieldCache
method (which operates on a top level reader. After each commit the
complete cache is invalid and has to be recreated).

Otherwise I'd try facet.method=enum which is fast if you have fewer
distinct facet values (num of docs doesn't influence the performance
that much).
The facet.method=enum option is also valid for normal TextFields, so
no need to have custom code.

Martijn

On 3 November 2011 21:16, Michael Ryan mr...@moreover.com wrote:
 I have some fields I facet on that are TextFields but have just a single 
 token.
 The fieldType looks like this:

 fieldType name=myStringFieldType class=solr.TextField indexed=true
    stored=false omitNorms=true sortMissingLast=true
    positionIncrementGap=100
  analyzer
    tokenizer class=solr.KeywordTokenizerFactory/
  /analyzer
 /fieldType

 SimpleFacets uses an UnInvertedField for these fields because
 multiValuedFieldCache() returns true for TextField. I tried changing the type 
 for
 these fields to the plain string type (StrField). The facets *seem* to be
 generated much faster. Is it expected that FieldCache would be faster than
 UnInvertedField for single-token strings like this?

 My goal is to make the facet re-generation after a commit as fast as 
 possible. I
 would like to continue using TextField for these fields since I have a need 
 for
 filters like LowerCaseFilterFactory, which still produces a single token. Is 
 it
 safe to extend TextField and have multiValuedFieldCache() return false for 
 these
 fields, so that UnInvertedField is not used? Or is there a better way to
 accomplish what I'm trying to do?

 -Michael




-- 
Met vriendelijke groet,

Martijn van Groningen

Highlighter showing matched query words only

2011-11-03 Thread Nikeman

Hello Folks,

I am a newbie of Solr. I wonder if Solr Highlighter can show the matched
query words only. Suppose my query is godfather AND pacino. I just want to
display godfather and pacino in any of the highlighted fields. For the
sake of performance, I do not want to use regular expressions to parse the
text and locate the query words which are already enclosed between em and
/em. Solr obviously has already done the searching and highlighting, but
the Solr output mixes what I want with what I do not want. 

I just want to get out the intermediate results, the matching query words,
and nothing else. 

Is there a way to get the intermediate results, the matching query words,
before they are mixed with other text? Thank you all very much for your help
in advance! 

N. J. 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Highlighter-showing-matched-query-words-only-tp3478731p3478731.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Using Solr components for dictionary matching?

2011-11-03 Thread Nagendra Mishr

The scenarios that could use dictionary matching:

1. Document being processed to see if it contains one of 10,000 terms.

2. Query completion as you type

3. Basically the inverse of finding a document.. Instead the document
is the query term and the dictionary of terms is being matched in
parallel

Nagendra

Sent from my Windows Phone
From: Erick Erickson
Sent: 11/3/2011 8:13 AM
To: solr-user@lucene.apache.org
Subject: Re: Using Solr components for dictionary matching?
I really don't understand what you're asking. Could you give some
examples of what you're trying to do?

Best
Erick

On Tue, Nov 1, 2011 at 10:38 AM, Nagendra Mishr nmi...@gmail.com wrote:
 Hi all,

 Is there a good guide on using Solr components as a dictionary
 matcher?  I'm need to do some pre-processing that involves lots of
 dictionary lookups and it doesn't seem right to query solr for each
 instance.

 Thanks in advance,

 Nagendra

Re: Using Solr components for dictionary matching?

2011-11-03 Thread Vijay Ramachandran

On Thu, Nov 3, 2011 at 4:06 PM, Nagendra Mishr nmi...@gmail.com wrote:

 The scenarios that could use dictionary matching:

 1. Document being processed to see if it contains one of 10,000 terms.

 2. Query completion as you type

 3. Basically the inverse of finding a document.. Instead the document
 is the query term and the dictionary of terms is being matched in
 parallel


Try the Aho-Corasick algorithm -
http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm

It is a kind of dictionary-matching algorithm that locates elements of a
finite set of strings (the dictionary) within an input text. It matches
all patterns simultaneously. The
complexityhttp://en.wikipedia.org/wiki/Computational_complexity_theoryof
the algorithm is linear in the length of the patterns plus the length
of
the searched text plus the number of output matches.

HTH,
Vijay

how to achieve google.com like results for phrase queries

2011-11-03 Thread alxsss

Hello,

I use nutch-1.3 crawled results in solr-3.4. I noticed that for two word 
phrases like newspaper latimes, latimes.com is not in results at all.
This may be due to the dismax def type that I use in  request handler 

str name=defTypedismax/str
str name=qfurl^1.5 id^1.5 content^ title^1.2/str
str name=pfurl^1.5 id^1.5 content^0.5 title^1.2/str


 with mm as
str name=mm2lt;-1 5lt;-2 6lt;90%/str 

However, changing it to 
str name=mm1lt;-1 2lt;-1 5lt;-2 6lt;90%/str 

and q.op to OR or AND 

do not solve the problem. In this case latimes.com is ranked higher, but still 
is not in the first place.
Also in this case results with both words are ranked very low, almost at the 
end.

We need to be able to achieve the case when latimes.com is placed in the first 
place then results with both words and etc.

Any ideas how to modify config to this end?

Thanks in advance.
Alex.

Re: Stopword filter - refreshing stop word list periodically

2011-11-03 Thread Jithin

Thanks Sami. I ended up setting up a proper core as per documentation,
named core0.

On Thu, Nov 3, 2011 at 11:07 PM, Sami Siren-2 [via Lucene] 
ml-node+s472066n3477844...@n3.nabble.com wrote:

 On Fri, Oct 14, 2011 at 10:06 PM, Jithin [hidden 
 email]http://user/SendEmail.jtp?type=nodenode=3477844i=0
 wrote:
  What will be the name of this hard coded core? I was re arranging my
  directory structure adding a separate directory for code. And it does
 work
  with a single core.

 In trunk the single core setup core is called collection1. So to
 reload that you'd call url:
 http://localhost:8983/solr/admin/cores?action=RELOADcore=collection1

 --
  Sami Siren


 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://lucene.472066.n3.nabble.com/Stopword-filter-refreshing-stop-word-list-periodically-tp3421611p3477844.html
  To unsubscribe from Stopword filter - refreshing stop word list
 periodically, click 
 herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=3421611code=aml0aGluMTk4N0BnbWFpbC5jb218MzQyMTYxMXwtMTEwMTgwMTA3Ng==.





-- 
Thanks
Jithin Emmanuel


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Stopword-filter-refreshing-stop-word-list-periodically-tp3421611p3479040.html
Sent from the Solr - User mailing list archive at Nabble.com.

49 matches

Mail list logo