solr-user-sc.1232439520.jifbbmojenccmdjiompd-dongfeiwww=gmail....@lucene.apache.org

2009-01-20 Thread fei dong
solr-user-sc.1232439520.jifbbmojenccmdjiompd-dongfeiwww=gmail.com@
lucene.apache.org


Re: how can solr search angainst group of field

2009-01-20 Thread Marc Sturlese

Chech the DismaxRequestHandler, maybe it helps.
It allows you to choose more that one field where to search:
http://wiki.apache.org/solr/DisMaxRequestHandler#head-af452050ee272a1c88e2ff89dc0012049e69e180


surfer10 wrote:
> 
> Good days gentlemen.
> 
> in my search engine i have 4 groups of text:
> 1) user address
> 2) user description
> 3) ...
> 4) ...
> 
> I want to give users ability to search all of them with ability to
> conjunction selection for searching some of them. conjunction means that
> user should be able to search 1) and 2) fields, 1 AND 3 fields and so on.
> 
> I'm realizing how i can give them ability to search everywhere - it can be
> archieved by copyFields parameter but how can user search for bunch of
> terms in different groups?
> 
> now i'm using such syntax
> 
> +(default_field:WORD default_field:WORD2 default_field:WORD3)
> 
> if i want to give them oportunity to search by 2 of 4 fields, i should
> repeat a construction?
> i.e.
> 
> (field1:WORD field1:WORD2 field1:WORD3) (field2:WORD field2:WORD2
> field2:WORD3) ?
> 
> is there any ability to specify field1,field2:TERM ?
> 

-- 
View this message in context: 
http://www.nabble.com/how-can-solr-search-angainst-group-of-field-tp21557783p21559093.html
Sent from the Solr - User mailing list archive at Nabble.com.



How to modify the revelance sorting in solr?

2009-01-20 Thread fei dong
Hi guys:

I am going to build up a audio search based on solr. I worked out a
prototype like :

schema.xml:
   
   
   
   
   

then import the data from mysql and add it to the index in XML format.

My problems are :
1 support a query language, "songname + artist " or "artist + album" or "
artist + album + songname", some guys would like to query like "because of
you ne-yo". So I need to cut words in the proper way. How to modify the way
of cutting words in solr ( recognize the song name or album or artist)

2 Revelance Sorting, in the matching result, the records whose album or
artist is the same as the query should be put ahead.  I find solr remove the
stop words and cut words into "because", "you", then the results like
"because I love you" , "because you loved me" are in the front. Another bad
case is some songs that lack artist information wghile get the riht song
name are sorting in the front. like
Results:1.id:602821artist:
album:
mp3:because of youlinks:1
2.id:612525artist:
album:
mp3:because of youlinks:1

 The principle is to match the total information as completely as possible
and the records which have more information should put ahead.  So how can I
add more weigh on term of album or artist and modify the strategy of
sorting. I am new in solr and really need help.

Regards,
Fei


Re: Querying Solr Index for date fields

2009-01-20 Thread Erik Hatcher


On Jan 20, 2009, at 12:10 AM, prerna07 wrote:

below mentioned fq tag gives me error
dateField:[NOW-45DAYS TO NOW]^1.0 DateField:[NOW TO
NOW+45DAYS]^1.0


What error did you get?   You've got dateField/DateField as two  
different cases, which would give a parse exception if one or both of  
those didn't exist in your schema.  Other than that, the syntax itself  
looks fine.  Of course boosting by ^1.0 isn't quite the number you'll  
want to use.


Erik



Re: Querying Solr Index for date fields

2009-01-20 Thread Erik Hatcher


On Jan 20, 2009, at 5:28 AM, Erik Hatcher wrote:

On Jan 20, 2009, at 12:10 AM, prerna07 wrote:

below mentioned fq tag gives me error
dateField:[NOW-45DAYS TO NOW]^1.0 DateField:[NOW TO
NOW+45DAYS]^1.0


What error did you get?   You've got dateField/DateField as two  
different cases, which would give a parse exception if one or both  
of those didn't exist in your schema.  Other than that, the syntax  
itself looks fine.  Of course boosting by ^1.0 isn't quite the  
number you'll want to use.


Oops... I misspoke... putting a boosting query like this into fq isn't  
going to help.  fq (filter query) doesn't factor into the scoring.  So  
you'll either need to SHOULD/OR include that clause into your main  
standard query, or if you're using dismax factor it into a bq  
(boosting query) parameter.


Erik



Re: Searching for 'A*' is not returning me same result as 'a*'

2009-01-20 Thread Manupriya

I got the answer to my problem. This is happening because I am using
wildcard. Wildcard queries are not passed through Analyzer.

http://wiki.apache.org/lucene-java/LuceneFAQ#head-4d62118417eaef0dcb87f4370583f809848ea695
http://markmail.org/message/25wm4mrdhs6yqnck#query:upper%20case%20solr+page:1+mid:7c6bf6e7p755eu67+state:results
http://www.mail-archive.com/solr-user@lucene.apache.org/msg08542.html

Thanks,
Manu



Manupriya wrote:
> 
> Hi,
> 
> I am using the following analyser for indexing and querying -
> --
>  
>   
> 
>   words="stopwords.txt" enablePositionIncrements="true"/>
>  generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> 
>  protected="protwords.txt"/>
> 
>   
>   
> 
>  ignoreCase="true" expand="true"/>
>  words="stopwords.txt"/>
>  generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
> 
>  protected="protwords.txt"/>
> 
>   
> 
> -
> 
> I search using Solr admin console. When I search for - 
> institutionName:a*, I get 93 matching records. But when I search for -
> institutionName:A*, I DO NOT get any matching records.
> 
> I did field Analysis for a* and A* for the analyzer configuration.
> 
> For a*
> --
>   http://www.nabble.com/file/p21557926/a-analysis.gif 
> 
> 
> For A*
> --
>   http://www.nabble.com/file/p21557926/A1-analysis.gif 
> 
> As per my understanding, analyzer is working fine in both the case. I am
> not able to understand, why query is not returning me any result for A*?
> :confused:
> 
> I feel that I am missing out something, can anyone help me with that?
> 
> Regards,
> Manu
> 

-- 
View this message in context: 
http://www.nabble.com/Searching-for-%27A*%27-is-not-returning-me-same-result-as-%27a*%27-tp21557926p21560742.html
Sent from the Solr - User mailing list archive at Nabble.com.



ERROR trying to just commit via /update

2009-01-20 Thread Marc Sturlese

Hey there,
I am trying to do just a commit via  url:
http://localhost:8084/nightly_web/es_jobs_core/update
I have tryeid also:
http://localhost:8084/nightly_web/es_jobs_core/update?commit=true
And I am getting this error:

2009-01-20 11:27:50,424 [http-8084-Processor25] ERROR
org.apache.solr.core.SolrCore - org.apache.solr.common.SolrException:
missing content stream
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:49)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1341)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:303)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:232)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:202)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173)
at
org.netbeans.modules.web.monitor.server.MonitorFilter.doFilter(MonitorFilter.java:368)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:202)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:178)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:126)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:105)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:107)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:148)
at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:869)
at
org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(Http11BaseProtocol.java:664)
at
org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:527)
at
org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:80)
at
org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:684)
at java.lang.Thread.run(Thread.java:619)

It looks like Solr is asking me a file with info to update (it does the
commit after that). I just need to do a commit. The problem has appered
because I am using the scripts of Solr Collection Distribution and when I
try to do a snapinstaller it calls to commit script... and commit script
tries to do what I writed above.
Am I missing something or iis there something wrong in there...?

Thanks in advance!
-- 
View this message in context: 
http://www.nabble.com/ERROR-trying-to-just-commit-via--update-tp21560718p21560718.html
Sent from the Solr - User mailing list archive at Nabble.com.



Date range query where doc has more than one date field

2009-01-20 Thread joeMcElroy

Hi 

sorry if this is a trival question but :

i have a doc which has more than one datefield. they are start and end. now
i need the user to specify a date range, and i need to find all docs which
user range is between the docs start and end date fields.

searching on this mail group someone has suggested having a single date
field which is multivalued, and specify the range query against this field
and should work.

any other suggestions?
-- 
View this message in context: 
http://www.nabble.com/Date-range-query-where-doc-has-more-than-one-date-field-tp21560935p21560935.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Embedded Solr updates not showing until restart

2009-01-20 Thread edre...@ha




Grant Ingersoll-6 wrote:
> 
> Do they show up if you use non-embedded?  That is, if you hit that  
> slave over HTTP from your browser, are the changes showing up?
> 

Yes.  Changing the config to access the server over HTTP works fine.  When
looking at our console logs for the Solr Server, I can see no discernable
difference between the embedded and HTTP approaches.  The snapinstaller
appears to be working in both cases, but changes to the index don't show up
in queries when the slave is configured as embedded.

I'm moving forward with the HTTP approach, but the embedded approach is
desirable for two (obvious) reasons: 1) performance improvement, 2) simpler
deployment.

Thanks.
-- 
View this message in context: 
http://www.nabble.com/Embedded-Solr-updates-not-showing-until-restart-tp21546235p21562955.html
Sent from the Solr - User mailing list archive at Nabble.com.



I get SEVERE: Lock obtain timed out

2009-01-20 Thread Julian Davchev
Hi,
Any documents or something I can read on how locks work and how I can
controll it. When do they occur etc.
Cause only way I got out of this mess was restarting tomcat

SEVERE: org.apache.lucene.store.LockObtainFailedException: Lock obtain
timed out: SingleInstanceLock: write.lock


Cheers,


Wish to Unsubscribe from Solr Mailing List

2009-01-20 Thread kirk beers
My email address is kgb...@gmail.com

Thanks for everything !!!


problem with DIH and MySQL

2009-01-20 Thread Nick Friedrich

Hi,

I'm new to Solr and I have a problem.
I want to use DIH to index data stored in a MySQL database.

I added to solrconfig.xml

class="org.apache.solr.handler.dataimport.DataImportHandler">

 
data-config.xml
 


The schema.xml is modified. Now there are just two fields

required="true" />
required="false"/>


"paper_ID_pk" is set to be the uniqueKey.


The data-config.xml:


url="jdbc:mysql://localhost/db_name" user="user" password="pw" />











When I try to make a full import via http nothing is indexed.
A look at the status returns always this



0
0



data-config.xml


status
idle


0:0:5.766
0
0
0
0
2009-01-20 14:21:36
Indexing failed. Rolled back all changes.
2009-01-20 14:21:36

	This response format is experimental.  It is  
likely to change in the future.



Adding some data e.g. via curl works fine. So, I think the schema.xml  
is correct.


I hope somebody can help.

Thanks,
Nick




Re: problem with DIH and MySQL

2009-01-20 Thread Noble Paul നോബിള്‍ नोब्ळ्
it got rolled back
any exceptions on solr console?

On Tue, Jan 20, 2009 at 9:07 PM, Nick Friedrich
 wrote:
> Hi,
>
> I'm new to Solr and I have a problem.
> I want to use DIH to index data stored in a MySQL database.
>
> I added to solrconfig.xml
>
>  class="org.apache.solr.handler.dataimport.DataImportHandler">
> 
>data-config.xml
> 
> 
>
> The schema.xml is modified. Now there are just two fields
>
>  required="true" />
>  required="false"/>
>
> "paper_ID_pk" is set to be the uniqueKey.
>
>
> The data-config.xml:
>
> 
>  url="jdbc:mysql://localhost/db_name" user="user" password="pw" />
> 
>
>
>
>
>
> 
> 
>
>
> When I try to make a full import via http nothing is indexed.
> A look at the status returns always this
>
> 
>
>0
>0
>
>
>
>data-config.xml
>
>
>status
>idle
>
>
>0:0:5.766
>0
>0
>0
>0
>2009-01-20 14:21:36
>Indexing failed. Rolled back all changes.
>2009-01-20 14:21:36
>
>This response format is experimental.  It is
> likely to change in the future.
> 
>
> Adding some data e.g. via curl works fine. So, I think the schema.xml is
> correct.
>
> I hope somebody can help.
>
> Thanks,
> Nick
>
>
>



-- 
--Noble Paul


Re: problem with DIH and MySQL

2009-01-20 Thread Nick Friedrich

no, there are no exceptions
but I have to admit, that I'm not sure what you mean with console


Zitat von Noble Paul ???  ?? :


it got rolled back
any exceptions on solr console?


--
--Noble Paul







RE: How to select *actual* match from a multi-valued field

2009-01-20 Thread Feak, Todd
Anyone that can shed some insight?

-Todd

-Original Message-
From: Feak, Todd [mailto:todd.f...@smss.sony.com] 
Sent: Friday, January 16, 2009 9:55 AM
To: solr-user@lucene.apache.org
Subject: How to select *actual* match from a multi-valued field

At a high level, I'm trying to do some more intelligent searching using
an app that will send multiple queries to Solr. My current issue is
around multi-valued fields and determining which entry actually
generated the "hit" for a particular query.

 

For example, let's say that I have a multi-valued field containing
people's names, associated with the document (trying to be non-specific
on purpose). In one document, I have the following names:

Jane Smith, Bob Smith, Roger Smith, Jane Doe. If the user performs a
search for Bob Smith, this document is returned. What I want to know is
that this document was returned because of "Bob Smith", not because of
Jane or Roger. I've tried using the highlighting settings. They do
provide some help, as the Jane Doe entry doesn't come back highlighted,
but both Jane and Roger do. I've tried using hl.requireFieldMatch, but
that seems to pertain only to fields, not entries within a multi-valued
field.

 

Using Solr, is there a way to get the information I am looking for?
Specifically, that "Bob Smith" is the value in the multi-valued field
that triggered the hit?

 

-Todd Feak



SOLR Problem with special chars

2009-01-20 Thread Kraus, Ralf | pixelhouse GmbH

Hello,

My string in my DB is like "Kellogs, Corn- (Flakes)"

When I search with "Kellogs" or "Corn" or "Flakes" I cant find the entry 
in my index :-(

Is there something I missing ?

Greets,

--
Ralf Kraus


Re: How to select *actual* match from a multi-valued field

2009-01-20 Thread Toby Cole
We came across this problem, unfortunately we gave up and did our hit- 
highlighting for multi-valued fields on the frontend. :-/
One approach would be to extend solr to return every value of a multi- 
valued field in the highlighting, regardless of whether that  
particular value matched.
Just an idea, don't know if it's feasible or not. if anyone can point  
me in the right direction I could probably bash together a plugin and  
some tests.

Toby.

On 20 Jan 2009, at 16:31, Feak, Todd wrote:


Anyone that can shed some insight?

-Todd

-Original Message-
From: Feak, Todd [mailto:todd.f...@smss.sony.com]
Sent: Friday, January 16, 2009 9:55 AM
To: solr-user@lucene.apache.org
Subject: How to select *actual* match from a multi-valued field

At a high level, I'm trying to do some more intelligent searching  
using

an app that will send multiple queries to Solr. My current issue is
around multi-valued fields and determining which entry actually
generated the "hit" for a particular query.



For example, let's say that I have a multi-valued field containing
people's names, associated with the document (trying to be non- 
specific

on purpose). In one document, I have the following names:

Jane Smith, Bob Smith, Roger Smith, Jane Doe. If the user performs a
search for Bob Smith, this document is returned. What I want to know  
is

that this document was returned because of "Bob Smith", not because of
Jane or Roger. I've tried using the highlighting settings. They do
provide some help, as the Jane Doe entry doesn't come back  
highlighted,

but both Jane and Roger do. I've tried using hl.requireFieldMatch, but
that seems to pertain only to fields, not entries within a multi- 
valued

field.



Using Solr, is there a way to get the information I am looking for?
Specifically, that "Bob Smith" is the value in the multi-valued field
that triggered the hit?



-Todd Feak



Toby Cole
Software Engineer

Semantico
Lees House, Floor 1, 21-23 Dyke Road, Brighton BN1 3FE
T: +44 (0)1273 358 238
F: +44 (0)1273 723 232
E: toby.c...@semantico.com
W: www.semantico.com



Re: How to get the score in the result

2009-01-20 Thread Ryan Grange
It would help to see your query, but you basically add ",score" to 
whatever you're sending over in the "fl" variable.  If you aren't 
passing "fl", you may want to use "fl=*,score".


Ryan T. Grange, IT Manager
DollarDays International, Inc.
rgra...@dollardays.com (480)922-8155 x106



ayyanar wrote:

final QueryResponse queryResponse = server.query(query);
final List results =
queryResponse.getBeans(DocumentWrapper.class);

This is the way i do the query in the solr. DocumentWrapper is my class
which maps to the document fields.

Can anyone let me know how the documentwrapper can return the score of the
document? How to get the solr score of each document?
  


Re: SOLR Problem with special chars

2009-01-20 Thread Otis Gospodnetic
Ralf,

Can you paste the part of your schema.xml where you defined the relevant field?

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
> From: "Kraus, Ralf | pixelhouse GmbH" 
> To: solr-user@lucene.apache.org
> Sent: Tuesday, January 20, 2009 11:35:38 AM
> Subject: SOLR Problem with special chars
> 
> Hello,
> 
> My string in my DB is like "Kellogs, Corn- (Flakes)"
> 
> When I search with "Kellogs" or "Corn" or "Flakes" I cant find the entry in 
> my 
> index :-(
> Is there something I missing ?
> 
> Greets,
> 
> -- Ralf Kraus



Re: advice on minimal solr/jetty

2009-01-20 Thread Otis Gospodnetic
Steve,

3s is pretty good.  I'm try it with Jetty without any webapps first.  Then I'd 
first try to trim Jetty and its config.  On Solr-end I'd comment out various 
pieces of schema and solrconfig that I'm not using - there is lots of noise 
there.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
> From: Steve Conover 
> To: solr-user@lucene.apache.org
> Sent: Monday, January 19, 2009 2:38:14 PM
> Subject: advice on minimal solr/jetty
> 
> Hi everyone,
> 
> I'd like to see how much I can reduce the startup time of jetty/solr.
> Right now I have it at about 3s - that's fast, but I'd like to see how
> close to zero I can get it.
> 
> I've minimized my schema and solrconfig down to what I use (my solr
> needs are pretty vanilla).  Now I'm looking at all the solr plugins
> that get loaded at startup that I don't use and wondering whether
> getting rid of those would help.
> 
> But before jumping off into jar manipulation I figured I'd pose this
> question to the group - what would you do?
> 
> -Steve



Re: Embedded Solr updates not showing until restart

2009-01-20 Thread Grant Ingersoll

Can you share your code?  Or reduce it down to a repeatable test?

On Jan 20, 2009, at 8:22 AM, edre...@ha wrote:






Grant Ingersoll-6 wrote:


Do they show up if you use non-embedded?  That is, if you hit that
slave over HTTP from your browser, are the changes showing up?



Yes.  Changing the config to access the server over HTTP works  
fine.  When
looking at our console logs for the Solr Server, I can see no  
discernable
difference between the embedded and HTTP approaches.  The  
snapinstaller
appears to be working in both cases, but changes to the index don't  
show up

in queries when the slave is configured as embedded.

I'm moving forward with the HTTP approach, but the embedded approach  
is
desirable for two (obvious) reasons: 1) performance improvement, 2)  
simpler

deployment.

Thanks.
--
View this message in context: 
http://www.nabble.com/Embedded-Solr-updates-not-showing-until-restart-tp21546235p21562955.html
Sent from the Solr - User mailing list archive at Nabble.com.



--
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ












Performance Hit for Zero Record Dataimport

2009-01-20 Thread wojtekpia

I have a transient SQL table that I use to load data into Solr using the
DataImportHandler. I run an update every 15 minutes
(dataimport?command=full-import&clean=false&optimize=false), but my table
will frequently have no new data for me to import. When the table contains
no data, it looks like Solr is doing a lot more work than it needs to. The
performance degradation is the same for loading zero records as it is for
loading a couple thousand records (while the system is under heavy load). 

I noticed that when no data is imported, no new index files are created, so
it seems like something (Lucene?) is aware of the empty update. But since
the performance degradation is the same, I'm guessing that a new Searcher is
still created, warmed, and registered. Is that correct? 
-- 
View this message in context: 
http://www.nabble.com/Performance-Hit-for-Zero-Record-Dataimport-tp21572935p21572935.html
Sent from the Solr - User mailing list archive at Nabble.com.



New to Solr/Lucene design question

2009-01-20 Thread Yogesh Chawla - PD
Hello All,
We are using SOLR/Lucene as the search engine for an application
we are designing.  The application is a workflow application that can
receive different types of documents.

For example, we are currently working on getting booking documents but
will also accept arrest documents later this year.

We have defined a custom schema that incorporates some schemas designed
by federal consortiums.  From those schemas we pluck out values that we want 
SOLR/Lucene to index and search on and we go from our instance document to
a SOLR document.

The fields in our schema.xml look like this:

 


   
   
   
   


Above, there is a field called "stash-content".  The goal is to take any search 
able data from
any document type and put it in this field.  For example, we would store data 
like this in XML format:



  
arrestee_firstname_Yogesh
arrestee_lastname_Chawla
arrestee_middlename_myMiddleName
  

The advantage to such an approach is that we can add new document types to 
search on and as long
as they use the same semantics such as arrestee_firstname
that we won't to update any code.  It also makes
the code simple and generic for any document type.

We can search on first name like this for a starts with 
query:arrestee_firstname_Y*.  We had to use
the _ instead of a space so that each word would not be searched when a query 
was performed and only
a single string would be searched.  (hope that makes sense).

The cons could be a performance hit.  

The other approach is to add fields explicitly like this:


  
Yogesh
Chawla
myMiddleName
  

This approach seems more traditional.  The pros of it are that it is straight 
forward.  The cons are that every time
we add a new document type to search on, we have to update schema.xml and the 
java code that creates SOLR
documents.

The number of documents that we will eventually want to search on is about 5 
million.  However, this will take a while
to ramp up to and we are more immediately looking at searching on about 100,000.

I am new to SOLR and just inherited this project with approach number 1.  Is 
this something that is going to bite us in the
future?

Thanks,
Yogesh


RE: New to Solr/Lucene design question

2009-01-20 Thread Feak, Todd
A third option - Use dynamic fields.

Add a dynamic field call "*_stash". This will allow new fields for
documents to be added down the road without changing schema.xml, yet
still allow you to query on fields like "arresteeFirstName_stash"
without extra overhead.

-Todd Feak

-Original Message-
From: Yogesh Chawla - PD [mailto:premiergenerat...@yahoo.com] 
Sent: Tuesday, January 20, 2009 2:30 PM
To: solr-user@lucene.apache.org
Subject: New to Solr/Lucene design question

Hello All,
We are using SOLR/Lucene as the search engine for an application
we are designing.  The application is a workflow application that can
receive different types of documents.

For example, we are currently working on getting booking documents but
will also accept arrest documents later this year.

We have defined a custom schema that incorporates some schemas designed
by federal consortiums.  From those schemas we pluck out values that we
want 
SOLR/Lucene to index and search on and we go from our instance document
to
a SOLR document.

The fields in our schema.xml look like this:

 


   
   
   
   


Above, there is a field called "stash-content".  The goal is to take any
search able data from
any document type and put it in this field.  For example, we would store
data like this in XML format:



  
arrestee_firstname_Yogesh
arrestee_lastname_Chawla
arrestee_middlename_myMiddleName
  

The advantage to such an approach is that we can add new document types
to search on and as long
as they use the same semantics such as arrestee_firstname
that we won't to update any code.  It also makes
the code simple and generic for any document type.

We can search on first name like this for a starts with
query:arrestee_firstname_Y*.  We had to use
the _ instead of a space so that each word would not be searched when a
query was performed and only
a single string would be searched.  (hope that makes sense).

The cons could be a performance hit.  

The other approach is to add fields explicitly like this:


  
Yogesh
Chawla
myMiddleName
  

This approach seems more traditional.  The pros of it are that it is
straight forward.  The cons are that every time
we add a new document type to search on, we have to update schema.xml
and the java code that creates SOLR
documents.

The number of documents that we will eventually want to search on is
about 5 million.  However, this will take a while
to ramp up to and we are more immediately looking at searching on about
100,000.

I am new to SOLR and just inherited this project with approach number 1.
Is this something that is going to bite us in the
future?

Thanks,
Yogesh



Re: New to Solr/Lucene design question

2009-01-20 Thread Yogesh Chawla - PD
Hi Todd,
I think I see what you are saying here.

In our schema.xml we can define it like this:



   
   

   


and then add data like this:


  
Yogesh
Chawla
myMiddleName
  


If we need to add other types of dynamic data types, we can do that at a later 
time
by adding a different type of dynamic field.

This way we are not querying a single field 'stash-content' but rather just the 
fields we are interested
in and there is no need to change the java code or the schema.xml.

Are we on the same wave length here?

Thanks a lot for the suggestion,
Yogesh







- Original Message 
From: "Feak, Todd" 
To: solr-user@lucene.apache.org
Sent: Tuesday, January 20, 2009 4:49:56 PM
Subject: RE: New to Solr/Lucene design question

A third option - Use dynamic fields.

Add a dynamic field call "*_stash". This will allow new fields for
documents to be added down the road without changing schema.xml, yet
still allow you to query on fields like "arresteeFirstName_stash"
without extra overhead.

-Todd Feak

-Original Message-
From: Yogesh Chawla - PD [mailto:premiergenerat...@yahoo.com] 
Sent: Tuesday, January 20, 2009 2:30 PM
To: solr-user@lucene.apache.org
Subject: New to Solr/Lucene design question

Hello All,
We are using SOLR/Lucene as the search engine for an application
we are designing.  The application is a workflow application that can
receive different types of documents.

For example, we are currently working on getting booking documents but
will also accept arrest documents later this year.

We have defined a custom schema that incorporates some schemas designed
by federal consortiums.  From those schemas we pluck out values that we
want 
SOLR/Lucene to index and search on and we go from our instance document
to
a SOLR document.

The fields in our schema.xml look like this:




   
   
   
   


Above, there is a field called "stash-content".  The goal is to take any
search able data from
any document type and put it in this field.  For example, we would store
data like this in XML format:



  
arrestee_firstname_Yogesh
arrestee_lastname_Chawla
arrestee_middlename_myMiddleName
  

The advantage to such an approach is that we can add new document types
to search on and as long
as they use the same semantics such as arrestee_firstname
that we won't to update any code.  It also makes
the code simple and generic for any document type.

We can search on first name like this for a starts with
query:arrestee_firstname_Y*.  We had to use
the _ instead of a space so that each word would not be searched when a
query was performed and only
a single string would be searched.  (hope that makes sense).

The cons could be a performance hit.  

The other approach is to add fields explicitly like this:


  
Yogesh
Chawla
myMiddleName
  

This approach seems more traditional.  The pros of it are that it is
straight forward.  The cons are that every time
we add a new document type to search on, we have to update schema.xml
and the java code that creates SOLR
documents.

The number of documents that we will eventually want to search on is
about 5 million.  However, this will take a while
to ramp up to and we are more immediately looking at searching on about
100,000.

I am new to SOLR and just inherited this project with approach number 1.
Is this something that is going to bite us in the
future?

Thanks,
Yogesh


Re: Query Performance while updating teh index

2009-01-20 Thread oleg_gnatovskiy

Hello again. It seems that we are still having these problems. Queries take
as long as 20 minutes to get back to their average response time after a
large index update, so it doesn't seem like the problem is the 12 second
autowarm time. Are there any more suggestions for things we can try? Taking
our servers out of teh loop for as long as 20 minutes is a bit of a hassle,
and a risk.
-- 
View this message in context: 
http://www.nabble.com/Query-Performance-while-updating-the-index-tp20452835p21573927.html
Sent from the Solr - User mailing list archive at Nabble.com.



RE: New to Solr/Lucene design question

2009-01-20 Thread Feak, Todd
Yes, that's what I was suggesting. :)

Might have to be careful with the extra underscore "_" characters. Not
sure if those will cause issue with dynamic fields.

-Todd Feak

-Original Message-
From: Yogesh Chawla - PD [mailto:premiergenerat...@yahoo.com] 
Sent: Tuesday, January 20, 2009 3:14 PM
To: solr-user@lucene.apache.org
Subject: Re: New to Solr/Lucene design question

Hi Todd,
I think I see what you are saying here.

In our schema.xml we can define it like this:



   
   

   


and then add data like this:


  
Yogesh
Chawla
myMiddleName
  


If we need to add other types of dynamic data types, we can do that at a
later time
by adding a different type of dynamic field.

This way we are not querying a single field 'stash-content' but rather
just the fields we are interested
in and there is no need to change the java code or the schema.xml.

Are we on the same wave length here?

Thanks a lot for the suggestion,
Yogesh







- Original Message 
From: "Feak, Todd" 
To: solr-user@lucene.apache.org
Sent: Tuesday, January 20, 2009 4:49:56 PM
Subject: RE: New to Solr/Lucene design question

A third option - Use dynamic fields.

Add a dynamic field call "*_stash". This will allow new fields for
documents to be added down the road without changing schema.xml, yet
still allow you to query on fields like "arresteeFirstName_stash"
without extra overhead.

-Todd Feak

-Original Message-
From: Yogesh Chawla - PD [mailto:premiergenerat...@yahoo.com] 
Sent: Tuesday, January 20, 2009 2:30 PM
To: solr-user@lucene.apache.org
Subject: New to Solr/Lucene design question

Hello All,
We are using SOLR/Lucene as the search engine for an application
we are designing.  The application is a workflow application that can
receive different types of documents.

For example, we are currently working on getting booking documents but
will also accept arrest documents later this year.

We have defined a custom schema that incorporates some schemas designed
by federal consortiums.  From those schemas we pluck out values that we
want 
SOLR/Lucene to index and search on and we go from our instance document
to
a SOLR document.

The fields in our schema.xml look like this:




   
   
   
   


Above, there is a field called "stash-content".  The goal is to take any
search able data from
any document type and put it in this field.  For example, we would store
data like this in XML format:



  
arrestee_firstname_Yogesh
arrestee_lastname_Chawla
arrestee_middlename_myMiddleName
  

The advantage to such an approach is that we can add new document types
to search on and as long
as they use the same semantics such as arrestee_firstname
that we won't to update any code.  It also makes
the code simple and generic for any document type.

We can search on first name like this for a starts with
query:arrestee_firstname_Y*.  We had to use
the _ instead of a space so that each word would not be searched when a
query was performed and only
a single string would be searched.  (hope that makes sense).

The cons could be a performance hit.  

The other approach is to add fields explicitly like this:


  
Yogesh
Chawla
myMiddleName
  

This approach seems more traditional.  The pros of it are that it is
straight forward.  The cons are that every time
we add a new document type to search on, we have to update schema.xml
and the java code that creates SOLR
documents.

The number of documents that we will eventually want to search on is
about 5 million.  However, this will take a while
to ramp up to and we are more immediately looking at searching on about
100,000.

I am new to SOLR and just inherited this project with approach number 1.
Is this something that is going to bite us in the
future?

Thanks,
Yogesh



Re: Using Threading while Indexing.

2009-01-20 Thread oleg_gnatovskiy

I can verify that multithreaded loading using HTTP does work. That's probably
the way to go.



zayhen wrote:
> 
> Your 3 instances are trying to acquire the physical lock to the index.
> If you want to use multi-threaded indexing, I would suggest http
> interface,
> as Solr will control the request queue for you and index as much docs  as
> it
> can receive from your open threads (resource wise obviously).
> 
> 2009/1/19 Sagar Khetkade 
> 
>>
>> Hi,
>>
>> I was trying to index three sets of document having 2000 articles using
>> three threads of embedded solr server. But while indexing, giving me
>> exception "org.apache.lucene.store.LockObtainFailedException: Lock obtain
>> timed out: SingleInstanceLock: write.lock".  I know that this issue do
>> persists with Lucene; is it the same with Solr?
>>
>> Thanks and Regards,
>> Sagar Khetkade.
>> _
>> For the freshest Indian Jobs Visit MSN Jobs
>> http://www.in.msn.com/jobs
>>
> 
> 
> 
> -- 
> Alexander Ramos Jardim
> 
> 
> -
> RPG da Ilha 
> 

-- 
View this message in context: 
http://www.nabble.com/Using-Threading-while-Indexing.-tp21537667p21574047.html
Sent from the Solr - User mailing list archive at Nabble.com.



Newbie Design Questions

2009-01-20 Thread Gunaranjan Chandraraju

Hi All
We are considering SOLR for a large database of XMLs.  I have some  
newbie questions - if there is a place I can go read about them do let  
me know and I will go read up :)


1. Currently we are able to pull the XMLs from a file systems using  
FileDataSource.  The DIH is convenient since I can map my XML fields  
using the XPathProcessor. This works for an initial load.However  
after the initial load, we would like to 'post' changed xmls to SOLR  
whenever the XML is updated in a separate system.  I know we can post  
xmls with 'add' however I was not sure how to do this and maintain the  
DIH mapping I use in data-config.xml?  I don't want to save the file  
to the disk and then call the DIH - would prefer to directly post it.   
Do I need to use solrj for this?


2.  If my solr schema.xml changes then do I HAVE to reindex all the  
old documents?  Suppose in future we have newer XML documents that  
contain a new additional xml field.The old documents that are  
already indexed don't have this field and (so) I don't need search on  
them with this field.  However the new ones need to be search-able on  
this new field.Can I just add this new field to the SOLR schema,  
restart the servers just post the new new documents or do I need to  
reindex everything?


3. Can I backup the index directory.  So that in case of a disk crash  
- I can restore this directory and bring solr up. I realize that any  
documents indexed after this backup would be lost - I can however keep  
track of these outside and simply re-index documents 'newer' than that  
backup date.  This question is really important to me in the context  
of using a Master Server with replicated index.  I would like to run  
this backup for the 'Master'.


4.  In general what happens when the solr application is bounced?  Is  
the index affected (anything maintained in memory)?


Regards
Guna


Query Matching all items in the catalog

2009-01-20 Thread Deo, Shantanu
Hi,
  Is there a query that will match and return all documents being
indexed by SOLR ?

Thanks
Shantanu Deo
AT&T eCommerce Web Hosting - Release Management
Office: (425)288-6081
email: sd1...@att.com




RE: Query Matching all items in the catalog

2009-01-20 Thread Deo, Shantanu
My apologies - I found it using the following param q=*:*


AT&T eCommerce Web Hosting - Release Management
Office: (425)288-6081
email: sd1...@att.com

-Original Message-
From: Deo, Shantanu 
Sent: Tuesday, January 20, 2009 4:05 PM
To: solr-user@lucene.apache.org
Subject: Query Matching all items in the catalog

Hi,
  Is there a query that will match and return all documents being
indexed by SOLR ?

Thanks
Shantanu Deo
AT&T eCommerce Web Hosting - Release Management
Office: (425)288-6081
email: sd1...@att.com




Re: Newbie Design Questions

2009-01-20 Thread Grant Ingersoll


On Jan 20, 2009, at 6:45 PM, Gunaranjan Chandraraju wrote:


Hi All
We are considering SOLR for a large database of XMLs.  I have some  
newbie questions - if there is a place I can go read about them do  
let me know and I will go read up :)


1. Currently we are able to pull the XMLs from a file systems using  
FileDataSource.  The DIH is convenient since I can map my XML fields  
using the XPathProcessor. This works for an initial load.However  
after the initial load, we would like to 'post' changed xmls to SOLR  
whenever the XML is updated in a separate system.  I know we can  
post xmls with 'add' however I was not sure how to do this and  
maintain the DIH mapping I use in data-config.xml?  I don't want to  
save the file to the disk and then call the DIH - would prefer to  
directly post it.  Do I need to use solrj for this?


You can likely use SolrJ, but then you probably need to parse the XML  
an extra time.  You may also be able to use Solr Cell, which is the  
Tika integration such that you can send the XML straight to Solr and  
have it deal with it.  See http://wiki.apache.org/solr/ExtractingRequestHandler 
  Solr Cell is a push technology, whereas DIH is a pull technology.


I don't know how compatible this would be w/ DIH.  Ideally, in the  
future, they will cooperate as much as possible, but we are not there  
yet.


As for your initial load, what if you ran a one time XSLT processor  
over all the files and transformed them to SolrXML and then just  
posted them the normal way?  Then, going forward, any new files could  
just be written out as SolrXML as well.


If you can give some more info about your content, I think it would be  
helpful.





2.  If my solr schema.xml changes then do I HAVE to reindex all the  
old documents?  Suppose in future we have newer XML documents that  
contain a new additional xml field.The old documents that are  
already indexed don't have this field and (so) I don't need search  
on them with this field.  However the new ones need to be search- 
able on this new field.Can I just add this new field to the SOLR  
schema, restart the servers just post the new new documents or do I  
need to reindex everything?


Yes, you should be able to add new fields w/o problems.  Where you can  
run into problems is renaming, removing, etc.





3. Can I backup the index directory.  So that in case of a disk  
crash - I can restore this directory and bring solr up. I realize  
that any documents indexed after this backup would be lost - I can  
however keep track of these outside and simply re-index documents  
'newer' than that backup date.  This question is really important to  
me in the context of using a Master Server with replicated index.  I  
would like to run this backup for the 'Master'.


Yes, just use the master/slave replication approach for doing backups.




4.  In general what happens when the solr application is bounced?   
Is the index affected (anything maintained in memory)?


I would recommend doing a commit before bouncing and letting all  
indexing operations complete.  Worst case, assuming you are using Solr  
1.3 or later, is that you may lose what is in memory.


-Grant

--
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ












Re: Performance Hit for Zero Record Dataimport

2009-01-20 Thread Shalin Shekhar Mangar
I guess Data Import Handler still calls commit even if there were no
documents created. We can add a short circuit in the code to make sure that
does not happen.

On Wed, Jan 21, 2009 at 3:49 AM, wojtekpia  wrote:

>
> I have a transient SQL table that I use to load data into Solr using the
> DataImportHandler. I run an update every 15 minutes
> (dataimport?command=full-import&clean=false&optimize=false), but my table
> will frequently have no new data for me to import. When the table contains
> no data, it looks like Solr is doing a lot more work than it needs to. The
> performance degradation is the same for loading zero records as it is for
> loading a couple thousand records (while the system is under heavy load).
>
> I noticed that when no data is imported, no new index files are created, so
> it seems like something (Lucene?) is aware of the empty update. But since
> the performance degradation is the same, I'm guessing that a new Searcher
> is
> still created, warmed, and registered. Is that correct?
> --
> View this message in context:
> http://www.nabble.com/Performance-Hit-for-Zero-Record-Dataimport-tp21572935p21572935.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>


-- 
Regards,
Shalin Shekhar Mangar.


Re: Newbie Design Questions

2009-01-20 Thread Noble Paul നോബിള്‍ नोब्ळ्
On Wed, Jan 21, 2009 at 5:15 AM, Gunaranjan Chandraraju
 wrote:
> Hi All
> We are considering SOLR for a large database of XMLs.  I have some newbie
> questions - if there is a place I can go read about them do let me know and
> I will go read up :)
>
> 1. Currently we are able to pull the XMLs from a file systems using
> FileDataSource.  The DIH is convenient since I can map my XML fields using
> the XPathProcessor. This works for an initial load.However after the
> initial load, we would like to 'post' changed xmls to SOLR whenever the XML
> is updated in a separate system.  I know we can post xmls with 'add' however
> I was not sure how to do this and maintain the DIH mapping I use in
> data-config.xml?  I don't want to save the file to the disk and then call
> the DIH - would prefer to directly post it.  Do I need to use solrj for
> this?

What is the source of your new data? is it a DB?

>
> 2.  If my solr schema.xml changes then do I HAVE to reindex all the old
> documents?  Suppose in future we have newer XML documents that contain a new
> additional xml field.The old documents that are already indexed don't
> have this field and (so) I don't need search on them with this field.
>  However the new ones need to be search-able on this new field.Can I
> just add this new field to the SOLR schema, restart the servers just post
> the new new documents or do I need to reindex everything?
>
> 3. Can I backup the index directory.  So that in case of a disk crash - I
> can restore this directory and bring solr up. I realize that any documents
> indexed after this backup would be lost - I can however keep track of these
> outside and simply re-index documents 'newer' than that backup date.  This
> question is really important to me in the context of using a Master Server
> with replicated index.  I would like to run this backup for the 'Master'.
the snapshot script is can be used to take backups on commit.
>
> 4.  In general what happens when the solr application is bounced?  Is the
> index affected (anything maintained in memory)?
>
> Regards
> Guna
>



-- 
--Noble Paul


Re: Query Performance while updating teh index

2009-01-20 Thread Otis Gospodnetic
This is an old and long thread, and I no longer recall what the specific 
suggestions were.
My guess is this has to do with the OS cache of your index files.  When you 
make the large index update, that OS cache is useless (old files are gone, new 
ones are in) and the OS cache has get re-warmed and this takes time.

Are you optimizing your index before the update?  Do you *really* need to do 
that?
How large is your update, what makes it big, and could you make it smaller?

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
> From: oleg_gnatovskiy 
> To: solr-user@lucene.apache.org
> Sent: Tuesday, January 20, 2009 6:19:46 PM
> Subject: Re: Query Performance while updating teh index
> 
> 
> Hello again. It seems that we are still having these problems. Queries take
> as long as 20 minutes to get back to their average response time after a
> large index update, so it doesn't seem like the problem is the 12 second
> autowarm time. Are there any more suggestions for things we can try? Taking
> our servers out of teh loop for as long as 20 minutes is a bit of a hassle,
> and a risk.
> -- 
> View this message in context: 
> http://www.nabble.com/Query-Performance-while-updating-the-index-tp20452835p21573927.html
> Sent from the Solr - User mailing list archive at Nabble.com.