Re: correct escapes in csv-Update files

2008-01-04 Thread Michael Lackhoff
On 03.01.2008 17:16 Yonik Seeley wrote:

 CSV doesn't use backslash escaping.
 http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm
 
 This is text with a quoted string

Thanks for the hint but the result is the same, that is, quoted
behaves exactly like \quoted\:
- both leave the single unescaped quote in the record: quoted
- both have the problem with a backslash before the escaped quote:
  This is text with a \quoted string gives an error invalid
  char between encapsualted token end delimiter.

So, is it possible to get a record into the index with csv that
originally looks like this?:
This is text with an unusual \combination of characters

A single quote is no problem: just double it ( - ).
A single backslash is no problem: just leave it alone (\ - \)
But what about a backslash followed by a quote (\ - ???)

-Michael



Query Syntax (Standard handler) Question

2008-01-04 Thread s d
Is there a simpler way to write this query (I'm using the standard handler)
?
field1:t1 field1:t2 field1:t1 t2 field2:t1 field2:t2 field2:t1 t2
Thanks,


Re: Query Syntax (Standard handler) Question

2008-01-04 Thread Erik Hatcher


On Jan 4, 2008, at 4:40 AM, s d wrote:
Is there a simpler way to write this query (I'm using the standard  
handler)

?
field1:t1 field1:t2 field1:t1 t2 field2:t1 field2:t2 field2:t1 t2


Looks like you'd be better off using the DisMax handler for t1 t2  
(without the brackets).


Erik



Best practice for storing relational data in Solr

2008-01-04 Thread steve.lillywhite
Hi all,

 

This is a (possibly very naive) newbie question regarding Solr best practice...

 

I run a website that displays/stores data on job applicants, together with 
information on where they came from (e.g. which recruiter), which office they 
are applying to, etc. This data is stored in a mySQL database. I currently have 
a basic search facility, but I  plan to introduce Solr to improve this, by also 
storing applicant data in a Solr schema. 

 

My problem is that *related* applicant data can also be updated in the web GUI 
(e.g. if there was a typo a recruiter could be changed from “My Rcruiter” to 
“My Recruiter”, and I don’t know how best to reflect this in the Solr schema.

Example:

We may have 2 applicants that came from recruiter “My Recruiter”. If the 
name of this recruiter is altered in the GUI then I would have to reindex all 
2 of those applicants in the Solr schema, which seems very overkill. The 
alternative would be if I didn’t store the recruiter name in the Solr schema, 
and instead only stored its mySQL database identifier. Then, I would need to 
parse any search results from Solr to put in the recruiter name before 
displaying the data in the GUI.

 

So I guess I’m asking which of these is the better approach;

 

1.   Use Solr to store the text value of related applicant data that exists 
in a relational mySQL database. Whenever that data is updated in the database 
reindex all dependent entries in the Solr schema. Advantage of this approach I 
guess is that search results can be returned from Solr and displayed as is (if 
XSLT is used). E.g. search result for “John Smith” of recruiter “My Recruiter” 
could be returned in the required HTML format from Solr, and displayed in the 
web GUI without any reformatting or further processing.

2.   Use Solr to store database Ids of related applicant data that exists 
in a relational mySQL database. When that data is updated in the database there 
is no need to reindex Solr. However, search results from Solr will need to be 
parsed before they can be output in the web GUI. E.g. if Solr returns “John 
Smith” of recruiter with database ID 143, then 143 will need to be mapped back 
to “My Recruiter” by my application before it can be displayed.

 

Can anyone offer any guidance here?

 

Regards

 

Steve

 


No virus found in this outgoing message.
Checked by AVG Free Edition. 
Version: 7.5.516 / Virus Database: 269.17.13/1208 - Release Date: 03/01/2008 
15:52
 


Re: Another text I cannot get into SOLR with csv

2008-01-04 Thread Yonik Seeley
On Jan 4, 2008 10:25 AM, Michael Lackhoff [EMAIL PROTECTED] wrote:
 If the fields value is:
 's-Gravenhage
 I cannot get it into SOLR with CSV.

This one works for me fine.

$ cat t2.csv
id,name
12345,'s-Gravenhage
12345,'s-Gravenhage
12345,s-Gravenhage

$ curl http://localhost:8983/solr/update/csv?commit=true --data-binary
@t2.csv -H 'Content-type:text/csv; charset=utf-8'
?xml version=1.0 encoding=UTF-8?
response
lst name=responseHeaderint name=status0/intint
name=QTime78/int/lst
/response

-Yonik


Re: Duplicated Keyword

2008-01-04 Thread Robert Young
I don't quite understand what you're getting at. What is the problem
you're encountering or what are you trying to achieve?

Cheers
Rob

On Jan 4, 2008 3:26 PM, Jae Joo [EMAIL PROTECTED] wrote:
 Hi,

 Is there any way to dedup the keyword cross the document?

 Ex.

 china keyword is in doc1 and doc2. Will Solr index have only 1 china
 keyword for both document?

 Thanks,

 Jae Joo



How the star operator works

2008-01-04 Thread Leonardo Santagada
From both lucene and solr docs the star * operator used after a  
word should find the word plus 0 or more characters after word.


I have some documents on a solr index (both in type text and string)  
and both don't work like that. For example I have a document called  
Test Document, if I search for Test* it doesn't find this document,  
only if I search for Tes*. For me it appears that it works more or  
less like a + on a regex than like a * in a regex or wildcard  
search.


What I'm I doing wrong?

[]'s
--
Leonardo Santagada





Re: Another text I cannot get into SOLR with csv

2008-01-04 Thread Ryan McKinley

Michael Lackhoff wrote:

If the fields value is:
's-Gravenhage
I cannot get it into SOLR with CSV.
I tried to double the single quote/apostrophe or escape it in several
ways but I either get an error or another character (the escape) in
front of the single quote. Is it not possible to have a field that
begins with an apostrophe/a single quote?
There is no error if the apostrophe is at the end of the field.
Is there anything I could try or do I have to use XML?



can you open you .csv file in excel or equivalent?  If so, that should 
handle all escaping issues for you...


ryan


Re: correct escapes in csv-Update files

2008-01-04 Thread Walter Underwood
I recommend the opencsv library for Java or the csv package for Python.
Either one can write legal CSV files.

There are lots of corner cases in CSV and some differences between
applications, like whetehr newlines are allowed inside a quoted field.
It is best to use a library for this instead of hacking at it.

We should use opencsv in Solr, too: http://opencsv.sourceforge.net/
It is under the Apache 2.0 license.

If you really want to write it yourself, here is a Python routine
that I used before finding the csv package:

def csvsafe(s):
if not s: return ''

# normalize all whitespace to single spaces
s = ' '.join(s.split())
s = s.strip()
if not s: return ''

# quote the quotes
s = s.replace('','')

return ''+s+''

wunder

On 1/4/08 1:08 AM, Michael Lackhoff [EMAIL PROTECTED] wrote:

 On 03.01.2008 17:16 Yonik Seeley wrote:
 
 CSV doesn't use backslash escaping.
 http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm
 
 This is text with a quoted string
 
 Thanks for the hint but the result is the same, that is, quoted
 behaves exactly like \quoted\:
 - both leave the single unescaped quote in the record: quoted
 - both have the problem with a backslash before the escaped quote:
   This is text with a \quoted string gives an error invalid
   char between encapsualted token end delimiter.
 
 So, is it possible to get a record into the index with csv that
 originally looks like this?:
 This is text with an unusual \combination of characters
 
 A single quote is no problem: just double it ( - ).
 A single backslash is no problem: just leave it alone (\ - \)
 But what about a backslash followed by a quote (\ - ???)
 
 -Michael
 



Re: correct escapes in csv-Update files

2008-01-04 Thread Yonik Seeley
On Jan 4, 2008 4:08 AM, Michael Lackhoff [EMAIL PROTECTED] wrote:
 Thanks for the hint but the result is the same, that is, quoted
 behaves exactly like \quoted\:
 - both leave the single unescaped quote in the record: quoted
 - both have the problem with a backslash before the escaped quote:
   This is text with a \quoted string gives an error invalid
   char between encapsualted token end delimiter.

Hmmm, this looks like it's probably a CSV bug.
Could you please file a bug report with that component?
http://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=truepid=12310491component=12311182sorter/field=issuekeysorter/order=DESC

-Yonik


Re: Another text I cannot get into SOLR with csv

2008-01-04 Thread Michael Lackhoff
On 04.01.2008 16:55 Yonik Seeley wrote:

 On Jan 4, 2008 10:25 AM, Michael Lackhoff [EMAIL PROTECTED] wrote:
 If the fields value is:
 's-Gravenhage
 I cannot get it into SOLR with CSV.
 
 This one works for me fine.
 
 $ cat t2.csv
 id,name
 12345,'s-Gravenhage
 12345,'s-Gravenhage
 12345,s-Gravenhage
 
 $ curl http://localhost:8983/solr/update/csv?commit=true --data-binary
 @t2.csv -H 'Content-type:text/csv; charset=utf-8'

But you are cheating ;-) This works for me too but I am using a local
csv file for the update:
http://localhost:8983/solr/update/csv?stream.file=t2.csvseparator=%09f.SIGNATURE.split=truecommit=true

Perhaps the problem is that I cannot define a charset for the stream.file?

-Michael



Re: Duplicated Keyword

2008-01-04 Thread Jae Joo
title of Document 1 - This is document 1 regarding china - fieldtype =
text
title of Document 2 - This is document 2 regarding china  fieldtype=text

Once it is indexed, will index hold  2 china  text fields  or just 1 china
word which is pointing document1 and document2?

Jae

On Jan 4, 2008 10:54 AM, Robert Young [EMAIL PROTECTED] wrote:

 I don't quite understand what you're getting at. What is the problem
 you're encountering or what are you trying to achieve?

 Cheers
 Rob

 On Jan 4, 2008 3:26 PM, Jae Joo [EMAIL PROTECTED] wrote:
  Hi,
 
  Is there any way to dedup the keyword cross the document?
 
  Ex.
 
  china keyword is in doc1 and doc2. Will Solr index have only 1 china
  keyword for both document?
 
  Thanks,
 
  Jae Joo
 



Re: Duplicated Keyword

2008-01-04 Thread Robert Young
You can think of it as the latter but it's quite a bit more
complicated than that. For details on how lucene stores it's index
check out the file formats page on lucene.
http://lucene.apache.org/java/docs/fileformats.html

Cheers
Rob


On Jan 4, 2008 4:59 PM, Jae Joo [EMAIL PROTECTED] wrote:
 title of Document 1 - This is document 1 regarding china - fieldtype =
 text
 title of Document 2 - This is document 2 regarding china  fieldtype=text

 Once it is indexed, will index hold  2 china  text fields  or just 1 china
 word which is pointing document1 and document2?

 Jae


 On Jan 4, 2008 10:54 AM, Robert Young [EMAIL PROTECTED] wrote:

  I don't quite understand what you're getting at. What is the problem
  you're encountering or what are you trying to achieve?
 
  Cheers
  Rob
 
  On Jan 4, 2008 3:26 PM, Jae Joo [EMAIL PROTECTED] wrote:
   Hi,
  
   Is there any way to dedup the keyword cross the document?
  
   Ex.
  
   china keyword is in doc1 and doc2. Will Solr index have only 1 china
   keyword for both document?
  
   Thanks,
  
   Jae Joo
  
 



Re: Backup of a Solr index

2008-01-04 Thread Jörg Kiegeland



A postCommit hook (configured in solrconfig.xml) is called in a safe
place for every commit.
You could have a program as a hook that normally did nothing unless
you had previously signaled to make a copy of the index.
  
Then I will give the postCommit trigger a try and hope that while the 
trigger is executed, the files in data/index are in a consistent state 
so that I can copy them.


Do you know how this signal can be communicated to the trigger at best? 
I use Solrj and would call server.commit(), and unfortunately one can 
not pass a commit message which could be used as signal for the trigger.


Re: Another text I cannot get into SOLR with csv

2008-01-04 Thread Yonik Seeley
On Jan 4, 2008 11:18 AM, Michael Lackhoff [EMAIL PROTECTED] wrote:
 On 04.01.2008 16:55 Yonik Seeley wrote:

  On Jan 4, 2008 10:25 AM, Michael Lackhoff [EMAIL PROTECTED] wrote:
  If the fields value is:
  's-Gravenhage
  I cannot get it into SOLR with CSV.
 
  This one works for me fine.
 
  $ cat t2.csv
  id,name
  12345,'s-Gravenhage
  12345,'s-Gravenhage
  12345,s-Gravenhage
 
  $ curl http://localhost:8983/solr/update/csv?commit=true --data-binary
  @t2.csv -H 'Content-type:text/csv; charset=utf-8'

 But you are cheating ;-) This works for me too but I am using a local
 csv file for the update:
 http://localhost:8983/solr/update/csv?stream.file=t2.csvseparator=%09f.SIGNATURE.split=truecommit=true

That works for me too if I remove the separator=%09 (since the file
uses comma as a separator and not tab)

-Yonik


Re: SolrJ Javadoc?

2008-01-04 Thread Ryan McKinley

run:
 ant javadoc-solrj
and that will build them...

Yes, they should be built into the nightly distribution...


Matthew Runo wrote:

Hello!

I've seen some SVN commits and heard some rumblings of SolrJ javadoc - 
but can't seem to find any. Is there any yet? I know that SolrJ is still 
pretty young =p


Thanks!

Matthew Runo
Software Developer
Zappos.com
702.943.7833






Re: solr with hadoop

2008-01-04 Thread Mike Klaas

On 4-Jan-08, at 11:37 AM, Evgeniy Strokin wrote:

I have huge index base (about 110 millions documents, 100 fields  
each). But size of the index base is reasonable, it's about 70 Gb.  
All I need is increase performance, since some queries, which match  
big number of documents, are running slow.
So I was thinking is any benefits to use hadoop for this? And if  
so, what direction should I go? Is anybody did something for  
integration Solr with Hadoop? Does it give any performance boost?


Hadoop might be useful for organizing your data enroute to Solr, but  
I don't see how it could be used to boost performance over a huge  
Solr index.  To accomplish that, you need to split it up over two  
machines (for which you might find hadoop useful).


-Mike


Re: solr with hadoop

2008-01-04 Thread Ryan McKinley

Mike Klaas wrote:

On 4-Jan-08, at 11:37 AM, Evgeniy Strokin wrote:

I have huge index base (about 110 millions documents, 100 fields 
each). But size of the index base is reasonable, it's about 70 Gb. All 
I need is increase performance, since some queries, which match big 
number of documents, are running slow.
So I was thinking is any benefits to use hadoop for this? And if so, 
what direction should I go? Is anybody did something for integration 
Solr with Hadoop? Does it give any performance boost?


Hadoop might be useful for organizing your data enroute to Solr, but I 
don't see how it could be used to boost performance over a huge Solr 
index.  To accomplish that, you need to split it up over two machines 
(for which you might find hadoop useful).




you may want to check out:
https://issues.apache.org/jira/browse/SOLR-303

ryan


Re: Query Syntax (Standard handler) Question

2008-01-04 Thread Mike Klaas


On 4-Jan-08, at 1:12 PM, s d wrote:

but i want to sum the scores and not use max, can i still do it  
with the

DisMax? am i missing anything ?


If you set tie=1.0, dismax functions like dissum.

-Mike


parsedquery_ToString

2008-01-04 Thread anuvenk

Is the parsedquery_ToString, the one passed to solr after all the tokenizing
and analyzing of the query? 
For the search term 'chapter 7' i have this parsedquery_ToString
str name=parsedquery_toString
+(text:(bankruptci chap 7) (7 chapter chap) 7 bankruptci^0.8 |
((name:bankruptci name:chap)^2.0))~0.01 (text:(bankruptci chap 7) (7
chapter chap) 7 bankruptci~50^0.8 | ((name:bankruptci name:chap)^2.0))~0.01
/str

I have these synonyms
chap 7 = bankruptcy
chapter = bankruptcy
chap = chapter
chapter 7 = bankruptcy
bankrupcy = bankruptcy
chap,7,chap7,chapter 7,chapter 7 bankruptcy,chap 7

But seem to have a little bit of trouble understanding how its building this
parsedquery_Tostring

Can someone explain. If i can understand this, i'll be able to debug better
and analyze why i don't get expected results for some of the search terms
and what change i could make to the associated synonyms. 
-- 
View this message in context: 
http://www.nabble.com/parsedquery_ToString-tp14627131p14627131.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Query Syntax (Standard handler) Question

2008-01-04 Thread Mike Klaas
It is the fraction of the score non-max terms that get added to the  
solr.  Hence, 1.0=sum everythign.


-Mike

On 4-Jan-08, at 3:28 PM, anuvenk wrote:



Could you elaborate on what the tie param does? I did read the  
definition in

the solr wiki but still not crystal clear.

Mike Klaas wrote:



On 4-Jan-08, at 1:12 PM, s d wrote:


but i want to sum the scores and not use max, can i still do it
with the
DisMax? am i missing anything ?


If you set tie=1.0, dismax functions like dissum.

-Mike




--
View this message in context: http://www.nabble.com/Query-Syntax-% 
28Standard-handler%29-Question-tp14613286p14627172.html

Sent from the Solr - User mailing list archive at Nabble.com.





spellcheckhandler

2008-01-04 Thread anuvenk

Is it possible to implement something like this with the spellcheckhandler

Like how google does,..

say i search for 'chater 13 bakrupcy',

should be able to display these..

did you search for 'chapter 13 bankruptcy'

Has someone been able to do this?
-- 
View this message in context: 
http://www.nabble.com/spellcheckhandler-tp14627712p14627712.html
Sent from the Solr - User mailing list archive at Nabble.com.



solr results debugging

2008-01-04 Thread anuvenk

I've been using the solr admin form with debug=true to do some in-depth
analysis on some results. Could someone explain how to make sense of
this..This is the debugging info for the first result i got.


10.201284 = (MATCH) sum of:
  6.2467875 = (MATCH) max plus 0.01 times others of:
6.236769 = (MATCH) weight(text:(probat trust live inherit)
testament^0.8 in 48784), product of:
  0.7070911 = queryWeight(text:(probat trust live inherit)
testament^0.8), product of:
0.8 = boost
18.032305 = idf(text:(probat trust live inherit) testament^0.8)
0.049015578 = queryNorm
  8.820319 = (MATCH) fieldWeight(text:(probat trust live inherit)
testament^0.8 in 48784), product of:
2.236068 = tf(phraseFreq=5.0)
18.032305 = idf(text:(probat trust live inherit)
.

and it continues some more..

search query: will

synonyms that i have: will, living will, last will and testament, living
trust, inheritance,probate

here is my request handler:

(portion of it)

str name=echoParamsexplicit/str
 float name=tie0.01/float
 str name=qftext^0.8 name^2.0/str
 !-- until 3 all should match;4 - 3 shld match; 5 - 4 shld match; 6 - 5
shld match; above 6 - 90% match --
 str name=mm3lt;-1 4lt;-1 5lt;-1 6lt;90%/str
 str name=pf
 text^0.8 name^2.0
 /str
 int name=ps50/int
-- 
View this message in context: 
http://www.nabble.com/solr-results-debugging-tp14628463p14628463.html
Sent from the Solr - User mailing list archive at Nabble.com.



solr word delimiter

2008-01-04 Thread anuvenk

I have the word delimiter filter factory in the text field definition both at
index and query time. 
But it does have some negative effects on some search terms like h1-b visa
It splits this in to three tokens h,1,b. Now if i understand right, does
solr look for matches for 'h' separately, '1' separately and 'b' separately
because they are three different tokens. This is giving some undesired
results..docs that have 'h' somewhere, '1' somewhere and 'b' somewhere. How
to solve this problem?
I tried adding synonym like h1-b = h1b visa
It does filter some results, but i'm trying to find a global solution rather
adding synonyms for all kinds of immigration forms like i-94, k-1 etc
-- 
View this message in context: 
http://www.nabble.com/solr-word-delimiter-tp14630435p14630435.html
Sent from the Solr - User mailing list archive at Nabble.com.