Re: correct escapes in csv-Update files
On 03.01.2008 17:16 Yonik Seeley wrote: CSV doesn't use backslash escaping. http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm This is text with a quoted string Thanks for the hint but the result is the same, that is, quoted behaves exactly like \quoted\: - both leave the single unescaped quote in the record: quoted - both have the problem with a backslash before the escaped quote: This is text with a \quoted string gives an error invalid char between encapsualted token end delimiter. So, is it possible to get a record into the index with csv that originally looks like this?: This is text with an unusual \combination of characters A single quote is no problem: just double it ( - ). A single backslash is no problem: just leave it alone (\ - \) But what about a backslash followed by a quote (\ - ???) -Michael
Query Syntax (Standard handler) Question
Is there a simpler way to write this query (I'm using the standard handler) ? field1:t1 field1:t2 field1:t1 t2 field2:t1 field2:t2 field2:t1 t2 Thanks,
Re: Query Syntax (Standard handler) Question
On Jan 4, 2008, at 4:40 AM, s d wrote: Is there a simpler way to write this query (I'm using the standard handler) ? field1:t1 field1:t2 field1:t1 t2 field2:t1 field2:t2 field2:t1 t2 Looks like you'd be better off using the DisMax handler for t1 t2 (without the brackets). Erik
Best practice for storing relational data in Solr
Hi all, This is a (possibly very naive) newbie question regarding Solr best practice... I run a website that displays/stores data on job applicants, together with information on where they came from (e.g. which recruiter), which office they are applying to, etc. This data is stored in a mySQL database. I currently have a basic search facility, but I plan to introduce Solr to improve this, by also storing applicant data in a Solr schema. My problem is that *related* applicant data can also be updated in the web GUI (e.g. if there was a typo a recruiter could be changed from “My Rcruiter” to “My Recruiter”, and I don’t know how best to reflect this in the Solr schema. Example: We may have 2 applicants that came from recruiter “My Recruiter”. If the name of this recruiter is altered in the GUI then I would have to reindex all 2 of those applicants in the Solr schema, which seems very overkill. The alternative would be if I didn’t store the recruiter name in the Solr schema, and instead only stored its mySQL database identifier. Then, I would need to parse any search results from Solr to put in the recruiter name before displaying the data in the GUI. So I guess I’m asking which of these is the better approach; 1. Use Solr to store the text value of related applicant data that exists in a relational mySQL database. Whenever that data is updated in the database reindex all dependent entries in the Solr schema. Advantage of this approach I guess is that search results can be returned from Solr and displayed as is (if XSLT is used). E.g. search result for “John Smith” of recruiter “My Recruiter” could be returned in the required HTML format from Solr, and displayed in the web GUI without any reformatting or further processing. 2. Use Solr to store database Ids of related applicant data that exists in a relational mySQL database. When that data is updated in the database there is no need to reindex Solr. However, search results from Solr will need to be parsed before they can be output in the web GUI. E.g. if Solr returns “John Smith” of recruiter with database ID 143, then 143 will need to be mapped back to “My Recruiter” by my application before it can be displayed. Can anyone offer any guidance here? Regards Steve No virus found in this outgoing message. Checked by AVG Free Edition. Version: 7.5.516 / Virus Database: 269.17.13/1208 - Release Date: 03/01/2008 15:52
Re: Another text I cannot get into SOLR with csv
On Jan 4, 2008 10:25 AM, Michael Lackhoff [EMAIL PROTECTED] wrote: If the fields value is: 's-Gravenhage I cannot get it into SOLR with CSV. This one works for me fine. $ cat t2.csv id,name 12345,'s-Gravenhage 12345,'s-Gravenhage 12345,s-Gravenhage $ curl http://localhost:8983/solr/update/csv?commit=true --data-binary @t2.csv -H 'Content-type:text/csv; charset=utf-8' ?xml version=1.0 encoding=UTF-8? response lst name=responseHeaderint name=status0/intint name=QTime78/int/lst /response -Yonik
Re: Duplicated Keyword
I don't quite understand what you're getting at. What is the problem you're encountering or what are you trying to achieve? Cheers Rob On Jan 4, 2008 3:26 PM, Jae Joo [EMAIL PROTECTED] wrote: Hi, Is there any way to dedup the keyword cross the document? Ex. china keyword is in doc1 and doc2. Will Solr index have only 1 china keyword for both document? Thanks, Jae Joo
How the star operator works
From both lucene and solr docs the star * operator used after a word should find the word plus 0 or more characters after word. I have some documents on a solr index (both in type text and string) and both don't work like that. For example I have a document called Test Document, if I search for Test* it doesn't find this document, only if I search for Tes*. For me it appears that it works more or less like a + on a regex than like a * in a regex or wildcard search. What I'm I doing wrong? []'s -- Leonardo Santagada
Re: Another text I cannot get into SOLR with csv
Michael Lackhoff wrote: If the fields value is: 's-Gravenhage I cannot get it into SOLR with CSV. I tried to double the single quote/apostrophe or escape it in several ways but I either get an error or another character (the escape) in front of the single quote. Is it not possible to have a field that begins with an apostrophe/a single quote? There is no error if the apostrophe is at the end of the field. Is there anything I could try or do I have to use XML? can you open you .csv file in excel or equivalent? If so, that should handle all escaping issues for you... ryan
Re: correct escapes in csv-Update files
I recommend the opencsv library for Java or the csv package for Python. Either one can write legal CSV files. There are lots of corner cases in CSV and some differences between applications, like whetehr newlines are allowed inside a quoted field. It is best to use a library for this instead of hacking at it. We should use opencsv in Solr, too: http://opencsv.sourceforge.net/ It is under the Apache 2.0 license. If you really want to write it yourself, here is a Python routine that I used before finding the csv package: def csvsafe(s): if not s: return '' # normalize all whitespace to single spaces s = ' '.join(s.split()) s = s.strip() if not s: return '' # quote the quotes s = s.replace('','') return ''+s+'' wunder On 1/4/08 1:08 AM, Michael Lackhoff [EMAIL PROTECTED] wrote: On 03.01.2008 17:16 Yonik Seeley wrote: CSV doesn't use backslash escaping. http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm This is text with a quoted string Thanks for the hint but the result is the same, that is, quoted behaves exactly like \quoted\: - both leave the single unescaped quote in the record: quoted - both have the problem with a backslash before the escaped quote: This is text with a \quoted string gives an error invalid char between encapsualted token end delimiter. So, is it possible to get a record into the index with csv that originally looks like this?: This is text with an unusual \combination of characters A single quote is no problem: just double it ( - ). A single backslash is no problem: just leave it alone (\ - \) But what about a backslash followed by a quote (\ - ???) -Michael
Re: correct escapes in csv-Update files
On Jan 4, 2008 4:08 AM, Michael Lackhoff [EMAIL PROTECTED] wrote: Thanks for the hint but the result is the same, that is, quoted behaves exactly like \quoted\: - both leave the single unescaped quote in the record: quoted - both have the problem with a backslash before the escaped quote: This is text with a \quoted string gives an error invalid char between encapsualted token end delimiter. Hmmm, this looks like it's probably a CSV bug. Could you please file a bug report with that component? http://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=truepid=12310491component=12311182sorter/field=issuekeysorter/order=DESC -Yonik
Re: Another text I cannot get into SOLR with csv
On 04.01.2008 16:55 Yonik Seeley wrote: On Jan 4, 2008 10:25 AM, Michael Lackhoff [EMAIL PROTECTED] wrote: If the fields value is: 's-Gravenhage I cannot get it into SOLR with CSV. This one works for me fine. $ cat t2.csv id,name 12345,'s-Gravenhage 12345,'s-Gravenhage 12345,s-Gravenhage $ curl http://localhost:8983/solr/update/csv?commit=true --data-binary @t2.csv -H 'Content-type:text/csv; charset=utf-8' But you are cheating ;-) This works for me too but I am using a local csv file for the update: http://localhost:8983/solr/update/csv?stream.file=t2.csvseparator=%09f.SIGNATURE.split=truecommit=true Perhaps the problem is that I cannot define a charset for the stream.file? -Michael
Re: Duplicated Keyword
title of Document 1 - This is document 1 regarding china - fieldtype = text title of Document 2 - This is document 2 regarding china fieldtype=text Once it is indexed, will index hold 2 china text fields or just 1 china word which is pointing document1 and document2? Jae On Jan 4, 2008 10:54 AM, Robert Young [EMAIL PROTECTED] wrote: I don't quite understand what you're getting at. What is the problem you're encountering or what are you trying to achieve? Cheers Rob On Jan 4, 2008 3:26 PM, Jae Joo [EMAIL PROTECTED] wrote: Hi, Is there any way to dedup the keyword cross the document? Ex. china keyword is in doc1 and doc2. Will Solr index have only 1 china keyword for both document? Thanks, Jae Joo
Re: Duplicated Keyword
You can think of it as the latter but it's quite a bit more complicated than that. For details on how lucene stores it's index check out the file formats page on lucene. http://lucene.apache.org/java/docs/fileformats.html Cheers Rob On Jan 4, 2008 4:59 PM, Jae Joo [EMAIL PROTECTED] wrote: title of Document 1 - This is document 1 regarding china - fieldtype = text title of Document 2 - This is document 2 regarding china fieldtype=text Once it is indexed, will index hold 2 china text fields or just 1 china word which is pointing document1 and document2? Jae On Jan 4, 2008 10:54 AM, Robert Young [EMAIL PROTECTED] wrote: I don't quite understand what you're getting at. What is the problem you're encountering or what are you trying to achieve? Cheers Rob On Jan 4, 2008 3:26 PM, Jae Joo [EMAIL PROTECTED] wrote: Hi, Is there any way to dedup the keyword cross the document? Ex. china keyword is in doc1 and doc2. Will Solr index have only 1 china keyword for both document? Thanks, Jae Joo
Re: Backup of a Solr index
A postCommit hook (configured in solrconfig.xml) is called in a safe place for every commit. You could have a program as a hook that normally did nothing unless you had previously signaled to make a copy of the index. Then I will give the postCommit trigger a try and hope that while the trigger is executed, the files in data/index are in a consistent state so that I can copy them. Do you know how this signal can be communicated to the trigger at best? I use Solrj and would call server.commit(), and unfortunately one can not pass a commit message which could be used as signal for the trigger.
Re: Another text I cannot get into SOLR with csv
On Jan 4, 2008 11:18 AM, Michael Lackhoff [EMAIL PROTECTED] wrote: On 04.01.2008 16:55 Yonik Seeley wrote: On Jan 4, 2008 10:25 AM, Michael Lackhoff [EMAIL PROTECTED] wrote: If the fields value is: 's-Gravenhage I cannot get it into SOLR with CSV. This one works for me fine. $ cat t2.csv id,name 12345,'s-Gravenhage 12345,'s-Gravenhage 12345,s-Gravenhage $ curl http://localhost:8983/solr/update/csv?commit=true --data-binary @t2.csv -H 'Content-type:text/csv; charset=utf-8' But you are cheating ;-) This works for me too but I am using a local csv file for the update: http://localhost:8983/solr/update/csv?stream.file=t2.csvseparator=%09f.SIGNATURE.split=truecommit=true That works for me too if I remove the separator=%09 (since the file uses comma as a separator and not tab) -Yonik
Re: SolrJ Javadoc?
run: ant javadoc-solrj and that will build them... Yes, they should be built into the nightly distribution... Matthew Runo wrote: Hello! I've seen some SVN commits and heard some rumblings of SolrJ javadoc - but can't seem to find any. Is there any yet? I know that SolrJ is still pretty young =p Thanks! Matthew Runo Software Developer Zappos.com 702.943.7833
Re: solr with hadoop
On 4-Jan-08, at 11:37 AM, Evgeniy Strokin wrote: I have huge index base (about 110 millions documents, 100 fields each). But size of the index base is reasonable, it's about 70 Gb. All I need is increase performance, since some queries, which match big number of documents, are running slow. So I was thinking is any benefits to use hadoop for this? And if so, what direction should I go? Is anybody did something for integration Solr with Hadoop? Does it give any performance boost? Hadoop might be useful for organizing your data enroute to Solr, but I don't see how it could be used to boost performance over a huge Solr index. To accomplish that, you need to split it up over two machines (for which you might find hadoop useful). -Mike
Re: solr with hadoop
Mike Klaas wrote: On 4-Jan-08, at 11:37 AM, Evgeniy Strokin wrote: I have huge index base (about 110 millions documents, 100 fields each). But size of the index base is reasonable, it's about 70 Gb. All I need is increase performance, since some queries, which match big number of documents, are running slow. So I was thinking is any benefits to use hadoop for this? And if so, what direction should I go? Is anybody did something for integration Solr with Hadoop? Does it give any performance boost? Hadoop might be useful for organizing your data enroute to Solr, but I don't see how it could be used to boost performance over a huge Solr index. To accomplish that, you need to split it up over two machines (for which you might find hadoop useful). you may want to check out: https://issues.apache.org/jira/browse/SOLR-303 ryan
Re: Query Syntax (Standard handler) Question
On 4-Jan-08, at 1:12 PM, s d wrote: but i want to sum the scores and not use max, can i still do it with the DisMax? am i missing anything ? If you set tie=1.0, dismax functions like dissum. -Mike
parsedquery_ToString
Is the parsedquery_ToString, the one passed to solr after all the tokenizing and analyzing of the query? For the search term 'chapter 7' i have this parsedquery_ToString str name=parsedquery_toString +(text:(bankruptci chap 7) (7 chapter chap) 7 bankruptci^0.8 | ((name:bankruptci name:chap)^2.0))~0.01 (text:(bankruptci chap 7) (7 chapter chap) 7 bankruptci~50^0.8 | ((name:bankruptci name:chap)^2.0))~0.01 /str I have these synonyms chap 7 = bankruptcy chapter = bankruptcy chap = chapter chapter 7 = bankruptcy bankrupcy = bankruptcy chap,7,chap7,chapter 7,chapter 7 bankruptcy,chap 7 But seem to have a little bit of trouble understanding how its building this parsedquery_Tostring Can someone explain. If i can understand this, i'll be able to debug better and analyze why i don't get expected results for some of the search terms and what change i could make to the associated synonyms. -- View this message in context: http://www.nabble.com/parsedquery_ToString-tp14627131p14627131.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Query Syntax (Standard handler) Question
It is the fraction of the score non-max terms that get added to the solr. Hence, 1.0=sum everythign. -Mike On 4-Jan-08, at 3:28 PM, anuvenk wrote: Could you elaborate on what the tie param does? I did read the definition in the solr wiki but still not crystal clear. Mike Klaas wrote: On 4-Jan-08, at 1:12 PM, s d wrote: but i want to sum the scores and not use max, can i still do it with the DisMax? am i missing anything ? If you set tie=1.0, dismax functions like dissum. -Mike -- View this message in context: http://www.nabble.com/Query-Syntax-% 28Standard-handler%29-Question-tp14613286p14627172.html Sent from the Solr - User mailing list archive at Nabble.com.
spellcheckhandler
Is it possible to implement something like this with the spellcheckhandler Like how google does,.. say i search for 'chater 13 bakrupcy', should be able to display these.. did you search for 'chapter 13 bankruptcy' Has someone been able to do this? -- View this message in context: http://www.nabble.com/spellcheckhandler-tp14627712p14627712.html Sent from the Solr - User mailing list archive at Nabble.com.
solr results debugging
I've been using the solr admin form with debug=true to do some in-depth analysis on some results. Could someone explain how to make sense of this..This is the debugging info for the first result i got. 10.201284 = (MATCH) sum of: 6.2467875 = (MATCH) max plus 0.01 times others of: 6.236769 = (MATCH) weight(text:(probat trust live inherit) testament^0.8 in 48784), product of: 0.7070911 = queryWeight(text:(probat trust live inherit) testament^0.8), product of: 0.8 = boost 18.032305 = idf(text:(probat trust live inherit) testament^0.8) 0.049015578 = queryNorm 8.820319 = (MATCH) fieldWeight(text:(probat trust live inherit) testament^0.8 in 48784), product of: 2.236068 = tf(phraseFreq=5.0) 18.032305 = idf(text:(probat trust live inherit) . and it continues some more.. search query: will synonyms that i have: will, living will, last will and testament, living trust, inheritance,probate here is my request handler: (portion of it) str name=echoParamsexplicit/str float name=tie0.01/float str name=qftext^0.8 name^2.0/str !-- until 3 all should match;4 - 3 shld match; 5 - 4 shld match; 6 - 5 shld match; above 6 - 90% match -- str name=mm3lt;-1 4lt;-1 5lt;-1 6lt;90%/str str name=pf text^0.8 name^2.0 /str int name=ps50/int -- View this message in context: http://www.nabble.com/solr-results-debugging-tp14628463p14628463.html Sent from the Solr - User mailing list archive at Nabble.com.
solr word delimiter
I have the word delimiter filter factory in the text field definition both at index and query time. But it does have some negative effects on some search terms like h1-b visa It splits this in to three tokens h,1,b. Now if i understand right, does solr look for matches for 'h' separately, '1' separately and 'b' separately because they are three different tokens. This is giving some undesired results..docs that have 'h' somewhere, '1' somewhere and 'b' somewhere. How to solve this problem? I tried adding synonym like h1-b = h1b visa It does filter some results, but i'm trying to find a global solution rather adding synonyms for all kinds of immigration forms like i-94, k-1 etc -- View this message in context: http://www.nabble.com/solr-word-delimiter-tp14630435p14630435.html Sent from the Solr - User mailing list archive at Nabble.com.