Re: Indexing TIKA extracted text. Are there some issues?
Sure. The java command I use with TIKA to extract text from a URL is: java -jar tika-0.3-standalone.jar -t $url I have also attached the screenshots of the web page, post documents produced in the two different ways (Perl Tika) for that web page, and the screenshots of the search result for a string contained in that web page. The index in each case contains just this one URL. To keep everything else identical, I used the same instance for creating the index in each case. First I posted the Tika document, checked for the results, emptied the index, posted the Perl document, and checked the results. Debug query for Tika: str name=parsedquery +DisjunctionMaxQuery((urltext:é«éå ¬å¸å±ç°äºæµ·éçä¼è´¨å¤åªä½å 容è½^2.0 | title:é«éå ¬å¸å±ç°äºæµ·éçä¼è´¨å¤åªä½å 容è½^2.0 | content_china:é«é éå ¬ å ¬å¸ å¸å± å±ç° ç°äº äºæµ· æµ·é éç çä¼ ä¼è´¨ è´¨å¤ å¤åª åªä½ ä½å å 容 容è½)~0.01) () /str Debug query for Perl: str name=parsedquery +DisjunctionMaxQuery((urltext:é«éå ¬å¸å±ç°äºæµ·éçä¼è´¨å¤åªä½å 容è½^2.0 | title:é«éå ¬å¸å±ç°äºæµ·éçä¼è´¨å¤åªä½å 容è½^2.0 | content_china:é«é éå ¬ å ¬å¸ å¸å± å±ç° ç°äº äºæµ· æµ·é éç çä¼ ä¼è´¨ è´¨å¤ å¤åª åªä½ ä½å å 容 容è½)~0.01) () /str The screenshots http://www.nabble.com/file/p24728917/Tika%2BIssue.docx Tika+Issue.docx Perl extracted doc http://www.nabble.com/file/p24728917/china.perl.xml china.perl.xml Tika extracted doc http://www.nabble.com/file/p24728917/china.tika.xml china.tika.xml Grant Ingersoll-6 wrote: Hmm, looks very much like an encoding problem. Can you post a sample showing it, along with the commands you invoked? Thanks, Grant On Jul 28, 2009, at 6:14 PM, ashokc wrote: I am finding that the search results based on indexing Tika extracted text are very different from results based on indexing the text extracted via other means. This shows up for example with a chinese web site that I am trying to index. I created the documents (for posting to SOLR) in two ways. The source text of the web pages are full of html entities like #12345; and some english characters mixed in. (a) Simple text extraction from the page source by a Perl script. The resulting content field looks like field name=content_chinaWho We Are #20844;#21496;#21382;#21490; #24744;#30340;#25104;#21151;#26696;#20363; #39046;#23548;#22242;#38431; #19994;#21153;#37096;#38376; Innovation #21019; etc... /field I posted these documents to a SOLR instance (b) Used Tika (command line). The resulting content field looks like field name=content_chinaWho We Are Ã¥ ŒÂ¸à ¥ÂŽÂ†Ã¥Â² 您的æˆÂ功æ¡ ˆä¾‹ 领导团队 业务部门  Innovation à ¥Â etc... /field I posted these documents to a different instance When I search the first instance for a string (that I copied pasted from the web site) I find a number of hits, including the page from which I copied the string from. But when I do the same on the instance with Tika extracted text - I get nothing. Has anyone seen this? I believe it may have to do with encoding. In both cases the posted documents were utf-8 compiant. Thanks for your insights. - ashok -- View this message in context: http://www.nabble.com/Indexing-TIKA-extracted-text.-Are-there-some-issues--tp24708854p24708854.html Sent from the Solr - User mailing list archive at Nabble.com. -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search -- View this message in context: http://www.nabble.com/Indexing-TIKA-extracted-text.-Are-there-some-issues--tp24708854p24728917.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Indexing TIKA extracted text. Are there some issues?
Could very well be... I will rectify it and try again. Thanks - ashok Robert Muir wrote: it appears there is an encoding problem, in the screenshot I can see the title is mangled, and if i open up the URL in IE or firefox, both browsers think it is iso-8859-1. I think this is why (from w3c validator): Character Encoding mismatch! The character encoding specified in the HTTP header (iso-8859-1) is different from the value in the meta element (utf-8). I will use the value from the HTTP header (iso-8859-1) for this validation. On Wed, Jul 29, 2009 at 6:02 PM, ashokcash...@qualcomm.com wrote: Sure. The java command I use with TIKA to extract text from a URL is: java -jar tika-0.3-standalone.jar -t $url I have also attached the screenshots of the web page, post documents produced in the two different ways (Perl Tika) for that web page, and the screenshots of the search result for a string contained in that web page. The index in each case contains just this one URL. To keep everything else identical, I used the same instance for creating the index in each case. First I posted the Tika document, checked for the results, emptied the index, posted the Perl document, and checked the results. Debug query for Tika: str name=parsedquery +DisjunctionMaxQuery((urltext:é«˜é€šå…¬å ¸å±•çŽ°äº†æµ·é‡ çš„ä¼˜è´¨å¤šåª’ä½“å†…å®¹èƒ½^2.0 | title:é«˜é€šå…¬å ¸å±•çŽ°äº†æµ·é‡ çš„ä¼˜è´¨å¤šåª’ä½“å†…å®¹èƒ½^2.0 | content_china:高通 通公 å…¬å ¸ å ¸å±• 展现 现了 了海 æµ·é‡ é‡ çš„ 的优 优质 质多 多媒 媒体 体内 内容 容能)~0.01) () /str Debug query for Perl: str name=parsedquery +DisjunctionMaxQuery((urltext:é«˜é€šå…¬å ¸å±•çŽ°äº†æµ·é‡ çš„ä¼˜è´¨å¤šåª’ä½“å†…å®¹èƒ½^2.0 | title:é«˜é€šå…¬å ¸å±•çŽ°äº†æµ·é‡ çš„ä¼˜è´¨å¤šåª’ä½“å†…å®¹èƒ½^2.0 | content_china:高通 通公 å…¬å ¸ å ¸å±• 展现 现了 了海 æµ·é‡ é‡ çš„ 的优 优质 质多 多媒 媒体 体内 内容 容能)~0.01) () /str The screenshots http://www.nabble.com/file/p24728917/Tika%2BIssue.docx Tika+Issue.docx Perl extracted doc http://www.nabble.com/file/p24728917/china.perl.xml china.perl.xml Tika extracted doc http://www.nabble.com/file/p24728917/china.tika.xml china.tika.xml Grant Ingersoll-6 wrote: Hmm, looks very much like an encoding problem. Can you post a sample showing it, along with the commands you invoked? Thanks, Grant On Jul 28, 2009, at 6:14 PM, ashokc wrote: I am finding that the search results based on indexing Tika extracted text are very different from results based on indexing the text extracted via other means. This shows up for example with a chinese web site that I am trying to index. I created the documents (for posting to SOLR) in two ways. The source text of the web pages are full of html entities like #12345; and some english characters mixed in. (a) Simple text extraction from the page source by a Perl script. The resulting content field looks like field name=content_chinaWho We Are #20844;#21496;#21382;#21490; #24744;#30340;#25104;#21151;#26696;#20363; #39046;#23548;#22242;#38431; #19994;#21153;#37096;#38376; Innovation #21019; etc... /field I posted these documents to a SOLR instance (b) Used Tika (command line). The resulting content field looks like field name=content_chinaWho We Are Ã¥ ¬å ¸à ¥ÂŽÂ†Ã¥Â ² 您的戠功æ¡ ˆä¾‹ 领导团队 业务部门  Innovation à ¥Â etc... /field I posted these documents to a different instance When I search the first instance for a string (that I copied pasted from the web site) I find a number of hits, including the page from which I copied the string from. But when I do the same on the instance with Tika extracted text - I get nothing. Has anyone seen this? I believe it may have to do with encoding. In both cases the posted documents were utf-8 compiant. Thanks for your insights. - ashok -- View this message in context: http://www.nabble.com/Indexing-TIKA-extracted-text.-Are-there-some-issues--tp24708854p24708854.html Sent from the Solr - User mailing list archive at Nabble.com. -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search -- View this message in context: http://www.nabble.com/Indexing-TIKA-extracted-text.-Are-there-some-issues--tp24708854p24728917.html Sent from the Solr - User mailing list archive at Nabble.com. -- Robert Muir rcm...@gmail.com -- View this message in context: http://www.nabble.com/Indexing-TIKA-extracted-text.-Are-there-some-issues--tp24708854p24729595.html Sent from the Solr - User mailing list archive at Nabble.com.
Indexing TIKA extracted text. Are there some issues?
I am finding that the search results based on indexing Tika extracted text are very different from results based on indexing the text extracted via other means. This shows up for example with a chinese web site that I am trying to index. I created the documents (for posting to SOLR) in two ways. The source text of the web pages are full of html entities like #12345; and some english characters mixed in. (a) Simple text extraction from the page source by a Perl script. The resulting content field looks like field name=content_chinaWho We Are #20844;#21496;#21382;#21490; #24744;#30340;#25104;#21151;#26696;#20363; #39046;#23548;#22242;#38431; #19994;#21153;#37096;#38376; Innovation #21019; etc... /field I posted these documents to a SOLR instance (b) Used Tika (command line). The resulting content field looks like field name=content_chinaWho We Are Ã¥ ŒÂ¸åŽ†å² 您的æˆÂ功æ¡ ˆä¾‹ 领导团队 业务部门  Innovation å etc... /field I posted these documents to a different instance When I search the first instance for a string (that I copied pasted from the web site) I find a number of hits, including the page from which I copied the string from. But when I do the same on the instance with Tika extracted text - I get nothing. Has anyone seen this? I believe it may have to do with encoding. In both cases the posted documents were utf-8 compiant. Thanks for your insights. - ashok -- View this message in context: http://www.nabble.com/Indexing-TIKA-extracted-text.-Are-there-some-issues--tp24708854p24708854.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: CJKTokenizerFactory seems to work for Korea but not for China and Japan
Yes, I reindexed the entire repository after each of my changes. Here is the output with debug on. == DEBUG OUTPUT BEGIN == lst name=responseHeader int name=status0/int int name=QTime83/int lst name=params str name=wtstandard/str str name=rows10/str str name=explainOther/ str name=start0/str str name=hl.flcontent/str str name=indenton/str str name=fl*,score/str str name=hlon/str str name=q创意或商业创新、/str str name=debugQueryon/str str name=qtdismax/str str name=version2.2/str /lst /lst result name=response numFound=0 start=0 maxScore=0.0/ lst name=debug str name=rawquerystring创意或商业创新、/str str name=querystring创意或商业创新、/str str name=parsedquery+DisjunctionMaxQuery((content:创意或商业创新、 | urltext:创意或商业创新、^2.0 | title:创意或商业创新、^2.0)~0.01) ()/str str name=parsedquery_toString+(content:创意或商业创新、 | urltext:创意或商业创新、^2.0 | title:创意或商业创新、^2.0)~0.01 ()/str lst name=explain/ str name=QParserDismaxQParser/str null name=altquerystring/ null name=boostfuncs/ == DEBUG OUTPUT END == That is strange. Can you add a request parameter debugQuery=on and post the response? Also, whenever you change the field type (use a different tokenizer etc.), make sure you re-index the documents. -- Regards, Shalin Shekhar Mangar. -- View this message in context: http://www.nabble.com/CJKTokenizerFactory-seems-to-work-for-Korea-but-not-for-China-and-Japan-tp24279927p24292975.html Sent from the Solr - User mailing list archive at Nabble.com.
CJKTokenizerFactory seems to work for Korea but not for China and Japan
Hi I have the following fieldType that processes korean/chinese/japanese text fieldType name=cjk_text class=solr.TextField analyzer type=index tokenizer class=solr.CJKTokenizerFactory/ /analyzer analyzer type=query tokenizer class=solr.CJKTokenizerFactory/ /analyzer /fieldType When I supply korean words/phrases in the query, I do get several expected Korean URLs as search results, and the my keywords are correctly highlighted in the excerpt. But for chinese japanese I almost always draw a blank - i.e. no hits. I ran sample chinese/japanese text through 'analysis' (/search/admin/analysis.jsp) it does highlight the matches it found for the query words I supplied. But when I actually search for it (/search/admin/form.jsp) I get no hits. For chinese text I have also tried fieldType name=cn_text class=solr.TextField analyzer type=index tokenizer class=solr.ChineseTokenizerFactory/ filter class=solr.ChineseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.ChineseTokenizerFactory/ filter class=solr.ChineseFilterFactory/ /analyzer /fieldType Same behavior. I am using SOLR for several other languages like russian/spanish/italian/french/german etc... (each with its own tokenizers stemmers too if available) and I do get results that correctly highlight the words I am supplying in the query. While I can't judge the meaningful quality of the results, I am satisfied that SOLR is returning documents that contain the query string(s). Not sure what the problem may be with chinese japanese. I have updated my SOLR distribution to the latest nightly solr-2009-06-29.zip just in case. Has not helped of course. Thanks for your help. - ashok -- View this message in context: http://www.nabble.com/CJKTokenizerFactory-seems-to-work-for-Korea-but-not-for-China-and-Japan-tp24279927p24279927.html Sent from the Solr - User mailing list archive at Nabble.com.
copyfield and 'store' and highlighting
Hi, I copy 'field1' to 'field2' so that I can apply a different set of analyzers filters. Content wise, they are identical. 'field2' has to be stored because it is used for high-lighting. Do I have to declare 'field1' also to be stored? 'field1' is never returned in the response. Thanks. - ashok -- View this message in context: http://www.nabble.com/copyfield-and-%27store%27-and-highlighting-tp23967232p23967232.html Sent from the Solr - User mailing list archive at Nabble.com.
qf boost Versus field boost for Dismax queries
When 'dismax' queries are use, where is the best place to apply boost values/factors? While indexing by supplying the 'boost' attribute to the field, or in solrconfig.xml by specifying the 'qf' parameter with the same boosts? What are the advantages/disadvantages to each? What happens if both boosts are present? Do they get multiplied? Thanks - ashok -- View this message in context: http://www.nabble.com/qf-boost-Versus-field-boost-for-Dismax-queries-tp23952323p23952323.html Sent from the Solr - User mailing list archive at Nabble.com.
How to disable posting updates from a remote server
Hi, I find that I am freely able to post to my production SOLR server, from any other host that can run the post command. So somebody can wipe out the whole index by posting a delete query. Is there a way SOLR can be configured so that it will take updates ONLY from the server on which it is running? Thanks - ashok -- View this message in context: http://www.nabble.com/How-to-disable-posting-updates-from-a-remote-server-tp23876170p23876170.html Sent from the Solr - User mailing list archive at Nabble.com.
Highlighting and Field options
Hi, The 'content' field that I am indexing is usually large (e.g. a pdf doc of a few Mb in size). I need highlighting to be on. This 'seems' to require that I have to set the 'content' field to be STORED. This returns the whole content field in the search result XML. for each matching document. The highlighted text also is returned in a separate block. But I do NOT need the entire content field to display the search results. I only use the highlighted segments to display a brief description of each hit. The fact that SOLR returns entire content field, makes the returned XML unnecessarily huge, and makes for larger response times. How can I have SOLR return ONLY the highlighted text for each hit and NOT the entire 'content' filed? Thanks - ashok -- View this message in context: http://www.nabble.com/Highlighting-and-Field-options-tp23818019p23818019.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Boosting by facets with standard query
Thanks for the tip. Looks like a neat idea. I have never used the sort feature, so I have to create a new numeric key with values 1 or 2 - value 1 for white_papers/pdfs 2 for others? The problem also is that the facets I need to boost can vary by query. That is, if the query term 'a', boost the facets 'facet1 facet2'. If the query term is 'b', then boost the facets 'facet4 facet5'. Perhaps I can identify the most freqently used boost order, and create as many fields as there are orders. That would be the way right? - ashok Shalin Shekhar Mangar wrote: On Fri, Apr 17, 2009 at 11:32 AM, ashokc ash...@qualcomm.com wrote: What we need is for the white_papers pdfs to be boosted, but if and only if such doucments are valid results to the search term in question. How would I write my above 'q' to accomplish that? Thanks for explaining in detail. Basically, all you want to do is sort the results in the following order: 1. White papers 2. PDFs 3. Others or maybe #1 and #2 are equivalent and can be intermingled. Easiest way to do this is to index a new field whose values (when sorted) give you the desired order. Then you can simply sort on that field and score. -- Regards, Shalin Shekhar Mangar. -- View this message in context: http://www.nabble.com/Boosting-by-facets-with-standard-query-tp23084860p23123288.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Boosting by facets with standard query
What you indicated here is for a different purpose, is it not? I already do something similar with my 'q'. For example a sample query logged in 'catalina.out' looks like webapp=/search path=/select params={rows=15start=0q=(+(content:umts)+OR+(title:umts)^2+OR+(urltext:umts)^2)} when the search term is umts. I am looking for this term umts in the fields - (a) content, (b) title (boosted by a factor of 2) and (c) urltext (boosted by a factor of 2). So the presense of the term umts in title or url is weighed more than its presense in the regular content. So far so good. Now, I have other fields as well, like document type, file type etc... that serve as facets to telescope down. Among the above set of search results, I want to boost a specific document type 'white_papers' a specific file type pdf. By boosting I mean that these white_paper pdf documents should float to the top of the heap in the search results, if such documents are at all present in the search results. So would I simply add the following to the above q? q=(+(content:umts)+OR+(title:umts)^2+OR+(urltext:umts)^2)+AND+(doctype:white_papers)^2+AND+(filetype:pdf)^2 But wouldn't the above give 0 results if there are no white_papers pdfs (because of the AND)? If I use OR, then the meaning of the query is lost altogether. What we need is for the white_papers pdfs to be boosted, but if and only if such doucments are valid results to the search term in question. How would I write my above 'q' to accomplish that? Thanks - ashok Shalin Shekhar Mangar wrote: On Fri, Apr 17, 2009 at 1:03 AM, ashokc ash...@qualcomm.com wrote: I have a query that yields results binned in several facets. How can I boost the results that fall in certain facets over the rest of them that do not belong to those facets? I use the standard query format. Thank you I'm not sure what you mean by boosting by facet. Do you mean that you want to boost documents which match a term query? If yes, you can use your_field_name:value^2.0 in the q parameter. -- Regards, Shalin Shekhar Mangar. -- View this message in context: http://www.nabble.com/Boosting-by-facets-with-standard-query-tp23084860p23091586.html Sent from the Solr - User mailing list archive at Nabble.com.
Boosting by facets with standard query
I have a query that yields results binned in several facets. How can I boost the results that fall in certain facets over the rest of them that do not belong to those facets? I use the standard query format. Thank you - ashok -- View this message in context: http://www.nabble.com/Boosting-by-facets-with-standard-query-tp23084860p23084860.html Sent from the Solr - User mailing list archive at Nabble.com.
DIH uniqueKey
Hi, I have separate JDBC datasources (DS1 DS2) that I want to index with DIH in a single SOLR instance. The unique record for the two sources are different. Do I have to synthesize a uniqueKey that spans both the datasources? Something like this? That is, the uniqueKey values will be like (+ indicating concatenation): DS1 + primary key for DS1 DS2 + primary key for DS2 Thanks - ashok -- View this message in context: http://www.nabble.com/DIH---uniqueKey-tp23042732p23042732.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: More than one language in the same document
What I am doing right now is to capture all the content under content_korea for example, use 'copyField' to duplicate that content to content_english. content_korea gets processed with CJK analyzers, and content_english gets processed with usual detailed index/query analyzers, filters, synonyms. Some results do come up, but I have not been able to verify that this approach is yielding better results. A related question. What does 'copyField' actually do? Does it 'append' content from the source field to the 'target' field? Or does it replace/overwrite it? Thank you. - ashok hossman wrote: : I have documents where text from two languages, e.g. (english korean) or : (english german) are mixed u p in a fairly intensive way. 20-30% of the if you search the list archives you'll find a lot of results for languages ... it's not something i deal with much but i believe using separate fields (or dynamic fields) for each language is considered the best strategy. -Hoss -- View this message in context: http://www.nabble.com/More-than-one-language-in-the-same-document-tp22726478p22939331.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Oracle Clob column with DIH does not turn to String
Yes, you are correct. But the documentation for DIH says the column names are case insensitive. That should be fixed. Here is what it says: = A shorter data-config In the above example, there are mappings of fields to Solr fields. It is possible to totally avoid the field entries in entities if the names of the fields are same (case does not matter) as those in Solr schema. Noble Paul നോബിള് नोब्ळ् wrote: it is very expensive to do a case insensitive lookup. It must first convert all the keys to lower case and try looking up there. because it may not be always in uppercase it can be in mixed case as well On Sat, Apr 4, 2009 at 12:58 AM, ashokc ash...@qualcomm.com wrote: Happy to report that it is working. Looks like we have to use UPPER CASE for all the column names. When I examined the map 'aRow', it had the column names in upper case, where as my config had lower case. No match was found so nothing happened. Changed my config and it works now. Thanks for your help. Perhaps this transformer can be modified to be case-insensitive for the column names. If you had written it perhaps it is a quick change for you? Noble Paul നോബിള് नोब्ळ् wrote: I guess u can write a custom transformer which gets a String out of the oracle.sql.CLOB. I am just out of clue, why this may happen. I even wrote a testcase and it seems to work fine --Noble On Fri, Apr 3, 2009 at 10:23 PM, ashokc ash...@qualcomm.com wrote: I downloaded the nightly build yesterday (2nd April), modified the ClobTransformer.java file with some prints, compiled it all (ant dist). It produced a war file, apache-solr-1.4-dev.war. That is what I am using. My modification compilation has not affected the results. I was getting the same behavior with the 'war' that download came with. Thanks Noble. Noble Paul നോബിള് नोब्ळ् wrote: and which version of Solr are u using? On Fri, Apr 3, 2009 at 10:09 PM, ashokc ash...@qualcomm.com wrote: Sure: data-config Xml === dataConfig dataSource driver=oracle.jdbc.driver.OracleDriver url=jdbc:oracle:thin:@x user=remedy password=y/ document name=remedy entity name=log transformer=ClobTransformer query=SELECT mylog_ato, name_char, dsc FROM log_tbl field column=mylog_ato name=log_no / field column=name_char name=short_desc / field column=dsc clob=true name=description / /entity /document /dataConfig === A search result on the field short_desc: -- doc float name=score1.8670129/float str name=descriptionoracle.sql.c...@155e3ab/str int name=log_no4486/int str name=short_descDevelop Rating functionality for QIN/str date name=timestamp2009-04-03T11:47:32.635Z/date /doc Noble Paul നോബിള് नोब्ळ् wrote: There is something else wrong with your setup. can you just paste the whole data-config.xml --Noble On Fri, Apr 3, 2009 at 5:39 PM, ashokc ash...@qualcomm.com wrote: Noble, I put in a few 'System.out.println' statements in the ClobTransformer.java file remade the war. But I see none of these prints coming up in my 'catalina.out' file. Is that the right file to be looking at? As an aside, is 'catalina.out' the ONLY log file for SOLR? I turned on the logging to 'FINE' for everything. Also, these settings seem to go away when Tomcat is restarted. - ashok Noble Paul നോബിള് नोब्ळ् wrote: yeah, ant dist will give you the .war file you may need . just drop it in and you are set to go. or if you can hook up a debugger to a running Solr that is the easiest --Noble On Fri, Apr 3, 2009 at 9:35 AM, ashokc ash...@qualcomm.com wrote: That would require me to recompile (with ant/maven scripts?) the source and replace the jar for DIH, right? I can try - for the first time. - ashok Noble Paul നോബിള് नोब्ळ् wrote: This looks strange. Apparently the Transformer did not get applied. Is it possible for you to debug ClobTransformer adding(System.out.println into ClobTransformer may help) On Fri, Apr 3, 2009 at 6:04 AM, ashokc ash...@qualcomm.com wrote: Correcting my earlier post. It lost some lines some how. Hi, I have set up to import some oracle clob columns with DIH. I am using the latest nightly release. My config says, entity name=log transformer=ClobTransformer ... field column=description clob=true name=description / /entity But it does not seem to turn this clob into a String. The search results show: doc float name=score1.8670129/float str name=descriptionoracle.sql.c...@aed3a5/str int name=log_no4486/int /doc Any pointers on why I do not get the 'string' out of the clob for indexing? Is the nightly war NOT the right one to use? Thanks for your help. - ashok ashokc wrote: Hi, I have set up to import some oracle clob columns with DIH. I am using the latest nightly
Re: Multi-valued fields with DIH
That worked. Thanks again. Noble Paul നോബിള് नोब्ळ् wrote: the column names are case sensitive try this field column=PROJECT_AREA name=projects / field column=PROJECT_VERSION name=projects / On Sat, Apr 4, 2009 at 3:58 AM, ashokc ash...@qualcomm.com wrote: Hi, I need to assign multiple values to a field, with each value coming from a different column of the sql query. My data config snippet has lines like field column=project_area name=projects / field column=project_version name=projects / where 'project_area' 'project_version' are output by the sql query to the datasource. The 'verbose-output' from dataimport.jsp does show that these columns have values returned by the query === lst name=verbose-output - lst name=entity:log - lst name=document#1 + str name=query x /str str name=time-taken0:0:0.142/str str--- row #1-/str str name=PROJECT_AREAMySource/Area/Admin/str str name=PROJECT_VERSIONMySource/Version/06.02/str date name=LAST_MODIFIED_DATE2008-10-21T07:00:00Z/date . == But the resulting index has no data in the field 'projects'. Is it NOT possible to create multi-valued fields with DIH? Thanks -- View this message in context: http://www.nabble.com/Multi-valued-fields-with-DIH-tp22877509p22877509.html Sent from the Solr - User mailing list archive at Nabble.com. -- --Noble Paul -- View this message in context: http://www.nabble.com/Multi-valued-fields-with-DIH-tp22877509p22886586.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Oracle Clob column with DIH does not turn to String
Noble, I put in a few 'System.out.println' statements in the ClobTransformer.java file remade the war. But I see none of these prints coming up in my 'catalina.out' file. Is that the right file to be looking at? As an aside, is 'catalina.out' the ONLY log file for SOLR? I turned on the logging to 'FINE' for everything. Also, these settings seem to go away when Tomcat is restarted. - ashok Noble Paul നോബിള് नोब्ळ् wrote: yeah, ant dist will give you the .war file you may need . just drop it in and you are set to go. or if you can hook up a debugger to a running Solr that is the easiest --Noble On Fri, Apr 3, 2009 at 9:35 AM, ashokc ash...@qualcomm.com wrote: That would require me to recompile (with ant/maven scripts?) the source and replace the jar for DIH, right? I can try - for the first time. - ashok Noble Paul നോബിള് नोब्ळ् wrote: This looks strange. Apparently the Transformer did not get applied. Is it possible for you to debug ClobTransformer adding(System.out.println into ClobTransformer may help) On Fri, Apr 3, 2009 at 6:04 AM, ashokc ash...@qualcomm.com wrote: Correcting my earlier post. It lost some lines some how. Hi, I have set up to import some oracle clob columns with DIH. I am using the latest nightly release. My config says, entity name=log transformer=ClobTransformer ... field column=description clob=true name=description / /entity But it does not seem to turn this clob into a String. The search results show: doc float name=score1.8670129/float str name=descriptionoracle.sql.c...@aed3a5/str int name=log_no4486/int /doc Any pointers on why I do not get the 'string' out of the clob for indexing? Is the nightly war NOT the right one to use? Thanks for your help. - ashok ashokc wrote: Hi, I have set up to import some oracle clob columns with DIH. I am using the latest nightly release. My config says, entity name=description transformer=ClobTransformer ... field column=description clob=true / /entity But it does not seem to turn this clob into a String. The search results show: doc float name=score1.8670129/float str name=descriptionoracle.sql.c...@aed3a5/str int name=log_no4486/int /doc Any pointers on why I do not get the 'string' out of the clob for indexing? Is the nightly war NOT the right one to use? Thanks for your help. - ashok -- View this message in context: http://www.nabble.com/Oracle-Clob-column-with-DIH-does-not-turn-to-String-tp22859837p22859865.html Sent from the Solr - User mailing list archive at Nabble.com. -- --Noble Paul -- View this message in context: http://www.nabble.com/Oracle-Clob-column-with-DIH-does-not-turn-to-String-tp22859837p22861630.html Sent from the Solr - User mailing list archive at Nabble.com. -- --Noble Paul -- View this message in context: http://www.nabble.com/Oracle-Clob-column-with-DIH-does-not-turn-to-String-tp22859837p22867161.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Oracle Clob column with DIH does not turn to String
Sure: data-config Xml === dataConfig dataSource driver=oracle.jdbc.driver.OracleDriver url=jdbc:oracle:thin:@x user=remedy password=y/ document name=remedy entity name=log transformer=ClobTransformer query=SELECT mylog_ato, name_char, dsc FROM log_tbl field column=mylog_ato name=log_no / field column=name_char name=short_desc / field column=dsc clob=true name=description / /entity /document /dataConfig === A search result on the field short_desc: -- doc float name=score1.8670129/float str name=descriptionoracle.sql.c...@155e3ab/str int name=log_no4486/int str name=short_descDevelop Rating functionality for QIN/str date name=timestamp2009-04-03T11:47:32.635Z/date /doc Noble Paul നോബിള് नोब्ळ् wrote: There is something else wrong with your setup. can you just paste the whole data-config.xml --Noble On Fri, Apr 3, 2009 at 5:39 PM, ashokc ash...@qualcomm.com wrote: Noble, I put in a few 'System.out.println' statements in the ClobTransformer.java file remade the war. But I see none of these prints coming up in my 'catalina.out' file. Is that the right file to be looking at? As an aside, is 'catalina.out' the ONLY log file for SOLR? I turned on the logging to 'FINE' for everything. Also, these settings seem to go away when Tomcat is restarted. - ashok Noble Paul നോബിള് नोब्ळ् wrote: yeah, ant dist will give you the .war file you may need . just drop it in and you are set to go. or if you can hook up a debugger to a running Solr that is the easiest --Noble On Fri, Apr 3, 2009 at 9:35 AM, ashokc ash...@qualcomm.com wrote: That would require me to recompile (with ant/maven scripts?) the source and replace the jar for DIH, right? I can try - for the first time. - ashok Noble Paul നോബിള് नोब्ळ् wrote: This looks strange. Apparently the Transformer did not get applied. Is it possible for you to debug ClobTransformer adding(System.out.println into ClobTransformer may help) On Fri, Apr 3, 2009 at 6:04 AM, ashokc ash...@qualcomm.com wrote: Correcting my earlier post. It lost some lines some how. Hi, I have set up to import some oracle clob columns with DIH. I am using the latest nightly release. My config says, entity name=log transformer=ClobTransformer ... field column=description clob=true name=description / /entity But it does not seem to turn this clob into a String. The search results show: doc float name=score1.8670129/float str name=descriptionoracle.sql.c...@aed3a5/str int name=log_no4486/int /doc Any pointers on why I do not get the 'string' out of the clob for indexing? Is the nightly war NOT the right one to use? Thanks for your help. - ashok ashokc wrote: Hi, I have set up to import some oracle clob columns with DIH. I am using the latest nightly release. My config says, entity name=description transformer=ClobTransformer ... field column=description clob=true / /entity But it does not seem to turn this clob into a String. The search results show: doc float name=score1.8670129/float str name=descriptionoracle.sql.c...@aed3a5/str int name=log_no4486/int /doc Any pointers on why I do not get the 'string' out of the clob for indexing? Is the nightly war NOT the right one to use? Thanks for your help. - ashok -- View this message in context: http://www.nabble.com/Oracle-Clob-column-with-DIH-does-not-turn-to-String-tp22859837p22859865.html Sent from the Solr - User mailing list archive at Nabble.com. -- --Noble Paul -- View this message in context: http://www.nabble.com/Oracle-Clob-column-with-DIH-does-not-turn-to-String-tp22859837p22861630.html Sent from the Solr - User mailing list archive at Nabble.com. -- --Noble Paul -- View this message in context: http://www.nabble.com/Oracle-Clob-column-with-DIH-does-not-turn-to-String-tp22859837p22867161.html Sent from the Solr - User mailing list archive at Nabble.com. -- --Noble Paul -- View this message in context: http://www.nabble.com/Oracle-Clob-column-with-DIH-does-not-turn-to-String-tp22859837p22872184.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Oracle Clob column with DIH does not turn to String
I downloaded the nightly build yesterday (2nd April), modified the ClobTransformer.java file with some prints, compiled it all (ant dist). It produced a war file, apache-solr-1.4-dev.war. That is what I am using. My modification compilation has not affected the results. I was getting the same behavior with the 'war' that download came with. Thanks Noble. Noble Paul നോബിള് नोब्ळ् wrote: and which version of Solr are u using? On Fri, Apr 3, 2009 at 10:09 PM, ashokc ash...@qualcomm.com wrote: Sure: data-config Xml === dataConfig dataSource driver=oracle.jdbc.driver.OracleDriver url=jdbc:oracle:thin:@x user=remedy password=y/ document name=remedy entity name=log transformer=ClobTransformer query=SELECT mylog_ato, name_char, dsc FROM log_tbl field column=mylog_ato name=log_no / field column=name_char name=short_desc / field column=dsc clob=true name=description / /entity /document /dataConfig === A search result on the field short_desc: -- doc float name=score1.8670129/float str name=descriptionoracle.sql.c...@155e3ab/str int name=log_no4486/int str name=short_descDevelop Rating functionality for QIN/str date name=timestamp2009-04-03T11:47:32.635Z/date /doc Noble Paul നോബിള് नोब्ळ् wrote: There is something else wrong with your setup. can you just paste the whole data-config.xml --Noble On Fri, Apr 3, 2009 at 5:39 PM, ashokc ash...@qualcomm.com wrote: Noble, I put in a few 'System.out.println' statements in the ClobTransformer.java file remade the war. But I see none of these prints coming up in my 'catalina.out' file. Is that the right file to be looking at? As an aside, is 'catalina.out' the ONLY log file for SOLR? I turned on the logging to 'FINE' for everything. Also, these settings seem to go away when Tomcat is restarted. - ashok Noble Paul നോബിള് नोब्ळ् wrote: yeah, ant dist will give you the .war file you may need . just drop it in and you are set to go. or if you can hook up a debugger to a running Solr that is the easiest --Noble On Fri, Apr 3, 2009 at 9:35 AM, ashokc ash...@qualcomm.com wrote: That would require me to recompile (with ant/maven scripts?) the source and replace the jar for DIH, right? I can try - for the first time. - ashok Noble Paul നോബിള് नोब्ळ् wrote: This looks strange. Apparently the Transformer did not get applied. Is it possible for you to debug ClobTransformer adding(System.out.println into ClobTransformer may help) On Fri, Apr 3, 2009 at 6:04 AM, ashokc ash...@qualcomm.com wrote: Correcting my earlier post. It lost some lines some how. Hi, I have set up to import some oracle clob columns with DIH. I am using the latest nightly release. My config says, entity name=log transformer=ClobTransformer ... field column=description clob=true name=description / /entity But it does not seem to turn this clob into a String. The search results show: doc float name=score1.8670129/float str name=descriptionoracle.sql.c...@aed3a5/str int name=log_no4486/int /doc Any pointers on why I do not get the 'string' out of the clob for indexing? Is the nightly war NOT the right one to use? Thanks for your help. - ashok ashokc wrote: Hi, I have set up to import some oracle clob columns with DIH. I am using the latest nightly release. My config says, entity name=description transformer=ClobTransformer ... field column=description clob=true / /entity But it does not seem to turn this clob into a String. The search results show: doc float name=score1.8670129/float str name=descriptionoracle.sql.c...@aed3a5/str int name=log_no4486/int /doc Any pointers on why I do not get the 'string' out of the clob for indexing? Is the nightly war NOT the right one to use? Thanks for your help. - ashok -- View this message in context: http://www.nabble.com/Oracle-Clob-column-with-DIH-does-not-turn-to-String-tp22859837p22859865.html Sent from the Solr - User mailing list archive at Nabble.com. -- --Noble Paul -- View this message in context: http://www.nabble.com/Oracle-Clob-column-with-DIH-does-not-turn-to-String-tp22859837p22861630.html Sent from the Solr - User mailing list archive at Nabble.com. -- --Noble Paul -- View this message in context: http://www.nabble.com/Oracle-Clob-column-with-DIH-does-not-turn-to-String-tp22859837p22867161.html Sent from the Solr - User mailing list archive at Nabble.com. -- --Noble Paul -- View this message in context: http://www.nabble.com/Oracle-Clob-column-with-DIH-does-not-turn-to-String-tp22859837p22872184.html Sent from the Solr - User mailing list archive at Nabble.com. -- --Noble Paul -- View this message in context: http://www.nabble.com/Oracle-Clob-column-with-DIH
Re: Oracle Clob column with DIH does not turn to String
Happy to report that it is working. Looks like we have to use UPPER CASE for all the column names. When I examined the map 'aRow', it had the column names in upper case, where as my config had lower case. No match was found so nothing happened. Changed my config and it works now. Thanks for your help. Perhaps this transformer can be modified to be case-insensitive for the column names. If you had written it perhaps it is a quick change for you? Noble Paul നോബിള് नोब्ळ् wrote: I guess u can write a custom transformer which gets a String out of the oracle.sql.CLOB. I am just out of clue, why this may happen. I even wrote a testcase and it seems to work fine --Noble On Fri, Apr 3, 2009 at 10:23 PM, ashokc ash...@qualcomm.com wrote: I downloaded the nightly build yesterday (2nd April), modified the ClobTransformer.java file with some prints, compiled it all (ant dist). It produced a war file, apache-solr-1.4-dev.war. That is what I am using. My modification compilation has not affected the results. I was getting the same behavior with the 'war' that download came with. Thanks Noble. Noble Paul നോബിള് नोब्ळ् wrote: and which version of Solr are u using? On Fri, Apr 3, 2009 at 10:09 PM, ashokc ash...@qualcomm.com wrote: Sure: data-config Xml === dataConfig dataSource driver=oracle.jdbc.driver.OracleDriver url=jdbc:oracle:thin:@x user=remedy password=y/ document name=remedy entity name=log transformer=ClobTransformer query=SELECT mylog_ato, name_char, dsc FROM log_tbl field column=mylog_ato name=log_no / field column=name_char name=short_desc / field column=dsc clob=true name=description / /entity /document /dataConfig === A search result on the field short_desc: -- doc float name=score1.8670129/float str name=descriptionoracle.sql.c...@155e3ab/str int name=log_no4486/int str name=short_descDevelop Rating functionality for QIN/str date name=timestamp2009-04-03T11:47:32.635Z/date /doc Noble Paul നോബിള് नोब्ळ् wrote: There is something else wrong with your setup. can you just paste the whole data-config.xml --Noble On Fri, Apr 3, 2009 at 5:39 PM, ashokc ash...@qualcomm.com wrote: Noble, I put in a few 'System.out.println' statements in the ClobTransformer.java file remade the war. But I see none of these prints coming up in my 'catalina.out' file. Is that the right file to be looking at? As an aside, is 'catalina.out' the ONLY log file for SOLR? I turned on the logging to 'FINE' for everything. Also, these settings seem to go away when Tomcat is restarted. - ashok Noble Paul നോബിള് नोब्ळ् wrote: yeah, ant dist will give you the .war file you may need . just drop it in and you are set to go. or if you can hook up a debugger to a running Solr that is the easiest --Noble On Fri, Apr 3, 2009 at 9:35 AM, ashokc ash...@qualcomm.com wrote: That would require me to recompile (with ant/maven scripts?) the source and replace the jar for DIH, right? I can try - for the first time. - ashok Noble Paul നോബിള് नोब्ळ् wrote: This looks strange. Apparently the Transformer did not get applied. Is it possible for you to debug ClobTransformer adding(System.out.println into ClobTransformer may help) On Fri, Apr 3, 2009 at 6:04 AM, ashokc ash...@qualcomm.com wrote: Correcting my earlier post. It lost some lines some how. Hi, I have set up to import some oracle clob columns with DIH. I am using the latest nightly release. My config says, entity name=log transformer=ClobTransformer ... field column=description clob=true name=description / /entity But it does not seem to turn this clob into a String. The search results show: doc float name=score1.8670129/float str name=descriptionoracle.sql.c...@aed3a5/str int name=log_no4486/int /doc Any pointers on why I do not get the 'string' out of the clob for indexing? Is the nightly war NOT the right one to use? Thanks for your help. - ashok ashokc wrote: Hi, I have set up to import some oracle clob columns with DIH. I am using the latest nightly release. My config says, entity name=description transformer=ClobTransformer ... field column=description clob=true / /entity But it does not seem to turn this clob into a String. The search results show: doc float name=score1.8670129/float str name=descriptionoracle.sql.c...@aed3a5/str int name=log_no4486/int /doc Any pointers on why I do not get the 'string' out of the clob for indexing? Is the nightly war NOT the right one to use? Thanks for your help. - ashok -- View this message in context: http://www.nabble.com/Oracle-Clob-column-with-DIH-does-not-turn-to-String-tp22859837p22859865.html Sent from the Solr - User mailing list archive at Nabble.com. -- --Noble Paul
Multi-valued fields with DIH
Hi, I need to assign multiple values to a field, with each value coming from a different column of the sql query. My data config snippet has lines like field column=project_area name=projects / field column=project_version name=projects / where 'project_area' 'project_version' are output by the sql query to the datasource. The 'verbose-output' from dataimport.jsp does show that these columns have values returned by the query === lst name=verbose-output − lst name=entity:log − lst name=document#1 + str name=query x /str str name=time-taken0:0:0.142/str str--- row #1-/str str name=PROJECT_AREAMySource/Area/Admin/str str name=PROJECT_VERSIONMySource/Version/06.02/str date name=LAST_MODIFIED_DATE2008-10-21T07:00:00Z/date . == But the resulting index has no data in the field 'projects'. Is it NOT possible to create multi-valued fields with DIH? Thanks -- View this message in context: http://www.nabble.com/Multi-valued-fields-with-DIH-tp22877509p22877509.html Sent from the Solr - User mailing list archive at Nabble.com.
Oracle Clob column with DIH does not turn to String
Hi, I have set up to import some oracle clob columns with DIH. I am using the latest nightly release. My config says, But it does not seem to turn this clob into a String. The search results show: 1.8670129 oracle.sql.c...@aed3a5 4486 Any pointers on why I do not get the 'string' out of the clob for indexing? Is the nightly war NOT the right one to use? Thanks for your help. - ashok -- View this message in context: http://www.nabble.com/Oracle-Clob-column-with-DIH-does-not-turn-to-String-tp22859837p22859837.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Oracle Clob column with DIH does not turn to String
Correcting my earlier post. It lost some lines some how. Hi, I have set up to import some oracle clob columns with DIH. I am using the latest nightly release. My config says, entity name=log transformer=ClobTransformer ... field column=description clob=true name=description / /entity But it does not seem to turn this clob into a String. The search results show: doc float name=score1.8670129/float str name=descriptionoracle.sql.c...@aed3a5/str int name=log_no4486/int /doc Any pointers on why I do not get the 'string' out of the clob for indexing? Is the nightly war NOT the right one to use? Thanks for your help. - ashok ashokc wrote: Hi, I have set up to import some oracle clob columns with DIH. I am using the latest nightly release. My config says, entity name=description transformer=ClobTransformer ... field column=description clob=true / /entity But it does not seem to turn this clob into a String. The search results show: doc float name=score1.8670129/float str name=descriptionoracle.sql.c...@aed3a5/str int name=log_no4486/int /doc Any pointers on why I do not get the 'string' out of the clob for indexing? Is the nightly war NOT the right one to use? Thanks for your help. - ashok -- View this message in context: http://www.nabble.com/Oracle-Clob-column-with-DIH-does-not-turn-to-String-tp22859837p22859865.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Oracle Clob column with DIH does not turn to String
That would require me to recompile (with ant/maven scripts?) the source and replace the jar for DIH, right? I can try - for the first time. - ashok Noble Paul നോബിള് नोब्ळ् wrote: This looks strange. Apparently the Transformer did not get applied. Is it possible for you to debug ClobTransformer adding(System.out.println into ClobTransformer may help) On Fri, Apr 3, 2009 at 6:04 AM, ashokc ash...@qualcomm.com wrote: Correcting my earlier post. It lost some lines some how. Hi, I have set up to import some oracle clob columns with DIH. I am using the latest nightly release. My config says, entity name=log transformer=ClobTransformer ... field column=description clob=true name=description / /entity But it does not seem to turn this clob into a String. The search results show: doc float name=score1.8670129/float str name=descriptionoracle.sql.c...@aed3a5/str int name=log_no4486/int /doc Any pointers on why I do not get the 'string' out of the clob for indexing? Is the nightly war NOT the right one to use? Thanks for your help. - ashok ashokc wrote: Hi, I have set up to import some oracle clob columns with DIH. I am using the latest nightly release. My config says, entity name=description transformer=ClobTransformer ... field column=description clob=true / /entity But it does not seem to turn this clob into a String. The search results show: doc float name=score1.8670129/float str name=descriptionoracle.sql.c...@aed3a5/str int name=log_no4486/int /doc Any pointers on why I do not get the 'string' out of the clob for indexing? Is the nightly war NOT the right one to use? Thanks for your help. - ashok -- View this message in context: http://www.nabble.com/Oracle-Clob-column-with-DIH-does-not-turn-to-String-tp22859837p22859865.html Sent from the Solr - User mailing list archive at Nabble.com. -- --Noble Paul -- View this message in context: http://www.nabble.com/Oracle-Clob-column-with-DIH-does-not-turn-to-String-tp22859837p22861630.html Sent from the Solr - User mailing list archive at Nabble.com.
More than one language in the same document
Hi, I have documents where text from two languages, e.g. (english korean) or (english german) are mixed u p in a fairly intensive way. 20-30% of the text is in English and the rest in the other. Can somebody indicate how I should set up the 'analyzers' and 'fields' in schema.xml? Should I have 2 fields with the same content, and 'analyze' them as English non-english to build the index? Will the analyzer for non-english corrupt the index while processing the english text? And should my query look at both the fields to fetch the results? Has somebody looked at this already? Thanks for your help. - ashok -- View this message in context: http://www.nabble.com/More-than-one-language-in-the-same-document-tp22726478p22726478.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Highlighting Oddities
I have seen some of these oddities that Chris is referring to. In my case, terms that are NOT in the query get highlighted. For example searching for 'Intel' highlights 'Microsot Corp' as well. I do not have them as synonyms either. Do these filter factories add some extra intelligence to the index in that if you search for 'Samsung' even 'LG' is considered a highlightable term? I believe this was not the case when I was working with an earlier development version (from Nov or early Dec). Right now I am using solr-2008-12-29.war. - ashok ryguasu wrote: I'm testing out the default (gap) fragmenter with some simple, single-word queries on a patched 1.3.0 release populated with some real-world data. (I think the primary quirk in my setup is that I'm using ShingleFilterFactory to put word bigrams (aka shingles) into my index. I was worried that this might mess up highlighting, but highlighting is *mostly* working.) There are some oddities here, and I'm wondering if people have any suggestions for debugging my setup and/or trying to make a good, reproducible test case. 1. The main weird thing is that, the vast majority of the time, the highlighted term is the last term in the fragment. For example, if I search for cat, then almost all my fragments look like this: fragment 1: to the *cat* fragment 2: with the *cat* fragment 3: it's what the *cat* fragment 4: Once upon a time the *cat* (My actual fragments are longer. The key to note is that all of these examples end in cat.) Sometimes cat will appear at somewhere other than the last position, but this is rare. My expectation, in contrast, is that cat would tend to be more or less evenly distributed throughout fragment positions. Note: I tried to reproduce this on 1.3.0 with my patches applied but using the example dataset/schema from the Solr source tree rather than my own dataset/schema. With the example dataset this didn't seem to be an issue. I've experienced three other highlighting issues, which may or may not be related: 2. Sometimes, if a term appears multiple times in a fragment, not just the term but all the words in between the two appearances will get highlighted too. For example, I searched for fear, and got this as one of the snippets: SETTLEMENT AGREEMENT This Agreement (the Agreement) is entered into this 18th day of August, 2008, by and between Cape emFear Bank Corporation, a North Carolina corporation (the Company), and Cape Fear/em In contrast, I would have expected SETTLEMENT AGREEMENT This Agreement (the Agreement) is entered into this 18th day of August, 2008, by and between Cape emFear/em Bank Corporation, a North Carolina corporation (the Company), and Cape emFear/em 3. My install seems to have a curiously liberal interpretation of hl.fragsize. Now if I put hl.fragsize=0, then things are as expected, i.e. it highlights the whole field. And it also seems more or less true (as it should) that as I increase hl.fragsize, the fragments get longer. However, I was surprised to see that when I put hl.fragsize=1 or hl.fragsize=5, I can get fragments as long as this one: addition, we believe the wireless feature for our controller will facilitate exceptional customer services and response time. About GpsLatitude GpsLatitude, a Montreal-based company, is a provider of security solutions and tracking for mobile assets. It is also a developer of advanced Videlocalisation , a cost-effective, integrated mobile digital emvideo/em That seems shockingly long for something of size five. 4. Very rarely I'll get a fragment that doesn't actually contain any of the search terms. For example, maybe I'll search for cat, and I'll get back three ounces of milk as a snippet. I need to explore this more, though the last time this happened when I opened the document and found that when I located three ounces of milk in the document text, the word cat did appear nearby; so maybe the document did contain three ounces of milk for the cat. Obviously I'm not describing my setup in much detail. Let me know what you think would be helpful to know more about. Thanks, Chris -- View this message in context: http://www.nabble.com/Highlighting-Oddities-tp20351015p21841992.html Sent from the Solr - User mailing list archive at Nabble.com.
Single index - multiple SOLR instances
Hello, Is it possible to have the index created by a single SOLR instance, but have several SOLR instances field the search queries. Or do I HAVE to replicate the index for each SOLR instance that I want to answer queries? I need to set up a fail-over instance. Thanks - ashok -- View this message in context: http://www.nabble.com/Single-index---multiple-SOLR-instances-tp21422543p21422543.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Single index - multiple SOLR instances
Thanks, Otis. That is great, as I plan to place the index on NAS and make it writable to a single solr instance (write load is not heavy) and readable by many solr instances to handle fail-over and also share the query load (query load can be high) - ashok Otis Gospodnetic wrote: Ashok, You can put your index on any kind of shared storage - SAN, NAS, NFS (this one is not recommended). That will let you point all your Solr instances to a single copy of your index. Of course, you will want to test performance to ensure the network is not slowing things down too much, if there is network in the picture. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: ashokc ash...@qualcomm.com To: solr-user@lucene.apache.org Sent: Monday, January 12, 2009 3:05:40 PM Subject: Single index - multiple SOLR instances Hello, Is it possible to have the index created by a single SOLR instance, but have several SOLR instances field the search queries. Or do I HAVE to replicate the index for each SOLR instance that I want to answer queries? I need to set up a fail-over instance. Thanks - ashok -- View this message in context: http://www.nabble.com/Single-index---multiple-SOLR-instances-tp21422543p21422543.html Sent from the Solr - User mailing list archive at Nabble.com. -- View this message in context: http://www.nabble.com/Single-index---multiple-SOLR-instances-tp21422543p21423138.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Boost a query by field at query time - Standard Request Handler
Thanks for the reply. I figured there is no simple solution here. I am parsing the query in my code separating out negations, assertions and such and building the final SOLR query to issue. I simply ue the boost as given by the user. If none given, I use a default boost for title url matches. - ashok hossman wrote: : Query (can be quite complex, as it gets built from an advanced search form): : term1^2.0 OR term2 OR term3 term4 ... : Any matches in the title or url fields should be weighed more. I can specify if i'm understanding you correctly: the client app can provide any arbitrary lucene syntax query, and you want to (server side) specify additional information about how specific fields (if specified) should be boosted ... is that correct? there is no way to do this with standard request handler out of the box ... but you could subclass the LuceneQParser, and use your own QueryParser subclass that knows about your field boosts and adds them -- don't forget you'll have to decide what to do when the client specifies a boost on a field you've got a configured boost for (add them? ... multiply them? ...) -Hoss -- View this message in context: http://www.nabble.com/Boost-a-query--by-field-at-query-time---Standard-Request-Handler-tp20842675p20920307.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Merging Indices
Thanks for the help Yonik Shalin.It really makes it easy for me if I do not have to stop/start the SOLR app during the merge operations. The reason I have to do this many times a day, is that I am implementing a simple-minded entity-extraction procedure for the content I am indexing. I have a user defined taxonomy into which the current documents, and any new documents should be classified under. The taxonomy defines the nested facet fields for SOLR. When a new document is posted, the user expects to have it available in the right facet right away. My classification procedure is as follows when a new document is added. 1. Create a new temporary index with that document (no taxonomy fields at this time) 2. Search this index with each of the taxonomy terms (synonyms are employed as well through synonyms.txt) and find out which of these categories is a hit for this document. 3. Add a new field ... line into the document for each category that is a match for this document. 4. Repost this updated document. Now I have a new index that facets this document, the same the the big index does. 5. I merge these two indices now so that the new document also part of the big index. 6. Delete the temporary index The reason for a new temporary index is that, the step 2 is A LOT quicker with a single (or a handful) document. If I simply posted this new doc, into the big index, and then tried to classify it, this search will take a while. I have over 200 nested taxonomy fields to search over. Are there better approaches? Thanks - ashok Yonik Seeley wrote: On Thu, Dec 4, 2008 at 6:39 PM, ashokc [EMAIL PROTECTED] wrote: The SOLR wiki says 3. Make sure both indexes you want to merge are closed. What exactly does 'closed' mean? If you do a commit, and then prevent updates, the index should be closed (no open IndexWriter). 1. Do I need to stop SOLR search on both indexes before running the merge command? So a brief downtime is required? Or do I simply prevent any 'updates/deletes' to these indices during the merge time so they can still serve up results (read only?) while I am creating a new merged index? Preventing updates/deletes should be sufficient. 2. Before the new index replaces the old index, do I need to stop SOLR for that instance? Or can I simply move the old index out and place the new index in the same place, without having to stop SOLR Yes, simply moving the index should work if you are careful to avoid any updates since the last commit. 3. If SOLR has to be stopped during the merge operation, can we work with a redundant/failover instance and stagger the merge so the search service will not go down? Any guidelines here are welcome. Thanks - ashok -- View this message in context: http://www.nabble.com/Merging-Indices-tp20845009p20845009.html Sent from the Solr - User mailing list archive at Nabble.com. -- View this message in context: http://www.nabble.com/Merging-Indices-tp20845009p20859513.html Sent from the Solr - User mailing list archive at Nabble.com.
Boost a query by field at query time - Standard Request Handler
Here is the problem I am trying to solve. I have to use the Standard Request Handler. Query (can be quite complex, as it gets built from an advanced search form): term1^2.0 OR term2 OR term3 term4 I have 3 fields - content (the default search field), title and url. Any matches in the title or url fields should be weighed more. I can specify index time boosting for these two fields, but I would rather not, as it is a heavy handed solution. I need to make it user configurable for advanced search. What should my query to SOLR be? Something like this? content:term1^2.0 OR content:term2 OR content:term3 term4 OR title:term1^2.0 OR title:term2 OR title:term3 term4 OR url:term1^2.0 OR url:term2 OR url:term3 term4 Looks like it can get pretty long and error prone. With the 'dismax' handler I can simply specify qf=content title^2 url^2 no matter how complex the 'q' parameter is. Is there a similar easier way I can do query time boosting with Standard Request Handler, that I am missing? Thanks for your help - ashok -- View this message in context: http://www.nabble.com/Boost-a-query--by-field-at-query-time---Standard-Request-Handler-tp20842675p20842675.html Sent from the Solr - User mailing list archive at Nabble.com.
Merging Indices
The SOLR wiki says 3. Make sure both indexes you want to merge are closed. What exactly does 'closed' mean? 1. Do I need to stop SOLR search on both indexes before running the merge command? So a brief downtime is required? Or do I simply prevent any 'updates/deletes' to these indices during the merge time so they can still serve up results (read only?) while I am creating a new merged index? 2. Before the new index replaces the old index, do I need to stop SOLR for that instance? Or can I simply move the old index out and place the new index in the same place, without having to stop SOLR 3. If SOLR has to be stopped during the merge operation, can we work with a redundant/failover instance and stagger the merge so the search service will not go down? Any guidelines here are welcome. Thanks - ashok -- View this message in context: http://www.nabble.com/Merging-Indices-tp20845009p20845009.html Sent from the Solr - User mailing list archive at Nabble.com.
solrQueryParser does not take effect - nightly build
Hi, I have set solrQueryParser defaultOperator=AND/ but it is not taking effect. It continues to take it as OR. I am working with the latest nightly build 11/20/2008 For a querry like term1 term2 Debug shows str name=parsedquerycontent:term1 content:term2/str Bug? Thanks - ashok -- View this message in context: http://www.nabble.com/solrQueryParser-does-not-take-effect---nightly-build-tp20609974p20609974.html Sent from the Solr - User mailing list archive at Nabble.com.