Tracing Files Which Have Errors
Hi there, I have posted 190,000 simple XML using POST.JAR and there are only 8 files that were with errors. But how do I know which are the ones have errors? Thank you in advance, Simon Cheng.
Fwd: Tracing Files Which Have Errors
Hi there, I have posted 190,000 simple XML using POST.JAR and there are only 8 files that were with errors. But how do I know which are the ones have errors? Thank you in advance, Simon Cheng.
Re: ICUTokenizer or StandardTokenizer or ??? for text_all type field that might include non-whitespace langs
Hi Tim, I'm working on a similar project with some differences and may be we can share our knowledge in this area : 1) I have no problem with the Chinese characters. You can try this link : http://123.100.239.158:8983/solr/collection1/browse?q=%E4%B8%AD%E5%9B%BD Solr can find the record even the phrase 中国 (meaning China) is in the middle of the sentence. 2) My problem is more relating to other Asian languages ... Thai and Arabic are two examples. Read from https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters that solr.ICUTokenizerFactory can overcome the problem and I am exploring this approach at the moment. Simon. On Sat, Jun 21, 2014 at 7:37 AM, T. Kuro Kurosaka k...@healthline.com wrote: On 06/20/2014 04:04 AM, Allison, Timothy B. wrote: Let's say a predominantly English document contains a Chinese sentence. If the English field uses the WhitespaceTokenizer with a basic WordDelimiterFilter, the Chinese sentence could be tokenized as one big token (if it doesn't have any punctuation, of course) and will be effectively unsearchable...barring use of wildcards. In my experiment with Solr 4.6.1, both StandardTokenizer and ICUTokenizer generates a token per han character. So they are searcheable though precision suffers. But in your scenario, Chinese text is rare, so some precision loss may not be a real issue. Kuro
Re: Simple Sort Is Not Working In Solr 4.7?
Hi Alex, It's okay after I added in a new field s_title in the schema and re-indexed. field name=s_title type=string indexed=true stored=false multiValued=false/ copyField source=title dest=s_title/ But how can I ignore the articles (A, An, The) in the sorting. As you can see from the below example : http://localhost:8983/solr/bibs/select?q=singaporefl=id,titlesort=s_title+ascwt=xmlstart=0rows=20indent=true response lst name=responseHeader int name=status0/int int name=QTime0/int lst name=params str name=qsingapore/str str name=indenttrue/str str name=flid,title/str str name=start0/str str name=sorts_title asc/str str name=rows20/str str name=wtxml/str /lst /lst result name=response numFound=18 start=0 doc str name=id36/str str name=title 5th SEACEN-Toronto Centre Leadership Seminar for Senior Management of Central Banks on Financial System Oversight, 16-21 Oct 2005, Singapore /str /doc doc str name=id70/str str name=title Anti-money laundering counter-terrorism financing / Commercial Affairs Dept /str /doc doc str name=id15/str str name=title China's anti-secession law : a legal perspective / Zou, Keyuan /str /doc doc str name=id12/str str name=title China's currency peg : firm in the eye of the storm / Calla Wiemer /str /doc doc str name=id22/str str name=title China's politics in 2004 : dawn of the Hu Jintao era / Zheng Yongnian Lye Liang Fook /str /doc doc str name=id92/str str name=title Goods and Services Tax Act [2005 ed.] (Chapter 117A) /str /doc doc str name=id13/str str name=title Governing capacity in China : creating a contingent of qualified personnel / Kjeld Erik Brodsgaard /str /doc doc str name=id21/str str name=titleHealth care marketization in urban China / Gu Xin/str /doc doc str name=id85/str str name=titleLianhe Zaobao, Sunday/str /doc doc str name=id84/str str name=title Singapore : vision of a global city / Jones Lang LaSalle /str /doc doc str name=id7/str str name=title Singapore real estate investment trusts : leveraged value / Tony Darwell /str /doc doc str name=id96/str str name=title Singapore's success : engineering economic growth / Henri Ghesquiere /str /doc doc str name=id23/str str name=title The Chen-Soong meeting : the beginning of inter-party rapprochement in Taiwan? / Raymond R. Wu /str /doc doc str name=id17/str str name=title The Haw Par saga in the 1970s / project sponsor, Low Kwok Mun; team leader, Sandy Ho; team members, Audrey Low ... et al /str /doc doc str name=id78/str str name=titleThe New paper on Sunday/str /doc doc str name=id95/str str name=title The little Red Dot : reflections by Singapore's diplomats / editors, Tommy Koh, Chang Li Lin /str /doc doc str name=id52/str str name=title [Press releases and articles on policy changes affecting the Singapore property market] / compiled by the Information Resource Centre, Monetary Authority of Singapore /str /doc doc str name=iddataq/str str name=title Simon is testing Solr - This one is in English. Color of the Wind. 我是中国人 , БOΛbШ OЙ PYCCKO-KИTAЙCKИЙ CΛOBAPb , Français-Chinois /str /doc /result /response
Re: Simple Sort Is Not Working In Solr 4.7?
Hi Alex, It's simply defined like this in the schema.xml : field name=title type=text_general indexed=true stored=true multiValued=false/ and it is cloned to the other multi-valued field o_title : copyField source=title dest=o_title/ Should I simply change the type to be string instead? Thanks again, Simon. On Wed, Feb 18, 2015 at 12:00 PM, Alexandre Rafalovitch arafa...@gmail.com wrote: What's the field definition for your title field? Is it just string or are you doing some tokenizing? It should be a string or a single token cleaned up (e.g. lower-cased) using KeywordTokenizer. In the example schema, you will normally see the original field tokenized and the sort field separately with copyField connection. In latest Solr, docValues are also recommended for sort fields. Regards, Alex.
Simple Sort Is Not Working In Solr 4.7?
Hi, I don't know whether it is my setup or any other reasons. But the fact is that a very simple sort is not working in my Solr 4.7 environment. The query is very simple : http://localhost:8983/solr/bibs/select?q=author:sorosfl=id,author,titlesort=title+ascwt=xmlstart=0indent=true And the output is NOT sorted according to title : response lst name=responseHeader int name=status0/int int name=QTime1/int lst name=params str name=sorttitle asc/str str name=flid,author,title/str str name=indenttrue/str str name=start0/str str name=qauthor:soros/str str name=wtxml/str /lst /lst result name=response numFound=13 start=0 doc str name=id9018/str arr name=author strSoros, George, 1930-/str /arr str name=title The alchemy of finance : reading the mind of the market / George Soros /str /doc doc str name=id15785/str arr name=author strSoros, George, 1930-/str strSoros Foundations/str /arr str name=titleBosnia / by George Soros/str /doc doc str name=id16281/str arr name=author strSoros, George, 1930-/str strSoros Foundations/str /arr str name=title Prospect for European disintegration / by George Soros /str /doc doc str name=id25807/str arr name=author strSoros, George/str /arr str name=title Open society : reforming global capitalism / George Soros /str /doc doc str name=id27440/str str name=titleGeorge Soros on globalization/str arr name=author strSoros, George, 1930-/str /arr /doc doc str name=id22254/str arr name=author strSoros, George, 1930-/str /arr str name=title The crisis of global capitalism : open society endangered / George Soros /str /doc doc str name=id16914/str arr name=author strSoros, George, 1930-/str strSoros Fund Management/str /arr str name=titleThe theory of reflexivity / by George Soros/str /doc doc str name=id17343/str str name=title Financial turmoil in Europe and the United States : essays / George Soros /str arr name=author strSoros, George, 1930-/str /arr /doc doc str name=id15542/str arr name=author strSoros, George, 1930-/str strHarvard Club of New York City/str /arr str name=title Nationalist dictatorships versus open society / by George Soros /str /doc doc str name=id15891/str arr name=author strSoros, George/str /arr str name=title The new paradigm for financial markets : the credit crisis of 2008 and what it means / George Soros /str /doc /result /response Thank you for the help in advance, Simon.
Re: Simple Sort Is Not Working In Solr 4.7?
Great help and thanks to you, Alex. On Wed, Feb 18, 2015 at 2:48 PM, Alexandre Rafalovitch arafa...@gmail.com wrote: Like I mentioned before. You could use string type if you just want title it is. Or you can use a custom type to normalize the indexed value, as long as you end up with a single token. So, if you want to strip leading A/An/The, you can use KeywordTokenizer, combined with whatever post-processing you need. I would suggest LowerCase filter and perhaps Regex filter to strip off those leading articles. You may need to iterate a couple of times on that specific chain. The good news is that you can just make a couple of type definitions with different values/order, reload the index (from Cores screen of the Web Admin UI) and run some of your sample titles through those different definitions without having to reindex in the Analysis screen. Regards, Alex. Sign up for my Solr resources newsletter at http://www.solr-start.com/ On 17 February 2015 at 22:36, Simon Cheng simonwhch...@gmail.com wrote: Hi Alex, It's okay after I added in a new field s_title in the schema and re-indexed. field name=s_title type=string indexed=true stored=false multiValued=false/ copyField source=title dest=s_title/ But how can I ignore the articles (A, An, The) in the sorting. As you can see from the below example :
How to trace error records during POST?
Good morning, I used Solr 4.7 to post 186,745 XML files and 186,622 files have been indexed. That means there are 123 XML files with errors. How can I trace what these files are? Thank you in advance, Simon Cheng.