Tracing Files Which Have Errors

2014-06-19 Thread Simon Cheng
Hi there,

I have posted 190,000 simple XML using POST.JAR and there are only 8 files
that were with errors. But how do I know which are the ones have errors?

Thank you in advance,
Simon Cheng.


Fwd: Tracing Files Which Have Errors

2014-06-19 Thread Simon Cheng
Hi there,

I have posted 190,000 simple XML using POST.JAR and there are only 8 files
that were with errors. But how do I know which are the ones have errors?

Thank you in advance,
Simon Cheng.


Re: ICUTokenizer or StandardTokenizer or ??? for text_all type field that might include non-whitespace langs

2014-06-20 Thread Simon Cheng
Hi Tim,

I'm working on a similar project with some differences and may be we can
share our knowledge in this area :

1) I have no problem with the Chinese characters. You can try this link :

http://123.100.239.158:8983/solr/collection1/browse?q=%E4%B8%AD%E5%9B%BD

Solr can find the record even the phrase 中国 (meaning China) is in the
middle of the sentence.

2) My problem is more relating to other Asian languages ... Thai and Arabic
are two examples. Read from
https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters that
solr.ICUTokenizerFactory  can overcome the problem and I am exploring this
approach at the moment.

Simon.



On Sat, Jun 21, 2014 at 7:37 AM, T. Kuro Kurosaka k...@healthline.com
wrote:

 On 06/20/2014 04:04 AM, Allison, Timothy B. wrote:

 Let's say a predominantly English document contains a Chinese sentence.
  If the English field uses the WhitespaceTokenizer with a basic
 WordDelimiterFilter, the Chinese sentence could be tokenized as one big
 token (if it doesn't have any punctuation, of course) and will be
 effectively unsearchable...barring use of wildcards.


 In my experiment with Solr 4.6.1, both StandardTokenizer and ICUTokenizer
 generates a token per han character. So they are searcheable though
 precision suffers. But in your scenario, Chinese text is rare, so some
 precision
 loss may not be a real issue.

 Kuro




Re: Simple Sort Is Not Working In Solr 4.7?

2015-02-17 Thread Simon Cheng
Hi Alex,

It's okay after I added in a new field s_title in the schema and
re-indexed.

   field name=s_title type=string indexed=true stored=false
multiValued=false/
   copyField source=title dest=s_title/

But how can I ignore the articles (A, An, The) in the sorting. As you
can see from the below example :

http://localhost:8983/solr/bibs/select?q=singaporefl=id,titlesort=s_title+ascwt=xmlstart=0rows=20indent=true

response
lst name=responseHeader
int name=status0/int
int name=QTime0/int
lst name=params
str name=qsingapore/str
str name=indenttrue/str
str name=flid,title/str
str name=start0/str
str name=sorts_title asc/str
str name=rows20/str
str name=wtxml/str
/lst
/lst
result name=response numFound=18 start=0
doc
str name=id36/str
str name=title
5th SEACEN-Toronto Centre Leadership Seminar for Senior Management of
Central Banks on Financial System Oversight, 16-21 Oct 2005, Singapore
/str
/doc
doc
str name=id70/str
str name=title
Anti-money laundering  counter-terrorism financing / Commercial Affairs
Dept
/str
/doc
doc
str name=id15/str
str name=title
China's anti-secession law : a legal perspective / Zou, Keyuan
/str
/doc
doc
str name=id12/str
str name=title
China's currency peg : firm in the eye of the storm / Calla Wiemer
/str
/doc
doc
str name=id22/str
str name=title
China's politics in 2004 : dawn of the Hu Jintao era / Zheng Yongnian  Lye
Liang Fook
/str
/doc
doc
str name=id92/str
str name=title
Goods and Services Tax Act [2005 ed.] (Chapter 117A)
/str
/doc
doc
str name=id13/str
str name=title
Governing capacity in China : creating a contingent of qualified personnel
/ Kjeld Erik Brodsgaard
/str
/doc
doc
str name=id21/str
str name=titleHealth care marketization in urban China / Gu Xin/str
/doc
doc
str name=id85/str
str name=titleLianhe Zaobao, Sunday/str
/doc
doc
str name=id84/str
str name=title
Singapore : vision of a global city / Jones Lang LaSalle
/str
/doc
doc
str name=id7/str
str name=title
Singapore real estate investment trusts : leveraged value / Tony Darwell
/str
/doc
doc
str name=id96/str
str name=title
Singapore's success : engineering economic growth / Henri Ghesquiere
/str
/doc
doc
str name=id23/str
str name=title
The Chen-Soong meeting : the beginning of inter-party rapprochement in
Taiwan? / Raymond R. Wu
/str
/doc
doc
str name=id17/str
str name=title
The Haw Par saga in the 1970s / project sponsor, Low Kwok Mun; team leader,
Sandy Ho; team members, Audrey Low ... et al
/str
/doc
doc
str name=id78/str
str name=titleThe New paper on Sunday/str
/doc
doc
str name=id95/str
str name=title
The little Red Dot : reflections by Singapore's diplomats / editors, Tommy
Koh, Chang Li Lin
/str
/doc
doc
str name=id52/str
str name=title
[Press releases and articles on policy changes affecting the Singapore
property market] / compiled by the Information Resource Centre, Monetary
Authority of Singapore
/str
/doc
doc
str name=iddataq/str
str name=title
Simon is testing Solr - This one is in English. Color of the Wind. 我是中国人 ,
БOΛbШ OЙ PYCCKO-KИTAЙCKИЙ CΛOBAPb , Français-Chinois
/str
/doc
/result
/response


Re: Simple Sort Is Not Working In Solr 4.7?

2015-02-17 Thread Simon Cheng
Hi Alex,

It's simply defined like this in the schema.xml :

   field name=title type=text_general indexed=true stored=true
multiValued=false/

and it is cloned to the other multi-valued field o_title :

   copyField source=title dest=o_title/

Should I simply change the type to be string instead?

Thanks again,
Simon.


On Wed, Feb 18, 2015 at 12:00 PM, Alexandre Rafalovitch arafa...@gmail.com
wrote:

 What's the field definition for your title field? Is it just string
 or are you doing some tokenizing?

 It should be a string or a single token cleaned up (e.g. lower-cased)
 using KeywordTokenizer. In the example schema, you will normally see
 the original field tokenized and the sort field separately with
 copyField connection. In latest Solr, docValues are also recommended
 for sort fields.

 Regards,
Alex.



Simple Sort Is Not Working In Solr 4.7?

2015-02-17 Thread Simon Cheng
Hi,

I don't know whether it is my setup or any other reasons. But the fact is
that a very simple sort is not working in my Solr 4.7 environment.

The query is very simple :
http://localhost:8983/solr/bibs/select?q=author:sorosfl=id,author,titlesort=title+ascwt=xmlstart=0indent=true

And the output is NOT sorted according to title :

response
lst name=responseHeader
int name=status0/int
int name=QTime1/int
lst name=params
str name=sorttitle asc/str
str name=flid,author,title/str
str name=indenttrue/str
str name=start0/str
str name=qauthor:soros/str
str name=wtxml/str
/lst
/lst
result name=response numFound=13 start=0
doc
str name=id9018/str
arr name=author
strSoros, George, 1930-/str
/arr
str name=title
The alchemy of finance : reading the mind of the market / George Soros
/str
/doc
doc
str name=id15785/str
arr name=author
strSoros, George, 1930-/str
strSoros Foundations/str
/arr
str name=titleBosnia / by George Soros/str
/doc
doc
str name=id16281/str
arr name=author
strSoros, George, 1930-/str
strSoros Foundations/str
/arr
str name=title
Prospect for European disintegration / by George Soros
/str
/doc
doc
str name=id25807/str
arr name=author
strSoros, George/str
/arr
str name=title
Open society : reforming global capitalism / George Soros
/str
/doc
doc
str name=id27440/str
str name=titleGeorge Soros on globalization/str
arr name=author
strSoros, George, 1930-/str
/arr
/doc
doc
str name=id22254/str
arr name=author
strSoros, George, 1930-/str
/arr
str name=title
The crisis of global capitalism : open society endangered / George Soros
/str
/doc
doc
str name=id16914/str
arr name=author
strSoros, George, 1930-/str
strSoros Fund Management/str
/arr
str name=titleThe theory of reflexivity / by George Soros/str
/doc
doc
str name=id17343/str
str name=title
Financial turmoil in Europe and the United States : essays / George Soros
/str
arr name=author
strSoros, George, 1930-/str
/arr
/doc
doc
str name=id15542/str
arr name=author
strSoros, George, 1930-/str
strHarvard Club of New York City/str
/arr
str name=title
Nationalist dictatorships versus open society / by George Soros
/str
/doc
doc
str name=id15891/str
arr name=author
strSoros, George/str
/arr
str name=title
The new paradigm for financial markets : the credit crisis of 2008 and what
it means / George Soros
/str
/doc
/result
/response

Thank you for the help in advance,
Simon.


Re: Simple Sort Is Not Working In Solr 4.7?

2015-02-18 Thread Simon Cheng
Great help and thanks to you, Alex.


On Wed, Feb 18, 2015 at 2:48 PM, Alexandre Rafalovitch arafa...@gmail.com
wrote:

 Like I mentioned before. You could use string type if you just want
 title it is. Or you can use a custom type to normalize the indexed
 value, as long as you end up with a single token.

 So, if you want to strip leading A/An/The, you can use
 KeywordTokenizer, combined with whatever post-processing you need. I
 would suggest LowerCase filter and perhaps Regex filter to strip off
 those leading articles. You may need to iterate a couple of times on
 that specific chain.

 The good news is that you can just make a couple of type definitions
 with different values/order, reload the index (from Cores screen of
 the Web Admin UI) and run some of your sample titles through those
 different definitions without having to reindex in the Analysis
 screen.

 Regards,
   Alex.

 
 Sign up for my Solr resources newsletter at http://www.solr-start.com/

 On 17 February 2015 at 22:36, Simon Cheng simonwhch...@gmail.com wrote:
  Hi Alex,
 
  It's okay after I added in a new field s_title in the schema and
  re-indexed.
 
 field name=s_title type=string indexed=true stored=false
  multiValued=false/
 copyField source=title dest=s_title/
 
  But how can I ignore the articles (A, An, The) in the sorting. As
 you
  can see from the below example :



How to trace error records during POST?

2015-04-07 Thread Simon Cheng
Good morning,

I used Solr 4.7 to post 186,745 XML files and 186,622 files have been
indexed. That means there are 123 XML files with errors. How can I trace
what these files are?

Thank you in advance,
Simon Cheng.