RE: SmartChineseAnalyzer and stopwords.txt

2012-01-07 Thread Delbosc, Sylvain
Hello,

Has anyone used SmartChineseAnalyzer to index  search Chinese content?
I would like to discuss about few things.

Best Regards,
Sylvain

De : Delbosc, Sylvain [mailto:sylvain.delb...@capgemini.com]
Envoyé : jeudi 5 janvier 2012 14:02
À : solr-user@lucene.apache.org
Cc : Delance, Quentin
Objet : SmartChineseAnalyzer and stopwords.txt

Hello,

I would like to know how to use stopwords with SmartChineseAnalyzer.
Following what is described at 
http://lucene.apache.org/java/2_9_0/api/contrib-smartcn/org/apache/lucene/analysis/cn/smart/SmartChineseAnalyzer.html
 it seems to be possible but I do not manage to make it work.

Presently I am defining my analyzer like this but the stopwords.txt file 
located in the same directory as schema.xml does not seem to be taken into 
account.
  analyzer 
class=org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer/

Has somebody managed to make this work?

NB: I am using SolR 1.4 and I am using several cores.

Best Regards,
_

Sylvain DELBOSC/ Capgemini Sud / Toulouse
Application Architect Senior / TIC - ADC

Tel.: +33 5 61 31 55 70 / www.capgemini.comhttp://www.capgemini.com/
Fax: +33 5 61 31 53 85

15, avenue du Docteur Grynfogel
BP 53655 - 31036 Toulouse Cedex 1
[cid:image001.gif@01CCCBB1.E82858F0]Ensemble, libérons nos énergies.
_
Capgemini is a trading name used by the Capgemini Group of companies which 
includes Capgemini Sud, registered in Toulouse, France (RCS 479 766 990) whose 
registered office is 15 avenue du Dr Grynfogel - BP 53655 - 31036 Toulouse 
cedex 1.

[cid:image002.gif@01CCCBB1.E82858F0]







This message contains information that may be privileged or confidential and is 
the property of the Capgemini Group. It is
intended only for the person to whom it is addressed. If you are not the 
intended recipient, you are not authorized to
read, print, retain, copy, disseminate, distribute, or use this message or any 
part thereof. If you receive this message
in error, please notify the sender immediately and delete all copies of this 
message.


Re: Indexing Failed.Rolled back all changes Issue

2012-01-07 Thread Gora Mohanty
On Fri, Jan 6, 2012 at 12:28 PM, Rajdeep Alapati
rajdeep.alap...@benefitfocus.com wrote:
 Hi,

 I am new to this SOLR.I was digging data import request handler for past few 
 days and now i am doing some poc after i download the solr server.
[...]

The dataimport.properties file should be created by DIH
on completion of indexing. This should be a problem only
if that file was not writeable by the user that the web
interface runs as. In any case, this should only generate
a logged error, and the only problem should be that this
would hamper future delta-imports.

There are probably other errors in your Solr log file. Please
share with us such errors, and your data-config.xml.

Regards,
Gora


Implementing complex token matching algo using solr

2012-01-07 Thread Sumit Thakur
Hi All,

Problem Description

I'm trying to implement a custom algorithm to match user provided
free-text input, a company name such as Ford Motor, against a
reference data source consisting of 1.4 million company names.

The algorithm executes following steps:

Step 1) Performs an Exact Match, followed by Begins Match and
finally Contains Match of user provided search input. Results from
this step are also sorted in the same order.

Step 2) Performs a token by token match of search input with reference
company name.

Every token is matched in following order: Exact, Begins, Contains,
Levenshtein Distance ( 0.2) and Refined Soundex.

E.g. If user input is Foord Motur Holding and it's being matched
against The Ford Motor Holdings Company then first token Foord
will match Ford based on Soundex match, second token Motur will
match Motor based on Edit Distance Algo and and last token Holding
will match Holdings via Begins match.

Scoring: Every token match is first scored on a scale that rates the
matching technique, with Exact match being the best and Soundex being
the worst.

The overall score is calculated, on a scale of 0-100%, by calculating
a weighted average of individual token-match scores. Weights are
assigned based on index-order of token i.e. the first token has
highest weight and last token has lowest.



My Partial Solution

I have implemented a simple schema in solr to store referance company
names. A String field (called companyName), a simple text field
(called as companyText) copied from string and another text field
(called as companySoundex) copied from string and using
PhoneticFilterFactory for Refined Soundex based matching.

I have been able to replicate step 1) in a single solr query.

For step 2) I plan to fire 3 parallel queries to solr server. First
query performing a simple text search on companyText field, second
query performing fuzzy match using ~ operator on companyText field and
third query performing soundex match on companySoundex field. I plan
to somehow combine the results from these 3 parallel queries to get
desired final result.



Questions:

1) Is there a better way to replicate Step 2) of original algorithm?

2) Even if I go with my three-parallel-queries approach then how to
get the right sorting order as I get in the original algorithm ? I
guess the main problem is how to compare the solr scores from these 3
entirely different queries to do the final combining of results

Thanks for reading this long question. Any help/pointers would be
greatly appreciated.

-- 
Thanks


question

2012-01-07 Thread Steve Chen
Hi Everybody,

 

How can we read e-mail content in PST file of the storage of Microsoft
outlook. It is not the e-mail at the exchange server.

 

Thanks

 

Regards

 

Shu (Steve) Chen

Tel: 425-818-0568

Fax: 425-641-8908

Cell: 425-785-9971