Re: Fwd: Language detection for solr 3.6.1

2014-07-09 Thread T. Kuro Kurosaka


On 07/08/2014 03:17 AM, Poornima Jay wrote:

I'm using the google library which I has mentioned in my first mail saying Im 
usinghttp://code.google.com/p/language-detection/. I have downloaded the jar 
file from the below url

https://www.versioneye.com/java/org.apache.solr:solr-langid/3.6.1


Please let me know from where I need to download the correct jar file.

Regards,
I don't think you need to download anything. It's included in Solr 3.6.1 
package.


$ ls contrib/langid/lib
jsonic-1.2.7.jar jsonic-NOTICE.txt langdetect-LICENSE-ASL.txt
jsonic-LICENSE-ASL.txt langdetect-1.1-20120112.jar langdetect-NOTICE.txt

langdetect-1.1-20120112.jar is the one you find in the Googole Code site,
which isn't developed by Google, but developed by a Japanese company
Cybozu.

I used this some years ago for a comparison purpose,
but I don't remember how I did. You'd have to move the
JARs in the lib directory to the lib directory, and
use
LangDetectLanguageIdentifierUpdateProcessorFactory
instead of
TikaLanguageIdentifierUpdateProcessorFactory
in the commented out portion of example/solr/conf/solrconfig.xml
(and you need to un-comment out that portion, of course)

Hope this helps.

--
T. Kuro Kurosaka • Senior Software Engineer
Healthline - The Power of Intelligent Health
www.healthline.com  |@Healthline  | @HealthlineCorp



Re: ICUTokenizer or StandardTokenizer or ??? for text_all type field that might include non-whitespace langs

2014-06-20 Thread T. Kuro Kurosaka

On 06/20/2014 04:04 AM, Allison, Timothy B. wrote:

Let's say a predominantly English document contains a Chinese sentence.  If the 
English field uses the WhitespaceTokenizer with a basic WordDelimiterFilter, 
the Chinese sentence could be tokenized as one big token (if it doesn't have 
any punctuation, of course) and will be effectively unsearchable...barring use 
of wildcards.


In my experiment with Solr 4.6.1, both StandardTokenizer and ICUTokenizer
generates a token per han character. So they are searcheable though
precision suffers. But in your scenario, Chinese text is rare, so some 
precision

loss may not be a real issue.

Kuro



Re: Strict mode at searching and indexing

2014-06-03 Thread T. Kuro Kurosaka


On 05/30/2014 08:29 AM, Erick Erickson wrote:

I see errors in both cases. Do you
1 have schemaless configured
or
2 have a dynamic field pattern that matches your non_exist_field?

Maybe
 !--dynamicField name=* type=ignored multiValued=true /--
is un-commented-out in schema.xml?

Kuro



Re: Stemming for Chinese and Japanese

2014-06-03 Thread T. Kuro Kurosaka

On 05/20/2014 11:31 AM, Geepalem wrote:

Hi,

What is the filter to be used to implement stemming for Chinese and Japanese
language field types.
For English, I have used  filter class=solr.SnowballPorterFilterFactory
language=English / and its working fine.

What do you mean by working fine?
Try analyzing this with text_en field type:
単語は何個ありますか?
This Japanese sentence for How many tokens are there?, and the correct
answer is 5, 6 or 7, depending on how to count some compound words.
You should be seeing 10, using text_en, instead.

Try using text_ja. You will see 7.

I don't recommend to use text_cjk for Chinese, Japanese and Korean.
They are *very* different languages, and you should be using a different
analyzer for each.

StandardTokenizer just doesn't work for Chinese and Japanese at all since
there are no spaces between words in these languages.

Kuro



Any Solrj API to obtain field list?

2014-05-27 Thread T. Kuro Kurosaka
I'd like to write Solr client code that writes text to language specific 
field, say, myfield_es, for Spanish,
if the field myfield_es is defined in schema.xml, and otherwise to a 
fall-back field myfield. To do this,
I need to obtain a list of defined fields (and dynamic fields) from the 
server. But I cannot find
a suitable Solrj API. Is there any? I'm using Solr 4.6.1. I could write 
code to use Schema REST API

(https://wiki.apache.org/solr/SchemaRESTAPI) but I would much prefer to use
the existing code if one exists.

--
T. Kuro Kurosaka • Senior Software Engineer




Re: Any Solrj API to obtain field list?

2014-05-27 Thread T. Kuro Kurosaka

On 05/27/2014 02:29 PM, Jack Krupansky wrote:
You might consider an update request processor as an alternative. It 
runs on the server and might be simpler. You can even use the 
stateless script update processor to avoid having to write any custom 
Java code.


-- Jack Krupansky 


That's an interesting approach. I'd consider it.


On 05/27/2014 02:04 PM, Sujit Pal wrote:

Have you looked at IndexSchema? That would offer you methods to query index
metadata using SolrJ.

http://lucene.apache.org/solr/4_7_2/solr-core/org/apache/solr/schema/IndexSchema.html

-sujit

The question was essentially how to get IndexSchema for Solrj client,
without needing to parse the XML file, hopefully.


On 05/27/2014 02:16 PM, Ahmet Arslan wrote:

Hi,

https://wiki.apache.org/solr/LukeRequestHandler   make sure numTerms=0 for 
performance


I'm afraid this won't work because when the index is empty, Luke won't 
return any fields.
And for the fields that are written, this method returns more 
information than I'd like to know.

I just want to know if a field is valid or not.


Kuro



Re: Any Solrj API to obtain field list?

2014-05-27 Thread T. Kuro Kurosaka

On 05/27/2014 02:55 PM, Steve Rowe wrote:
You can call the Schema API from SolrJ - see Shawn Heisey’s example code here:http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201307.mbox/%3c51daecd2.6030...@elyograg.org%3e  


Steve

It looks like this returns a Json representation of fields if I do
query.setRequestHandler(/schema/fields);

I guess this is the closest Solrj can do.

Thank  you, Steve.



Re: Any Solrj API to obtain field list?

2014-05-27 Thread T. Kuro Kurosaka

On 05/27/2014 04:21 PM, Steve Rowe wrote:

Shawn’s code shows that SolrJ parses the JSON for you into NamedList 
(response.getResponse()). - Steve

Thank you for pointing it out.

It wasn't apparent what get(key) returns since the method signature
of getResponse() merely tells it would return a NamedListObject.
After running a test code under debugger, I found out, for key=fields,
the returned object is of ArrayListSimpleOrderedMapNameValuePair.
This is what I came up with:

private static final String url = http://localhost:8983/solr/hlbase;;
private static final SolrServer server = new HttpSolrServer(url);
   ...
SolrQuery query = new SolrQuery();
query.setRequestHandler(/schema/fields);
QueryResponse response = server.query(query);

ListSimpleOrderedMapNameValuePair fields = 
(ArrayListSimpleOrderedMapNameValuePair) 
response.getResponse().get(fields);

for(SimpleOrderedMapNameValuePair fmap : fields) {
  System.out.println(fmap.get(name));
}


Kuro



Re: Solr special characters like '(' and ''?

2014-04-08 Thread T. Kuro Kurosaka
I don't think  is special to the parser. Classic examples like ATT 
just work, as far as query parser is considered.

https://wiki.apache.org/solr/SolrQuerySyntax
even tells that you can escape the special meaning by the backslash.

 is special in the URL, however, and that has to be hex-escaped as %26.

On 04/08/2014 06:37 AM, Peter Kirk wrote:

Hi

How to search for Solr special characters like '(' and ''?



Kuro



Re: Analysis of Japanese characters

2014-04-07 Thread T. Kuro Kurosaka

Tom,
You should be using JapaneseAnalyzer (kuromoji).
Neither CJK nor ICU tokenize at word boundaries.

On 04/02/2014 10:33 AM, Tom Burton-West wrote:

Hi Shawn,

I'm not sure I understand the problem and why you need to solve it at the
ICUTokenizer level rather than the CJKBigramFilter
Can you perhaps give a few examples of the problem?

Have you looked at the flags for the CJKBigramfilter?
You can tell it to make bigrams of different Japanese character sets.  For
example the config given in the JavaDocs tells it to make bigrams across 3
of the different Japanese character sets.  (Is the issue related to Romaji?)

  filter class=solr.CJKBigramFilterFactory
han=true hiragana=true
katakana=true hangul=true outputUnigrams=false /



http://lucene.apache.org/core/4_7_1/analyzers-common/org/apache/lucene/analysis/cjk/CJKBigramFilterFactory.html

Tom


On Wed, Apr 2, 2014 at 1:19 PM, Shawn Heisey s...@elyograg.org wrote:


My company is setting up a system for a customer from Japan.  We have an
existing system that handles primarily English.

Here's my general text analysis chain:

http://apaste.info/xa5

After talking to the customer about problems they are encountering with
search, we have determined that some of the problems are caused because
ICUTokenizer splits on *any* character set change, including changes
between different Japanase character sets.

Knowing the risk of this being an XY problem, here's my question: Can
someone help me develop a rule file for the ICU Tokenizer that will *not*
split when the character set changes from one of the japanese character
sets to another japanese character set, but still split on other character
set changes?

Thanks,
Shawn






Re: w/10 ? [was: Partial Counts in SOLR]

2014-03-24 Thread T. Kuro Kurosaka

On 3/19/14 5:13 PM, Otis Gospodnetic wrote: Hi,

 Guessing it's surround query parser's support for within backed by span
 queries.

 Otis

You mean this?
http://wiki.apache.org/solr/SurroundQueryParser

I guess this parser needs improvement in documentation area.
It doesn't explain or have an example of the w/int syntax at all.
(Is this the infix notation of W?)
An example would help explaining difference between W and N;
some readers may not understand what ordered and unordered
in this context mean.

Kuro



w/10 ? [was: Partial Counts in SOLR]

2014-03-19 Thread T. Kuro Kurosaka

In the thread Partial Counts in SOLR, Salman gave us this sample query:


((stock or share*) w/10 (sale or sell* or sold or bought or buy* or
purchase* or repurchase*)) w/10 (executive or director)


I'm not familiar with this w/10 notation. What does this mean,
and what parser(s) supports this syntax?

Kuro



Re: Apache Solr Configuration Problem (Japanese Language)

2014-03-06 Thread T. Kuro Kurosaka

Andy,
I don't have a direct answer to your question but I have a question.

On 03/05/2014 07:21 AM, Andy Alexander wrote:

fq=ss_language:jaq=製品


I am guessing you have a field called ss_language where a language code 
of the document is stored,

and you have Solr documents of different languages.


str name=parsedquery+DisjunctionMaxQuery((content:製品)~0.01)/str
This indicate your default query field is content.  What does the 
analyzer for this field look like?

Does the analyzer work for any languages that you want to support?
Many analyzers have language dependency and won't work with multilingual 
fields.


--
T. Kuro Kurosaka • Senior Software Engineer
Healthline - The Power of Intelligent Health
www.healthline.com  |@Healthline  | @HealthlineCorp



What types is supported by Solrj addBean() in the fields of POJO objects?

2014-03-03 Thread T. Kuro Kurosaka
What are supported types of the POJO objects that are sent to 
SolrServer.addBean(obj)?

A quick glance of DocumentObjectBinder seems to suggest that
an arbitrary combination of an Collection, List, ArrayList, array ([]), Map, 
Hashmap,

of primitive types, String and Date is supported, but I'm not too sure. I would 
also
like to know what Solr field types are allowed for each object's (Java) field 
types.
Is there documentation explaining this?

Kuro


search across cores

2014-02-21 Thread T. Kuro Kurosaka
If I want to search across cores, can I use (abuse?) the distributed 
search?

My simple experiment seems to confirm this but I'd like to know if there is
any drawbacks other than those of distributed search listed here?
https://wiki.apache.org/solr/DistributedSearch#Distributed_Searching_Limitations

If all cores are served by the same machine, does a distributed
search actually make sub-search requests over HTTP? Or is it
clever enough to skip the HTTP connection?

Kuro



Re: Escape \\n from getting highlighted - highlighter component

2014-02-18 Thread T. Kuro Kurosaka

Your search expression means 'talk' OR 'n' OR 'text'.
I think you want to do a phrase search. To do that, quote the whole 
thing with double-quotes talk n text, if you are using one of the Solr 
standard query parsers.



On 02/17/2014 03:53 PM, Developer wrote:

Hi,

When searching for a text like 'talk n text' the highlighter component also
adds the em tags to the special characters like \n. Is there a way to
avoid highlighting the special characters?

\\r\\n Family Messaging

  is getting replaced as

\\r\\emn/em Family Messaging


Kuro



Re: geo/spatial search performance comparison using different methods

2013-11-06 Thread T. Kuro Kurosaka

Thank you, David.
I believe the field doesn't need to be multivalued.
Can you give me some idea how much query-time performance gain
we can expect by switching to LatLonType from Solr-2155?

On 11/06/2013 09:56 AM, Smiley, David W. wrote:

Hi Kuro,

I don't know of any benchmarks featuring distance-sort performance.

Presumably you are using SOLR-2155 because you have multi-valued spatial
fields?  If so, LatLonType is not an option.  SOLR-2155 sorting
performance is *probably* about the same as the equivalent in Solr 4 RPT.
If you actually do have single valued spatial to sort on, then definitely
don't use SOLR-2155 or RPT for that, use LatLonType.  It's surely faster
but I haven't measured it.

The best multi-valued distance sort option for Solr 4 is currently this:
https://issues.apache.org/jira/browse/SOLR-5170


~ David

On 11/5/13 1:36 PM, T. Kuro Kurosaka k...@healthline.com wrote:


Are there any performance comparison results available comparing various
methods
to sort result by distance (not just filtering) on Solr 3 and 4?

We are using Solr 3.5 with Solr-2155 patch. I am particularly interested
in learning
performance difference among Solr 3 LatLongType, Solr-2155 GeoHash,
Solr 4 implementation of GeoHash and Solr 4's
SpatialRecursivePrefixTreeFieldType
(location_rpt).

I see comparison of Solr 3 LatLongType vs Solr-2155
3.6.2-work/example/solr/conf/
but it is 2 years old.

--
-
T. Kuro Kurosaka € Senior Software Engineer
Healthline Networks, Inc. € Connect to Better Health
www.healthline.com




--
-
T. Kuro Kurosaka • Senior Software Engineer
p: 415-281-3100x3261  f: 415-281-3199
Healthline Networks, Inc. • Connect to Better Health
660 Third Street, San Francisco, CA 94107 www.healthline.com
About Us: www.healthlinenetworks.net | Media Kit: mediakit.healthline.com



geo/spatial search performance comparison using different methods

2013-11-05 Thread T. Kuro Kurosaka
Are there any performance comparison results available comparing various 
methods

to sort result by distance (not just filtering) on Solr 3 and 4?

We are using Solr 3.5 with Solr-2155 patch. I am particularly interested 
in learning

performance difference among Solr 3 LatLongType, Solr-2155 GeoHash,
Solr 4 implementation of GeoHash and Solr 4's 
SpatialRecursivePrefixTreeFieldType

(location_rpt).

I see comparison of Solr 3 LatLongType vs Solr-2155
3.6.2-work/example/solr/conf/
but it is 2 years old.

--
-
T. Kuro Kurosaka • Senior Software Engineer
Healthline Networks, Inc. • Connect to Better Health
www.healthline.com




Re: character encoding issue...

2013-11-05 Thread T. Kuro Kurosaka

It sounds like the characters were mishandled at index build time.
I would use Luke to see if a character that appear correctly
when you change the output to be SHIFT JIS is actually
stored as one Unicode. I bet it's stored as two characters,
each having the character of the value that happened
to be high and low bytes of the SHIFT JIS character.

There are many possible cause of this. If you are indexing
the HTML document from HTTP servers, HTTP server may
be configured to send wrong charset= info in Content-Type
header. If the document is directly from a file system,
and if the document doesn't  have META header declaring
the charset, then the system assumes a default charset,
which is typically ISO-8859-1 or UTF-8, and misinterprets
SHIF-JIS encoded characters.

You need to debug to find out where the characters
get corrupted.

On 11/04/2013 11:15 PM, Chris wrote:

Sorry, was away a bit  hence the delay.

I am inserting java strings into a java bean class, and then doing a
addBean() method to insert the POJO into Solr.

When i Query using either tomcat/jetty, I get these special characters. But
I have noted, if I change output to - Shift-JIS encoding then those
characters appear as some japanese characters I think.

But then this solution doesn't work for all special characters as I can
still see some of them...isn't there an encoding that can cover all the
characters whatever they might be? Any ideas on what do i do?

Regards,
Chris


On Mon, Nov 4, 2013 at 6:27 PM, Erick Erickson erickerick...@gmail.comwrote:


The problem is there are about a dozen places where the character
encoding can be mis-configured. The problem you're seeing above
actually looks like a problem with the character set configured in
your browser, it may have nothing to do with what's actually in Solr.

You might write small SolrJ program and see if you can dump the contents
in binary and examine to see...

Best
Erick


On Sun, Nov 3, 2013 at 6:39 AM, Rajani Maski rajinima...@gmail.com
wrote:


How are you extracting the text that is there in the website[1] you are
referring to? Apache Nutch or any other crawler? If yes, initially check
whether that crawler engine is giving you data in correct format before

you

invoke solr index method.

[1]http://blog.diigo.com/2009/09/28/scheduled-groups-maintenance/

URI encoding should resolve this problem.




On Fri, Nov 1, 2013 at 10:50 AM, Chris christu...@gmail.com wrote:


Hi Rajani,

I followed the steps exactly as in



http://zensarteam.wordpress.com/2011/11/25/6-steps-to-configure-solr-on-apache-tomcat-7-0-20/

However, when i send a query to this new instance in tomcat, i again

get

the error -

   str name=fulltxtScheduled Groups Maintenance
In preparation for the new release roll-out, Diigo groups won’t be
accessible on Sept 28 (Mon) around midnight 0:00 PST for several
hours.
Stay tuned to say hello to Diigo V4 soon!

location of the text  -
http://blog.diigo.com/2009/09/28/scheduled-groups-maintenance/

same problem at - http://cn.nytimes.com/business/20130926/c26alibaba/

All text in title comes like -

 - �
/str
 arr name=text
   str -
� /str
 /arr


Can you please advice?

Chris




On Tue, Oct 29, 2013 at 11:33 PM, Rajani Maski rajinima...@gmail.com

wrote:
Hi,

If you are using Apache Tomcat Server, hope you are not missing

the

below mentioned configuration:

  Connector port=”port Number″ protocol=”HTTP/1.1″
connectionTimeout=”2″
redirectPort=”8443″ *URIEncoding=”UTF-8″*/

I had faced similar issue with Chinese Characters and had resolved

with

the

above config.

Links for reference :



http://zensarteam.wordpress.com/2011/11/25/6-steps-to-configure-solr-on-apache-tomcat-7-0-20/



http://blog.sidu.in/2007/05/tomcat-and-utf-8-encoded-uri-parameters.html#.Um_3P3Cw2X8


Thanks



On Tue, Oct 29, 2013 at 9:20 PM, Chris christu...@gmail.com wrote:


Hi All,

I get characters like -

�� - CTA -

in the solr index. I am adding Java beans to solr by the addBean()
function.

This seems to be a character encoding issue. Any pointers on how to
resolve this one?

I have seen that this occurs  mostly for japanese chinese

characters.


--
-
T. Kuro Kurosaka • Senior Software Engineer



Phrase query with prefix query

2013-08-02 Thread T. Kuro Kurosaka
Is there a query parser that supports a phrase query with prefix query 
at the end, such as San Fran* ?


--
-
T. Kuro Kurosaka • Senior Software Engineer



Re: predefined variables usable in schema.xml ?

2012-11-30 Thread T. Kuro Kurosaka
I tried to use ${solr.core.instanceDir} in schema.xml with Solr 4.0, 
where every deployment is multi-core, and it didn't work.
It must be that the description about pre-defined properties in 
CoreAdmin wiki page is wrong, or it only works in solrconfig.xml, perhaps?


On 11/28/12 5:17 PM, T. Kuro Kurosaka wrote:

Thank you, Hoss.

I found this SolrWiki page talks about pre-defined properties such as 
solr.core.instanceDir:

http://wiki.apache.org/solr/CoreAdmin

I tried to use ${solr.core.instanceDir} in the default single-core 
schema.xml, and it didn't work.
Is this page wrong, or these properties are available only in 
multi-core deployments?


On 11/27/12 2:27 PM, Chris Hostetter wrote:
: The default solrconfig.xml seems to suggest ${solr.data.dir} can be 
used.
: So I am hoping there is another pre-defined variable like this that 
points to

: the solr core directory.

there's nothing special about solr.data.dir ... it's used i nthe example
configs as a convinient way to let you override it on the command line
when running the example, otherwise it defaults to the empty string 
which

triggers the default dataDir logic (ie: ./data in the instanceDir)...

dataDir${solr.data.dir:}/dataDir

:charFilter 
class=com.basistech.rlp.solr.RCLUNormalizeCharFilterFactory

:   rlpContext=solr/conf/rlp-context-rclu.xml/
:
: This only works if Solr is started from $SOLR_HOME/example, as it 
is relative

: to the current working directory.

if your factories are using the SolrResourceLoader.openResource to load
those files then you can change that to just be 
'rlpContext=rlp-context-rclu.xml'

and it will just plain work -- the SolrResourceLoader is
SolrCloud/ZooKeeper aware, and in stadalone mode checks the conf dir,
the classpath, and as a last resort attempts to resolve it as an 
relative

path -- if your custom factories just call new File(rlpContext) on the
string, then you're stuck using absolute paths, or needing to define
system properties at runtime.


-Hoss






Re: predefined variables usable in schema.xml ?

2012-11-30 Thread T. Kuro Kurosaka

Sorry, correction.
${solr.core.instanceDir} is working in a sense.  It is replaced by the 
core name, rather than a directory path.

In an earlier startup time Solr prints out:
INFO: Creating SolrCore 'collection1' using instanceDir: solr/collection1
But judging from the error message I get, ${solr.core.instanceDir} is 
replaced by the value collection1  (no solr/).


I was hoping that ${solr.core.instanceDir} would be replaced by the 
absolute path to the examples/core/collection1 directory.


On 11/30/12 2:41 PM, T. Kuro Kurosaka wrote:
I tried to use ${solr.core.instanceDir} in schema.xml with Solr 4.0, 
where every deployment is multi-core, and it didn't work.
It must be that the description about pre-defined properties in 
CoreAdmin wiki page is wrong, or it only works in solrconfig.xml, 
perhaps?


On 11/28/12 5:17 PM, T. Kuro Kurosaka wrote:

Thank you, Hoss.

I found this SolrWiki page talks about pre-defined properties such as 
solr.core.instanceDir:

http://wiki.apache.org/solr/CoreAdmin

I tried to use ${solr.core.instanceDir} in the default single-core 
schema.xml, and it didn't work.
Is this page wrong, or these properties are available only in 
multi-core deployments?


On 11/27/12 2:27 PM, Chris Hostetter wrote:
: The default solrconfig.xml seems to suggest ${solr.data.dir} can 
be used.
: So I am hoping there is another pre-defined variable like this 
that points to

: the solr core directory.

there's nothing special about solr.data.dir ... it's used i nthe 
example

configs as a convinient way to let you override it on the command line
when running the example, otherwise it defaults to the empty string 
which

triggers the default dataDir logic (ie: ./data in the instanceDir)...

dataDir${solr.data.dir:}/dataDir

:charFilter 
class=com.basistech.rlp.solr.RCLUNormalizeCharFilterFactory

:   rlpContext=solr/conf/rlp-context-rclu.xml/
:
: This only works if Solr is started from $SOLR_HOME/example, as it 
is relative

: to the current working directory.

if your factories are using the SolrResourceLoader.openResource to load
those files then you can change that to just be 
'rlpContext=rlp-context-rclu.xml'

and it will just plain work -- the SolrResourceLoader is
SolrCloud/ZooKeeper aware, and in stadalone mode checks the conf dir,
the classpath, and as a last resort attempts to resolve it as an 
relative
path -- if your custom factories just call new File(rlpContext) on 
the

string, then you're stuck using absolute paths, or needing to define
system properties at runtime.


-Hoss








Re: predefined variables usable in schema.xml ?

2012-11-28 Thread T. Kuro Kurosaka

Thank you, Hoss.

I found this SolrWiki page talks about pre-defined properties such as 
solr.core.instanceDir:

http://wiki.apache.org/solr/CoreAdmin

I tried to use ${solr.core.instanceDir} in the default single-core 
schema.xml, and it didn't work.
Is this page wrong, or these properties are available only in multi-core 
deployments?


On 11/27/12 2:27 PM, Chris Hostetter wrote:

: The default solrconfig.xml seems to suggest ${solr.data.dir} can be used.
: So I am hoping there is another pre-defined variable like this that points to
: the solr core directory.

there's nothing special about solr.data.dir ... it's used i nthe example
configs as a convinient way to let you override it on the command line
when running the example, otherwise it defaults to the empty string which
triggers the default dataDir logic (ie: ./data in the instanceDir)...

   dataDir${solr.data.dir:}/dataDir

:charFilter class=com.basistech.rlp.solr.RCLUNormalizeCharFilterFactory
:   rlpContext=solr/conf/rlp-context-rclu.xml/
:
: This only works if Solr is started from $SOLR_HOME/example, as it is relative
: to the current working directory.

if your factories are using the SolrResourceLoader.openResource to load
those files then you can change that to just be 
'rlpContext=rlp-context-rclu.xml'
and it will just plain work -- the SolrResourceLoader is
SolrCloud/ZooKeeper aware, and in stadalone mode checks the conf dir,
the classpath, and as a last resort attempts to resolve it as an relative
path -- if your custom factories just call new File(rlpContext) on the
string, then you're stuck using absolute paths, or needing to define
system properties at runtime.


-Hoss




predefined variables usable in schema.xml ?

2012-11-27 Thread T. Kuro Kurosaka
Is there a pre-defined variable that can be used in schema.xml to point 
to the solr core directory, or the conf subdirectory?
I thought ${solr.home} or perhaps ${solr.solr.home} might work but they 
didn't (unless -Dsolr.home=/my/solr/home is supplied, that is).

The default solrconfig.xml seems to suggest ${solr.data.dir} can be used.
So I am hoping there is another pre-defined variable like this that 
points to the solr core directory.


Use case, in case you wonder:
We have our custom CharFilter, Tokenizer TokenFilter, and corresponding 
factories.

Currently we ship schema.xml that contains lines like:

charFilter class=com.basistech.rlp.solr.RCLUNormalizeCharFilterFactory
  rlpContext=solr/conf/rlp-context-rclu.xml/

This only works if Solr is started from $SOLR_HOME/example, as it is 
relative to the current working directory.
Our customers have to adjust the value to the absolute path if they'd 
like to use Tomcat or any other web container other than Solr builtin 
jetty.

We'd rather like to write something like this:

charFilter class=com.basistech.rlp.solr.RCLUNormalizeCharFilterFactory
  rlpContext=${solr.conf.dir}/rlp-context-rclu.xml/

Kuro



Re: Any filter to map mutiple tokens into one ?

2012-10-15 Thread T. Kuro Kurosaka

On 10/14/12 12:19 PM, Jack Krupansky wrote:
There's a miscommunication here somewhere. Is Solr 4.0 still passing 
*:* to the analyzer? Show us the parsed query for *:*, as well as 
the debugQuery explain for the score.

I'm not quite sure what you mean by the parsed query for *:*.
This fake analyzer using NGramTokenizer divides *:* into three tokens 
*, :, and *, on purpose to simulate our Tokenizer's behavior.


An excerpt of he XML results from the query is pasted in the bottom of 
this message.


I mean, *:* (MatchAllDocsQuery) has a constant score, so there 
isn't any way for it to be suboptimal.

That's exactly the point I'd like to raise.
No matter what analyzers are assigned to fields, the hit score for *:* 
must remain 1.0, but it's not happening when an analyzer that divides 
*:* are in use.



Here's an excerpt of the query response. Notice this element, which 
should not be there, in my opinion:

DisjunctionMaxQuery((name:* : *^0.5))
There is a space between * and :, and another space between : and *.

response
lstname=responseHeader
intname=status0/int
intname=QTime33/int
lstname=params
strname=indenton/str
strname=wt/
strname=version2.2/str
strname=rows10/str
strname=defTypeedismax/str
strname=pfname^0.5/str
strname=fl*,score/str
strname=debugQueryon/str
strname=start0/str
strname=q*:*/str
strname=qt/
strname=fq/
/lst
/lst
resultname=responsenumFound=32start=0maxScore=0.14764866
doc
strname=idGB18030TEST/str
strname=nameTest with some GB18030 encoded characters/str
arrname=features
strNo accents here/str
str这是一个功能/str
strThis is a feature (translated)/str
str这份文件是很有光泽/str
strThis document is very shiny (translated)/str
/arr
floatname=price0.0/float
strname=price_c0,USD/str
boolname=inStocktrue/bool
longname=_version_1415830106215022592/long
floatname=score0.14764866/float
/doc
...
/result
lstname=debug
strname=rawquerystring*:*/str
strname=querystring*:*/str
strname=parsedquery
(+MatchAllDocsQuery(*:*) DisjunctionMaxQuery((name:* : *^0.5)))/no_coord
/str
strname=parsedquery_toString+*:* (name:* : *^0.5)/str
lstname=explain
strname=GB18030TEST
0.14764866 = (MATCH) sum of: 0.14764866 = (MATCH) MatchAllDocsQuery, 
product of: 0.14764866 = queryNorm

/str
/lst
strname=QParserExtendedDismaxQParser/str
nullname=altquerystring/
nullname=boostfuncs/
...

/lst
/lst
/lst
/response



Re: Any filter to map mutiple tokens into one ?

2012-10-15 Thread T. Kuro Kurosaka

On 10/15/12 10:35 AM, Jack Krupansky wrote:
And you're absolutely certain you see *:* being passed to your 
analyzer in the final release of Solr 4.0???
I don't have a direct evidence. This is the only theory I have that 
explains why changing FieldType causes the sub-optimal scores.

If you know of a way to tell if a tokenizer is really invoked, let me know.



-- Jack Krupansky

-Original Message- From: T. Kuro Kurosaka
Sent: Monday, October 15, 2012 1:28 PM
To: solr-user@lucene.apache.org
Subject: Re: Any filter to map mutiple tokens into one ?

On 10/14/12 12:19 PM, Jack Krupansky wrote:
There's a miscommunication here somewhere. Is Solr 4.0 still passing 
*:* to the analyzer? Show us the parsed query for *:*, as well as 
the debugQuery explain for the score.

I'm not quite sure what you mean by the parsed query for *:*.
This fake analyzer using NGramTokenizer divides *:* into three tokens
*, :, and *, on purpose to simulate our Tokenizer's behavior.

An excerpt of he XML results from the query is pasted in the bottom of
this message.


I mean, *:* (MatchAllDocsQuery) has a constant score, so there 
isn't any way for it to be suboptimal.

That's exactly the point I'd like to raise.
No matter what analyzers are assigned to fields, the hit score for *:*
must remain 1.0, but it's not happening when an analyzer that divides
*:* are in use.


Here's an excerpt of the query response. Notice this element, which
should not be there, in my opinion:
DisjunctionMaxQuery((name:* : *^0.5))
There is a space between * and :, and another space between : and *.

response
lstname=responseHeader
intname=status0/int
intname=QTime33/int
lstname=params
strname=indenton/str
strname=wt/
strname=version2.2/str
strname=rows10/str
strname=defTypeedismax/str
strname=pfname^0.5/str
strname=fl*,score/str
strname=debugQueryon/str
strname=start0/str
strname=q*:*/str
strname=qt/
strname=fq/
/lst
/lst
resultname=responsenumFound=32start=0maxScore=0.14764866
doc
strname=idGB18030TEST/str
strname=nameTest with some GB18030 encoded characters/str
arrname=features
strNo accents here/str
str这是一个功能/str
strThis is a feature (translated)/str
str这份文件是很有光泽/str
strThis document is very shiny (translated)/str
/arr
floatname=price0.0/float
strname=price_c0,USD/str
boolname=inStocktrue/bool
longname=_version_1415830106215022592/long
floatname=score0.14764866/float
/doc
...
/result
lstname=debug
strname=rawquerystring*:*/str
strname=querystring*:*/str
strname=parsedquery
(+MatchAllDocsQuery(*:*) DisjunctionMaxQuery((name:* : 
*^0.5)))/no_coord

/str
strname=parsedquery_toString+*:* (name:* : *^0.5)/str
lstname=explain
strname=GB18030TEST
0.14764866 = (MATCH) sum of: 0.14764866 = (MATCH) MatchAllDocsQuery,
product of: 0.14764866 = queryNorm
/str
/lst
strname=QParserExtendedDismaxQParser/str
nullname=altquerystring/
nullname=boostfuncs/
...

/lst
/lst
/lst
/response




Re: Any filter to map mutiple tokens into one ?

2012-10-14 Thread T. Kuro Kurosaka

Jack,
I don't think SOLR-3261 describes this issue.
I ran the same experiment with Solr-3.6 and the score for all the 
matches was 0.1626374.

The newly released Solr 4.0.0 also returns a suboptimal score of 0.14764866.

Kuro

On 10/12/12 2:03 PM, Jack Krupansky wrote:
I don't have a Solr 3.5 to check, but SOLR-3261, which was fixed in 
Solr 3.6 may be your culprit.


See:
https://issues.apache.org/jira/browse/SOLR-3261

So, try SOlr 3.6 or 3.6.1 or 4.0 to see if your issue goes away.

-- Jack Krupansky

-Original Message- From: T. Kuro Kurosaka
Sent: Friday, October 12, 2012 3:15 PM
To: solr-user@lucene.apache.org
Subject: Re: Any filter to map mutiple tokens into one ?

Jack,
It goes like this:

http://myhost:8983/solr/select?indent=onversion=2.2q=*%3A*fq=start=0rows=10fl=*%2Cscoreqt=wt=debugQuery=on 



and edismax is the default query parser in solrconfig.xml.

There is a field named text_jpn that uses a Tokenizer that we developed
as a product, which we can't share here.

But I can simulate our situation using NGramTokenizer.
After indexing the Solr sample docs normally, stop the Solr and insert:

fieldtype name=text_fake class=solr.TextField
positionIncrementGap=100
analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
/analyzer
analyzer type=query
tokenizer class=solr.NGramTokenizerFactory
   maxGramSize=1
   minGramSize=1 /
/analyzer
/fieldtype

Replace the field definition for name, for example:
field name=name type=text_fake indexed=true stored=true/

In solrconfig.xml, change the default search handler's definition like 
this:

str name=defTypeedismax/str
str name=pfname^0.5/str
(I guess I could just have these in the URL.)

Start Solr and give this URL:

http://localhost:8983/solr/select?indent=onversion=2.2q=*%3A*fq=start=0rows=10fl=*%2Cscoreqt=wt=debugQuery=onexplainOther=hl.fl= 



Hopefully you'll see
floatname=score0.3663672/float
and
+MatchAllDocsQuery(*:*) DisjunctionMaxQuery((name:* : *^0.5))

in the debug output.

The score calculation should not be done when the query is *:* which has
the special meaning, should it ?
And even if the score calculation is done, *:* shouldn't be fed to
Tokenizers, should it?

On 10/12/12 9:44 AM, Jack Krupansky wrote:

Okay, let's back up. First, hold off mixing in your proposed solution
until after we understand the actual, original problem:

1. What is your field and field type (with analyzer details)?
2. What is your query parser (defType)?
3. What is your query request URL?
4. What is the parsed query (add debugQuery=true to your query
request)? (Actually, I think you gave us that)

I just tried the following query with the fresh 4.0 release and it
works fine:

http://localhost:8983/solr/collection1/select?q=*:*wt=xmldebugQuery=truedefType=edismax 




str name=rawquerystring*:*/str

The parsed query is:

str name=parsedquery(+MatchAllDocsQuery(*:*))/no_coord/str

And this was with the 4.0 example schema, adding *.xml and books.json
documents.

If you could try your scenario with 4.0 that would be a help. If it's
a bug in 3.5 that is fixed now... oh well. I mean, feel free to check
the revision history for edismax since the 3.5 release.

-- Jack Krupansky

-Original Message- From: T. Kuro Kurosaka
Sent: Friday, October 12, 2012 11:54 AM
To: solr-user@lucene.apache.org
Subject: Re: Any filter to map mutiple tokens into one ?

On 10/11/12 4:47 PM, Jack Krupansky wrote:

The : which normally separates a field name from a term (or quoted
string or parenthesized sub-query) is parsed by the query parser
before analysis gets called, and *:* is recognized before analysis
as well. So, any attempt to recreate *:* in analysis will be too
late to affect query parsing and other pre-analysis processing.

That's why I suspect a bug in Solr.  Tokenizer shouldn't play any roles
here but it is affecting the score calculation. I am seeing an evidence
that *:* is being passed to my tokenizer.
I'm trying to find a way to work around this by reconstructing *:* in
the analysis chain.


But, what is it you are really trying to do? What's the real problem?
(This sounds like a proverbial XY Problem.)

-- Jack Krupansky

-Original Message- From: T. Kuro Kurosaka
Sent: Thursday, October 11, 2012 7:35 PM
To: solr-user@lucene.apache.org
Subject: Any filter to map mutiple tokens into one ?

I am looking for a way to fold a particular sequence of tokens into one
token.
Concretely, I'd like to detect a three-token sequence of *, : and
*, and replace it with a token of the text *:*.
I tried SynonymFIlter but it seems it can only deal with a single input
token. * : * = *:* seems to be interpreted
as one input token of 5 characters *, space, :, space and *.

I'm using Solr 3.5.

Background:
My tokenizer separate the three character sequence *:* into 3 tokens
of one character each.
The edismax parser, when given the query *:*, i.e. find every doc,
seems to pass the entire string *:* to the query analyzer  (I suspect
a bug

Re: Any filter to map mutiple tokens into one ?

2012-10-12 Thread T. Kuro Kurosaka

On 10/11/12 4:47 PM, Jack Krupansky wrote:
The : which normally separates a field name from a term (or quoted 
string or parenthesized sub-query) is parsed by the query parser 
before analysis gets called, and *:* is recognized before analysis 
as well. So, any attempt to recreate *:* in analysis will be too 
late to affect query parsing and other pre-analysis processing.
That's why I suspect a bug in Solr.  Tokenizer shouldn't play any roles 
here but it is affecting the score calculation. I am seeing an evidence 
that *:* is being passed to my tokenizer.
I'm trying to find a way to work around this by reconstructing *:* in 
the analysis chain.


But, what is it you are really trying to do? What's the real problem? 
(This sounds like a proverbial XY Problem.)


-- Jack Krupansky

-Original Message- From: T. Kuro Kurosaka
Sent: Thursday, October 11, 2012 7:35 PM
To: solr-user@lucene.apache.org
Subject: Any filter to map mutiple tokens into one ?

I am looking for a way to fold a particular sequence of tokens into one
token.
Concretely, I'd like to detect a three-token sequence of *, : and
*, and replace it with a token of the text *:*.
I tried SynonymFIlter but it seems it can only deal with a single input
token. * : * = *:* seems to be interpreted
as one input token of 5 characters *, space, :, space and *.

I'm using Solr 3.5.

Background:
My tokenizer separate the three character sequence *:* into 3 tokens
of one character each.
The edismax parser, when given the query *:*, i.e. find every doc,
seems to pass the entire string *:* to the query analyzer  (I suspect
a bug.),
and feed the tokenized result to DisjunctionMaxQuery object,
according to this debug output:

lst name=debug
str name=rawquerystring*:*/str
str name=querystring*:*/str
str name=parsedquery+MatchAllDocsQuery(*:*)
DisjunctionMaxQuery((body:* : *~100^0.5 | title:* :
*~100^1.2)~0.01)/str
str name=parsedquery_toString+*:* (body:* : *~100^0.5 | title:* :
*~100^1.2)~0.01/str

Notice that there is a space between * and : in
DisjunctionMaxQuery((body:* : * )

Probably because of this, the hit score is as low as 0.109, while it is
1.000 if an analyzer that doesn't break *:* is used.
So I'd like to stitch together *, :, * into *:* again to make
DisjunctionMaxQuery happy.


Thanks.


T. Kuro Kurosaka





Re: Any filter to map mutiple tokens into one ?

2012-10-12 Thread T. Kuro Kurosaka

Jack,
It goes like this:

http://myhost:8983/solr/select?indent=onversion=2.2q=*%3A*fq=start=0rows=10fl=*%2Cscoreqt=wt=debugQuery=on

and edismax is the default query parser in solrconfig.xml.

There is a field named text_jpn that uses a Tokenizer that we developed 
as a product, which we can't share here.


But I can simulate our situation using NGramTokenizer.
After indexing the Solr sample docs normally, stop the Solr and insert:

fieldtype name=text_fake class=solr.TextField 
positionIncrementGap=100

analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
/analyzer
analyzer type=query
tokenizer class=solr.NGramTokenizerFactory
   maxGramSize=1
   minGramSize=1 /
/analyzer
/fieldtype

Replace the field definition for name, for example:
field name=name type=text_fake indexed=true stored=true/

In solrconfig.xml, change the default search handler's definition like this:
str name=defTypeedismax/str
str name=pfname^0.5/str
(I guess I could just have these in the URL.)

Start Solr and give this URL:

http://localhost:8983/solr/select?indent=onversion=2.2q=*%3A*fq=start=0rows=10fl=*%2Cscoreqt=wt=debugQuery=onexplainOther=hl.fl=

Hopefully you'll see
floatname=score0.3663672/float
and
+MatchAllDocsQuery(*:*) DisjunctionMaxQuery((name:* : *^0.5))

in the debug output.

The score calculation should not be done when the query is *:* which has 
the special meaning, should it ?
And even if the score calculation is done, *:* shouldn't be fed to 
Tokenizers, should it?


On 10/12/12 9:44 AM, Jack Krupansky wrote:
Okay, let's back up. First, hold off mixing in your proposed solution 
until after we understand the actual, original problem:


1. What is your field and field type (with analyzer details)?
2. What is your query parser (defType)?
3. What is your query request URL?
4. What is the parsed query (add debugQuery=true to your query 
request)? (Actually, I think you gave us that)


I just tried the following query with the fresh 4.0 release and it 
works fine:


http://localhost:8983/solr/collection1/select?q=*:*wt=xmldebugQuery=truedefType=edismax 



str name=rawquerystring*:*/str

The parsed query is:

str name=parsedquery(+MatchAllDocsQuery(*:*))/no_coord/str

And this was with the 4.0 example schema, adding *.xml and books.json 
documents.


If you could try your scenario with 4.0 that would be a help. If it's 
a bug in 3.5 that is fixed now... oh well. I mean, feel free to check 
the revision history for edismax since the 3.5 release.


-- Jack Krupansky

-Original Message- From: T. Kuro Kurosaka
Sent: Friday, October 12, 2012 11:54 AM
To: solr-user@lucene.apache.org
Subject: Re: Any filter to map mutiple tokens into one ?

On 10/11/12 4:47 PM, Jack Krupansky wrote:
The : which normally separates a field name from a term (or quoted 
string or parenthesized sub-query) is parsed by the query parser 
before analysis gets called, and *:* is recognized before analysis 
as well. So, any attempt to recreate *:* in analysis will be too 
late to affect query parsing and other pre-analysis processing.

That's why I suspect a bug in Solr.  Tokenizer shouldn't play any roles
here but it is affecting the score calculation. I am seeing an evidence
that *:* is being passed to my tokenizer.
I'm trying to find a way to work around this by reconstructing *:* in
the analysis chain.


But, what is it you are really trying to do? What's the real problem? 
(This sounds like a proverbial XY Problem.)


-- Jack Krupansky

-Original Message- From: T. Kuro Kurosaka
Sent: Thursday, October 11, 2012 7:35 PM
To: solr-user@lucene.apache.org
Subject: Any filter to map mutiple tokens into one ?

I am looking for a way to fold a particular sequence of tokens into one
token.
Concretely, I'd like to detect a three-token sequence of *, : and
*, and replace it with a token of the text *:*.
I tried SynonymFIlter but it seems it can only deal with a single input
token. * : * = *:* seems to be interpreted
as one input token of 5 characters *, space, :, space and *.

I'm using Solr 3.5.

Background:
My tokenizer separate the three character sequence *:* into 3 tokens
of one character each.
The edismax parser, when given the query *:*, i.e. find every doc,
seems to pass the entire string *:* to the query analyzer  (I suspect
a bug.),
and feed the tokenized result to DisjunctionMaxQuery object,
according to this debug output:

lst name=debug
str name=rawquerystring*:*/str
str name=querystring*:*/str
str name=parsedquery+MatchAllDocsQuery(*:*)
DisjunctionMaxQuery((body:* : *~100^0.5 | title:* :
*~100^1.2)~0.01)/str
str name=parsedquery_toString+*:* (body:* : *~100^0.5 | title:* :
*~100^1.2)~0.01/str

Notice that there is a space between * and : in
DisjunctionMaxQuery((body:* : * )

Probably because of this, the hit score is as low as 0.109, while it is
1.000 if an analyzer that doesn't break *:* is used.
So I'd like to stitch together *, :, * into *:* again to make
DisjunctionMaxQuery happy.


Thanks

Any filter to map mutiple tokens into one ?

2012-10-11 Thread T. Kuro Kurosaka
I am looking for a way to fold a particular sequence of tokens into one 
token.
Concretely, I'd like to detect a three-token sequence of *, : and 
*, and replace it with a token of the text *:*.
I tried SynonymFIlter but it seems it can only deal with a single input 
token. * : * = *:* seems to be interpreted

as one input token of 5 characters *, space, :, space and *.

I'm using Solr 3.5.

Background:
My tokenizer separate the three character sequence *:* into 3 tokens 
of one character each.
The edismax parser, when given the query *:*, i.e. find every doc, 
seems to pass the entire string *:* to the query analyzer  (I suspect 
a bug.),

and feed the tokenized result to DisjunctionMaxQuery object,
according to this debug output:

lst name=debug
str name=rawquerystring*:*/str
str name=querystring*:*/str
str name=parsedquery+MatchAllDocsQuery(*:*) 
DisjunctionMaxQuery((body:* : *~100^0.5 | title:* : 
*~100^1.2)~0.01)/str
str name=parsedquery_toString+*:* (body:* : *~100^0.5 | title:* : 
*~100^1.2)~0.01/str


Notice that there is a space between * and : in 
DisjunctionMaxQuery((body:* : * )


Probably because of this, the hit score is as low as 0.109, while it is 
1.000 if an analyzer that doesn't break *:* is used.
So I'd like to stitch together *, :, * into *:* again to make 
DisjunctionMaxQuery happy.



Thanks.


T. Kuro Kurosaka




Why does Solr (1.4.1) keep so many Tokenizer objects?

2012-09-08 Thread T. Kuro Kurosaka

While investigating a bug, I found that Solr keeps many Tokenizer objects.

This experimental 80-core Solr 1.4.1 system runs on Tomcat. It was 
continuously sent indexing requests in parallel, and it eventually died 
due to OutOfMemory.
The heap dump that was taken by the JVM shows there were 14477 Tokenizer 
objects, or about 180 Tokenizer objects per core, at the time it died.
Each core's schema.xml has only 5 Fields that uses this Tokenizer, so 
I'd think 5 Tokenizer per indexing thread are needed at most.
Tomcat at its default configuration can run up to 200 threads.  So at 
most 1000 Tokenizer objects should be enough.


My colleague ran a similar experiment on 10-core Solr 3.6 system, and 
observed a fewer Tokenizer objects there, but still there are 48 
Tokenizers per core.

Why does Solr keep this many Tokenizer objects ?

Kuro