How to rank an exact match higher?

2012-03-05 Thread Tommy Chheng
I'm using solr 3.5 for a type ahead search system. I want to rank
exact matches(lowercased) higher than non-exact matches.

For example, if i have two docs:
Doc One: title=New York
Doc Two: title=New York City

I would expect a query of new york to rank New York over New York City

It looks like I need to take into account the # of matches vs the
total # of tokens in a field. I'm not sure how to do this.

My debug query shows the two docs with the exact scores:
lst name=debug
str name=rawquerystringnew york/str
str name=querystringnew york/str
str name=parsedquery+DisjunctionMaxQuery((title:new york^50.0 |
textng:new york^40.0))/str
str name=parsedquery_toString+(title:new york^50.0 | textng:new
york^40.0)/str
lst name=explain
str name=4f553cbc03643929d093d4671.1890696 = (MATCH) max of:
1.1890696 = (MATCH) weight(title:new york^50.0 in 0), product of:
0.9994 = queryWeight(title:new york^50.0), product of:  50.0
= boost  1.1890697 = idf(title: new=2 york=2)  0.01681987 =
queryNorm1.1890697 = fieldWeight(title:new york in 0), product
of:  1.0 = tf(phraseFreq=1.0)  1.1890697 = idf(title: new=2
york=2)  1.0 = fieldNorm(field=title, doc=0)/str
str name=4f553cbc03643929d093d4681.1890696 = (MATCH) max of:
1.1890696 = (MATCH) weight(title:new york^50.0 in 1), product of:
0.9994 = queryWeight(title:new york^50.0), product of:  50.0
= boost  1.1890697 = idf(title: new=2 york=2)  0.01681987 =
queryNorm1.1890697 = fieldWeight(title:new york in 1), product
of:  1.0 = tf(phraseFreq=1.0)  1.1890697 = idf(title: new=2
york=2)  1.0 = fieldNorm(field=title, doc=1)/str
/lst

I posted my solrconfig/schema here:
https://gist.github.com/1984052

-- 
Tommy Chheng


Re: Solr with Scala

2012-02-06 Thread Tommy Chheng
I have created a solr plugin using scala. It works without problems.

I wouldn't go as far as using scala improve solr performance but you
can definitely use scala to add a missing functionality or custom
query parsing. Just build a jar using maven/sbt and put it in solr's
lib directory.


On Sun, Feb 5, 2012 at 4:06 PM, deniz denizdurmu...@gmail.com wrote:
 Hi all,

 I have a question about scala and solr... I am curious if we can use solr
 with scala (plugins etc) to improve performance.

 anybody used scala on solr? could you tell me opinions about them?

 -
 Zeki ama calismiyor... Calissa yapar...
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Solr-with-Scala-tp3718539p3718539.html
 Sent from the Solr - User mailing list archive at Nabble.com.



-- 
Tommy Chheng


Re: phrase auto-complete with suggester component

2012-01-25 Thread Tommy Chheng
Thanks for link, that's the approach I'm going to try.

On Wed, Jan 25, 2012 at 2:39 PM, O. Klein kl...@octoweb.nl wrote:

 O. Klein wrote

 I agree. Suggester could use some attention. Looking at Wiki there were
 some features planned, but not much has happened lately.


 Or check out this post
 http://www.cominvent.com/2012/01/25/super-flexible-autocomplete-with-solr/
 looking very promising as an alternative.

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/phrase-auto-complete-with-suggester-component-tp3685572p3689240.html
 Sent from the Solr - User mailing list archive at Nabble.com.



-- 
Tommy Chheng


phrase auto-complete with suggester component

2012-01-24 Thread Tommy Chheng
I'm testing out the various auto-complete functionalities on the
wikipedia dataset.

I first tried the facet.prefix and found it slow at times. I'm now
looking at the Suggester component. Given a query like new york, I
would like to get results like New York or New York City.

When I tried using the suggest component, it suggest entries for each
word rather then phrase(even if i add quotes). How can I change my
config to get title matches and not have the query broken into each
word?

lst name=spellcheck
lst name=suggestions
lst name=new
int name=numFound5/int
int name=startOffset0/int
int name=endOffset3/int
arr name=suggestion
strnewt/str
strnewwy patitta/str
strnewyddion/str
strnewyorker/str
strnewyork–presbyterian hospital/str
/arr
/lst
lst name=york
int name=numFound5/int
int name=startOffset4/int
int name=endOffset8/int
arr name=suggestion
stryork/str
stryork–dauphin (septa station)/str
stryork—humber/str
stryork—scarborough/str
stryork—simcoe/str
/arr
/lst
str name=collationnewt york/str
/lst
/lst

/solr/suggest?q=new%20yorkomitHeader=truespellcheck.count=5spellcheck.collate=true

solrconfig.xml:
  searchComponent name=suggest class=solr.SpellCheckComponent
   lst name=spellchecker
    str name=namesuggest/str
    str name=classnameorg.apache.solr.spelling.suggest.Suggester/str
    str name=lookupImplorg.apache.solr.spelling.suggest.tst.TSTLookup/str
    str name=fieldtitle_autocomplete/str
    str name=buildOnCommittrue/str
   /lst
  /searchComponent

  requestHandler name=/suggest
class=org.apache.solr.handler.component.SearchHandler
   lst name=defaults
    str name=spellchecktrue/str
    str name=spellcheck.dictionarysuggest/str
    str name=spellcheck.count10/str
   /lst
   arr name=components
    strsuggest/str
   /arr
  /requestHandler

schema.xml:
    fieldType name=text_auto class=solr.TextField
     analyzer
      tokenizer class=solr.KeywordTokenizerFactory/
      filter class=solr.LowerCaseFilterFactory/
     /analyzer
    /fieldType

   field name=title_autocomplete type=text_auto indexed=true
stored=false multiValued=false /


-- 
Tommy Chheng


Re: phrase auto-complete with suggester component

2012-01-24 Thread Tommy Chheng
Thanks, I'll try out the custom class file. Any possibilities this
class can be merged into solr? It seems like an expected behavior.


On Tue, Jan 24, 2012 at 11:29 AM, O. Klein kl...@octoweb.nl wrote:
 You might wanna read
 http://lucene.472066.n3.nabble.com/suggester-issues-td3262718.html#a3264740
 which contains the solution to your problem.

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/phrase-auto-complete-with-suggester-component-tp3685572p3685730.html
 Sent from the Solr - User mailing list archive at Nabble.com.



-- 
Tommy Chheng


Re: snapshot-4.0 and maven

2010-10-26 Thread Tommy Chheng
You use maven-assembly-plugin's jar-with-dependencies to build a single 
jar with all its dependencies


http://stackoverflow.com/questions/574594/how-can-i-create-an-executable-jar-with-dependencies-using-maven

@tommychheng

On 10/19/10 6:53 AM, Matt Mitchell wrote:

Hey thanks Tommy. To be more specific, I'm trying to use SolrJ in a
clojure project. When I try to use SolrJ using what you showed me, I
get errors saying lucene classes can't be found etc.. Is there a way
to build everything SolrJ (snapshot-4.0) needs into one jar?

Matt

On Mon, Oct 18, 2010 at 11:01 PM, Tommy Chhengtommy.chh...@gmail.com  wrote:

Once you built the solr 4.0 jar, you can use mvn's install command like
this:

mvn install:install-file -DgroupId=org.apache -DartifactId=solr
-Dpackaging=jar -Dversion=4.0-SNAPSHOT -Dfile=solr-4.0-SNAPSHOT.jar
-DgeneratePom=true

@tommychheng

On 10/18/10 7:28 PM, Matt Mitchell wrote:

I'd like to get solr snapshot-4.0 pushed into my local maven repo. Is
this possible to do? If so, could someone give me a tip or two on
getting started?

Thanks,
Matt



Re: snapshot-4.0 and maven

2010-10-18 Thread Tommy Chheng
 Once you built the solr 4.0 jar, you can use mvn's install command 
like this:


mvn install:install-file -DgroupId=org.apache -DartifactId=solr 
-Dpackaging=jar -Dversion=4.0-SNAPSHOT -Dfile=solr-4.0-SNAPSHOT.jar 
-DgeneratePom=true


@tommychheng


On 10/18/10 7:28 PM, Matt Mitchell wrote:

I'd like to get solr snapshot-4.0 pushed into my local maven repo. Is
this possible to do? If so, could someone give me a tip or two on
getting started?

Thanks,
Matt


Re: DIH - deleting documents, high performance (delta) imports, and passing parameters

2010-08-30 Thread Tommy Chheng

 Thanks for the section on Passing parameters to DIH config:

I'm going to try the parameter passing to allow the DIH to index 
different DBs based on the system environment(local dev machine or 
production machine)


@tommychheng
Programmer and UC Irvine Graduate Student
Find a great grad school based on research interests: http://gradschoolnow.com


On 8/30/10 5:07 AM, Ephraim Ofir wrote:

After wasting a few days navigating the somewhat uncharted and murky
waters of DIH, thought I'd share my insights with the community to save
other newbies time, so here goes...

First off, this is not to say DIH is bad, I think it's great and it
works really well for my uses, but it has a few undocumented quirks
which cost me a lot of time.



Deleting documents - several options:
1. the deletedPkQuery in delta import - you'll need to make a DB query
which generates the IDs to be deleted (something like: SELECT id FROM
yourTable WHERE deletedFlag = 1).  Make sure that you have a pk in your
entity and that it's the same one returned by your query (in this case -
pk=id).
2. Add the $deleteDocById or $deleteDocByQuery special command to your
full/delta import.  This one is a bit tricky, see comment below**.
3. Use preImportDeleteQuery/postImportDeleteQuery in your full/delta
query (contrary to what the wiki says, this works for delta-import as
well as full-import).

Any one of these can be used separately from your import, you can put
them in a separate entity and do a full/delta import just on that entity
if that's what you want.



High performance imports with sub entities:
DIH's sub entity architecture is very easy to understand and makes a lot
of sense, but it performs sub queries for each row in the root entity,
which is not practical for high volumes.  I opted for a solution I found
in the Solr book by Packt (excellent book BTW) which involves pushing
multi-valued data into a single field with a separator which is then
split by DIH with a RegexTransformer.  This way Solr issues only one
query to the DB and the DB does all the heavy-lifting.  I actually
implemented my query as a stored procedure so it can be optimized by the
DBA and by the DB and be kept separate from the Solr config.  The
following (MySql) query concatenates 3 lang_code fields from the main
table into one field and multiple emails from a secondary table into
another field:
SELECT u.id,
u.name,
IF((u.lang_code1 IS NULL AND u.lang_code2 IS NULL AND
u.lang_code3 IS NULL), NULL,
CONVERT(CONCAT_WS('|', u.lang_code1, u.lang_code2,
u.lang_code3) USING ascii)) AS multi_lang_codes,
GROUP_CONCAT(e.email SEPARATOR '|') AS multiple_emails
FROM users_tb u
LEFT JOIN emails_tb e ON u.id = e.id
GROUP BY u.id

The entity in data-config.xml looks something like:
entity name=my_entity
 query=call get_solr_full();
 transformer=RegexTransformer
field name=email column=multiple_emails splitBy=\| /
field name=lang_code column=multiple_lang_codes
splitBy=\| /
/entity

High performance delta imports:
DIH's delta import architecture suffers from the same problem as above,
it performs one query to create a list of IDs which need to be updated
and then performs one query to update each ID, which is not practical
for high volumes of data.  I was fervently looking for a way to do this
in a single simple query which would be basically like the full import
query only adding a WHERE last_updated
${dataimporter.last_index_time} clause.  The closest thing I found was
how to do a delta-import using full-import (DIH FAQ).  I fiddled around
with it a bit until I finally realized that you can actually do exactly
what I wanted very simply - you just need to put a dummy query in the
deltaQuery (you have to have a query there which returns one row for
each time you want the deltaImportQuery to run - once in my case) and
put whatever query you want in the deltaImportQuery.  You could even use
the deltaQuery to get some parameter from the DB to use with the
deltaImportQuery instead of using the dataimporter's timestamp (I saw a
lot of questions concerning time differences between the Solr host and
the DB or other methods of determining the delta which could be solved
this way). I have no need for this so my entity in data-config.xml looks
something like:
entity name=my_entity
 pk=id
 deltaQuery=SELECT 1 AS dummy;
 deltaImportQuery=call
get_solr_delta('${dataimporter.last_index_time}');
 ...
field ... /
/entity



Passing parameters to DIH config:
I have multiple Solr shards in my setup, and wanted to reuse as many
config files as possible, the problem is that data-config.xml doesn't
seem to support system property substitution like solrconfig.xml does
(at least not in 1.4.1, I think I saw something about that in JIRA
somewhere). I found a workaround for this by using the property
substitution in solrconfig.xml and passing it as a parameter to DIH.
Here's an excerpt from my 

Re: specifying the doc id in clustering component

2010-08-20 Thread Tommy Chheng
 Yes, that's the approach I'm taking right now. I do a lookup the doc 
ids in the resultset to find the matching document.


I can live with the manual lookup, I wanted to see if it would be 
possible to pick a custom field to represent the document in the docs 
array.


Thanks for contributing the plugin to solr!

@tommychheng
Programmer and UC Irvine Graduate Student
Find a great grad school based on research interests: http://gradschoolnow.com


On 8/19/10 10:51 PM, Stanislaw Osinski wrote:

The solr schema has the fields, id,  name and desc.

  I would like to get docs:[name Field here ] instead of the doc Id
field as in
docs:[200066, 195650,


The idea behind using the document ids was that based on them you could
access the individual documents' content, including the other fields, right
from the response field. Using ids limits duplication in the response text
as a whole. Is it possible to use this approach in your application?

Staszek



changable DIH datasource based on environment variables

2010-08-17 Thread Tommy Chheng
 I defined my DIH datasource in solrconfig.xml. Is there a way to 
define two sets of data sources and use one based on the current 
system's environment variable?(ex. APP_ENV=production or 
APP_ENV=development)


I run the DIH on my local machine and remote server. They use different 
mysql datasources for importing.


--
@tommychheng
Programmer and UC Irvine Graduate Student
Find a great grad school based on research interests: http://gradschoolnow.com



specifying the doc id in clustering component

2010-08-14 Thread Tommy Chheng

 I'm using the clustering component with solr 1.4.

The response is given by the id field in the doc array like:
labels:[Devices],
docs:[200066,
 195650,
 204850,
Is there a way to change the doc label to be another field?

i couldn't this option in http://wiki.apache.org/solr/ClusteringComponent

--
@tommychheng
Programmer and UC Irvine Graduate Student
Find a great grad school based on research interests: http://gradschoolnow.com



Re: DIH and multivariable fields problems

2010-08-06 Thread Tommy Chheng
 For multiple value fields using the DIH, i use group_concat with the 
regextransformer's splitby:

ex:
entity dataSource=grad_schools query=
  SELECTgroup_concat(professors.name separator '|') as 
university_professors

  FROM professors
WHERE professors.university_guid = '${universities.guid}'

transformer=RegexTransformer
field column=university_professors splitBy=\| /
/entity

hope that's helpful.

@tommychheng
Programmer and UC Irvine Graduate Student
Find a great grad school based on research interests: http://gradschoolnow.com


On 8/6/10 4:39 PM, harrysmith wrote:

I'm having a difficult time understanding how multivariable fields work with
the DataImportHandler when the source is a RDBMS. I've read the following
from the wiki:

--
What is a row?

A row in DataImportHandler is a Map (MapString, Object). In the map , the
key is the name of the field and the value can be anything which is a valid
Solr type. The value can also be a Collection of the valid Solr types (this
may get mapped to a multi-valued field). If the DataSource is RDBMS a query
cannot emit a multivalued field. But it is possible to create a multivalued
field by joining an entity with another.i.e if the sub-entity returns
multiple rows for one row from parent entity it can go into a multivalued
field. If the datasource is xml, it is possible to return a multivalued
field.
--

How does one 'join an entity with another'?  Below are the relevant sections
of my schema.xml and data-config.xml.

schema.xml

dynamicField name=*_s  type=string  indexed=true  stored=true
multiValued=true /

=

data-config.xml

entity name=item query=select * from project_items where
projectid_fk=1
 field column=ID_PK name=id /
  entity name=terms query=select distinct DESC_TERM from
tem_metadata where item_id=${item.ID_PK}
   entity name=metadata query=select * from
term_metadata where item_id=${item.ID_PK} AND
desc_term='${terms.DESC_TERM}'
  field name=${terms.DESC_TERM}_s
column=TEXT_VALUE /
/entity
   /entity
/entity



I have multiple terms (rows) in the term_metadata table that are returned
from the query, but only the first one gets added. Am I missing something
obvious?




























Re: Design questions/Schema Help

2010-07-26 Thread Tommy Chheng
 Alternatively, have you considered storing(or i should say indexing) 
the search logs with Solr?


This lets you text search across your search queries. You can perform 
time range queries with solr as well.


@tommychheng
Programmer and UC Irvine Graduate Student
Find a great grad school based on research interests: http://gradschoolnow.com


On 7/26/10 4:43 PM, Mark wrote:
We are thinking about using Cassandra to store our search logs. Can 
someone point me in the right direction/lend some guidance on design? 
I am new to Cassandra and I am having trouble wrapping my head around 
some of these new concepts. My brain keeps wanting to go back to a 
RDBMS design.


We will be storing the user query, # of hits returned and their 
session id. We would like to be able to answer the following questions.


- What is the n most popular queries and their counts within the last 
x (mins/hours/days/etc). Basically the most popular searches within a 
given time range.
- What is the most popular query within the last x where hits = 0. 
Same as above but with an extra where clause

- For session id x give me all their other queries
- What are all the session ids that searched for 'foos'

We accomplish the above functionality w/ MySQL using 2 tables. One for 
the raw search log information and the other to keep the 
aggregate/running counts of queries.


Would this sort of ad-hoc querying be better implemented using Hadoop 
+ Hive? If so, should I be storing all this information in Cassandra 
then using Hadoop to retrieve it?


Thanks for your suggestions



DIH stalling, how to debug?

2010-07-22 Thread Tommy Chheng

 Hi,
When I run my DIH script, it says it's busy but the Total Requests 
made to DataSource and Total Rows Fetched remain unchanged at 4 and 
6. It hasn't reported a failure.


How can I debug what is blocking the DIH?

--

@tommychheng
Programmer and UC Irvine Graduate Student
Find a great grad school based on research interests: http://gradschoolnow.com



Re: DIH stalling, how to debug?

2010-07-22 Thread Tommy Chheng

 Ok, it was a runaway SQL query which isn't using an index.

@tommychheng
Programmer and UC Irvine Graduate Student
Find a great grad school based on research interests: http://gradschoolnow.com


On 7/22/10 4:26 PM, Tommy Chheng wrote:

 Hi,
When I run my DIH script, it says it's busy but the Total Requests 
made to DataSource and Total Rows Fetched remain unchanged at 4 and 
6. It hasn't reported a failure.


How can I debug what is blocking the DIH?



Re: csv response writer

2010-07-14 Thread Tommy Chheng
  I fixed the path of the queryResponseWriter class in the example 
solrconfig.xml. This was successfully applied against solr 4.0 trunk.


A few quirks:

   * When I didn't specify a default Delimiter, it printed out null as
 delimiter. I couldn't figure out why because init(NamedList args)
 specifies it'll use a default of ,
 organizationnull2null

   * If i don't specify the column names, the output doesn't put in
 empty  correctly.
 eg: output has a mismatched number of commas.
 organization,1,Test,Name,2, ,200,8,
 organization,4,Solar,4,0,

added the patch to https://issues.apache.org/jira/browse/SOLR-1925

@tommychheng
Programmer and UC Irvine Graduate Student
Find a great grad school based on research interests: http://gradschoolnow.com


On 7/13/10 1:41 PM, Erik Hatcher wrote:

Tommy,

It's not committed to trunk or any other branch at the moment, so no 
future released version until then.


Have you tested it out?  Any feedback we should incorporate?

When I can carve out some time over the next week or so I'll review 
and commit if there are no issues brought up.


Erik

On Jul 13, 2010, at 3:42 PM, Tommy Chheng wrote:


Hi,
Which next version of solr is the csv response writer set to be 
included in?

https://issues.apache.org/jira/browse/SOLR-1925

--
@tommychheng
Programmer and UC Irvine Graduate Student
Find a great grad school based on research interests: 
http://gradschoolnow.com






csv response writer

2010-07-13 Thread Tommy Chheng

 Hi,
Which next version of solr is the csv response writer set to be included 
in?

https://issues.apache.org/jira/browse/SOLR-1925

--
@tommychheng
Programmer and UC Irvine Graduate Student
Find a great grad school based on research interests: http://gradschoolnow.com



Re: csv response writer

2010-07-13 Thread Tommy Chheng

  I'll try it out and let you know!

@tommychheng

Programmer and UC Irvine Graduate Student
Find a great grad school based on research interests: http://gradschoolnow.com


On 7/13/10 1:41 PM, Erik Hatcher wrote:

Tommy,

It's not committed to trunk or any other branch at the moment, so no 
future released version until then.


Have you tested it out?  Any feedback we should incorporate?

When I can carve out some time over the next week or so I'll review 
and commit if there are no issues brought up.


Erik

On Jul 13, 2010, at 3:42 PM, Tommy Chheng wrote:


Hi,
Which next version of solr is the csv response writer set to be 
included in?

https://issues.apache.org/jira/browse/SOLR-1925

--
@tommychheng
Programmer and UC Irvine Graduate Student
Find a great grad school based on research interests: 
http://gradschoolnow.com






Re: Query modification

2010-07-02 Thread Tommy Chheng

 Hi,
I actually did something similar on http://researchwatch.net/
if you search for stanford university solar, it will process the query 
by tagging the stanford university to the organization field.


I created a querycomponent class and altered the query string like 
this(in scala but translatable to java easily):

  override def prepare(rb: ResponseBuilder){
val params: SolrParams = rb.req.getParams

if(params.getBool(COMPONENT_NAME, false)){
  val queryString = params.get(q).trim //rb.getQueryString()
  val entityTransform = new ClearboxEntityDetection
  val (transformedQuery, explainMap) = 
entityTransform.transformQuery(queryString)


  rb.setQueryString(transformedQuery)
  rb.rsp.add(clearboxExplain, explainMap)
}
  }


@tommychheng
Programmer and UC Irvine Graduate Student
Find a great grad school based on research interests: http://gradschoolnow.com


On 7/2/10 3:12 PM, osocurious2 wrote:

If I wanted to intercept a query and turn
 q=romantic italian restaurant in seattle
into
 q=romantic tag:restaurant city:seattle cuisine:italian

would I subclass QueryComponent, modify the query, and pass it to super? Or
is there a standard way already to do this?

What about changing it to
q=romantic city:seattle cuisine:italianfq=type:restaurant

would that be the same process, or is there a nuance to modifying a query
into a query+filterQuery?

Ken



Re: Query modification

2010-07-02 Thread Tommy Chheng
 i tried openNLP but found it's not very good for search queries 
because it uses grammar features like capitalization.


i coded up a bayesian model with mutual information to model dependence 
between terms. ex. grouping stanford university together in the query 
stanford university solar


@tommychheng
Programmer and UC Irvine Graduate Student
Find a great grad school based on research interests: http://gradschoolnow.com


On 7/2/10 3:26 PM, caman wrote:

And what did you use for entity detection?

GATE,openNLP?

Do you mind sharing that please?



From: Tommy Chheng-2 [via Lucene]
[mailto:ml-node+939600-682384129-124...@n3.nabble.com]
Sent: Friday, July 02, 2010 3:20 PM
To: caman
Subject: Re: Query modification



   Hi,
I actually did something similar on http://researchwatch.net/
if you search for stanford university solar, it will process the query
by tagging the stanford university to the organization field.

I created a querycomponent class and altered the query string like
this(in scala but translatable to java easily):
override def prepare(rb: ResponseBuilder){
  val params: SolrParams = rb.req.getParams

  if(params.getBool(COMPONENT_NAME, false)){
val queryString = params.get(q).trim //rb.getQueryString()
val entityTransform = new ClearboxEntityDetection
val (transformedQuery, explainMap) =
entityTransform.transformQuery(queryString)

rb.setQueryString(transformedQuery)
rb.rsp.add(clearboxExplain, explainMap)
  }
}


@tommychheng
Programmer and UC Irvine Graduate Student
Find a great grad school based on research interests:
http://gradschoolnow.comhttp://gradschoolnow.com?by-user=t


On 7/2/10 3:12 PM, osocurious2 wrote:



If I wanted to intercept a query and turn
  q=romantic italian restaurant in seattle
into
  q=romantic tag:restaurant city:seattle cuisine:italian

would I subclass QueryComponent, modify the query, and pass it to super?

Or

is there a standard way already to do this?

What about changing it to
 q=romantic city:seattle cuisine:italianfq=type:restaurant

would that be the same process, or is there a nuance to modifying a query
into a query+filterQuery?

Ken




   _

View message @
http://lucene.472066.n3.nabble.com/Query-modification-tp939584p939600.html
To start a new topic under Solr - User, email
ml-node+472068-464289649-124...@n3.nabble.com
To unsubscribe from Solr - User, click
  (link removed)
GZvcnRoZW90aGVyc3R1ZmZAZ21haWwuY29tfDQ3MjA2OHwtOTM0OTI1NzEx   here.






dismax and AND as the default operator

2010-06-17 Thread Tommy Chheng
 I'm using the dismax request handler and want to set the default 
operator to AND.
Using the standard handler, i could just use the q.op or defaultOperator 
in the schema, but this doesn't work using the dismax request handler.


For example, if I call solr/select/?q=fuel+cell, I want solr to handle 
it as a solr/select/?q=fuel+AND+cell


--
@tommychheng
Programmer and UC Irvine Graduate Student
Find a great grad school based on research interests: http://gradschoolnow.com



readonly access for all host except for localhost

2010-05-10 Thread Tommy Chheng
 Is there a way to configure solr to only allow readonly access for all 
external hosts except when the request is coming from localhost?


ex. solr-server.com:8983/solr/select is read-only accessible from remote 
server and the remote server is not allow to do any update/delete POST 
actions.


--
Tommy Chheng
Programmer and UC Irvine Graduate Student
Twitter @tommychheng
http://tommy.chheng.com



Re: use a solr-built index with lucene?

2010-04-09 Thread Tommy Chheng
 I was thinking of the reverse case: from solr to lucene. lucene 
doesn't use a schema.xml


Tommy Chheng
Programmer and UC Irvine Graduate Student
Twitter @tommychheng
http://tommy.chheng.com


On 4/9/10 12:15 AM, Paul Libbrecht wrote:
This looks like an interesting avenue for a smooth transition from 
lucene to solr.


thanks for more hints you find around.
(e.g. maybe it is not too hard to pre-generate a schema.xml from an 
actual index for the field-types?)


paul


Le 09-avr.-10 à 02:32, Erik Hatcher a écrit :


Yes... gotta jive with schema.xml though.

Erik

On Apr 8, 2010, at 7:18 PM, Tommy Chheng wrote:

If i build an index with solr, is it possible to use the index 
folder with lucene?


--
Tommy Chheng
Programmer and UC Irvine Graduate Student
Twitter @tommychheng
http://tommy.chheng.com







Re: Drill down a solr result set by facets

2010-03-29 Thread Tommy Chheng

 Try adding quotes to your query:

DepartmentName:Chemistry+fSponsor:\US Cancer/Diabetic Research Institute\


 The parser will split on whitespace

Tommy Chheng
Programmer and UC Irvine Graduate Student
Twitter @tommychheng
http://tommy.chheng.com


On 3/29/10 8:49 AM, Dhanushka Samarakoon wrote:

Hi,

I'm trying to perform a search based on keywords and then reduce the result
set based on facets that user selects.
First query for a search would look like this.

http://localhost:8983/solr/select/?q=cancer+stemversion=2.2wt=phpstart=rows=10indent=onqt=dismaxfacet=onfacet.mincount=1facet.field=fDepartmentNamefacet.field=fInvestigatorNamefacet.field=fSponsorfacet.date=DateAwardedfacet.date.start=2009-01-01T00:00:00Zfacet.date.end=2010-01-01T00:00:00Zfacet.date.gap=%2B1MONTH

In the above query (as per dismax on the solr config file) it searches
multiple fields such as GrantTitle, DepartmentName, InvestigatorName, etc...

Then if user select 'Chemistry' from the facet field 'fDepartmentName'  and
'US Cancer/Diabetic Research Institute' from 'fSponsor' I need to reduce the
result set above to only records from where fDepartmentName is 'Chemistry'
and 'fSponsor' is 'US Cancer/Diabetic Research Institute'
The following query is not working.
select/?q=cancer+stem+fDepartmentName:Chemistry+fSponsor:US Cancer/Diabetic
Research Instituteversion=2.2

Fields starting with 'f' are defined in the schema.xml as copy fields.
field name=DepartmentName type=text indexed=true stored=true
multiValued=true /
field name=fDepartmentName type=string indexed=true stored=false
multiValued=true /
copyField source=DepartmentName dest=fDepartmentName/

Any ideas on the correct syntax?

Thanks,
Dhanushka.



Re: document categorization using solr?

2010-03-25 Thread Tommy Chheng

 Hi Joel,
Do you need a supervised or unsupervised classification?
supervised: u have examples of your classes
unsupervised: u don't know your classes in advance

In the contribs, there is a solr clustering component which will handle 
unsupervised classification:

http://wiki.apache.org/solr/ClusteringComponent
*i think the component meant to support small quantities of documents

for supervised solutions(or larger scale unsupervised solutions), mahout 
could be a good start as it can use the solr index.


Tommy Chheng
Programmer and UC Irvine Graduate Student
Twitter @tommychheng
http://tommy.chheng.com


On 3/25/10 6:40 PM, Joel Nylund wrote:

Hi,

Does solr have something built in, or recommended add-on that does 
document categorization? ( I found a thread about a year ago, but not 
exact same topic)


For example, here is a commercial categorization product that will 
take a website and categorize it


http://grapeshot.co.uk/online-demo-3.php?url=http://www.solutionstreet.com 



I am looking for something similar that works with Solr/Lucene and is 
open source based.


Seems like Weka 
(http://weka.wikispaces.com/Frequently+Asked+Questions)  might be 
close, but not sure. Also not sure how to come up with a category 
list


thanks
Joel



Re: keyword query tokenizer

2010-03-25 Thread Tommy Chheng

 Multi-field searches is one reason of doing the tokenizing in the parser.

Imagine if your query was name:bob content:climate

The parser can tokenize the query into name:bob, content:climate and 
pass each into their own analyzer.


Tommy Chheng
Programmer and UC Irvine Graduate Student
Twitter @tommychheng
http://tommy.chheng.com


On 3/25/10 7:37 PM, Jason Chaffee wrote:
I am curious as to why the query parser does any tokenizing?  I would 
think you would want control/configure this with your analyzers?


Does anyone know the answer to this. Is there a performance gain or 
something?


Thanks,

Jason

On Mar 25, 2010, at 4:04 PM, Ahmet Arslan iori...@yahoo.com wrote:


 I have the following configured for a
 particular field:



 analyzer type=query

 tokenizer
 class=solr.KeywordTokenizerFactory /

 filter
 class=solr.LowerCaseFilterFactory /

 /analyzer





 I am using dismax and querying multiple fields and I expect
 the query to
 be parsed different for each field.  For some reason,
 it is not kept as
 single token for this field's query.  For example, the
 query Apple
 Store  is being broken into two tokens, apple and
 store.  I would
 expect it to be apple store.



 Does anyone have ideas of what might be going on here?

Before analysis phase, QueryParser splits on whitespace. You can 
alter this behavior by escaping whitespace with back slash. apple\ store








phrase segmentation plugin in component, analyzer, filter or parser?

2010-03-23 Thread Tommy Chheng

 I'm writing an experimental phrase segmentation plugin for solr.

My current plan is to write as a SearchComponent by overriding the 
queryString with the new grouped query.
ex. (university of california irvine 2009) will be re-written to 
university of calfornia irvine 2009



Is the SearchComponent the right class to extend for this type of logic?
I picked the component because it was one place where i could get access 
to overwrite the whole query string.


Or is it better design to write it as an analyzer, tokenizer, filter or 
parser plugin?



--
Tommy Chheng
Programmer and UC Irvine Graduate Student
Twitter @tommychheng
http://tommy.chheng.com



trimfilterfactory on string fieldtype?

2010-03-18 Thread Tommy Chheng

 Can the trim filter factory work on string fieldtypes?

When I define a trim filter factory on a string fieldtype, i get an 
exception:
org.apache.solr.common.SolrException: Unknown fieldtype 'string' 
specified on field id
at 
org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:477)

at org.apache.solr.schema.IndexSchema.init(IndexSchema.java:95)
at org.apache.solr.core.SolrCore.init(SolrCore.java:520)
at 
org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:137)


This is how i define the field in the schema:

fieldType name=string class=solr.StrField sortMissingLast=true 
omitNorms=true

analyzer type=index
filter class=solr.TrimFilterFactory /
/analyzer
/fieldType

--
Tommy Chheng
Programmer and UC Irvine Graduate Student
Twitter @tommychheng
http://tommy.chheng.com



Re: XML data in solr field

2010-03-16 Thread Tommy Chheng
 Do you have the option of just importing each xml node as a 
field/value when you add the document?


That'll let you do the search easily. If you need to store the raw XML, 
you can use an extra field.


Tommy Chheng
Programmer and UC Irvine Graduate Student
Twitter @tommychheng
http://tommy.chheng.com


On 3/16/10 12:59 PM, Nair, Manas wrote:

Hello Experts,

I need help on this issue of mine. I am unsure if this scenario is possible.
I have a field in my solr document namedinputxml, the value of which is a xml string as below. This xml 
structure is within the inputxml field value. I needed help on searching this xml structure i.e. if I search  
for Venue, I should get Radio City Music Hall as the result and not the complete tag likeVenue 
value=Radio City Music Hall /. Is this supported in solr?? If it is, how can this be 
implemented??

root
Venue value=Radio City Music Hall /
Link value=http://bit.ly/Rndab; /
LinkText value=En savoir + /
Address value=New-York, USA /
/root

Any help is appreciated. I donot need the tag name in the result, instead I 
need the tag value.

Thanks in advance,
Manas Nair



Re: DIH field options

2010-03-12 Thread Tommy Chheng
 Haven't tried this myself but try adding a default value  and don't 
specify it during the import.

http://wiki.apache.org/solr/SchemaXml


On 3/12/10 7:56 AM, blargy wrote:

Forgive me but I'm slightly retarded... I grew up underneath some power lines
;)

I've read through that wiki but I still can't find what I'm looking for. I
just want to give one of the DIH entities/fields a static value (ie it
doesnt come from a database column). How can I configure this?

FYI this is data-config.xml not schema.xml.

   document
 entity name=item query=select * from items
   field name=my_field  column=static_value_not_from_db/
   
 /entity
   /document




Tommy Chheng-4 wrote:

   The wiki page has most of the info you need
*http://wiki*.apache.org/*solr*/DataImportHandler

To use multi-value fields, your schema.xml must define it with
multiValued=true


On 3/11/10 10:58 PM, blargy wrote:

How can you simply add a static value like?field name=id
value=123/
How does one add a static multi-value field?field name=category_ids
values=123, 456/

Is there any documentation on all the options for the field tag in
data-config.xml?

Thanks for the help

--
Tommy Chheng
Programmer and UC Irvine Graduate Student
Twitter @tommychheng
http://tommy.chheng.com





--
Tommy Chheng
Programmer and UC Irvine Graduate Student
Twitter @tommychheng
http://tommy.chheng.com



Re: How to get Term Positions?

2010-03-12 Thread Tommy Chheng

 I contributed a little reward to whoever can complete this task too
http://nextsprocket.com/tasks/solr-1337-spans-and-payloads-query-support-asf-jira

Feel free to contribute to the reward if you need this done too!

Tommy Chheng
Programmer and UC Irvine Graduate Student
Twitter @tommychheng
http://tommy.chheng.com


On 3/12/10 2:14 PM, Grant Ingersoll wrote:

OK, you need https://issues.apache.org/jira/browse/SOLR-1337 and it's related 
item: https://issues.apache.org/jira/browse/SOLR-1485

Unfortunately, not implemented yet.

On Mar 12, 2010, at 1:36 PM, MitchK wrote:


Thanks for your response, Grant!

Imagine you are searching for foo.
foor occurs in doc1 three times. It is the 5th, the 20th, and the 50th
term in the document.
I want to get these positions.

Of course, if I am searching for foo bar and bar occurs at the 4th and
the 21th position, I also want to know that. I am not sure, but I think this
is what you mean by per doc basis, right?

Since I need the TermPosition at scoring time, TermVectorComponent seems to
be no option in this case, or do you think it could be one, if I create such
Vectors at index-time?
--
View this message in context: 
http://old.nabble.com/How-to-get-Term-Positions--tp27880551p27881024.html
Sent from the Solr - User mailing list archive at Nabble.com.





Re: DIH field options

2010-03-11 Thread Tommy Chheng

 The wiki page has most of the info you need
*http://wiki*.apache.org/*solr*/DataImportHandler

To use multi-value fields, your schema.xml must define it with 
multiValued=true



On 3/11/10 10:58 PM, blargy wrote:

How can you simply add a static value like?field name=id value=123/
How does one add a static multi-value field?field name=category_ids
values=123, 456/

Is there any documentation on all the options for the field tag in
data-config.xml?

Thanks for the help


--
Tommy Chheng
Programmer and UC Irvine Graduate Student
Twitter @tommychheng
http://tommy.chheng.com



Re: persistent cache

2010-02-12 Thread Tommy Chheng
 One solution is to add the persistent cache with memcache at the 
application layer.


--
Tommy Chheng

Programmer and UC Irvine Graduate Student
Twitter @tommychheng
http://tommy.chheng.com



On 2/12/10 5:19 AM, Tim Terlegård wrote:

2010/2/12 Shalin Shekhar Mangarshalinman...@gmail.com:

2010/2/12 Tim Terlegårdtim.terleg...@gmail.com


Does Solr use some sort of a persistent cache?


Solr does not have a persistent cache. That is the operating system's file
cache at work.

Aha, that's very interesting and seems to make sense.

So is the primary goal of warmup queries to allow the operating system
to cache all the files in the data/index directory? Because I think
the difference (768ms vs 52ms) is pretty big. I just do one warmup
query and get 52 ms response on a 40 million documents index. I think
that's pretty nice performance without tinkering with the caches at
all. The only tinkering that seems to be needed is this operating
system file caching. What's the best way to make sure that my warmup
queries have cached all the files? And does a file cache have the
complete file in memory? I guess it can get tough to get my 100GB
index into the 16GB memory.

/Tim



--
Tommy Chheng
Programmer and UC Irvine Graduate Student
Twitter @tommychheng
http://tommy.chheng.com



DataImportHandlerException for custom DIH Transformer

2010-02-07 Thread Tommy Chheng

 I'm having trouble making a custom DIH transformer in solr 1.4.

I compiled the General TrimTransformer into a jar. (just copy/paste 
sample code from http://wiki.apache.org/solr/DIHCustomTransformer)
I placed the jar along with the dataimporthandler jar in solr/lib (same 
directory as the jetty jar)


Then I added to my DIH data-config.xml file: 
transformer=DateFormatTransformer, RegexTransformer, 
com.chheng.dih.transformers.TrimTransformer


Now I get this exception when I try running the import.
org.apache.solr.handler.dataimport.DataImportHandlerException: 
java.lang.NoSuchMethodException: 
com.chheng.dih.transformers.TrimTransformer.transformRow(java.util.Map)
at 
org.apache.solr.handler.dataimport.EntityProcessorWrapper.loadTransformers(EntityProcessorWrapper.java:120)


I noticed the exception lists 
TrimTransformer.transformRow(java.util.Map) but the abstract Transformer 
class defines a two parameter method: transformRow(MapString, Object 
row, Context context)?



--
Tommy Chheng
Programmer and UC Irvine Graduate Student
Twitter @tommychheng
http://tommy.chheng.com



Re: Using solr to store data

2010-02-03 Thread Tommy Chheng

Hey AJ,
For simplicity sake, I am using Solr to serve as storage and search for 
http://researchwatch.net.
The dataset is 110K  NSF grants from 1999 to 2009. The faceting is all 
dynamic fields and I use a catch all to copy all fields to a default 
text field. All fields are also stored and used for individual grant view.
The performance seems fine for my purposes. I haven't done any extensive 
benchmarking with it. The site was built using a light ROR/rsolr layer 
on a small EC2 instance.


Feel free to bang against the site with jmeter if you want to stress 
test a sample server to failure.  :)


--
Tommy Chheng
Developer  UC Irvine Graduate Student
http://tommy.chheng.com

On 2/3/10 5:41 PM, AJ Asver wrote:

Hi all,

I work on search at Scoopler.com, a real-time search engine which uses Solr.
  We current use solr for indexing but then fetch data from our couchdb
cluster using the IDs solr returns.  We are now considering storing a larger
portion of data in Solr's index itself so we don't have to hit the DB too.
  Assuming that we are still storing data on the db (for backend and back up
purposes) are there any significant disadvantages to using solr as a data
store too?

We currently run a master-slave setup on EC2 using x-large slave instances
to allow for the disk cache to use as much memory as possible.  I imagine we
would definitely have to add more slave instances to accomodate the extra
data we're storing (and make sure it stays in memory).

Any tips would be really helpful.
--
AJ Asver
Co-founder, Scoopler.com

+44 (0) 7834 609830 / +1 (415) 670 9152
a...@scoopler.com


Follow me on Twitter: http://www.twitter.com/_aj
Add me on Linkedin: http://www.linkedin.com/in/ajasver
or YouNoodle: http://younoodle.com/people/ajmal_asver

My Blog: http://ajasver.com

   


filter querying working on dynamic int fields but not dynamic string fields?

2010-01-20 Thread Tommy Chheng
I'm having trouble doing a filter query on a string field. Any ideas why 
it's working on dynamic int fields but not dynamic string fields?


ex.
http://localhost:8983/solr/select?indent=onversion=2.2q=climate - correct
http://localhost:8983/solr/select?version=2.2q=climatefq=awardedamounttodate_i%3A88900 
FQ with dynamic int field returns one result - correct
http://localhost:8983/solr/select?indent=onversion=2.2q=climatefq=awardinstrument_s:Continuing+grant 
returns zero results - Incorrect


In my schema.xml, i setup dynamic fields like this:
dynamicField name=*_i  type=intindexed=true  stored=true/
dynamicField name=*_s  type=string  indexed=true  stored=true/

In my index, i have a record like which should have matched the last query:
str name=id9987644/str
int name=awardedamounttodate_i88900/int
str name=awardinstrument_sContinuing grant /str
str name=abstract_tAbstract  ATM-987644  Zeng, Ning  University of 
California, Los Angeles  Title: Hierarchical Modeling of 
Vegetation-Climate /str


This is the query debug section:
lst name=debug
str name=rawquerystringclimate/str
str name=querystringclimate/str
str name=parsedquerytext:climat/str
str name=parsedquery_toStringtext:climat/str
lst name=explain/
str name=QParserLuceneQParser/str
arr name=filter_queries
strawardinstrument_s:Continuing grant/str
/arr
arr name=parsed_filter_queries
str+awardinstrument_s:Continuing +text:grant/str/arr



Re: filter querying working on dynamic int fields but not dynamic string fields?

2010-01-20 Thread Tommy Chheng
Thanks,  quoting it fixed it. I'm also going to strip the 
leading/trailing whitespace at index time.


Tommy

On 1/20/10 1:47 PM, Erik Hatcher wrote:


On Jan 20, 2010, at 4:27 PM, Tommy Chheng wrote:

I'm having trouble doing a filter query on a string field. Any ideas 
why it's working on dynamic int fields but not dynamic string fields?


ex.
http://localhost:8983/solr/select?indent=onversion=2.2q=climate - 
correct
http://localhost:8983/solr/select?version=2.2q=climatefq=awardedamounttodate_i%3A88900 FQ 
with dynamic int field returns one result - correct
http://localhost:8983/solr/select?indent=onversion=2.2q=climatefq=awardinstrument_s:Continuing+grant returns 
zero results - Incorrect


fq=field:value with spaces

is problematic - it is being parsed as a SolrQueryParser expression.  
It should work if you quote it - fq=field:value with spaces


However, as I mentioned earlier today on the list, I think the best 
option for facet narrowing on string fields is this:


   fq={!raw f=field}value with spaces

Of course all of the above need to be URL encoded too.


str+awardinstrument_s:Continuing +text:grant/str/arr


This explains the problem exactly.  Note how it parsed the second word 
to the text field, not the field you specified.


Erik




Re: Facet query help

2009-10-12 Thread Tommy Chheng
ok, so fq != facet.query. i thought it was an alias. I'm trying your 
suggestion fq=Memory_s:1 GB and now it's returning zero documents even 
though there is one document that has tommy and Memory_s:1 GB as 
seen in the original pastie(http://pastie.org/650932). I tried the fq 
query body with quotes and without quotes.


http://lh:8983/solr/select/?facet=truefacet.field=CPU_sfacet.field=Memory_sfacet.field=Video+Card_swt=rubyfq=%22Memory_s:1+GB%22q=tommyindent=on

Any thoughts?

thanks,
tommy

On 10/12/09 1:00 AM, Shalin Shekhar Mangar wrote:

On Mon, Oct 12, 2009 at 6:07 AM, Tommy Chhengtommy.chh...@gmail.comwrote:

   

The dummy data set is composed of 6 docs.

My query is set for 'tommy' with the facet query of Memory_s:1+GB

http://lh:8983/solr/select/?facet=truefacet.field=CPU_sfacet.field=Memory_sfacet.field=Video+Card_swt=rubyfacet.query=Memory_s:1+GBq=tommyindent=on

However, in the response (http://pastie.org/650932), I get two docs: one
which has the correct field Memory_s:1 GB and the second document which has
a Memory_s:3+GB. Why did the second document match if i set the facet.query
to just 1+GB??


 

facet.query does not limit documents. It is used for finding the number of
documents matching the query. In order to filter the result set you should
use filter query e.g. fq=Memory_s:1 GB

   


Facet query help

2009-10-11 Thread Tommy Chheng

The dummy data set is composed of 6 docs.

My query is set for 'tommy' with the facet query of Memory_s:1+GB
http://lh:8983/solr/select/?facet=truefacet.field=CPU_sfacet.field=Memory_sfacet.field=Video+Card_swt=rubyfacet.query=Memory_s:1+GBq=tommyindent=on

However, in the response (http://pastie.org/650932), I get two docs: one 
which has the correct field Memory_s:1 GB and the second document which 
has a Memory_s:3+GB. Why did the second document match if i set the 
facet.query to just 1+GB??


I'm using Solr 1.4 trunk

thanks
tommy