date:20121203

Re: behavior of solr.KeepWordFilterFactory

2012-12-03 Thread Xi Shen

Solr index is case-sensitive by default, unless you used the lower case
filter. I remember I saw this topic on Solr, and the solution is simple:

copy the filed;
use a new analyzer/tokenizer to process this field, and do not use lower
case filter

when query, make sure both fields are included.


On Mon, Dec 3, 2012 at 3:04 PM, Joe Zhang smartag...@gmail.com wrote:

 In other words, what I wanted to achieve is case-senstive indexing on a
 small set of words. Can anybody help?

 On Sun, Dec 2, 2012 at 11:56 PM, Joe Zhang smartag...@gmail.com wrote:

  To be more specific, this is the data type I was using:
 
 fieldType name=textspecial class=solr.TextField
  positionIncrementGap=100
  analyzer
  tokenizer class=solr.StandardTokenizerFactory/
  filter class=solr.KeepWordFilterFactory
  words=tickers.txt ignoreCase=false/
  filter class=solr.StopFilterFactory
  ignoreCase=true words=stopwords.txt/
  filter class=solr.WordDelimiterFilterFactory
  generateWordParts=1 generateNumberParts=1
  catenateWords=1 catenateNumbers=1 catenateAll=0
  splitOnCaseChange=1/
  filter class=solr.LowerCaseFilterFactory/
  filter class=solr.EnglishPorterFilterFactory
  protected=protwords.txt/
  filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
  /fieldType
 
 
  On Sun, Dec 2, 2012 at 11:51 PM, Joe Zhang smartag...@gmail.com wrote:
 
  yes, that is the correct behavior. But how do I achieve my goal, i.e,
  speical treatment on a list of uppercase/special words, normal
 treatment on
  everything else?
 
 
  On Sun, Dec 2, 2012 at 11:46 PM, Xi Shen davidshe...@gmail.com wrote:
 
  By the definition on
 
 
 https://lucene.apache.org/solr/api-3_6_1/org/apache/solr/analysis/KeepWordFilter.html
  ,
  I am pretty sure it is the correct behavior of this filter :)
 
  I guess you are trying to this filter to index some special words in
  Chinese?
 
 
  On Mon, Dec 3, 2012 at 1:54 PM, Joe Zhang smartag...@gmail.com
 wrote:
 
   I defined the following data type in my solr schema.xml
  
   fieldtype name=testkeep class=solr.TextField
  analyzer
filter class=solr.KeepWordFilterFactory words=keepwords.txt
   ignoreCase=false/
  /analyzer
   /fieldtype
  
   when I use the type testkeep to index a test field, my true
  expecation
   was to make sure solr indexes the uppercase form of a small list of
  words
   in the file, AND TREAT EVERY OTHER WORD AS USUAL. The goal of
 securing
  the
   closed list is achieved, but NO OTHER WORD outside the list is
 indexed!
  
   Can anybody help? Thanks in advance!
  
   Joe
  
 
 
 
  --
  Regards，
  David Shen
 
  http://about.me/davidshen
  https://twitter.com/#!/davidshen84
 
 
 
 




-- 
Regards，
David Shen

http://about.me/davidshen
https://twitter.com/#!/davidshen84

Re: duplicated URL sent from Nutch to solr index

2012-12-03 Thread Xi Shen

Then the URL must be the same.


On Mon, Dec 3, 2012 at 2:34 PM, Joe Zhang smartag...@gmail.com wrote:

 Sorry I didn't make it perfectly clear. The id field is URL.

 On Sun, Dec 2, 2012 at 11:33 PM, Joe Zhang smartag...@gmail.com wrote:

  Thanks!
 
 
  On Sun, Dec 2, 2012 at 11:20 PM, Xi Shen davidshe...@gmail.com wrote:
 
  If the value for id field is the same, the old entry will be update;
 if
  it is new, a new entry will be created  indexed.
 
  This is my experience. :)
 
 
  On Mon, Dec 3, 2012 at 1:45 PM, Joe Zhang smartag...@gmail.com wrote:
 
   Dear list,
  
   I just want to confirm an expected behavior of solr:
  
   Assuming we have  uniqueKeyid/uniqueKey in schema.xml for solr,
  when
   we send the same URL from nutch to solr multiple times. would there be
  ONLY
   ONE entry for that URL, but the content (if changed) and timestamp
  would be
   updated?
  
  
   Thanks!
  
   Joe
  
 
 
 
  --
  Regards，
  David Shen
 
  http://about.me/davidshen
  https://twitter.com/#!/davidshen84
 
 
 




-- 
Regards，
David Shen

http://about.me/davidshen
https://twitter.com/#!/davidshen84

Re: Solr 4: Join Query

2012-12-03 Thread Vikash Sharma

Hi Erick,
One more thing: So is there any other way to get the result?
I mean, I need to get both parent and child document in/not nested format.

Regards,
Vikash

Regards,
Vikash Sharma
vikash0...@gmail.com


On Sat, Dec 1, 2012 at 10:29 PM, Erick Erickson erickerick...@gmail.comwrote:

 That's the way joins work, and why they're called pseudo join, they don't
 work like DB joins and return data from both records

 Joins were put in for a specific use-case, when you try to treat Solr like
 a DB you're bound to be disappointed. I'd think about reworking the
 solution to de-normalize the data so you don't have to do joins.

 Best
 Erick


 On Fri, Nov 30, 2012 at 10:38 AM, Vikash Sharma vikash0...@gmail.com
 wrote:

  Hi All,
  I have my field definition in schema.xml like below
 
  field name=id type=string indexed=true. /
  field name=Emp_id type=string indexed=true. /
  field name=doc_id type=string indexed=true. /
  field name=content type=string indexed=true. /
 
 
  I need to create separate record in solr for each parent child
  relationship... such that if child is same across different parent that
 it
  gets stored only once.
 
  For e.g.
   ---_Record 1
  idABCid/
  emp_idEMP001emp_id/
  doc_idDOC001doc_id/
  doc_contentMy Parent Docdoc_content/
 
   ---_Record 2
  idDOC001id/
  emp_idemp_id/
  doc_iddoc_id/
  doc_contentMy Document Datadoc_content/
 
 
  This will ensure that if any doc_id content is duplicate, than only once
  the record is inserted in the solr.
 
  Lastly, I want the result as join. if emp_id=EMP001. then both record
  should be returned, as there is a relationship between two records using
 of
  doc_id = id
 
  If I query:
 
 
 http://localhost:8983/solr/select?q={!join%20from=doc_id%20to=id}emp_id:EMP001wt=json
  
 
 http://localhost:8983/solr/select?q={!join%20from=sha_one%20to=id}project_id:10wt=json
  
 
  I expect both record should be returned either one after another or
  nested..
  But I only get child records...
 
 
  Please help..
 
 
 
  Regards,
  Vikash Sharma
  vikash0...@gmail.com

How to change Solr UI

2012-12-03 Thread Romita Saha

Hi,

I want to change the Solr UI. As far as i understand, Solritas is just for 
prototyping, where I can change the UI according to a predefined template 
(Velocity) and cannot add on any additional functionality to that page. 
How can I change the Solr UI otherwise. Any guidance would be appreciated.

Thanks and regards,
Romita

AW: Edismax query parser and phrase queries

2012-12-03 Thread Tantius, Richard

Hi,
the use case we have in mind is that we would like to achieve exact matches for 
explicit phrases. Our users expect that an explicit phrase not only considers 
the order of terms, but also the exact wording. Therefore if we search on 
fields using a data type that is not meant performing exact matches, we need to 
change that for explicit phrases. This means in a usual query we have qf 
default fields using advanced tokenization (for query processing and indexing), 
for example like stemming via SnowballPorterFilterFactory. So our idea was to 
change the default search fields for explicit phrases to achieve exact matches, 
by using a simple data format like for example “string“ (StrField, without 
advanced options).

Extending our example from the last mail: 

qf=title text

Datatype of title, text, something like “text_advanced”:

fieldtype ...
 analyzer type=index !--(and also analyzer type=query )--
  filter class=solr.WordDelimiterFilterFactory ...
  filter class=solr.LowerCaseFilterFactory /
  filter class=solr.SnowballPorterFilterFactory language=German2 /
...

Data type of the additional fields titleExact, textExact:
fieldType name=string class=solr.StrField sortMissingLast=true 
omitNorms=true/

q=ran away from home Cat Dog 

-transformTo-

q=( titleExact:ran away from home OR textExact:ran away from home ) Cat Dog.

Regards,
Richard.

BINSERV
Gesellschaft für interaktive Konzepte und neue Medien mbH
Software Engineer

Gotenstr. 7-9
53175 Bonn
Tel.: +49 (0)228 / 4 22 86 - 38 
Fax.: +49 (0)228 / 4 22 86 - 538
E-Mail:   r.tant...@binserv.de  
Web:  www.binserv.de
  www.binforcepro.de

Geschäftsführer: Rüdiger Jakob
Amtsgericht: Siegburg HRB 6765
Hauptsitz der Gesellschaft.: Pfarrer-Wichert-Str. 35, 53639 Königswinter
Diese E-Mail einschließlich eventuell angehängter Dateien enthält vertrauliche 
und/oder rechtlich geschützte Informationen. Wenn Sie nicht der richtige 
Adressat sind und diese E-Mail irrtümlich erhalten haben, dürfen Sie weder den 
Inhalt dieser E-Mail nutzen noch dürfen Sie die eventuell angehängten Dateien 
öffnen und auch nichts kopieren oder weitergeben/verbreiten. Bitte verständigen 
Sie den Absender und löschen Sie diese E-Mail und eventuell angehängte Dateien 
umgehend. Vielen Dank!


- Original message -
Von: Jack Krupansky [mailto:j...@basetechnology.com] 
Gesendet: Freitag, 30. November 2012 23:04
An: solr-user@lucene.apache.org
Betreff: Re: Edismax query parser and phrase queries

I don’t have a simple answer for your stated issue, but maybe part of that is 
because I’m not so sure what the exact problem/goal is. I mean, what’s so 
special about phrase queries for your app than they need distinct processing 
from individual terms?

And, ultimately, what goal are you trying to achieve? Such as, how will the 
outcome of the query affect what users see and do.

-- Jack Krupansky

From: Tantius, Richard
Sent: Friday, November 30, 2012 8:44 AM
To: solr-user@lucene.apache.org
Subject: Edismax query parser and phrase queries

Hi,

we are using the edismax query parser and execute queries on specific fields by 
using the qf option. Like others, we are facing the problem we do not want 
explicit phrase queries to be performed on some of the qf fields and also 
require additional search fields for those kind of queries.

We tried to expand explicit phrases in a query by implementing some 
pre-processing logic, which did not seemed to be quite convenient.

So for example (lets assume qf=title text, we want phrase queries to be 
performed on the additional fields titleAlt textAlt ): q=ran away from home 
Cat Dog -transformTo- q=( titleAlt:ran away from home OR textAlt:ran away 
from home ) Cat Dog. Unfortunately this gets rather complicated if logic 
operators are involved within the query. Is there some kind of best practice, 
should we for example extend the query parser, or stick to our pre-processing 
approach?


Regards,
Richard.

Re: Replication in SolrCloud

2012-12-03 Thread Arkadi Colson


  
  
Thanks for the explaination It's clear now...
  
  I expanded the setup to:
  4 hosts with 2 shards en 1 replicator for each shard. When I
  shutdown tomcat on solr01-dcg which is the master of shard 1 for
  both collections, the replicator (solr01-gs) seems NOT to
  takeover.
  See logs below.
  
  Dec 3, 2012 9:55:34 AM
  org.apache.solr.cloud.ShardLeaderElectionContext
  runLeaderProcess
  INFO: Running the leader process.
  Dec 3, 2012 9:55:34 AM
  org.apache.solr.cloud.ShardLeaderElectionContext
  shouldIBeLeader
  INFO: Checking if I should try and be the leader.
  Dec 3, 2012 9:55:34 AM
  org.apache.solr.cloud.ShardLeaderElectionContext
  shouldIBeLeader
  INFO: My last published State was Active, it's okay to be the
  leader.
  Dec 3, 2012 9:55:34 AM
  org.apache.solr.cloud.ShardLeaderElectionContext
  runLeaderProcess
  INFO: I may be the new leader - try and sync
  Dec 3, 2012 9:55:34 AM org.apache.solr.cloud.SyncStrategy sync
  INFO: Sync replicas to http://solr01-gs:8983/solr/intradesk/
  Dec 3, 2012 9:55:34 AM org.apache.solr.update.PeerSync sync
  INFO: PeerSync: core=intradesk url="" class="moz-txt-link-freetext" href="http://solr01-gs:8983/solr">http://solr01-gs:8983/solr
  START replicas=[http://solr01-dcg:8983/solr/intradesk/]
  nUpdates=100
  Dec 3, 2012 9:55:34 AM org.apache.solr.update.PeerSync sync
  INFO: PeerSync: core=intradesk url="" class="moz-txt-link-freetext" href="http://solr01-gs:8983/solr">http://solr01-gs:8983/solr
  DONE. We have no versions. sync failed.
  Dec 3, 2012 9:55:34 AM org.apache.solr.common.SolrException
  log
  SEVERE: Sync Failed
  Dec 3, 2012 9:55:34 AM
  org.apache.solr.cloud.ShardLeaderElectionContext
  rejoinLeaderElection
  INFO: There is a better leader candidate than us - going back
  into recovery
  Dec 3, 2012 9:55:35 AM
  org.apache.solr.update.DefaultSolrCoreState doRecovery
  INFO: Running recovery - first canceling any ongoing recovery
  Dec 3, 2012 9:55:35 AM org.apache.solr.cloud.RecoveryStrategy
  run
  INFO: Starting recovery process. core=intradesk
  recoveringAfterStartup=false
  Dec 3, 2012 9:55:35 AM org.apache.solr.cloud.RecoveryStrategy
  doRecovery
  INFO: Attempting to PeerSync from
  http://solr01-dcg:8983/solr/intradesk/ core=intradesk -
  recoveringAfterStartup=false
  Dec 3, 2012 9:55:35 AM org.apache.solr.update.PeerSync sync
  INFO: PeerSync: core=intradesk url="" class="moz-txt-link-freetext" href="http://solr01-gs:8983/solr">http://solr01-gs:8983/solr
  START replicas=[http://solr01-dcg:8983/solr/intradesk/]
  nUpdates=100
  Dec 3, 2012 9:55:35 AM org.apache.solr.update.PeerSync sync
  WARNING: no frame of reference to tell of we've missed updates
  Dec 3, 2012 9:55:35 AM org.apache.solr.cloud.RecoveryStrategy
  doRecovery
  INFO: PeerSync Recovery was not successful - trying
  replication. core=intradesk
  Dec 3, 2012 9:55:35 AM org.apache.solr.cloud.RecoveryStrategy
  doRecovery
  INFO: Starting Replication Recovery. core=intradesk
  Dec 3, 2012 9:55:35 AM
  org.apache.solr.client.solrj.impl.HttpClientUtil createClient
  INFO: Creating new http client,
config:maxConnections=128maxConnectionsPerHost=32followRedirects=false
  Dec 3, 2012 9:55:35 AM org.apache.solr.common.SolrException
  log
  SEVERE: Error while trying to recover.
  core=intradesk:org.apache.solr.client.solrj.SolrServerException:
  Server refused connection at: http://solr01-dcg:8983/solr
   at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:406)
   at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
   at
org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:199)
   at
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:388)
   at
  org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:220)
  Caused by: org.apache.http.conn.HttpHostConnectException:
  Connection to http://solr01-dcg:8983 refused
   at
org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:158)
   at
org.apache.http.impl.conn.AbstractPoolEntry.open(AbstractPoolEntry.java:150)
   at
org.apache.http.impl.conn.AbstractPooledConnAdapter.open(AbstractPooledConnAdapter.java:121)
   at

Re: Replication in SolrCloud

2012-12-03 Thread Arkadi Colson


Never mind I think I found it.

There must be some documents into each shardso they havea version 
number. Then everything seems to work...


On 11/30/2012 04:57 PM, Mark Miller wrote:

Thanks for all the detailed info!

Yes, that is confusing. One of the sore points we have while supporting both 
std Solr and SolrCloud mode.

In SolrCloud, every node is a Master when thinking about std Solr replication. 
However, as you see on the cloud page, only one of them is a *leader*. A leader 
is different than a master.

Being a Master when it comes to the replication handler simply means you can 
replicate the index to other nodes - in SolrCloud we need every node to be 
capable of doing that. Each shard only has one leader, but every node in your 
cluster will be a replication master.

- Mark


On Nov 30, 2012, at 10:32 AM, Arkadi Colson ark...@smartbit.be wrote:


This is my setup for solrCloud 4.0 on Tomcat 7.0.33 and zookeeper 3.4.5

hosts:
- solr01-dcg (first started)
- solr01-gs (second started so becomes replicate)

collections:
- smsc

shards:
- mydoc

zookeeper:
- on solr01-dcg
- on solr01-gs

SOLR_OPTS=-Dsolr.solr.home=/opt/solr/ -Dport=8983 -Dcollection.configName=smsc 
-DzkClientTimeout=2 -DzkHost=solr01-dcg:2181,solr01-gs:2181

solr.xml:
?xml version=1.0 encoding=UTF-8 ?
solr persistent=true
   cores adminPath=/admin/cores zkClientTimeout=2 hostPort=8983
 core schema=schema.xml shard=shard1 instanceDir=/solr/mydoc/ name=mydoc 
config=solrconfig.xml collection=mydoc/
   /cores
/solr

I upload the config to zookeeper:
java -classpath .:/usr/local/tomcat/webapps/solr/WEB-INF/lib/* 
org.apache.solr.cloud.ZkCLI -cmd upconfig -zkhost 
solr01-dcg:2181,solr01-gs:2181 -confdir /opt/solr/conf -confname smsc

Linking the config to the collection:
java -classpath .:/usr/local/tomcat/webapps/solr/WEB-INF/lib/* 
org.apache.solr.cloud.ZkCLI -cmd linkconfig -collection mydoc -zkhost 
solr01-dcg.intnet.smartbit.be:2181,solr01-gs.intnet.smartbit.be:2181 -confname 
smsc

cloud on both hosts:

dcddagii.png

solr01-dcg

hhfgdeab.png

solr01-gs:

daafhdef.png
Any idea?

Thanks!

On 11/30/2012 03:15 PM, Mark Miller wrote:

On Nov 30, 2012, at 5:08 AM, Arkadi Colson ark...@smartbit.be
  wrote:



Hi

I've setup an simple 2 machine cloud with 1 shard, one replicator and 2 
collections.Everything went fine. However when I look at the interface:
http://localhost:8983/solr/#/coll1/replication
  is reporting the both machines are master. Did I do something wrong in my 
config or isit a report for manual replication configuration? Can someone else 
check this?


How? You don't really give anything to look at :)



Is it poossible to link 2 collections to the same conf in zookeeper?



Yes, that is no problem.

- Mark









--
Met vriendelijke groeten

Arkadi Colson

Smartbit bvba . Hoogstraat 13 . 3670 Meeuwen
T +32 11 64 08 80 . F +32 11 64 08 81

Re: News clustering

2012-12-03 Thread Stanislaw Osinski

One of our clients uses Solr's search results clustering for grouping news.
Instead of the default Carrot2 algorithm that ships with Solr they use a
commercial one, but Carrot2 should give you decent clusters too. Here's an
example clustering result:

http://imagebin.org/238001

Staszek

--
Stanislaw Osinski
http://carrotsearch.com

On Fri, Nov 30, 2012 at 4:44 PM, Jorge Luis Betancourt Gonzalez 
jlbetanco...@uci.cu wrote:

 Hi all:

 I'm thinking on using nutch combined with solr to index some news sites in
 an intranet. And I was wondering how effective could be using the
 clustering component to cluster the search results? Any success history on
 using solr clustering component for news clustering? Any existing solution
 for clustering/classification on index time?

 Greetings!
 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS
 INFORMATICAS...
 CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

 http://www.uci.cu
 http://www.facebook.com/universidad.uci
 http://www.flickr.com/photos/universidad_uci

Re: behavior of solr.KeepWordFilterFactory

2012-12-03 Thread Joe Zhang

across-the-board case-senstive indexing is not what I want...

Let me make sure I understand your suggestion:

   fieldType name=text1 class=solr.TextField
positionIncrementGap=100
analyzer
tokenizer class=solr.StandardTokenizerFactory/

filter class=solr.LowerCaseFilterFactory/

/analyzer
/fieldType

   fieldType name=text2 class=solr.TextField
positionIncrementGap=100
analyzer
tokenizer class=solr.StandardTokenizerFactory/

/analyzer
/fieldType


And define content1 as text1, content2 as text2?
On Mon, Dec 3, 2012 at 1:09 AM, Xi Shen davidshe...@gmail.com wrote:

 Solr index is case-sensitive by default, unless you used the lower case
 filter. I remember I saw this topic on Solr, and the solution is simple:

 copy the filed;
 use a new analyzer/tokenizer to process this field, and do not use lower
 case filter

 when query, make sure both fields are included.


 On Mon, Dec 3, 2012 at 3:04 PM, Joe Zhang smartag...@gmail.com wrote:

  In other words, what I wanted to achieve is case-senstive indexing on a
  small set of words. Can anybody help?
 
  On Sun, Dec 2, 2012 at 11:56 PM, Joe Zhang smartag...@gmail.com wrote:
 
   To be more specific, this is the data type I was using:
  
  fieldType name=textspecial class=solr.TextField
   positionIncrementGap=100
   analyzer
   tokenizer class=solr.StandardTokenizerFactory/
   filter class=solr.KeepWordFilterFactory
   words=tickers.txt ignoreCase=false/
   filter class=solr.StopFilterFactory
   ignoreCase=true words=stopwords.txt/
   filter class=solr.WordDelimiterFilterFactory
   generateWordParts=1 generateNumberParts=1
   catenateWords=1 catenateNumbers=1
 catenateAll=0
   splitOnCaseChange=1/
   filter class=solr.LowerCaseFilterFactory/
   filter class=solr.EnglishPorterFilterFactory
   protected=protwords.txt/
   filter
 class=solr.RemoveDuplicatesTokenFilterFactory/
   /analyzer
   /fieldType
  
  
   On Sun, Dec 2, 2012 at 11:51 PM, Joe Zhang smartag...@gmail.com
 wrote:
  
   yes, that is the correct behavior. But how do I achieve my goal, i.e,
   speical treatment on a list of uppercase/special words, normal
  treatment on
   everything else?
  
  
   On Sun, Dec 2, 2012 at 11:46 PM, Xi Shen davidshe...@gmail.com
 wrote:
  
   By the definition on
  
  
 
 https://lucene.apache.org/solr/api-3_6_1/org/apache/solr/analysis/KeepWordFilter.html
   ,
   I am pretty sure it is the correct behavior of this filter :)
  
   I guess you are trying to this filter to index some special words in
   Chinese?
  
  
   On Mon, Dec 3, 2012 at 1:54 PM, Joe Zhang smartag...@gmail.com
  wrote:
  
I defined the following data type in my solr schema.xml
   
fieldtype name=testkeep class=solr.TextField
   analyzer
 filter class=solr.KeepWordFilterFactory
 words=keepwords.txt
ignoreCase=false/
   /analyzer
/fieldtype
   
when I use the type testkeep to index a test field, my true
   expecation
was to make sure solr indexes the uppercase form of a small list of
   words
in the file, AND TREAT EVERY OTHER WORD AS USUAL. The goal of
  securing
   the
closed list is achieved, but NO OTHER WORD outside the list is
  indexed!
   
Can anybody help? Thanks in advance!
   
Joe
   
  
  
  
   --
   Regards，
   David Shen
  
   http://about.me/davidshen
   https://twitter.com/#!/davidshen84
  
  
  
  
 



 --
 Regards，
 David Shen

 http://about.me/davidshen
 https://twitter.com/#!/davidshen84

Re: News clustering

2012-12-03 Thread Iwan Hanjoyo

Hi Stanislaw Osinski,


On Mon, Dec 3, 2012 at 6:13 PM, Stanislaw Osinski stanis...@osinski.namewrote:

 One of our clients uses Solr's search results clustering for grouping news.
 Instead of the default Carrot2 algorithm that ships with Solr they use a
 commercial one, but Carrot2 should give you decent clusters too. Here's an
 example clustering result:

 http://imagebin.org/238001

 Staszek

 --
 Stanislaw Osinski
 http://carrotsearch.com

 On Fri, Nov 30, 2012 at 4:44 PM, Jorge Luis Betancourt Gonzalez 
 jlbetanco...@uci.cu wrote:

  Hi all:
 
  I'm thinking on using nutch combined with solr to index some news sites
 in
  an intranet. And I was wondering how effective could be using the
  clustering component to cluster the search results? Any success history
 on
  using solr clustering component for news clustering? Any existing
 solution
  for clustering/classification on index time?
 
  Greetings!
  10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS
  INFORMATICAS...
  CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
 
  http://www.uci.cu
  http://www.facebook.com/universidad.uci
  http://www.flickr.com/photos/universidad_uci

Re: News clustering

2012-12-03 Thread Iwan Hanjoyo

Hi Stanislaw Osinski,

Was the picture generated using Lingo 3G algorihtms?
I saw some sub-clusters inside it.
Nice pic :)

I am interested to learn it.
How long is the Lingo 3G trial period?

Is there any way to programmatically measure the performance of Carrot2
clustering algorithm?
thanx

cheers

Hanjoyo

On Mon, Dec 3, 2012 at 6:13 PM, Stanislaw Osinski stanis...@osinski.namewrote:

 One of our clients uses Solr's search results clustering for grouping news.
 Instead of the default Carrot2 algorithm that ships with Solr they use a
 commercial one, but Carrot2 should give you decent clusters too. Here's an
 example clustering result:

 http://imagebin.org/238001

 Staszek

 --
 Stanislaw Osinski
 http://carrotsearch.com

 On Fri, Nov 30, 2012 at 4:44 PM, Jorge Luis Betancourt Gonzalez 
 jlbetanco...@uci.cu wrote:

  Hi all:
 
  I'm thinking on using nutch combined with solr to index some news sites
 in
  an intranet. And I was wondering how effective could be using the
  clustering component to cluster the search results? Any success history
 on
  using solr clustering component for news clustering? Any existing
 solution
  for clustering/classification on index time?
 
  Greetings!
  10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS
  INFORMATICAS...
  CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
 
  http://www.uci.cu
  http://www.facebook.com/universidad.uci
  http://www.flickr.com/photos/universidad_uci

Re: How to change Solr UI

2012-12-03 Thread Iwan Hanjoyo

Hi Romita,

In my opinion, if you are new to Solr, you can start learning from Solritas.
Solritas uses Apache Velocity, a templating language, CSS and JQuery to
manage it looks and behavior.
Besides that you can write a custom SearchComponent inside the /browse
SearchHandler
to add more functionality to your search application.

Kind regards,

Hanjoyo

On Mon, Dec 3, 2012 at 4:35 PM, Romita Saha romita.s...@sg.panasonic.comwrote:

 Hi,

 I want to change the Solr UI. As far as i understand, Solritas is just for
 prototyping, where I can change the UI according to a predefined template
 (Velocity) and cannot add on any additional functionality to that page.
 How can I change the Solr UI otherwise. Any guidance would be appreciated.

 Thanks and regards,
 Romita

Re: News clustering

2012-12-03 Thread Stanislaw Osinski

 Was the picture generated using Lingo 3G algorihtms?
 I saw some sub-clusters inside it.
 Nice pic :)


That is correct.


I am interested to learn it.
 How long is the Lingo 3G trial period?


I'll send you the details in a private e-mail in a second.



 Is there any way to programmatically measure the performance of Carrot2
 clustering algorithm?


I'm not sure what you mean by performance. Measuring clustering time is
pretty straightforward, measuring the quality of clusters is not, a lot
depends on your specific data and application.

Staszek

Whole Phrase search in Solr

2012-12-03 Thread NickA

Hello,

I am trying to achieve searching with a phrase in SOLR. Specifically I have
the following field in my schema:

   field name=search_field type=phrase_search indexed=true
stored=false multiValued=true/

fieldType name= phrase_search class=solr.TextField
positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt enablePositionIncrements=true /
filter class=solr.LowerCaseFilterFactory/
  /analyzer
  analyzer type=query
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
ignoreCase=true expand=true/
tokenizer class=solr.KeywordTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
  /analyzer
/fieldType

Also (as a second similar problem) in the “synonyms.txt” I have values like
these:

aword = a whole phrase

and I even tried:

aword = a whole phrase

now I tried searching for “check this” in several ways:

fq=search_field:check this
fq=search_field:check+this
fq=search_field:check this
fq=search_field:'check this'

but in all cases the search seems to run for “check OR this”!

similarly, if I search for “aword” which matches the synonyms file, the
search also looks for “a OR whole OR phrase”.

What am I doing wrong? Is there any way to force the query for the whole
phrase and not for each word separately?




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Whole-Phrase-search-in-Solr-tp4023931.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Luke and SOLR search giving different results

2012-12-03 Thread Erol Akarsu

Jack,

Thanks for help.

I removed data folder  of SOLR and indexed this sample doc from scratch,
there was no document in SOLR but only one.

When I analysed , I can see stemming is correct and I can see these for
words bul, baş ,gör and umut in SF row
I attached analyse screens

Erol Akarsu

On Sun, Dec 2, 2012 at 11:00 PM, Jack Krupansky j...@basetechnology.comwrote:

 Have you tried using the Solr Admin Analysis page, using the word and a
 few words of context for index analysis and the word alone for query
 analysis?

 And be sure to fully reindex if you change ANYTHING in the schema fields
 or field types.

 -- Jack Krupansky

 From: Erol Akarsu
 Sent: Sunday, December 02, 2012 10:38 PM
 To: solr-user@lucene.apache.org
 Subject: Luke and SOLR search giving different results

 Hi,

 I am trying to apply SOLR for Turkish Language for my research.

 Instead of using language identification, I manually assigned Turkish
 language for a sample test document. I have configured SOLR schema.xml,
 activated the part below. I have added the attached document
 testTurkishDoc.xml that is inserted to SOLR database.

 But searching for raw Lucene index through Luke and SOLR 4.0 search though
 GUI is giving different results. In picture Selection_006.png, the word
 baş is listed as top term. I search the word baş in Luke and I got the
 result result that is only document, shown in Selection_004.png.

 But in SOLR GUI, I am getting empty result for word baş in picture
 Selection_002.png.

 In the text we have  features field, that has word baştan that is being
 derived from root word baş in Turkish Grammar. Somehow, SOLR GUI is doing
 search different than Luke. I could not figure it out why I could not find
 it while getting in Luke. The same thing happens for words umut, bul
 and gör.

 I will appreciate if you can help me to get same results from SOLR UI.


 field name=features
Firmalarsa Nasılsa buldum oynatacak ünlüyü, neyleyim senaryoyu!
 diyerek baştan savma reklamlarla kotarmaya bakıyor işi. Futbolcu Arda Turan
 ve büyük umutlarla Türkiye'ye getirilen Paris Hilton'un oynatıldığı giyim
 firması reklamı da tam bir fiyasko. Birbirinden ünlü bu iki ismin oynadığı
 reklam Arda'nın kabinde papağan gibi tekrarladığı My darling! repliği,
 sonunda Paris'i görünce anlam veremediğimiz uyduruk bayılma sahnesi, bir de
 Paris'in ancak 5 kez izledikten sonra anlaşılan Paris seçti, firma yaptı,
 Arda bayıldı. sözleriyle kazındı hafızalara, Keşke unutabilsek!
 dedirterek.
   /field



 Added to schema.xml for SOLR:

 field name=features type=text_tr indexed=true stored=true
 multiValued=true/
 fieldType name=text_tr class=solr.TextField
 positionIncrementGap=100
   analyzer type=index
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.TurkishLowerCaseFilterFactory/
 filter class=solr.StopFilterFactory ignoreCase=true
 words=lang/stopwords_tr.txt enablePositionIncrements=true/
 filter class=solr.SnowballPorterFilterFactory
 language=Turkish/
   /analyzer
   analyzer type=query
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.TurkishLowerCaseFilterFactory/
 filter class=solr.StopFilterFactory ignoreCase=true
 words=lang/stopwords_tr.txt enablePositionIncrements=true/
 filter class=solr.SnowballPorterFilterFactory
 language=Turkish/
   /analyzer
 /fieldType

Backing up SolR 4.0

2012-12-03 Thread Andy D'Arcy Jewell


Hi all.

I'm new to SolR, and I have recently had to set up a SolR server running 
4.0.


I've been searching for info on backing it up, but all I've managed to 
come up with is it'll be different or you'll be able to do push 
replication or using http and the command=backup parameter, which 
doesn't sound like it will be effective for a production setup (unless 
I've got that wrong)...



I was wondering if I can just stop or suspend the SolR server, then do 
an LVM snapshot of the data store, before bringing it back on line, but 
I'm not sure if that will cut it. I gather merely rsyncing the data 
files won't do...


Can anyone give me a pointer to that easy-to-find document I have so 
far failed to find? Or failing that, maybe some sound advice on how to 
proceed?


Regards,
-Andy




--
Andy D'Arcy Jewell

SysMicro Limited
Linux Support
E:  andy.jew...@sysmicro.co.uk
W:  www.sysmicro.co.uk

Re: News clustering

2012-12-03 Thread Iwan Hanjoyo

Hi Stanislaw,

I mean measuring the similarity between the document in each cluster.
Also, difference between document on one cluster with another cluster.

I saw the sample code ClusteringQualityBencmark.java
However, I do not know how to make use of it for assessing my Solr
Clustering performance.

Kind regards,

Hanjoyo

On Mon, Dec 3, 2012 at 8:11 PM, Stanislaw Osinski stanis...@osinski.namewrote:

  Was the picture generated using Lingo 3G algorihtms?
  I saw some sub-clusters inside it.
  Nice pic :)
 

 That is correct.


 I am interested to learn it.
  How long is the Lingo 3G trial period?
 

 I'll send you the details in a private e-mail in a second.



  Is there any way to programmatically measure the performance of Carrot2
  clustering algorithm?
 

 I'm not sure what you mean by performance. Measuring clustering time is
 pretty straightforward, measuring the quality of clusters is not, a lot
 depends on your specific data and application.

 Staszek

PHP client

2012-12-03 Thread Arkadi Colson


Hi

Anyone tested the pecl Solr Client in combination with SolrCloud? I 
seems to be broken since 4.0


Best regard
Arkadi

Re: PHP client

2012-12-03 Thread Bill Au

https://bugs.php.net/bug.php?id=62332

There is a fork with patches applied.


On Mon, Dec 3, 2012 at 9:38 AM, Arkadi Colson ark...@smartbit.be wrote:

 Hi

 Anyone tested the pecl Solr Client in combination with SolrCloud? I seems
 to be broken since 4.0

 Best regard
 Arkadi

Re: AW: Edismax query parser and phrase queries

2012-12-03 Thread Jack Krupansky

Okay, so the bottom line here is that you wish to change the semantics of 
quoted phrases. Fine, that's your prerogative, but a change in semantics 
would require a change to the query parser, or as you originally indicated, 
a pre-processor. It does sound as if a pre-processor is the way to go here.


You still have a choice: An application-level preprocessor that generates an 
edismax query, or implement a Solr SearchComponent that pre-processes the 
query after Solr receives it but before edismax sees it. The former is 
probably easier. The only question is whether there might be multiple 
applications that access the same Solr node, so that maybe centralizing the 
pre-processing in Solr might be warranted.


-- Jack Krupansky

-Original Message- 
From: Tantius, Richard

Sent: Monday, December 03, 2012 5:03 AM
To: solr-user@lucene.apache.org
Subject: AW: Edismax query parser and phrase queries

Hi,
the use case we have in mind is that we would like to achieve exact matches 
for explicit phrases. Our users expect that an explicit phrase not only 
considers the order of terms, but also the exact wording. Therefore if we 
search on fields using a data type that is not meant performing exact 
matches, we need to change that for explicit phrases. This means in a usual 
query we have qf default fields using advanced tokenization (for query 
processing and indexing), for example like stemming via 
SnowballPorterFilterFactory. So our idea was to change the default search 
fields for explicit phrases to achieve exact matches, by using a simple data 
format like for example “string“ (StrField, without advanced options).


Extending our example from the last mail:

qf=title text

Datatype of title, text, something like “text_advanced”:

fieldtype ...
analyzer type=index !--(and also analyzer type=query )--
 filter class=solr.WordDelimiterFilterFactory ...
 filter class=solr.LowerCaseFilterFactory /
 filter class=solr.SnowballPorterFilterFactory language=German2 /
...

Data type of the additional fields titleExact, textExact:
fieldType name=string class=solr.StrField sortMissingLast=true 
omitNorms=true/


q=ran away from home Cat Dog

-transformTo-

q=( titleExact:ran away from home OR textExact:ran away from home ) Cat 
Dog.


Regards,
Richard.

BINSERV
Gesellschaft für interaktive Konzepte und neue Medien mbH
Software Engineer

Gotenstr. 7-9
53175 Bonn
Tel.: +49 (0)228 / 4 22 86 - 38
Fax.: +49 (0)228 / 4 22 86 - 538
E-Mail:   r.tant...@binserv.de
Web:  www.binserv.de
   www.binforcepro.de

Geschäftsführer: Rüdiger Jakob
Amtsgericht: Siegburg HRB 6765
Hauptsitz der Gesellschaft.: Pfarrer-Wichert-Str. 35, 53639 Königswinter
Diese E-Mail einschließlich eventuell angehängter Dateien enthält 
vertrauliche und/oder rechtlich geschützte Informationen. Wenn Sie nicht der 
richtige Adressat sind und diese E-Mail irrtümlich erhalten haben, dürfen 
Sie weder den Inhalt dieser E-Mail nutzen noch dürfen Sie die eventuell 
angehängten Dateien öffnen und auch nichts kopieren oder 
weitergeben/verbreiten. Bitte verständigen Sie den Absender und löschen Sie 
diese E-Mail und eventuell angehängte Dateien umgehend. Vielen Dank!



- Original message -
Von: Jack Krupansky [mailto:j...@basetechnology.com]
Gesendet: Freitag, 30. November 2012 23:04
An: solr-user@lucene.apache.org
Betreff: Re: Edismax query parser and phrase queries

I don’t have a simple answer for your stated issue, but maybe part of that 
is because I’m not so sure what the exact problem/goal is. I mean, what’s so 
special about phrase queries for your app than they need distinct processing 
from individual terms?


And, ultimately, what goal are you trying to achieve? Such as, how will the 
outcome of the query affect what users see and do.


-- Jack Krupansky

From: Tantius, Richard
Sent: Friday, November 30, 2012 8:44 AM
To: solr-user@lucene.apache.org
Subject: Edismax query parser and phrase queries

Hi,

we are using the edismax query parser and execute queries on specific fields 
by using the qf option. Like others, we are facing the problem we do not 
want explicit phrase queries to be performed on some of the qf fields and 
also require additional search fields for those kind of queries.


We tried to expand explicit phrases in a query by implementing some 
pre-processing logic, which did not seemed to be quite convenient.


So for example (lets assume qf=title text, we want phrase queries to be 
performed on the additional fields titleAlt textAlt ): q=ran away from 
home Cat Dog -transformTo- q=( titleAlt:ran away from home OR 
textAlt:ran away from home ) Cat Dog. Unfortunately this gets rather 
complicated if logic operators are involved within the query. Is there some 
kind of best practice, should we for example extend the query parser, or 
stick to our pre-processing approach?



Regards,
Richard.

Re: Luke and SOLR search giving different results

2012-12-03 Thread Jack Krupansky

So, does that highlight the problem for you or not? Is the term analyzed as you 
expected?

-- Jack Krupansky

From: Erol Akarsu 
Sent: Monday, December 03, 2012 8:44 AM
To: solr-user@lucene.apache.org 
Subject: Re: Luke and SOLR search giving different results

Jack,

Thanks for help.

I removed data folder  of SOLR and indexed this sample doc from scratch, there 
was no document in SOLR but only one. 

When I analysed , I can see stemming is correct and I can see these for words 
bul, baş ,gör and umut in SF row
I attached analyse screens

Erol Akarsu

On Sun, Dec 2, 2012 at 11:00 PM, Jack Krupansky j...@basetechnology.com wrote:

  Have you tried using the Solr Admin Analysis page, using the word and a few 
words of context for index analysis and the word alone for query analysis?

  And be sure to fully reindex if you change ANYTHING in the schema fields or 
field types.

  -- Jack Krupansky

  From: Erol Akarsu
  Sent: Sunday, December 02, 2012 10:38 PM
  To: solr-user@lucene.apache.org
  Subject: Luke and SOLR search giving different results

  Hi,

  I am trying to apply SOLR for Turkish Language for my research.

  Instead of using language identification, I manually assigned Turkish 
language for a sample test document. I have configured SOLR schema.xml, 
activated the part below. I have added the attached document testTurkishDoc.xml 
that is inserted to SOLR database.

  But searching for raw Lucene index through Luke and SOLR 4.0 search though 
GUI is giving different results. In picture Selection_006.png, the word baş 
is listed as top term. I search the word baş in Luke and I got the result 
result that is only document, shown in Selection_004.png.

  But in SOLR GUI, I am getting empty result for word baş in picture 
Selection_002.png.

  In the text we have  features field, that has word baştan that is being 
derived from root word baş in Turkish Grammar. Somehow, SOLR GUI is doing 
search different than Luke. I could not figure it out why I could not find it 
while getting in Luke. The same thing happens for words umut, bul and gör.

  I will appreciate if you can help me to get same results from SOLR UI.

  field name=features
 Firmalarsa “Nasılsa buldum oynatacak ünlüyü, neyleyim senaryoyu!” 
diyerek baştan savma reklamlarla kotarmaya bakıyor işi. Futbolcu Arda Turan ve 
büyük umutlarla Türkiye’ye getirilen Paris Hilton’un oynatıldığı giyim firması 
reklamı da tam bir fiyasko. Birbirinden ünlü bu iki ismin oynadığı reklam 
Arda’nın kabinde papağan gibi tekrarladığı “My darling!” repliği, sonunda 
Paris’i görünce anlam veremediğimiz uyduruk bayılma sahnesi, bir de Paris’in 
ancak 5 kez izledikten sonra anlaşılan “Paris seçti, firma yaptı, Arda 
bayıldı.” sözleriyle kazındı hafızalara, “Keşke unutabilsek!” dedirterek.
/field

  Added to schema.xml for SOLR:

  field name=features type=text_tr indexed=true stored=true 
multiValued=true/
  fieldType name=text_tr class=solr.TextField positionIncrementGap=100
analyzer type=index
  tokenizer class=solr.StandardTokenizerFactory/
  filter class=solr.TurkishLowerCaseFilterFactory/
  filter class=solr.StopFilterFactory ignoreCase=true 
words=lang/stopwords_tr.txt enablePositionIncrements=true/
  filter class=solr.SnowballPorterFilterFactory language=Turkish/
/analyzer
analyzer type=query
  tokenizer class=solr.StandardTokenizerFactory/
  filter class=solr.TurkishLowerCaseFilterFactory/
  filter class=solr.StopFilterFactory ignoreCase=true 
words=lang/stopwords_tr.txt enablePositionIncrements=true/
  filter class=solr.SnowballPorterFilterFactory language=Turkish/
/analyzer
  /fieldType

Re: Solr 4: Join Query

2012-12-03 Thread Erick Erickson

not that I know of. Also, your performance will be much better if you can
denormlized the data.


On Mon, Dec 3, 2012 at 12:44 AM, Vikash Sharma vikash0...@gmail.com wrote:

 Hi Erick,
 One more thing: So is there any other way to get the result?
 I mean, I need to get both parent and child document in/not nested format.

 Regards,
 Vikash

 Regards,
 Vikash Sharma
 vikash0...@gmail.com


 On Sat, Dec 1, 2012 at 10:29 PM, Erick Erickson erickerick...@gmail.com
 wrote:

  That's the way joins work, and why they're called pseudo join, they
 don't
  work like DB joins and return data from both records
 
  Joins were put in for a specific use-case, when you try to treat Solr
 like
  a DB you're bound to be disappointed. I'd think about reworking the
  solution to de-normalize the data so you don't have to do joins.
 
  Best
  Erick
 
 
  On Fri, Nov 30, 2012 at 10:38 AM, Vikash Sharma vikash0...@gmail.com
  wrote:
 
   Hi All,
   I have my field definition in schema.xml like below
  
   field name=id type=string indexed=true. /
   field name=Emp_id type=string indexed=true. /
   field name=doc_id type=string indexed=true. /
   field name=content type=string indexed=true. /
  
  
   I need to create separate record in solr for each parent child
   relationship... such that if child is same across different parent that
  it
   gets stored only once.
  
   For e.g.
---_Record 1
   idABCid/
   emp_idEMP001emp_id/
   doc_idDOC001doc_id/
   doc_contentMy Parent Docdoc_content/
  
---_Record 2
   idDOC001id/
   emp_idemp_id/
   doc_iddoc_id/
   doc_contentMy Document Datadoc_content/
  
  
   This will ensure that if any doc_id content is duplicate, than only
 once
   the record is inserted in the solr.
  
   Lastly, I want the result as join. if emp_id=EMP001. then both record
   should be returned, as there is a relationship between two records
 using
  of
   doc_id = id
  
   If I query:
  
  
 
 http://localhost:8983/solr/select?q={!join%20from=doc_id%20to=id}emp_id:EMP001wt=json
   
  
 
 http://localhost:8983/solr/select?q={!join%20from=sha_one%20to=id}project_id:10wt=json
   
  
   I expect both record should be returned either one after another or
   nested..
   But I only get child records...
  
  
   Please help..
  
  
  
   Regards,
   Vikash Sharma
   vikash0...@gmail.com

Re: How to change Solr UI

2012-12-03 Thread Erick Erickson

Adding to what Iwan said, I want to be sure you're not confusing
prototyping with a full-fledged application. The Velocity code included is
mostly intended as a rapid-prototyping vehicle. There are significant
security issues if you try to use it as your user-facing application, be
sure you trust your users if you go down this route.

But to change it, see the Apache velocity project, and the code in solr
home/conf/velocity.

Note that Velocity _can_ be used for user-facing code, but be very sure you
secure your Solr. If you allow direct access, a user can easily enter
something like 
http://solr/update?commit=truestream.body=deletequery*:*/query/delete.
And all your documents will be gone.

Most installations use a middle layer between Solr and the user that
controls access.

Best
Erick


On Mon, Dec 3, 2012 at 5:01 AM, Iwan Hanjoyo ihanj...@gmail.com wrote:

 Hi Romita,

 In my opinion, if you are new to Solr, you can start learning from
 Solritas.
 Solritas uses Apache Velocity, a templating language, CSS and JQuery to
 manage it looks and behavior.
 Besides that you can write a custom SearchComponent inside the /browse
 SearchHandler
 to add more functionality to your search application.

 Kind regards,

 Hanjoyo

 On Mon, Dec 3, 2012 at 4:35 PM, Romita Saha romita.s...@sg.panasonic.com
 wrote:

  Hi,
 
  I want to change the Solr UI. As far as i understand, Solritas is just
 for
  prototyping, where I can change the UI according to a predefined template
  (Velocity) and cannot add on any additional functionality to that page.
  How can I change the Solr UI otherwise. Any guidance would be
 appreciated.
 
  Thanks and regards,
  Romita

Re: Luke and SOLR search giving different results

2012-12-03 Thread Erol Akarsu

Jack,

Yes.

I expect SOLR should give same search results as Luked does.

Term analyzer gives correct answer in SOLR as expected. But SOLR does not
return correct search results.

I don't know why.

Erol Akarsu

On Mon, Dec 3, 2012 at 11:21 AM, Jack Krupansky j...@basetechnology.comwrote:

 So, does that highlight the problem for you or not? Is the term analyzed
 as you expected?

 -- Jack Krupansky

 From: Erol Akarsu
 Sent: Monday, December 03, 2012 8:44 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Luke and SOLR search giving different results

 Jack,

 Thanks for help.

 I removed data folder  of SOLR and indexed this sample doc from scratch,
 there was no document in SOLR but only one.

 When I analysed , I can see stemming is correct and I can see these for
 words bul, baş ,gör and umut in SF row
 I attached analyse screens

 Erol Akarsu


 On Sun, Dec 2, 2012 at 11:00 PM, Jack Krupansky j...@basetechnology.com
 wrote:

   Have you tried using the Solr Admin Analysis page, using the word and a
 few words of context for index analysis and the word alone for query
 analysis?

   And be sure to fully reindex if you change ANYTHING in the schema fields
 or field types.

   -- Jack Krupansky

   From: Erol Akarsu
   Sent: Sunday, December 02, 2012 10:38 PM
   To: solr-user@lucene.apache.org
   Subject: Luke and SOLR search giving different results


   Hi,

   I am trying to apply SOLR for Turkish Language for my research.

   Instead of using language identification, I manually assigned Turkish
 language for a sample test document. I have configured SOLR schema.xml,
 activated the part below. I have added the attached document
 testTurkishDoc.xml that is inserted to SOLR database.

   But searching for raw Lucene index through Luke and SOLR 4.0 search
 though GUI is giving different results. In picture Selection_006.png, the
 word baş is listed as top term. I search the word baş in Luke and I got
 the result result that is only document, shown in Selection_004.png.

   But in SOLR GUI, I am getting empty result for word baş in picture
 Selection_002.png.

   In the text we have  features field, that has word baştan that is
 being derived from root word baş in Turkish Grammar. Somehow, SOLR GUI is
 doing search different than Luke. I could not figure it out why I could not
 find it while getting in Luke. The same thing happens for words umut,
 bul and gör.

   I will appreciate if you can help me to get same results from SOLR UI.


   field name=features
  Firmalarsa Nasılsa buldum oynatacak ünlüyü, neyleyim senaryoyu!
 diyerek baştan savma reklamlarla kotarmaya bakıyor işi. Futbolcu Arda Turan
 ve büyük umutlarla Türkiye'ye getirilen Paris Hilton'un oynatıldığı giyim
 firması reklamı da tam bir fiyasko. Birbirinden ünlü bu iki ismin oynadığı
 reklam Arda'nın kabinde papağan gibi tekrarladığı My darling! repliği,
 sonunda Paris'i görünce anlam veremediğimiz uyduruk bayılma sahnesi, bir de
 Paris'in ancak 5 kez izledikten sonra anlaşılan Paris seçti, firma yaptı,
 Arda bayıldı. sözleriyle kazındı hafızalara, Keşke unutabilsek!
 dedirterek.
 /field



   Added to schema.xml for SOLR:

   field name=features type=text_tr indexed=true stored=true
 multiValued=true/
   fieldType name=text_tr class=solr.TextField
 positionIncrementGap=100
 analyzer type=index
   tokenizer class=solr.StandardTokenizerFactory/
   filter class=solr.TurkishLowerCaseFilterFactory/
   filter class=solr.StopFilterFactory ignoreCase=true
 words=lang/stopwords_tr.txt enablePositionIncrements=true/
   filter class=solr.SnowballPorterFilterFactory
 language=Turkish/
 /analyzer
 analyzer type=query
   tokenizer class=solr.StandardTokenizerFactory/
   filter class=solr.TurkishLowerCaseFilterFactory/
   filter class=solr.StopFilterFactory ignoreCase=true
 words=lang/stopwords_tr.txt enablePositionIncrements=true/
   filter class=solr.SnowballPorterFilterFactory
 language=Turkish/
 /analyzer
   /fieldType

Re: Whole Phrase search in Solr

2012-12-03 Thread NickA

Thank you Jack,

the problem with the AND is that it does not search for a PHRASE but for
the 2 words being SOMEWHERE in the article.

For example the Check this will NOT search for Check this as a PHRASE
but for the Check word and the this word somewhere in the article, even
far away the one from the other.

So the suggestions that you made do not work for searching as a PHRASE.

Unless we do something wrong?

Any other ideas on the PHRASE search?

Thank you again!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Whole-Phrase-search-in-Solr-tp4023931p4024029.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Backing up SolR 4.0

2012-12-03 Thread Erick Erickson

There's no real need to do what you ask.

First thing is that you should always be prepared, in the worst-case
scenario, to regenerate your entire index.

That said, perhaps the easiest way to back up Solr is just to use
master/slave replication. Consider having a machine that's a slave to the
master (but not necessarily searched against) and periodically poll your
master (say daily or whatever your interval is). You can configure Solr to
keep N copies of the index as extra insurance. These will be fairly static
so if you _really_ wanted to you could just copy the solrhome/data
directory somewhere, but I don't know if that's necessary.

See: http://wiki.apache.org/solr/SolrReplication

Best
Erick


On Mon, Dec 3, 2012 at 6:07 AM, Andy D'Arcy Jewell 
andy.jew...@sysmicro.co.uk wrote:

 Hi all.

 I'm new to SolR, and I have recently had to set up a SolR server running
 4.0.

 I've been searching for info on backing it up, but all I've managed to
 come up with is it'll be different or you'll be able to do push
 replication or using http and the command=backup parameter, which doesn't
 sound like it will be effective for a production setup (unless I've got
 that wrong)...


 I was wondering if I can just stop or suspend the SolR server, then do an
 LVM snapshot of the data store, before bringing it back on line, but I'm
 not sure if that will cut it. I gather merely rsyncing the data files won't
 do...

 Can anyone give me a pointer to that easy-to-find document I have so far
 failed to find? Or failing that, maybe some sound advice on how to proceed?

 Regards,
 -Andy




 --
 Andy D'Arcy Jewell

 SysMicro Limited
 Linux Support
 E:  andy.jew...@sysmicro.co.uk
 W:  www.sysmicro.co.uk

Re: Backing up SolR 4.0

2012-12-03 Thread Andy D'Arcy Jewell


On 03/12/12 16:39, Erick Erickson wrote:

There's no real need to do what you ask.

First thing is that you should always be prepared, in the worst-case
scenario, to regenerate your entire index.

That said, perhaps the easiest way to back up Solr is just to use
master/slave replication. Consider having a machine that's a slave to the
master (but not necessarily searched against) and periodically poll your
master (say daily or whatever your interval is). You can configure Solr to
keep N copies of the index as extra insurance. These will be fairly static
so if you_really_  wanted to you could just copy the solrhome/data
directory somewhere, but I don't know if that's necessary.

See:http://wiki.apache.org/solr/SolrReplication

Best
Erick

Hi Erick,

Thanks for that, I'll take a look.

However, wouldn't re-creating the index on a large dataset take an 
inordinate amount of time? The system I will be backing up is likely to 
undergo rapid development and thus schema changes, so I need some kind 
of insurance against corruption if we need to roll-back after a change.


How should I go about creating multiplebackup versions I can put aside 
(e.g. on tape) to hedge against the down-time which would be required to 
regenerate the indexes from scratch?


Regards,
-Andy

--
Andy D'Arcy Jewell

SysMicro Limited
Linux Support
E:  andy.jew...@sysmicro.co.uk
W:  www.sysmicro.co.uk

Re: AW: Edismax query parser and phrase queries

2012-12-03 Thread Erick Erickson

It _seems_ like just adding phrase fields (qf) to your edismax defaults
gets you close. It would have the problem of matching if the field were
longer... but it might be close enough.

Otherwise, why not just add in fq clauses on your exact fields? Because one
problem you'll have is that you need to get the parameters past the parser
to the field, which will be...er...interesting.

And one note. Rather than String fields (which are case sensitive),
consider KeywordTokenizer and LowercaseFilter or some such.

But I'd _really_ prove that you can't get close enough with current
functionality before I went down the custom route. Often things like this
seem like a good idea but then don't improve results enough to be worth the
complexity.

Best
Erick


On Mon, Dec 3, 2012 at 8:00 AM, Jack Krupansky j...@basetechnology.comwrote:

 Okay, so the bottom line here is that you wish to change the semantics of
 quoted phrases. Fine, that's your prerogative, but a change in semantics
 would require a change to the query parser, or as you originally indicated,
 a pre-processor. It does sound as if a pre-processor is the way to go here.

 You still have a choice: An application-level preprocessor that generates
 an edismax query, or implement a Solr SearchComponent that pre-processes
 the query after Solr receives it but before edismax sees it. The former is
 probably easier. The only question is whether there might be multiple
 applications that access the same Solr node, so that maybe centralizing the
 pre-processing in Solr might be warranted.

 -- Jack Krupansky

 -Original Message- From: Tantius, Richard
 Sent: Monday, December 03, 2012 5:03 AM
 To: solr-user@lucene.apache.org
 Subject: AW: Edismax query parser and phrase queries


 Hi,
 the use case we have in mind is that we would like to achieve exact
 matches for explicit phrases. Our users expect that an explicit phrase not
 only considers the order of terms, but also the exact wording. Therefore if
 we search on fields using a data type that is not meant performing exact
 matches, we need to change that for explicit phrases. This means in a usual
 query we have qf default fields using advanced tokenization (for query
 processing and indexing), for example like stemming via
 SnowballPorterFilterFactory. So our idea was to change the default search
 fields for explicit phrases to achieve exact matches, by using a simple
 data format like for example “string“ (StrField, without advanced options).

 Extending our example from the last mail:

 qf=title text

 Datatype of title, text, something like “text_advanced”:

 fieldtype ...
 analyzer type=index !--(and also analyzer type=query )--
  filter class=solr.**WordDelimiterFilterFactory ...
  filter class=solr.**LowerCaseFilterFactory /
  filter class=solr.**SnowballPorterFilterFactory language=German2 /
 ...

 Data type of the additional fields titleExact, textExact:
 fieldType name=string class=solr.StrField sortMissingLast=true
 omitNorms=true/

 q=ran away from home Cat Dog

 -transformTo-

 q=( titleExact:ran away from home OR textExact:ran away from home )
 Cat Dog.

 Regards,
 Richard.

 BINSERV
 Gesellschaft für interaktive Konzepte und neue Medien mbH
 Software Engineer

 Gotenstr. 7-9
 53175 Bonn
 Tel.: +49 (0)228 / 4 22 86 - 38
 Fax.: +49 (0)228 / 4 22 86 - 538
 E-Mail:   r.tant...@binserv.de
 Web:  www.binserv.de
www.binforcepro.de

 Geschäftsführer: Rüdiger Jakob
 Amtsgericht: Siegburg HRB 6765
 Hauptsitz der Gesellschaft.: Pfarrer-Wichert-Str. 35, 53639 Königswinter
 Diese E-Mail einschließlich eventuell angehängter Dateien enthält
 vertrauliche und/oder rechtlich geschützte Informationen. Wenn Sie nicht
 der richtige Adressat sind und diese E-Mail irrtümlich erhalten haben,
 dürfen Sie weder den Inhalt dieser E-Mail nutzen noch dürfen Sie die
 eventuell angehängten Dateien öffnen und auch nichts kopieren oder
 weitergeben/verbreiten. Bitte verständigen Sie den Absender und löschen Sie
 diese E-Mail und eventuell angehängte Dateien umgehend. Vielen Dank!


 - Original message -
 Von: Jack Krupansky [mailto:jack@basetechnology.**comj...@basetechnology.com
 ]
 Gesendet: Freitag, 30. November 2012 23:04
 An: solr-user@lucene.apache.org
 Betreff: Re: Edismax query parser and phrase queries

 I don’t have a simple answer for your stated issue, but maybe part of that
 is because I’m not so sure what the exact problem/goal is. I mean, what’s
 so special about phrase queries for your app than they need distinct
 processing from individual terms?

 And, ultimately, what goal are you trying to achieve? Such as, how will
 the outcome of the query affect what users see and do.

 -- Jack Krupansky

 From: Tantius, Richard
 Sent: Friday, November 30, 2012 8:44 AM
 To: solr-user@lucene.apache.org
 Subject: Edismax query parser and phrase queries

 Hi,

 we are using the edismax query parser and execute queries on specific
 fields by using the qf option. Like others, we are facing

Re: Whole Phrase search in Solr

2012-12-03 Thread Erick Erickson

As Jack suggested, show the results of adding debugQuery=on, it'll help us
help you. Particularly with this form: q=search_field:check this. It
should be doing what you want.

Best
Erick


On Mon, Dec 3, 2012 at 8:37 AM, NickA nickathen...@gmail.com wrote:

 Thank you Jack,

 the problem with the AND is that it does not search for a PHRASE but for
 the 2 words being SOMEWHERE in the article.

 For example the Check this will NOT search for Check this as a PHRASE
 but for the Check word and the this word somewhere in the article, even
 far away the one from the other.

 So the suggestions that you made do not work for searching as a PHRASE.

 Unless we do something wrong?

 Any other ideas on the PHRASE search?

 Thank you again!



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Whole-Phrase-search-in-Solr-tp4023931p4024029.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Whole Phrase search in Solr

2012-12-03 Thread Jack Krupansky

If you use the edismax query parser and set the pf, pf2, and pf3 
fields your phrases should show up as top results. This will not eliminate 
non-phrase matches, but will assure that phrase matches get boosted.


See:
http://wiki.apache.org/solr/ExtendedDisMax#pf_.28Phrase_Fields.29

-- Jack Krupansky

-Original Message- 
From: NickA

Sent: Monday, December 03, 2012 11:37 AM
To: solr-user@lucene.apache.org
Subject: Re: Whole Phrase search in Solr

Thank you Jack,

the problem with the AND is that it does not search for a PHRASE but for
the 2 words being SOMEWHERE in the article.

For example the Check this will NOT search for Check this as a PHRASE
but for the Check word and the this word somewhere in the article, even
far away the one from the other.

So the suggestions that you made do not work for searching as a PHRASE.

Unless we do something wrong?

Any other ideas on the PHRASE search?

Thank you again!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Whole-Phrase-search-in-Solr-tp4023931p4024029.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Downloading files from the solr replication Handler

2012-12-03 Thread Eva Lacy

They are the '\0' character.
what is a marker?

Gettting the following with a wget
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/xml]


On Fri, Nov 30, 2012 at 4:58 PM, Alexandre Rafalovitch
arafa...@gmail.comwrote:

 What mime type you get for binary files? Maybe server is misconfigured for
 that extension and sends them as text. Then they could be the markers.

 Do they look like markers?

 Regards,
 Alex
 On 30 Nov 2012 04:06, Eva Lacy e...@lacy.ie wrote:

  Doesn't make much sense if they are in binary files as well.
 
 
  On Thu, Nov 29, 2012 at 10:16 PM, Lance Norskog goks...@gmail.com
 wrote:
 
   Maybe these are text encoding markers?
  
   - Original Message -
   | From: Eva Lacy e...@lacy.ie
   | To: solr-user@lucene.apache.org
   | Sent: Thursday, November 29, 2012 3:53:07 AM
   | Subject: Re: Downloading files from the solr replication Handler
   |
   | I tried downloading them with my browser and also with a c#
   | WebRequest.
   | If I skip the first and last 4 bytes it seems work fine.
   |
   |
   | On Thu, Nov 29, 2012 at 2:28 AM, Erick Erickson
   | erickerick...@gmail.comwrote:
   |
   |  How are you downloading them? I suspect the issue is
   |  with the download process rather than Solr, but I'm just guessing.
   | 
   |  Best
   |  Erick
   | 
   | 
   |  On Wed, Nov 28, 2012 at 12:19 PM, Eva Lacy e...@lacy.ie wrote:
   | 
   |   Just to add to that, I'm using solr 3.6.1
   |  
   |  
   |   On Wed, Nov 28, 2012 at 5:18 PM, Eva Lacy e...@lacy.ie wrote:
   |  
   |I downloaded some configuration and data files directly from
   |solr in an
   |attempt to develop a backup solution.
   |I noticed there is some characters at the start and end of the
   |file
   |  that
   |aren't in configuration files, I notice the same characters at
   |the
   |  start
   |and end of the data files.
   |Anyone with any idea how I can download these files without the
   |extra
   |characters or predict how many there are going to be so I can
   |skip
   |  them?
   |   
   |  
   | 
   |

Re: Luke and SOLR search giving different results

2012-12-03 Thread Jack Krupansky


Two points:

1. Possibly an encoding problem with your container? Is UTF-8 encoding 
enabled?
2. Add debugQuery=true to your query (from the browser) and see if the 
parser_query has the expected term that matches what Luke reports for the 
index and what Solr Admin Analysis also reports for index analysis.


-- Jack Krupansky

-Original Message- 
From: Erol Akarsu

Sent: Monday, December 03, 2012 11:35 AM
To: solr-user@lucene.apache.org
Subject: Re: Luke and SOLR search giving different results

Jack,

Yes.

I expect SOLR should give same search results as Luked does.

Term analyzer gives correct answer in SOLR as expected. But SOLR does not
return correct search results.

I don't know why.

Erol Akarsu

On Mon, Dec 3, 2012 at 11:21 AM, Jack Krupansky 
j...@basetechnology.comwrote:



So, does that highlight the problem for you or not? Is the term analyzed
as you expected?

-- Jack Krupansky

From: Erol Akarsu
Sent: Monday, December 03, 2012 8:44 AM
To: solr-user@lucene.apache.org
Subject: Re: Luke and SOLR search giving different results

Jack,

Thanks for help.

I removed data folder  of SOLR and indexed this sample doc from scratch,
there was no document in SOLR but only one.

When I analysed , I can see stemming is correct and I can see these for
words bul, baş ,gör and umut in SF row
I attached analyse screens

Erol Akarsu


On Sun, Dec 2, 2012 at 11:00 PM, Jack Krupansky j...@basetechnology.com
wrote:

  Have you tried using the Solr Admin Analysis page, using the word and a
few words of context for index analysis and the word alone for query
analysis?

  And be sure to fully reindex if you change ANYTHING in the schema fields
or field types.

  -- Jack Krupansky

  From: Erol Akarsu
  Sent: Sunday, December 02, 2012 10:38 PM
  To: solr-user@lucene.apache.org
  Subject: Luke and SOLR search giving different results


  Hi,

  I am trying to apply SOLR for Turkish Language for my research.

  Instead of using language identification, I manually assigned Turkish
language for a sample test document. I have configured SOLR schema.xml,
activated the part below. I have added the attached document
testTurkishDoc.xml that is inserted to SOLR database.

  But searching for raw Lucene index through Luke and SOLR 4.0 search
though GUI is giving different results. In picture Selection_006.png, the
word baş is listed as top term. I search the word baş in Luke and I 
got

the result result that is only document, shown in Selection_004.png.

  But in SOLR GUI, I am getting empty result for word baş in picture
Selection_002.png.

  In the text we have  features field, that has word baştan that is
being derived from root word baş in Turkish Grammar. Somehow, SOLR GUI 
is
doing search different than Luke. I could not figure it out why I could 
not

find it while getting in Luke. The same thing happens for words umut,
bul and gör.

  I will appreciate if you can help me to get same results from SOLR UI.


  field name=features
 Firmalarsa Nasılsa buldum oynatacak ünlüyü, neyleyim senaryoyu!
diyerek baştan savma reklamlarla kotarmaya bakıyor işi. Futbolcu Arda 
Turan

ve büyük umutlarla Türkiye'ye getirilen Paris Hilton'un oynatıldığı giyim
firması reklamı da tam bir fiyasko. Birbirinden ünlü bu iki ismin oynadığı
reklam Arda'nın kabinde papağan gibi tekrarladığı My darling! repliği,
sonunda Paris'i görünce anlam veremediğimiz uyduruk bayılma sahnesi, bir 
de

Paris'in ancak 5 kez izledikten sonra anlaşılan Paris seçti, firma yaptı,
Arda bayıldı. sözleriyle kazındı hafızalara, Keşke unutabilsek!
dedirterek.
/field



  Added to schema.xml for SOLR:

  field name=features type=text_tr indexed=true stored=true
multiValued=true/
  fieldType name=text_tr class=solr.TextField
positionIncrementGap=100
analyzer type=index
  tokenizer class=solr.StandardTokenizerFactory/
  filter class=solr.TurkishLowerCaseFilterFactory/
  filter class=solr.StopFilterFactory ignoreCase=true
words=lang/stopwords_tr.txt enablePositionIncrements=true/
  filter class=solr.SnowballPorterFilterFactory
language=Turkish/
/analyzer
analyzer type=query
  tokenizer class=solr.StandardTokenizerFactory/
  filter class=solr.TurkishLowerCaseFilterFactory/
  filter class=solr.StopFilterFactory ignoreCase=true
words=lang/stopwords_tr.txt enablePositionIncrements=true/
  filter class=solr.SnowballPorterFilterFactory
language=Turkish/
/analyzer
  /fieldType

Re: News clustering

2012-12-03 Thread Stanislaw Osinski

 I mean measuring the similarity between the document in each cluster.
 Also, difference between document on one cluster with another cluster.

 I saw the sample code ClusteringQualityBencmark.java
 However, I do not know how to make use of it for assessing my Solr
 Clustering performance.


You'd need to write your own code for this, here are the most common
clustering quality measures you mentioned:

http://en.wikipedia.org/wiki/Cluster_analysis#Evaluation_of_clustering_results

These are meant for the general case (numeric attributes), to apply them to
texts, you'd need to use the vector representation of the documents.

One a more general note, synthetic measures test only the document-cluster
assignments, but none take the quality of labels into account (this is
really hard to measure objectively).

Staszek

Re: Whole Phrase search in Solr

2012-12-03 Thread Jack Krupansky

The edismax phrase boost feature boosts the phrase IF it occurs - it's 
optional.


If you want Solr to search ONLY by whole phrase, Solr does have a precise 
way to request that - simply enclose the phrase in quotes. But I presume 
that you knew that.


You can certainly preprocess your query to convert raw phrases into quoted 
phrases.


-- Jack Krupansky

-Original Message- 
From: NickA

Sent: Monday, December 03, 2012 12:40 PM
To: solr-user@lucene.apache.org
Subject: Re: Whole Phrase search in Solr

Thank you Jack,

Before doing this major change, please note that the problem is that there
are ZERO matches of the your products phrase (on my example below). It is
not that the search finds this phrase but it has it in very low ranking...
it is that it NEVER finds this phrase as a result.

So how will the search show them on top results, since these are ZERO?

OR you mean that with this new parser we WILL get phrase results too?

Thank you again!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Whole-Phrase-search-in-Solr-tp4023931p4024048.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Backing up SolR 4.0

2012-12-03 Thread Shawn Heisey


On 12/3/2012 9:47 AM, Andy D'Arcy Jewell wrote:
However, wouldn't re-creating the index on a large dataset take an 
inordinate amount of time? The system I will be backing up is likely 
to undergo rapid development and thus schema changes, so I need some 
kind of insurance against corruption if we need to roll-back after a 
change.


How should I go about creating multiplebackup versions I can put aside 
(e.g. on tape) to hedge against the down-time which would be required 
to regenerate the indexes from scratch?


Serious production Solr installs require at least two copies of your 
index.  Failures *will* happen, and sometimes they'll be the kind of 
failures that will take down an entire machine.  You can plan for some 
failures -- redundant power supply and RAID are important for this.  
Some failures will cause downtime, though -- multiple disk failures, 
motherboard, CPU, memory, software problems wiping out your index, user 
error, etc.If you have at least one other copy of your index, you'll be 
able to keep the system operational while you fix the down machine.


Replication is a very good way to accomplish getting two or more copies 
of your index.  I would expect that most production Solr installations 
use either plain replication or SolrCloud.  I do my redundancy a 
different way that gives me a lot more flexibility, but replication is a 
VERY solid way to go.


If you are running on a UNIX/Linux platform (just about anything *other* 
than Windows), and backups via replication are not enough for you, you 
can use the hardlink capability in the OS to avoid taking Solr down 
while you make backups.  Here's the basic sequence:


1) Pause indexing, wait for all commits and merges to complete.
2) Create a target directory on the same filesystem as your Solr index.
3) Make hardlinks of all files in your Solr index in the target directory.
4) Resume indexing.
5) Copy the target directory to your backup location at your leisure.
6) Delete the hardlink copies from the target directory.

Making hardlinks is a near-instantaneous operation.  The way that 
Solr/Lucene works will guarantee that your hardlink copy will continue 
to be a valid index snapshot no matter what happens to the live index.  
If you can make the backup and get the hardlinks deleted before your 
index undergoes a merge, the hardlinks will use very little extra disk 
space.


If you leave the hardlink copies around, eventually your live index will 
diverge to the point where the copy has different files and therefore 
takes up disk space.  If you have a *LOT* of extra disk space on the 
Solr server, you can keep multiple hardlink copies around as snapshots.


Recent versions of Windows do have features similar to UNIX links, so 
there may in fact be a way to do this on Windows.  I will leave that for 
someone else to pursue.


Thanks,
Shawn

Re: Luke and SOLR search giving different results

2012-12-03 Thread Erol Akarsu

Jack,

I have already set tomcat server fro UTF-Encoding before. I have added
URIEncoding=UTF-8 to all Connector .. elements in server.xml in Tomcat
7.

As you see below, when I search  word baş  with debug mode I can see
empty response. But  when I search word baştan, I can get correct
response.

It seems to me that TurkishAnalyser is not being used in SOLR search
because we can make only full word search baştan but not the root word
baş. Probably, English Analyzer is being used and could not find the root
word. For example, in Luke, if I change Analyser to use for query parsing
to EnglishAnalyser, then it can not find word baş but it can with
TurkishAnalyser only. I guess SOLR is not using TurkishAnalyzer.

Is this assumption true? I could not find any other reason


?xml version=1.0 encoding=UTF-8?
response
lst name=responseHeader
int name=status0/int
int name=QTime58/int
lst name=params
str name=debugQuerytrue/str
str name=qbaş/str
str name=wtxml/str
/lst
/lst
result name=response numFound=0 start=0 /
lst name=debug
str name=rawquerystringbaş/str
str name=querystringbaş/str
str name=parsedquerytext:baş/str
str name=parsedquery_toStringtext:baş/str
lst name=explain /
str name=QParserLuceneQParser/str
lst name=timing
double name=time38.0/double
lst name=prepare
double name=time16.0/double
lst
name=org.apache.solr.handler.component.QueryComponent
double name=time3.0/double
/lst
lst
name=org.apache.solr.handler.component.FacetComponent
double name=time0.0/double
/lst
lst
name=org.apache.solr.handler.component.MoreLikeThisComponent
double name=time0.0/double
/lst
lst
name=org.apache.solr.handler.component.HighlightComponent
double name=time0.0/double
/lst
lst
name=org.apache.solr.handler.component.StatsComponent
double name=time0.0/double
/lst
lst
name=org.apache.solr.handler.component.DebugComponent
double name=time0.0/double
/lst
/lst
lst name=process
double name=time10.0/double
lst
name=org.apache.solr.handler.component.QueryComponent
double name=time0.0/double
/lst
lst
name=org.apache.solr.handler.component.FacetComponent
double name=time0.0/double
/lst
lst
name=org.apache.solr.handler.component.MoreLikeThisComponent
double name=time0.0/double
/lst
lst
name=org.apache.solr.handler.component.HighlightComponent
double name=time0.0/double
/lst
lst
name=org.apache.solr.handler.component.StatsComponent
double name=time0.0/double
/lst
lst
name=org.apache.solr.handler.component.DebugComponent
double name=time10.0/double
/lst
/lst
/lst
/lst
/response

response
lst name=responseHeader
int name=status0/int
int name=QTime2/int
lst name=params
str name=debugQuerytrue/str
str name=qbaştan/str
str name=wtxml/str
/lst
/lst
result name=response numFound=1 start=0
doc
str name=urlhtt://111.a.b1/str
str name=id6H500F0/str
str name=langtr/str
str name=nameMaxtor DiamondMax 11 - hard drive - 500 GB -
SATA-300
/str
str name=manuMaxtor Corp./str
str name=manu_id_smaxtor/str
arr name=cat
strelectronics/str
strhard drive/str
/arr
arr name=features
strSATA 3.0Gb/s, NCQ/str
str8.5ms seek/str
str16MB cache/str
str
Firmalarsa Nasılsa buldum oynatacak ünlüyü, neyleyim
senaryoyu! diyerek
baştan savma reklamlarla kotarmaya bakıyor işi.
Futbolcu Arda Turan
ve büyük umutlarla Türkiye'ye getirilen Paris Hilton'un
oynatıldığı
giyim firması reklamı da tam bir fiyasko. Birbirinden
ünlü bu iki
ismin oynadığı reklam Arda'nın kabinde papağan gibi
tekrarladığı
My darling! repliği, sonunda Paris'i görünce anlam
veremediğimiz
uyduruk bayılma sahnesi, bir de Paris'in ancak 5 kez
izledikten
sonra anlaşılan Paris seçti, firma yaptı, Arda
bayıldı.
sözleriyle kazındı hafızalara, Keşke unutabilsek!
dedirterek.

Re: Luke and SOLR search giving different results

2012-12-03 Thread Jack Krupansky

Ah! See where it says str name=parsedquery_toStringtext:baş/str? 
Your query is against the text field, which probably doesn't have the 
Turkish analysis.


There is probably a copyField from features to text. You use the 
text_tr field type for features, but probably not for the text field.


-- Jack Krupansky

-Original Message- 
From: Erol Akarsu

Sent: Monday, December 03, 2012 1:06 PM
To: solr-user@lucene.apache.org
Subject: Re: Luke and SOLR search giving different results

Jack,

I have already set tomcat server fro UTF-Encoding before. I have added
URIEncoding=UTF-8 to all Connector .. elements in server.xml in Tomcat
7.

As you see below, when I search  word baş  with debug mode I can see
empty response. But  when I search word baştan, I can get correct
response.

It seems to me that TurkishAnalyser is not being used in SOLR search
because we can make only full word search baştan but not the root word
baş. Probably, English Analyzer is being used and could not find the root
word. For example, in Luke, if I change Analyser to use for query parsing
to EnglishAnalyser, then it can not find word baş but it can with
TurkishAnalyser only. I guess SOLR is not using TurkishAnalyzer.

Is this assumption true? I could not find any other reason


?xml version=1.0 encoding=UTF-8?
response
   lst name=responseHeader
   int name=status0/int
   int name=QTime58/int
   lst name=params
   str name=debugQuerytrue/str
   str name=qbaş/str
   str name=wtxml/str
   /lst
   /lst
   result name=response numFound=0 start=0 /
   lst name=debug
   str name=rawquerystringbaş/str
   str name=querystringbaş/str
   str name=parsedquerytext:baş/str
   str name=parsedquery_toStringtext:baş/str
   lst name=explain /
   str name=QParserLuceneQParser/str
   lst name=timing
   double name=time38.0/double
   lst name=prepare
   double name=time16.0/double
   lst
name=org.apache.solr.handler.component.QueryComponent
   double name=time3.0/double
   /lst
   lst
name=org.apache.solr.handler.component.FacetComponent
   double name=time0.0/double
   /lst
   lst
name=org.apache.solr.handler.component.MoreLikeThisComponent
   double name=time0.0/double
   /lst
   lst
name=org.apache.solr.handler.component.HighlightComponent
   double name=time0.0/double
   /lst
   lst
name=org.apache.solr.handler.component.StatsComponent
   double name=time0.0/double
   /lst
   lst
name=org.apache.solr.handler.component.DebugComponent
   double name=time0.0/double
   /lst
   /lst
   lst name=process
   double name=time10.0/double
   lst
name=org.apache.solr.handler.component.QueryComponent
   double name=time0.0/double
   /lst
   lst
name=org.apache.solr.handler.component.FacetComponent
   double name=time0.0/double
   /lst
   lst
name=org.apache.solr.handler.component.MoreLikeThisComponent
   double name=time0.0/double
   /lst
   lst
name=org.apache.solr.handler.component.HighlightComponent
   double name=time0.0/double
   /lst
   lst
name=org.apache.solr.handler.component.StatsComponent
   double name=time0.0/double
   /lst
   lst
name=org.apache.solr.handler.component.DebugComponent
   double name=time10.0/double
   /lst
   /lst
   /lst
   /lst
/response

response
   lst name=responseHeader
   int name=status0/int
   int name=QTime2/int
   lst name=params
   str name=debugQuerytrue/str
   str name=qbaştan/str
   str name=wtxml/str
   /lst
   /lst
   result name=response numFound=1 start=0
   doc
   str name=urlhtt://111.a.b1/str
   str name=id6H500F0/str
   str name=langtr/str
   str name=nameMaxtor DiamondMax 11 - hard drive - 500 GB -
SATA-300
   /str
   str name=manuMaxtor Corp./str
   str name=manu_id_smaxtor/str
   arr name=cat
   strelectronics/str
   strhard drive/str
   /arr
   arr name=features
   strSATA 3.0Gb/s, NCQ/str
   str8.5ms seek/str
   str16MB cache/str
   str
   Firmalarsa Nasılsa buldum oynatacak ünlüyü, neyleyim
senaryoyu! diyerek
   baştan savma reklamlarla kotarmaya bakıyor işi.
Futbolcu Arda Turan
   ve büyük umutlarla Türkiye'ye getirilen Paris Hilton'un
oynatıldığı
   giyim firması reklamı da tam bir fiyasko. Birbirinden
ünlü bu iki

Re: Luke and SOLR search giving different results

2012-12-03 Thread Erol Akarsu

Jack,

I have these in schema.xml that defines features as type of text_tr

But unfortunately, this fails.

 field name=features type=text_tr indexed=true stored=true
multiValued=true/
copyField source=features dest=text/

fieldType name=text_tr class=solr.TextField positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.TurkishLowerCaseFilterFactory/
filter class=solr.StopFilterFactory ignoreCase=true
words=lang/stopwords_tr.txt enablePositionIncrements=true/
filter class=solr.SnowballPorterFilterFactory
language=Turkish/
  /analyzer
  analyzer type=query
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.TurkishLowerCaseFilterFactory/
filter class=solr.StopFilterFactory ignoreCase=true
words=lang/stopwords_tr.txt enablePositionIncrements=true/
filter class=solr.SnowballPorterFilterFactory
language=Turkish/
  /analyzer
/fieldType



On Mon, Dec 3, 2012 at 1:15 PM, Jack Krupansky j...@basetechnology.comwrote:

 Ah! See where it says str name=parsedquery_toString**text:baş/str?
 Your query is against the text field, which probably doesn't have the
 Turkish analysis.

 There is probably a copyField from features to text. You use the
 text_tr field type for features, but probably not for the text field.


 -- Jack Krupansky

 -Original Message- From: Erol Akarsu
 Sent: Monday, December 03, 2012 1:06 PM

 To: solr-user@lucene.apache.org
 Subject: Re: Luke and SOLR search giving different results

 Jack,

 I have already set tomcat server fro UTF-Encoding before. I have added
 URIEncoding=UTF-8 to all Connector .. elements in server.xml in Tomcat
 7.

 As you see below, when I search  word baş  with debug mode I can see
 empty response. But  when I search word baştan, I can get correct
 response.

 It seems to me that TurkishAnalyser is not being used in SOLR search
 because we can make only full word search baştan but not the root word
 baş. Probably, English Analyzer is being used and could not find the root
 word. For example, in Luke, if I change Analyser to use for query parsing
 to EnglishAnalyser, then it can not find word baş but it can with
 TurkishAnalyser only. I guess SOLR is not using TurkishAnalyzer.

 Is this assumption true? I could not find any other reason


 ?xml version=1.0 encoding=UTF-8?
 response
lst name=responseHeader
int name=status0/int
int name=QTime58/int
lst name=params
str name=debugQuerytrue/str
str name=qbaş/str
str name=wtxml/str
/lst
/lst
result name=response numFound=0 start=0 /
lst name=debug
str name=rawquerystringbaş/**str
str name=querystringbaş/str
str name=parsedquerytext:baş/**str
str name=parsedquery_toString**text:baş/str
lst name=explain /
str name=QParserLuceneQParser/**str
lst name=timing
double name=time38.0/double
lst name=prepare
double name=time16.0/double
lst
 name=org.apache.solr.handler.**component.QueryComponent
double name=time3.0/double
/lst
lst
 name=org.apache.solr.handler.**component.FacetComponent
double name=time0.0/double
/lst
lst
 name=org.apache.solr.handler.**component.**MoreLikeThisComponent
double name=time0.0/double
/lst
lst
 name=org.apache.solr.handler.**component.HighlightComponent
double name=time0.0/double
/lst
lst
 name=org.apache.solr.handler.**component.StatsComponent
double name=time0.0/double
/lst
lst
 name=org.apache.solr.handler.**component.DebugComponent
double name=time0.0/double
/lst
/lst
lst name=process
double name=time10.0/double
lst
 name=org.apache.solr.handler.**component.QueryComponent
double name=time0.0/double
/lst
lst
 name=org.apache.solr.handler.**component.FacetComponent
double name=time0.0/double
/lst
lst
 name=org.apache.solr.handler.**component.**MoreLikeThisComponent
double name=time0.0/double
/lst
lst
 name=org.apache.solr.handler.**component.HighlightComponent
double name=time0.0/double
/lst
lst
 name=org.apache.solr.handler.**component.StatsComponent
double name=time0.0/double
/lst
lst
 name=org.apache.solr.handler.**component.DebugComponent
double name=time10.0/double
/lst
/lst
/lst
/lst

Re: Whole Phrase search in Solr

2012-12-03 Thread NickA

Jack thank you again,

however we have the major problem that using QUOTES to bring phrase
results, actually does not bring any results AT ALL!

I mentioned this at the initial post, that we also used these:

fq=search_field:check this
fq=search_field:'check this' 

But no results appear when quotes are used. What may be doing wrong in our
configuration?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Whole-Phrase-search-in-Solr-tp4023931p4024071.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: News clustering

2012-12-03 Thread Jorge Luis Betancourt Gonzalez

I'm trying to using to search though news websites, but I was interested in 
classification on index time, is there any available solution for this?

Greetings!

On Dec 3, 2012, at 12:37 PM, Stanislaw Osinski stanis...@osinski.name wrote:

 I mean measuring the similarity between the document in each cluster.
 Also, difference between document on one cluster with another cluster.
 
 I saw the sample code ClusteringQualityBencmark.java
 However, I do not know how to make use of it for assessing my Solr
 Clustering performance.
 
 
 You'd need to write your own code for this, here are the most common
 clustering quality measures you mentioned:
 
 http://en.wikipedia.org/wiki/Cluster_analysis#Evaluation_of_clustering_results
 
 These are meant for the general case (numeric attributes), to apply them to
 texts, you'd need to use the vector representation of the documents.
 
 One a more general note, synthetic measures test only the document-cluster
 assignments, but none take the quality of labels into account (this is
 really hard to measure objectively).
 
 Staszek
 
 
 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
 INFORMATICAS...
 CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
 
 http://www.uci.cu
 http://www.facebook.com/universidad.uci
 http://www.flickr.com/photos/universidad_uci


10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

Re: Luke and SOLR search giving different results

2012-12-03 Thread Erol Akarsu

Jack,

I see interesting stuff here now.

I tried  as search query  not baş but features:baş in field q in SOLR
GUI. And, I got result!

In the one document, I had some fields type of text_eng, text_general and
one field features type of text_tr. If I don't specify field name, SOLR use
EnglishAnalyzer. If I do, it uses the analyzer specific to field specified
in search query string.

Is this true?

Erol Akarsu

On Mon, Dec 3, 2012 at 1:30 PM, Erol Akarsu eaka...@gmail.com wrote:

 Jack,

 I have these in schema.xml that defines features as type of text_tr

 But unfortunately, this fails.


  field name=features type=text_tr indexed=true stored=true
 multiValued=true/
 copyField source=features dest=text/


 fieldType name=text_tr class=solr.TextField
 positionIncrementGap=100
   analyzer type=index
  tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.TurkishLowerCaseFilterFactory/
 filter class=solr.StopFilterFactory ignoreCase=true
 words=lang/stopwords_tr.txt enablePositionIncrements=true/
  filter class=solr.SnowballPorterFilterFactory
 language=Turkish/
   /analyzer
   analyzer type=query

 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.TurkishLowerCaseFilterFactory/
 filter class=solr.StopFilterFactory ignoreCase=true
 words=lang/stopwords_tr.txt enablePositionIncrements=true/
  filter class=solr.SnowballPorterFilterFactory
 language=Turkish/
   /analyzer
 /fieldType




 On Mon, Dec 3, 2012 at 1:15 PM, Jack Krupansky j...@basetechnology.comwrote:

 Ah! See where it says str name=parsedquery_toString**text:baş/str?
 Your query is against the text field, which probably doesn't have the
 Turkish analysis.

 There is probably a copyField from features to text. You use the
 text_tr field type for features, but probably not for the text field.


 -- Jack Krupansky

 -Original Message- From: Erol Akarsu
 Sent: Monday, December 03, 2012 1:06 PM

 To: solr-user@lucene.apache.org
 Subject: Re: Luke and SOLR search giving different results

 Jack,

 I have already set tomcat server fro UTF-Encoding before. I have added
 URIEncoding=UTF-8 to all Connector .. elements in server.xml in Tomcat
 7.

 As you see below, when I search  word baş  with debug mode I can see
 empty response. But  when I search word baştan, I can get correct
 response.

 It seems to me that TurkishAnalyser is not being used in SOLR search
 because we can make only full word search baştan but not the root word
 baş. Probably, English Analyzer is being used and could not find the
 root
 word. For example, in Luke, if I change Analyser to use for query
 parsing
 to EnglishAnalyser, then it can not find word baş but it can with
 TurkishAnalyser only. I guess SOLR is not using TurkishAnalyzer.

 Is this assumption true? I could not find any other reason


 ?xml version=1.0 encoding=UTF-8?
 response
lst name=responseHeader
int name=status0/int
int name=QTime58/int
lst name=params
str name=debugQuerytrue/str
str name=qbaş/str
str name=wtxml/str
/lst
/lst
result name=response numFound=0 start=0 /
lst name=debug
str name=rawquerystringbaş/**str
str name=querystringbaş/str
str name=parsedquerytext:baş/**str
str name=parsedquery_toString**text:baş/str
lst name=explain /
str name=QParserLuceneQParser/**str
lst name=timing
double name=time38.0/double
lst name=prepare
double name=time16.0/double
lst
 name=org.apache.solr.handler.**component.QueryComponent
double name=time3.0/double
/lst
lst
 name=org.apache.solr.handler.**component.FacetComponent
double name=time0.0/double
/lst
lst
 name=org.apache.solr.handler.**component.**MoreLikeThisComponent
double name=time0.0/double
/lst
lst
 name=org.apache.solr.handler.**component.HighlightComponent
double name=time0.0/double
/lst
lst
 name=org.apache.solr.handler.**component.StatsComponent
double name=time0.0/double
/lst
lst
 name=org.apache.solr.handler.**component.DebugComponent
double name=time0.0/double
/lst
/lst
lst name=process
double name=time10.0/double
lst
 name=org.apache.solr.handler.**component.QueryComponent
double name=time0.0/double
/lst
lst
 name=org.apache.solr.handler.**component.FacetComponent
double name=time0.0/double
/lst
lst
 name=org.apache.solr.handler.**component.**MoreLikeThisComponent
double

Re: Luke and SOLR search giving different results

2012-12-03 Thread Jack Krupansky

As I pointed out in my message, your query is indicating that text is your 
default search field. So, either choose a different default search field, or 
assure that the text field has the desired field type.


If you want to change the default search field, eEither use a df request 
parameter or change the df default value for the request handler in the 
solrconfig.xml.


-- Jack Krupansky

-Original Message- 
From: Erol Akarsu

Sent: Monday, December 03, 2012 3:44 PM
To: solr-user@lucene.apache.org
Subject: Re: Luke and SOLR search giving different results

Jack,

I see interesting stuff here now.

I tried  as search query  not baş but features:baş in field q in SOLR
GUI. And, I got result!

In the one document, I had some fields type of text_eng, text_general and
one field features type of text_tr. If I don't specify field name, SOLR use
EnglishAnalyzer. If I do, it uses the analyzer specific to field specified
in search query string.

Is this true?

Erol Akarsu

On Mon, Dec 3, 2012 at 1:30 PM, Erol Akarsu eaka...@gmail.com wrote:


Jack,

I have these in schema.xml that defines features as type of text_tr

But unfortunately, this fails.


 field name=features type=text_tr indexed=true stored=true
multiValued=true/
copyField source=features dest=text/


fieldType name=text_tr class=solr.TextField
positionIncrementGap=100
  analyzer type=index
 tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.TurkishLowerCaseFilterFactory/
filter class=solr.StopFilterFactory ignoreCase=true
words=lang/stopwords_tr.txt enablePositionIncrements=true/
 filter class=solr.SnowballPorterFilterFactory
language=Turkish/
  /analyzer
  analyzer type=query

tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.TurkishLowerCaseFilterFactory/
filter class=solr.StopFilterFactory ignoreCase=true
words=lang/stopwords_tr.txt enablePositionIncrements=true/
 filter class=solr.SnowballPorterFilterFactory
language=Turkish/
  /analyzer
/fieldType




On Mon, Dec 3, 2012 at 1:15 PM, Jack Krupansky 
j...@basetechnology.comwrote:


Ah! See where it says str 
name=parsedquery_toString**text:baş/str?

Your query is against the text field, which probably doesn't have the
Turkish analysis.

There is probably a copyField from features to text. You use the
text_tr field type for features, but probably not for the text 
field.



-- Jack Krupansky

-Original Message- From: Erol Akarsu
Sent: Monday, December 03, 2012 1:06 PM

To: solr-user@lucene.apache.org
Subject: Re: Luke and SOLR search giving different results

Jack,

I have already set tomcat server fro UTF-Encoding before. I have added
URIEncoding=UTF-8 to all Connector .. elements in server.xml in 
Tomcat

7.

As you see below, when I search  word baş  with debug mode I can see
empty response. But  when I search word baştan, I can get correct
response.

It seems to me that TurkishAnalyser is not being used in SOLR search
because we can make only full word search baştan but not the root word
baş. Probably, English Analyzer is being used and could not find the
root
word. For example, in Luke, if I change Analyser to use for query
parsing
to EnglishAnalyser, then it can not find word baş but it can with
TurkishAnalyser only. I guess SOLR is not using TurkishAnalyzer.

Is this assumption true? I could not find any other reason


?xml version=1.0 encoding=UTF-8?
response
   lst name=responseHeader
   int name=status0/int
   int name=QTime58/int
   lst name=params
   str name=debugQuerytrue/str
   str name=qbaş/str
   str name=wtxml/str
   /lst
   /lst
   result name=response numFound=0 start=0 /
   lst name=debug
   str name=rawquerystringbaş/**str
   str name=querystringbaş/str
   str name=parsedquerytext:baş/**str
   str name=parsedquery_toString**text:baş/str
   lst name=explain /
   str name=QParserLuceneQParser/**str
   lst name=timing
   double name=time38.0/double
   lst name=prepare
   double name=time16.0/double
   lst
name=org.apache.solr.handler.**component.QueryComponent
   double name=time3.0/double
   /lst
   lst
name=org.apache.solr.handler.**component.FacetComponent
   double name=time0.0/double
   /lst
   lst
name=org.apache.solr.handler.**component.**MoreLikeThisComponent
   double name=time0.0/double
   /lst
   lst
name=org.apache.solr.handler.**component.HighlightComponent
   double name=time0.0/double
   /lst
   lst
name=org.apache.solr.handler.**component.StatsComponent
   double name=time0.0/double
   /lst
   lst
name=org.apache.solr.handler.**component.DebugComponent
   double name=time0.0/double
   /lst
   /lst

solr war - osgi

2012-12-03 Thread Marcos Mendez

Hi,

Has anyone had any experience repackaging the solr war for osgi? And while I'm 
at it, has anyone done this in geronimo 3.0?

Regards,
Marcos

Re: Luke and SOLR search giving different results

2012-12-03 Thread Shawn Heisey


On 12/3/2012 1:44 PM, Erol Akarsu wrote:

I tried  as search query  not baş but features:baş in field q in SOLR
GUI. And, I got result!

In the one document, I had some fields type of text_eng, text_general and
one field features type of text_tr. If I don't specify field name, SOLR use
EnglishAnalyzer. If I do, it uses the analyzer specific to field specified
in search query string.


Your config is set up to search against a field named text by default 
- either by a setting in schema.xml or a df parameter in your search 
handler definition in solrconfig.xml.  If you are using (e)dismax, it 
might be qf/pf parameters instead of df.


The field named text is not properly set up for this search.  Your 
attachment at the beginning of this thread indicates that either you do 
not have a text field for this document at all, or that field is not 
stored.  If the text field is a copyField as Jack has mentioned, note 
that it doesn't matter what analysis you are doing on features -- the 
copy is done before analysis, so it is completely separate.


Thanks,
Shawn

Re: Whole Phrase search in Solr

2012-12-03 Thread Jack Krupansky

Ah! You have conflicting tokenizers in your index and query analyzers. They 
should be the same.


Your index has:
 tokenizer class=solr.StandardTokenizerFactory/

Your query has:
  tokenizer class=solr.KeywordTokenizerFactory/

That has the effect of treating the entire query term as one index term. 
That actually works for simple terms, but a quoted phrase is passed to the 
query analyzer as one string and the keyword tokenizer will treat it as one 
token and this will index it as one term, which will not match the two terms 
that were indexed by the standard tokenizer.


Stick with the same tokenizer as you used at index time.

-- Jack Krupansky

-Original Message- 
From: NickA

Sent: Monday, December 03, 2012 1:47 PM
To: solr-user@lucene.apache.org
Subject: Re: Whole Phrase search in Solr

Jack thank you again,

however we have the major problem that using QUOTES to bring phrase
results, actually does not bring any results AT ALL!

I mentioned this at the initial post, that we also used these:

fq=search_field:check this
fq=search_field:'check this'

But no results appear when quotes are used. What may be doing wrong in our
configuration?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Whole-Phrase-search-in-Solr-tp4023931p4024071.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Problem with ping handler, SolrJ 4.1-SNAPSHOT, Solr 3.5.0

2012-12-03 Thread Shawn Heisey


On 11/8/2012 3:25 PM, Dyer, James wrote:

Could this be a side-effect from SOLR-4019, in branch_4.0 this was commit r1405894 ?  
Prior to this commit, PingRequestHandler would throw a SolrException for 503/Bad Request. 
 The change is that the exception isn't actually thrown but rather sent in place of the 
response.  This prevents the container from logging huge stack traces just because 
PingrequestHandler is in a disabled state.  Prior to this, SolrException had 
logging disabled for 503's with hardcoding, but this broke other uses of 503 SE's.


While working on another issue (SOLR-4143), I figured out why this isn't 
working.  Initially I did not connect the exceptions in the Solr 3.5 log 
to my problems getting ping responses, but the light eventually turned on.


My requests to the 3.5 ping handler from SolrJ 4.1-SNAPSHOT use the 
setRequestHandler method to talk to /admin/ping.  In addition to using 
/admin/ping as the URL path, this also sets the qt parameter to 
/admin/ping.  The PingRequestHandler in Solr 3.x looks at the qt 
parameter that it receives, and if that handler is an instance of 
PingRequestHandler, throws an exception saying that you can't call PRH 
recursively.  This is why I get an exception and no response, but it 
works perfectly in a browser -- I wasn't setting qt in my browser.  Once 
I did that, I get the bad response in the browser too.


There is no way in SolrJ 4.x or trunk to set the request handler without 
also setting qt.  When I looked at SolrJ code trying to make a patch for 
SOLR-4143, I discovered that it's not a trivial change, and it may not 
be possible to even do in branch_4x.


Is there possibly a workaround I can use in SolrJ?  Other thoughts?

Thanks,
Shawn

Re: News clustering

2012-12-03 Thread Iwan Hanjoyo

Hi Stanislaw,

I see. Thank you for the reference.

Kind regards,

Hanjoyo

On Tue, Dec 4, 2012 at 12:37 AM, Stanislaw Osinski
stanis...@osinski.namewrote:

  I mean measuring the similarity between the document in each cluster.
  Also, difference between document on one cluster with another cluster.
 
  I saw the sample code ClusteringQualityBencmark.java
  However, I do not know how to make use of it for assessing my Solr
  Clustering performance.
 

 You'd need to write your own code for this, here are the most common
 clustering quality measures you mentioned:


 http://en.wikipedia.org/wiki/Cluster_analysis#Evaluation_of_clustering_results

 These are meant for the general case (numeric attributes), to apply them to
 texts, you'd need to use the vector representation of the documents.

 One a more general note, synthetic measures test only the document-cluster
 assignments, but none take the quality of labels into account (this is
 really hard to measure objectively).

 Staszek

Re: How to change Solr UI

2012-12-03 Thread Iwan Hanjoyo



 Note that Velocity _can_ be used for user-facing code, but be very sure you
 secure your Solr. If you allow direct access, a user can easily enter
 something like http://
 solr/update?commit=truestream.body=deletequery*:*/query/delete.
 And all your documents will be gone.

 Hi Erickson,

Thank you for the input.
I'll notice and filter out this url.
* http://
solr/update?commit=truestream.body=deletequery*:*/query/delete

Kind regards,

Hanjoyo

Re: solr war - osgi

2012-12-03 Thread Iwan Hanjoyo

 Has anyone had any experience repackaging the solr war for osgi? And while
 I'm at it, has anyone done this in geronimo 3.0?


Hi Marcos,

Start glassfish web server.
Put solr war file inside the autodeploy folder.
Finally, you need to find the solr home folder location.
Different operating system will have different solr home location for
glassfish.

You need to find it yourself in the glassfish log file.
It is a bit difficult.

good luck

Kind regards,

Hanjoyo

Re: How to change Solr UI

2012-12-03 Thread Jack Krupansky


It is annoying to have to repeat these explanations so much.

Any serious objection to removing the VW UI from Solr proper and replacing 
it with a standalone app?


I mean, Solr should have PHP, python, Java, and ruby example apps, right?

-- Jack Krupansky

-Original Message- 
From: Iwan Hanjoyo

Sent: Monday, December 03, 2012 8:28 PM
To: solr-user@lucene.apache.org
Subject: Re: How to change Solr UI




Note that Velocity _can_ be used for user-facing code, but be very sure 
you

secure your Solr. If you allow direct access, a user can easily enter
something like http://
solr/update?commit=truestream.body=deletequery*:*/query/delete.
And all your documents will be gone.

Hi Erickson,


Thank you for the input.
I'll notice and filter out this url.
* http://
solr/update?commit=truestream.body=deletequery*:*/query/delete

Kind regards,

Hanjoyo

Solr Query Parameter : ids - What is this used for?

2012-12-03 Thread deniz

Hello, as it is clear in the title too, i wanna know for what solr uses this
parameter... i see it on a sharding env on cloud, so i guess it is related
with cloud but still there is no explanation about it in any of wiki pages
that i have checked... can someone explain the usage and aim of this
parameter? 



-
Zeki ama calismiyor... Calissa yapar...
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Query-Parameter-ids-What-is-this-used-for-tp4024152.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr Query Parameter : ids - What is this used for?

2012-12-03 Thread Yonik Seeley

On Mon, Dec 3, 2012 at 10:55 PM, deniz denizdurmu...@gmail.com wrote:
 Hello, as it is clear in the title too, i wanna know for what solr uses this
 parameter... i see it on a sharding env on cloud, so i guess it is related
 with cloud but still there is no explanation about it in any of wiki pages
 that i have checked... can someone explain the usage and aim of this
 parameter?

It's an internal implementation detail of distributed search - the
second phase selects specific ids on each shard via the ids
parameter.

-Yonik
http://lucidworks.com

Difference between 'bf' and 'boost' when using eDismax

2012-12-03 Thread Floyd Wu

Hi there,

I'm not sure if I understand this clearly.

'bf' is that final score will be add some value return by bf?
for example-  score + bf = final score

'boost' is that score will be multiply with value that return by boost?
for example- score * boost = final score

When using both( 'bf' and 'boost')
score * boost + bf = final score

If I would like to make recent created document ranking higher, using 'bf'
or 'boost' will be better solution(Assume bf and boost will use the same
function recip(ms(NOW,datefield),3.16e-11,1,1))?


Please help on this.

search behavior on a case-sensitive field

2012-12-03 Thread Joe Zhang

I have a search like this:

fieldType name=text_cs class=solr.TextField
positionIncrementGap=100
analyzer
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.StopFilterFactory
ignoreCase=true words=stopwords.txt/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1
catenateWords=1 catenateNumbers=1 catenateAll=0
splitOnCaseChange=1/
!--filter class=solr.LowerCaseFilterFactory/  --
filter class=solr.EnglishPorterFilterFactory
protected=protwords.txt/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
/analyzer
/fieldType

When I query COST, it gives reasonable results (n1);
When I query CoSt, however, it gives me n2 (n1) results, and I can't
locate actual occurence of CoSt in the docs at all. Can anybody advise?

Re: Solr Query Parameter : ids - What is this used for?

2012-12-03 Thread deniz

Yonik Seeley-4 wrote
 It's an internal implementation detail of distributed search - the
 second phase selects specific ids on each shard via the ids
 parameter.
 
 -Yonik
 http://lucidworks.com

so i suppose it us unique field? or it depends on which field we are using
for querying on shards? 



-
Zeki ama calismiyor... Calissa yapar...
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Query-Parameter-ids-What-is-this-used-for-tp4024152p4024159.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: search behavior on a case-sensitive field

2012-12-03 Thread Jack Krupansky

CoSt was split into two terms and the query parser generated an OR of 
them. Adding the autoGeneratePhraseQueries=true attribute to your field 
type should fix the problem.


You can also change splitOnCaseChange=1 to splitOnCaseChange=0 to avoid 
the term splitting issue.


Be sure to completely reindex in either case.

-- Jack Krupansky

-Original Message- 
From: Joe Zhang

Sent: Monday, December 03, 2012 11:10 PM
To: solr-user@lucene.apache.org
Subject: search behavior on a case-sensitive field

I have a search like this:

   fieldType name=text_cs class=solr.TextField
   positionIncrementGap=100
   analyzer
   tokenizer class=solr.WhitespaceTokenizerFactory/
   filter class=solr.StopFilterFactory
   ignoreCase=true words=stopwords.txt/
   filter class=solr.WordDelimiterFilterFactory
   generateWordParts=1 generateNumberParts=1
   catenateWords=1 catenateNumbers=1 catenateAll=0
   splitOnCaseChange=1/
!--filter class=solr.LowerCaseFilterFactory/  --
   filter class=solr.EnglishPorterFilterFactory
   protected=protwords.txt/
   filter class=solr.RemoveDuplicatesTokenFilterFactory/
   /analyzer
   /fieldType

When I query COST, it gives reasonable results (n1);
When I query CoSt, however, it gives me n2 (n1) results, and I can't
locate actual occurence of CoSt in the docs at all. Can anybody advise?

Re: search behavior on a case-sensitive field

2012-12-03 Thread Joe Zhang

haha, makes perfect sense! Thanks a lot!

On Mon, Dec 3, 2012 at 9:25 PM, Jack Krupansky j...@basetechnology.comwrote:

 CoSt was split into two terms and the query parser generated an OR of
 them. Adding the autoGeneratePhraseQueries=**true attribute to your
 field type should fix the problem.

 You can also change splitOnCaseChange=1 to splitOnCaseChange=0 to
 avoid the term splitting issue.

 Be sure to completely reindex in either case.

 -- Jack Krupansky

 -Original Message- From: Joe Zhang
 Sent: Monday, December 03, 2012 11:10 PM
 To: solr-user@lucene.apache.org
 Subject: search behavior on a case-sensitive field


 I have a search like this:

fieldType name=text_cs class=solr.TextField
positionIncrementGap=100
analyzer
tokenizer class=solr.**WhitespaceTokenizerFactory/
filter class=solr.StopFilterFactory
ignoreCase=true words=stopwords.txt/
filter class=solr.**WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1
catenateWords=1 catenateNumbers=1 catenateAll=0
splitOnCaseChange=1/
 !--filter class=solr.**LowerCaseFilterFactory/  --
filter class=solr.**EnglishPorterFilterFactory
protected=protwords.txt/
filter class=solr.**RemoveDuplicatesTokenFilterFac**
 tory/
/analyzer
/fieldType

 When I query COST, it gives reasonable results (n1);
 When I query CoSt, however, it gives me n2 (n1) results, and I can't
 locate actual occurence of CoSt in the docs at all. Can anybody advise?

Re: Difference between 'bf' and 'boost' when using eDismax

2012-12-03 Thread Jack Krupansky


bf is processed first, then boost.

All the bf's will be added, then the resulting scores will be boosted by the 
product of all the boost function queries.


-- Jack Krupansky

-Original Message- 
From: Floyd Wu

Sent: Monday, December 03, 2012 11:00 PM
To: solr-user@lucene.apache.org
Subject: Difference between 'bf' and 'boost' when using eDismax

Hi there,

I'm not sure if I understand this clearly.

'bf' is that final score will be add some value return by bf?
for example-  score + bf = final score

'boost' is that score will be multiply with value that return by boost?
for example- score * boost = final score

When using both( 'bf' and 'boost')
score * boost + bf = final score

If I would like to make recent created document ranking higher, using 'bf'
or 'boost' will be better solution(Assume bf and boost will use the same
function recip(ms(NOW,datefield),3.16e-11,1,1))?


Please help on this.

Re: How to change Solr UI

2012-12-03 Thread Erick Erickson

That's only one example, there are others,
stream.body=deleteidblah/id/delete. or
deletequeryid:*/query/delete

Jack's comment is well taken, consider a real middleware application.


Best
Erick


On Mon, Dec 3, 2012 at 5:28 PM, Iwan Hanjoyo ihanj...@gmail.com wrote:

 
 
  Note that Velocity _can_ be used for user-facing code, but be very sure
 you
  secure your Solr. If you allow direct access, a user can easily enter
  something like http://
 
 solr/update?commit=truestream.body=deletequery*:*/query/delete.
  And all your documents will be gone.
 
  Hi Erickson,

 Thank you for the input.
 I'll notice and filter out this url.
 * http://
 solr/update?commit=truestream.body=deletequery*:*/query/delete

 Kind regards,

 Hanjoyo

Re: Difference between 'bf' and 'boost' when using eDismax

2012-12-03 Thread Floyd Wu

Thanks Jack!
It helps a lots.

Floyd



2012/12/4 Jack Krupansky j...@basetechnology.com

 bf is processed first, then boost.

 All the bf's will be added, then the resulting scores will be boosted by
 the product of all the boost function queries.

 -- Jack Krupansky

 -Original Message- From: Floyd Wu
 Sent: Monday, December 03, 2012 11:00 PM
 To: solr-user@lucene.apache.org
 Subject: Difference between 'bf' and 'boost' when using eDismax


 Hi there,

 I'm not sure if I understand this clearly.

 'bf' is that final score will be add some value return by bf?
 for example-  score + bf = final score

 'boost' is that score will be multiply with value that return by boost?
 for example- score * boost = final score

 When using both( 'bf' and 'boost')
 score * boost + bf = final score

 If I would like to make recent created document ranking higher, using 'bf'
 or 'boost' will be better solution(Assume bf and boost will use the same
 function recip(ms(NOW,datefield),3.16e-**11,1,1))?


 Please help on this.

Migrating solr 3.6 to solr 4.0

2012-12-03 Thread Shaveta_Chawla

Hi,

I had solr3.6 installed on my system, now i am migrating my solr3.6 to
solr4.0. but i am getting the error 

SEVERE: Unable to create core: collection1
java.io.IOException: Can't find resource 'solrconfig.xml' in classpath or
'solr/collection1/conf/', cwd=/opt/tomcat/bin

i don't know how to resolve this.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Migrating-solr-3-6-to-solr-4-0-tp4024173.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Migrating solr 3.6 to solr 4.0

2012-12-03 Thread Tirthankar Chatterjee

can you paste the content of solr.xml
On Dec 4, 2012, at 1:26 AM, Shaveta_Chawla wrote:

 Hi,
 
 I had solr3.6 installed on my system, now i am migrating my solr3.6 to
 solr4.0. but i am getting the error 
 
 SEVERE: Unable to create core: collection1
 java.io.IOException: Can't find resource 'solrconfig.xml' in classpath or
 'solr/collection1/conf/', cwd=/opt/tomcat/bin
 
 i don't know how to resolve this.
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Migrating-solr-3-6-to-solr-4-0-tp4024173.html
 Sent from the Solr - User mailing list archive at Nabble.com.

**Legal Disclaimer***
This communication may contain confidential and privileged
material for the sole use of the intended recipient. Any
unauthorized review, use or distribution by others is strictly
prohibited. If you have received the message in error, please
advise the sender by reply email and delete the message. Thank
you.
*

62 matches

Mail list logo