Entity with multiple datasources

2012-02-16 Thread Radu Toev
Hello,

I created a data-config.xml file where I define a datasource and an entity
with 12 fields.
In my use case I have 2 databases with the same schema, so I want to
combine in one index the 2 databases.
I defined a second dataSource tag and duplicateed the entity with its
field(changed the name and the datasource).
What I'm expecting is to get around 7k results(I have around 6k in the
first db and 1k in the second). However I'm getting a total of 2k.
Where could be the problem?

Thanks


Re: 'foruns' don't match 'forum' with NGramFilterFactory (or EdgeNGramFilterFactory)

2012-02-16 Thread Dirceu Vieira
Hi,

It's funny that if you try fóruns it matches:
http://bhakta.casadomato.org:8982/solr/select/?q=f%C3%B3runsversion=2.2start=0rows=10indent=on
But not when you try foruns, it does not.

Check this out...

http://bhakta.casadomato.org:8982/solr/admin/analysis.jsp?nt=typename=textverbose=onhighlight=onval=f%C3%B3rumqverbose=onqval=foruns

See that stemming does not work for the word foruns.

Could it be because fórum is part of the PT dictionary but not forum?

Regards,

2012/2/14 Bráulio Bhavamitra brauli...@gmail.com

 Hello all,

 I'm experimenting with NGramFilterFactory and EgdeNGramFilterFactory.

 Both of them shows a match in my solr admin analysis, but when I query
 'foruns'
 doesn't find any 'forum'.
 analysis

 http://bhakta.casadomato.org:8982/solr/admin/analysis.jsp?nt=typename=textverbose=onhighlight=onval=f%C3%B3runsqverbose=onqval=f%C3%B3runs
 search

 http://bhakta.casadomato.org:8982/solr/select/?q=forunsversion=2.2start=0rows=10indent=on

 Anybody knows what's the problem?

 bráulio




-- 
Dirceu Vieira Júnior
---
+47 9753 2473
dirceuvjr.blogspot.com
twitter.com/dirceuvjr


problem to indexing pdf directory

2012-02-16 Thread alessio crisantemi
Hi all,
I have a problem to configure a pdf indexing from a directory in my solr
wit DIH:

with this data-config


dataConfig
 dataSource type=BinFileDataSource /
 document
  entity
name=tika-test
processor=FileListEntityProcessor
baseDir=D:\gioconews_archivio\marzo2011
fileName=.*pdf
recursive=true
rootEntity=false
dataSource=null/
  entity processor=FileListEntityProcessor
url=D:\gioconews_archivio\marzo2011 format=text 
   field column=author  name=author meta=true/
   field column=title name=title meta=true/
 field column=description name=description /
 field column=comments name=comments /

 field column=content_type name=content_type /
 field column=last_modified name=last_modified /
  /entity
 /document
/dataConfig

I obtain this result:



  str name=commandfull-import/str

  str name=statusidle/str

  str name=importResponse /

- lst name=statusMessages

  str name=Time Elapsed0:0:2.44/str

  str name=Total Requests made to DataSource0/str

  str name=Total Rows Fetched43/str

  str name=Total Documents Skipped0/str

  str name=Full Dump Started2012-02-12 19:06:00/str

  str name=Indexing failed. Rolled back all changes./str

  str name=Rolledback2012-02-12 19:06:00/str
  /lst


suggestions?
thank you
alessio


Best requestHandler for typing error.

2012-02-16 Thread stockii
Hello.

Which RH do you use to find typing errors like goolge = do you mean
google ?!

I want to use my Autosuggestion EdgeNGram with a clever AutoCorrection!



What do you use ?

-
--- System 

One Server, 12 GB RAM, 2 Solr Instances, 8 Cores, 
1 Core with 45 Million Documents other Cores  200.000

- Solr1 for Search-Requests - commit every Minute  - 5GB Xmx
- Solr2 for Update-Request  - delta every Minute - 4GB Xmx
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Best-requestHandler-for-typing-error-tp3749576p3749576.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Do we need reindexing from solr 1.4.1 to 3.5.0?

2012-02-16 Thread Kashif Khan
I kept old schema files and solrconfig file but there were some errors due to
which solr was not loading. I dono what are those things. We have few our
own custom plugins developed with 1.4.1

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Do-we-need-reindexing-from-solr-1-4-1-to-3-5-0-tp3739353p3749629.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Do we need reindexing from solr 1.4.1 to 3.5.0?

2012-02-16 Thread Kashif Khan
we have both stored = true and false fields in the schema. So we cant reindex
wat u said. we have tried that earlier.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Do-we-need-reindexing-from-solr-1-4-1-to-3-5-0-tp3739353p3749631.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Using Solr for a rather busy Yellow Pages-type index - good idea or not really?

2012-02-16 Thread Mikhail Khludnev
Pls find inlined.

On Thu, Feb 16, 2012 at 10:30 AM, Alexey Verkhovsky 
alexey.verkhov...@gmail.com wrote:

 Hi, all,

 I'm new here. Used Solr on a couple of projects before, but didn't need to
 dive deep into anything until now. These days, I'm doing a spike for a
 yellow pages type search server with the following technical
 requirements:

 ~10 mln listings in the database. A listing has a name, address,
 description, coordinates and a number of tags / filtering fields; no more
 than a kilobyte all told; i.e. theoretically the whole thing should fit in
 RAM without sharding. A typical query is either all text matches on name
 and/or description within a bounded box, or some combination of tag
 matches within a bounded box. Bounded boxes are 1 to 50 km wide, and
 contain up to 10^5 unfiltered listings (the average is more like 10^3).
 More than 50% of all the listings are in the frequently requested bounding
 boxes, however a vast majority of listings are almost never displayed
 (because they don't match the other filters).

 Data never changes (i.e., a daily batch update; rebuild of the entire
 index and restart of all search servers is feasible, as long as it takes
 minutes, not hours).

Everybody start from daily bounce, but end up with UPDATED_AT column and
delta updates , just consider urgent content fix usecase. Don't think it's
worth to rely on daily bounce as a cornerstone of architecture.


 This thing ideally should serve up to 10^3 requests
 per second on a small (as in, less than 10 commodity boxes) cluster. In
 other words, a typical request should be CPU bound and take ~100-200 msec
 to process. Because of coordinates (that are almost never the same),
 caching of queries makes no sense;

you can use grid of coordinates to reduce their entropy, if you filter by
bounding box argument is bounding box not a coordinates. Anyway
postfiltering and cache=false for such filters
http://yonik.wordpress.com/2012/02/10/advanced-filter-caching-in-solr/


 from what little I understand about
 Lucene internals, caching of filters probably doesn't make sense either.

But solr does it http://wiki.apache.org/solr/SolrCaching#filterCache


 After perusing documentation and some googling (but almost no source code
 exploring yet), I understand how the schema and the queries will look like,
 and now have to figure out a specific configuration that fits the
 performance/scalability requirements. Here is what I'm thinking:

 1. Search server is an internal service that uses embedded Solr for the
 indexing part. RAMDirectoryFactory as index storage.

Bad idea. It's purposed mostly for tests, the closest purposed for
production analogue is
org.apache.lucene.store.instantiated.InstantiatedIndex


 2. All data is in some sort of persistent storage on a file system, and is
 loaded into the memory when a search server starts up.

AFAIK the state of art is use file directory (MMAP or whatever), rely on
Linux file system RAM cache. Also Solr and partially Lucene cache some
stuff in HEAP themselves
http://wiki.apache.org/solr/SolrCaching#Types_of_Caches_and_Example_Configuration.
So, this is mostly done already.


 3. Data updates are handled as update the persistent storage, start
 another cluster, load the world into RAM, flip the load balancer, kill the
 old cluster

no again. Lucene has pretty cool model of segments and generations purposed
to incremental update. And Solr does a lot to do search in old generation
and warnup the new one simultaneously (it just takes some memory, you know,
two times). I don;t think that manual A/B scheme is applicable. Anyway, you
can (but don't relly need to) play around replication facilities e.g.
disable traffic for half of nodes, push new index on it, let them warmup,
enable traffic (such machinery never works smoothly due number of moving
parts)


 4. Solr returns IDs with relevance scores; actual presentations of listings
 (as JSON documents) are constructed outside of Solr and cached in
 Memcached, as a mostly static content with a few templated bits, like
 distance%=DISTANCE_TO(-123.0123, 45.6789) %.

Use separate nodes to do a search and another nodes to stream the content
sounds good (mentioned in every book). Looks like beside of the score you
can also return distance to user i.e. no need to %=DISTANCE_TO(-123.0123,
45.6789) % , just %=doc.DISTANCE% see
http://wiki.apache.org/solr/SpatialSearch?#Returning_the_distance



 5. All Solr caching is switched off.

But why?




 Obviously, we are not the first people to do something like this with Solr,
 so I'm hoping for some collective wisdom on the following:

 Does this sounds like a feasible set of requirements in terms of
 performance and scalability for Solr? Are we on the right path to solving
 this problem well? If not, what should we be doing instead? What nasty
 technical/architectural gotchas are we probably missing at this stage?

 One particular advice I'd be really happy to hear is you may not need
 RAMDataFactory if 

Re: Entity with multiple datasources

2012-02-16 Thread Dmitry Kan
1. Do you see any errors / exceptions in the logs?
2. Could you have duplicates?

On Thu, Feb 16, 2012 at 10:15 AM, Radu Toev radut...@gmail.com wrote:

 Hello,

 I created a data-config.xml file where I define a datasource and an entity
 with 12 fields.
 In my use case I have 2 databases with the same schema, so I want to
 combine in one index the 2 databases.
 I defined a second dataSource tag and duplicateed the entity with its
 field(changed the name and the datasource).
 What I'm expecting is to get around 7k results(I have around 6k in the
 first db and 1k in the second). However I'm getting a total of 2k.
 Where could be the problem?

 Thanks




-- 
Regards,

Dmitry Kan


Re: Spatial Search and faceting

2012-02-16 Thread Eric Grobler
Hi William,

Thanks for the feedback.

I will try the group query and see how the performance with 2 queries is.

Best Regards
Ericz

On Thu, Feb 16, 2012 at 4:06 AM, William Bell billnb...@gmail.com wrote:

 One way to do it is to group by city and then sort=geodist() asc

 select?group=truegroup.field=citysort=geodist() descrows=10fl=city

 It might require 2 calls to SOLR to get it the way you want.

 On Wed, Feb 15, 2012 at 5:51 PM, Eric Grobler impalah...@googlemail.com
 wrote:
  Hi Solr community,
 
  I am doing a spatial search and then do a facet by city.
  Is it possible to then sort the faceted cities by distance?
 
  We would like to display the hits per city, but sort them by distance.
 
  Thanks  Regards
  Ericz
 
  q=iphone
  fq={!bbox}
  sfield=geopoint
  pt=49.594857,8.468614
  d=50
  fl=id,description,city,geopoint
 
  facet=true
  facet.field=city
  f.city.facet.limit=10
  f.city.facet.sort=count //geodist() asc



 --
 Bill Bell
 billnb...@gmail.com
 cell 720-256-8076



Realtime search with multi clients updating index simultaneously.

2012-02-16 Thread v_shan
I have a heldesk application developed in PHP/MySQL. I want to implement real
time Full text search and I have shortlisted Solr. MySQL database will store
all the tickets and their updates and that data will be imported for
building Solr index. All Search requests will be handled by Solr.

What I want is a real time search. The moment someone updates a ticket, it
should be available for search. 

As per my understanding of Solr, this is how I think the system will work. 
A user updates a ticket - database record is modified - a request is sent
to Solr server to modify corresponding document in index.

I have read a book on Solr and below questions are troubling me.
1. The book mentions that commits are slow in Solr. Depending on the index
size, Solr's auto-warming
configuration, and Solr's cache state prior to committing, a commit can take
a non-trivial amount of time. Typically, it takes a few seconds, but it can
take
some number of minutes in extreme cases. If this is true then how will I
know when the data will be availbale for search and how can I implemnt
realtime search? Also I don't want the ticket update operation to be slowed
down (by adding extra step of updating Solr index)

2. It is also mentioned that there is no transaction isolation. This means
that if more than one Solr client
were to submit modifications and commit them at overlapping times, it is
possible for part of one client's set of changes to be committed before that
client told Solr to commit. This applies to rollback as well. If this is a
problem
for your architecture then consider using one client process responsible for
updating Solr.

Does it mean that due to lack of transactional commits, Solr can mess up the
updates when multiple people update the ticket simultaneously?

Now the question before me is: Is Solr fit in my case? If yes, How?

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Realtime-search-with-multi-clients-updating-index-simultaneously-tp3749881p3749881.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Entity with multiple datasources

2012-02-16 Thread Radu Toev
1. Nothing in the logs
2. No.

On Thu, Feb 16, 2012 at 12:44 PM, Dmitry Kan dmitry@gmail.com wrote:

 1. Do you see any errors / exceptions in the logs?
 2. Could you have duplicates?

 On Thu, Feb 16, 2012 at 10:15 AM, Radu Toev radut...@gmail.com wrote:

  Hello,
 
  I created a data-config.xml file where I define a datasource and an
 entity
  with 12 fields.
  In my use case I have 2 databases with the same schema, so I want to
  combine in one index the 2 databases.
  I defined a second dataSource tag and duplicateed the entity with its
  field(changed the name and the datasource).
  What I'm expecting is to get around 7k results(I have around 6k in the
  first db and 1k in the second). However I'm getting a total of 2k.
  Where could be the problem?
 
  Thanks
 



 --
 Regards,

 Dmitry Kan



Re: Entity with multiple datasources

2012-02-16 Thread Dmitry Kan
It sounds a bit, as if SOLR stopped processing data once it queried all
from the smaller dataset. That's why you have 2000. If you just have a
handler pointed to the bigger data set (6k), do you manage to get all 6k db
entries into solr?

On Thu, Feb 16, 2012 at 1:46 PM, Radu Toev radut...@gmail.com wrote:

 1. Nothing in the logs
 2. No.

 On Thu, Feb 16, 2012 at 12:44 PM, Dmitry Kan dmitry@gmail.com wrote:

  1. Do you see any errors / exceptions in the logs?
  2. Could you have duplicates?
 
  On Thu, Feb 16, 2012 at 10:15 AM, Radu Toev radut...@gmail.com wrote:
 
   Hello,
  
   I created a data-config.xml file where I define a datasource and an
  entity
   with 12 fields.
   In my use case I have 2 databases with the same schema, so I want to
   combine in one index the 2 databases.
   I defined a second dataSource tag and duplicateed the entity with its
   field(changed the name and the datasource).
   What I'm expecting is to get around 7k results(I have around 6k in the
   first db and 1k in the second). However I'm getting a total of 2k.
   Where could be the problem?
  
   Thanks
  
 
 
 
  --
  Regards,
 
  Dmitry Kan
 




-- 
Regards,

Dmitry Kan


Re: Entity with multiple datasources

2012-02-16 Thread Radu Toev
I tried running with just one datasource(the one that has 6k entries) and
it indexes them ok.
The same, if I do sepparately the 1k database. It indexes ok.

On Thu, Feb 16, 2012 at 2:11 PM, Dmitry Kan dmitry@gmail.com wrote:

 It sounds a bit, as if SOLR stopped processing data once it queried all
 from the smaller dataset. That's why you have 2000. If you just have a
 handler pointed to the bigger data set (6k), do you manage to get all 6k db
 entries into solr?

 On Thu, Feb 16, 2012 at 1:46 PM, Radu Toev radut...@gmail.com wrote:

  1. Nothing in the logs
  2. No.
 
  On Thu, Feb 16, 2012 at 12:44 PM, Dmitry Kan dmitry@gmail.com
 wrote:
 
   1. Do you see any errors / exceptions in the logs?
   2. Could you have duplicates?
  
   On Thu, Feb 16, 2012 at 10:15 AM, Radu Toev radut...@gmail.com
 wrote:
  
Hello,
   
I created a data-config.xml file where I define a datasource and an
   entity
with 12 fields.
In my use case I have 2 databases with the same schema, so I want to
combine in one index the 2 databases.
I defined a second dataSource tag and duplicateed the entity with its
field(changed the name and the datasource).
What I'm expecting is to get around 7k results(I have around 6k in
 the
first db and 1k in the second). However I'm getting a total of 2k.
Where could be the problem?
   
Thanks
   
  
  
  
   --
   Regards,
  
   Dmitry Kan
  
 



 --
 Regards,

 Dmitry Kan



Re: Entity with multiple datasources

2012-02-16 Thread Dmitry Kan
OK, maybe you can show the db-data-config.xml just in case?
Also in schema.xml, does you uniqueKey correspond to the unique field in
the db?

On Thu, Feb 16, 2012 at 2:13 PM, Radu Toev radut...@gmail.com wrote:

 I tried running with just one datasource(the one that has 6k entries) and
 it indexes them ok.
 The same, if I do sepparately the 1k database. It indexes ok.

 On Thu, Feb 16, 2012 at 2:11 PM, Dmitry Kan dmitry@gmail.com wrote:

  It sounds a bit, as if SOLR stopped processing data once it queried all
  from the smaller dataset. That's why you have 2000. If you just have a
  handler pointed to the bigger data set (6k), do you manage to get all 6k
 db
  entries into solr?
 
  On Thu, Feb 16, 2012 at 1:46 PM, Radu Toev radut...@gmail.com wrote:
 
   1. Nothing in the logs
   2. No.
  
   On Thu, Feb 16, 2012 at 12:44 PM, Dmitry Kan dmitry@gmail.com
  wrote:
  
1. Do you see any errors / exceptions in the logs?
2. Could you have duplicates?
   
On Thu, Feb 16, 2012 at 10:15 AM, Radu Toev radut...@gmail.com
  wrote:
   
 Hello,

 I created a data-config.xml file where I define a datasource and an
entity
 with 12 fields.
 In my use case I have 2 databases with the same schema, so I want
 to
 combine in one index the 2 databases.
 I defined a second dataSource tag and duplicateed the entity with
 its
 field(changed the name and the datasource).
 What I'm expecting is to get around 7k results(I have around 6k in
  the
 first db and 1k in the second). However I'm getting a total of 2k.
 Where could be the problem?

 Thanks

   
   
   
--
Regards,
   
Dmitry Kan
   
  
 
 
 
  --
  Regards,
 
  Dmitry Kan
 




-- 
Regards,

Dmitry Kan


Re: Entity with multiple datasources

2012-02-16 Thread Radu Toev
dataConfig
  dataSource
 name=s
 driver=com.microsoft.sqlserver.jdbc.SQLServerDriver
 url=
 user=
 password=/
  dataSource
 name=p
 driver=com.microsoft.sqlserver.jdbc.SQLServerDriver
 url=
 user=
 password=/
  document
entity name=ms
datasource=s
query=SELECT m.id as id, m.serial as m_machine_serial, m.ivk as
m_machine_ivk, m.sitename as m_sitename, m.deliveryDate as m_delivery_date,
m.hotsite as m_hotsite, m.guardian as m_guardian, m.warranty as m_warranty,
m.contract as m_contract,
   st.name as m_st_name, pm.name as m_pm_name, p.name as m_p_name,
sv.shortName as m_sv_name, c.clusterMajor as m_c_cluster_major,
c.clusterMinor as m_c_cluster_minor, c.country as m_c_country, c.code as
m_c_code
   FROM Machine AS m
   LEFT JOIN SystemType AS st ON m.fk_systemType=st.id
   LEFT JOIN ProductModel AS pm ON fk_productModel = pm.id
   LEFT JOIN Platform AS p ON m.fk_platform = p.id
   LEFT JOIN SoftwareVersion AS sv ON fk_softwareVersion = sv.id
   LEFT JOIN Country AS c ON fk_country = c.id
readOnly=true
transformer=DateFormatTransformer
field column=id /
field column=m_machine_serial/
field column=m_machine_ivk/
field column=m_sitename/
filed column=m_delivery_date dateTimeFormat=-MM-dd/
field column=m_hotsite/
field column=m_guardian/
field column=m_warranty/
field column=m_contract/
field column=m_st_name/
field column=m_pm_name/
field column=m_p_name/
field column=m_sv_name/
field column=m_c_cluster_major/
field column=m_c_cluster_minor/
field column=m_c_country/
field column=m_c_code/
   /entity

   entity name=mp
datasource=p
query=SELECT m.id as id, m.serial as m_machine_serial, m.ivk as
m_machine_ivk, m.sitename as m_sitename, m.deliveryDate as m_delivery_date,
m.hotsite as m_hotsite, m.guardian as m_guardian, m.warranty as m_warranty,
m.contract as m_contract,
   st.name as m_st_name, pm.name as m_pm_name, p.name as m_p_name,
sv.shortName as m_sv_name, c.clusterMajor as m_c_cluster_major,
c.clusterMinor as m_c_cluster_minor, c.country as m_c_country, c.code as
m_c_code
   FROM Machine AS m
   LEFT JOIN SystemType AS st ON m.fk_systemType=st.id
   LEFT JOIN ProductModel AS pm ON fk_productModel = pm.id
   LEFT JOIN Platform AS p ON m.fk_platform = p.id
   LEFT JOIN SoftwareVersion AS sv ON fk_softwareVersion = sv.id
   LEFT JOIN Country AS c ON fk_country = c.id
readOnly=true
transformer=DateFormatTransformer
field column=id /
field column=m_machine_serial/
field column=m_machine_ivk/
field column=m_sitename/
filed column=m_delivery_date dateTimeFormat=-MM-dd/
field column=m_hotsite/
field column=m_guardian/
field column=m_warranty/
field column=m_contract/
field column=m_st_name/
field column=m_pm_name/
field column=m_p_name/
field column=m_sv_name/
field column=m_c_cluster_major/
field column=m_c_cluster_minor/
field column=m_c_country/
field column=m_c_code/
   /entity
  /document
/dataConfig

I've removed the connection params
The unique key is id.

On Thu, Feb 16, 2012 at 2:27 PM, Dmitry Kan dmitry@gmail.com wrote:

 OK, maybe you can show the db-data-config.xml just in case?
 Also in schema.xml, does you uniqueKey correspond to the unique field in
 the db?

 On Thu, Feb 16, 2012 at 2:13 PM, Radu Toev radut...@gmail.com wrote:

  I tried running with just one datasource(the one that has 6k entries) and
  it indexes them ok.
  The same, if I do sepparately the 1k database. It indexes ok.
 
  On Thu, Feb 16, 2012 at 2:11 PM, Dmitry Kan dmitry@gmail.com
 wrote:
 
   It sounds a bit, as if SOLR stopped processing data once it queried all
   from the smaller dataset. That's why you have 2000. If you just have a
   handler pointed to the bigger data set (6k), do you manage to get all
 6k
  db
   entries into solr?
  
   On Thu, Feb 16, 2012 at 1:46 PM, Radu Toev radut...@gmail.com wrote:
  
1. Nothing in the logs
2. No.
   
On Thu, Feb 16, 2012 at 12:44 PM, Dmitry Kan dmitry@gmail.com
   wrote:
   
 1. Do you see any errors / exceptions in the logs?
 2. Could you have duplicates?

 On Thu, Feb 16, 2012 at 10:15 AM, Radu Toev radut...@gmail.com
   wrote:

  Hello,
 
  I created a data-config.xml file where I define a datasource and
 an
 entity
  with 12 fields.
  In my use case I have 2 databases with the same schema, so I want
  to
  combine in one index the 2 databases.
  I defined a second dataSource tag and duplicateed the entity with
  its
  field(changed the name and the datasource).
  What I'm expecting is to get around 7k results(I have around 6k
 in
   the
  first db and 1k in the second). However I'm getting a total of
 2k.
  Where could be the problem?
 
  Thanks
 



 --
 Regards,

 Dmitry Kan

   
  
  
  
   --
   Regards,
  
   Dmitry Kan
  
 



 --
 Regards,

 Dmitry Kan



How to loop through the DataImportHandler query results?

2012-02-16 Thread K, Baraneetharan

Hi Solr community,

I'm new to Solr and DataImportHandler., I've a requirement to fetch the data 
from a database table and pass it to solr.

Part of existing data-config.xml and solr schema.xml are given below,

data-config.xml

dataConfig
dataSource driver=com.mysql.jdbc.Driver 
url=jdbc:mysql://demo22122011.com user= password=1234 /
document name=sample
entity name=adaptation pk= sample _id
  query=Select * from adap
 transformer=TemplateTransformer,DateFormatTransformer 

 field column=field_mrmid_value name=mrm_id_camp_s_t /
 field column=field_ sample _scope_value name= sample 
_scope_camp_s_s /
 field column=field_quarterly_plan_value 
name=quarterly_plan_camp_s_s /
 field column=field_business_unit_value 
name=field_business_unit_value_camp_s_t /
 field column=field_sub_business_value 
name=field_sub_business_value_camp_s_t /
 field column=field_cdescription_value name= sample 
_description_camp_s_t /
field column=field_ sample _owner_value name= sample 
_owner_camp_s_s /

   /entity
/document
/dataConfig


Schema.xml

schema name=example version=1.2
fields
   dynamicField name=*_camp_m_i  type=intindexed=true  stored=true 
multiValued=true/
   dynamicField name=*_camp_s_i  type=intindexed=true  stored=true 
multiValued=false/
   dynamicField name=*_camp_m_t  type=textindexed=true  
stored=true multiValued=true/
   dynamicField name=*_camp_s_t  type=textindexed=true  
stored=true multiValued=false/
   dynamicField name=*_camp_m_s  type=string  indexed=true  
stored=true multiValued=true/
   dynamicField name=*_camp_s_s  type=string  indexed=true  
stored=true multiValued=false/
   dynamicField name=*_camp_m_l  type=long  indexed=true  stored=true 
multiValued=true/
   dynamicField name=*_camp_s_l  type=long  indexed=true  stored=true 
multiValued=false/
/fields
/schema


The table used in the query (adap) is often modified, number of columns in this 
table are changing frequently. Hence we are supposed to change the 
data-config.xml whenever a field is added or deleted.
To avoid that we don't want to mention the column names in the field tag , but 
want to write a query to map all the fields in the table with solr fileds even 
if we don't know, how many columns are there in the table.  I need a kind of 
loop which runs through all the query results and map that with solr fileds.

Please help me.

Regards,
Baranee


Re: SolrCloud Replication Question

2012-02-16 Thread Mark Miller

On Feb 14, 2012, at 10:57 PM, Jamie Johnson wrote:

  Not sure if this is
 expected or not.

Nope - should be already resolved or will be today though.

- Mark Miller
lucidimagination.com













Re: SolrCloud Replication Question

2012-02-16 Thread Jamie Johnson
Ok, great.  Just wanted to make sure someone was aware.  Thanks for
looking into this.

On Thu, Feb 16, 2012 at 8:26 AM, Mark Miller markrmil...@gmail.com wrote:

 On Feb 14, 2012, at 10:57 PM, Jamie Johnson wrote:

  Not sure if this is
 expected or not.

 Nope - should be already resolved or will be today though.

 - Mark Miller
 lucidimagination.com













PatternReplaceFilterFactory group

2012-02-16 Thread O. Klein
PatternReplaceFilterFactory has no option to select the group to replace.

Is there a reason for this, or could this be a nice feature? 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/PatternReplaceFilterFactory-group-tp3750201p3750201.html
Sent from the Solr - User mailing list archive at Nabble.com.


custom scoring

2012-02-16 Thread Carlos Gonzalez-Cadenas
Hello all:

We'd like to score the matching documents using a combination of SOLR's IR
score with another application-specific score that we store within the
documents themselves (i.e. a float field containing the app-specific
score). In particular, we'd like to calculate the final score doing some
operations with both numbers (i.e product, sqrt, ...)

According to what we know, there are two ways to do this in SOLR:

A) Sort by function [1]: We've tested an expression like
sort=product(score, query_score) in the SOLR query, where score is the
common SOLR IR score and query_score is our own precalculated score, but it
seems that SOLR can only do this with stored/indexed fields (and obviously
score is not stored/indexed).

B) Function queries: We've used _val_ and function queries like max, sqrt
and query, and we've obtained the desired results from a functional point
of view. However, our index is quite large (400M documents) and the
performance degrades heavily, given that function queries are AFAIK
matching all the documents.

I have two questions:

1) Apart from the two options I mentioned, is there any other (simple) way
to achieve this that we're not aware of?

2) If we have to choose the function queries path, would it be very
difficult to modify the actual implementation so that it doesn't match all
the documents, that is, to pass a query so that it only operates over the
documents matching the query?. Looking at the FunctionQuery.java source
code, there's a comment that says // instead of matching all docs, we
could also embed a query. the score could either ignore the subscore, or
boost it, which is giving us some hope that maybe it's possible and even
desirable to go in this direction. If you can give us some directions about
how to go about this, we may be able to do the actual implementation.

BTW, we're using Lucene/SOLR trunk.

Thanks a lot for your help.
Carlos

[1]: http://wiki.apache.org/solr/FunctionQuery#Sort_By_Function


Re: problem to indexing pdf directory

2012-02-16 Thread Gora Mohanty
On 16 February 2012 14:33, alessio crisantemi
alessio.crisant...@gmail.com wrote:
 Hi all,
 I have a problem to configure a pdf indexing from a directory in my solr
 wit DIH:

 with this data-config


 dataConfig
  dataSource type=BinFileDataSource /
  document
  entity
    name=tika-test
    processor=FileListEntityProcessor
    baseDir=D:\gioconews_archivio\marzo2011
    fileName=.*pdf
    recursive=true
    rootEntity=false
    dataSource=null/
  entity processor=FileListEntityProcessor
 url=D:\gioconews_archivio\marzo2011 format=text 
   field column=author  name=author meta=true/
   field column=title name=title meta=true/
     field column=description name=description /
     field column=comments name=comments /

     field column=content_type name=content_type /
     field column=last_modified name=last_modified /
  /entity
  /document
 /dataConfig
[...]

You should look in your Solr logs for more details about
the exception, but as things stand, the above setup will
not work for indexing PDF files. You need Tika. Searching
Google for solr tika index pdf turns up many possibilities,
e.g.,
http://www.abcseo.com/tech/search/integrating-solr-and-tika
http://solr.pl/en/2011/04/04/indexing-files-like-doc-pdf-solr-and-tika-integration/

Regards,
Gora


Re: How to loop through the DataImportHandler query results?

2012-02-16 Thread Mikhail Khludnev
Hi Baranee,

Some time ago I played with
http://wiki.apache.org/solr/DataImportHandler#ScriptTransformer - it was a
pretty good stuff.

Regards


On Thu, Feb 16, 2012 at 3:53 PM, K, Baraneetharan baraneethara...@hp.comwrote:

 To avoid that we don't want to mention the column names in the field tag ,
 but want to write a query to map all the fields in the table with solr
 fileds even if we don't know, how many columns are there in the table.  I
 need a kind of loop which runs through all the query results and map that
 with solr fileds.




-- 
Sincerely yours
Mikhail Khludnev
Lucid Certified
Apache Lucene/Solr Developer
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Solr soft commit feature

2012-02-16 Thread Nagendra Nagarajayya
The slaves will be able to replicate from the master as before but not 
in NRT depending on your commit interval. Commit interval can be set 
higher for NRT as it is not needed for searches except for consolidating 
the index changes on the master and can be an hr or even more. It maybe 
easier to update the slaves directly as the update/query performance is 
high (replication in the cloud in 4.0 also follows similar paradigm as 
the docs are sent across as a whole to be replicated. So for now you may 
have to do this manually)


- Nagendra Nagarajayya
http://solr-ra.tgels.org
http://rankingalgorithm.tgels.org


On 2/15/2012 8:35 AM, Dipti Srivastava wrote:

Hi Nagendra,

Certainly interesting! Would this work in a Master/slave setup where the
reads are from the slaves and all writes are to the master?

Regards,
Dipti Srivastava


On 2/15/12 5:40 AM, Nagendra Nagarajayyannagaraja...@transaxtions.com
wrote:


If you are looking for NRT functionality with Solr 3.5, you may want to
take a look at Solr 3.5 with RankingAlgorithm. This allows you to
add/update documents without a commit while being able to search
concurrently. The add/update performance to add 1m docs is about 5000
docs in about 498 ms  with one concurrent searcher. You can get more
information about Solr 3.5 with RankingAlgorithm from here:

http://tgels.org/wiki/en/Near_Real_Time_Search_ver_3.x

Regards,

- Nagendra Nagarajayya
http://solr-ra.tgels.org
http://rankingalgorithm.tgels.org

On 2/14/2012 4:41 PM, Dipti Srivastava wrote:

Hi All,
Is there a way to soft commit in the current released version of solr
3.5?

Regards,
Dipti Srivastava


This message is private and confidential. If you have received it in
error, please notify the sender and remove it from your system.








This message is private and confidential. If you have received it in error, 
please notify the sender and remove it from your system.








Re: Can I rebuild an index and remove some fields?

2012-02-16 Thread Robert Stewart
I will test it with my big production indexes first, if it works I
will port to Java and add to contrib I think.

On Wed, Feb 15, 2012 at 10:03 PM, Li Li fancye...@gmail.com wrote:
 great. I think you could make it a public tool. maybe others also need such
 functionality.

 On Thu, Feb 16, 2012 at 5:31 AM, Robert Stewart bstewart...@gmail.comwrote:

 I implemented an index shrinker and it works.  I reduced my test index
 from 6.6 GB to 3.6 GB by removing a single shingled field I did not
 need anymore.  I'm actually using Lucene.Net for this project so code
 is C# using Lucene.Net 2.9.2 API.  But basic idea is:

 Create an IndexReader wrapper that only enumerates the terms you want
 to keep, and that removes terms from documents when returning
 documents.

 Use the SegmentMerger to re-write each segment (where each segment is
 wrapped by the wrapper class), writing new segment to a new directory.
 Collect the SegmentInfos and do a commit in order to create a new
 segments file in new index directory

 Done - you now have a shrunk index with specified terms removed.

 Implementation uses separate thread for each segment, so it re-writes
 them in parallel.  Took about 15 minutes to do 770,000 doc index on my
 macbook.


 On Tue, Feb 14, 2012 at 10:12 PM, Li Li fancye...@gmail.com wrote:
  I have roughly read the codes of 4.0 trunk. maybe it's feasible.
     SegmentMerger.add(IndexReader) will add to be merged Readers
     merge() will call
       mergeTerms(segmentWriteState);
       mergePerDoc(segmentWriteState);
 
    mergeTerms() will construct fields from IndexReaders
     for(int
  readerIndex=0;readerIndexmergeState.readers.size();readerIndex++) {
       final MergeState.IndexReaderAndLiveDocs r =
  mergeState.readers.get(readerIndex);
       final Fields f = r.reader.fields();
       final int maxDoc = r.reader.maxDoc();
       if (f != null) {
         slices.add(new ReaderUtil.Slice(docBase, maxDoc, readerIndex));
         fields.add(f);
       }
       docBase += maxDoc;
     }
     So If you wrapper your IndexReader and override its fields() method,
  maybe it will work for merge terms.
 
     for DocValues, it can also override AtomicReader.docValues(). just
  return null for fields you want to remove. maybe it should
  traverse CompositeReader's getSequentialSubReaders() and wrapper each
  AtomicReader
 
     other things like term vectors norms are similar.
  On Wed, Feb 15, 2012 at 6:30 AM, Robert Stewart bstewart...@gmail.com
 wrote:
 
  I was thinking if I make a wrapper class that aggregates another
  IndexReader and filter out terms I don't want anymore it might work.
 And
  then pass that wrapper into SegmentMerger.  I think if I filter out
 terms
  on GetFieldNames(...) and Terms(...) it might work.
 
  Something like:
 
  HashSetstring ignoredTerms=...;
 
  FilteringIndexReader wrapper=new FilterIndexReader(reader);
 
  SegmentMerger merger=new SegmentMerger(writer);
 
  merger.add(wrapper);
 
  merger.Merge();
 
 
 
 
 
  On Feb 14, 2012, at 1:49 AM, Li Li wrote:
 
   for method 2, delete is wrong. we can't delete terms.
     you also should hack with the tii and tis file.
  
   On Tue, Feb 14, 2012 at 2:46 PM, Li Li fancye...@gmail.com wrote:
  
   method1, dumping data
   for stored fields, you can traverse the whole index and save it to
   somewhere else.
   for indexed but not stored fields, it may be more difficult.
      if the indexed and not stored field is not analyzed(fields such as
   id), it's easy to get from FieldCache.StringIndex.
      But for analyzed fields, though theoretically it can be restored
 from
   term vector and term position, it's hard to recover from index.
  
   method 2, hack with metadata
   1. indexed fields
        delete by query, e.g. field:*
   2. stored fields
         because all fields are stored sequentially. it's not easy to
  delete
   some fields. this will not affect search speed. but if you want to
 get
   stored fields,  and the useless fields are very long, then it will
 slow
   down.
         also it's possible to hack with it. but need more effort to
   understand the index file format  and traverse the fdt/fdx file.
  
 
 http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/fileformats.html
  
   this will give you some insight.
  
  
   On Tue, Feb 14, 2012 at 6:29 AM, Robert Stewart 
 bstewart...@gmail.com
  wrote:
  
   Lets say I have a large index (100M docs, 1TB, split up between 10
   indexes).  And a bunch of the stored and indexed fields are not
  used in
   search at all.  In order to save memory and disk, I'd like to
 rebuild
  that
   index *without* those fields, but I don't have original documents to
   rebuild entire index with (don't have the full-text anymore, etc.).
  Is
   there some way to rebuild or optimize an existing index with only a
  sub-set
   of the existing indexed fields?  Or alternatively is there a way to
  avoid
   loading some indexed fields at all ( to avoid loading term infos and
  

Payload and exact search - 2

2012-02-16 Thread leonardo2
Hello,
I already posted this question but for some reason it was attached to a
thread with different topic.


Is there the possibility of perform 'exact search' in a payload field? 

I'have to index text with auxiliary info for each word. In particular at
each word is associated the bounding box containing it in the original pdf
page (it is used for highligthing the search terms in the pdf). I used the
payload to store that information. 

In the schema.xml, the fieldType definition is: 

--- 
fieldtype name=wppayloads stored=false indexed=true
class=solr.TextField 
analyzer
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.WordDelimiterFilterFactory generateWordParts=1
generateNumberParts=1 
 catenateWords=1 catenateNumbers=1
catenateAll=0 splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.DelimitedPayloadTokenFilterFactory encoder=identity/
/analyzer
/fieldtype
--- 

while the field definition is: 

--- 
field name=words type=wppayloads indexed=true stored=true
required=true multiValued=true/
--- 

When indexing, the field 'words' contains a list of word|box as in the
following example: 

--- 
doc_id=example 
words={Fonte:|307.62,948.16,324.62,954.25 Comune|326.29,948.16,349.07,954.25
di|350.74,948.16,355.62,954.25 Bologna|358.95,948.16,381.28,954.25} 
--- 

Such solution works well except in the case of an exact search. For example,
assuming the only indexed doc is the 'example' doc (before shown), the query
words:Comune di Bologna returns no results. 

Someone know if there is the possibility of perform 'exact search' in a
payload field? 

Thanks in advance, 
Leonardo

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Payload-and-exact-search-2-tp3750355p3750355.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Entity with multiple datasources

2012-02-16 Thread Dmitry Kan
I think the problem here is that initially you trying to create separate
documents for two different tables, while your config is aiming to create
only one document. Here there is one solution (not tried by me):

--
You can have multiple documents generated by the same data-config:

dataConfig
 dataSource name=ds1 .../
 dataSource name=ds2 .../
 dataSource name=ds3 .../
 document
   entity blah blah rootEntity=false
   entity blah blah this is a document
  entity sets unique id/
   /document
   document blah blah this is another document
  entity sets unique id
   /document
 /document
/dataConfig

It's the 'rootEntity=false that makes the child entity a document.
--

Dmitry

On Thu, Feb 16, 2012 at 2:37 PM, Radu Toev radut...@gmail.com wrote:

 dataConfig
  dataSource
 name=s
 driver=com.microsoft.sqlserver.jdbc.SQLServerDriver
 url=
 user=
 password=/
  dataSource
 name=p
  driver=com.microsoft.sqlserver.jdbc.SQLServerDriver
 url=
 user=
 password=/
  document
entity name=ms
datasource=s
 query=SELECT m.id as id, m.serial as m_machine_serial, m.ivk as
 m_machine_ivk, m.sitename as m_sitename, m.deliveryDate as m_delivery_date,
 m.hotsite as m_hotsite, m.guardian as m_guardian, m.warranty as m_warranty,
 m.contract as m_contract,
   st.name as m_st_name, pm.name as m_pm_name, p.name as m_p_name,
 sv.shortName as m_sv_name, c.clusterMajor as m_c_cluster_major,
 c.clusterMinor as m_c_cluster_minor, c.country as m_c_country, c.code as
 m_c_code
   FROM Machine AS m
   LEFT JOIN SystemType AS st ON m.fk_systemType=st.id
   LEFT JOIN ProductModel AS pm ON fk_productModel = pm.id
   LEFT JOIN Platform AS p ON m.fk_platform = p.id
   LEFT JOIN SoftwareVersion AS sv ON fk_softwareVersion = sv.id
   LEFT JOIN Country AS c ON fk_country = c.id
 readOnly=true
 transformer=DateFormatTransformer
 field column=id /
 field column=m_machine_serial/
 field column=m_machine_ivk/
 field column=m_sitename/
 filed column=m_delivery_date dateTimeFormat=-MM-dd/
 field column=m_hotsite/
 field column=m_guardian/
 field column=m_warranty/
 field column=m_contract/
 field column=m_st_name/
 field column=m_pm_name/
 field column=m_p_name/
 field column=m_sv_name/
 field column=m_c_cluster_major/
 field column=m_c_cluster_minor/
 field column=m_c_country/
 field column=m_c_code/
   /entity

   entity name=mp
datasource=p
 query=SELECT m.id as id, m.serial as m_machine_serial, m.ivk as
 m_machine_ivk, m.sitename as m_sitename, m.deliveryDate as m_delivery_date,
 m.hotsite as m_hotsite, m.guardian as m_guardian, m.warranty as m_warranty,
 m.contract as m_contract,
   st.name as m_st_name, pm.name as m_pm_name, p.name as m_p_name,
 sv.shortName as m_sv_name, c.clusterMajor as m_c_cluster_major,
 c.clusterMinor as m_c_cluster_minor, c.country as m_c_country, c.code as
 m_c_code
   FROM Machine AS m
   LEFT JOIN SystemType AS st ON m.fk_systemType=st.id
   LEFT JOIN ProductModel AS pm ON fk_productModel = pm.id
   LEFT JOIN Platform AS p ON m.fk_platform = p.id
   LEFT JOIN SoftwareVersion AS sv ON fk_softwareVersion = sv.id
   LEFT JOIN Country AS c ON fk_country = c.id
 readOnly=true
 transformer=DateFormatTransformer
 field column=id /
 field column=m_machine_serial/
 field column=m_machine_ivk/
 field column=m_sitename/
 filed column=m_delivery_date dateTimeFormat=-MM-dd/
 field column=m_hotsite/
 field column=m_guardian/
 field column=m_warranty/
 field column=m_contract/
 field column=m_st_name/
 field column=m_pm_name/
 field column=m_p_name/
 field column=m_sv_name/
 field column=m_c_cluster_major/
 field column=m_c_cluster_minor/
 field column=m_c_country/
 field column=m_c_code/
   /entity
  /document
 /dataConfig

 I've removed the connection params
 The unique key is id.

 On Thu, Feb 16, 2012 at 2:27 PM, Dmitry Kan dmitry@gmail.com wrote:

  OK, maybe you can show the db-data-config.xml just in case?
  Also in schema.xml, does you uniqueKey correspond to the unique field
 in
  the db?
 
  On Thu, Feb 16, 2012 at 2:13 PM, Radu Toev radut...@gmail.com wrote:
 
   I tried running with just one datasource(the one that has 6k entries)
 and
   it indexes them ok.
   The same, if I do sepparately the 1k database. It indexes ok.
  
   On Thu, Feb 16, 2012 at 2:11 PM, Dmitry Kan dmitry@gmail.com
  wrote:
  
It sounds a bit, as if SOLR stopped processing data once it queried
 all
from the smaller dataset. That's why you have 2000. If you just have
 a
handler pointed to the bigger data set (6k), do you manage to get all
  6k
   db
entries into solr?
   
On Thu, Feb 16, 2012 at 1:46 PM, Radu Toev radut...@gmail.com
 wrote:
   
 1. Nothing in the logs
 2. No.

 On Thu, Feb 16, 2012 at 12:44 PM, Dmitry Kan dmitry@gmail.com
 
wrote:

  1. Do you see any errors / exceptions in the logs?
  2. Could you have duplicates?
 
  On Thu, Feb 16, 2012 at 10:15 AM, Radu Toev 

Re: Entity with multiple datasources

2012-02-16 Thread Radu Toev
I'm not sure I follow.
The idea is to have only one document. Do the multiple documents have the
same structure then(different datasources), and if so how are they actually
indexed?

Thanks.

On Thu, Feb 16, 2012 at 4:40 PM, Dmitry Kan dmitry@gmail.com wrote:

 I think the problem here is that initially you trying to create separate
 documents for two different tables, while your config is aiming to create
 only one document. Here there is one solution (not tried by me):

 --
 You can have multiple documents generated by the same data-config:

 dataConfig
  dataSource name=ds1 .../
  dataSource name=ds2 .../
  dataSource name=ds3 .../
  document
   entity blah blah rootEntity=false
   entity blah blah this is a document
  entity sets unique id/
   /document
   document blah blah this is another document
  entity sets unique id
   /document
  /document
 /dataConfig

 It's the 'rootEntity=false that makes the child entity a document.
 --

 Dmitry

 On Thu, Feb 16, 2012 at 2:37 PM, Radu Toev radut...@gmail.com wrote:

  dataConfig
   dataSource
  name=s
  driver=com.microsoft.sqlserver.jdbc.SQLServerDriver
  url=
  user=
  password=/
   dataSource
  name=p
   driver=com.microsoft.sqlserver.jdbc.SQLServerDriver
  url=
  user=
  password=/
   document
 entity name=ms
 datasource=s
  query=SELECT m.id as id, m.serial as m_machine_serial, m.ivk as
  m_machine_ivk, m.sitename as m_sitename, m.deliveryDate as
 m_delivery_date,
  m.hotsite as m_hotsite, m.guardian as m_guardian, m.warranty as
 m_warranty,
  m.contract as m_contract,
st.name as m_st_name, pm.name as m_pm_name, p.name as m_p_name,
  sv.shortName as m_sv_name, c.clusterMajor as m_c_cluster_major,
  c.clusterMinor as m_c_cluster_minor, c.country as m_c_country, c.code as
  m_c_code
FROM Machine AS m
LEFT JOIN SystemType AS st ON m.fk_systemType=st.id
LEFT JOIN ProductModel AS pm ON fk_productModel = pm.id
LEFT JOIN Platform AS p ON m.fk_platform = p.id
LEFT JOIN SoftwareVersion AS sv ON fk_softwareVersion = sv.id
LEFT JOIN Country AS c ON fk_country = c.id
  readOnly=true
  transformer=DateFormatTransformer
  field column=id /
  field column=m_machine_serial/
  field column=m_machine_ivk/
  field column=m_sitename/
  filed column=m_delivery_date dateTimeFormat=-MM-dd/
  field column=m_hotsite/
  field column=m_guardian/
  field column=m_warranty/
  field column=m_contract/
  field column=m_st_name/
  field column=m_pm_name/
  field column=m_p_name/
  field column=m_sv_name/
  field column=m_c_cluster_major/
  field column=m_c_cluster_minor/
  field column=m_c_country/
  field column=m_c_code/
/entity
 
entity name=mp
 datasource=p
  query=SELECT m.id as id, m.serial as m_machine_serial, m.ivk as
  m_machine_ivk, m.sitename as m_sitename, m.deliveryDate as
 m_delivery_date,
  m.hotsite as m_hotsite, m.guardian as m_guardian, m.warranty as
 m_warranty,
  m.contract as m_contract,
st.name as m_st_name, pm.name as m_pm_name, p.name as m_p_name,
  sv.shortName as m_sv_name, c.clusterMajor as m_c_cluster_major,
  c.clusterMinor as m_c_cluster_minor, c.country as m_c_country, c.code as
  m_c_code
FROM Machine AS m
LEFT JOIN SystemType AS st ON m.fk_systemType=st.id
LEFT JOIN ProductModel AS pm ON fk_productModel = pm.id
LEFT JOIN Platform AS p ON m.fk_platform = p.id
LEFT JOIN SoftwareVersion AS sv ON fk_softwareVersion = sv.id
LEFT JOIN Country AS c ON fk_country = c.id
  readOnly=true
  transformer=DateFormatTransformer
  field column=id /
  field column=m_machine_serial/
  field column=m_machine_ivk/
  field column=m_sitename/
  filed column=m_delivery_date dateTimeFormat=-MM-dd/
  field column=m_hotsite/
  field column=m_guardian/
  field column=m_warranty/
  field column=m_contract/
  field column=m_st_name/
  field column=m_pm_name/
  field column=m_p_name/
  field column=m_sv_name/
  field column=m_c_cluster_major/
  field column=m_c_cluster_minor/
  field column=m_c_country/
  field column=m_c_code/
/entity
   /document
  /dataConfig
 
  I've removed the connection params
  The unique key is id.
 
  On Thu, Feb 16, 2012 at 2:27 PM, Dmitry Kan dmitry@gmail.com
 wrote:
 
   OK, maybe you can show the db-data-config.xml just in case?
   Also in schema.xml, does you uniqueKey correspond to the unique field
  in
   the db?
  
   On Thu, Feb 16, 2012 at 2:13 PM, Radu Toev radut...@gmail.com wrote:
  
I tried running with just one datasource(the one that has 6k entries)
  and
it indexes them ok.
The same, if I do sepparately the 1k database. It indexes ok.
   
On Thu, Feb 16, 2012 at 2:11 PM, Dmitry Kan dmitry@gmail.com
   wrote:
   
 It sounds a bit, as if SOLR stopped processing data once it queried
  all
 from the smaller dataset. That's why you have 2000. If you just
 have
  a
 handler pointed to the bigger data set (6k), do you manage to get
 

Re: problem to indexing pdf directory

2012-02-16 Thread alessio crisantemi
yes, but if I use TikaEntityProcessor the result of my full-import is

str name=Total Requests made to DataSource0/str
 str name=Total Rows Fetched1/str

str name=Total Documents Skipped0/str

str name=Indexing failed. Rolled back all changes./str




2012/2/16 alessio crisantemi alessio.crisant...@gmail.com

 Hi all,
 I have a problem to configure a pdf indexing from a directory in my solr
 wit DIH:

 with this data-config


 dataConfig
  dataSource type=BinFileDataSource /
  document
   entity
 name=tika-test
 processor=FileListEntityProcessor
 baseDir=D:\gioconews_archivio\marzo2011
 fileName=.*pdf
 recursive=true
 rootEntity=false
 dataSource=null/
   entity processor=FileListEntityProcessor
 url=D:\gioconews_archivio\marzo2011 format=text 
field column=author  name=author meta=true/
field column=title name=title meta=true/
  field column=description name=description /
  field column=comments name=comments /

  field column=content_type name=content_type /
  field column=last_modified name=last_modified /
   /entity
  /document
 /dataConfig

 I obtain this result:



   str name=commandfull-import/str

   str name=statusidle/str

   str name=importResponse /

 - lst name=statusMessages

   str name=Time Elapsed0:0:2.44/str

   str name=Total Requests made to DataSource0/str

   str name=Total Rows Fetched43/str

   str name=Total Documents Skipped0/str

   str name=Full Dump Started2012-02-12 19:06:00/str

   str name=Indexing failed. Rolled back all changes./str

   str name=Rolledback2012-02-12 19:06:00/str
   /lst


 suggestions?
 thank you
 alessio



Re: Entity with multiple datasources

2012-02-16 Thread Dmitry Kan
Each document in SOLR will correspond to one db record and since both
databases have the same schema, you can't index two records from two
databases into the same SOLR document.

So after indexing, you should have 7k different documents, each of which
holds data from a db record.

Also one problem I see here is that since the record id in each table is
unique only within the table and (most probably) not globally, there will
be collisions. To aviod this, I would prepend a record_id with some static
value, like: concat(t1,  CONVERT(id, CHAR(8))).

Dmitry

On Thu, Feb 16, 2012 at 4:47 PM, Radu Toev radut...@gmail.com wrote:

 I'm not sure I follow.
 The idea is to have only one document. Do the multiple documents have the
 same structure then(different datasources), and if so how are they actually
 indexed?

 Thanks.

 On Thu, Feb 16, 2012 at 4:40 PM, Dmitry Kan dmitry@gmail.com wrote:

  I think the problem here is that initially you trying to create separate
  documents for two different tables, while your config is aiming to create
  only one document. Here there is one solution (not tried by me):
 
  --
  You can have multiple documents generated by the same data-config:
 
  dataConfig
   dataSource name=ds1 .../
   dataSource name=ds2 .../
   dataSource name=ds3 .../
   document
entity blah blah rootEntity=false
entity blah blah this is a document
   entity sets unique id/
/document
document blah blah this is another document
   entity sets unique id
/document
   /document
  /dataConfig
 
  It's the 'rootEntity=false that makes the child entity a document.
  --
 
  Dmitry
 
  On Thu, Feb 16, 2012 at 2:37 PM, Radu Toev radut...@gmail.com wrote:
 
   dataConfig
dataSource
   name=s
   driver=com.microsoft.sqlserver.jdbc.SQLServerDriver
   url=
   user=
   password=/
dataSource
   name=p
driver=com.microsoft.sqlserver.jdbc.SQLServerDriver
   url=
   user=
   password=/
document
  entity name=ms
  datasource=s
   query=SELECT m.id as id, m.serial as m_machine_serial, m.ivk as
   m_machine_ivk, m.sitename as m_sitename, m.deliveryDate as
  m_delivery_date,
   m.hotsite as m_hotsite, m.guardian as m_guardian, m.warranty as
  m_warranty,
   m.contract as m_contract,
 st.name as m_st_name, pm.name as m_pm_name, p.name as m_p_name,
   sv.shortName as m_sv_name, c.clusterMajor as m_c_cluster_major,
   c.clusterMinor as m_c_cluster_minor, c.country as m_c_country, c.code
 as
   m_c_code
 FROM Machine AS m
 LEFT JOIN SystemType AS st ON m.fk_systemType=st.id
 LEFT JOIN ProductModel AS pm ON fk_productModel = pm.id
 LEFT JOIN Platform AS p ON m.fk_platform = p.id
 LEFT JOIN SoftwareVersion AS sv ON fk_softwareVersion = sv.id
 LEFT JOIN Country AS c ON fk_country = c.id
   readOnly=true
   transformer=DateFormatTransformer
   field column=id /
   field column=m_machine_serial/
   field column=m_machine_ivk/
   field column=m_sitename/
   filed column=m_delivery_date dateTimeFormat=-MM-dd/
   field column=m_hotsite/
   field column=m_guardian/
   field column=m_warranty/
   field column=m_contract/
   field column=m_st_name/
   field column=m_pm_name/
   field column=m_p_name/
   field column=m_sv_name/
   field column=m_c_cluster_major/
   field column=m_c_cluster_minor/
   field column=m_c_country/
   field column=m_c_code/
 /entity
  
 entity name=mp
  datasource=p
   query=SELECT m.id as id, m.serial as m_machine_serial, m.ivk as
   m_machine_ivk, m.sitename as m_sitename, m.deliveryDate as
  m_delivery_date,
   m.hotsite as m_hotsite, m.guardian as m_guardian, m.warranty as
  m_warranty,
   m.contract as m_contract,
 st.name as m_st_name, pm.name as m_pm_name, p.name as m_p_name,
   sv.shortName as m_sv_name, c.clusterMajor as m_c_cluster_major,
   c.clusterMinor as m_c_cluster_minor, c.country as m_c_country, c.code
 as
   m_c_code
 FROM Machine AS m
 LEFT JOIN SystemType AS st ON m.fk_systemType=st.id
 LEFT JOIN ProductModel AS pm ON fk_productModel = pm.id
 LEFT JOIN Platform AS p ON m.fk_platform = p.id
 LEFT JOIN SoftwareVersion AS sv ON fk_softwareVersion = sv.id
 LEFT JOIN Country AS c ON fk_country = c.id
   readOnly=true
   transformer=DateFormatTransformer
   field column=id /
   field column=m_machine_serial/
   field column=m_machine_ivk/
   field column=m_sitename/
   filed column=m_delivery_date dateTimeFormat=-MM-dd/
   field column=m_hotsite/
   field column=m_guardian/
   field column=m_warranty/
   field column=m_contract/
   field column=m_st_name/
   field column=m_pm_name/
   field column=m_p_name/
   field column=m_sv_name/
   field column=m_c_cluster_major/
   field column=m_c_cluster_minor/
   field column=m_c_country/
   field column=m_c_code/
 /entity
/document
   /dataConfig
  
   I've removed the connection params
   The unique key is id.
  
   On Thu, Feb 16, 2012 at 2:27 

Re: custom scoring

2012-02-16 Thread Em
Hello carlos,

could you show us how your Solr-call looks like?

Regards,
Em

Am 16.02.2012 14:34, schrieb Carlos Gonzalez-Cadenas:
 Hello all:
 
 We'd like to score the matching documents using a combination of SOLR's IR
 score with another application-specific score that we store within the
 documents themselves (i.e. a float field containing the app-specific
 score). In particular, we'd like to calculate the final score doing some
 operations with both numbers (i.e product, sqrt, ...)
 
 According to what we know, there are two ways to do this in SOLR:
 
 A) Sort by function [1]: We've tested an expression like
 sort=product(score, query_score) in the SOLR query, where score is the
 common SOLR IR score and query_score is our own precalculated score, but it
 seems that SOLR can only do this with stored/indexed fields (and obviously
 score is not stored/indexed).
 
 B) Function queries: We've used _val_ and function queries like max, sqrt
 and query, and we've obtained the desired results from a functional point
 of view. However, our index is quite large (400M documents) and the
 performance degrades heavily, given that function queries are AFAIK
 matching all the documents.
 
 I have two questions:
 
 1) Apart from the two options I mentioned, is there any other (simple) way
 to achieve this that we're not aware of?
 
 2) If we have to choose the function queries path, would it be very
 difficult to modify the actual implementation so that it doesn't match all
 the documents, that is, to pass a query so that it only operates over the
 documents matching the query?. Looking at the FunctionQuery.java source
 code, there's a comment that says // instead of matching all docs, we
 could also embed a query. the score could either ignore the subscore, or
 boost it, which is giving us some hope that maybe it's possible and even
 desirable to go in this direction. If you can give us some directions about
 how to go about this, we may be able to do the actual implementation.
 
 BTW, we're using Lucene/SOLR trunk.
 
 Thanks a lot for your help.
 Carlos
 
 [1]: http://wiki.apache.org/solr/FunctionQuery#Sort_By_Function
 


Re: Entity with multiple datasources

2012-02-16 Thread Radu Toev
Really good point on the ids, I completely overlooked that matter.
I will give it a try.
Thanks again.

On Thu, Feb 16, 2012 at 5:00 PM, Dmitry Kan dmitry@gmail.com wrote:

 Each document in SOLR will correspond to one db record and since both
 databases have the same schema, you can't index two records from two
 databases into the same SOLR document.

 So after indexing, you should have 7k different documents, each of which
 holds data from a db record.

 Also one problem I see here is that since the record id in each table is
 unique only within the table and (most probably) not globally, there will
 be collisions. To aviod this, I would prepend a record_id with some static
 value, like: concat(t1,  CONVERT(id, CHAR(8))).

 Dmitry

 On Thu, Feb 16, 2012 at 4:47 PM, Radu Toev radut...@gmail.com wrote:

  I'm not sure I follow.
  The idea is to have only one document. Do the multiple documents have the
  same structure then(different datasources), and if so how are they
 actually
  indexed?
 
  Thanks.
 
  On Thu, Feb 16, 2012 at 4:40 PM, Dmitry Kan dmitry@gmail.com
 wrote:
 
   I think the problem here is that initially you trying to create
 separate
   documents for two different tables, while your config is aiming to
 create
   only one document. Here there is one solution (not tried by me):
  
   --
   You can have multiple documents generated by the same data-config:
  
   dataConfig
dataSource name=ds1 .../
dataSource name=ds2 .../
dataSource name=ds3 .../
document
 entity blah blah rootEntity=false
 entity blah blah this is a document
entity sets unique id/
 /document
 document blah blah this is another document
entity sets unique id
 /document
/document
   /dataConfig
  
   It's the 'rootEntity=false that makes the child entity a document.
   --
  
   Dmitry
  
   On Thu, Feb 16, 2012 at 2:37 PM, Radu Toev radut...@gmail.com wrote:
  
dataConfig
 dataSource
name=s
driver=com.microsoft.sqlserver.jdbc.SQLServerDriver
url=
user=
password=/
 dataSource
name=p
 driver=com.microsoft.sqlserver.jdbc.SQLServerDriver
url=
user=
password=/
 document
   entity name=ms
   datasource=s
query=SELECT m.id as id, m.serial as m_machine_serial, m.ivk as
m_machine_ivk, m.sitename as m_sitename, m.deliveryDate as
   m_delivery_date,
m.hotsite as m_hotsite, m.guardian as m_guardian, m.warranty as
   m_warranty,
m.contract as m_contract,
  st.name as m_st_name, pm.name as m_pm_name, p.name as m_p_name,
sv.shortName as m_sv_name, c.clusterMajor as m_c_cluster_major,
c.clusterMinor as m_c_cluster_minor, c.country as m_c_country, c.code
  as
m_c_code
  FROM Machine AS m
  LEFT JOIN SystemType AS st ON m.fk_systemType=st.id
  LEFT JOIN ProductModel AS pm ON fk_productModel = pm.id
  LEFT JOIN Platform AS p ON m.fk_platform = p.id
  LEFT JOIN SoftwareVersion AS sv ON fk_softwareVersion = sv.id
  LEFT JOIN Country AS c ON fk_country = c.id
readOnly=true
transformer=DateFormatTransformer
field column=id /
field column=m_machine_serial/
field column=m_machine_ivk/
field column=m_sitename/
filed column=m_delivery_date dateTimeFormat=-MM-dd/
field column=m_hotsite/
field column=m_guardian/
field column=m_warranty/
field column=m_contract/
field column=m_st_name/
field column=m_pm_name/
field column=m_p_name/
field column=m_sv_name/
field column=m_c_cluster_major/
field column=m_c_cluster_minor/
field column=m_c_country/
field column=m_c_code/
  /entity
   
  entity name=mp
   datasource=p
query=SELECT m.id as id, m.serial as m_machine_serial, m.ivk as
m_machine_ivk, m.sitename as m_sitename, m.deliveryDate as
   m_delivery_date,
m.hotsite as m_hotsite, m.guardian as m_guardian, m.warranty as
   m_warranty,
m.contract as m_contract,
  st.name as m_st_name, pm.name as m_pm_name, p.name as m_p_name,
sv.shortName as m_sv_name, c.clusterMajor as m_c_cluster_major,
c.clusterMinor as m_c_cluster_minor, c.country as m_c_country, c.code
  as
m_c_code
  FROM Machine AS m
  LEFT JOIN SystemType AS st ON m.fk_systemType=st.id
  LEFT JOIN ProductModel AS pm ON fk_productModel = pm.id
  LEFT JOIN Platform AS p ON m.fk_platform = p.id
  LEFT JOIN SoftwareVersion AS sv ON fk_softwareVersion = sv.id
  LEFT JOIN Country AS c ON fk_country = c.id
readOnly=true
transformer=DateFormatTransformer
field column=id /
field column=m_machine_serial/
field column=m_machine_ivk/
field column=m_sitename/
filed column=m_delivery_date dateTimeFormat=-MM-dd/
field column=m_hotsite/
field column=m_guardian/
field column=m_warranty/
field column=m_contract/
field column=m_st_name/
field column=m_pm_name/

Re: Entity with multiple datasources

2012-02-16 Thread Dmitry Kan
no problem, hope it helps, you're welcome.

On Thu, Feb 16, 2012 at 5:03 PM, Radu Toev radut...@gmail.com wrote:

 Really good point on the ids, I completely overlooked that matter.
 I will give it a try.
 Thanks again.

 On Thu, Feb 16, 2012 at 5:00 PM, Dmitry Kan dmitry@gmail.com wrote:

  Each document in SOLR will correspond to one db record and since both
  databases have the same schema, you can't index two records from two
  databases into the same SOLR document.
 
  So after indexing, you should have 7k different documents, each of which
  holds data from a db record.
 
  Also one problem I see here is that since the record id in each table is
  unique only within the table and (most probably) not globally, there will
  be collisions. To aviod this, I would prepend a record_id with some
 static
  value, like: concat(t1,  CONVERT(id, CHAR(8))).
 
  Dmitry
 
  On Thu, Feb 16, 2012 at 4:47 PM, Radu Toev radut...@gmail.com wrote:
 
   I'm not sure I follow.
   The idea is to have only one document. Do the multiple documents have
 the
   same structure then(different datasources), and if so how are they
  actually
   indexed?
  
   Thanks.
  
   On Thu, Feb 16, 2012 at 4:40 PM, Dmitry Kan dmitry@gmail.com
  wrote:
  
I think the problem here is that initially you trying to create
  separate
documents for two different tables, while your config is aiming to
  create
only one document. Here there is one solution (not tried by me):
   
--
You can have multiple documents generated by the same data-config:
   
dataConfig
 dataSource name=ds1 .../
 dataSource name=ds2 .../
 dataSource name=ds3 .../
 document
  entity blah blah rootEntity=false
  entity blah blah this is a document
 entity sets unique id/
  /document
  document blah blah this is another document
 entity sets unique id
  /document
 /document
/dataConfig
   
It's the 'rootEntity=false that makes the child entity a document.
--
   
Dmitry
   
On Thu, Feb 16, 2012 at 2:37 PM, Radu Toev radut...@gmail.com
 wrote:
   
 dataConfig
  dataSource
 name=s
 driver=com.microsoft.sqlserver.jdbc.SQLServerDriver
 url=
 user=
 password=/
  dataSource
 name=p
  driver=com.microsoft.sqlserver.jdbc.SQLServerDriver
 url=
 user=
 password=/
  document
entity name=ms
datasource=s
 query=SELECT m.id as id, m.serial as m_machine_serial, m.ivk as
 m_machine_ivk, m.sitename as m_sitename, m.deliveryDate as
m_delivery_date,
 m.hotsite as m_hotsite, m.guardian as m_guardian, m.warranty as
m_warranty,
 m.contract as m_contract,
   st.name as m_st_name, pm.name as m_pm_name, p.name as m_p_name,
 sv.shortName as m_sv_name, c.clusterMajor as m_c_cluster_major,
 c.clusterMinor as m_c_cluster_minor, c.country as m_c_country,
 c.code
   as
 m_c_code
   FROM Machine AS m
   LEFT JOIN SystemType AS st ON m.fk_systemType=st.id
   LEFT JOIN ProductModel AS pm ON fk_productModel = pm.id
   LEFT JOIN Platform AS p ON m.fk_platform = p.id
   LEFT JOIN SoftwareVersion AS sv ON fk_softwareVersion = sv.id
   LEFT JOIN Country AS c ON fk_country = c.id
 readOnly=true
 transformer=DateFormatTransformer
 field column=id /
 field column=m_machine_serial/
 field column=m_machine_ivk/
 field column=m_sitename/
 filed column=m_delivery_date dateTimeFormat=-MM-dd/
 field column=m_hotsite/
 field column=m_guardian/
 field column=m_warranty/
 field column=m_contract/
 field column=m_st_name/
 field column=m_pm_name/
 field column=m_p_name/
 field column=m_sv_name/
 field column=m_c_cluster_major/
 field column=m_c_cluster_minor/
 field column=m_c_country/
 field column=m_c_code/
   /entity

   entity name=mp
datasource=p
 query=SELECT m.id as id, m.serial as m_machine_serial, m.ivk as
 m_machine_ivk, m.sitename as m_sitename, m.deliveryDate as
m_delivery_date,
 m.hotsite as m_hotsite, m.guardian as m_guardian, m.warranty as
m_warranty,
 m.contract as m_contract,
   st.name as m_st_name, pm.name as m_pm_name, p.name as m_p_name,
 sv.shortName as m_sv_name, c.clusterMajor as m_c_cluster_major,
 c.clusterMinor as m_c_cluster_minor, c.country as m_c_country,
 c.code
   as
 m_c_code
   FROM Machine AS m
   LEFT JOIN SystemType AS st ON m.fk_systemType=st.id
   LEFT JOIN ProductModel AS pm ON fk_productModel = pm.id
   LEFT JOIN Platform AS p ON m.fk_platform = p.id
   LEFT JOIN SoftwareVersion AS sv ON fk_softwareVersion = sv.id
   LEFT JOIN Country AS c ON fk_country = c.id
 readOnly=true
 transformer=DateFormatTransformer
 field column=id /
 field column=m_machine_serial/
 field column=m_machine_ivk/
 

Frequent garbage collections after a day of operation

2012-02-16 Thread Matthias Käppler
Hey everyone,

we're running into some operational problems with our SOLR production
setup here and were wondering if anyone else is affected or has even
solved these problems before. We're running a vanilla SOLR 3.4.0 in
several Tomcat 6 instances, so nothing out of the ordinary, but after
a day or so of operation we see increased response times from SOLR, up
to 3 times increases on average. During this time we see increased CPU
load due to heavy garbage collection in the JVM, which bogs down the
the whole system, so throughput decreases, naturally. When restarting
the slaves, everything goes back to normal, but that's more like a
brute force solution.

The thing is, we don't know what's causing this and we don't have that
much experience with Java stacks since we're for most parts a Rails
company. Are Tomcat 6 or SOLR known to leak memory? Is anyone else
seeing this, or can you think of a reason for this? Most of our
queries to SOLR involve the DismaxHandler and the spatial search query
components. We don't use any custom request handlers so far.

Thanks in advance,
-Matthias

-- 
Matthias Käppler
Lead Developer API  Mobile

Qype GmbH
Großer Burstah 50-52
20457 Hamburg
Telephone: +49 (0)40 - 219 019 2 - 160
Skype: m_kaeppler
Email: matth...@qype.com

Managing Director: Ian Brotherston
Amtsgericht Hamburg
HRB 95913

This e-mail and its attachments may contain confidential and/or
privileged information. If you are not the intended recipient (or have
received this e-mail in error) please notify the sender immediately
and destroy this e-mail and its attachments. Any unauthorized copying,
disclosure or distribution of this e-mail and  its attachments is
strictly forbidden. This notice also applies to future messages.


RE: PatternReplaceFilterFactory group

2012-02-16 Thread Steven A Rowe
Hi O.,

PatternReplaceFilter(Factory) uses Matcher.replaceAll() or replaceFirst(), both 
of which take in a string that can include any or all groups using the syntax 
$n, where n is the group number.  See the Matcher.appendReplacement() 
javadocs for an explanation of the functionality and syntax: 
http://docs.oracle.com/javase/6/docs/api/java/util/regex/Matcher.html#appendReplacement%28java.lang.StringBuffer,%20java.lang.String%29

Steve

 -Original Message-
 From: O. Klein [mailto:kl...@octoweb.nl]
 Sent: Thursday, February 16, 2012 8:34 AM
 To: solr-user@lucene.apache.org
 Subject: PatternReplaceFilterFactory group
 
 PatternReplaceFilterFactory has no option to select the group to replace.
 
 Is there a reason for this, or could this be a nice feature?
 
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/PatternReplaceFilterFactory-group-
 tp3750201p3750201.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: custom scoring

2012-02-16 Thread Carlos Gonzalez-Cadenas
Hello Em:

The URL is quite large (w/ shards, ...), maybe it's best if I paste the
relevant parts.

Our q parameter is:

  
q:_val_:\product(query_score,max(query($q8),max(query($q7),max(query($q4),query($q3)\,

The subqueries q8, q7, q4 and q3 are regular queries, for example:

q7:stopword_phrase:colomba~1 AND stopword_phrase:santa AND
wildcard_stopword_phrase:car^0.7 AND stopword_phrase:hoteles OR
(stopword_phrase:las AND stopword_phrase:de)

We've executed the subqueries q3-q8 independently and they're very fast,
but when we introduce the function queries as described below, it all goes
10X slower.

Let me know if you need anything else.

Thanks
Carlos


Carlos Gonzalez-Cadenas
CEO, ExperienceOn - New generation search
http://www.experienceon.com

Mobile: +34 652 911 201
Skype: carlosgonzalezcadenas
LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas


On Thu, Feb 16, 2012 at 4:02 PM, Em mailformailingli...@yahoo.de wrote:

 Hello carlos,

 could you show us how your Solr-call looks like?

 Regards,
 Em

 Am 16.02.2012 14:34, schrieb Carlos Gonzalez-Cadenas:
  Hello all:
 
  We'd like to score the matching documents using a combination of SOLR's
 IR
  score with another application-specific score that we store within the
  documents themselves (i.e. a float field containing the app-specific
  score). In particular, we'd like to calculate the final score doing some
  operations with both numbers (i.e product, sqrt, ...)
 
  According to what we know, there are two ways to do this in SOLR:
 
  A) Sort by function [1]: We've tested an expression like
  sort=product(score, query_score) in the SOLR query, where score is the
  common SOLR IR score and query_score is our own precalculated score, but
 it
  seems that SOLR can only do this with stored/indexed fields (and
 obviously
  score is not stored/indexed).
 
  B) Function queries: We've used _val_ and function queries like max, sqrt
  and query, and we've obtained the desired results from a functional point
  of view. However, our index is quite large (400M documents) and the
  performance degrades heavily, given that function queries are AFAIK
  matching all the documents.
 
  I have two questions:
 
  1) Apart from the two options I mentioned, is there any other (simple)
 way
  to achieve this that we're not aware of?
 
  2) If we have to choose the function queries path, would it be very
  difficult to modify the actual implementation so that it doesn't match
 all
  the documents, that is, to pass a query so that it only operates over the
  documents matching the query?. Looking at the FunctionQuery.java source
  code, there's a comment that says // instead of matching all docs, we
  could also embed a query. the score could either ignore the subscore, or
  boost it, which is giving us some hope that maybe it's possible and even
  desirable to go in this direction. If you can give us some directions
 about
  how to go about this, we may be able to do the actual implementation.
 
  BTW, we're using Lucene/SOLR trunk.
 
  Thanks a lot for your help.
  Carlos
 
  [1]: http://wiki.apache.org/solr/FunctionQuery#Sort_By_Function
 



Re: How to loop through the DataImportHandler query results?

2012-02-16 Thread Chantal Ackermann
If your script turns out too complex to maintain, and you are developing
in Java, anyway, you could extend EntityProcessor and handle the data in
a custom way. I've done that to transform a datamart like data structure
back into a row based one.

Basically you override the method that gets the data in a Map and
transform it into a different Map which contains the fields as
understood by your schema.

Chantal


On Thu, 2012-02-16 at 14:59 +0100, Mikhail Khludnev wrote:
 Hi Baranee,
 
 Some time ago I played with
 http://wiki.apache.org/solr/DataImportHandler#ScriptTransformer - it was a
 pretty good stuff.
 
 Regards
 
 
 On Thu, Feb 16, 2012 at 3:53 PM, K, Baraneetharan 
 baraneethara...@hp.comwrote:
 
  To avoid that we don't want to mention the column names in the field tag ,
  but want to write a query to map all the fields in the table with solr
  fileds even if we don't know, how many columns are there in the table.  I
  need a kind of loop which runs through all the query results and map that
  with solr fileds.
 
 
 
 



Re: Frequent garbage collections after a day of operation

2012-02-16 Thread Chantal Ackermann
Make sure your Tomcat instances are started each with a max heap size
that adds up to something a lot lower than the complete RAM of your
system.

Frequent Garbage collection means that your applications request more
RAM but your Java VM has no more resources, so it requires the Garbage
Collector to free memory so that the requested new objects can be
created. It's not indicating a memory leak unless you are running a
custom EntityProcessor in DIH that runs into an infinite loop and
creates huge amounts of schema fields. ;-)

Also - if you are doing hot deploys on Tomcat, you will have to restart
the Tomcat instance on a regular bases as hot deploys DO leak memory
after a while. (You might be seeing class undeploy messages in
catalina.out and later on OutOfMemory error messages.)

If this is not of any help you will probably have to provide a bit more
information on your Tomcat and SOLR configuration setup.

Chantal


On Thu, 2012-02-16 at 16:22 +0100, Matthias Käppler wrote:
 Hey everyone,
 
 we're running into some operational problems with our SOLR production
 setup here and were wondering if anyone else is affected or has even
 solved these problems before. We're running a vanilla SOLR 3.4.0 in
 several Tomcat 6 instances, so nothing out of the ordinary, but after
 a day or so of operation we see increased response times from SOLR, up
 to 3 times increases on average. During this time we see increased CPU
 load due to heavy garbage collection in the JVM, which bogs down the
 the whole system, so throughput decreases, naturally. When restarting
 the slaves, everything goes back to normal, but that's more like a
 brute force solution.
 
 The thing is, we don't know what's causing this and we don't have that
 much experience with Java stacks since we're for most parts a Rails
 company. Are Tomcat 6 or SOLR known to leak memory? Is anyone else
 seeing this, or can you think of a reason for this? Most of our
 queries to SOLR involve the DismaxHandler and the spatial search query
 components. We don't use any custom request handlers so far.
 
 Thanks in advance,
 -Matthias
 



RE: PatternReplaceFilterFactory group

2012-02-16 Thread O. Klein

steve_rowe wrote
 
 Hi O.,
 
 PatternReplaceFilter(Factory) uses Matcher.replaceAll() or replaceFirst(),
 both of which take in a string that can include any or all groups using
 the syntax $n, where n is the group number.  See the
 Matcher.appendReplacement() javadocs for an explanation of the
 functionality and syntax:
 lt;http://docs.oracle.com/javase/6/docs/api/java/util/regex/Matcher.html#appendReplacement%28java.lang.StringBuffer,%20java.lang.String%29gt;
 
 Steve
 
 -Original Message-
 From: O. Klein [mailto:klein@]
 Sent: Thursday, February 16, 2012 8:34 AM
 To: solr-user@.apache
 Subject: PatternReplaceFilterFactory group
 
 PatternReplaceFilterFactory has no option to select the group to replace.
 
 Is there a reason for this, or could this be a nice feature?
 
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/PatternReplaceFilterFactory-group-
 tp3750201p3750201.html
 Sent from the Solr - User mailing list archive at Nabble.com.
 

Thanks. I should get it working then.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/PatternReplaceFilterFactory-group-tp3750201p3750650.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: is it possible to run deltaimport command with out delta query?

2012-02-16 Thread Shawn Heisey

On 2/15/2012 11:26 PM, nagarjuna wrote:

hi all..
   i am new to solr .can any body explain me about the delta-import and
delta query and also i have the below questions
1.is it possible to run deltaimport without delataquery?
2. is it possible to write a delta query without having last_modified column
in database? if yes pls explain me


Assuming I understand what you're asking:

Define deltaImportQuery to be the same as query, then set deltaQuery to 
something that always returns some kind of value in the field you have 
designated as your primary key.  The data doesn't have to be relevant to 
anything at all, it just needs to return something for the primary key 
field.  Here's what I have in mine, my pk is did:


  deltaQuery=SELECT 1 AS did

If you wish, you can completely ignore lastModified and track your own 
information about what data is new, then pass parameters via the 
dataimport handler URL to be used in your queries.  This is what both my 
query and deltaImportQuery are set to:


SELECT * FROM ${dataimporter.request.dataView}
WHERE (
  (
did gt; ${dataimporter.request.minDid}
AND did lt;= ${dataimporter.request.maxDid}
  )
  ${dataimporter.request.extraWhere}
) AND (crc32(did) % ${dataimporter.request.numShards})
  IN (${dataimporter.request.modVal})

Thanks,
Shawn



Re: problem to indexing pdf directory

2012-02-16 Thread alessio crisantemi
here the log:


org.apache.solr.handler.dataimport.DataImporter doFullImport
Grave: Full Import failed
org.apache.solr.handler.dataimport.DataImportHandlerException: 'baseDir' is
a required attribute Processing Document # 1
 at
org.apache.solr.handler.dataimport.FileListEntityProcessor.init(FileListEntityProcessor.java:117)
 at
org.apache.solr.handler.dataimport.EntityProcessorWrapper.init(EntityProcessorWrapper.java:71)
 at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:319)
 at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:242)
 at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:180)
 at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:331)
 at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:389)
 at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:370)
feb 12, 2012 7:06:00 PM org.apache.solr.update.DirectUpdateHandler2 rollback
Informazioni: start rollback
feb 12, 2012 7:06:00 PM org.apache.solr.update.DirectUpdateHandler2 rollback
Informazioni: end_rollback
feb 12, 2012 7:06:02 PM org.apache.solr.handler.dataimport.DataImporter
doFullImport
Informazioni: Starting Full Import
feb 12, 2012 7:06:02 PM org.apache.solr.core.SolrCore execute
Informazioni: [] webapp=/solr path=/select
params={clean=falsecommit=truecommand=full-importqt=/dataimport}
status=0 QTime=16
feb 12, 2012 7:06:02 PM org.apache.solr.handler.dataimport.SolrWriter
readIndexerProperties
Informazioni: Read dataimport.properties
feb 12, 2012 7:06:02 PM org.apache.solr.handler.dataimport.DataImporter
doFullImport
Grave: Full Import failed
org.apache.solr.handler.dataimport.DataImportHandlerException: 'baseDir' is
a required attribute Processing Document # 1
 at
org.apache.solr.handler.dataimport.FileListEntityProcessor.init(FileListEntityProcessor.java:117)
 at
org.apache.solr.handler.dataimport.EntityProcessorWrapper.init(EntityProcessorWrapper.java:71)
 at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:319)
 at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:242)
 at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:180)
 at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:331)
 at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:389)
 at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:370)
feb 12, 2012 7:06:02 PM org.apache.solr.update.DirectUpdateHandler2 rollback
Informazioni: start rollback
feb 12, 2012 7:06:02 PM org.apache.solr.update.DirectUpdateHandler2 rollback
Informazioni: end_rollback
feb 12, 2012 7:06:42 PM org.apache.coyote.AbstractProtocol pause
Informazioni: Pausing ProtocolHandler [http-bio-8983]
feb 12, 2012 7:06:42 PM org.apache.coyote.AbstractProtocol pause
Informazioni: Pausing ProtocolHandler [ajp-bio-8009]
feb 12, 2012 7:06:42 PM org.apache.catalina.core.StandardService
stopInternal
Informazioni: Stopping service Catalina
feb 12, 2012 7:06:42 PM org.apache.solr.core.SolrCore close
Informazioni: []  CLOSING SolrCore org.apache.solr.core.SolrCore@7d1217
feb 12, 2012 7:06:42 PM org.apache.solr.core.SolrCore closeSearcher
Informazioni: [] Closing main searcher on request.
feb 12, 2012 7:06:42 PM org.apache.solr.search.SolrIndexSearcher close
Informazioni: Closing Searcher@19fabda main
 
fieldValueCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
 
filterCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
 
queryResultCache{lookups=0,hits=0,hitratio=0.00,inserts=2,evictions=0,size=2,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
 
documentCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
feb 12, 2012 7:06:42 PM org.apache.solr.update.DirectUpdateHandler2 close
Informazioni: closing
DirectUpdateHandler2{commits=0,autocommits=0,optimizes=0,rollbacks=4,expungeDeletes=0,docsPending=0,adds=0,deletesById=0,deletesByQuery=0,errors=0,cumulative_adds=0,cumulative_deletesById=0,cumulative_deletesByQuery=0,cumulative_errors=0}
feb 12, 2012 7:06:42 PM org.apache.solr.update.DirectUpdateHandler2 close
Informazioni: closed
DirectUpdateHandler2{commits=0,autocommits=0,optimizes=0,rollbacks=4,expungeDeletes=0,docsPending=0,adds=0,deletesById=0,deletesByQuery=0,errors=0,cumulative_adds=0,cumulative_deletesById=0,cumulative_deletesByQuery=0,cumulative_errors=0}
feb 12, 2012 7:06:42 PM org.apache.coyote.AbstractProtocol stop
Informazioni: Stopping 

Re: problem to indexing pdf directory

2012-02-16 Thread Gora Mohanty
On 16 February 2012 21:37, alessio crisantemi
alessio.crisant...@gmail.com wrote:
 here the log:


 org.apache.solr.handler.dataimport.DataImporter doFullImport
 Grave: Full Import failed
 org.apache.solr.handler.dataimport.DataImportHandlerException: 'baseDir' is
 a required attribute Processing Document # 1
[...]

The exception message above is pretty clear. You need to define a
baseDir attribute for the second entity.

However, even if you fix this, the setup will *not* work for indexing
PDFs. Did you read the URLs that I sent earlier?

Regards,
Gora


Re: Setting solrj server connection timeout

2012-02-16 Thread Shawn Heisey

On 2/3/2012 1:12 PM, Shawn Heisey wrote:
Is the following a reasonable approach to setting a connection timeout 
with SolrJ?


queryCore.getHttpClient().getHttpConnectionManager().getParams()
.setConnectionTimeout(15000);

Right now I have all my solr server objects sharing a single 
HttpClient that gets created using the multithreaded connection 
manager, where I set the timeout for all of them.  Now I will be 
letting each server object create its own HttpClient object, and using 
the above statement to set the timeout on each one individually.  
It'll use up a bunch more memory, as there are 56 server objects, but 
maybe it'll work better.  The total of 56 objects comes about from 7 
shards, a build core and a live core per shard, two complete index 
chains, and for each of those, one server object for access to 
CoreAdmin and another for the index.


The impetus for this, as it's possible I'm stating an XY problem: 
Currently I have an occasional problem where SolrJ connections throw 
an exception.  When it happens, nothing is logged in Solr.  My code is 
smart enough to notice the problem, send an email alert, and simply 
try again at the top of the next minute.  The simple explanation is 
that this is a Linux networking problem, but I never had any problem 
like this when I was using Perl with LWP to keep my index up to date.  
I sent a message to the list some time ago on this exception, but I 
never got a response that helped me figure it out.


Caused by: org.apache.solr.client.solrj.SolrServerException: 
java.net.SocketException: Connection reset


at 
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:480)


at 
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:246)


at 
org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:89)


at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:276)

at com.newscom.idxbuild.solr.Core.getCount(Core.java:325)

... 3 more

Caused by: java.net.SocketException: Connection reset

at java.net.SocketInputStream.read(SocketInputStream.java:168)

at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)

at java.io.BufferedInputStream.read(BufferedInputStream.java:237)

at 
org.apache.commons.httpclient.HttpParser.readRawLine(HttpParser.java:78)


at org.apache.commons.httpclient.HttpParser.readLine(HttpParser.java:106)

at 
org.apache.commons.httpclient.HttpConnection.readLine(HttpConnection.java:1116)


at 
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpConnectionAdapter.readLine(MultiThreadedHttpConnectionManager.java:1413)


at 
org.apache.commons.httpclient.HttpMethodBase.readStatusLine(HttpMethodBase.java:1973)


at 
org.apache.commons.httpclient.HttpMethodBase.readResponse(HttpMethodBase.java:1735)


at 
org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java:1098)


at 
org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:398)


at 
org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171)


at 
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)


at 
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323)


at 
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:424)


... 7 more


No response in quite some time, so I'm bringing it up again.  I brought 
up the Exception issue before, and though I did get some responses, I 
didn't feel that I got an answer.


http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201112.mbox/%3c4eeaf6e5.9030...@elyograg.org%3E

Thanks,
Shawn



Re: problem to indexing pdf directory

2012-02-16 Thread alessio crisantemi
Yes, I read it. But I don't know the cause.
and more: I work on windows and so, I configured manually tika and solr
because I don't have maven...

2012/2/16 Gora Mohanty g...@mimirtech.com

 On 16 February 2012 21:37, alessio crisantemi
 alessio.crisant...@gmail.com wrote:
  here the log:
 
 
  org.apache.solr.handler.dataimport.DataImporter doFullImport
  Grave: Full Import failed
  org.apache.solr.handler.dataimport.DataImportHandlerException: 'baseDir'
 is
  a required attribute Processing Document # 1
 [...]

 The exception message above is pretty clear. You need to define a
 baseDir attribute for the second entity.

 However, even if you fix this, the setup will *not* work for indexing
 PDFs. Did you read the URLs that I sent earlier?

 Regards,
 Gora



Re: Do we need reindexing from solr 1.4.1 to 3.5.0?

2012-02-16 Thread tamanjit.bin...@yahoo.co.in
There may be issues with your solrconfig. Kindly post the exception that you
are recieving.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Do-we-need-reindexing-from-solr-1-4-1-to-3-5-0-tp3739353p3750937.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: is it possible to run deltaimport command with out delta query?

2012-02-16 Thread Dyer, James
There is a good example on how to do a delta update using 
command=full-updateclean=false on the wiki, here:  
http://wiki.apache.org/solr/DataImportHandlerFaq#fullimportdelta

This can be advantageous if you are updating a ton of data at once and do not 
want it executing as many queries to the database.  It also can be easier to 
maintain just 1 set of queries for both full and delta imports.

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311

-Original Message-
From: Shawn Heisey [mailto:s...@elyograg.org] 
Sent: Thursday, February 16, 2012 10:04 AM
To: solr-user@lucene.apache.org
Subject: Re: is it possible to run deltaimport command with out delta query?

On 2/15/2012 11:26 PM, nagarjuna wrote:
 hi all..
i am new to solr .can any body explain me about the delta-import and
 delta query and also i have the below questions
 1.is it possible to run deltaimport without delataquery?
 2. is it possible to write a delta query without having last_modified column
 in database? if yes pls explain me

Assuming I understand what you're asking:

Define deltaImportQuery to be the same as query, then set deltaQuery to 
something that always returns some kind of value in the field you have 
designated as your primary key.  The data doesn't have to be relevant to 
anything at all, it just needs to return something for the primary key 
field.  Here's what I have in mine, my pk is did:

   deltaQuery=SELECT 1 AS did

If you wish, you can completely ignore lastModified and track your own 
information about what data is new, then pass parameters via the 
dataimport handler URL to be used in your queries.  This is what both my 
query and deltaImportQuery are set to:

 SELECT * FROM ${dataimporter.request.dataView}
 WHERE (
   (
 did gt; ${dataimporter.request.minDid}
 AND did lt;= ${dataimporter.request.maxDid}
   )
   ${dataimporter.request.extraWhere}
 ) AND (crc32(did) % ${dataimporter.request.numShards})
   IN (${dataimporter.request.modVal})

Thanks,
Shawn



Re: Best requestHandler for typing error.

2012-02-16 Thread tamanjit.bin...@yahoo.co.in
You can enable the spellcheck component and add it to your default request
handler.

This might be of use:
http://wiki.apache.org/solr/SpellCheckComponent
http://wiki.apache.org/solr/SpellCheckComponent 

It could be used both during autosuggest as well as did you mean.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Best-requestHandler-for-typing-error-tp3749576p3750995.html
Sent from the Solr - User mailing list archive at Nabble.com.


Solr edismax clarification

2012-02-16 Thread Indika Tantrigoda
Hi All,

I am using edismax SearchHandler in my search and I have some issues in the
search results. As I understand if the defaultOperator is set to OR the
search query will be passed as  - The OR quick OR brown OR fox implicitly.
However if I search for The quick brown fox, I get lesser results than
explicitly adding the OR. Another issue is that if I search for The quick
brown fox other documents that contain the word fox is not in the search
results.

Thanks.


copyField: multivalued field to joined singlevalue field

2012-02-16 Thread flyingeagle-de
Hello,

I want to copy all data from a multivalued field joined together in a single
valued field.

Is there any opportunity to do this by using solr-standards?

kind regards

--
View this message in context: 
http://lucene.472066.n3.nabble.com/copyField-multivalued-field-to-joined-singlevalue-field-tp3750857p3750857.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: copyField: multivalued field to joined singlevalue field

2012-02-16 Thread Yonik Seeley
On Thu, Feb 16, 2012 at 11:35 AM, flyingeagle-de
flyingeagle...@yahoo.de wrote:
 Hello,

 I want to copy all data from a multivalued field joined together in a single
 valued field.

 Is there any opportunity to do this by using solr-standards?

There is not currently, but it certainly makes sense.

Anyone know of an open issue for this yet?  If not, we should create one!

-Yonik
lucidimagination.com


Distributed Faceting Bug?

2012-02-16 Thread Jamie Johnson
I am attempting to execute a query with the following parameters

q=*:*
distrib=true
facet=true
facet.limit=10
facet.field=manu
f.manu.facet.mincount=1
f.manu.facet.limit=10
f.manu.facet.sort=index
rows=10

When doing this I get the following exception

null  java.lang.ArrayIndexOutOfBoundsException

request: http://hostname:8983/solr/select
at 
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:435)
at 
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
at 
org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:249)
at 
org.apache.solr.handler.component.HttpCommComponent$1.call(SearchHandler.java:517)
at 
org.apache.solr.handler.component.HttpCommComponent$1.call(SearchHandler.java:482)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

if I play with some of the parameters the query works as expected, i.e.

q=*:*
distrib=true
facet=true
facet.limit=10
facet.field=manu
f.manu.facet.mincount=1
f.manu.facet.limit=10
f.manu.facet.sort=index
rows=0

q=*:*
distrib=true
facet=true
facet.limit=10
facet.field=manu
f.manu.facet.mincount=1
f.manu.facet.limit=10
f.manu.facet.sort=count
rows=10

q=*:*
distrib=true
facet=true
facet.limit=10
facet.field=manu
f.manu.facet.mincount=1
f.manu.facet.sort=index
rows=10


I am running on an old snapshot of Solr, but will try this on a new
version relatively soon.  Unfortunately I can't duplicate locally so
I'm a bit baffled by the error.

All of the shards have the field which we are faceting on


Re: custom scoring

2012-02-16 Thread Em
Hello Carlos,

well, you must take into account that you are executing up to 8 queries
per request instead of one query per request.

I am not totally sure about the details of the implementation of the
max-function-query, but I guess it first iterates over the results of
the first max-query, afterwards over the results of the second max-query
and so on. This is a much higher complexity than in the case of a normal
query.

I would suggest you to optimize your request. I don't think that this
particular function query is matching *all* docs. Instead I think it
just matches those docs specified by your inner-query (although I might
be wrong about that).

What are you trying to achieve by your request?

Regards,
Em

Am 16.02.2012 16:24, schrieb Carlos Gonzalez-Cadenas:
 Hello Em:
 
 The URL is quite large (w/ shards, ...), maybe it's best if I paste the
 relevant parts.
 
 Our q parameter is:
 
   
 q:_val_:\product(query_score,max(query($q8),max(query($q7),max(query($q4),query($q3)\,
 
 The subqueries q8, q7, q4 and q3 are regular queries, for example:
 
 q7:stopword_phrase:colomba~1 AND stopword_phrase:santa AND
 wildcard_stopword_phrase:car^0.7 AND stopword_phrase:hoteles OR
 (stopword_phrase:las AND stopword_phrase:de)
 
 We've executed the subqueries q3-q8 independently and they're very fast,
 but when we introduce the function queries as described below, it all goes
 10X slower.
 
 Let me know if you need anything else.
 
 Thanks
 Carlos
 
 
 Carlos Gonzalez-Cadenas
 CEO, ExperienceOn - New generation search
 http://www.experienceon.com
 
 Mobile: +34 652 911 201
 Skype: carlosgonzalezcadenas
 LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas
 
 
 On Thu, Feb 16, 2012 at 4:02 PM, Em mailformailingli...@yahoo.de wrote:
 
 Hello carlos,

 could you show us how your Solr-call looks like?

 Regards,
 Em

 Am 16.02.2012 14:34, schrieb Carlos Gonzalez-Cadenas:
 Hello all:

 We'd like to score the matching documents using a combination of SOLR's
 IR
 score with another application-specific score that we store within the
 documents themselves (i.e. a float field containing the app-specific
 score). In particular, we'd like to calculate the final score doing some
 operations with both numbers (i.e product, sqrt, ...)

 According to what we know, there are two ways to do this in SOLR:

 A) Sort by function [1]: We've tested an expression like
 sort=product(score, query_score) in the SOLR query, where score is the
 common SOLR IR score and query_score is our own precalculated score, but
 it
 seems that SOLR can only do this with stored/indexed fields (and
 obviously
 score is not stored/indexed).

 B) Function queries: We've used _val_ and function queries like max, sqrt
 and query, and we've obtained the desired results from a functional point
 of view. However, our index is quite large (400M documents) and the
 performance degrades heavily, given that function queries are AFAIK
 matching all the documents.

 I have two questions:

 1) Apart from the two options I mentioned, is there any other (simple)
 way
 to achieve this that we're not aware of?

 2) If we have to choose the function queries path, would it be very
 difficult to modify the actual implementation so that it doesn't match
 all
 the documents, that is, to pass a query so that it only operates over the
 documents matching the query?. Looking at the FunctionQuery.java source
 code, there's a comment that says // instead of matching all docs, we
 could also embed a query. the score could either ignore the subscore, or
 boost it, which is giving us some hope that maybe it's possible and even
 desirable to go in this direction. If you can give us some directions
 about
 how to go about this, we may be able to do the actual implementation.

 BTW, we're using Lucene/SOLR trunk.

 Thanks a lot for your help.
 Carlos

 [1]: http://wiki.apache.org/solr/FunctionQuery#Sort_By_Function


 


Re: Distributed Faceting Bug?

2012-02-16 Thread Jamie Johnson
please ignore this, it has nothing to do with the faceting component.
I was able to disable a custom component that I had and it worked
perfectly fine.


On Thu, Feb 16, 2012 at 12:42 PM, Jamie Johnson jej2...@gmail.com wrote:
 I am attempting to execute a query with the following parameters

 q=*:*
 distrib=true
 facet=true
 facet.limit=10
 facet.field=manu
 f.manu.facet.mincount=1
 f.manu.facet.limit=10
 f.manu.facet.sort=index
 rows=10

 When doing this I get the following exception

 null  java.lang.ArrayIndexOutOfBoundsException

 request: http://hostname:8983/solr/select
        at 
 org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:435)
        at 
 org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
        at 
 org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:249)
        at 
 org.apache.solr.handler.component.HttpCommComponent$1.call(SearchHandler.java:517)
        at 
 org.apache.solr.handler.component.HttpCommComponent$1.call(SearchHandler.java:482)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)

 if I play with some of the parameters the query works as expected, i.e.

 q=*:*
 distrib=true
 facet=true
 facet.limit=10
 facet.field=manu
 f.manu.facet.mincount=1
 f.manu.facet.limit=10
 f.manu.facet.sort=index
 rows=0

 q=*:*
 distrib=true
 facet=true
 facet.limit=10
 facet.field=manu
 f.manu.facet.mincount=1
 f.manu.facet.limit=10
 f.manu.facet.sort=count
 rows=10

 q=*:*
 distrib=true
 facet=true
 facet.limit=10
 facet.field=manu
 f.manu.facet.mincount=1
 f.manu.facet.sort=index
 rows=10


 I am running on an old snapshot of Solr, but will try this on a new
 version relatively soon.  Unfortunately I can't duplicate locally so
 I'm a bit baffled by the error.

 All of the shards have the field which we are faceting on


Re: Distributed Faceting Bug?

2012-02-16 Thread Em
Hi Jamie,

what version of Solr/SolrJ are you using?

Regards,
Em

Am 16.02.2012 18:42, schrieb Jamie Johnson:
 I am attempting to execute a query with the following parameters
 
 q=*:*
 distrib=true
 facet=true
 facet.limit=10
 facet.field=manu
 f.manu.facet.mincount=1
 f.manu.facet.limit=10
 f.manu.facet.sort=index
 rows=10
 
 When doing this I get the following exception
 
 null  java.lang.ArrayIndexOutOfBoundsException
 
 request: http://hostname:8983/solr/select
   at 
 org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:435)
   at 
 org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
   at 
 org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:249)
   at 
 org.apache.solr.handler.component.HttpCommComponent$1.call(SearchHandler.java:517)
   at 
 org.apache.solr.handler.component.HttpCommComponent$1.call(SearchHandler.java:482)
   at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
   at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
   at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
   at java.lang.Thread.run(Thread.java:662)
 
 if I play with some of the parameters the query works as expected, i.e.
 
 q=*:*
 distrib=true
 facet=true
 facet.limit=10
 facet.field=manu
 f.manu.facet.mincount=1
 f.manu.facet.limit=10
 f.manu.facet.sort=index
 rows=0
 
 q=*:*
 distrib=true
 facet=true
 facet.limit=10
 facet.field=manu
 f.manu.facet.mincount=1
 f.manu.facet.limit=10
 f.manu.facet.sort=count
 rows=10
 
 q=*:*
 distrib=true
 facet=true
 facet.limit=10
 facet.field=manu
 f.manu.facet.mincount=1
 f.manu.facet.sort=index
 rows=10
 
 
 I am running on an old snapshot of Solr, but will try this on a new
 version relatively soon.  Unfortunately I can't duplicate locally so
 I'm a bit baffled by the error.
 
 All of the shards have the field which we are faceting on
 


Re: Distributed Faceting Bug?

2012-02-16 Thread Em
Hi Jamie,

nice to hear.
Maybe you can share in what kind of bug you ran, so that other
developers with similar bugish components can benefit from your
experience. :)

Regards,
Em

Am 16.02.2012 19:23, schrieb Jamie Johnson:
 please ignore this, it has nothing to do with the faceting component.
 I was able to disable a custom component that I had and it worked
 perfectly fine.
 
 
 On Thu, Feb 16, 2012 at 12:42 PM, Jamie Johnson jej2...@gmail.com wrote:
 I am attempting to execute a query with the following parameters

 q=*:*
 distrib=true
 facet=true
 facet.limit=10
 facet.field=manu
 f.manu.facet.mincount=1
 f.manu.facet.limit=10
 f.manu.facet.sort=index
 rows=10

 When doing this I get the following exception

 null  java.lang.ArrayIndexOutOfBoundsException

 request: http://hostname:8983/solr/select
at 
 org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:435)
at 
 org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
at 
 org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:249)
at 
 org.apache.solr.handler.component.HttpCommComponent$1.call(SearchHandler.java:517)
at 
 org.apache.solr.handler.component.HttpCommComponent$1.call(SearchHandler.java:482)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

 if I play with some of the parameters the query works as expected, i.e.

 q=*:*
 distrib=true
 facet=true
 facet.limit=10
 facet.field=manu
 f.manu.facet.mincount=1
 f.manu.facet.limit=10
 f.manu.facet.sort=index
 rows=0

 q=*:*
 distrib=true
 facet=true
 facet.limit=10
 facet.field=manu
 f.manu.facet.mincount=1
 f.manu.facet.limit=10
 f.manu.facet.sort=count
 rows=10

 q=*:*
 distrib=true
 facet=true
 facet.limit=10
 facet.field=manu
 f.manu.facet.mincount=1
 f.manu.facet.sort=index
 rows=10


 I am running on an old snapshot of Solr, but will try this on a new
 version relatively soon.  Unfortunately I can't duplicate locally so
 I'm a bit baffled by the error.

 All of the shards have the field which we are faceting on
 


Re: How to loop through the DataImportHandler query results?

2012-02-16 Thread Mikhail Khludnev
Chantal,

if you prefer java here is http://wiki.apache.org/solr/DIHCustomTransform



On Thu, Feb 16, 2012 at 7:24 PM, Chantal Ackermann 
chantal.ackerm...@btelligent.de wrote:

 If your script turns out too complex to maintain, and you are developing
 in Java, anyway, you could extend EntityProcessor and handle the data in
 a custom way. I've done that to transform a datamart like data structure
 back into a row based one.

 Basically you override the method that gets the data in a Map and
 transform it into a different Map which contains the fields as
 understood by your schema.

 Chantal


 On Thu, 2012-02-16 at 14:59 +0100, Mikhail Khludnev wrote:
  Hi Baranee,
 
  Some time ago I played with
  http://wiki.apache.org/solr/DataImportHandler#ScriptTransformer - it
 was a
  pretty good stuff.
 
  Regards
 
 
  On Thu, Feb 16, 2012 at 3:53 PM, K, Baraneetharan 
 baraneethara...@hp.comwrote:
 
   To avoid that we don't want to mention the column names in the field
 tag ,
   but want to write a query to map all the fields in the table with solr
   fileds even if we don't know, how many columns are there in the table.
  I
   need a kind of loop which runs through all the query results and map
 that
   with solr fileds.
 
 
 
 




-- 
Sincerely yours
Mikhail Khludnev
Lucid Certified
Apache Lucene/Solr Developer
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: custom scoring

2012-02-16 Thread Carlos Gonzalez-Cadenas
Hello Em:

Thanks for your answer.

Yes, we initially also thought that the excessive increase in response time
was caused by the several queries being executed, and we did another test.
We executed one of the subqueries that I've shown to you directly in the
q parameter and then we tested this same subquery (only this one, without
the others) with the function query query($q1) in the q parameter.

Theoretically the times for these two queries should be more or less the
same, but the second one is several times slower than the first one. After
this observation we learned more about function queries and we learned from
the code and from some comments in the forums [1] that the FunctionQueries
are expected to match all documents.

We have some more tests on that matter: now we're moving from issuing this
large query through the SOLR interface to creating our own QueryParser. The
initial tests we've done in our QParser (that internally creates multiple
queries and inserts them inside a DisjunctionMaxQuery) are very good, we're
getting very good response times and high quality answers. But when we've
tried to wrap the DisjunctionMaxQuery within a FunctionQuery (i.e. with a
QueryValueSource that wraps the DisMaxQuery), then the times move from
10-20 msec to 200-300msec.

Note that we're using early termination of queries (via a custom
collector), and therefore (as shown by the numbers I included above) even
if the query is very complex, we're getting very fast answers. The only
situation where the response time explodes is when we include a
FunctionQuery.

Re: your question of what we're trying to achieve ... We're implementing a
powerful query autocomplete system, and we use several fields to a) improve
performance on wildcard queries and b) have a very precise control over the
score.

Thanks a lot for your help,
Carlos

[1]: http://grokbase.com/p/lucene/solr-user/11bjw87bt5/functionquery-score-0

Carlos Gonzalez-Cadenas
CEO, ExperienceOn - New generation search
http://www.experienceon.com

Mobile: +34 652 911 201
Skype: carlosgonzalezcadenas
LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas


On Thu, Feb 16, 2012 at 7:09 PM, Em mailformailingli...@yahoo.de wrote:

 Hello Carlos,

 well, you must take into account that you are executing up to 8 queries
 per request instead of one query per request.

 I am not totally sure about the details of the implementation of the
 max-function-query, but I guess it first iterates over the results of
 the first max-query, afterwards over the results of the second max-query
 and so on. This is a much higher complexity than in the case of a normal
 query.

 I would suggest you to optimize your request. I don't think that this
 particular function query is matching *all* docs. Instead I think it
 just matches those docs specified by your inner-query (although I might
 be wrong about that).

 What are you trying to achieve by your request?

 Regards,
 Em

 Am 16.02.2012 16:24, schrieb Carlos Gonzalez-Cadenas:
  Hello Em:
 
  The URL is quite large (w/ shards, ...), maybe it's best if I paste the
  relevant parts.
 
  Our q parameter is:
 
 
 q:_val_:\product(query_score,max(query($q8),max(query($q7),max(query($q4),query($q3)\,
 
  The subqueries q8, q7, q4 and q3 are regular queries, for example:
 
  q7:stopword_phrase:colomba~1 AND stopword_phrase:santa AND
  wildcard_stopword_phrase:car^0.7 AND stopword_phrase:hoteles OR
  (stopword_phrase:las AND stopword_phrase:de)
 
  We've executed the subqueries q3-q8 independently and they're very fast,
  but when we introduce the function queries as described below, it all
 goes
  10X slower.
 
  Let me know if you need anything else.
 
  Thanks
  Carlos
 
 
  Carlos Gonzalez-Cadenas
  CEO, ExperienceOn - New generation search
  http://www.experienceon.com
 
  Mobile: +34 652 911 201
  Skype: carlosgonzalezcadenas
  LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas
 
 
  On Thu, Feb 16, 2012 at 4:02 PM, Em mailformailingli...@yahoo.de
 wrote:
 
  Hello carlos,
 
  could you show us how your Solr-call looks like?
 
  Regards,
  Em
 
  Am 16.02.2012 14:34, schrieb Carlos Gonzalez-Cadenas:
  Hello all:
 
  We'd like to score the matching documents using a combination of SOLR's
  IR
  score with another application-specific score that we store within the
  documents themselves (i.e. a float field containing the app-specific
  score). In particular, we'd like to calculate the final score doing
 some
  operations with both numbers (i.e product, sqrt, ...)
 
  According to what we know, there are two ways to do this in SOLR:
 
  A) Sort by function [1]: We've tested an expression like
  sort=product(score, query_score) in the SOLR query, where score is
 the
  common SOLR IR score and query_score is our own precalculated score,
 but
  it
  seems that SOLR can only do this with stored/indexed fields (and
  obviously
  score is not stored/indexed).
 
  B) Function queries: We've used _val_ and function queries like max,
 

Re: custom scoring

2012-02-16 Thread Em
Hello Carlos,

 We have some more tests on that matter: now we're moving from issuing this
 large query through the SOLR interface to creating our own
QueryParser. The
 initial tests we've done in our QParser (that internally creates multiple
 queries and inserts them inside a DisjunctionMaxQuery) are very good,
we're
 getting very good response times and high quality answers. But when we've
 tried to wrap the DisjunctionMaxQuery within a FunctionQuery (i.e. with a
 QueryValueSource that wraps the DisMaxQuery), then the times move from
 10-20 msec to 200-300msec.
I reviewed the sourcecode and yes, the FunctionQuery iterates over the
whole index, however... let's see!

In relation to the DisMaxQuery you create within your parser: What kind
of clause is the FunctionQuery and what kind of clause are your other
queries (MUST, SHOULD, MUST_NOT...)?

*I* would expect that with a shrinking set of matching documents to the
overall-query, the function query only checks those documents that are
guaranteed to be within the result set.

 Note that we're using early termination of queries (via a custom
 collector), and therefore (as shown by the numbers I included above) even
 if the query is very complex, we're getting very fast answers. The only
 situation where the response time explodes is when we include a
 FunctionQuery.
Could you give us some details about how/where did you plugin the
Collector, please?

Kind regards,
Em

Am 16.02.2012 19:41, schrieb Carlos Gonzalez-Cadenas:
 Hello Em:
 
 Thanks for your answer.
 
 Yes, we initially also thought that the excessive increase in response time
 was caused by the several queries being executed, and we did another test.
 We executed one of the subqueries that I've shown to you directly in the
 q parameter and then we tested this same subquery (only this one, without
 the others) with the function query query($q1) in the q parameter.
 
 Theoretically the times for these two queries should be more or less the
 same, but the second one is several times slower than the first one. After
 this observation we learned more about function queries and we learned from
 the code and from some comments in the forums [1] that the FunctionQueries
 are expected to match all documents.
 
 We have some more tests on that matter: now we're moving from issuing this
 large query through the SOLR interface to creating our own QueryParser. The
 initial tests we've done in our QParser (that internally creates multiple
 queries and inserts them inside a DisjunctionMaxQuery) are very good, we're
 getting very good response times and high quality answers. But when we've
 tried to wrap the DisjunctionMaxQuery within a FunctionQuery (i.e. with a
 QueryValueSource that wraps the DisMaxQuery), then the times move from
 10-20 msec to 200-300msec.
 
 Note that we're using early termination of queries (via a custom
 collector), and therefore (as shown by the numbers I included above) even
 if the query is very complex, we're getting very fast answers. The only
 situation where the response time explodes is when we include a
 FunctionQuery.
 
 Re: your question of what we're trying to achieve ... We're implementing a
 powerful query autocomplete system, and we use several fields to a) improve
 performance on wildcard queries and b) have a very precise control over the
 score.
 
 Thanks a lot for your help,
 Carlos
 
 [1]: http://grokbase.com/p/lucene/solr-user/11bjw87bt5/functionquery-score-0
 
 Carlos Gonzalez-Cadenas
 CEO, ExperienceOn - New generation search
 http://www.experienceon.com
 
 Mobile: +34 652 911 201
 Skype: carlosgonzalezcadenas
 LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas
 
 
 On Thu, Feb 16, 2012 at 7:09 PM, Em mailformailingli...@yahoo.de wrote:
 
 Hello Carlos,

 well, you must take into account that you are executing up to 8 queries
 per request instead of one query per request.

 I am not totally sure about the details of the implementation of the
 max-function-query, but I guess it first iterates over the results of
 the first max-query, afterwards over the results of the second max-query
 and so on. This is a much higher complexity than in the case of a normal
 query.

 I would suggest you to optimize your request. I don't think that this
 particular function query is matching *all* docs. Instead I think it
 just matches those docs specified by your inner-query (although I might
 be wrong about that).

 What are you trying to achieve by your request?

 Regards,
 Em

 Am 16.02.2012 16:24, schrieb Carlos Gonzalez-Cadenas:
 Hello Em:

 The URL is quite large (w/ shards, ...), maybe it's best if I paste the
 relevant parts.

 Our q parameter is:


 q:_val_:\product(query_score,max(query($q8),max(query($q7),max(query($q4),query($q3)\,

 The subqueries q8, q7, q4 and q3 are regular queries, for example:

 q7:stopword_phrase:colomba~1 AND stopword_phrase:santa AND
 wildcard_stopword_phrase:car^0.7 AND stopword_phrase:hoteles OR
 (stopword_phrase:las AND 

Re: custom scoring

2012-02-16 Thread Carlos Gonzalez-Cadenas
Hello Em:

1) Here's a printout of an example DisMax query (as you can see mostly MUST
terms except for some SHOULD terms used for boosting scores for stopwords)
*
*
*((+stopword_shortened_phrase:hoteles +stopword_shortened_phrase:barcelona
stopword_shortened_phrase:en) | (+stopword_phrase:hoteles
+stopword_phrase:barcelona
stopword_phrase:en) | (+stopword_shortened_phrase:hoteles +stopword_short
ened_phrase:barcelona stopword_shortened_phrase:en) | (+stopword_phrase:hoteles
+stopword_phrase:barcelona stopword_phrase:en) | (+stopword_shor
tened_phrase:hoteles +wildcard_stopword_shortened_phrase:barcelona
stopword_shortened_phrase:en) | (+stopword_phrase:hoteles +wildcard_stopw
ord_phrase:barcelona stopword_phrase:en) | (+stopword_shortened_phrase:hoteles
+wildcard_stopword_shortened_phrase:barcelona stopword_shortened_phrase:en)
| (+stopword_phrase:hoteles +wildcard_stopword_phrase:barcelona
stopword_phrase:en))*
*
*
2)* *The collector is inserted in the SolrIndexSearcher (replacing the
TimeLimitingCollector). We trigger it through the SOLR interface by passing
the timeAllowed parameter. We know this is a hack but AFAIK there's no
out-of-the-box way to specify custom collectors by now (
https://issues.apache.org/jira/browse/SOLR-1680). In any case the collector
part works perfectly as of now, so clearly this is not the problem.

3) Re: your sentence:
*
*
**I* would expect that with a shrinking set of matching documents to
the overall-query, the function query only checks those documents that are
guaranteed to be within the result set.*
*
*
Yes, I agree with this, but this snippet of code in FunctionQuery.java
seems to say otherwise:

// instead of matching all docs, we could also embed a query.
// the score could either ignore the subscore, or boost it.
// Containment:  floatline(foo:myTerm, myFloatField, 1.0, 0.0f)
// Boost:foo:myTerm^floatline(myFloatField,1.0,0.0f)
@Override
public int nextDoc() throws IOException {
  for(;;) {
++doc;
if (doc=maxDoc) {
  return doc=NO_MORE_DOCS;
}
if (acceptDocs != null  !acceptDocs.get(doc)) continue;
return doc;
  }
}

It seems that the author also thought of maybe embedding a query in order
to restrict matches, but this doesn't seem to be in place as of now (or
maybe I'm not understanding how the whole thing works :) ).

Thanks
Carlos
*
*

Carlos Gonzalez-Cadenas
CEO, ExperienceOn - New generation search
http://www.experienceon.com

Mobile: +34 652 911 201
Skype: carlosgonzalezcadenas
LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas


On Thu, Feb 16, 2012 at 8:09 PM, Em mailformailingli...@yahoo.de wrote:

 Hello Carlos,

  We have some more tests on that matter: now we're moving from issuing
 this
  large query through the SOLR interface to creating our own
 QueryParser. The
  initial tests we've done in our QParser (that internally creates multiple
  queries and inserts them inside a DisjunctionMaxQuery) are very good,
 we're
  getting very good response times and high quality answers. But when we've
  tried to wrap the DisjunctionMaxQuery within a FunctionQuery (i.e. with a
  QueryValueSource that wraps the DisMaxQuery), then the times move from
  10-20 msec to 200-300msec.
 I reviewed the sourcecode and yes, the FunctionQuery iterates over the
 whole index, however... let's see!

 In relation to the DisMaxQuery you create within your parser: What kind
 of clause is the FunctionQuery and what kind of clause are your other
 queries (MUST, SHOULD, MUST_NOT...)?

 *I* would expect that with a shrinking set of matching documents to the
 overall-query, the function query only checks those documents that are
 guaranteed to be within the result set.

  Note that we're using early termination of queries (via a custom
  collector), and therefore (as shown by the numbers I included above) even
  if the query is very complex, we're getting very fast answers. The only
  situation where the response time explodes is when we include a
  FunctionQuery.
 Could you give us some details about how/where did you plugin the
 Collector, please?

 Kind regards,
 Em

 Am 16.02.2012 19:41, schrieb Carlos Gonzalez-Cadenas:
  Hello Em:
 
  Thanks for your answer.
 
  Yes, we initially also thought that the excessive increase in response
 time
  was caused by the several queries being executed, and we did another
 test.
  We executed one of the subqueries that I've shown to you directly in the
  q parameter and then we tested this same subquery (only this one,
 without
  the others) with the function query query($q1) in the q parameter.
 
  Theoretically the times for these two queries should be more or less the
  same, but the second one is several times slower than the first one.
 After
  this observation we learned more about function queries and we learned
 from
  the code and from some comments in the forums [1] that the
 FunctionQueries
  are expected to match all documents.
 
  We 

Re: copyField: multivalued field to joined singlevalue field

2012-02-16 Thread Chris Hostetter

:  I want to copy all data from a multivalued field joined together in a single
:  valued field.
: 
:  Is there any opportunity to do this by using solr-standards?
: 
: There is not currently, but it certainly makes sense.

Part of it has just recently been commited to trunk actually...

https://issues.apache.org/jira/browse/SOLR-2802

https://builds.apache.org/job/Solr-trunk/javadoc/org/apache/solr/update/processor/ConcatFieldUpdateProcessorFactory.html

...with that, it's easy to say anytime multiple values are found for a 
single valued string field, join them together with a comma.

the only piece that's missing is to copy from a source field in an 
(earlier) UpdateProcessor.  

Theres a patch for this in SOLR-2599 but i haven't had a chance to look at 
it yet.




-Hoss


Re: Specify a cores roles through core add command

2012-02-16 Thread Mark Miller
https://issues.apache.org/jira/browse/SOLR-3138

On Feb 9, 2012, at 4:16 PM, Jamie Johnson wrote:

 per SOLR-2765 we can add roles to specific cores such that it's
 possible to give custom roles to solr instances, is it possible to
 specify this when adding a core through curl
 'http://host:port/solr/admin/cores...'?
 
 
 https://issues.apache.org/jira/browse/SOLR-2765

- Mark Miller
lucidimagination.com













Re: Using Solr for a rather busy Yellow Pages-type index - good idea or not really?

2012-02-16 Thread Alexey Verkhovsky
On Thu, Feb 16, 2012 at 3:37 AM, Mikhail Khludnev 
mkhlud...@griddynamics.com wrote:

 Everybody start from daily bounce, but end up with UPDATED_AT column and
 delta updates , just consider urgent content fix usecase. Don't think it's
 worth to rely on daily bounce as a cornerstone of architecture.


I'd be happy to avoid it, for all the obvious reasons.

I do know that performance of this type of services tends to be not that
great (as in 700 to 5000 msec), and there should be ways to do it several
times faster than this.


 you can use grid of coordinates to reduce their entropy


I don't understand this statement. Can you elaborate, please?

Since my bounding boxes are small, one [premature optimization] idea could
be to divide Earth into 2x2 degree overlapping tiles at 1 degree step in
both directions (such that any bounding box fits within at least one of
them, and any location belongs to 4 of them), then use tileId=X as a cached
filter and geofilt as a post-filter. Is that along the lines of what you
are talking about?


 http://yonik.wordpress.com/2012/02/10/advanced-filter-caching-in-solr/
  Lucene internals, caching of filters probably doesn't make sense either.
  from what little I understand about
 But solr does it http://wiki.apache.org/solr/SolrCaching#filterCache


I didn't realize that multiple qf's in the same query were applied in
parallel as set intersections. In that case, the non-geography filters
should be cached (and added to the prewarming routine, I guess) even when
they are usually far less specific than the bounding box. Makes sense.


  1. Search server is an internal service that uses embedded Solr for the
  indexing part. RAMDirectoryFactory as index storage.
 Bad idea. It's purposed mostly for tests, the closest purposed for
 production analogue is
 org.apache.lucene.store.instantiated.InstantiatedIndex

...

 AFAIK the state of art is use file directory (MMAP or whatever), rely on
 Linux file system RAM cache.


OK, I may as well start the spike from this angle, too. By the way, this is
precisely the kind of advice I was hoping for. Thanks a lot.

 5. All Solr caching is switched off.

 But why?


Because (a) I shouldn't need to cache documents, if they are all in memory
anyway; (b) query caching will have abysmal hit/miss because of the spatial
component; and (c) I misunderstood how query filters work. So, now I'm
thinking a FastLFU query filter cache for non-geo filters.


 Btw, if you need multivalue geofield pls vote for SOLR-2155

Our data has one lon/lat pair per entity... so no, I don't need it. Or at
least haven't figured out that I do yet. :)

-- 
Alexey Verkhovsky
http://alex-verkhovsky.blogspot.com/
CruiseControl.rb [http://cruisecontrolrb.thoughtworks.com]


Re: Distributed Faceting Bug?

2012-02-16 Thread Jamie Johnson
still digging ;)  Once I figure it out I'll be happy to share.

On Thu, Feb 16, 2012 at 1:32 PM, Em mailformailingli...@yahoo.de wrote:
 Hi Jamie,

 nice to hear.
 Maybe you can share in what kind of bug you ran, so that other
 developers with similar bugish components can benefit from your
 experience. :)

 Regards,
 Em

 Am 16.02.2012 19:23, schrieb Jamie Johnson:
 please ignore this, it has nothing to do with the faceting component.
 I was able to disable a custom component that I had and it worked
 perfectly fine.


 On Thu, Feb 16, 2012 at 12:42 PM, Jamie Johnson jej2...@gmail.com wrote:
 I am attempting to execute a query with the following parameters

 q=*:*
 distrib=true
 facet=true
 facet.limit=10
 facet.field=manu
 f.manu.facet.mincount=1
 f.manu.facet.limit=10
 f.manu.facet.sort=index
 rows=10

 When doing this I get the following exception

 null  java.lang.ArrayIndexOutOfBoundsException

 request: http://hostname:8983/solr/select
        at 
 org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:435)
        at 
 org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
        at 
 org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:249)
        at 
 org.apache.solr.handler.component.HttpCommComponent$1.call(SearchHandler.java:517)
        at 
 org.apache.solr.handler.component.HttpCommComponent$1.call(SearchHandler.java:482)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)

 if I play with some of the parameters the query works as expected, i.e.

 q=*:*
 distrib=true
 facet=true
 facet.limit=10
 facet.field=manu
 f.manu.facet.mincount=1
 f.manu.facet.limit=10
 f.manu.facet.sort=index
 rows=0

 q=*:*
 distrib=true
 facet=true
 facet.limit=10
 facet.field=manu
 f.manu.facet.mincount=1
 f.manu.facet.limit=10
 f.manu.facet.sort=count
 rows=10

 q=*:*
 distrib=true
 facet=true
 facet.limit=10
 facet.field=manu
 f.manu.facet.mincount=1
 f.manu.facet.sort=index
 rows=10


 I am running on an old snapshot of Solr, but will try this on a new
 version relatively soon.  Unfortunately I can't duplicate locally so
 I'm a bit baffled by the error.

 All of the shards have the field which we are faceting on



Re: SolrCloud - issues running with embedded zookeeper ensemble

2012-02-16 Thread arin g
i have the same problem, it seems that there is a bug in SolrZkServer class
(parseProperties method), that doesn't work well when you have an external
zookeeper ensemble.

Thanks,
 arin

--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrCloud-issues-running-with-embedded-zookeeper-ensemble-tp3694004p3751629.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Using Solr for a rather busy Yellow Pages-type index - good idea or not really?

2012-02-16 Thread Yonik Seeley
On Thu, Feb 16, 2012 at 3:03 PM, Alexey Verkhovsky
alexey.verkhov...@gmail.com wrote:
 5. All Solr caching is switched off.

 But why?


 Because (a) I shouldn't need to cache documents, if they are all in memory
 anyway;

Your're making many assumptions about how Solr works internally.

One example of many:
  Solr streams documents (requests the stored fields right before they
are written to the response stream) to support returning any number of
documents.
If you highlight documents, the stored fields need to be retrieved
first.  When streaming those same documents later, Solr will retrieve
the stored fields again - reying on the fact that they should be
cached by the document cache since they were just used.

There are tons of examples of how things are architected to take
advantage of the caches - it pretty much never makes sense to outright
disable them.  If they take up too much memory, then just reduce the
size.

-Yonik
lucidimagination.com


Re: custom scoring

2012-02-16 Thread Chris Hostetter

: We'd like to score the matching documents using a combination of SOLR's IR
: score with another application-specific score that we store within the
: documents themselves (i.e. a float field containing the app-specific
: score). In particular, we'd like to calculate the final score doing some
: operations with both numbers (i.e product, sqrt, ...)

let's back up a minute.

if your ultimate goal is to have the final score of all documents be a 
simple multiplication of an indexed field (query_score) against the 
score of your base query, that's fairely trivial use of the 
BoostQParser...

q={!boost f=query_score}your base query

...or to split it out using pram derefrencing...

q={!boost f=query_score v=$qq}
qq=your base query

: A) Sort by function [1]: We've tested an expression like
: sort=product(score, query_score) in the SOLR query, where score is the
: common SOLR IR score and query_score is our own precalculated score, but it
: seems that SOLR can only do this with stored/indexed fields (and obviously
: score is not stored/indexed).

you could do this by replacing score with the query whose score you 
want, which could be a ref back to $q -- but that's really only needed 
if you want the scores returned for each document to be differnt then the 
value used for sorting (ie: score comes from solr, sort value includes you 
query_score and the score from the main query -- or some completley diff 
query)

based on what you've said, you don't need that and it would be 
unneccessary overhead.

: B) Function queries: We've used _val_ and function queries like max, sqrt
: and query, and we've obtained the desired results from a functional point
: of view. However, our index is quite large (400M documents) and the
: performance degrades heavily, given that function queries are AFAIK
: matching all the documents.

based on the examples you've given in your subsequent queries, it's not 
hard to see why...

 q:_val_:\product(query_score,max(query($q8),max(query($q7),

wrapping queries in functions in queries can have that effect, because 
functions ultimatley match all documents -- even when that function wraps 
a query -- so your outermost query is still scoring every document in the 
index.

you want to do as much pruning with the query as possible, and only 
multiply by your boost function on matching docs, hence the 
purpose of the BoostQParser.

-Hoss


Re: Distributed Faceting Bug?

2012-02-16 Thread Jamie Johnson
The issue appears to be that I put an empty array into the doc scores
instead of null in DocSlice.  DocSlice then just checks if scores is
null when hasScore is called which caused a further issue down the
line.  I'll follow up with anything else that I find along the way.

On Thu, Feb 16, 2012 at 3:05 PM, Jamie Johnson jej2...@gmail.com wrote:
 still digging ;)  Once I figure it out I'll be happy to share.

 On Thu, Feb 16, 2012 at 1:32 PM, Em mailformailingli...@yahoo.de wrote:
 Hi Jamie,

 nice to hear.
 Maybe you can share in what kind of bug you ran, so that other
 developers with similar bugish components can benefit from your
 experience. :)

 Regards,
 Em

 Am 16.02.2012 19:23, schrieb Jamie Johnson:
 please ignore this, it has nothing to do with the faceting component.
 I was able to disable a custom component that I had and it worked
 perfectly fine.


 On Thu, Feb 16, 2012 at 12:42 PM, Jamie Johnson jej2...@gmail.com wrote:
 I am attempting to execute a query with the following parameters

 q=*:*
 distrib=true
 facet=true
 facet.limit=10
 facet.field=manu
 f.manu.facet.mincount=1
 f.manu.facet.limit=10
 f.manu.facet.sort=index
 rows=10

 When doing this I get the following exception

 null  java.lang.ArrayIndexOutOfBoundsException

 request: http://hostname:8983/solr/select
        at 
 org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:435)
        at 
 org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
        at 
 org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:249)
        at 
 org.apache.solr.handler.component.HttpCommComponent$1.call(SearchHandler.java:517)
        at 
 org.apache.solr.handler.component.HttpCommComponent$1.call(SearchHandler.java:482)
        at 
 java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
        at 
 java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)

 if I play with some of the parameters the query works as expected, i.e.

 q=*:*
 distrib=true
 facet=true
 facet.limit=10
 facet.field=manu
 f.manu.facet.mincount=1
 f.manu.facet.limit=10
 f.manu.facet.sort=index
 rows=0

 q=*:*
 distrib=true
 facet=true
 facet.limit=10
 facet.field=manu
 f.manu.facet.mincount=1
 f.manu.facet.limit=10
 f.manu.facet.sort=count
 rows=10

 q=*:*
 distrib=true
 facet=true
 facet.limit=10
 facet.field=manu
 f.manu.facet.mincount=1
 f.manu.facet.sort=index
 rows=10


 I am running on an old snapshot of Solr, but will try this on a new
 version relatively soon.  Unfortunately I can't duplicate locally so
 I'm a bit baffled by the error.

 All of the shards have the field which we are faceting on



Re: custom scoring

2012-02-16 Thread Robert Muir
On Thu, Feb 16, 2012 at 8:34 AM, Carlos Gonzalez-Cadenas
c...@experienceon.com wrote:
 Hello all:

 We'd like to score the matching documents using a combination of SOLR's IR
 score with another application-specific score that we store within the
 documents themselves (i.e. a float field containing the app-specific
 score). In particular, we'd like to calculate the final score doing some
 operations with both numbers (i.e product, sqrt, ...)
...

 1) Apart from the two options I mentioned, is there any other (simple) way
 to achieve this that we're not aware of?


In general there is always a third option, that may or may not fit,
depending really upon how you are trying to model relevance and how
you want to integrate with scoring, and thats to tie in your factors
directly into Similarity (lucene's term weighting api). For example,
some people use index-time boosting, but in lucene index-time boost
really just means 'make the document appear shorter'. You might for
example, have other boosts that modify term-frequency before
normalization, or however you want to do it. Similarity is pluggable
into Solr via schema.xml.

Since you are using trunk, this is a lot more flexible than previous
releases, e.g. you can access things from FieldCache, DocValues, or
even your own rapidly-changing float[] or whatever you want :) There
are also a lot more predefined models than just the vector space model
to work with if you find you can easily imagine your notion of
relevance in terms of an existing model.

-- 
lucidimagination.com


Re: Using Solr for a rather busy Yellow Pages-type index - good idea or not really?

2012-02-16 Thread Alexey Verkhovsky
On Thu, Feb 16, 2012 at 1:32 PM, Yonik Seeley yo...@lucidimagination.comwrote:

 Your're making many assumptions about how Solr works internally.


True that. If this spike turns into a project, digging through the source
code will come. Meantime, we have to start somewhere, and the default
configuration may not be the greatest starting point for this problem.

We don't need highlighting, and only need ids, scores and total number of
results out of Solr. Presentation of selected entities will have to include
some write-heavy data (from RDBMS and/or memcached), therefore won't be
Solr's business anyway.

From what you said, I guess it won't hurt to give it a small document
cache, just big enough to prevent streaming the same document twice within
the same query. Still don't have a reason to have a query cache - because
of lon/lat coming from the mobile devices, there are virtually no repeated
queries in our production logs. Or am I making a bad assumption here, too?

-- 
Alexey Verkhovsky
http://alex-verkhovsky.blogspot.com/
CruiseControl.rb [http://cruisecontrolrb.thoughtworks.com]


Re: Using Solr for a rather busy Yellow Pages-type index - good idea or not really?

2012-02-16 Thread Yonik Seeley
On Thu, Feb 16, 2012 at 4:06 PM, Alexey Verkhovsky
alexey.verkhov...@gmail.com wrote:
 ly need ids, scores and total number of results out of Solr. Presentation of
 selected entities will have to include some write-heavy data (from RDBMS
 and/or memcached), therefore won't be Solr's business anyway.

It depends on if you're going to be doing distributed search - there
may be some scenarios there where it's used, but in general the query
cache is the least useful.
The filterCache is useful in a ton of ways if you're doing faceting too.

-Yonik
lucidimagination.com


Re: SolrCloud - issues running with embedded zookeeper ensemble

2012-02-16 Thread Mark Miller

On Feb 16, 2012, at 2:53 PM, arin g wrote:

 i have the same problem, it seems that there is a bug in SolrZkServer class
 (parseProperties method), that doesn't work well when you have an external
 zookeeper ensemble.
 

This issue was around using an embedded ensemble - an external ensemble makes 
SolrZkServer irrelevant.

What issue are you having? I just tried a basic test against an external 
ensemble.

What version are you using?

 Thanks,
 arin
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/SolrCloud-issues-running-with-embedded-zookeeper-ensemble-tp3694004p3751629.html
 Sent from the Solr - User mailing list archive at Nabble.com.

- Mark Miller
lucidimagination.com













Re: custom scoring

2012-02-16 Thread Em
Hello Carlos,

I think we missunderstood eachother.

As an example:
BooleanQuery (
  clauses: (
 MustMatch(
   DisjunctionMaxQuery(
   TermQuery(stopword_field, barcelona),
   TermQuery(stopword_field, hoteles)
   )
 ),
 ShouldMatch(
  FunctionQuery(
*please insert your function here*
 )
 )
  )
)

Explanation:
You construct an artificial BooleanQuery which wraps your user's query
as well as your function query.
Your user's query - in that case - is just a DisjunctionMaxQuery
consisting of two TermQueries.
In the real world you might construct another BooleanQuery around your
DisjunctionMaxQuery in order to have more flexibility.
However the interesting part of the given example is, that we specify
the user's query as a MustMatch-condition of the BooleanQuery and the
FunctionQuery just as a ShouldMatch.
Constructed that way, I am expecting the FunctionQuery only scores those
documents which fit the MustMatch-Condition.

I conclude that from the fact that the FunctionQuery-class also has a
skipTo-method and I would expect that the scorer will use it to score
only matching documents (however I did not search where and how it might
get called).

If my conclusion is wrong than hopefully Robert Muir (as far as I can
see the author of that class) can tell us what was the intention by
constructing an every-time-match-all-function-query.

Can you validate whether your QueryParser constructs a query in the form
I drew above?

Regards,
Em

Am 16.02.2012 20:29, schrieb Carlos Gonzalez-Cadenas:
 Hello Em:
 
 1) Here's a printout of an example DisMax query (as you can see mostly MUST
 terms except for some SHOULD terms used for boosting scores for stopwords)
 *
 *
 *((+stopword_shortened_phrase:hoteles +stopword_shortened_phrase:barcelona
 stopword_shortened_phrase:en) | (+stopword_phrase:hoteles
 +stopword_phrase:barcelona
 stopword_phrase:en) | (+stopword_shortened_phrase:hoteles +stopword_short
 ened_phrase:barcelona stopword_shortened_phrase:en) | 
 (+stopword_phrase:hoteles
 +stopword_phrase:barcelona stopword_phrase:en) | (+stopword_shor
 tened_phrase:hoteles +wildcard_stopword_shortened_phrase:barcelona
 stopword_shortened_phrase:en) | (+stopword_phrase:hoteles +wildcard_stopw
 ord_phrase:barcelona stopword_phrase:en) | (+stopword_shortened_phrase:hoteles
 +wildcard_stopword_shortened_phrase:barcelona stopword_shortened_phrase:en)
 | (+stopword_phrase:hoteles +wildcard_stopword_phrase:barcelona
 stopword_phrase:en))*
 *
 *
 2)* *The collector is inserted in the SolrIndexSearcher (replacing the
 TimeLimitingCollector). We trigger it through the SOLR interface by passing
 the timeAllowed parameter. We know this is a hack but AFAIK there's no
 out-of-the-box way to specify custom collectors by now (
 https://issues.apache.org/jira/browse/SOLR-1680). In any case the collector
 part works perfectly as of now, so clearly this is not the problem.
 
 3) Re: your sentence:
 *
 *
 **I* would expect that with a shrinking set of matching documents to
 the overall-query, the function query only checks those documents that are
 guaranteed to be within the result set.*
 *
 *
 Yes, I agree with this, but this snippet of code in FunctionQuery.java
 seems to say otherwise:
 
 // instead of matching all docs, we could also embed a query.
 // the score could either ignore the subscore, or boost it.
 // Containment:  floatline(foo:myTerm, myFloatField, 1.0, 0.0f)
 // Boost:foo:myTerm^floatline(myFloatField,1.0,0.0f)
 @Override
 public int nextDoc() throws IOException {
   for(;;) {
 ++doc;
 if (doc=maxDoc) {
   return doc=NO_MORE_DOCS;
 }
 if (acceptDocs != null  !acceptDocs.get(doc)) continue;
 return doc;
   }
 }
 
 It seems that the author also thought of maybe embedding a query in order
 to restrict matches, but this doesn't seem to be in place as of now (or
 maybe I'm not understanding how the whole thing works :) ).
 
 Thanks
 Carlos
 *
 *
 
 Carlos Gonzalez-Cadenas
 CEO, ExperienceOn - New generation search
 http://www.experienceon.com
 
 Mobile: +34 652 911 201
 Skype: carlosgonzalezcadenas
 LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas
 
 
 On Thu, Feb 16, 2012 at 8:09 PM, Em mailformailingli...@yahoo.de wrote:
 
 Hello Carlos,

 We have some more tests on that matter: now we're moving from issuing
 this
 large query through the SOLR interface to creating our own
 QueryParser. The
 initial tests we've done in our QParser (that internally creates multiple
 queries and inserts them inside a DisjunctionMaxQuery) are very good,
 we're
 getting very good response times and high quality answers. But when we've
 tried to wrap the DisjunctionMaxQuery within a FunctionQuery (i.e. with a
 QueryValueSource that wraps the DisMaxQuery), then the times move from
 10-20 msec to 200-300msec.
 I reviewed the 

files left open?

2012-02-16 Thread Paulo Magalhaes
Hi all,

I was loading a big (60 million docs) csv in solr 4 when something odd
happened.
I got a solr error in the log saying that it could not write the file.
du -s indicated I had used 30Gb of a 50Gb available but df -k  indicated
that the disk was I00% used.
ds and df giving different results could be an indication that there are
file descriptors left open.
After a solr bounce, df -k came down and agreed with du.
has anyone seen anything like that ?

Thanks,
Paulo.

environment;
Linux 2.6.18-238.19.1.el5.centos.
plus #1 SMP Mon Jul 18 10:05:09 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux
Java(TM) SE Runtime Environment (build 1.6.0_17-b04) Java HotSpot(TM)
64-Bit Server VM (build 14.3-b01, mixed mode)
apache-solr-4.0-2012-02-10_09-58-50
solr config is the one in the distribution package. i had my own schema.


Re: custom scoring

2012-02-16 Thread Em
I just modified some TestCases a little bit to see how the FunctionQuery
behaves.

Given that you got an index containing 14 docs, where 13 of them
containing the term batman and two contain the term superman, a
search for

q=+text:superman _val_:query($qq)qq=text:superman

Leads to two hits and the FunctionQuery has two iterations.

If you remove that little plus-symbol before text:superman, it
wouldn't be a mustMatch-condition anymore and the whole query results
into 14 hits (default-operator is OR):

q=text:superman _val_:query($qq)qq=text:superman

If both queries, the TermQuery and the FunctionQuery must match, it
would also result into two hits:

q=text:superman AND _val_:query($qq)qq=text:superman

There is some behaviour that I currently don't understand (if 14 docs
match, the FunctionQuery's AllScorer re-iterates for two times over the
0th and the 1st doc and the reason for that seems to be the construction
of two AllScorers), but as far as I can see the performance of your
queries *should* increase if you construct your query as I explained in
my last eMail.

Kind regards,
Em

Am 16.02.2012 23:43, schrieb Em:
 Hello Carlos,
 
 I think we missunderstood eachother.
 
 As an example:
 BooleanQuery (
   clauses: (
  MustMatch(
DisjunctionMaxQuery(
TermQuery(stopword_field, barcelona),
TermQuery(stopword_field, hoteles)
)
  ),
  ShouldMatch(
   FunctionQuery(
 *please insert your function here*
  )
  )
   )
 )
 
 Explanation:
 You construct an artificial BooleanQuery which wraps your user's query
 as well as your function query.
 Your user's query - in that case - is just a DisjunctionMaxQuery
 consisting of two TermQueries.
 In the real world you might construct another BooleanQuery around your
 DisjunctionMaxQuery in order to have more flexibility.
 However the interesting part of the given example is, that we specify
 the user's query as a MustMatch-condition of the BooleanQuery and the
 FunctionQuery just as a ShouldMatch.
 Constructed that way, I am expecting the FunctionQuery only scores those
 documents which fit the MustMatch-Condition.
 
 I conclude that from the fact that the FunctionQuery-class also has a
 skipTo-method and I would expect that the scorer will use it to score
 only matching documents (however I did not search where and how it might
 get called).
 
 If my conclusion is wrong than hopefully Robert Muir (as far as I can
 see the author of that class) can tell us what was the intention by
 constructing an every-time-match-all-function-query.
 
 Can you validate whether your QueryParser constructs a query in the form
 I drew above?
 
 Regards,
 Em
 
 Am 16.02.2012 20:29, schrieb Carlos Gonzalez-Cadenas:
 Hello Em:

 1) Here's a printout of an example DisMax query (as you can see mostly MUST
 terms except for some SHOULD terms used for boosting scores for stopwords)
 *
 *
 *((+stopword_shortened_phrase:hoteles +stopword_shortened_phrase:barcelona
 stopword_shortened_phrase:en) | (+stopword_phrase:hoteles
 +stopword_phrase:barcelona
 stopword_phrase:en) | (+stopword_shortened_phrase:hoteles +stopword_short
 ened_phrase:barcelona stopword_shortened_phrase:en) | 
 (+stopword_phrase:hoteles
 +stopword_phrase:barcelona stopword_phrase:en) | (+stopword_shor
 tened_phrase:hoteles +wildcard_stopword_shortened_phrase:barcelona
 stopword_shortened_phrase:en) | (+stopword_phrase:hoteles +wildcard_stopw
 ord_phrase:barcelona stopword_phrase:en) | 
 (+stopword_shortened_phrase:hoteles
 +wildcard_stopword_shortened_phrase:barcelona stopword_shortened_phrase:en)
 | (+stopword_phrase:hoteles +wildcard_stopword_phrase:barcelona
 stopword_phrase:en))*
 *
 *
 2)* *The collector is inserted in the SolrIndexSearcher (replacing the
 TimeLimitingCollector). We trigger it through the SOLR interface by passing
 the timeAllowed parameter. We know this is a hack but AFAIK there's no
 out-of-the-box way to specify custom collectors by now (
 https://issues.apache.org/jira/browse/SOLR-1680). In any case the collector
 part works perfectly as of now, so clearly this is not the problem.

 3) Re: your sentence:
 *
 *
 **I* would expect that with a shrinking set of matching documents to
 the overall-query, the function query only checks those documents that are
 guaranteed to be within the result set.*
 *
 *
 Yes, I agree with this, but this snippet of code in FunctionQuery.java
 seems to say otherwise:

 // instead of matching all docs, we could also embed a query.
 // the score could either ignore the subscore, or boost it.
 // Containment:  floatline(foo:myTerm, myFloatField, 1.0, 0.0f)
 // Boost:foo:myTerm^floatline(myFloatField,1.0,0.0f)
 @Override
 public int nextDoc() throws IOException {
   for(;;) {
 ++doc;
 if (doc=maxDoc) {
   return doc=NO_MORE_DOCS;
 }
 if (acceptDocs != null  

Re: files left open?

2012-02-16 Thread Yonik Seeley
On Thu, Feb 16, 2012 at 5:56 PM, Paulo Magalhaes
paulo.magalh...@gmail.com wrote:
 I was loading a big (60 million docs) csv in solr 4 when something odd
 happened.
 I got a solr error in the log saying that it could not write the file.
 du -s indicated I had used 30Gb of a 50Gb available but df -k  indicated
 that the disk was I00% used.

You probably hit a big segment merge, which does require more disk
space temporarily.
The difference between du and df probably just indicates how they
internally work (du may just look at file sizes, and non-closed files
can register as smaller or 0 than the amount of disk space they
actually take up).

-Yonik
lucidimagination.com


Re: Setting solrj server connection timeout

2012-02-16 Thread Mark Miller
Im not sure that timeout will help you here - I believe it's the timeout on
'creating' the connection.

Try setting the socket timeout (setSoTimeout) - that should let you try
sooner.

It looks like perhaps the server is timing out and closing the connection.

I guess all you can do is timeout reasonably (if it takes too long to we
for the exception) and retry.

On Fri, Feb 3, 2012 at 3:12 PM, Shawn Heisey s...@elyograg.org wrote:

 Is the following a reasonable approach to setting a connection timeout
 with SolrJ?

queryCore.getHttpClient().**getHttpConnectionManager().**
 getParams()
.setConnectionTimeout(15000);

 Right now I have all my solr server objects sharing a single HttpClient
 that gets created using the multithreaded connection manager, where I set
 the timeout for all of them.  Now I will be letting each server object
 create its own HttpClient object, and using the above statement to set the
 timeout on each one individually.  It'll use up a bunch more memory, as
 there are 56 server objects, but maybe it'll work better.  The total of 56
 objects comes about from 7 shards, a build core and a live core per shard,
 two complete index chains, and for each of those, one server object for
 access to CoreAdmin and another for the index.

 The impetus for this, as it's possible I'm stating an XY problem:
 Currently I have an occasional problem where SolrJ connections throw an
 exception.  When it happens, nothing is logged in Solr.  My code is smart
 enough to notice the problem, send an email alert, and simply try again at
 the top of the next minute.  The simple explanation is that this is a Linux
 networking problem, but I never had any problem like this when I was using
 Perl with LWP to keep my index up to date.  I sent a message to the list
 some time ago on this exception, but I never got a response that helped me
 figure it out.

 Caused by: org.apache.solr.client.solrj.**SolrServerException:
 java.net.SocketException: Connection reset

 at org.apache.solr.client.solrj.**impl.CommonsHttpSolrServer.**
 request(CommonsHttpSolrServer.**java:480)

 at org.apache.solr.client.solrj.**impl.CommonsHttpSolrServer.**
 request(CommonsHttpSolrServer.**java:246)

 at org.apache.solr.client.solrj.**request.QueryRequest.process(**
 QueryRequest.java:89)

 at org.apache.solr.client.solrj.**SolrServer.query(SolrServer.**java:276)

 at com.newscom.idxbuild.solr.**Core.getCount(Core.java:325)

 ... 3 more

 Caused by: java.net.SocketException: Connection reset

 at java.net.SocketInputStream.**read(SocketInputStream.java:**168)

 at java.io.BufferedInputStream.**fill(BufferedInputStream.java:**218)

 at java.io.BufferedInputStream.**read(BufferedInputStream.java:**237)

 at org.apache.commons.httpclient.**HttpParser.readRawLine(**
 HttpParser.java:78)

 at org.apache.commons.httpclient.**HttpParser.readLine(**
 HttpParser.java:106)

 at org.apache.commons.httpclient.**HttpConnection.readLine(**
 HttpConnection.java:1116)

 at org.apache.commons.httpclient.**MultiThreadedHttpConnectionMan**
 ager$HttpConnectionAdapter.**readLine(**MultiThreadedHttpConnectionMan**
 ager.java:1413)

 at org.apache.commons.httpclient.**HttpMethodBase.readStatusLine(**
 HttpMethodBase.java:1973)

 at org.apache.commons.httpclient.**HttpMethodBase.readResponse(**
 HttpMethodBase.java:1735)

 at org.apache.commons.httpclient.**HttpMethodBase.execute(**
 HttpMethodBase.java:1098)

 at org.apache.commons.httpclient.**HttpMethodDirector.**executeWithRetry(*
 *HttpMethodDirector.java:398)

 at org.apache.commons.httpclient.**HttpMethodDirector.**executeMethod(**
 HttpMethodDirector.java:171)

 at org.apache.commons.httpclient.**HttpClient.executeMethod(**
 HttpClient.java:397)

 at org.apache.commons.httpclient.**HttpClient.executeMethod(**
 HttpClient.java:323)

 at org.apache.solr.client.solrj.**impl.CommonsHttpSolrServer.**
 request(CommonsHttpSolrServer.**java:424)

 ... 7 more


 Thanks,
 Shawn




-- 
- Mark

http://www.lucidimagination.com


how to delta index linked entities in 3.5.0

2012-02-16 Thread AdamLane
The delta instructions from
https://wiki.apache.org/solr/DataImportHandler#Using_delta-import_command
works for me in solr 1.4 but crashes in 3.5.0 (error: deltaQuery has no
column to resolve to declared primary key pk='ITEM_ID, CATEGORY_ID'  issue:
https://issues.apache.org/jira/browse/SOLR-2907) 

Is there anyone out there that can confirm my bug?  Because I am new to solr
and hopefully I am just doing something wrong based on a misunderstanding of
the wiki.  Anyone successfully indexing the join of items and multiple
item_categories just like the wiki example that would be willing to share
their workaround or suggest a workaround?

Thanks,
Adam   


--
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-delta-index-linked-entities-in-3-5-0-tp3752455p3752455.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Setting solrj server connection timeout

2012-02-16 Thread Shawn Heisey

On 2/16/2012 6:28 PM, Mark Miller wrote:

Im not sure that timeout will help you here - I believe it's the timeout on
'creating' the connection.

Try setting the socket timeout (setSoTimeout) - that should let you try
sooner.

It looks like perhaps the server is timing out and closing the connection.

I guess all you can do is timeout reasonably (if it takes too long to we
for the exception) and retry.


When the timeout exception happens, it is happening within the same 
second as the beginning of the update cycle, which involves a lot of 
other things happening (such as talking to a database) before it even 
gets around to talking to Solr.  I do not have millisecond timestamps, 
but from what little I can tell, it's a handful of milliseconds from 
when SolrJ starts the request until the exception is logged.  It happens 
relatively rarely - no more than once every few days, usually less often 
than that.  I cannot reproduce it at will.  Nobody is doing any work on 
either Solr or the network when it happens.  Nothing is logged in the 
Solr server log or syslog at the OS level, the only mention of anything 
bad going on is in the log of my SolrJ application.


I never had this problem when my build system was written in Perl, using 
LWP to make HTTP requests with URLs that I constructed myself.  The perl 
system ran on CentOS 5 with Xen virtualization, now I'm running CentOS 6 
on the bare metal.  I'm using a bonded interface (for failover, not load 
balancing) comprised of two NICs plugged into separate switches.  When 
it was virtualized, the Xen host was also using an identically 
configured bonded interface, bridged to the guests, which used eth0.


The last time the error happened, which was on Feb 15th at 2:04 PM MST, 
the query that failed was 'did:(289800299 OR 289800157)', a very simple 
query against a tlong field.  The application tests for the existence of 
the did values that it is trying to delete before it issues the delete 
request.


I'm willing to look deeper into possible networking issues, but I am 
skeptical about that being the problem, and because there are no log 
messages to investigate, I have no idea how to proceed.  The application 
runs on one of four Solr servers, sometimes the error even happens when 
connecting to Solr on the same server it's running on, which takes the 
gigabit switches out of the equation.  If it's an actual networking 
problem, it's either in the hardware (Dell PowerEdge 2950 III, built-in 
NICs) or the CentOS 6 kernel.


At this point, I am thinking it's one of the following problems, in 
order of decreasing probability: 1) I am using SolrJ incorrectly. 2) 
There is a SolrJ problem that only appears under specific circumstances 
that happen to exist in my setup. 3) My hardware or OS software has an 
extremely intermittent problem.


What other info can I provide?

Thanks,
Shawn



Sort by the number of matching terms (coord value)

2012-02-16 Thread Nicholas Clark
Hi,

I'm looking for a way to sort results by the number of matching terms.
Being able to sort by the coord() value or by the overlap value that gets
passed into the coord() function would do the trick. Is there a way I can
expose those values to the sort function?

I'd appreciate any help that points me in the right direction. I'm OK with
making basic code modifications.

Thanks!

-Nick


Re: how to delta index linked entities in 3.5.0

2012-02-16 Thread Shawn Heisey

On 2/16/2012 6:31 PM, AdamLane wrote:

The delta instructions from
https://wiki.apache.org/solr/DataImportHandler#Using_delta-import_command
works for me in solr 1.4 but crashes in 3.5.0 (error: deltaQuery has no
column to resolve to declared primary key pk='ITEM_ID, CATEGORY_ID'  issue:
https://issues.apache.org/jira/browse/SOLR-2907)

Is there anyone out there that can confirm my bug?  Because I am new to solr
and hopefully I am just doing something wrong based on a misunderstanding of
the wiki.  Anyone successfully indexing the join of items and multiple
item_categories just like the wiki example that would be willing to share
their workaround or suggest a workaround?


I ran into something like this, possibly even this exact problem.

Things have been tightened up in 3.x.  All query results now need to 
have a field corresponding to what you've defined as pk, or it's 
considered an error.  I was not using the results from my deltaQuery, 
but I still had to adjust it so that it returned a field with the same 
name as my primary key.  You have defined more than one field for your 
pk, so I don't really know exactly what you'll have to do - perhaps you 
need to have both ITEM_ID and CATEGORY_ID fields in your query results.


Thanks,
Shawn



Re: Sort by the number of matching terms (coord value)

2012-02-16 Thread Li Li
you can fool the lucene scoring fuction. override each function such as idf
queryNorm lengthNorm and let them simply return 1.0f.
I don't lucene 4 will expose more details. but for 2.x/3.x, lucene can only
score by vector space model and the formula can't be replaced by users.

On Fri, Feb 17, 2012 at 10:47 AM, Nicholas Clark clark...@gmail.com wrote:

 Hi,

 I'm looking for a way to sort results by the number of matching terms.
 Being able to sort by the coord() value or by the overlap value that gets
 passed into the coord() function would do the trick. Is there a way I can
 expose those values to the sort function?

 I'd appreciate any help that points me in the right direction. I'm OK with
 making basic code modifications.

 Thanks!

 -Nick



Improving proximity search performance

2012-02-16 Thread Bryan Loofbourrow
Here’s my use case. I expect to set up a Solr index that is approximately
1.4GB (this is a real number from the proof-of-concept using the real data,
which consists of about 10 million documents, many of significant size, and
making use of the FastVectorHighlighter to do highlighting on the body text
field, which is of course stored, and with termVectors, termPositions, and
termOffsets on).



I no longer have the proof-of-concept Solr core available (our live site
uses Solr 1.4 and the ordinary Highlighter), so I can’t get an empirical
answer to this question: Will storing that extra information about the
location of terms help the performance of proximity searches?



A significant and important subset of my users make extensive use of
proximity searches. These sophisticated users have found that they are best
able to locate what they want by doing searches about THISWORD within 5
words of THATWORD, or much more sophisticated variants on that theme,
including plenty of booleans and wildcards. The problem I’m facing is
performance. Some of these searches, when common words are used, can take
many minutes, even with the index on an SSD.



The question is, how to improve the performance. It occurred to me as
possible that all of that term vector information, stored for the benefit
of the FastVectorHighlighter, might be a significant aid to the performance
of these searches.



First question: is that already the case? Will storing this extra
information automatically improve my proximity search performance?



Second question: If not, I’m very willing to dive into the code and come up
with a patch that would do this. Can someone with knowledge of the
internals comment on whether this is a plausible strategy for improving
performance, and, if so, give tips about the outlines of what a successful
approach to the problem might look like?



Third question: Any tips in general for improving the performance of these
proximity searches? I have explored the question of whether the customers
might be weaned off of them, and that does not appear to be an option.



Thanks,



-- Bryan Loofbourrow


RE: Frequent garbage collections after a day of operation

2012-02-16 Thread Bryan Loofbourrow
A couple of thoughts:

We wound up doing a bunch of tuning on the Java garbage collection.
However, the pattern we were seeing was periodic very extreme slowdowns,
because we were then using the default garbage collector, which blocks
when it has to do a major collection. This doesn't sound like your
problem, but it's something to be aware of.

One thing that could fit the pattern you describe would be Solr caches
filling up and getting you too close to your JVM or memory limit. For
example, if you have large documents, and have defined a large document
cache, that might do it.

I found it useful to point jconsole (free with the JDK) at my JVM, and
watch the pattern of memory usage. If the troughs at the bottom of the GC
cycles keep rising, you know you've got something that is continuing to
grab more memory and not let go of it. Now that our JVM is running
smoothly, we just see a sawtooth pattern, with the troughs approximately
level. When the system is under load, the frequency of the wave rises. Try
it and see what sort of pattern you're getting.

-- Bryan

 -Original Message-
 From: Matthias Käppler [mailto:matth...@qype.com]
 Sent: Thursday, February 16, 2012 7:23 AM
 To: solr-user@lucene.apache.org
 Subject: Frequent garbage collections after a day of operation

 Hey everyone,

 we're running into some operational problems with our SOLR production
 setup here and were wondering if anyone else is affected or has even
 solved these problems before. We're running a vanilla SOLR 3.4.0 in
 several Tomcat 6 instances, so nothing out of the ordinary, but after
 a day or so of operation we see increased response times from SOLR, up
 to 3 times increases on average. During this time we see increased CPU
 load due to heavy garbage collection in the JVM, which bogs down the
 the whole system, so throughput decreases, naturally. When restarting
 the slaves, everything goes back to normal, but that's more like a
 brute force solution.

 The thing is, we don't know what's causing this and we don't have that
 much experience with Java stacks since we're for most parts a Rails
 company. Are Tomcat 6 or SOLR known to leak memory? Is anyone else
 seeing this, or can you think of a reason for this? Most of our
 queries to SOLR involve the DismaxHandler and the spatial search query
 components. We don't use any custom request handlers so far.

 Thanks in advance,
 -Matthias

 --
 Matthias Käppler
 Lead Developer API  Mobile

 Qype GmbH
 Großer Burstah 50-52
 20457 Hamburg
 Telephone: +49 (0)40 - 219 019 2 - 160
 Skype: m_kaeppler
 Email: matth...@qype.com

 Managing Director: Ian Brotherston
 Amtsgericht Hamburg
 HRB 95913

 This e-mail and its attachments may contain confidential and/or
 privileged information. If you are not the intended recipient (or have
 received this e-mail in error) please notify the sender immediately
 and destroy this e-mail and its attachments. Any unauthorized copying,
 disclosure or distribution of this e-mail and  its attachments is
 strictly forbidden. This notice also applies to future messages.


Re: distributed deletes working?

2012-02-16 Thread Mark Miller
Yup - deletes are fine.


On Thu, Feb 16, 2012 at 8:56 PM, Jamie Johnson jej2...@gmail.com wrote:

 With solr-2358 being committed to trunk do deletes and updates get
 distributed/routed like adds do? Also when a down shard comes back up are
 the deletes/updates forwarded as well? Reading the jira I believe the
 answer is yes, I just want to verify before bringing the latest into my
 environment.




-- 
- Mark

http://www.lucidimagination.com


Re: Sort by the number of matching terms (coord value)

2012-02-16 Thread Nicholas Clark
I want to leave the score intact so I can sort by matching term frequency
and then by score. I don't think I can do that if I modify all the
similarity functions, but I think your solution would have worked otherwise.

It would be great if there was a way I could expose this information
through a function query (similar to the new relevance functions in version
4.0). I'll have to see if I can figure out how those functions work.

-Nick


On Thu, Feb 16, 2012 at 6:58 PM, Li Li fancye...@gmail.com wrote:

 you can fool the lucene scoring fuction. override each function such as idf
 queryNorm lengthNorm and let them simply return 1.0f.
 I don't lucene 4 will expose more details. but for 2.x/3.x, lucene can only
 score by vector space model and the formula can't be replaced by users.

 On Fri, Feb 17, 2012 at 10:47 AM, Nicholas Clark clark...@gmail.com
 wrote:

  Hi,
 
  I'm looking for a way to sort results by the number of matching terms.
  Being able to sort by the coord() value or by the overlap value that gets
  passed into the coord() function would do the trick. Is there a way I can
  expose those values to the sort function?
 
  I'd appreciate any help that points me in the right direction. I'm OK
 with
  making basic code modifications.
 
  Thanks!
 
  -Nick
 



Ranking based on number of matches in a multivalued field?

2012-02-16 Thread Steven Ou
So suppose I have a multivalued field for categories. Let's say we have 3
items with these categories:

Item 1: category ids [1,2,5,7,9]
Item 2: category ids [4,8,9]
Item 3: category ids [1,4,9]

I now run a filter query for any of the following category ids [1,4,9]. I
should get all of them back as results because they all include at least
one category which I'm querying.

Now, how do I order it based on the number of matching categories?? In this
case, I would like Item 3 (matched all [1,4,9]) to be ranked higher,
followed by Item 2 (matched [4,9]) and Item 3 (matches [1,9]). Is there a
way I can boost documents based on the number of matches?

I don't want an absolute rank where Item 3 is definitely the first
result, but rather a way to boost Item 3's score higher than that of Item 1
and 2 so that it's more likely to show up higher (depending on the query
string).

Thanks!
--
Steven Ou | 歐偉凡

*ravn.com* | Chief Technology Officer
steve...@gmail.com | +1 909-569-9880


UpdateRequestHandler coding

2012-02-16 Thread Lance Norskog
If I want to write a complex UpdateRequestHandler should I do it on
trunk or the 3.x branch? The criteria are a stable, debugged,
full-featured environment.

-- 
Lance Norskog
goks...@gmail.com


Re: Size of suggest dictionary

2012-02-16 Thread Mike Hugo
Thanks Em!

What if we use a threshold value in the suggest configuration, like 

  float name=threshold0.005/float

I assume the dictionary size will then be smaller than the total number of 
distinct terms, is there anyway to determine what that size is?

Thanks,

Mike


On Wednesday, February 15, 2012 at 4:39 PM, Em wrote:

 Hello Mike,
 
 have a look at Solr's Schema Browser. Click on FIELDS, select label
 and have a look at the number of distinct (term-)values.
 
 Regards,
 Em
 
 
 Am 15.02.2012 23:07, schrieb Mike Hugo:
  Hello,
  
  We're building an auto suggest component based on the label field of
  documents. Is there a way to see how many terms are in the dictionary, or
  how much memory it's taking up? I looked on the statistics page but didn't
  find anything obvious.
  
  Thanks in advance,
  
  Mike
  
  ps- here's the config:
  
  searchComponent name=suggestlabel class=solr.SpellCheckComponent
  lst name=spellchecker
  str name=namesuggestlabel/str
  str
  name=classnameorg.apache.solr.spelling.suggest.Suggester/str
  str
  name=lookupImplorg.apache.solr.spelling.suggest.tst.TSTLookup/str
  str name=fieldlabel/str
  str name=buildOnOptimizetrue/str
  /lst
  /searchComponent
  
  requestHandler name=suggestlabel
  class=org.apache.solr.handler.component.SearchHandler
  lst name=defaults
  str name=spellchecktrue/str
  str name=spellcheck.dictionarysuggestlabel/str
  str name=spellcheck.count10/str
  /lst
  arr name=components
  strsuggestlabel/str
  /arr
  /requestHandler
  
 
 
 




Re: Frequent garbage collections after a day of operation

2012-02-16 Thread Jason Rutherglen
 One thing that could fit the pattern you describe would be Solr caches
 filling up and getting you too close to your JVM or memory limit

This [uncommitted] issue would solve that problem by allowing the GC
to collect caches that become too large, though in practice, the cache
setting would need to be fairly large for an OOM to occur from them:
https://issues.apache.org/jira/browse/SOLR-1513

On Thu, Feb 16, 2012 at 7:14 PM, Bryan Loofbourrow
bloofbour...@knowledgemosaic.com wrote:
 A couple of thoughts:

 We wound up doing a bunch of tuning on the Java garbage collection.
 However, the pattern we were seeing was periodic very extreme slowdowns,
 because we were then using the default garbage collector, which blocks
 when it has to do a major collection. This doesn't sound like your
 problem, but it's something to be aware of.

 One thing that could fit the pattern you describe would be Solr caches
 filling up and getting you too close to your JVM or memory limit. For
 example, if you have large documents, and have defined a large document
 cache, that might do it.

 I found it useful to point jconsole (free with the JDK) at my JVM, and
 watch the pattern of memory usage. If the troughs at the bottom of the GC
 cycles keep rising, you know you've got something that is continuing to
 grab more memory and not let go of it. Now that our JVM is running
 smoothly, we just see a sawtooth pattern, with the troughs approximately
 level. When the system is under load, the frequency of the wave rises. Try
 it and see what sort of pattern you're getting.

 -- Bryan

 -Original Message-
 From: Matthias Käppler [mailto:matth...@qype.com]
 Sent: Thursday, February 16, 2012 7:23 AM
 To: solr-user@lucene.apache.org
 Subject: Frequent garbage collections after a day of operation

 Hey everyone,

 we're running into some operational problems with our SOLR production
 setup here and were wondering if anyone else is affected or has even
 solved these problems before. We're running a vanilla SOLR 3.4.0 in
 several Tomcat 6 instances, so nothing out of the ordinary, but after
 a day or so of operation we see increased response times from SOLR, up
 to 3 times increases on average. During this time we see increased CPU
 load due to heavy garbage collection in the JVM, which bogs down the
 the whole system, so throughput decreases, naturally. When restarting
 the slaves, everything goes back to normal, but that's more like a
 brute force solution.

 The thing is, we don't know what's causing this and we don't have that
 much experience with Java stacks since we're for most parts a Rails
 company. Are Tomcat 6 or SOLR known to leak memory? Is anyone else
 seeing this, or can you think of a reason for this? Most of our
 queries to SOLR involve the DismaxHandler and the spatial search query
 components. We don't use any custom request handlers so far.

 Thanks in advance,
 -Matthias

 --
 Matthias Käppler
 Lead Developer API  Mobile

 Qype GmbH
 Großer Burstah 50-52
 20457 Hamburg
 Telephone: +49 (0)40 - 219 019 2 - 160
 Skype: m_kaeppler
 Email: matth...@qype.com

 Managing Director: Ian Brotherston
 Amtsgericht Hamburg
 HRB 95913

 This e-mail and its attachments may contain confidential and/or
 privileged information. If you are not the intended recipient (or have
 received this e-mail in error) please notify the sender immediately
 and destroy this e-mail and its attachments. Any unauthorized copying,
 disclosure or distribution of this e-mail and  its attachments is
 strictly forbidden. This notice also applies to future messages.