Entity with multiple datasources
Hello, I created a data-config.xml file where I define a datasource and an entity with 12 fields. In my use case I have 2 databases with the same schema, so I want to combine in one index the 2 databases. I defined a second dataSource tag and duplicateed the entity with its field(changed the name and the datasource). What I'm expecting is to get around 7k results(I have around 6k in the first db and 1k in the second). However I'm getting a total of 2k. Where could be the problem? Thanks
Re: 'foruns' don't match 'forum' with NGramFilterFactory (or EdgeNGramFilterFactory)
Hi, It's funny that if you try fóruns it matches: http://bhakta.casadomato.org:8982/solr/select/?q=f%C3%B3runsversion=2.2start=0rows=10indent=on But not when you try foruns, it does not. Check this out... http://bhakta.casadomato.org:8982/solr/admin/analysis.jsp?nt=typename=textverbose=onhighlight=onval=f%C3%B3rumqverbose=onqval=foruns See that stemming does not work for the word foruns. Could it be because fórum is part of the PT dictionary but not forum? Regards, 2012/2/14 Bráulio Bhavamitra brauli...@gmail.com Hello all, I'm experimenting with NGramFilterFactory and EgdeNGramFilterFactory. Both of them shows a match in my solr admin analysis, but when I query 'foruns' doesn't find any 'forum'. analysis http://bhakta.casadomato.org:8982/solr/admin/analysis.jsp?nt=typename=textverbose=onhighlight=onval=f%C3%B3runsqverbose=onqval=f%C3%B3runs search http://bhakta.casadomato.org:8982/solr/select/?q=forunsversion=2.2start=0rows=10indent=on Anybody knows what's the problem? bráulio -- Dirceu Vieira Júnior --- +47 9753 2473 dirceuvjr.blogspot.com twitter.com/dirceuvjr
problem to indexing pdf directory
Hi all, I have a problem to configure a pdf indexing from a directory in my solr wit DIH: with this data-config dataConfig dataSource type=BinFileDataSource / document entity name=tika-test processor=FileListEntityProcessor baseDir=D:\gioconews_archivio\marzo2011 fileName=.*pdf recursive=true rootEntity=false dataSource=null/ entity processor=FileListEntityProcessor url=D:\gioconews_archivio\marzo2011 format=text field column=author name=author meta=true/ field column=title name=title meta=true/ field column=description name=description / field column=comments name=comments / field column=content_type name=content_type / field column=last_modified name=last_modified / /entity /document /dataConfig I obtain this result: str name=commandfull-import/str str name=statusidle/str str name=importResponse / - lst name=statusMessages str name=Time Elapsed0:0:2.44/str str name=Total Requests made to DataSource0/str str name=Total Rows Fetched43/str str name=Total Documents Skipped0/str str name=Full Dump Started2012-02-12 19:06:00/str str name=Indexing failed. Rolled back all changes./str str name=Rolledback2012-02-12 19:06:00/str /lst suggestions? thank you alessio
Best requestHandler for typing error.
Hello. Which RH do you use to find typing errors like goolge = do you mean google ?! I want to use my Autosuggestion EdgeNGram with a clever AutoCorrection! What do you use ? - --- System One Server, 12 GB RAM, 2 Solr Instances, 8 Cores, 1 Core with 45 Million Documents other Cores 200.000 - Solr1 for Search-Requests - commit every Minute - 5GB Xmx - Solr2 for Update-Request - delta every Minute - 4GB Xmx -- View this message in context: http://lucene.472066.n3.nabble.com/Best-requestHandler-for-typing-error-tp3749576p3749576.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Do we need reindexing from solr 1.4.1 to 3.5.0?
I kept old schema files and solrconfig file but there were some errors due to which solr was not loading. I dono what are those things. We have few our own custom plugins developed with 1.4.1 -- View this message in context: http://lucene.472066.n3.nabble.com/Do-we-need-reindexing-from-solr-1-4-1-to-3-5-0-tp3739353p3749629.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Do we need reindexing from solr 1.4.1 to 3.5.0?
we have both stored = true and false fields in the schema. So we cant reindex wat u said. we have tried that earlier. -- View this message in context: http://lucene.472066.n3.nabble.com/Do-we-need-reindexing-from-solr-1-4-1-to-3-5-0-tp3739353p3749631.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Using Solr for a rather busy Yellow Pages-type index - good idea or not really?
Pls find inlined. On Thu, Feb 16, 2012 at 10:30 AM, Alexey Verkhovsky alexey.verkhov...@gmail.com wrote: Hi, all, I'm new here. Used Solr on a couple of projects before, but didn't need to dive deep into anything until now. These days, I'm doing a spike for a yellow pages type search server with the following technical requirements: ~10 mln listings in the database. A listing has a name, address, description, coordinates and a number of tags / filtering fields; no more than a kilobyte all told; i.e. theoretically the whole thing should fit in RAM without sharding. A typical query is either all text matches on name and/or description within a bounded box, or some combination of tag matches within a bounded box. Bounded boxes are 1 to 50 km wide, and contain up to 10^5 unfiltered listings (the average is more like 10^3). More than 50% of all the listings are in the frequently requested bounding boxes, however a vast majority of listings are almost never displayed (because they don't match the other filters). Data never changes (i.e., a daily batch update; rebuild of the entire index and restart of all search servers is feasible, as long as it takes minutes, not hours). Everybody start from daily bounce, but end up with UPDATED_AT column and delta updates , just consider urgent content fix usecase. Don't think it's worth to rely on daily bounce as a cornerstone of architecture. This thing ideally should serve up to 10^3 requests per second on a small (as in, less than 10 commodity boxes) cluster. In other words, a typical request should be CPU bound and take ~100-200 msec to process. Because of coordinates (that are almost never the same), caching of queries makes no sense; you can use grid of coordinates to reduce their entropy, if you filter by bounding box argument is bounding box not a coordinates. Anyway postfiltering and cache=false for such filters http://yonik.wordpress.com/2012/02/10/advanced-filter-caching-in-solr/ from what little I understand about Lucene internals, caching of filters probably doesn't make sense either. But solr does it http://wiki.apache.org/solr/SolrCaching#filterCache After perusing documentation and some googling (but almost no source code exploring yet), I understand how the schema and the queries will look like, and now have to figure out a specific configuration that fits the performance/scalability requirements. Here is what I'm thinking: 1. Search server is an internal service that uses embedded Solr for the indexing part. RAMDirectoryFactory as index storage. Bad idea. It's purposed mostly for tests, the closest purposed for production analogue is org.apache.lucene.store.instantiated.InstantiatedIndex 2. All data is in some sort of persistent storage on a file system, and is loaded into the memory when a search server starts up. AFAIK the state of art is use file directory (MMAP or whatever), rely on Linux file system RAM cache. Also Solr and partially Lucene cache some stuff in HEAP themselves http://wiki.apache.org/solr/SolrCaching#Types_of_Caches_and_Example_Configuration. So, this is mostly done already. 3. Data updates are handled as update the persistent storage, start another cluster, load the world into RAM, flip the load balancer, kill the old cluster no again. Lucene has pretty cool model of segments and generations purposed to incremental update. And Solr does a lot to do search in old generation and warnup the new one simultaneously (it just takes some memory, you know, two times). I don;t think that manual A/B scheme is applicable. Anyway, you can (but don't relly need to) play around replication facilities e.g. disable traffic for half of nodes, push new index on it, let them warmup, enable traffic (such machinery never works smoothly due number of moving parts) 4. Solr returns IDs with relevance scores; actual presentations of listings (as JSON documents) are constructed outside of Solr and cached in Memcached, as a mostly static content with a few templated bits, like distance%=DISTANCE_TO(-123.0123, 45.6789) %. Use separate nodes to do a search and another nodes to stream the content sounds good (mentioned in every book). Looks like beside of the score you can also return distance to user i.e. no need to %=DISTANCE_TO(-123.0123, 45.6789) % , just %=doc.DISTANCE% see http://wiki.apache.org/solr/SpatialSearch?#Returning_the_distance 5. All Solr caching is switched off. But why? Obviously, we are not the first people to do something like this with Solr, so I'm hoping for some collective wisdom on the following: Does this sounds like a feasible set of requirements in terms of performance and scalability for Solr? Are we on the right path to solving this problem well? If not, what should we be doing instead? What nasty technical/architectural gotchas are we probably missing at this stage? One particular advice I'd be really happy to hear is you may not need RAMDataFactory if
Re: Entity with multiple datasources
1. Do you see any errors / exceptions in the logs? 2. Could you have duplicates? On Thu, Feb 16, 2012 at 10:15 AM, Radu Toev radut...@gmail.com wrote: Hello, I created a data-config.xml file where I define a datasource and an entity with 12 fields. In my use case I have 2 databases with the same schema, so I want to combine in one index the 2 databases. I defined a second dataSource tag and duplicateed the entity with its field(changed the name and the datasource). What I'm expecting is to get around 7k results(I have around 6k in the first db and 1k in the second). However I'm getting a total of 2k. Where could be the problem? Thanks -- Regards, Dmitry Kan
Re: Spatial Search and faceting
Hi William, Thanks for the feedback. I will try the group query and see how the performance with 2 queries is. Best Regards Ericz On Thu, Feb 16, 2012 at 4:06 AM, William Bell billnb...@gmail.com wrote: One way to do it is to group by city and then sort=geodist() asc select?group=truegroup.field=citysort=geodist() descrows=10fl=city It might require 2 calls to SOLR to get it the way you want. On Wed, Feb 15, 2012 at 5:51 PM, Eric Grobler impalah...@googlemail.com wrote: Hi Solr community, I am doing a spatial search and then do a facet by city. Is it possible to then sort the faceted cities by distance? We would like to display the hits per city, but sort them by distance. Thanks Regards Ericz q=iphone fq={!bbox} sfield=geopoint pt=49.594857,8.468614 d=50 fl=id,description,city,geopoint facet=true facet.field=city f.city.facet.limit=10 f.city.facet.sort=count //geodist() asc -- Bill Bell billnb...@gmail.com cell 720-256-8076
Realtime search with multi clients updating index simultaneously.
I have a heldesk application developed in PHP/MySQL. I want to implement real time Full text search and I have shortlisted Solr. MySQL database will store all the tickets and their updates and that data will be imported for building Solr index. All Search requests will be handled by Solr. What I want is a real time search. The moment someone updates a ticket, it should be available for search. As per my understanding of Solr, this is how I think the system will work. A user updates a ticket - database record is modified - a request is sent to Solr server to modify corresponding document in index. I have read a book on Solr and below questions are troubling me. 1. The book mentions that commits are slow in Solr. Depending on the index size, Solr's auto-warming configuration, and Solr's cache state prior to committing, a commit can take a non-trivial amount of time. Typically, it takes a few seconds, but it can take some number of minutes in extreme cases. If this is true then how will I know when the data will be availbale for search and how can I implemnt realtime search? Also I don't want the ticket update operation to be slowed down (by adding extra step of updating Solr index) 2. It is also mentioned that there is no transaction isolation. This means that if more than one Solr client were to submit modifications and commit them at overlapping times, it is possible for part of one client's set of changes to be committed before that client told Solr to commit. This applies to rollback as well. If this is a problem for your architecture then consider using one client process responsible for updating Solr. Does it mean that due to lack of transactional commits, Solr can mess up the updates when multiple people update the ticket simultaneously? Now the question before me is: Is Solr fit in my case? If yes, How? -- View this message in context: http://lucene.472066.n3.nabble.com/Realtime-search-with-multi-clients-updating-index-simultaneously-tp3749881p3749881.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Entity with multiple datasources
1. Nothing in the logs 2. No. On Thu, Feb 16, 2012 at 12:44 PM, Dmitry Kan dmitry@gmail.com wrote: 1. Do you see any errors / exceptions in the logs? 2. Could you have duplicates? On Thu, Feb 16, 2012 at 10:15 AM, Radu Toev radut...@gmail.com wrote: Hello, I created a data-config.xml file where I define a datasource and an entity with 12 fields. In my use case I have 2 databases with the same schema, so I want to combine in one index the 2 databases. I defined a second dataSource tag and duplicateed the entity with its field(changed the name and the datasource). What I'm expecting is to get around 7k results(I have around 6k in the first db and 1k in the second). However I'm getting a total of 2k. Where could be the problem? Thanks -- Regards, Dmitry Kan
Re: Entity with multiple datasources
It sounds a bit, as if SOLR stopped processing data once it queried all from the smaller dataset. That's why you have 2000. If you just have a handler pointed to the bigger data set (6k), do you manage to get all 6k db entries into solr? On Thu, Feb 16, 2012 at 1:46 PM, Radu Toev radut...@gmail.com wrote: 1. Nothing in the logs 2. No. On Thu, Feb 16, 2012 at 12:44 PM, Dmitry Kan dmitry@gmail.com wrote: 1. Do you see any errors / exceptions in the logs? 2. Could you have duplicates? On Thu, Feb 16, 2012 at 10:15 AM, Radu Toev radut...@gmail.com wrote: Hello, I created a data-config.xml file where I define a datasource and an entity with 12 fields. In my use case I have 2 databases with the same schema, so I want to combine in one index the 2 databases. I defined a second dataSource tag and duplicateed the entity with its field(changed the name and the datasource). What I'm expecting is to get around 7k results(I have around 6k in the first db and 1k in the second). However I'm getting a total of 2k. Where could be the problem? Thanks -- Regards, Dmitry Kan -- Regards, Dmitry Kan
Re: Entity with multiple datasources
I tried running with just one datasource(the one that has 6k entries) and it indexes them ok. The same, if I do sepparately the 1k database. It indexes ok. On Thu, Feb 16, 2012 at 2:11 PM, Dmitry Kan dmitry@gmail.com wrote: It sounds a bit, as if SOLR stopped processing data once it queried all from the smaller dataset. That's why you have 2000. If you just have a handler pointed to the bigger data set (6k), do you manage to get all 6k db entries into solr? On Thu, Feb 16, 2012 at 1:46 PM, Radu Toev radut...@gmail.com wrote: 1. Nothing in the logs 2. No. On Thu, Feb 16, 2012 at 12:44 PM, Dmitry Kan dmitry@gmail.com wrote: 1. Do you see any errors / exceptions in the logs? 2. Could you have duplicates? On Thu, Feb 16, 2012 at 10:15 AM, Radu Toev radut...@gmail.com wrote: Hello, I created a data-config.xml file where I define a datasource and an entity with 12 fields. In my use case I have 2 databases with the same schema, so I want to combine in one index the 2 databases. I defined a second dataSource tag and duplicateed the entity with its field(changed the name and the datasource). What I'm expecting is to get around 7k results(I have around 6k in the first db and 1k in the second). However I'm getting a total of 2k. Where could be the problem? Thanks -- Regards, Dmitry Kan -- Regards, Dmitry Kan
Re: Entity with multiple datasources
OK, maybe you can show the db-data-config.xml just in case? Also in schema.xml, does you uniqueKey correspond to the unique field in the db? On Thu, Feb 16, 2012 at 2:13 PM, Radu Toev radut...@gmail.com wrote: I tried running with just one datasource(the one that has 6k entries) and it indexes them ok. The same, if I do sepparately the 1k database. It indexes ok. On Thu, Feb 16, 2012 at 2:11 PM, Dmitry Kan dmitry@gmail.com wrote: It sounds a bit, as if SOLR stopped processing data once it queried all from the smaller dataset. That's why you have 2000. If you just have a handler pointed to the bigger data set (6k), do you manage to get all 6k db entries into solr? On Thu, Feb 16, 2012 at 1:46 PM, Radu Toev radut...@gmail.com wrote: 1. Nothing in the logs 2. No. On Thu, Feb 16, 2012 at 12:44 PM, Dmitry Kan dmitry@gmail.com wrote: 1. Do you see any errors / exceptions in the logs? 2. Could you have duplicates? On Thu, Feb 16, 2012 at 10:15 AM, Radu Toev radut...@gmail.com wrote: Hello, I created a data-config.xml file where I define a datasource and an entity with 12 fields. In my use case I have 2 databases with the same schema, so I want to combine in one index the 2 databases. I defined a second dataSource tag and duplicateed the entity with its field(changed the name and the datasource). What I'm expecting is to get around 7k results(I have around 6k in the first db and 1k in the second). However I'm getting a total of 2k. Where could be the problem? Thanks -- Regards, Dmitry Kan -- Regards, Dmitry Kan -- Regards, Dmitry Kan
Re: Entity with multiple datasources
dataConfig dataSource name=s driver=com.microsoft.sqlserver.jdbc.SQLServerDriver url= user= password=/ dataSource name=p driver=com.microsoft.sqlserver.jdbc.SQLServerDriver url= user= password=/ document entity name=ms datasource=s query=SELECT m.id as id, m.serial as m_machine_serial, m.ivk as m_machine_ivk, m.sitename as m_sitename, m.deliveryDate as m_delivery_date, m.hotsite as m_hotsite, m.guardian as m_guardian, m.warranty as m_warranty, m.contract as m_contract, st.name as m_st_name, pm.name as m_pm_name, p.name as m_p_name, sv.shortName as m_sv_name, c.clusterMajor as m_c_cluster_major, c.clusterMinor as m_c_cluster_minor, c.country as m_c_country, c.code as m_c_code FROM Machine AS m LEFT JOIN SystemType AS st ON m.fk_systemType=st.id LEFT JOIN ProductModel AS pm ON fk_productModel = pm.id LEFT JOIN Platform AS p ON m.fk_platform = p.id LEFT JOIN SoftwareVersion AS sv ON fk_softwareVersion = sv.id LEFT JOIN Country AS c ON fk_country = c.id readOnly=true transformer=DateFormatTransformer field column=id / field column=m_machine_serial/ field column=m_machine_ivk/ field column=m_sitename/ filed column=m_delivery_date dateTimeFormat=-MM-dd/ field column=m_hotsite/ field column=m_guardian/ field column=m_warranty/ field column=m_contract/ field column=m_st_name/ field column=m_pm_name/ field column=m_p_name/ field column=m_sv_name/ field column=m_c_cluster_major/ field column=m_c_cluster_minor/ field column=m_c_country/ field column=m_c_code/ /entity entity name=mp datasource=p query=SELECT m.id as id, m.serial as m_machine_serial, m.ivk as m_machine_ivk, m.sitename as m_sitename, m.deliveryDate as m_delivery_date, m.hotsite as m_hotsite, m.guardian as m_guardian, m.warranty as m_warranty, m.contract as m_contract, st.name as m_st_name, pm.name as m_pm_name, p.name as m_p_name, sv.shortName as m_sv_name, c.clusterMajor as m_c_cluster_major, c.clusterMinor as m_c_cluster_minor, c.country as m_c_country, c.code as m_c_code FROM Machine AS m LEFT JOIN SystemType AS st ON m.fk_systemType=st.id LEFT JOIN ProductModel AS pm ON fk_productModel = pm.id LEFT JOIN Platform AS p ON m.fk_platform = p.id LEFT JOIN SoftwareVersion AS sv ON fk_softwareVersion = sv.id LEFT JOIN Country AS c ON fk_country = c.id readOnly=true transformer=DateFormatTransformer field column=id / field column=m_machine_serial/ field column=m_machine_ivk/ field column=m_sitename/ filed column=m_delivery_date dateTimeFormat=-MM-dd/ field column=m_hotsite/ field column=m_guardian/ field column=m_warranty/ field column=m_contract/ field column=m_st_name/ field column=m_pm_name/ field column=m_p_name/ field column=m_sv_name/ field column=m_c_cluster_major/ field column=m_c_cluster_minor/ field column=m_c_country/ field column=m_c_code/ /entity /document /dataConfig I've removed the connection params The unique key is id. On Thu, Feb 16, 2012 at 2:27 PM, Dmitry Kan dmitry@gmail.com wrote: OK, maybe you can show the db-data-config.xml just in case? Also in schema.xml, does you uniqueKey correspond to the unique field in the db? On Thu, Feb 16, 2012 at 2:13 PM, Radu Toev radut...@gmail.com wrote: I tried running with just one datasource(the one that has 6k entries) and it indexes them ok. The same, if I do sepparately the 1k database. It indexes ok. On Thu, Feb 16, 2012 at 2:11 PM, Dmitry Kan dmitry@gmail.com wrote: It sounds a bit, as if SOLR stopped processing data once it queried all from the smaller dataset. That's why you have 2000. If you just have a handler pointed to the bigger data set (6k), do you manage to get all 6k db entries into solr? On Thu, Feb 16, 2012 at 1:46 PM, Radu Toev radut...@gmail.com wrote: 1. Nothing in the logs 2. No. On Thu, Feb 16, 2012 at 12:44 PM, Dmitry Kan dmitry@gmail.com wrote: 1. Do you see any errors / exceptions in the logs? 2. Could you have duplicates? On Thu, Feb 16, 2012 at 10:15 AM, Radu Toev radut...@gmail.com wrote: Hello, I created a data-config.xml file where I define a datasource and an entity with 12 fields. In my use case I have 2 databases with the same schema, so I want to combine in one index the 2 databases. I defined a second dataSource tag and duplicateed the entity with its field(changed the name and the datasource). What I'm expecting is to get around 7k results(I have around 6k in the first db and 1k in the second). However I'm getting a total of 2k. Where could be the problem? Thanks -- Regards, Dmitry Kan -- Regards, Dmitry Kan -- Regards, Dmitry Kan
How to loop through the DataImportHandler query results?
Hi Solr community, I'm new to Solr and DataImportHandler., I've a requirement to fetch the data from a database table and pass it to solr. Part of existing data-config.xml and solr schema.xml are given below, data-config.xml dataConfig dataSource driver=com.mysql.jdbc.Driver url=jdbc:mysql://demo22122011.com user= password=1234 / document name=sample entity name=adaptation pk= sample _id query=Select * from adap transformer=TemplateTransformer,DateFormatTransformer field column=field_mrmid_value name=mrm_id_camp_s_t / field column=field_ sample _scope_value name= sample _scope_camp_s_s / field column=field_quarterly_plan_value name=quarterly_plan_camp_s_s / field column=field_business_unit_value name=field_business_unit_value_camp_s_t / field column=field_sub_business_value name=field_sub_business_value_camp_s_t / field column=field_cdescription_value name= sample _description_camp_s_t / field column=field_ sample _owner_value name= sample _owner_camp_s_s / /entity /document /dataConfig Schema.xml schema name=example version=1.2 fields dynamicField name=*_camp_m_i type=intindexed=true stored=true multiValued=true/ dynamicField name=*_camp_s_i type=intindexed=true stored=true multiValued=false/ dynamicField name=*_camp_m_t type=textindexed=true stored=true multiValued=true/ dynamicField name=*_camp_s_t type=textindexed=true stored=true multiValued=false/ dynamicField name=*_camp_m_s type=string indexed=true stored=true multiValued=true/ dynamicField name=*_camp_s_s type=string indexed=true stored=true multiValued=false/ dynamicField name=*_camp_m_l type=long indexed=true stored=true multiValued=true/ dynamicField name=*_camp_s_l type=long indexed=true stored=true multiValued=false/ /fields /schema The table used in the query (adap) is often modified, number of columns in this table are changing frequently. Hence we are supposed to change the data-config.xml whenever a field is added or deleted. To avoid that we don't want to mention the column names in the field tag , but want to write a query to map all the fields in the table with solr fileds even if we don't know, how many columns are there in the table. I need a kind of loop which runs through all the query results and map that with solr fileds. Please help me. Regards, Baranee
Re: SolrCloud Replication Question
On Feb 14, 2012, at 10:57 PM, Jamie Johnson wrote: Not sure if this is expected or not. Nope - should be already resolved or will be today though. - Mark Miller lucidimagination.com
Re: SolrCloud Replication Question
Ok, great. Just wanted to make sure someone was aware. Thanks for looking into this. On Thu, Feb 16, 2012 at 8:26 AM, Mark Miller markrmil...@gmail.com wrote: On Feb 14, 2012, at 10:57 PM, Jamie Johnson wrote: Not sure if this is expected or not. Nope - should be already resolved or will be today though. - Mark Miller lucidimagination.com
PatternReplaceFilterFactory group
PatternReplaceFilterFactory has no option to select the group to replace. Is there a reason for this, or could this be a nice feature? -- View this message in context: http://lucene.472066.n3.nabble.com/PatternReplaceFilterFactory-group-tp3750201p3750201.html Sent from the Solr - User mailing list archive at Nabble.com.
custom scoring
Hello all: We'd like to score the matching documents using a combination of SOLR's IR score with another application-specific score that we store within the documents themselves (i.e. a float field containing the app-specific score). In particular, we'd like to calculate the final score doing some operations with both numbers (i.e product, sqrt, ...) According to what we know, there are two ways to do this in SOLR: A) Sort by function [1]: We've tested an expression like sort=product(score, query_score) in the SOLR query, where score is the common SOLR IR score and query_score is our own precalculated score, but it seems that SOLR can only do this with stored/indexed fields (and obviously score is not stored/indexed). B) Function queries: We've used _val_ and function queries like max, sqrt and query, and we've obtained the desired results from a functional point of view. However, our index is quite large (400M documents) and the performance degrades heavily, given that function queries are AFAIK matching all the documents. I have two questions: 1) Apart from the two options I mentioned, is there any other (simple) way to achieve this that we're not aware of? 2) If we have to choose the function queries path, would it be very difficult to modify the actual implementation so that it doesn't match all the documents, that is, to pass a query so that it only operates over the documents matching the query?. Looking at the FunctionQuery.java source code, there's a comment that says // instead of matching all docs, we could also embed a query. the score could either ignore the subscore, or boost it, which is giving us some hope that maybe it's possible and even desirable to go in this direction. If you can give us some directions about how to go about this, we may be able to do the actual implementation. BTW, we're using Lucene/SOLR trunk. Thanks a lot for your help. Carlos [1]: http://wiki.apache.org/solr/FunctionQuery#Sort_By_Function
Re: problem to indexing pdf directory
On 16 February 2012 14:33, alessio crisantemi alessio.crisant...@gmail.com wrote: Hi all, I have a problem to configure a pdf indexing from a directory in my solr wit DIH: with this data-config dataConfig dataSource type=BinFileDataSource / document entity name=tika-test processor=FileListEntityProcessor baseDir=D:\gioconews_archivio\marzo2011 fileName=.*pdf recursive=true rootEntity=false dataSource=null/ entity processor=FileListEntityProcessor url=D:\gioconews_archivio\marzo2011 format=text field column=author name=author meta=true/ field column=title name=title meta=true/ field column=description name=description / field column=comments name=comments / field column=content_type name=content_type / field column=last_modified name=last_modified / /entity /document /dataConfig [...] You should look in your Solr logs for more details about the exception, but as things stand, the above setup will not work for indexing PDF files. You need Tika. Searching Google for solr tika index pdf turns up many possibilities, e.g., http://www.abcseo.com/tech/search/integrating-solr-and-tika http://solr.pl/en/2011/04/04/indexing-files-like-doc-pdf-solr-and-tika-integration/ Regards, Gora
Re: How to loop through the DataImportHandler query results?
Hi Baranee, Some time ago I played with http://wiki.apache.org/solr/DataImportHandler#ScriptTransformer - it was a pretty good stuff. Regards On Thu, Feb 16, 2012 at 3:53 PM, K, Baraneetharan baraneethara...@hp.comwrote: To avoid that we don't want to mention the column names in the field tag , but want to write a query to map all the fields in the table with solr fileds even if we don't know, how many columns are there in the table. I need a kind of loop which runs through all the query results and map that with solr fileds. -- Sincerely yours Mikhail Khludnev Lucid Certified Apache Lucene/Solr Developer Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Solr soft commit feature
The slaves will be able to replicate from the master as before but not in NRT depending on your commit interval. Commit interval can be set higher for NRT as it is not needed for searches except for consolidating the index changes on the master and can be an hr or even more. It maybe easier to update the slaves directly as the update/query performance is high (replication in the cloud in 4.0 also follows similar paradigm as the docs are sent across as a whole to be replicated. So for now you may have to do this manually) - Nagendra Nagarajayya http://solr-ra.tgels.org http://rankingalgorithm.tgels.org On 2/15/2012 8:35 AM, Dipti Srivastava wrote: Hi Nagendra, Certainly interesting! Would this work in a Master/slave setup where the reads are from the slaves and all writes are to the master? Regards, Dipti Srivastava On 2/15/12 5:40 AM, Nagendra Nagarajayyannagaraja...@transaxtions.com wrote: If you are looking for NRT functionality with Solr 3.5, you may want to take a look at Solr 3.5 with RankingAlgorithm. This allows you to add/update documents without a commit while being able to search concurrently. The add/update performance to add 1m docs is about 5000 docs in about 498 ms with one concurrent searcher. You can get more information about Solr 3.5 with RankingAlgorithm from here: http://tgels.org/wiki/en/Near_Real_Time_Search_ver_3.x Regards, - Nagendra Nagarajayya http://solr-ra.tgels.org http://rankingalgorithm.tgels.org On 2/14/2012 4:41 PM, Dipti Srivastava wrote: Hi All, Is there a way to soft commit in the current released version of solr 3.5? Regards, Dipti Srivastava This message is private and confidential. If you have received it in error, please notify the sender and remove it from your system. This message is private and confidential. If you have received it in error, please notify the sender and remove it from your system.
Re: Can I rebuild an index and remove some fields?
I will test it with my big production indexes first, if it works I will port to Java and add to contrib I think. On Wed, Feb 15, 2012 at 10:03 PM, Li Li fancye...@gmail.com wrote: great. I think you could make it a public tool. maybe others also need such functionality. On Thu, Feb 16, 2012 at 5:31 AM, Robert Stewart bstewart...@gmail.comwrote: I implemented an index shrinker and it works. I reduced my test index from 6.6 GB to 3.6 GB by removing a single shingled field I did not need anymore. I'm actually using Lucene.Net for this project so code is C# using Lucene.Net 2.9.2 API. But basic idea is: Create an IndexReader wrapper that only enumerates the terms you want to keep, and that removes terms from documents when returning documents. Use the SegmentMerger to re-write each segment (where each segment is wrapped by the wrapper class), writing new segment to a new directory. Collect the SegmentInfos and do a commit in order to create a new segments file in new index directory Done - you now have a shrunk index with specified terms removed. Implementation uses separate thread for each segment, so it re-writes them in parallel. Took about 15 minutes to do 770,000 doc index on my macbook. On Tue, Feb 14, 2012 at 10:12 PM, Li Li fancye...@gmail.com wrote: I have roughly read the codes of 4.0 trunk. maybe it's feasible. SegmentMerger.add(IndexReader) will add to be merged Readers merge() will call mergeTerms(segmentWriteState); mergePerDoc(segmentWriteState); mergeTerms() will construct fields from IndexReaders for(int readerIndex=0;readerIndexmergeState.readers.size();readerIndex++) { final MergeState.IndexReaderAndLiveDocs r = mergeState.readers.get(readerIndex); final Fields f = r.reader.fields(); final int maxDoc = r.reader.maxDoc(); if (f != null) { slices.add(new ReaderUtil.Slice(docBase, maxDoc, readerIndex)); fields.add(f); } docBase += maxDoc; } So If you wrapper your IndexReader and override its fields() method, maybe it will work for merge terms. for DocValues, it can also override AtomicReader.docValues(). just return null for fields you want to remove. maybe it should traverse CompositeReader's getSequentialSubReaders() and wrapper each AtomicReader other things like term vectors norms are similar. On Wed, Feb 15, 2012 at 6:30 AM, Robert Stewart bstewart...@gmail.com wrote: I was thinking if I make a wrapper class that aggregates another IndexReader and filter out terms I don't want anymore it might work. And then pass that wrapper into SegmentMerger. I think if I filter out terms on GetFieldNames(...) and Terms(...) it might work. Something like: HashSetstring ignoredTerms=...; FilteringIndexReader wrapper=new FilterIndexReader(reader); SegmentMerger merger=new SegmentMerger(writer); merger.add(wrapper); merger.Merge(); On Feb 14, 2012, at 1:49 AM, Li Li wrote: for method 2, delete is wrong. we can't delete terms. you also should hack with the tii and tis file. On Tue, Feb 14, 2012 at 2:46 PM, Li Li fancye...@gmail.com wrote: method1, dumping data for stored fields, you can traverse the whole index and save it to somewhere else. for indexed but not stored fields, it may be more difficult. if the indexed and not stored field is not analyzed(fields such as id), it's easy to get from FieldCache.StringIndex. But for analyzed fields, though theoretically it can be restored from term vector and term position, it's hard to recover from index. method 2, hack with metadata 1. indexed fields delete by query, e.g. field:* 2. stored fields because all fields are stored sequentially. it's not easy to delete some fields. this will not affect search speed. but if you want to get stored fields, and the useless fields are very long, then it will slow down. also it's possible to hack with it. but need more effort to understand the index file format and traverse the fdt/fdx file. http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/fileformats.html this will give you some insight. On Tue, Feb 14, 2012 at 6:29 AM, Robert Stewart bstewart...@gmail.com wrote: Lets say I have a large index (100M docs, 1TB, split up between 10 indexes). And a bunch of the stored and indexed fields are not used in search at all. In order to save memory and disk, I'd like to rebuild that index *without* those fields, but I don't have original documents to rebuild entire index with (don't have the full-text anymore, etc.). Is there some way to rebuild or optimize an existing index with only a sub-set of the existing indexed fields? Or alternatively is there a way to avoid loading some indexed fields at all ( to avoid loading term infos and
Payload and exact search - 2
Hello, I already posted this question but for some reason it was attached to a thread with different topic. Is there the possibility of perform 'exact search' in a payload field? I'have to index text with auxiliary info for each word. In particular at each word is associated the bounding box containing it in the original pdf page (it is used for highligthing the search terms in the pdf). I used the payload to store that information. In the schema.xml, the fieldType definition is: --- fieldtype name=wppayloads stored=false indexed=true class=solr.TextField analyzer tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.DelimitedPayloadTokenFilterFactory encoder=identity/ /analyzer /fieldtype --- while the field definition is: --- field name=words type=wppayloads indexed=true stored=true required=true multiValued=true/ --- When indexing, the field 'words' contains a list of word|box as in the following example: --- doc_id=example words={Fonte:|307.62,948.16,324.62,954.25 Comune|326.29,948.16,349.07,954.25 di|350.74,948.16,355.62,954.25 Bologna|358.95,948.16,381.28,954.25} --- Such solution works well except in the case of an exact search. For example, assuming the only indexed doc is the 'example' doc (before shown), the query words:Comune di Bologna returns no results. Someone know if there is the possibility of perform 'exact search' in a payload field? Thanks in advance, Leonardo -- View this message in context: http://lucene.472066.n3.nabble.com/Payload-and-exact-search-2-tp3750355p3750355.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Entity with multiple datasources
I think the problem here is that initially you trying to create separate documents for two different tables, while your config is aiming to create only one document. Here there is one solution (not tried by me): -- You can have multiple documents generated by the same data-config: dataConfig dataSource name=ds1 .../ dataSource name=ds2 .../ dataSource name=ds3 .../ document entity blah blah rootEntity=false entity blah blah this is a document entity sets unique id/ /document document blah blah this is another document entity sets unique id /document /document /dataConfig It's the 'rootEntity=false that makes the child entity a document. -- Dmitry On Thu, Feb 16, 2012 at 2:37 PM, Radu Toev radut...@gmail.com wrote: dataConfig dataSource name=s driver=com.microsoft.sqlserver.jdbc.SQLServerDriver url= user= password=/ dataSource name=p driver=com.microsoft.sqlserver.jdbc.SQLServerDriver url= user= password=/ document entity name=ms datasource=s query=SELECT m.id as id, m.serial as m_machine_serial, m.ivk as m_machine_ivk, m.sitename as m_sitename, m.deliveryDate as m_delivery_date, m.hotsite as m_hotsite, m.guardian as m_guardian, m.warranty as m_warranty, m.contract as m_contract, st.name as m_st_name, pm.name as m_pm_name, p.name as m_p_name, sv.shortName as m_sv_name, c.clusterMajor as m_c_cluster_major, c.clusterMinor as m_c_cluster_minor, c.country as m_c_country, c.code as m_c_code FROM Machine AS m LEFT JOIN SystemType AS st ON m.fk_systemType=st.id LEFT JOIN ProductModel AS pm ON fk_productModel = pm.id LEFT JOIN Platform AS p ON m.fk_platform = p.id LEFT JOIN SoftwareVersion AS sv ON fk_softwareVersion = sv.id LEFT JOIN Country AS c ON fk_country = c.id readOnly=true transformer=DateFormatTransformer field column=id / field column=m_machine_serial/ field column=m_machine_ivk/ field column=m_sitename/ filed column=m_delivery_date dateTimeFormat=-MM-dd/ field column=m_hotsite/ field column=m_guardian/ field column=m_warranty/ field column=m_contract/ field column=m_st_name/ field column=m_pm_name/ field column=m_p_name/ field column=m_sv_name/ field column=m_c_cluster_major/ field column=m_c_cluster_minor/ field column=m_c_country/ field column=m_c_code/ /entity entity name=mp datasource=p query=SELECT m.id as id, m.serial as m_machine_serial, m.ivk as m_machine_ivk, m.sitename as m_sitename, m.deliveryDate as m_delivery_date, m.hotsite as m_hotsite, m.guardian as m_guardian, m.warranty as m_warranty, m.contract as m_contract, st.name as m_st_name, pm.name as m_pm_name, p.name as m_p_name, sv.shortName as m_sv_name, c.clusterMajor as m_c_cluster_major, c.clusterMinor as m_c_cluster_minor, c.country as m_c_country, c.code as m_c_code FROM Machine AS m LEFT JOIN SystemType AS st ON m.fk_systemType=st.id LEFT JOIN ProductModel AS pm ON fk_productModel = pm.id LEFT JOIN Platform AS p ON m.fk_platform = p.id LEFT JOIN SoftwareVersion AS sv ON fk_softwareVersion = sv.id LEFT JOIN Country AS c ON fk_country = c.id readOnly=true transformer=DateFormatTransformer field column=id / field column=m_machine_serial/ field column=m_machine_ivk/ field column=m_sitename/ filed column=m_delivery_date dateTimeFormat=-MM-dd/ field column=m_hotsite/ field column=m_guardian/ field column=m_warranty/ field column=m_contract/ field column=m_st_name/ field column=m_pm_name/ field column=m_p_name/ field column=m_sv_name/ field column=m_c_cluster_major/ field column=m_c_cluster_minor/ field column=m_c_country/ field column=m_c_code/ /entity /document /dataConfig I've removed the connection params The unique key is id. On Thu, Feb 16, 2012 at 2:27 PM, Dmitry Kan dmitry@gmail.com wrote: OK, maybe you can show the db-data-config.xml just in case? Also in schema.xml, does you uniqueKey correspond to the unique field in the db? On Thu, Feb 16, 2012 at 2:13 PM, Radu Toev radut...@gmail.com wrote: I tried running with just one datasource(the one that has 6k entries) and it indexes them ok. The same, if I do sepparately the 1k database. It indexes ok. On Thu, Feb 16, 2012 at 2:11 PM, Dmitry Kan dmitry@gmail.com wrote: It sounds a bit, as if SOLR stopped processing data once it queried all from the smaller dataset. That's why you have 2000. If you just have a handler pointed to the bigger data set (6k), do you manage to get all 6k db entries into solr? On Thu, Feb 16, 2012 at 1:46 PM, Radu Toev radut...@gmail.com wrote: 1. Nothing in the logs 2. No. On Thu, Feb 16, 2012 at 12:44 PM, Dmitry Kan dmitry@gmail.com wrote: 1. Do you see any errors / exceptions in the logs? 2. Could you have duplicates? On Thu, Feb 16, 2012 at 10:15 AM, Radu Toev
Re: Entity with multiple datasources
I'm not sure I follow. The idea is to have only one document. Do the multiple documents have the same structure then(different datasources), and if so how are they actually indexed? Thanks. On Thu, Feb 16, 2012 at 4:40 PM, Dmitry Kan dmitry@gmail.com wrote: I think the problem here is that initially you trying to create separate documents for two different tables, while your config is aiming to create only one document. Here there is one solution (not tried by me): -- You can have multiple documents generated by the same data-config: dataConfig dataSource name=ds1 .../ dataSource name=ds2 .../ dataSource name=ds3 .../ document entity blah blah rootEntity=false entity blah blah this is a document entity sets unique id/ /document document blah blah this is another document entity sets unique id /document /document /dataConfig It's the 'rootEntity=false that makes the child entity a document. -- Dmitry On Thu, Feb 16, 2012 at 2:37 PM, Radu Toev radut...@gmail.com wrote: dataConfig dataSource name=s driver=com.microsoft.sqlserver.jdbc.SQLServerDriver url= user= password=/ dataSource name=p driver=com.microsoft.sqlserver.jdbc.SQLServerDriver url= user= password=/ document entity name=ms datasource=s query=SELECT m.id as id, m.serial as m_machine_serial, m.ivk as m_machine_ivk, m.sitename as m_sitename, m.deliveryDate as m_delivery_date, m.hotsite as m_hotsite, m.guardian as m_guardian, m.warranty as m_warranty, m.contract as m_contract, st.name as m_st_name, pm.name as m_pm_name, p.name as m_p_name, sv.shortName as m_sv_name, c.clusterMajor as m_c_cluster_major, c.clusterMinor as m_c_cluster_minor, c.country as m_c_country, c.code as m_c_code FROM Machine AS m LEFT JOIN SystemType AS st ON m.fk_systemType=st.id LEFT JOIN ProductModel AS pm ON fk_productModel = pm.id LEFT JOIN Platform AS p ON m.fk_platform = p.id LEFT JOIN SoftwareVersion AS sv ON fk_softwareVersion = sv.id LEFT JOIN Country AS c ON fk_country = c.id readOnly=true transformer=DateFormatTransformer field column=id / field column=m_machine_serial/ field column=m_machine_ivk/ field column=m_sitename/ filed column=m_delivery_date dateTimeFormat=-MM-dd/ field column=m_hotsite/ field column=m_guardian/ field column=m_warranty/ field column=m_contract/ field column=m_st_name/ field column=m_pm_name/ field column=m_p_name/ field column=m_sv_name/ field column=m_c_cluster_major/ field column=m_c_cluster_minor/ field column=m_c_country/ field column=m_c_code/ /entity entity name=mp datasource=p query=SELECT m.id as id, m.serial as m_machine_serial, m.ivk as m_machine_ivk, m.sitename as m_sitename, m.deliveryDate as m_delivery_date, m.hotsite as m_hotsite, m.guardian as m_guardian, m.warranty as m_warranty, m.contract as m_contract, st.name as m_st_name, pm.name as m_pm_name, p.name as m_p_name, sv.shortName as m_sv_name, c.clusterMajor as m_c_cluster_major, c.clusterMinor as m_c_cluster_minor, c.country as m_c_country, c.code as m_c_code FROM Machine AS m LEFT JOIN SystemType AS st ON m.fk_systemType=st.id LEFT JOIN ProductModel AS pm ON fk_productModel = pm.id LEFT JOIN Platform AS p ON m.fk_platform = p.id LEFT JOIN SoftwareVersion AS sv ON fk_softwareVersion = sv.id LEFT JOIN Country AS c ON fk_country = c.id readOnly=true transformer=DateFormatTransformer field column=id / field column=m_machine_serial/ field column=m_machine_ivk/ field column=m_sitename/ filed column=m_delivery_date dateTimeFormat=-MM-dd/ field column=m_hotsite/ field column=m_guardian/ field column=m_warranty/ field column=m_contract/ field column=m_st_name/ field column=m_pm_name/ field column=m_p_name/ field column=m_sv_name/ field column=m_c_cluster_major/ field column=m_c_cluster_minor/ field column=m_c_country/ field column=m_c_code/ /entity /document /dataConfig I've removed the connection params The unique key is id. On Thu, Feb 16, 2012 at 2:27 PM, Dmitry Kan dmitry@gmail.com wrote: OK, maybe you can show the db-data-config.xml just in case? Also in schema.xml, does you uniqueKey correspond to the unique field in the db? On Thu, Feb 16, 2012 at 2:13 PM, Radu Toev radut...@gmail.com wrote: I tried running with just one datasource(the one that has 6k entries) and it indexes them ok. The same, if I do sepparately the 1k database. It indexes ok. On Thu, Feb 16, 2012 at 2:11 PM, Dmitry Kan dmitry@gmail.com wrote: It sounds a bit, as if SOLR stopped processing data once it queried all from the smaller dataset. That's why you have 2000. If you just have a handler pointed to the bigger data set (6k), do you manage to get
Re: problem to indexing pdf directory
yes, but if I use TikaEntityProcessor the result of my full-import is str name=Total Requests made to DataSource0/str str name=Total Rows Fetched1/str str name=Total Documents Skipped0/str str name=Indexing failed. Rolled back all changes./str 2012/2/16 alessio crisantemi alessio.crisant...@gmail.com Hi all, I have a problem to configure a pdf indexing from a directory in my solr wit DIH: with this data-config dataConfig dataSource type=BinFileDataSource / document entity name=tika-test processor=FileListEntityProcessor baseDir=D:\gioconews_archivio\marzo2011 fileName=.*pdf recursive=true rootEntity=false dataSource=null/ entity processor=FileListEntityProcessor url=D:\gioconews_archivio\marzo2011 format=text field column=author name=author meta=true/ field column=title name=title meta=true/ field column=description name=description / field column=comments name=comments / field column=content_type name=content_type / field column=last_modified name=last_modified / /entity /document /dataConfig I obtain this result: str name=commandfull-import/str str name=statusidle/str str name=importResponse / - lst name=statusMessages str name=Time Elapsed0:0:2.44/str str name=Total Requests made to DataSource0/str str name=Total Rows Fetched43/str str name=Total Documents Skipped0/str str name=Full Dump Started2012-02-12 19:06:00/str str name=Indexing failed. Rolled back all changes./str str name=Rolledback2012-02-12 19:06:00/str /lst suggestions? thank you alessio
Re: Entity with multiple datasources
Each document in SOLR will correspond to one db record and since both databases have the same schema, you can't index two records from two databases into the same SOLR document. So after indexing, you should have 7k different documents, each of which holds data from a db record. Also one problem I see here is that since the record id in each table is unique only within the table and (most probably) not globally, there will be collisions. To aviod this, I would prepend a record_id with some static value, like: concat(t1, CONVERT(id, CHAR(8))). Dmitry On Thu, Feb 16, 2012 at 4:47 PM, Radu Toev radut...@gmail.com wrote: I'm not sure I follow. The idea is to have only one document. Do the multiple documents have the same structure then(different datasources), and if so how are they actually indexed? Thanks. On Thu, Feb 16, 2012 at 4:40 PM, Dmitry Kan dmitry@gmail.com wrote: I think the problem here is that initially you trying to create separate documents for two different tables, while your config is aiming to create only one document. Here there is one solution (not tried by me): -- You can have multiple documents generated by the same data-config: dataConfig dataSource name=ds1 .../ dataSource name=ds2 .../ dataSource name=ds3 .../ document entity blah blah rootEntity=false entity blah blah this is a document entity sets unique id/ /document document blah blah this is another document entity sets unique id /document /document /dataConfig It's the 'rootEntity=false that makes the child entity a document. -- Dmitry On Thu, Feb 16, 2012 at 2:37 PM, Radu Toev radut...@gmail.com wrote: dataConfig dataSource name=s driver=com.microsoft.sqlserver.jdbc.SQLServerDriver url= user= password=/ dataSource name=p driver=com.microsoft.sqlserver.jdbc.SQLServerDriver url= user= password=/ document entity name=ms datasource=s query=SELECT m.id as id, m.serial as m_machine_serial, m.ivk as m_machine_ivk, m.sitename as m_sitename, m.deliveryDate as m_delivery_date, m.hotsite as m_hotsite, m.guardian as m_guardian, m.warranty as m_warranty, m.contract as m_contract, st.name as m_st_name, pm.name as m_pm_name, p.name as m_p_name, sv.shortName as m_sv_name, c.clusterMajor as m_c_cluster_major, c.clusterMinor as m_c_cluster_minor, c.country as m_c_country, c.code as m_c_code FROM Machine AS m LEFT JOIN SystemType AS st ON m.fk_systemType=st.id LEFT JOIN ProductModel AS pm ON fk_productModel = pm.id LEFT JOIN Platform AS p ON m.fk_platform = p.id LEFT JOIN SoftwareVersion AS sv ON fk_softwareVersion = sv.id LEFT JOIN Country AS c ON fk_country = c.id readOnly=true transformer=DateFormatTransformer field column=id / field column=m_machine_serial/ field column=m_machine_ivk/ field column=m_sitename/ filed column=m_delivery_date dateTimeFormat=-MM-dd/ field column=m_hotsite/ field column=m_guardian/ field column=m_warranty/ field column=m_contract/ field column=m_st_name/ field column=m_pm_name/ field column=m_p_name/ field column=m_sv_name/ field column=m_c_cluster_major/ field column=m_c_cluster_minor/ field column=m_c_country/ field column=m_c_code/ /entity entity name=mp datasource=p query=SELECT m.id as id, m.serial as m_machine_serial, m.ivk as m_machine_ivk, m.sitename as m_sitename, m.deliveryDate as m_delivery_date, m.hotsite as m_hotsite, m.guardian as m_guardian, m.warranty as m_warranty, m.contract as m_contract, st.name as m_st_name, pm.name as m_pm_name, p.name as m_p_name, sv.shortName as m_sv_name, c.clusterMajor as m_c_cluster_major, c.clusterMinor as m_c_cluster_minor, c.country as m_c_country, c.code as m_c_code FROM Machine AS m LEFT JOIN SystemType AS st ON m.fk_systemType=st.id LEFT JOIN ProductModel AS pm ON fk_productModel = pm.id LEFT JOIN Platform AS p ON m.fk_platform = p.id LEFT JOIN SoftwareVersion AS sv ON fk_softwareVersion = sv.id LEFT JOIN Country AS c ON fk_country = c.id readOnly=true transformer=DateFormatTransformer field column=id / field column=m_machine_serial/ field column=m_machine_ivk/ field column=m_sitename/ filed column=m_delivery_date dateTimeFormat=-MM-dd/ field column=m_hotsite/ field column=m_guardian/ field column=m_warranty/ field column=m_contract/ field column=m_st_name/ field column=m_pm_name/ field column=m_p_name/ field column=m_sv_name/ field column=m_c_cluster_major/ field column=m_c_cluster_minor/ field column=m_c_country/ field column=m_c_code/ /entity /document /dataConfig I've removed the connection params The unique key is id. On Thu, Feb 16, 2012 at 2:27
Re: custom scoring
Hello carlos, could you show us how your Solr-call looks like? Regards, Em Am 16.02.2012 14:34, schrieb Carlos Gonzalez-Cadenas: Hello all: We'd like to score the matching documents using a combination of SOLR's IR score with another application-specific score that we store within the documents themselves (i.e. a float field containing the app-specific score). In particular, we'd like to calculate the final score doing some operations with both numbers (i.e product, sqrt, ...) According to what we know, there are two ways to do this in SOLR: A) Sort by function [1]: We've tested an expression like sort=product(score, query_score) in the SOLR query, where score is the common SOLR IR score and query_score is our own precalculated score, but it seems that SOLR can only do this with stored/indexed fields (and obviously score is not stored/indexed). B) Function queries: We've used _val_ and function queries like max, sqrt and query, and we've obtained the desired results from a functional point of view. However, our index is quite large (400M documents) and the performance degrades heavily, given that function queries are AFAIK matching all the documents. I have two questions: 1) Apart from the two options I mentioned, is there any other (simple) way to achieve this that we're not aware of? 2) If we have to choose the function queries path, would it be very difficult to modify the actual implementation so that it doesn't match all the documents, that is, to pass a query so that it only operates over the documents matching the query?. Looking at the FunctionQuery.java source code, there's a comment that says // instead of matching all docs, we could also embed a query. the score could either ignore the subscore, or boost it, which is giving us some hope that maybe it's possible and even desirable to go in this direction. If you can give us some directions about how to go about this, we may be able to do the actual implementation. BTW, we're using Lucene/SOLR trunk. Thanks a lot for your help. Carlos [1]: http://wiki.apache.org/solr/FunctionQuery#Sort_By_Function
Re: Entity with multiple datasources
Really good point on the ids, I completely overlooked that matter. I will give it a try. Thanks again. On Thu, Feb 16, 2012 at 5:00 PM, Dmitry Kan dmitry@gmail.com wrote: Each document in SOLR will correspond to one db record and since both databases have the same schema, you can't index two records from two databases into the same SOLR document. So after indexing, you should have 7k different documents, each of which holds data from a db record. Also one problem I see here is that since the record id in each table is unique only within the table and (most probably) not globally, there will be collisions. To aviod this, I would prepend a record_id with some static value, like: concat(t1, CONVERT(id, CHAR(8))). Dmitry On Thu, Feb 16, 2012 at 4:47 PM, Radu Toev radut...@gmail.com wrote: I'm not sure I follow. The idea is to have only one document. Do the multiple documents have the same structure then(different datasources), and if so how are they actually indexed? Thanks. On Thu, Feb 16, 2012 at 4:40 PM, Dmitry Kan dmitry@gmail.com wrote: I think the problem here is that initially you trying to create separate documents for two different tables, while your config is aiming to create only one document. Here there is one solution (not tried by me): -- You can have multiple documents generated by the same data-config: dataConfig dataSource name=ds1 .../ dataSource name=ds2 .../ dataSource name=ds3 .../ document entity blah blah rootEntity=false entity blah blah this is a document entity sets unique id/ /document document blah blah this is another document entity sets unique id /document /document /dataConfig It's the 'rootEntity=false that makes the child entity a document. -- Dmitry On Thu, Feb 16, 2012 at 2:37 PM, Radu Toev radut...@gmail.com wrote: dataConfig dataSource name=s driver=com.microsoft.sqlserver.jdbc.SQLServerDriver url= user= password=/ dataSource name=p driver=com.microsoft.sqlserver.jdbc.SQLServerDriver url= user= password=/ document entity name=ms datasource=s query=SELECT m.id as id, m.serial as m_machine_serial, m.ivk as m_machine_ivk, m.sitename as m_sitename, m.deliveryDate as m_delivery_date, m.hotsite as m_hotsite, m.guardian as m_guardian, m.warranty as m_warranty, m.contract as m_contract, st.name as m_st_name, pm.name as m_pm_name, p.name as m_p_name, sv.shortName as m_sv_name, c.clusterMajor as m_c_cluster_major, c.clusterMinor as m_c_cluster_minor, c.country as m_c_country, c.code as m_c_code FROM Machine AS m LEFT JOIN SystemType AS st ON m.fk_systemType=st.id LEFT JOIN ProductModel AS pm ON fk_productModel = pm.id LEFT JOIN Platform AS p ON m.fk_platform = p.id LEFT JOIN SoftwareVersion AS sv ON fk_softwareVersion = sv.id LEFT JOIN Country AS c ON fk_country = c.id readOnly=true transformer=DateFormatTransformer field column=id / field column=m_machine_serial/ field column=m_machine_ivk/ field column=m_sitename/ filed column=m_delivery_date dateTimeFormat=-MM-dd/ field column=m_hotsite/ field column=m_guardian/ field column=m_warranty/ field column=m_contract/ field column=m_st_name/ field column=m_pm_name/ field column=m_p_name/ field column=m_sv_name/ field column=m_c_cluster_major/ field column=m_c_cluster_minor/ field column=m_c_country/ field column=m_c_code/ /entity entity name=mp datasource=p query=SELECT m.id as id, m.serial as m_machine_serial, m.ivk as m_machine_ivk, m.sitename as m_sitename, m.deliveryDate as m_delivery_date, m.hotsite as m_hotsite, m.guardian as m_guardian, m.warranty as m_warranty, m.contract as m_contract, st.name as m_st_name, pm.name as m_pm_name, p.name as m_p_name, sv.shortName as m_sv_name, c.clusterMajor as m_c_cluster_major, c.clusterMinor as m_c_cluster_minor, c.country as m_c_country, c.code as m_c_code FROM Machine AS m LEFT JOIN SystemType AS st ON m.fk_systemType=st.id LEFT JOIN ProductModel AS pm ON fk_productModel = pm.id LEFT JOIN Platform AS p ON m.fk_platform = p.id LEFT JOIN SoftwareVersion AS sv ON fk_softwareVersion = sv.id LEFT JOIN Country AS c ON fk_country = c.id readOnly=true transformer=DateFormatTransformer field column=id / field column=m_machine_serial/ field column=m_machine_ivk/ field column=m_sitename/ filed column=m_delivery_date dateTimeFormat=-MM-dd/ field column=m_hotsite/ field column=m_guardian/ field column=m_warranty/ field column=m_contract/ field column=m_st_name/ field column=m_pm_name/
Re: Entity with multiple datasources
no problem, hope it helps, you're welcome. On Thu, Feb 16, 2012 at 5:03 PM, Radu Toev radut...@gmail.com wrote: Really good point on the ids, I completely overlooked that matter. I will give it a try. Thanks again. On Thu, Feb 16, 2012 at 5:00 PM, Dmitry Kan dmitry@gmail.com wrote: Each document in SOLR will correspond to one db record and since both databases have the same schema, you can't index two records from two databases into the same SOLR document. So after indexing, you should have 7k different documents, each of which holds data from a db record. Also one problem I see here is that since the record id in each table is unique only within the table and (most probably) not globally, there will be collisions. To aviod this, I would prepend a record_id with some static value, like: concat(t1, CONVERT(id, CHAR(8))). Dmitry On Thu, Feb 16, 2012 at 4:47 PM, Radu Toev radut...@gmail.com wrote: I'm not sure I follow. The idea is to have only one document. Do the multiple documents have the same structure then(different datasources), and if so how are they actually indexed? Thanks. On Thu, Feb 16, 2012 at 4:40 PM, Dmitry Kan dmitry@gmail.com wrote: I think the problem here is that initially you trying to create separate documents for two different tables, while your config is aiming to create only one document. Here there is one solution (not tried by me): -- You can have multiple documents generated by the same data-config: dataConfig dataSource name=ds1 .../ dataSource name=ds2 .../ dataSource name=ds3 .../ document entity blah blah rootEntity=false entity blah blah this is a document entity sets unique id/ /document document blah blah this is another document entity sets unique id /document /document /dataConfig It's the 'rootEntity=false that makes the child entity a document. -- Dmitry On Thu, Feb 16, 2012 at 2:37 PM, Radu Toev radut...@gmail.com wrote: dataConfig dataSource name=s driver=com.microsoft.sqlserver.jdbc.SQLServerDriver url= user= password=/ dataSource name=p driver=com.microsoft.sqlserver.jdbc.SQLServerDriver url= user= password=/ document entity name=ms datasource=s query=SELECT m.id as id, m.serial as m_machine_serial, m.ivk as m_machine_ivk, m.sitename as m_sitename, m.deliveryDate as m_delivery_date, m.hotsite as m_hotsite, m.guardian as m_guardian, m.warranty as m_warranty, m.contract as m_contract, st.name as m_st_name, pm.name as m_pm_name, p.name as m_p_name, sv.shortName as m_sv_name, c.clusterMajor as m_c_cluster_major, c.clusterMinor as m_c_cluster_minor, c.country as m_c_country, c.code as m_c_code FROM Machine AS m LEFT JOIN SystemType AS st ON m.fk_systemType=st.id LEFT JOIN ProductModel AS pm ON fk_productModel = pm.id LEFT JOIN Platform AS p ON m.fk_platform = p.id LEFT JOIN SoftwareVersion AS sv ON fk_softwareVersion = sv.id LEFT JOIN Country AS c ON fk_country = c.id readOnly=true transformer=DateFormatTransformer field column=id / field column=m_machine_serial/ field column=m_machine_ivk/ field column=m_sitename/ filed column=m_delivery_date dateTimeFormat=-MM-dd/ field column=m_hotsite/ field column=m_guardian/ field column=m_warranty/ field column=m_contract/ field column=m_st_name/ field column=m_pm_name/ field column=m_p_name/ field column=m_sv_name/ field column=m_c_cluster_major/ field column=m_c_cluster_minor/ field column=m_c_country/ field column=m_c_code/ /entity entity name=mp datasource=p query=SELECT m.id as id, m.serial as m_machine_serial, m.ivk as m_machine_ivk, m.sitename as m_sitename, m.deliveryDate as m_delivery_date, m.hotsite as m_hotsite, m.guardian as m_guardian, m.warranty as m_warranty, m.contract as m_contract, st.name as m_st_name, pm.name as m_pm_name, p.name as m_p_name, sv.shortName as m_sv_name, c.clusterMajor as m_c_cluster_major, c.clusterMinor as m_c_cluster_minor, c.country as m_c_country, c.code as m_c_code FROM Machine AS m LEFT JOIN SystemType AS st ON m.fk_systemType=st.id LEFT JOIN ProductModel AS pm ON fk_productModel = pm.id LEFT JOIN Platform AS p ON m.fk_platform = p.id LEFT JOIN SoftwareVersion AS sv ON fk_softwareVersion = sv.id LEFT JOIN Country AS c ON fk_country = c.id readOnly=true transformer=DateFormatTransformer field column=id / field column=m_machine_serial/ field column=m_machine_ivk/
Frequent garbage collections after a day of operation
Hey everyone, we're running into some operational problems with our SOLR production setup here and were wondering if anyone else is affected or has even solved these problems before. We're running a vanilla SOLR 3.4.0 in several Tomcat 6 instances, so nothing out of the ordinary, but after a day or so of operation we see increased response times from SOLR, up to 3 times increases on average. During this time we see increased CPU load due to heavy garbage collection in the JVM, which bogs down the the whole system, so throughput decreases, naturally. When restarting the slaves, everything goes back to normal, but that's more like a brute force solution. The thing is, we don't know what's causing this and we don't have that much experience with Java stacks since we're for most parts a Rails company. Are Tomcat 6 or SOLR known to leak memory? Is anyone else seeing this, or can you think of a reason for this? Most of our queries to SOLR involve the DismaxHandler and the spatial search query components. We don't use any custom request handlers so far. Thanks in advance, -Matthias -- Matthias Käppler Lead Developer API Mobile Qype GmbH Großer Burstah 50-52 20457 Hamburg Telephone: +49 (0)40 - 219 019 2 - 160 Skype: m_kaeppler Email: matth...@qype.com Managing Director: Ian Brotherston Amtsgericht Hamburg HRB 95913 This e-mail and its attachments may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail and its attachments. Any unauthorized copying, disclosure or distribution of this e-mail and its attachments is strictly forbidden. This notice also applies to future messages.
RE: PatternReplaceFilterFactory group
Hi O., PatternReplaceFilter(Factory) uses Matcher.replaceAll() or replaceFirst(), both of which take in a string that can include any or all groups using the syntax $n, where n is the group number. See the Matcher.appendReplacement() javadocs for an explanation of the functionality and syntax: http://docs.oracle.com/javase/6/docs/api/java/util/regex/Matcher.html#appendReplacement%28java.lang.StringBuffer,%20java.lang.String%29 Steve -Original Message- From: O. Klein [mailto:kl...@octoweb.nl] Sent: Thursday, February 16, 2012 8:34 AM To: solr-user@lucene.apache.org Subject: PatternReplaceFilterFactory group PatternReplaceFilterFactory has no option to select the group to replace. Is there a reason for this, or could this be a nice feature? -- View this message in context: http://lucene.472066.n3.nabble.com/PatternReplaceFilterFactory-group- tp3750201p3750201.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: custom scoring
Hello Em: The URL is quite large (w/ shards, ...), maybe it's best if I paste the relevant parts. Our q parameter is: q:_val_:\product(query_score,max(query($q8),max(query($q7),max(query($q4),query($q3)\, The subqueries q8, q7, q4 and q3 are regular queries, for example: q7:stopword_phrase:colomba~1 AND stopword_phrase:santa AND wildcard_stopword_phrase:car^0.7 AND stopword_phrase:hoteles OR (stopword_phrase:las AND stopword_phrase:de) We've executed the subqueries q3-q8 independently and they're very fast, but when we introduce the function queries as described below, it all goes 10X slower. Let me know if you need anything else. Thanks Carlos Carlos Gonzalez-Cadenas CEO, ExperienceOn - New generation search http://www.experienceon.com Mobile: +34 652 911 201 Skype: carlosgonzalezcadenas LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas On Thu, Feb 16, 2012 at 4:02 PM, Em mailformailingli...@yahoo.de wrote: Hello carlos, could you show us how your Solr-call looks like? Regards, Em Am 16.02.2012 14:34, schrieb Carlos Gonzalez-Cadenas: Hello all: We'd like to score the matching documents using a combination of SOLR's IR score with another application-specific score that we store within the documents themselves (i.e. a float field containing the app-specific score). In particular, we'd like to calculate the final score doing some operations with both numbers (i.e product, sqrt, ...) According to what we know, there are two ways to do this in SOLR: A) Sort by function [1]: We've tested an expression like sort=product(score, query_score) in the SOLR query, where score is the common SOLR IR score and query_score is our own precalculated score, but it seems that SOLR can only do this with stored/indexed fields (and obviously score is not stored/indexed). B) Function queries: We've used _val_ and function queries like max, sqrt and query, and we've obtained the desired results from a functional point of view. However, our index is quite large (400M documents) and the performance degrades heavily, given that function queries are AFAIK matching all the documents. I have two questions: 1) Apart from the two options I mentioned, is there any other (simple) way to achieve this that we're not aware of? 2) If we have to choose the function queries path, would it be very difficult to modify the actual implementation so that it doesn't match all the documents, that is, to pass a query so that it only operates over the documents matching the query?. Looking at the FunctionQuery.java source code, there's a comment that says // instead of matching all docs, we could also embed a query. the score could either ignore the subscore, or boost it, which is giving us some hope that maybe it's possible and even desirable to go in this direction. If you can give us some directions about how to go about this, we may be able to do the actual implementation. BTW, we're using Lucene/SOLR trunk. Thanks a lot for your help. Carlos [1]: http://wiki.apache.org/solr/FunctionQuery#Sort_By_Function
Re: How to loop through the DataImportHandler query results?
If your script turns out too complex to maintain, and you are developing in Java, anyway, you could extend EntityProcessor and handle the data in a custom way. I've done that to transform a datamart like data structure back into a row based one. Basically you override the method that gets the data in a Map and transform it into a different Map which contains the fields as understood by your schema. Chantal On Thu, 2012-02-16 at 14:59 +0100, Mikhail Khludnev wrote: Hi Baranee, Some time ago I played with http://wiki.apache.org/solr/DataImportHandler#ScriptTransformer - it was a pretty good stuff. Regards On Thu, Feb 16, 2012 at 3:53 PM, K, Baraneetharan baraneethara...@hp.comwrote: To avoid that we don't want to mention the column names in the field tag , but want to write a query to map all the fields in the table with solr fileds even if we don't know, how many columns are there in the table. I need a kind of loop which runs through all the query results and map that with solr fileds.
Re: Frequent garbage collections after a day of operation
Make sure your Tomcat instances are started each with a max heap size that adds up to something a lot lower than the complete RAM of your system. Frequent Garbage collection means that your applications request more RAM but your Java VM has no more resources, so it requires the Garbage Collector to free memory so that the requested new objects can be created. It's not indicating a memory leak unless you are running a custom EntityProcessor in DIH that runs into an infinite loop and creates huge amounts of schema fields. ;-) Also - if you are doing hot deploys on Tomcat, you will have to restart the Tomcat instance on a regular bases as hot deploys DO leak memory after a while. (You might be seeing class undeploy messages in catalina.out and later on OutOfMemory error messages.) If this is not of any help you will probably have to provide a bit more information on your Tomcat and SOLR configuration setup. Chantal On Thu, 2012-02-16 at 16:22 +0100, Matthias Käppler wrote: Hey everyone, we're running into some operational problems with our SOLR production setup here and were wondering if anyone else is affected or has even solved these problems before. We're running a vanilla SOLR 3.4.0 in several Tomcat 6 instances, so nothing out of the ordinary, but after a day or so of operation we see increased response times from SOLR, up to 3 times increases on average. During this time we see increased CPU load due to heavy garbage collection in the JVM, which bogs down the the whole system, so throughput decreases, naturally. When restarting the slaves, everything goes back to normal, but that's more like a brute force solution. The thing is, we don't know what's causing this and we don't have that much experience with Java stacks since we're for most parts a Rails company. Are Tomcat 6 or SOLR known to leak memory? Is anyone else seeing this, or can you think of a reason for this? Most of our queries to SOLR involve the DismaxHandler and the spatial search query components. We don't use any custom request handlers so far. Thanks in advance, -Matthias
RE: PatternReplaceFilterFactory group
steve_rowe wrote Hi O., PatternReplaceFilter(Factory) uses Matcher.replaceAll() or replaceFirst(), both of which take in a string that can include any or all groups using the syntax $n, where n is the group number. See the Matcher.appendReplacement() javadocs for an explanation of the functionality and syntax: lt;http://docs.oracle.com/javase/6/docs/api/java/util/regex/Matcher.html#appendReplacement%28java.lang.StringBuffer,%20java.lang.String%29gt; Steve -Original Message- From: O. Klein [mailto:klein@] Sent: Thursday, February 16, 2012 8:34 AM To: solr-user@.apache Subject: PatternReplaceFilterFactory group PatternReplaceFilterFactory has no option to select the group to replace. Is there a reason for this, or could this be a nice feature? -- View this message in context: http://lucene.472066.n3.nabble.com/PatternReplaceFilterFactory-group- tp3750201p3750201.html Sent from the Solr - User mailing list archive at Nabble.com. Thanks. I should get it working then. -- View this message in context: http://lucene.472066.n3.nabble.com/PatternReplaceFilterFactory-group-tp3750201p3750650.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: is it possible to run deltaimport command with out delta query?
On 2/15/2012 11:26 PM, nagarjuna wrote: hi all.. i am new to solr .can any body explain me about the delta-import and delta query and also i have the below questions 1.is it possible to run deltaimport without delataquery? 2. is it possible to write a delta query without having last_modified column in database? if yes pls explain me Assuming I understand what you're asking: Define deltaImportQuery to be the same as query, then set deltaQuery to something that always returns some kind of value in the field you have designated as your primary key. The data doesn't have to be relevant to anything at all, it just needs to return something for the primary key field. Here's what I have in mine, my pk is did: deltaQuery=SELECT 1 AS did If you wish, you can completely ignore lastModified and track your own information about what data is new, then pass parameters via the dataimport handler URL to be used in your queries. This is what both my query and deltaImportQuery are set to: SELECT * FROM ${dataimporter.request.dataView} WHERE ( ( did gt; ${dataimporter.request.minDid} AND did lt;= ${dataimporter.request.maxDid} ) ${dataimporter.request.extraWhere} ) AND (crc32(did) % ${dataimporter.request.numShards}) IN (${dataimporter.request.modVal}) Thanks, Shawn
Re: problem to indexing pdf directory
here the log: org.apache.solr.handler.dataimport.DataImporter doFullImport Grave: Full Import failed org.apache.solr.handler.dataimport.DataImportHandlerException: 'baseDir' is a required attribute Processing Document # 1 at org.apache.solr.handler.dataimport.FileListEntityProcessor.init(FileListEntityProcessor.java:117) at org.apache.solr.handler.dataimport.EntityProcessorWrapper.init(EntityProcessorWrapper.java:71) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:319) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:242) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:180) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:331) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:389) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:370) feb 12, 2012 7:06:00 PM org.apache.solr.update.DirectUpdateHandler2 rollback Informazioni: start rollback feb 12, 2012 7:06:00 PM org.apache.solr.update.DirectUpdateHandler2 rollback Informazioni: end_rollback feb 12, 2012 7:06:02 PM org.apache.solr.handler.dataimport.DataImporter doFullImport Informazioni: Starting Full Import feb 12, 2012 7:06:02 PM org.apache.solr.core.SolrCore execute Informazioni: [] webapp=/solr path=/select params={clean=falsecommit=truecommand=full-importqt=/dataimport} status=0 QTime=16 feb 12, 2012 7:06:02 PM org.apache.solr.handler.dataimport.SolrWriter readIndexerProperties Informazioni: Read dataimport.properties feb 12, 2012 7:06:02 PM org.apache.solr.handler.dataimport.DataImporter doFullImport Grave: Full Import failed org.apache.solr.handler.dataimport.DataImportHandlerException: 'baseDir' is a required attribute Processing Document # 1 at org.apache.solr.handler.dataimport.FileListEntityProcessor.init(FileListEntityProcessor.java:117) at org.apache.solr.handler.dataimport.EntityProcessorWrapper.init(EntityProcessorWrapper.java:71) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:319) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:242) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:180) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:331) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:389) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:370) feb 12, 2012 7:06:02 PM org.apache.solr.update.DirectUpdateHandler2 rollback Informazioni: start rollback feb 12, 2012 7:06:02 PM org.apache.solr.update.DirectUpdateHandler2 rollback Informazioni: end_rollback feb 12, 2012 7:06:42 PM org.apache.coyote.AbstractProtocol pause Informazioni: Pausing ProtocolHandler [http-bio-8983] feb 12, 2012 7:06:42 PM org.apache.coyote.AbstractProtocol pause Informazioni: Pausing ProtocolHandler [ajp-bio-8009] feb 12, 2012 7:06:42 PM org.apache.catalina.core.StandardService stopInternal Informazioni: Stopping service Catalina feb 12, 2012 7:06:42 PM org.apache.solr.core.SolrCore close Informazioni: [] CLOSING SolrCore org.apache.solr.core.SolrCore@7d1217 feb 12, 2012 7:06:42 PM org.apache.solr.core.SolrCore closeSearcher Informazioni: [] Closing main searcher on request. feb 12, 2012 7:06:42 PM org.apache.solr.search.SolrIndexSearcher close Informazioni: Closing Searcher@19fabda main fieldValueCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0} filterCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0} queryResultCache{lookups=0,hits=0,hitratio=0.00,inserts=2,evictions=0,size=2,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0} documentCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0} feb 12, 2012 7:06:42 PM org.apache.solr.update.DirectUpdateHandler2 close Informazioni: closing DirectUpdateHandler2{commits=0,autocommits=0,optimizes=0,rollbacks=4,expungeDeletes=0,docsPending=0,adds=0,deletesById=0,deletesByQuery=0,errors=0,cumulative_adds=0,cumulative_deletesById=0,cumulative_deletesByQuery=0,cumulative_errors=0} feb 12, 2012 7:06:42 PM org.apache.solr.update.DirectUpdateHandler2 close Informazioni: closed DirectUpdateHandler2{commits=0,autocommits=0,optimizes=0,rollbacks=4,expungeDeletes=0,docsPending=0,adds=0,deletesById=0,deletesByQuery=0,errors=0,cumulative_adds=0,cumulative_deletesById=0,cumulative_deletesByQuery=0,cumulative_errors=0} feb 12, 2012 7:06:42 PM org.apache.coyote.AbstractProtocol stop Informazioni: Stopping
Re: problem to indexing pdf directory
On 16 February 2012 21:37, alessio crisantemi alessio.crisant...@gmail.com wrote: here the log: org.apache.solr.handler.dataimport.DataImporter doFullImport Grave: Full Import failed org.apache.solr.handler.dataimport.DataImportHandlerException: 'baseDir' is a required attribute Processing Document # 1 [...] The exception message above is pretty clear. You need to define a baseDir attribute for the second entity. However, even if you fix this, the setup will *not* work for indexing PDFs. Did you read the URLs that I sent earlier? Regards, Gora
Re: Setting solrj server connection timeout
On 2/3/2012 1:12 PM, Shawn Heisey wrote: Is the following a reasonable approach to setting a connection timeout with SolrJ? queryCore.getHttpClient().getHttpConnectionManager().getParams() .setConnectionTimeout(15000); Right now I have all my solr server objects sharing a single HttpClient that gets created using the multithreaded connection manager, where I set the timeout for all of them. Now I will be letting each server object create its own HttpClient object, and using the above statement to set the timeout on each one individually. It'll use up a bunch more memory, as there are 56 server objects, but maybe it'll work better. The total of 56 objects comes about from 7 shards, a build core and a live core per shard, two complete index chains, and for each of those, one server object for access to CoreAdmin and another for the index. The impetus for this, as it's possible I'm stating an XY problem: Currently I have an occasional problem where SolrJ connections throw an exception. When it happens, nothing is logged in Solr. My code is smart enough to notice the problem, send an email alert, and simply try again at the top of the next minute. The simple explanation is that this is a Linux networking problem, but I never had any problem like this when I was using Perl with LWP to keep my index up to date. I sent a message to the list some time ago on this exception, but I never got a response that helped me figure it out. Caused by: org.apache.solr.client.solrj.SolrServerException: java.net.SocketException: Connection reset at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:480) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:246) at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:89) at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:276) at com.newscom.idxbuild.solr.Core.getCount(Core.java:325) ... 3 more Caused by: java.net.SocketException: Connection reset at java.net.SocketInputStream.read(SocketInputStream.java:168) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read(BufferedInputStream.java:237) at org.apache.commons.httpclient.HttpParser.readRawLine(HttpParser.java:78) at org.apache.commons.httpclient.HttpParser.readLine(HttpParser.java:106) at org.apache.commons.httpclient.HttpConnection.readLine(HttpConnection.java:1116) at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpConnectionAdapter.readLine(MultiThreadedHttpConnectionManager.java:1413) at org.apache.commons.httpclient.HttpMethodBase.readStatusLine(HttpMethodBase.java:1973) at org.apache.commons.httpclient.HttpMethodBase.readResponse(HttpMethodBase.java:1735) at org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java:1098) at org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:398) at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:424) ... 7 more No response in quite some time, so I'm bringing it up again. I brought up the Exception issue before, and though I did get some responses, I didn't feel that I got an answer. http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201112.mbox/%3c4eeaf6e5.9030...@elyograg.org%3E Thanks, Shawn
Re: problem to indexing pdf directory
Yes, I read it. But I don't know the cause. and more: I work on windows and so, I configured manually tika and solr because I don't have maven... 2012/2/16 Gora Mohanty g...@mimirtech.com On 16 February 2012 21:37, alessio crisantemi alessio.crisant...@gmail.com wrote: here the log: org.apache.solr.handler.dataimport.DataImporter doFullImport Grave: Full Import failed org.apache.solr.handler.dataimport.DataImportHandlerException: 'baseDir' is a required attribute Processing Document # 1 [...] The exception message above is pretty clear. You need to define a baseDir attribute for the second entity. However, even if you fix this, the setup will *not* work for indexing PDFs. Did you read the URLs that I sent earlier? Regards, Gora
Re: Do we need reindexing from solr 1.4.1 to 3.5.0?
There may be issues with your solrconfig. Kindly post the exception that you are recieving. -- View this message in context: http://lucene.472066.n3.nabble.com/Do-we-need-reindexing-from-solr-1-4-1-to-3-5-0-tp3739353p3750937.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: is it possible to run deltaimport command with out delta query?
There is a good example on how to do a delta update using command=full-updateclean=false on the wiki, here: http://wiki.apache.org/solr/DataImportHandlerFaq#fullimportdelta This can be advantageous if you are updating a ton of data at once and do not want it executing as many queries to the database. It also can be easier to maintain just 1 set of queries for both full and delta imports. James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: Shawn Heisey [mailto:s...@elyograg.org] Sent: Thursday, February 16, 2012 10:04 AM To: solr-user@lucene.apache.org Subject: Re: is it possible to run deltaimport command with out delta query? On 2/15/2012 11:26 PM, nagarjuna wrote: hi all.. i am new to solr .can any body explain me about the delta-import and delta query and also i have the below questions 1.is it possible to run deltaimport without delataquery? 2. is it possible to write a delta query without having last_modified column in database? if yes pls explain me Assuming I understand what you're asking: Define deltaImportQuery to be the same as query, then set deltaQuery to something that always returns some kind of value in the field you have designated as your primary key. The data doesn't have to be relevant to anything at all, it just needs to return something for the primary key field. Here's what I have in mine, my pk is did: deltaQuery=SELECT 1 AS did If you wish, you can completely ignore lastModified and track your own information about what data is new, then pass parameters via the dataimport handler URL to be used in your queries. This is what both my query and deltaImportQuery are set to: SELECT * FROM ${dataimporter.request.dataView} WHERE ( ( did gt; ${dataimporter.request.minDid} AND did lt;= ${dataimporter.request.maxDid} ) ${dataimporter.request.extraWhere} ) AND (crc32(did) % ${dataimporter.request.numShards}) IN (${dataimporter.request.modVal}) Thanks, Shawn
Re: Best requestHandler for typing error.
You can enable the spellcheck component and add it to your default request handler. This might be of use: http://wiki.apache.org/solr/SpellCheckComponent http://wiki.apache.org/solr/SpellCheckComponent It could be used both during autosuggest as well as did you mean. -- View this message in context: http://lucene.472066.n3.nabble.com/Best-requestHandler-for-typing-error-tp3749576p3750995.html Sent from the Solr - User mailing list archive at Nabble.com.
Solr edismax clarification
Hi All, I am using edismax SearchHandler in my search and I have some issues in the search results. As I understand if the defaultOperator is set to OR the search query will be passed as - The OR quick OR brown OR fox implicitly. However if I search for The quick brown fox, I get lesser results than explicitly adding the OR. Another issue is that if I search for The quick brown fox other documents that contain the word fox is not in the search results. Thanks.
copyField: multivalued field to joined singlevalue field
Hello, I want to copy all data from a multivalued field joined together in a single valued field. Is there any opportunity to do this by using solr-standards? kind regards -- View this message in context: http://lucene.472066.n3.nabble.com/copyField-multivalued-field-to-joined-singlevalue-field-tp3750857p3750857.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: copyField: multivalued field to joined singlevalue field
On Thu, Feb 16, 2012 at 11:35 AM, flyingeagle-de flyingeagle...@yahoo.de wrote: Hello, I want to copy all data from a multivalued field joined together in a single valued field. Is there any opportunity to do this by using solr-standards? There is not currently, but it certainly makes sense. Anyone know of an open issue for this yet? If not, we should create one! -Yonik lucidimagination.com
Distributed Faceting Bug?
I am attempting to execute a query with the following parameters q=*:* distrib=true facet=true facet.limit=10 facet.field=manu f.manu.facet.mincount=1 f.manu.facet.limit=10 f.manu.facet.sort=index rows=10 When doing this I get the following exception null java.lang.ArrayIndexOutOfBoundsException request: http://hostname:8983/solr/select at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:435) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244) at org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:249) at org.apache.solr.handler.component.HttpCommComponent$1.call(SearchHandler.java:517) at org.apache.solr.handler.component.HttpCommComponent$1.call(SearchHandler.java:482) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) if I play with some of the parameters the query works as expected, i.e. q=*:* distrib=true facet=true facet.limit=10 facet.field=manu f.manu.facet.mincount=1 f.manu.facet.limit=10 f.manu.facet.sort=index rows=0 q=*:* distrib=true facet=true facet.limit=10 facet.field=manu f.manu.facet.mincount=1 f.manu.facet.limit=10 f.manu.facet.sort=count rows=10 q=*:* distrib=true facet=true facet.limit=10 facet.field=manu f.manu.facet.mincount=1 f.manu.facet.sort=index rows=10 I am running on an old snapshot of Solr, but will try this on a new version relatively soon. Unfortunately I can't duplicate locally so I'm a bit baffled by the error. All of the shards have the field which we are faceting on
Re: custom scoring
Hello Carlos, well, you must take into account that you are executing up to 8 queries per request instead of one query per request. I am not totally sure about the details of the implementation of the max-function-query, but I guess it first iterates over the results of the first max-query, afterwards over the results of the second max-query and so on. This is a much higher complexity than in the case of a normal query. I would suggest you to optimize your request. I don't think that this particular function query is matching *all* docs. Instead I think it just matches those docs specified by your inner-query (although I might be wrong about that). What are you trying to achieve by your request? Regards, Em Am 16.02.2012 16:24, schrieb Carlos Gonzalez-Cadenas: Hello Em: The URL is quite large (w/ shards, ...), maybe it's best if I paste the relevant parts. Our q parameter is: q:_val_:\product(query_score,max(query($q8),max(query($q7),max(query($q4),query($q3)\, The subqueries q8, q7, q4 and q3 are regular queries, for example: q7:stopword_phrase:colomba~1 AND stopword_phrase:santa AND wildcard_stopword_phrase:car^0.7 AND stopword_phrase:hoteles OR (stopword_phrase:las AND stopword_phrase:de) We've executed the subqueries q3-q8 independently and they're very fast, but when we introduce the function queries as described below, it all goes 10X slower. Let me know if you need anything else. Thanks Carlos Carlos Gonzalez-Cadenas CEO, ExperienceOn - New generation search http://www.experienceon.com Mobile: +34 652 911 201 Skype: carlosgonzalezcadenas LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas On Thu, Feb 16, 2012 at 4:02 PM, Em mailformailingli...@yahoo.de wrote: Hello carlos, could you show us how your Solr-call looks like? Regards, Em Am 16.02.2012 14:34, schrieb Carlos Gonzalez-Cadenas: Hello all: We'd like to score the matching documents using a combination of SOLR's IR score with another application-specific score that we store within the documents themselves (i.e. a float field containing the app-specific score). In particular, we'd like to calculate the final score doing some operations with both numbers (i.e product, sqrt, ...) According to what we know, there are two ways to do this in SOLR: A) Sort by function [1]: We've tested an expression like sort=product(score, query_score) in the SOLR query, where score is the common SOLR IR score and query_score is our own precalculated score, but it seems that SOLR can only do this with stored/indexed fields (and obviously score is not stored/indexed). B) Function queries: We've used _val_ and function queries like max, sqrt and query, and we've obtained the desired results from a functional point of view. However, our index is quite large (400M documents) and the performance degrades heavily, given that function queries are AFAIK matching all the documents. I have two questions: 1) Apart from the two options I mentioned, is there any other (simple) way to achieve this that we're not aware of? 2) If we have to choose the function queries path, would it be very difficult to modify the actual implementation so that it doesn't match all the documents, that is, to pass a query so that it only operates over the documents matching the query?. Looking at the FunctionQuery.java source code, there's a comment that says // instead of matching all docs, we could also embed a query. the score could either ignore the subscore, or boost it, which is giving us some hope that maybe it's possible and even desirable to go in this direction. If you can give us some directions about how to go about this, we may be able to do the actual implementation. BTW, we're using Lucene/SOLR trunk. Thanks a lot for your help. Carlos [1]: http://wiki.apache.org/solr/FunctionQuery#Sort_By_Function
Re: Distributed Faceting Bug?
please ignore this, it has nothing to do with the faceting component. I was able to disable a custom component that I had and it worked perfectly fine. On Thu, Feb 16, 2012 at 12:42 PM, Jamie Johnson jej2...@gmail.com wrote: I am attempting to execute a query with the following parameters q=*:* distrib=true facet=true facet.limit=10 facet.field=manu f.manu.facet.mincount=1 f.manu.facet.limit=10 f.manu.facet.sort=index rows=10 When doing this I get the following exception null java.lang.ArrayIndexOutOfBoundsException request: http://hostname:8983/solr/select at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:435) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244) at org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:249) at org.apache.solr.handler.component.HttpCommComponent$1.call(SearchHandler.java:517) at org.apache.solr.handler.component.HttpCommComponent$1.call(SearchHandler.java:482) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) if I play with some of the parameters the query works as expected, i.e. q=*:* distrib=true facet=true facet.limit=10 facet.field=manu f.manu.facet.mincount=1 f.manu.facet.limit=10 f.manu.facet.sort=index rows=0 q=*:* distrib=true facet=true facet.limit=10 facet.field=manu f.manu.facet.mincount=1 f.manu.facet.limit=10 f.manu.facet.sort=count rows=10 q=*:* distrib=true facet=true facet.limit=10 facet.field=manu f.manu.facet.mincount=1 f.manu.facet.sort=index rows=10 I am running on an old snapshot of Solr, but will try this on a new version relatively soon. Unfortunately I can't duplicate locally so I'm a bit baffled by the error. All of the shards have the field which we are faceting on
Re: Distributed Faceting Bug?
Hi Jamie, what version of Solr/SolrJ are you using? Regards, Em Am 16.02.2012 18:42, schrieb Jamie Johnson: I am attempting to execute a query with the following parameters q=*:* distrib=true facet=true facet.limit=10 facet.field=manu f.manu.facet.mincount=1 f.manu.facet.limit=10 f.manu.facet.sort=index rows=10 When doing this I get the following exception null java.lang.ArrayIndexOutOfBoundsException request: http://hostname:8983/solr/select at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:435) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244) at org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:249) at org.apache.solr.handler.component.HttpCommComponent$1.call(SearchHandler.java:517) at org.apache.solr.handler.component.HttpCommComponent$1.call(SearchHandler.java:482) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) if I play with some of the parameters the query works as expected, i.e. q=*:* distrib=true facet=true facet.limit=10 facet.field=manu f.manu.facet.mincount=1 f.manu.facet.limit=10 f.manu.facet.sort=index rows=0 q=*:* distrib=true facet=true facet.limit=10 facet.field=manu f.manu.facet.mincount=1 f.manu.facet.limit=10 f.manu.facet.sort=count rows=10 q=*:* distrib=true facet=true facet.limit=10 facet.field=manu f.manu.facet.mincount=1 f.manu.facet.sort=index rows=10 I am running on an old snapshot of Solr, but will try this on a new version relatively soon. Unfortunately I can't duplicate locally so I'm a bit baffled by the error. All of the shards have the field which we are faceting on
Re: Distributed Faceting Bug?
Hi Jamie, nice to hear. Maybe you can share in what kind of bug you ran, so that other developers with similar bugish components can benefit from your experience. :) Regards, Em Am 16.02.2012 19:23, schrieb Jamie Johnson: please ignore this, it has nothing to do with the faceting component. I was able to disable a custom component that I had and it worked perfectly fine. On Thu, Feb 16, 2012 at 12:42 PM, Jamie Johnson jej2...@gmail.com wrote: I am attempting to execute a query with the following parameters q=*:* distrib=true facet=true facet.limit=10 facet.field=manu f.manu.facet.mincount=1 f.manu.facet.limit=10 f.manu.facet.sort=index rows=10 When doing this I get the following exception null java.lang.ArrayIndexOutOfBoundsException request: http://hostname:8983/solr/select at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:435) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244) at org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:249) at org.apache.solr.handler.component.HttpCommComponent$1.call(SearchHandler.java:517) at org.apache.solr.handler.component.HttpCommComponent$1.call(SearchHandler.java:482) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) if I play with some of the parameters the query works as expected, i.e. q=*:* distrib=true facet=true facet.limit=10 facet.field=manu f.manu.facet.mincount=1 f.manu.facet.limit=10 f.manu.facet.sort=index rows=0 q=*:* distrib=true facet=true facet.limit=10 facet.field=manu f.manu.facet.mincount=1 f.manu.facet.limit=10 f.manu.facet.sort=count rows=10 q=*:* distrib=true facet=true facet.limit=10 facet.field=manu f.manu.facet.mincount=1 f.manu.facet.sort=index rows=10 I am running on an old snapshot of Solr, but will try this on a new version relatively soon. Unfortunately I can't duplicate locally so I'm a bit baffled by the error. All of the shards have the field which we are faceting on
Re: How to loop through the DataImportHandler query results?
Chantal, if you prefer java here is http://wiki.apache.org/solr/DIHCustomTransform On Thu, Feb 16, 2012 at 7:24 PM, Chantal Ackermann chantal.ackerm...@btelligent.de wrote: If your script turns out too complex to maintain, and you are developing in Java, anyway, you could extend EntityProcessor and handle the data in a custom way. I've done that to transform a datamart like data structure back into a row based one. Basically you override the method that gets the data in a Map and transform it into a different Map which contains the fields as understood by your schema. Chantal On Thu, 2012-02-16 at 14:59 +0100, Mikhail Khludnev wrote: Hi Baranee, Some time ago I played with http://wiki.apache.org/solr/DataImportHandler#ScriptTransformer - it was a pretty good stuff. Regards On Thu, Feb 16, 2012 at 3:53 PM, K, Baraneetharan baraneethara...@hp.comwrote: To avoid that we don't want to mention the column names in the field tag , but want to write a query to map all the fields in the table with solr fileds even if we don't know, how many columns are there in the table. I need a kind of loop which runs through all the query results and map that with solr fileds. -- Sincerely yours Mikhail Khludnev Lucid Certified Apache Lucene/Solr Developer Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: custom scoring
Hello Em: Thanks for your answer. Yes, we initially also thought that the excessive increase in response time was caused by the several queries being executed, and we did another test. We executed one of the subqueries that I've shown to you directly in the q parameter and then we tested this same subquery (only this one, without the others) with the function query query($q1) in the q parameter. Theoretically the times for these two queries should be more or less the same, but the second one is several times slower than the first one. After this observation we learned more about function queries and we learned from the code and from some comments in the forums [1] that the FunctionQueries are expected to match all documents. We have some more tests on that matter: now we're moving from issuing this large query through the SOLR interface to creating our own QueryParser. The initial tests we've done in our QParser (that internally creates multiple queries and inserts them inside a DisjunctionMaxQuery) are very good, we're getting very good response times and high quality answers. But when we've tried to wrap the DisjunctionMaxQuery within a FunctionQuery (i.e. with a QueryValueSource that wraps the DisMaxQuery), then the times move from 10-20 msec to 200-300msec. Note that we're using early termination of queries (via a custom collector), and therefore (as shown by the numbers I included above) even if the query is very complex, we're getting very fast answers. The only situation where the response time explodes is when we include a FunctionQuery. Re: your question of what we're trying to achieve ... We're implementing a powerful query autocomplete system, and we use several fields to a) improve performance on wildcard queries and b) have a very precise control over the score. Thanks a lot for your help, Carlos [1]: http://grokbase.com/p/lucene/solr-user/11bjw87bt5/functionquery-score-0 Carlos Gonzalez-Cadenas CEO, ExperienceOn - New generation search http://www.experienceon.com Mobile: +34 652 911 201 Skype: carlosgonzalezcadenas LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas On Thu, Feb 16, 2012 at 7:09 PM, Em mailformailingli...@yahoo.de wrote: Hello Carlos, well, you must take into account that you are executing up to 8 queries per request instead of one query per request. I am not totally sure about the details of the implementation of the max-function-query, but I guess it first iterates over the results of the first max-query, afterwards over the results of the second max-query and so on. This is a much higher complexity than in the case of a normal query. I would suggest you to optimize your request. I don't think that this particular function query is matching *all* docs. Instead I think it just matches those docs specified by your inner-query (although I might be wrong about that). What are you trying to achieve by your request? Regards, Em Am 16.02.2012 16:24, schrieb Carlos Gonzalez-Cadenas: Hello Em: The URL is quite large (w/ shards, ...), maybe it's best if I paste the relevant parts. Our q parameter is: q:_val_:\product(query_score,max(query($q8),max(query($q7),max(query($q4),query($q3)\, The subqueries q8, q7, q4 and q3 are regular queries, for example: q7:stopword_phrase:colomba~1 AND stopword_phrase:santa AND wildcard_stopword_phrase:car^0.7 AND stopword_phrase:hoteles OR (stopword_phrase:las AND stopword_phrase:de) We've executed the subqueries q3-q8 independently and they're very fast, but when we introduce the function queries as described below, it all goes 10X slower. Let me know if you need anything else. Thanks Carlos Carlos Gonzalez-Cadenas CEO, ExperienceOn - New generation search http://www.experienceon.com Mobile: +34 652 911 201 Skype: carlosgonzalezcadenas LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas On Thu, Feb 16, 2012 at 4:02 PM, Em mailformailingli...@yahoo.de wrote: Hello carlos, could you show us how your Solr-call looks like? Regards, Em Am 16.02.2012 14:34, schrieb Carlos Gonzalez-Cadenas: Hello all: We'd like to score the matching documents using a combination of SOLR's IR score with another application-specific score that we store within the documents themselves (i.e. a float field containing the app-specific score). In particular, we'd like to calculate the final score doing some operations with both numbers (i.e product, sqrt, ...) According to what we know, there are two ways to do this in SOLR: A) Sort by function [1]: We've tested an expression like sort=product(score, query_score) in the SOLR query, where score is the common SOLR IR score and query_score is our own precalculated score, but it seems that SOLR can only do this with stored/indexed fields (and obviously score is not stored/indexed). B) Function queries: We've used _val_ and function queries like max,
Re: custom scoring
Hello Carlos, We have some more tests on that matter: now we're moving from issuing this large query through the SOLR interface to creating our own QueryParser. The initial tests we've done in our QParser (that internally creates multiple queries and inserts them inside a DisjunctionMaxQuery) are very good, we're getting very good response times and high quality answers. But when we've tried to wrap the DisjunctionMaxQuery within a FunctionQuery (i.e. with a QueryValueSource that wraps the DisMaxQuery), then the times move from 10-20 msec to 200-300msec. I reviewed the sourcecode and yes, the FunctionQuery iterates over the whole index, however... let's see! In relation to the DisMaxQuery you create within your parser: What kind of clause is the FunctionQuery and what kind of clause are your other queries (MUST, SHOULD, MUST_NOT...)? *I* would expect that with a shrinking set of matching documents to the overall-query, the function query only checks those documents that are guaranteed to be within the result set. Note that we're using early termination of queries (via a custom collector), and therefore (as shown by the numbers I included above) even if the query is very complex, we're getting very fast answers. The only situation where the response time explodes is when we include a FunctionQuery. Could you give us some details about how/where did you plugin the Collector, please? Kind regards, Em Am 16.02.2012 19:41, schrieb Carlos Gonzalez-Cadenas: Hello Em: Thanks for your answer. Yes, we initially also thought that the excessive increase in response time was caused by the several queries being executed, and we did another test. We executed one of the subqueries that I've shown to you directly in the q parameter and then we tested this same subquery (only this one, without the others) with the function query query($q1) in the q parameter. Theoretically the times for these two queries should be more or less the same, but the second one is several times slower than the first one. After this observation we learned more about function queries and we learned from the code and from some comments in the forums [1] that the FunctionQueries are expected to match all documents. We have some more tests on that matter: now we're moving from issuing this large query through the SOLR interface to creating our own QueryParser. The initial tests we've done in our QParser (that internally creates multiple queries and inserts them inside a DisjunctionMaxQuery) are very good, we're getting very good response times and high quality answers. But when we've tried to wrap the DisjunctionMaxQuery within a FunctionQuery (i.e. with a QueryValueSource that wraps the DisMaxQuery), then the times move from 10-20 msec to 200-300msec. Note that we're using early termination of queries (via a custom collector), and therefore (as shown by the numbers I included above) even if the query is very complex, we're getting very fast answers. The only situation where the response time explodes is when we include a FunctionQuery. Re: your question of what we're trying to achieve ... We're implementing a powerful query autocomplete system, and we use several fields to a) improve performance on wildcard queries and b) have a very precise control over the score. Thanks a lot for your help, Carlos [1]: http://grokbase.com/p/lucene/solr-user/11bjw87bt5/functionquery-score-0 Carlos Gonzalez-Cadenas CEO, ExperienceOn - New generation search http://www.experienceon.com Mobile: +34 652 911 201 Skype: carlosgonzalezcadenas LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas On Thu, Feb 16, 2012 at 7:09 PM, Em mailformailingli...@yahoo.de wrote: Hello Carlos, well, you must take into account that you are executing up to 8 queries per request instead of one query per request. I am not totally sure about the details of the implementation of the max-function-query, but I guess it first iterates over the results of the first max-query, afterwards over the results of the second max-query and so on. This is a much higher complexity than in the case of a normal query. I would suggest you to optimize your request. I don't think that this particular function query is matching *all* docs. Instead I think it just matches those docs specified by your inner-query (although I might be wrong about that). What are you trying to achieve by your request? Regards, Em Am 16.02.2012 16:24, schrieb Carlos Gonzalez-Cadenas: Hello Em: The URL is quite large (w/ shards, ...), maybe it's best if I paste the relevant parts. Our q parameter is: q:_val_:\product(query_score,max(query($q8),max(query($q7),max(query($q4),query($q3)\, The subqueries q8, q7, q4 and q3 are regular queries, for example: q7:stopword_phrase:colomba~1 AND stopword_phrase:santa AND wildcard_stopword_phrase:car^0.7 AND stopword_phrase:hoteles OR (stopword_phrase:las AND
Re: custom scoring
Hello Em: 1) Here's a printout of an example DisMax query (as you can see mostly MUST terms except for some SHOULD terms used for boosting scores for stopwords) * * *((+stopword_shortened_phrase:hoteles +stopword_shortened_phrase:barcelona stopword_shortened_phrase:en) | (+stopword_phrase:hoteles +stopword_phrase:barcelona stopword_phrase:en) | (+stopword_shortened_phrase:hoteles +stopword_short ened_phrase:barcelona stopword_shortened_phrase:en) | (+stopword_phrase:hoteles +stopword_phrase:barcelona stopword_phrase:en) | (+stopword_shor tened_phrase:hoteles +wildcard_stopword_shortened_phrase:barcelona stopword_shortened_phrase:en) | (+stopword_phrase:hoteles +wildcard_stopw ord_phrase:barcelona stopword_phrase:en) | (+stopword_shortened_phrase:hoteles +wildcard_stopword_shortened_phrase:barcelona stopword_shortened_phrase:en) | (+stopword_phrase:hoteles +wildcard_stopword_phrase:barcelona stopword_phrase:en))* * * 2)* *The collector is inserted in the SolrIndexSearcher (replacing the TimeLimitingCollector). We trigger it through the SOLR interface by passing the timeAllowed parameter. We know this is a hack but AFAIK there's no out-of-the-box way to specify custom collectors by now ( https://issues.apache.org/jira/browse/SOLR-1680). In any case the collector part works perfectly as of now, so clearly this is not the problem. 3) Re: your sentence: * * **I* would expect that with a shrinking set of matching documents to the overall-query, the function query only checks those documents that are guaranteed to be within the result set.* * * Yes, I agree with this, but this snippet of code in FunctionQuery.java seems to say otherwise: // instead of matching all docs, we could also embed a query. // the score could either ignore the subscore, or boost it. // Containment: floatline(foo:myTerm, myFloatField, 1.0, 0.0f) // Boost:foo:myTerm^floatline(myFloatField,1.0,0.0f) @Override public int nextDoc() throws IOException { for(;;) { ++doc; if (doc=maxDoc) { return doc=NO_MORE_DOCS; } if (acceptDocs != null !acceptDocs.get(doc)) continue; return doc; } } It seems that the author also thought of maybe embedding a query in order to restrict matches, but this doesn't seem to be in place as of now (or maybe I'm not understanding how the whole thing works :) ). Thanks Carlos * * Carlos Gonzalez-Cadenas CEO, ExperienceOn - New generation search http://www.experienceon.com Mobile: +34 652 911 201 Skype: carlosgonzalezcadenas LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas On Thu, Feb 16, 2012 at 8:09 PM, Em mailformailingli...@yahoo.de wrote: Hello Carlos, We have some more tests on that matter: now we're moving from issuing this large query through the SOLR interface to creating our own QueryParser. The initial tests we've done in our QParser (that internally creates multiple queries and inserts them inside a DisjunctionMaxQuery) are very good, we're getting very good response times and high quality answers. But when we've tried to wrap the DisjunctionMaxQuery within a FunctionQuery (i.e. with a QueryValueSource that wraps the DisMaxQuery), then the times move from 10-20 msec to 200-300msec. I reviewed the sourcecode and yes, the FunctionQuery iterates over the whole index, however... let's see! In relation to the DisMaxQuery you create within your parser: What kind of clause is the FunctionQuery and what kind of clause are your other queries (MUST, SHOULD, MUST_NOT...)? *I* would expect that with a shrinking set of matching documents to the overall-query, the function query only checks those documents that are guaranteed to be within the result set. Note that we're using early termination of queries (via a custom collector), and therefore (as shown by the numbers I included above) even if the query is very complex, we're getting very fast answers. The only situation where the response time explodes is when we include a FunctionQuery. Could you give us some details about how/where did you plugin the Collector, please? Kind regards, Em Am 16.02.2012 19:41, schrieb Carlos Gonzalez-Cadenas: Hello Em: Thanks for your answer. Yes, we initially also thought that the excessive increase in response time was caused by the several queries being executed, and we did another test. We executed one of the subqueries that I've shown to you directly in the q parameter and then we tested this same subquery (only this one, without the others) with the function query query($q1) in the q parameter. Theoretically the times for these two queries should be more or less the same, but the second one is several times slower than the first one. After this observation we learned more about function queries and we learned from the code and from some comments in the forums [1] that the FunctionQueries are expected to match all documents. We
Re: copyField: multivalued field to joined singlevalue field
: I want to copy all data from a multivalued field joined together in a single : valued field. : : Is there any opportunity to do this by using solr-standards? : : There is not currently, but it certainly makes sense. Part of it has just recently been commited to trunk actually... https://issues.apache.org/jira/browse/SOLR-2802 https://builds.apache.org/job/Solr-trunk/javadoc/org/apache/solr/update/processor/ConcatFieldUpdateProcessorFactory.html ...with that, it's easy to say anytime multiple values are found for a single valued string field, join them together with a comma. the only piece that's missing is to copy from a source field in an (earlier) UpdateProcessor. Theres a patch for this in SOLR-2599 but i haven't had a chance to look at it yet. -Hoss
Re: Specify a cores roles through core add command
https://issues.apache.org/jira/browse/SOLR-3138 On Feb 9, 2012, at 4:16 PM, Jamie Johnson wrote: per SOLR-2765 we can add roles to specific cores such that it's possible to give custom roles to solr instances, is it possible to specify this when adding a core through curl 'http://host:port/solr/admin/cores...'? https://issues.apache.org/jira/browse/SOLR-2765 - Mark Miller lucidimagination.com
Re: Using Solr for a rather busy Yellow Pages-type index - good idea or not really?
On Thu, Feb 16, 2012 at 3:37 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Everybody start from daily bounce, but end up with UPDATED_AT column and delta updates , just consider urgent content fix usecase. Don't think it's worth to rely on daily bounce as a cornerstone of architecture. I'd be happy to avoid it, for all the obvious reasons. I do know that performance of this type of services tends to be not that great (as in 700 to 5000 msec), and there should be ways to do it several times faster than this. you can use grid of coordinates to reduce their entropy I don't understand this statement. Can you elaborate, please? Since my bounding boxes are small, one [premature optimization] idea could be to divide Earth into 2x2 degree overlapping tiles at 1 degree step in both directions (such that any bounding box fits within at least one of them, and any location belongs to 4 of them), then use tileId=X as a cached filter and geofilt as a post-filter. Is that along the lines of what you are talking about? http://yonik.wordpress.com/2012/02/10/advanced-filter-caching-in-solr/ Lucene internals, caching of filters probably doesn't make sense either. from what little I understand about But solr does it http://wiki.apache.org/solr/SolrCaching#filterCache I didn't realize that multiple qf's in the same query were applied in parallel as set intersections. In that case, the non-geography filters should be cached (and added to the prewarming routine, I guess) even when they are usually far less specific than the bounding box. Makes sense. 1. Search server is an internal service that uses embedded Solr for the indexing part. RAMDirectoryFactory as index storage. Bad idea. It's purposed mostly for tests, the closest purposed for production analogue is org.apache.lucene.store.instantiated.InstantiatedIndex ... AFAIK the state of art is use file directory (MMAP or whatever), rely on Linux file system RAM cache. OK, I may as well start the spike from this angle, too. By the way, this is precisely the kind of advice I was hoping for. Thanks a lot. 5. All Solr caching is switched off. But why? Because (a) I shouldn't need to cache documents, if they are all in memory anyway; (b) query caching will have abysmal hit/miss because of the spatial component; and (c) I misunderstood how query filters work. So, now I'm thinking a FastLFU query filter cache for non-geo filters. Btw, if you need multivalue geofield pls vote for SOLR-2155 Our data has one lon/lat pair per entity... so no, I don't need it. Or at least haven't figured out that I do yet. :) -- Alexey Verkhovsky http://alex-verkhovsky.blogspot.com/ CruiseControl.rb [http://cruisecontrolrb.thoughtworks.com]
Re: Distributed Faceting Bug?
still digging ;) Once I figure it out I'll be happy to share. On Thu, Feb 16, 2012 at 1:32 PM, Em mailformailingli...@yahoo.de wrote: Hi Jamie, nice to hear. Maybe you can share in what kind of bug you ran, so that other developers with similar bugish components can benefit from your experience. :) Regards, Em Am 16.02.2012 19:23, schrieb Jamie Johnson: please ignore this, it has nothing to do with the faceting component. I was able to disable a custom component that I had and it worked perfectly fine. On Thu, Feb 16, 2012 at 12:42 PM, Jamie Johnson jej2...@gmail.com wrote: I am attempting to execute a query with the following parameters q=*:* distrib=true facet=true facet.limit=10 facet.field=manu f.manu.facet.mincount=1 f.manu.facet.limit=10 f.manu.facet.sort=index rows=10 When doing this I get the following exception null java.lang.ArrayIndexOutOfBoundsException request: http://hostname:8983/solr/select at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:435) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244) at org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:249) at org.apache.solr.handler.component.HttpCommComponent$1.call(SearchHandler.java:517) at org.apache.solr.handler.component.HttpCommComponent$1.call(SearchHandler.java:482) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) if I play with some of the parameters the query works as expected, i.e. q=*:* distrib=true facet=true facet.limit=10 facet.field=manu f.manu.facet.mincount=1 f.manu.facet.limit=10 f.manu.facet.sort=index rows=0 q=*:* distrib=true facet=true facet.limit=10 facet.field=manu f.manu.facet.mincount=1 f.manu.facet.limit=10 f.manu.facet.sort=count rows=10 q=*:* distrib=true facet=true facet.limit=10 facet.field=manu f.manu.facet.mincount=1 f.manu.facet.sort=index rows=10 I am running on an old snapshot of Solr, but will try this on a new version relatively soon. Unfortunately I can't duplicate locally so I'm a bit baffled by the error. All of the shards have the field which we are faceting on
Re: SolrCloud - issues running with embedded zookeeper ensemble
i have the same problem, it seems that there is a bug in SolrZkServer class (parseProperties method), that doesn't work well when you have an external zookeeper ensemble. Thanks, arin -- View this message in context: http://lucene.472066.n3.nabble.com/SolrCloud-issues-running-with-embedded-zookeeper-ensemble-tp3694004p3751629.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Using Solr for a rather busy Yellow Pages-type index - good idea or not really?
On Thu, Feb 16, 2012 at 3:03 PM, Alexey Verkhovsky alexey.verkhov...@gmail.com wrote: 5. All Solr caching is switched off. But why? Because (a) I shouldn't need to cache documents, if they are all in memory anyway; Your're making many assumptions about how Solr works internally. One example of many: Solr streams documents (requests the stored fields right before they are written to the response stream) to support returning any number of documents. If you highlight documents, the stored fields need to be retrieved first. When streaming those same documents later, Solr will retrieve the stored fields again - reying on the fact that they should be cached by the document cache since they were just used. There are tons of examples of how things are architected to take advantage of the caches - it pretty much never makes sense to outright disable them. If they take up too much memory, then just reduce the size. -Yonik lucidimagination.com
Re: custom scoring
: We'd like to score the matching documents using a combination of SOLR's IR : score with another application-specific score that we store within the : documents themselves (i.e. a float field containing the app-specific : score). In particular, we'd like to calculate the final score doing some : operations with both numbers (i.e product, sqrt, ...) let's back up a minute. if your ultimate goal is to have the final score of all documents be a simple multiplication of an indexed field (query_score) against the score of your base query, that's fairely trivial use of the BoostQParser... q={!boost f=query_score}your base query ...or to split it out using pram derefrencing... q={!boost f=query_score v=$qq} qq=your base query : A) Sort by function [1]: We've tested an expression like : sort=product(score, query_score) in the SOLR query, where score is the : common SOLR IR score and query_score is our own precalculated score, but it : seems that SOLR can only do this with stored/indexed fields (and obviously : score is not stored/indexed). you could do this by replacing score with the query whose score you want, which could be a ref back to $q -- but that's really only needed if you want the scores returned for each document to be differnt then the value used for sorting (ie: score comes from solr, sort value includes you query_score and the score from the main query -- or some completley diff query) based on what you've said, you don't need that and it would be unneccessary overhead. : B) Function queries: We've used _val_ and function queries like max, sqrt : and query, and we've obtained the desired results from a functional point : of view. However, our index is quite large (400M documents) and the : performance degrades heavily, given that function queries are AFAIK : matching all the documents. based on the examples you've given in your subsequent queries, it's not hard to see why... q:_val_:\product(query_score,max(query($q8),max(query($q7), wrapping queries in functions in queries can have that effect, because functions ultimatley match all documents -- even when that function wraps a query -- so your outermost query is still scoring every document in the index. you want to do as much pruning with the query as possible, and only multiply by your boost function on matching docs, hence the purpose of the BoostQParser. -Hoss
Re: Distributed Faceting Bug?
The issue appears to be that I put an empty array into the doc scores instead of null in DocSlice. DocSlice then just checks if scores is null when hasScore is called which caused a further issue down the line. I'll follow up with anything else that I find along the way. On Thu, Feb 16, 2012 at 3:05 PM, Jamie Johnson jej2...@gmail.com wrote: still digging ;) Once I figure it out I'll be happy to share. On Thu, Feb 16, 2012 at 1:32 PM, Em mailformailingli...@yahoo.de wrote: Hi Jamie, nice to hear. Maybe you can share in what kind of bug you ran, so that other developers with similar bugish components can benefit from your experience. :) Regards, Em Am 16.02.2012 19:23, schrieb Jamie Johnson: please ignore this, it has nothing to do with the faceting component. I was able to disable a custom component that I had and it worked perfectly fine. On Thu, Feb 16, 2012 at 12:42 PM, Jamie Johnson jej2...@gmail.com wrote: I am attempting to execute a query with the following parameters q=*:* distrib=true facet=true facet.limit=10 facet.field=manu f.manu.facet.mincount=1 f.manu.facet.limit=10 f.manu.facet.sort=index rows=10 When doing this I get the following exception null java.lang.ArrayIndexOutOfBoundsException request: http://hostname:8983/solr/select at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:435) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244) at org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:249) at org.apache.solr.handler.component.HttpCommComponent$1.call(SearchHandler.java:517) at org.apache.solr.handler.component.HttpCommComponent$1.call(SearchHandler.java:482) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) if I play with some of the parameters the query works as expected, i.e. q=*:* distrib=true facet=true facet.limit=10 facet.field=manu f.manu.facet.mincount=1 f.manu.facet.limit=10 f.manu.facet.sort=index rows=0 q=*:* distrib=true facet=true facet.limit=10 facet.field=manu f.manu.facet.mincount=1 f.manu.facet.limit=10 f.manu.facet.sort=count rows=10 q=*:* distrib=true facet=true facet.limit=10 facet.field=manu f.manu.facet.mincount=1 f.manu.facet.sort=index rows=10 I am running on an old snapshot of Solr, but will try this on a new version relatively soon. Unfortunately I can't duplicate locally so I'm a bit baffled by the error. All of the shards have the field which we are faceting on
Re: custom scoring
On Thu, Feb 16, 2012 at 8:34 AM, Carlos Gonzalez-Cadenas c...@experienceon.com wrote: Hello all: We'd like to score the matching documents using a combination of SOLR's IR score with another application-specific score that we store within the documents themselves (i.e. a float field containing the app-specific score). In particular, we'd like to calculate the final score doing some operations with both numbers (i.e product, sqrt, ...) ... 1) Apart from the two options I mentioned, is there any other (simple) way to achieve this that we're not aware of? In general there is always a third option, that may or may not fit, depending really upon how you are trying to model relevance and how you want to integrate with scoring, and thats to tie in your factors directly into Similarity (lucene's term weighting api). For example, some people use index-time boosting, but in lucene index-time boost really just means 'make the document appear shorter'. You might for example, have other boosts that modify term-frequency before normalization, or however you want to do it. Similarity is pluggable into Solr via schema.xml. Since you are using trunk, this is a lot more flexible than previous releases, e.g. you can access things from FieldCache, DocValues, or even your own rapidly-changing float[] or whatever you want :) There are also a lot more predefined models than just the vector space model to work with if you find you can easily imagine your notion of relevance in terms of an existing model. -- lucidimagination.com
Re: Using Solr for a rather busy Yellow Pages-type index - good idea or not really?
On Thu, Feb 16, 2012 at 1:32 PM, Yonik Seeley yo...@lucidimagination.comwrote: Your're making many assumptions about how Solr works internally. True that. If this spike turns into a project, digging through the source code will come. Meantime, we have to start somewhere, and the default configuration may not be the greatest starting point for this problem. We don't need highlighting, and only need ids, scores and total number of results out of Solr. Presentation of selected entities will have to include some write-heavy data (from RDBMS and/or memcached), therefore won't be Solr's business anyway. From what you said, I guess it won't hurt to give it a small document cache, just big enough to prevent streaming the same document twice within the same query. Still don't have a reason to have a query cache - because of lon/lat coming from the mobile devices, there are virtually no repeated queries in our production logs. Or am I making a bad assumption here, too? -- Alexey Verkhovsky http://alex-verkhovsky.blogspot.com/ CruiseControl.rb [http://cruisecontrolrb.thoughtworks.com]
Re: Using Solr for a rather busy Yellow Pages-type index - good idea or not really?
On Thu, Feb 16, 2012 at 4:06 PM, Alexey Verkhovsky alexey.verkhov...@gmail.com wrote: ly need ids, scores and total number of results out of Solr. Presentation of selected entities will have to include some write-heavy data (from RDBMS and/or memcached), therefore won't be Solr's business anyway. It depends on if you're going to be doing distributed search - there may be some scenarios there where it's used, but in general the query cache is the least useful. The filterCache is useful in a ton of ways if you're doing faceting too. -Yonik lucidimagination.com
Re: SolrCloud - issues running with embedded zookeeper ensemble
On Feb 16, 2012, at 2:53 PM, arin g wrote: i have the same problem, it seems that there is a bug in SolrZkServer class (parseProperties method), that doesn't work well when you have an external zookeeper ensemble. This issue was around using an embedded ensemble - an external ensemble makes SolrZkServer irrelevant. What issue are you having? I just tried a basic test against an external ensemble. What version are you using? Thanks, arin -- View this message in context: http://lucene.472066.n3.nabble.com/SolrCloud-issues-running-with-embedded-zookeeper-ensemble-tp3694004p3751629.html Sent from the Solr - User mailing list archive at Nabble.com. - Mark Miller lucidimagination.com
Re: custom scoring
Hello Carlos, I think we missunderstood eachother. As an example: BooleanQuery ( clauses: ( MustMatch( DisjunctionMaxQuery( TermQuery(stopword_field, barcelona), TermQuery(stopword_field, hoteles) ) ), ShouldMatch( FunctionQuery( *please insert your function here* ) ) ) ) Explanation: You construct an artificial BooleanQuery which wraps your user's query as well as your function query. Your user's query - in that case - is just a DisjunctionMaxQuery consisting of two TermQueries. In the real world you might construct another BooleanQuery around your DisjunctionMaxQuery in order to have more flexibility. However the interesting part of the given example is, that we specify the user's query as a MustMatch-condition of the BooleanQuery and the FunctionQuery just as a ShouldMatch. Constructed that way, I am expecting the FunctionQuery only scores those documents which fit the MustMatch-Condition. I conclude that from the fact that the FunctionQuery-class also has a skipTo-method and I would expect that the scorer will use it to score only matching documents (however I did not search where and how it might get called). If my conclusion is wrong than hopefully Robert Muir (as far as I can see the author of that class) can tell us what was the intention by constructing an every-time-match-all-function-query. Can you validate whether your QueryParser constructs a query in the form I drew above? Regards, Em Am 16.02.2012 20:29, schrieb Carlos Gonzalez-Cadenas: Hello Em: 1) Here's a printout of an example DisMax query (as you can see mostly MUST terms except for some SHOULD terms used for boosting scores for stopwords) * * *((+stopword_shortened_phrase:hoteles +stopword_shortened_phrase:barcelona stopword_shortened_phrase:en) | (+stopword_phrase:hoteles +stopword_phrase:barcelona stopword_phrase:en) | (+stopword_shortened_phrase:hoteles +stopword_short ened_phrase:barcelona stopword_shortened_phrase:en) | (+stopword_phrase:hoteles +stopword_phrase:barcelona stopword_phrase:en) | (+stopword_shor tened_phrase:hoteles +wildcard_stopword_shortened_phrase:barcelona stopword_shortened_phrase:en) | (+stopword_phrase:hoteles +wildcard_stopw ord_phrase:barcelona stopword_phrase:en) | (+stopword_shortened_phrase:hoteles +wildcard_stopword_shortened_phrase:barcelona stopword_shortened_phrase:en) | (+stopword_phrase:hoteles +wildcard_stopword_phrase:barcelona stopword_phrase:en))* * * 2)* *The collector is inserted in the SolrIndexSearcher (replacing the TimeLimitingCollector). We trigger it through the SOLR interface by passing the timeAllowed parameter. We know this is a hack but AFAIK there's no out-of-the-box way to specify custom collectors by now ( https://issues.apache.org/jira/browse/SOLR-1680). In any case the collector part works perfectly as of now, so clearly this is not the problem. 3) Re: your sentence: * * **I* would expect that with a shrinking set of matching documents to the overall-query, the function query only checks those documents that are guaranteed to be within the result set.* * * Yes, I agree with this, but this snippet of code in FunctionQuery.java seems to say otherwise: // instead of matching all docs, we could also embed a query. // the score could either ignore the subscore, or boost it. // Containment: floatline(foo:myTerm, myFloatField, 1.0, 0.0f) // Boost:foo:myTerm^floatline(myFloatField,1.0,0.0f) @Override public int nextDoc() throws IOException { for(;;) { ++doc; if (doc=maxDoc) { return doc=NO_MORE_DOCS; } if (acceptDocs != null !acceptDocs.get(doc)) continue; return doc; } } It seems that the author also thought of maybe embedding a query in order to restrict matches, but this doesn't seem to be in place as of now (or maybe I'm not understanding how the whole thing works :) ). Thanks Carlos * * Carlos Gonzalez-Cadenas CEO, ExperienceOn - New generation search http://www.experienceon.com Mobile: +34 652 911 201 Skype: carlosgonzalezcadenas LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas On Thu, Feb 16, 2012 at 8:09 PM, Em mailformailingli...@yahoo.de wrote: Hello Carlos, We have some more tests on that matter: now we're moving from issuing this large query through the SOLR interface to creating our own QueryParser. The initial tests we've done in our QParser (that internally creates multiple queries and inserts them inside a DisjunctionMaxQuery) are very good, we're getting very good response times and high quality answers. But when we've tried to wrap the DisjunctionMaxQuery within a FunctionQuery (i.e. with a QueryValueSource that wraps the DisMaxQuery), then the times move from 10-20 msec to 200-300msec. I reviewed the
files left open?
Hi all, I was loading a big (60 million docs) csv in solr 4 when something odd happened. I got a solr error in the log saying that it could not write the file. du -s indicated I had used 30Gb of a 50Gb available but df -k indicated that the disk was I00% used. ds and df giving different results could be an indication that there are file descriptors left open. After a solr bounce, df -k came down and agreed with du. has anyone seen anything like that ? Thanks, Paulo. environment; Linux 2.6.18-238.19.1.el5.centos. plus #1 SMP Mon Jul 18 10:05:09 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux Java(TM) SE Runtime Environment (build 1.6.0_17-b04) Java HotSpot(TM) 64-Bit Server VM (build 14.3-b01, mixed mode) apache-solr-4.0-2012-02-10_09-58-50 solr config is the one in the distribution package. i had my own schema.
Re: custom scoring
I just modified some TestCases a little bit to see how the FunctionQuery behaves. Given that you got an index containing 14 docs, where 13 of them containing the term batman and two contain the term superman, a search for q=+text:superman _val_:query($qq)qq=text:superman Leads to two hits and the FunctionQuery has two iterations. If you remove that little plus-symbol before text:superman, it wouldn't be a mustMatch-condition anymore and the whole query results into 14 hits (default-operator is OR): q=text:superman _val_:query($qq)qq=text:superman If both queries, the TermQuery and the FunctionQuery must match, it would also result into two hits: q=text:superman AND _val_:query($qq)qq=text:superman There is some behaviour that I currently don't understand (if 14 docs match, the FunctionQuery's AllScorer re-iterates for two times over the 0th and the 1st doc and the reason for that seems to be the construction of two AllScorers), but as far as I can see the performance of your queries *should* increase if you construct your query as I explained in my last eMail. Kind regards, Em Am 16.02.2012 23:43, schrieb Em: Hello Carlos, I think we missunderstood eachother. As an example: BooleanQuery ( clauses: ( MustMatch( DisjunctionMaxQuery( TermQuery(stopword_field, barcelona), TermQuery(stopword_field, hoteles) ) ), ShouldMatch( FunctionQuery( *please insert your function here* ) ) ) ) Explanation: You construct an artificial BooleanQuery which wraps your user's query as well as your function query. Your user's query - in that case - is just a DisjunctionMaxQuery consisting of two TermQueries. In the real world you might construct another BooleanQuery around your DisjunctionMaxQuery in order to have more flexibility. However the interesting part of the given example is, that we specify the user's query as a MustMatch-condition of the BooleanQuery and the FunctionQuery just as a ShouldMatch. Constructed that way, I am expecting the FunctionQuery only scores those documents which fit the MustMatch-Condition. I conclude that from the fact that the FunctionQuery-class also has a skipTo-method and I would expect that the scorer will use it to score only matching documents (however I did not search where and how it might get called). If my conclusion is wrong than hopefully Robert Muir (as far as I can see the author of that class) can tell us what was the intention by constructing an every-time-match-all-function-query. Can you validate whether your QueryParser constructs a query in the form I drew above? Regards, Em Am 16.02.2012 20:29, schrieb Carlos Gonzalez-Cadenas: Hello Em: 1) Here's a printout of an example DisMax query (as you can see mostly MUST terms except for some SHOULD terms used for boosting scores for stopwords) * * *((+stopword_shortened_phrase:hoteles +stopword_shortened_phrase:barcelona stopword_shortened_phrase:en) | (+stopword_phrase:hoteles +stopword_phrase:barcelona stopword_phrase:en) | (+stopword_shortened_phrase:hoteles +stopword_short ened_phrase:barcelona stopword_shortened_phrase:en) | (+stopword_phrase:hoteles +stopword_phrase:barcelona stopword_phrase:en) | (+stopword_shor tened_phrase:hoteles +wildcard_stopword_shortened_phrase:barcelona stopword_shortened_phrase:en) | (+stopword_phrase:hoteles +wildcard_stopw ord_phrase:barcelona stopword_phrase:en) | (+stopword_shortened_phrase:hoteles +wildcard_stopword_shortened_phrase:barcelona stopword_shortened_phrase:en) | (+stopword_phrase:hoteles +wildcard_stopword_phrase:barcelona stopword_phrase:en))* * * 2)* *The collector is inserted in the SolrIndexSearcher (replacing the TimeLimitingCollector). We trigger it through the SOLR interface by passing the timeAllowed parameter. We know this is a hack but AFAIK there's no out-of-the-box way to specify custom collectors by now ( https://issues.apache.org/jira/browse/SOLR-1680). In any case the collector part works perfectly as of now, so clearly this is not the problem. 3) Re: your sentence: * * **I* would expect that with a shrinking set of matching documents to the overall-query, the function query only checks those documents that are guaranteed to be within the result set.* * * Yes, I agree with this, but this snippet of code in FunctionQuery.java seems to say otherwise: // instead of matching all docs, we could also embed a query. // the score could either ignore the subscore, or boost it. // Containment: floatline(foo:myTerm, myFloatField, 1.0, 0.0f) // Boost:foo:myTerm^floatline(myFloatField,1.0,0.0f) @Override public int nextDoc() throws IOException { for(;;) { ++doc; if (doc=maxDoc) { return doc=NO_MORE_DOCS; } if (acceptDocs != null
Re: files left open?
On Thu, Feb 16, 2012 at 5:56 PM, Paulo Magalhaes paulo.magalh...@gmail.com wrote: I was loading a big (60 million docs) csv in solr 4 when something odd happened. I got a solr error in the log saying that it could not write the file. du -s indicated I had used 30Gb of a 50Gb available but df -k indicated that the disk was I00% used. You probably hit a big segment merge, which does require more disk space temporarily. The difference between du and df probably just indicates how they internally work (du may just look at file sizes, and non-closed files can register as smaller or 0 than the amount of disk space they actually take up). -Yonik lucidimagination.com
Re: Setting solrj server connection timeout
Im not sure that timeout will help you here - I believe it's the timeout on 'creating' the connection. Try setting the socket timeout (setSoTimeout) - that should let you try sooner. It looks like perhaps the server is timing out and closing the connection. I guess all you can do is timeout reasonably (if it takes too long to we for the exception) and retry. On Fri, Feb 3, 2012 at 3:12 PM, Shawn Heisey s...@elyograg.org wrote: Is the following a reasonable approach to setting a connection timeout with SolrJ? queryCore.getHttpClient().**getHttpConnectionManager().** getParams() .setConnectionTimeout(15000); Right now I have all my solr server objects sharing a single HttpClient that gets created using the multithreaded connection manager, where I set the timeout for all of them. Now I will be letting each server object create its own HttpClient object, and using the above statement to set the timeout on each one individually. It'll use up a bunch more memory, as there are 56 server objects, but maybe it'll work better. The total of 56 objects comes about from 7 shards, a build core and a live core per shard, two complete index chains, and for each of those, one server object for access to CoreAdmin and another for the index. The impetus for this, as it's possible I'm stating an XY problem: Currently I have an occasional problem where SolrJ connections throw an exception. When it happens, nothing is logged in Solr. My code is smart enough to notice the problem, send an email alert, and simply try again at the top of the next minute. The simple explanation is that this is a Linux networking problem, but I never had any problem like this when I was using Perl with LWP to keep my index up to date. I sent a message to the list some time ago on this exception, but I never got a response that helped me figure it out. Caused by: org.apache.solr.client.solrj.**SolrServerException: java.net.SocketException: Connection reset at org.apache.solr.client.solrj.**impl.CommonsHttpSolrServer.** request(CommonsHttpSolrServer.**java:480) at org.apache.solr.client.solrj.**impl.CommonsHttpSolrServer.** request(CommonsHttpSolrServer.**java:246) at org.apache.solr.client.solrj.**request.QueryRequest.process(** QueryRequest.java:89) at org.apache.solr.client.solrj.**SolrServer.query(SolrServer.**java:276) at com.newscom.idxbuild.solr.**Core.getCount(Core.java:325) ... 3 more Caused by: java.net.SocketException: Connection reset at java.net.SocketInputStream.**read(SocketInputStream.java:**168) at java.io.BufferedInputStream.**fill(BufferedInputStream.java:**218) at java.io.BufferedInputStream.**read(BufferedInputStream.java:**237) at org.apache.commons.httpclient.**HttpParser.readRawLine(** HttpParser.java:78) at org.apache.commons.httpclient.**HttpParser.readLine(** HttpParser.java:106) at org.apache.commons.httpclient.**HttpConnection.readLine(** HttpConnection.java:1116) at org.apache.commons.httpclient.**MultiThreadedHttpConnectionMan** ager$HttpConnectionAdapter.**readLine(**MultiThreadedHttpConnectionMan** ager.java:1413) at org.apache.commons.httpclient.**HttpMethodBase.readStatusLine(** HttpMethodBase.java:1973) at org.apache.commons.httpclient.**HttpMethodBase.readResponse(** HttpMethodBase.java:1735) at org.apache.commons.httpclient.**HttpMethodBase.execute(** HttpMethodBase.java:1098) at org.apache.commons.httpclient.**HttpMethodDirector.**executeWithRetry(* *HttpMethodDirector.java:398) at org.apache.commons.httpclient.**HttpMethodDirector.**executeMethod(** HttpMethodDirector.java:171) at org.apache.commons.httpclient.**HttpClient.executeMethod(** HttpClient.java:397) at org.apache.commons.httpclient.**HttpClient.executeMethod(** HttpClient.java:323) at org.apache.solr.client.solrj.**impl.CommonsHttpSolrServer.** request(CommonsHttpSolrServer.**java:424) ... 7 more Thanks, Shawn -- - Mark http://www.lucidimagination.com
how to delta index linked entities in 3.5.0
The delta instructions from https://wiki.apache.org/solr/DataImportHandler#Using_delta-import_command works for me in solr 1.4 but crashes in 3.5.0 (error: deltaQuery has no column to resolve to declared primary key pk='ITEM_ID, CATEGORY_ID' issue: https://issues.apache.org/jira/browse/SOLR-2907) Is there anyone out there that can confirm my bug? Because I am new to solr and hopefully I am just doing something wrong based on a misunderstanding of the wiki. Anyone successfully indexing the join of items and multiple item_categories just like the wiki example that would be willing to share their workaround or suggest a workaround? Thanks, Adam -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-delta-index-linked-entities-in-3-5-0-tp3752455p3752455.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Setting solrj server connection timeout
On 2/16/2012 6:28 PM, Mark Miller wrote: Im not sure that timeout will help you here - I believe it's the timeout on 'creating' the connection. Try setting the socket timeout (setSoTimeout) - that should let you try sooner. It looks like perhaps the server is timing out and closing the connection. I guess all you can do is timeout reasonably (if it takes too long to we for the exception) and retry. When the timeout exception happens, it is happening within the same second as the beginning of the update cycle, which involves a lot of other things happening (such as talking to a database) before it even gets around to talking to Solr. I do not have millisecond timestamps, but from what little I can tell, it's a handful of milliseconds from when SolrJ starts the request until the exception is logged. It happens relatively rarely - no more than once every few days, usually less often than that. I cannot reproduce it at will. Nobody is doing any work on either Solr or the network when it happens. Nothing is logged in the Solr server log or syslog at the OS level, the only mention of anything bad going on is in the log of my SolrJ application. I never had this problem when my build system was written in Perl, using LWP to make HTTP requests with URLs that I constructed myself. The perl system ran on CentOS 5 with Xen virtualization, now I'm running CentOS 6 on the bare metal. I'm using a bonded interface (for failover, not load balancing) comprised of two NICs plugged into separate switches. When it was virtualized, the Xen host was also using an identically configured bonded interface, bridged to the guests, which used eth0. The last time the error happened, which was on Feb 15th at 2:04 PM MST, the query that failed was 'did:(289800299 OR 289800157)', a very simple query against a tlong field. The application tests for the existence of the did values that it is trying to delete before it issues the delete request. I'm willing to look deeper into possible networking issues, but I am skeptical about that being the problem, and because there are no log messages to investigate, I have no idea how to proceed. The application runs on one of four Solr servers, sometimes the error even happens when connecting to Solr on the same server it's running on, which takes the gigabit switches out of the equation. If it's an actual networking problem, it's either in the hardware (Dell PowerEdge 2950 III, built-in NICs) or the CentOS 6 kernel. At this point, I am thinking it's one of the following problems, in order of decreasing probability: 1) I am using SolrJ incorrectly. 2) There is a SolrJ problem that only appears under specific circumstances that happen to exist in my setup. 3) My hardware or OS software has an extremely intermittent problem. What other info can I provide? Thanks, Shawn
Sort by the number of matching terms (coord value)
Hi, I'm looking for a way to sort results by the number of matching terms. Being able to sort by the coord() value or by the overlap value that gets passed into the coord() function would do the trick. Is there a way I can expose those values to the sort function? I'd appreciate any help that points me in the right direction. I'm OK with making basic code modifications. Thanks! -Nick
Re: how to delta index linked entities in 3.5.0
On 2/16/2012 6:31 PM, AdamLane wrote: The delta instructions from https://wiki.apache.org/solr/DataImportHandler#Using_delta-import_command works for me in solr 1.4 but crashes in 3.5.0 (error: deltaQuery has no column to resolve to declared primary key pk='ITEM_ID, CATEGORY_ID' issue: https://issues.apache.org/jira/browse/SOLR-2907) Is there anyone out there that can confirm my bug? Because I am new to solr and hopefully I am just doing something wrong based on a misunderstanding of the wiki. Anyone successfully indexing the join of items and multiple item_categories just like the wiki example that would be willing to share their workaround or suggest a workaround? I ran into something like this, possibly even this exact problem. Things have been tightened up in 3.x. All query results now need to have a field corresponding to what you've defined as pk, or it's considered an error. I was not using the results from my deltaQuery, but I still had to adjust it so that it returned a field with the same name as my primary key. You have defined more than one field for your pk, so I don't really know exactly what you'll have to do - perhaps you need to have both ITEM_ID and CATEGORY_ID fields in your query results. Thanks, Shawn
Re: Sort by the number of matching terms (coord value)
you can fool the lucene scoring fuction. override each function such as idf queryNorm lengthNorm and let them simply return 1.0f. I don't lucene 4 will expose more details. but for 2.x/3.x, lucene can only score by vector space model and the formula can't be replaced by users. On Fri, Feb 17, 2012 at 10:47 AM, Nicholas Clark clark...@gmail.com wrote: Hi, I'm looking for a way to sort results by the number of matching terms. Being able to sort by the coord() value or by the overlap value that gets passed into the coord() function would do the trick. Is there a way I can expose those values to the sort function? I'd appreciate any help that points me in the right direction. I'm OK with making basic code modifications. Thanks! -Nick
Improving proximity search performance
Here’s my use case. I expect to set up a Solr index that is approximately 1.4GB (this is a real number from the proof-of-concept using the real data, which consists of about 10 million documents, many of significant size, and making use of the FastVectorHighlighter to do highlighting on the body text field, which is of course stored, and with termVectors, termPositions, and termOffsets on). I no longer have the proof-of-concept Solr core available (our live site uses Solr 1.4 and the ordinary Highlighter), so I can’t get an empirical answer to this question: Will storing that extra information about the location of terms help the performance of proximity searches? A significant and important subset of my users make extensive use of proximity searches. These sophisticated users have found that they are best able to locate what they want by doing searches about THISWORD within 5 words of THATWORD, or much more sophisticated variants on that theme, including plenty of booleans and wildcards. The problem I’m facing is performance. Some of these searches, when common words are used, can take many minutes, even with the index on an SSD. The question is, how to improve the performance. It occurred to me as possible that all of that term vector information, stored for the benefit of the FastVectorHighlighter, might be a significant aid to the performance of these searches. First question: is that already the case? Will storing this extra information automatically improve my proximity search performance? Second question: If not, I’m very willing to dive into the code and come up with a patch that would do this. Can someone with knowledge of the internals comment on whether this is a plausible strategy for improving performance, and, if so, give tips about the outlines of what a successful approach to the problem might look like? Third question: Any tips in general for improving the performance of these proximity searches? I have explored the question of whether the customers might be weaned off of them, and that does not appear to be an option. Thanks, -- Bryan Loofbourrow
RE: Frequent garbage collections after a day of operation
A couple of thoughts: We wound up doing a bunch of tuning on the Java garbage collection. However, the pattern we were seeing was periodic very extreme slowdowns, because we were then using the default garbage collector, which blocks when it has to do a major collection. This doesn't sound like your problem, but it's something to be aware of. One thing that could fit the pattern you describe would be Solr caches filling up and getting you too close to your JVM or memory limit. For example, if you have large documents, and have defined a large document cache, that might do it. I found it useful to point jconsole (free with the JDK) at my JVM, and watch the pattern of memory usage. If the troughs at the bottom of the GC cycles keep rising, you know you've got something that is continuing to grab more memory and not let go of it. Now that our JVM is running smoothly, we just see a sawtooth pattern, with the troughs approximately level. When the system is under load, the frequency of the wave rises. Try it and see what sort of pattern you're getting. -- Bryan -Original Message- From: Matthias Käppler [mailto:matth...@qype.com] Sent: Thursday, February 16, 2012 7:23 AM To: solr-user@lucene.apache.org Subject: Frequent garbage collections after a day of operation Hey everyone, we're running into some operational problems with our SOLR production setup here and were wondering if anyone else is affected or has even solved these problems before. We're running a vanilla SOLR 3.4.0 in several Tomcat 6 instances, so nothing out of the ordinary, but after a day or so of operation we see increased response times from SOLR, up to 3 times increases on average. During this time we see increased CPU load due to heavy garbage collection in the JVM, which bogs down the the whole system, so throughput decreases, naturally. When restarting the slaves, everything goes back to normal, but that's more like a brute force solution. The thing is, we don't know what's causing this and we don't have that much experience with Java stacks since we're for most parts a Rails company. Are Tomcat 6 or SOLR known to leak memory? Is anyone else seeing this, or can you think of a reason for this? Most of our queries to SOLR involve the DismaxHandler and the spatial search query components. We don't use any custom request handlers so far. Thanks in advance, -Matthias -- Matthias Käppler Lead Developer API Mobile Qype GmbH Großer Burstah 50-52 20457 Hamburg Telephone: +49 (0)40 - 219 019 2 - 160 Skype: m_kaeppler Email: matth...@qype.com Managing Director: Ian Brotherston Amtsgericht Hamburg HRB 95913 This e-mail and its attachments may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail and its attachments. Any unauthorized copying, disclosure or distribution of this e-mail and its attachments is strictly forbidden. This notice also applies to future messages.
Re: distributed deletes working?
Yup - deletes are fine. On Thu, Feb 16, 2012 at 8:56 PM, Jamie Johnson jej2...@gmail.com wrote: With solr-2358 being committed to trunk do deletes and updates get distributed/routed like adds do? Also when a down shard comes back up are the deletes/updates forwarded as well? Reading the jira I believe the answer is yes, I just want to verify before bringing the latest into my environment. -- - Mark http://www.lucidimagination.com
Re: Sort by the number of matching terms (coord value)
I want to leave the score intact so I can sort by matching term frequency and then by score. I don't think I can do that if I modify all the similarity functions, but I think your solution would have worked otherwise. It would be great if there was a way I could expose this information through a function query (similar to the new relevance functions in version 4.0). I'll have to see if I can figure out how those functions work. -Nick On Thu, Feb 16, 2012 at 6:58 PM, Li Li fancye...@gmail.com wrote: you can fool the lucene scoring fuction. override each function such as idf queryNorm lengthNorm and let them simply return 1.0f. I don't lucene 4 will expose more details. but for 2.x/3.x, lucene can only score by vector space model and the formula can't be replaced by users. On Fri, Feb 17, 2012 at 10:47 AM, Nicholas Clark clark...@gmail.com wrote: Hi, I'm looking for a way to sort results by the number of matching terms. Being able to sort by the coord() value or by the overlap value that gets passed into the coord() function would do the trick. Is there a way I can expose those values to the sort function? I'd appreciate any help that points me in the right direction. I'm OK with making basic code modifications. Thanks! -Nick
Ranking based on number of matches in a multivalued field?
So suppose I have a multivalued field for categories. Let's say we have 3 items with these categories: Item 1: category ids [1,2,5,7,9] Item 2: category ids [4,8,9] Item 3: category ids [1,4,9] I now run a filter query for any of the following category ids [1,4,9]. I should get all of them back as results because they all include at least one category which I'm querying. Now, how do I order it based on the number of matching categories?? In this case, I would like Item 3 (matched all [1,4,9]) to be ranked higher, followed by Item 2 (matched [4,9]) and Item 3 (matches [1,9]). Is there a way I can boost documents based on the number of matches? I don't want an absolute rank where Item 3 is definitely the first result, but rather a way to boost Item 3's score higher than that of Item 1 and 2 so that it's more likely to show up higher (depending on the query string). Thanks! -- Steven Ou | 歐偉凡 *ravn.com* | Chief Technology Officer steve...@gmail.com | +1 909-569-9880
UpdateRequestHandler coding
If I want to write a complex UpdateRequestHandler should I do it on trunk or the 3.x branch? The criteria are a stable, debugged, full-featured environment. -- Lance Norskog goks...@gmail.com
Re: Size of suggest dictionary
Thanks Em! What if we use a threshold value in the suggest configuration, like float name=threshold0.005/float I assume the dictionary size will then be smaller than the total number of distinct terms, is there anyway to determine what that size is? Thanks, Mike On Wednesday, February 15, 2012 at 4:39 PM, Em wrote: Hello Mike, have a look at Solr's Schema Browser. Click on FIELDS, select label and have a look at the number of distinct (term-)values. Regards, Em Am 15.02.2012 23:07, schrieb Mike Hugo: Hello, We're building an auto suggest component based on the label field of documents. Is there a way to see how many terms are in the dictionary, or how much memory it's taking up? I looked on the statistics page but didn't find anything obvious. Thanks in advance, Mike ps- here's the config: searchComponent name=suggestlabel class=solr.SpellCheckComponent lst name=spellchecker str name=namesuggestlabel/str str name=classnameorg.apache.solr.spelling.suggest.Suggester/str str name=lookupImplorg.apache.solr.spelling.suggest.tst.TSTLookup/str str name=fieldlabel/str str name=buildOnOptimizetrue/str /lst /searchComponent requestHandler name=suggestlabel class=org.apache.solr.handler.component.SearchHandler lst name=defaults str name=spellchecktrue/str str name=spellcheck.dictionarysuggestlabel/str str name=spellcheck.count10/str /lst arr name=components strsuggestlabel/str /arr /requestHandler
Re: Frequent garbage collections after a day of operation
One thing that could fit the pattern you describe would be Solr caches filling up and getting you too close to your JVM or memory limit This [uncommitted] issue would solve that problem by allowing the GC to collect caches that become too large, though in practice, the cache setting would need to be fairly large for an OOM to occur from them: https://issues.apache.org/jira/browse/SOLR-1513 On Thu, Feb 16, 2012 at 7:14 PM, Bryan Loofbourrow bloofbour...@knowledgemosaic.com wrote: A couple of thoughts: We wound up doing a bunch of tuning on the Java garbage collection. However, the pattern we were seeing was periodic very extreme slowdowns, because we were then using the default garbage collector, which blocks when it has to do a major collection. This doesn't sound like your problem, but it's something to be aware of. One thing that could fit the pattern you describe would be Solr caches filling up and getting you too close to your JVM or memory limit. For example, if you have large documents, and have defined a large document cache, that might do it. I found it useful to point jconsole (free with the JDK) at my JVM, and watch the pattern of memory usage. If the troughs at the bottom of the GC cycles keep rising, you know you've got something that is continuing to grab more memory and not let go of it. Now that our JVM is running smoothly, we just see a sawtooth pattern, with the troughs approximately level. When the system is under load, the frequency of the wave rises. Try it and see what sort of pattern you're getting. -- Bryan -Original Message- From: Matthias Käppler [mailto:matth...@qype.com] Sent: Thursday, February 16, 2012 7:23 AM To: solr-user@lucene.apache.org Subject: Frequent garbage collections after a day of operation Hey everyone, we're running into some operational problems with our SOLR production setup here and were wondering if anyone else is affected or has even solved these problems before. We're running a vanilla SOLR 3.4.0 in several Tomcat 6 instances, so nothing out of the ordinary, but after a day or so of operation we see increased response times from SOLR, up to 3 times increases on average. During this time we see increased CPU load due to heavy garbage collection in the JVM, which bogs down the the whole system, so throughput decreases, naturally. When restarting the slaves, everything goes back to normal, but that's more like a brute force solution. The thing is, we don't know what's causing this and we don't have that much experience with Java stacks since we're for most parts a Rails company. Are Tomcat 6 or SOLR known to leak memory? Is anyone else seeing this, or can you think of a reason for this? Most of our queries to SOLR involve the DismaxHandler and the spatial search query components. We don't use any custom request handlers so far. Thanks in advance, -Matthias -- Matthias Käppler Lead Developer API Mobile Qype GmbH Großer Burstah 50-52 20457 Hamburg Telephone: +49 (0)40 - 219 019 2 - 160 Skype: m_kaeppler Email: matth...@qype.com Managing Director: Ian Brotherston Amtsgericht Hamburg HRB 95913 This e-mail and its attachments may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail and its attachments. Any unauthorized copying, disclosure or distribution of this e-mail and its attachments is strictly forbidden. This notice also applies to future messages.