CachedSqlEntityProcessor's purpose
I am starting to look at Solr's Data Import Handler framework and am quite impressed with it so far. My question is in trying to reduce the number of SQL queries issued to the database and saw this entity processor. In the following example: entity name=x query=select * from x entity name=y query=select * from y where xid=${x.id} processor=CachedSqlEntityProcessor /entity entity I like the concept of having multiple entity blocks for clarity but why wouldn't I have (for DB efficiency), the following as one entity's SQL statement select * from X,Y where x.id=y.xid and have two fields pointing at X and Y columns? My main question though is how the CachedSQLEntityProcessor helps in this case for I want to use the multiple entity blocks for cleanliness. If I have 500,000 X records, how many SQL queries in the second entity block (y) would get executed, 50? If there is any more detailed information about the number of queries executed in different circumstances, memory overhead or way that the data is brought from the database into Java it would be much appreciated for it's important for my application. Thanks in advance! Amit
Unknown field error using JDBC
Hello, I get Unknown field error when I'm indexing an Oracle dB. I've reduced the number of fields/columns in order to troubleshoot. If I change the uniqeKey to timestamp (for example) and create a dynamic field dynamicField name=* type=text indexed=true stored=true the indexing works fine, except the id-field is empty. --data-config.xml--- ... dataSource driver=oracle.jdbc.OracleDriver url=jdbc:oracle:thin:@host:port/service-name user=user password=pw name=ds1/ ... entity name=document pk=PUBID query=SELECT PUBID FROM UPLMAIN dataSource=ds1 field column=PUBID name=id/ /entity ... -- --schema.xml--- ... field name=id type=text indexed=true stored=true required=true / ... uniqueKeyid/uniqueKey ... --ERROR-message 2008-nov-25 12:25:25 org.apache.solr.handler.dataimport.SolrWriter upload VARNING: Error creating document : SolrInputDocument[{PUBID=PUBID(1.0)={43392}}] org.apache.solr.common.SolrException: ERROR:unknown field 'PUBID' at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:274) ... --- Anyone who had similar problems or knows how to solve this!? Any help is truly appreciated!! // Joel
Re: Using Solr for indexing emails
On Tue, 25 Nov 2008 03:59:31 +0200 Timo Sirainen [EMAIL PROTECTED] wrote: would it be faster to say q=user:user AND highestuid:[ * TO *] ? Now that I read again what fq really did, yes, sounds like you're right. you may want to compare them both to see which one is better... I just went from memory :P ( and i guess you'd sort DESC and return 1 record only). No, I'd use the above for getting highestuid value for all mailboxes (there should be only one record per mailbox (each mailbox has separate uid values - separate highestuid value)) so I can look at the returned highestuid values to see what mailboxes aren't fully indexed yet. gotcha. It is an interesting use of SOLR, i must say... I for one am not used to having to deal with up to the second update needs. good luck, B _ {Beto|Norberto|Numard} Meijome Never offend people with style when you can offend them with substance. Sam Brown I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned.
Re: solr internationalization support
On Mon, Nov 24, 2008 at 7:56 PM, rameshgalla [EMAIL PROTECTED]wrote: 1)Which languages solr supports out-of-the box other than english? Solr does not know about any languages. It will apply whatever analyzers you specify in the schema.xml for that field type. 2)What are the analyzers(stemmer,synonym,tokenizer etc) it provides for each language? Quite a few. The complete list is at http://lucene.apache.org/solr/api/org/apache/solr/analysis/package-summary.html 3)Shall we create our own analyzers for any languages?(If possible explain how?) If the existing analyzers do not work well, then yes, you would need to create your own. I can't say how easy or difficult it will be because I've never written one of my own yet. Some javadocs that may be of help: http://lucene.apache.org/java/docs/api/org/apache/lucene/analysis/TokenFilter.html http://lucene.apache.org/java/docs/api/org/apache/lucene/analysis/Tokenizer.html http://lucene.apache.org/solr/api/org/apache/solr/analysis/BaseTokenizerFactory.html http://lucene.apache.org/solr/api/org/apache/solr/analysis/BaseTokenFilterFactory.html -- Regards, Shalin Shekhar Mangar.
Re: Unknown field error using JDBC
which version of DIH are you using? On Tue, Nov 25, 2008 at 5:24 PM, Joel Karlsson [EMAIL PROTECTED] wrote: Hello, I get Unknown field error when I'm indexing an Oracle dB. I've reduced the number of fields/columns in order to troubleshoot. If I change the uniqeKey to timestamp (for example) and create a dynamic field dynamicField name=* type=text indexed=true stored=true the indexing works fine, except the id-field is empty. --data-config.xml--- ... dataSource driver=oracle.jdbc.OracleDriver url=jdbc:oracle:thin:@host:port/service-name user=user password=pw name=ds1/ ... entity name=document pk=PUBID query=SELECT PUBID FROM UPLMAIN dataSource=ds1 field column=PUBID name=id/ /entity ... -- --schema.xml--- ... field name=id type=text indexed=true stored=true required=true / ... uniqueKeyid/uniqueKey ... --ERROR-message 2008-nov-25 12:25:25 org.apache.solr.handler.dataimport.SolrWriter upload VARNING: Error creating document : SolrInputDocument[{PUBID=PUBID(1.0)={43392}}] org.apache.solr.common.SolrException: ERROR:unknown field 'PUBID' at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:274) ... --- Anyone who had similar problems or knows how to solve this!? Any help is truly appreciated!! // Joel -- --Noble Paul
Re: Unknown field error using JDBC
I actually don't know which version I was using, but now I've upgraded to 1.3 and it works like a charm!! Thanks a lot! 2008/11/25 Noble Paul നോബിള് नोब्ळ् [EMAIL PROTECTED] which version of DIH are you using? On Tue, Nov 25, 2008 at 5:24 PM, Joel Karlsson [EMAIL PROTECTED] wrote: Hello, I get Unknown field error when I'm indexing an Oracle dB. I've reduced the number of fields/columns in order to troubleshoot. If I change the uniqeKey to timestamp (for example) and create a dynamic field dynamicField name=* type=text indexed=true stored=true the indexing works fine, except the id-field is empty. --data-config.xml--- ... dataSource driver=oracle.jdbc.OracleDriver url=jdbc:oracle:thin:@host:port/service-name user=user password=pw name=ds1/ ... entity name=document pk=PUBID query=SELECT PUBID FROM UPLMAIN dataSource=ds1 field column=PUBID name=id/ /entity ... -- --schema.xml--- ... field name=id type=text indexed=true stored=true required=true / ... uniqueKeyid/uniqueKey ... --ERROR-message 2008-nov-25 12:25:25 org.apache.solr.handler.dataimport.SolrWriter upload VARNING: Error creating document : SolrInputDocument[{PUBID=PUBID(1.0)={43392}}] org.apache.solr.common.SolrException: ERROR:unknown field 'PUBID' at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:274) ... --- Anyone who had similar problems or knows how to solve this!? Any help is truly appreciated!! // Joel -- --Noble Paul
Re: Schema Design Guidance
Even if you go for the 400,000 documents way, the size of data and number of unique tokens would remain the same. With your data size, you should think about sharding and distributed search. Is the availability of a product a boolean value or the number of items? To make sure that you don't need to do very frequent updates, it will be better to use a boolean for availability. Even then, real-time updates in Solr are not possible and you will have to allow for a reasonable delay for changes to take effect. On Tue, Nov 25, 2008 at 4:40 AM, Vimal Jobanputra [EMAIL PROTECTED]wrote: Hi, and apologies in advance for the lengthy question! I'm looking to use Solr to power searching browsing over a large set of product data stored in a relational db. I'm wandering what the most appropriate schema design strategy to use is. A simplified view of the relational data is: Shop (~1000 rows) -Id* -Name Product (~300,000 rows) -Id* -Name -Availability ProductFormat (~5 rows) -Id* -Name Component (part of a product that may be sold separately) (~4,000,000 rows) -Id* -Name ProductComponent (~4,000,000 rows) -ProductId* -ComponentId* ShopProduct (~6,000,000 rows) -ShopId* -ProductId* -ProductFormatId* -AvailableDate ShopProductPriceList (~15,000,000 rows) -ShopId* -ProductId* -ProductFormatId* -Applicability (Component/Product)* -Type (Regular/SalePrice)* -Amount * logical primary key This means: -availability of a product differs from shop to shop -the price of a product or component is dependent on the format, and also differs from shop to shop Textual searching is required over product component names, and filtering is required over Shops, Product Availability, Formats, Prices. The simplest approach would be to flatten out the data completely (1 solr document per ShopProduct and ShopProductComponent). This would result in ~80million documents, which I'm guessing this would need some form of sharding/distribution An alternate approach would be to construct one document per Product, and *nest* the relational data via dynamic fields (and possibly plugins?) Eg one document per Product; multi-value fields for ProductComponent Shop; dynamic fields for Availability/Format, using ShopId as part of the field name. This approach would result in far fewer documents (400,000), but more complex queries. It would also require extending Solr/Lucene to search over ProductComponents and filter by price, which I'm not quite clear on as yet... Any guidance on which of the two general approaches (or others) to explore further? Thanks! Vim -- Regards, Shalin Shekhar Mangar.
Re: [VOTE] Community Logo Preferences
https://issues.apache.org/jira/secure/attachment/12394266/apache_solr_b_red.jpg https://issues.apache.org/jira/secure/attachment/12394070/sslogo-solr-finder2.0.png https://issues.apache.org/jira/secure/attachment/12394282/solr2_maho_impression.png https://issues.apache.org/jira/secure/attachment/12394353/solr.s5.jpg https://issues.apache.org/jira/secure/attachment/12394264/apache_solr_a_red.jpg On Sun, Nov 23, 2008 at 10:29 PM, Ryan McKinley [EMAIL PROTECTED] wrote: Please submit your preferences for the solr logo. For full voting details, see: http://wiki.apache.org/solr/LogoContest#Voting The eligible logos are: http://people.apache.org/~ryan/solr-logo-options.html Any and all members of the Solr community are encouraged to reply to this thread and list (up to) 5 ranked choices by listing the Jira attachment URLs. Votes will be assigned a point value based on rank. For each vote, 1st choice has a point value of 5, 5th place has a point value of 1, and all others follow a similar pattern. https://issues.apache.org/jira/secure/attachment/12345/yourfrstchoice.jpg https://issues.apache.org/jira/secure/attachment/34567/yoursecondchoice.jpg ... This poll will be open until Wednesday November 26th, 2008 @ 11:59PM GMT When the poll is complete, the solr committers will tally the community preferences and take a final vote on the logo. A big thanks to everyone would submitted possible logos -- its great to see so many good options. -- Regards, Shalin Shekhar Mangar.
Re: [VOTE] Community Logo Preferences
https://issues.apache.org/jira/secure/attachment/12394282/solr2_maho_impression.png https://issues.apache.org/jira/secure/attachment/12394376/solr_sp.png https://issues.apache.org/jira/secure/attachment/12393936/logo_remake.jpg https://issues.apache.org/jira/secure/attachment/12394264/apache_solr_a_red.jpg https://issues.apache.org/jira/secure/attachment/12394353/solr.s5.jpg
Re: Sorting and JVM heap size ....
On Tue, Nov 25, 2008 at 7:49 AM, souravm [EMAIL PROTECTED] wrote: 3. Another case is - if there are 2 search requests concurrently hitting the server, each with sorting on the same 20 character date field, then also it would need 2x2GB memory. So if I know that I need to support at least 4 concurrent search requests, I need to start the JVM at least with 8 GB heap size. This is a misunderstanding. Yonik said searchers, not searches. A single searcher handles all live search requests. When a commit/optimize happens, a new searcher is created, it's caches are auto-warmed and then swapped with the live searcher. It may be a bit more complicated under the hoods, but that's pretty much how it works. Considering that after commits and during auto-warming, another searcher might have been created which will have another field cache for each field you are sorting on, you'll need double the memory. The number of searchers can be controlled through the maxWarmingSearchers parameter in solrconfig.xml -- Regards, Shalin Shekhar Mangar.
Re: Analyzing CSV phrase fields
The easiest solution would be to create the documents you send to solr with multiple keywords fields... they will be separated by a positionIncrement so a phrase query won't see yankees adjacent to cleveland. If you can't do that, then perhaps patch PatternTokenizer filter to put a larger positionIncrement between groups. Then you would need to follow it by another filter that tokens on whitespace or some other regex (which we currently don't have). -Yonik On Tue, Nov 25, 2008 at 2:10 AM, Neal Richter [EMAIL PROTECTED] wrote: Hey all, Very basic question.. I want to index fields of comma separated values: Example document: id: 1 title: Football Teams keywords: philadelphia eagles, cleveland browns, new york jets id: 2 title: Baseball Teams keywords:philadelphia phillies, new york yankees, cleveland indians A query of 'new york' should return the obvious documents, but a quoted phrase query of yankees cleveland should return nothing... meaning that comma breaks phrases without fail. I've created a textCSV type in the schema.xml file and used the PatternTokenizerFactory to split on commas, and from there analysis can proceed as normal via StopFilterFactory, LowerCaseFilter, RemoveDuplicatesTokenFilter tokenizer class=solr.PatternTokenizerFactory pattern=\s*,\s* group=-1/ Has anyone done this before? Can I somehow use an existing (or combination of) Analyzer? It seems as though I need to create a PhraseDelimiterFilter from the WordDelimiterFilter.. though I am sure there is a way to make an existing analyzer to break things up the way I want. Thanks - Neal Richter
Re: CachedSqlEntityProcessor's purpose
On Tue, Nov 25, 2008 at 1:52 PM, Amit Nithian [EMAIL PROTECTED] wrote: I like the concept of having multiple entity blocks for clarity but why wouldn't I have (for DB efficiency), the following as one entity's SQL statement select * from X,Y where x.id=y.xid and have two fields pointing at X and Y columns? You can certainly do that. However, it is a problem when you need field X or Y to be multi-valued. You'd get repeated rows for that query and DataImportHandler will have no way to figure out what to put where. In the nested entities approach, DataImportHandler multiple values will come from a nested entity which can be very easily represented as a List. If you do not have multi-valued fields then you can go for that approach. My main question though is how the CachedSQLEntityProcessor helps in this case for I want to use the multiple entity blocks for cleanliness. If I have 500,000 X records, how many SQL queries in the second entity block (y) would get executed, 50? For each row fetched from the parent entity, the query for its nested entity is executed after replacing the variables with known values. When the nested entity has few records in the database, it is more efficient to use CachedSqlEntityProcessor which executes the query only once and keeps all the returned rows in memory. After that for each row returned by parent entity, the cached entity needs to do a lookup in the cache which is quite fast. Since all rows are stored in-memory, you trade memory for number of queries to the db when you use CachedSqlEntityProcessor. http://wiki.apache.org/solr/DataImportHandler#head-4465e39677ec06e4b14fd6a574434bac6e4d01e1 -- Regards, Shalin Shekhar Mangar.
Re: Using Solr for indexing emails
On Mon, Nov 24, 2008 at 11:51 PM, Timo Sirainen [EMAIL PROTECTED] wrote: DIH seems to be about Solr pulling data into it from an external source. That's not really practical with Dovecot since there's no central repository of any kind of data, so there's no way to know what has changed since last pull. Isn't your IMAP server the external data source? DIH can consume from any data store. Tools for consuming from databases and files have been written. I think it is possible to write one which consumes from IMAP. -- Regards, Shalin Shekhar Mangar.
Re: port of Nutch CommonGrams to Solr for help with slow phrase queries
Hi Tom, I don't think anybody has worked on adding this to Solr yet. Do you mind opening a jira issue? On Tue, Nov 25, 2008 at 12:01 AM, Burton-West, Tom [EMAIL PROTECTED]wrote: Hello all, We are having problems with extremely slow phrase queries when the phrase query contains a common words. We are reluctant to just use stop words due to various problems with false hits and some things becoming impossible to search with stop words turned on. (For example to be or not to be, the who, man in the moon vs man on the moon etc.) The approach to this problem used by Nutch looks promising. Has anyone ported the Nutch CommonGrams filter to Solr? Construct n-grams for frequently occuring terms and phrases while indexing. Optimize phrase queries to use the n-grams. Single terms are still indexed too, with n-grams overlaid. http://lucene.apache.org/nutch/apidocs-0.8.x/org/apache/nutch/analysis/C ommonGrams.htmlhttp://lucene.apache.org/nutch/apidocs-0.8.x/org/apache/nutch/analysis/CommonGrams.html Tom Tom Burton-West Information Retrieval Programmer Digital Library Production Services University of Michigan Library -- Regards, Shalin Shekhar Mangar.
Re: CachedSqlEntityProcessor's purpose
every row emitted by an outer entity results in a new Sql query in the inner entity. (yes 50 queries on inner entity)So,if you wish to join multiple tables then nested entities is the way to go. CachedSqlEntityProcessor is meant to help you reduce the number of queries fired on sub-entities. If you get the entire table in one query (by using select * from y) and use a separate where attribute , The entire set of rows in y get loaded into RAM. If you use it w/o the where attribute, it still ends up loading the entire table into the memory (it is an unbounded cache ).It can easily give you an OOM. dod not use CachedSqlEntityProcessor for tidying up. use it if you wish to save time and you have a lot of RAM On Tue, Nov 25, 2008 at 1:52 PM, Amit Nithian [EMAIL PROTECTED] wrote: I am starting to look at Solr's Data Import Handler framework and am quite impressed with it so far. My question is in trying to reduce the number of SQL queries issued to the database and saw this entity processor. In the following example: entity name=x query=select * from x entity name=y query=select * from y where xid=${x.id} processor=CachedSqlEntityProcessor /entity entity I like the concept of having multiple entity blocks for clarity but why wouldn't I have (for DB efficiency), the following as one entity's SQL statement select * from X,Y where x.id=y.xid and have two fields pointing at X and Y columns? My main question though is how the CachedSQLEntityProcessor helps in this case for I want to use the multiple entity blocks for cleanliness. If I have 500,000 X records, how many SQL queries in the second entity block (y) would get executed, 50? If there is any more detailed information about the number of queries executed in different circumstances, memory overhead or way that the data is brought from the database into Java it would be much appreciated for it's important for my application. Thanks in advance! Amit -- --Noble Paul
RE: Sorting and JVM heap size ....
Hi Shalin, Thanks for the clarifications. Could you please explain a bit more on how the new searcher can double the memory ? Based on your explanation, when a new set of documents gets committed a new searcher is created. So what I understand is whenever a update/delete query and search query run in parallel then only this type of situation may occur. Also I am assuming that like commit optimization also happens during update/delete query only. Regards, Sourav From: Shalin Shekhar Mangar [EMAIL PROTECTED] Sent: Tuesday, November 25, 2008 6:40 AM To: solr-user@lucene.apache.org Cc: souravm Subject: Re: Sorting and JVM heap size On Tue, Nov 25, 2008 at 7:49 AM, souravm [EMAIL PROTECTED]mailto:[EMAIL PROTECTED] wrote: 3. Another case is - if there are 2 search requests concurrently hitting the server, each with sorting on the same 20 character date field, then also it would need 2x2GB memory. So if I know that I need to support at least 4 concurrent search requests, I need to start the JVM at least with 8 GB heap size. This is a misunderstanding. Yonik said searchers, not searches. A single searcher handles all live search requests. When a commit/optimize happens, a new searcher is created, it's caches are auto-warmed and then swapped with the live searcher. It may be a bit more complicated under the hoods, but that's pretty much how it works. Considering that after commits and during auto-warming, another searcher might have been created which will have another field cache for each field you are sorting on, you'll need double the memory. The number of searchers can be controlled through the maxWarmingSearchers parameter in solrconfig.xml -- Regards, Shalin Shekhar Mangar. CAUTION - Disclaimer * This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely for the use of the addressee(s). If you are not the intended recipient, please notify the sender by e-mail and delete the original message. Further, you are not to copy, disclose, or distribute this e-mail or its contents to any other person and any such actions are unlawful. This e-mail may contain viruses. Infosys has taken every reasonable precaution to minimize this risk, but is not liable for any damage you may sustain as a result of any virus in this e-mail. You should carry out your own virus checks before opening the e-mail or attachment. Infosys reserves the right to monitor and review the content of all messages sent to or from this e-mail address. Messages sent to or from this e-mail address may be stored on the Infosys e-mail system. ***INFOSYS End of Disclaimer INFOSYS***
matching exact terms
This is probably severe user error, but I am curious about how to index docs to make this query work: happy birthday to return the doc with n_name:Happy Birthday before the doc with n_name:Happy Birthday, Happy Birthday . As it is now, the latter appears first for a query of n_name:happy birthday, the former second. It would be great to do this at query time instead of having to re-index, but I will if I have to! The n_* type is defined as: fieldtype name=name class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StandardFilterFactory/ filter class=solr.ISOLatin1AccentFilterFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StandardFilterFactory/ filter class=solr.ISOLatin1AccentFilterFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldtype
Re: Sorting and JVM heap size ....
On Tue, Nov 25, 2008 at 9:37 PM, souravm [EMAIL PROTECTED] wrote: Could you please explain a bit more on how the new searcher can double the memory ? Take a look at slide 13 of Yonik's presentation available at http://people.apache.org/~yonik/ApacheConEU2006/Solr.ppt Each searcher in Solr maintains various caches for performance reasons. When a new one is created, its caches are empty. If one exposes this searcher to live requests, response times can be very long because a lot of disk accesses may be needed. Therefore, Solr warms the new searcher's caches by re-executing queries whose results had been cached on the old searcher's cache. If you sort on fields, then the new searcher will create its own FieldCache for each field you sort. At this time, both the old and the new searcher will have their field caches. Based on your explanation, when a new set of documents gets committed a new searcher is created. So what I understand is whenever a update/delete query and search query run in parallel then only this type of situation may occur. Not during updates/deletes, but when you issue an commit or optimize command. Also I am assuming that like commit optimization also happens during update/delete query only. Commit or Optimize have to be called by you explicitly. -- Regards, Shalin Shekhar Mangar.
Re: matching exact terms
On Nov 25, 2008, at 11:40 AM, Brian Whitman wrote: This is probably severe user error, but I am curious about how to index docs to make this query work: happy birthday to return the doc with n_name:Happy Birthday before the doc with n_name:Happy Birthday, Happy Birthday . As it is now, the latter appears first for a query of n_name:happy birthday, the former second. It would be great to do this at query time instead of having to re- index, but I will if I have to! The n_* type is defined as: fieldtype name=name class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StandardFilterFactory/ filter class=solr.ISOLatin1AccentFilterFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StandardFilterFactory/ filter class=solr.ISOLatin1AccentFilterFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldtype Hi Brian! what is the explain text when you turn on debugQuery=true? With the indexing scheme you have, happy birthday, happy birthday will match 4 terms while happy birthday only two. Two options come to mind (sorry, both require reindexing) 1. add the remove duplicates filter. This would have both documents match only two terms, and the fieldNorm should boost the shorter field about the longer one. However removing the duplicates may make some other queries less relevant. 2. add a copyField and index the name as a string or something without tokenization (use the KeywordTokenizerFactory)-- then query on both fields (dismax) and boost an exact match over text match: name_with_tokens^1 name_no_tokens^3 (or something like that) ryan
Re: CachedSqlEntityProcessor's purpose
Thanks for the responses. Few follow-ups: 1) It seems that the CachedSQLEntityProcessor performs the where clause in memory on the cache. Is this cache an in memory RDBMS or maps? 2) In the example, there were two use cases, one that is like query=select * from Y where xid=${X.ID} and another where it's query=select * from Y where=xid=${x.ID}. Is there any difference in how CachedSQLEntityPRocessor behaves? Does it know to strip off the WHERE clause and simply cache the select * from Y? What are some dataset sizes that have been tested using this framework and what are some performance metrics? Thanks again Amit On Tue, Nov 25, 2008 at 7:32 AM, Noble Paul നോബിള് नोब्ळ् [EMAIL PROTECTED] wrote: every row emitted by an outer entity results in a new Sql query in the inner entity. (yes 50 queries on inner entity)So,if you wish to join multiple tables then nested entities is the way to go. CachedSqlEntityProcessor is meant to help you reduce the number of queries fired on sub-entities. If you get the entire table in one query (by using select * from y) and use a separate where attribute , The entire set of rows in y get loaded into RAM. If you use it w/o the where attribute, it still ends up loading the entire table into the memory (it is an unbounded cache ).It can easily give you an OOM. dod not use CachedSqlEntityProcessor for tidying up. use it if you wish to save time and you have a lot of RAM On Tue, Nov 25, 2008 at 1:52 PM, Amit Nithian [EMAIL PROTECTED] wrote: I am starting to look at Solr's Data Import Handler framework and am quite impressed with it so far. My question is in trying to reduce the number of SQL queries issued to the database and saw this entity processor. In the following example: entity name=x query=select * from x entity name=y query=select * from y where xid=${x.id} processor=CachedSqlEntityProcessor /entity entity I like the concept of having multiple entity blocks for clarity but why wouldn't I have (for DB efficiency), the following as one entity's SQL statement select * from X,Y where x.id=y.xid and have two fields pointing at X and Y columns? My main question though is how the CachedSQLEntityProcessor helps in this case for I want to use the multiple entity blocks for cleanliness. If I have 500,000 X records, how many SQL queries in the second entity block (y) would get executed, 50? If there is any more detailed information about the number of queries executed in different circumstances, memory overhead or way that the data is brought from the database into Java it would be much appreciated for it's important for my application. Thanks in advance! Amit -- --Noble Paul
Re: [VOTE] Community Logo Preferences
https://issues.apache.org/jira/secure/attachment/12394282/solr2_maho_impression.png https://issues.apache.org/jira/secure/attachment/12394475/solr2_maho-vote.png
Re: [VOTE] Community Logo Preferences
https://issues.apache.org/jira/secure/attachment/12394264/apache_solr_a_red.jpg https://issues.apache.org/jira/secure/attachment/12394282/solr2_maho_impression.png https://issues.apache.org/jira/secure/attachment/12394266/apache_solr_b_red.jpg https://issues.apache.org/jira/secure/attachment/12394353/solr.s5.jpg On Nov 25, 2008, at 9:05 AM, Marcus Stratmann wrote: https://issues.apache.org/jira/secure/attachment/12394282/solr2_maho_impression.png https://issues.apache.org/jira/secure/attachment/12394376/solr_sp.png https://issues.apache.org/jira/secure/attachment/12393936/logo_remake.jpg https://issues.apache.org/jira/secure/attachment/12394264/apache_solr_a_red.jpg https://issues.apache.org/jira/secure/attachment/12394353/solr.s5.jpg
Re: [VOTE] Community Logo Preferences
https://issues.apache.org/jira/secure/attachment/12394266/apache_solr_b_red.jpg https://issues.apache.org/jira/secure/attachment/12394314/apache_soir_001.jpg https://issues.apache.org/jira/secure/attachment/12394264/apache_solr_a_red.jpg
newbie question on SOLR distributed searches with many shards
I wasn't able to find examples/anything via google so thought I'd ask: Say I want to implement a solution using distributed searches with many shards in SOLR 1.3.0. Also, say there are too many shards to pass in via the URL (dozens, hundreds, whatever) Is there a way to specify in solrconfig.xml (or elsewhere) a list of the shard URLs to use? I saw references to a shards.txt but no info on it. I also saw bits of info that suggested that there MIGHT be another way to do this. Any info appreciated on doing this sort of distributed search. thx
Keyword extraction
Hi all, Strugling with a question I recently got from a collegue: is it possible to extract keywords from indexed content? In my opinion it should be possible to find out on what words the ranking of the indexed content is the highest (Lucene or Solr), but have no clue where to begin. Anyone having suggestions? Best, Patrick
Re: Keyword extraction
lots of approaches out there... the easiest off the shelf method would be to use the MoreLikeThisHandler and get the top interesting terms; http://wiki.apache.org/solr/MoreLikeThisHandler ryan On Nov 25, 2008, at 2:09 PM, Plaatje, Patrick wrote: Hi all, Strugling with a question I recently got from a collegue: is it possible to extract keywords from indexed content? In my opinion it should be possible to find out on what words the ranking of the indexed content is the highest (Lucene or Solr), but have no clue where to begin. Anyone having suggestions? Best, Patrick
Re: Using Solr for indexing emails
On Tue, 2008-11-25 at 20:45 +0530, Shalin Shekhar Mangar wrote: On Mon, Nov 24, 2008 at 11:51 PM, Timo Sirainen [EMAIL PROTECTED] wrote: DIH seems to be about Solr pulling data into it from an external source. That's not really practical with Dovecot since there's no central repository of any kind of data, so there's no way to know what has changed since last pull. Isn't your IMAP server the external data source? DIH can consume from any data store. Tools for consuming from databases and files have been written. I think it is possible to write one which consumes from IMAP. Yes, but that would require going through all users' all mailboxes to find out which ones have new nonindexed messages. The data isn't stored in any centralized database that would allow quickly returning all non-indexed messages. Instead for each mailbox it would have to (at minimum) open and read two files. That won't really scale for large installations with a huge amount of mailboxes. (At some point I probably am going to implement something that allows finding everyone's all new messages more easily so that I can implement replication support, but for now that kind of a change would be way too much work.) signature.asc Description: This is a digitally signed message part
Spellcheck for phrase queries
Hi, I am trying to implement a spell check functionality on a particular field. I need to do a complete phrase spell check when user enters multiple words. For eg: If the user enters great Hyat the current implementation would suggest great Hyatt, just correcting the word hyatt. But there will not be any record for this suggestion. How do I implement a complete phrase spell check, so that it suggests grand Hyatt instead of great Hyatt. Any suggestions in this regard will be helpful Thanks, Kalyan Manepalli
Stuck threads on Weblogic
Hello guys, I am getting some stuck threads on my application when it connects to Solr. The stuck threads occur in an even time, in such a way that each 3 days the app is online it hangs up the entire cluster. I don't know if there's any direct relation to Solr, but I get the following exception on some sparse connections the application does to Solr. Is there any know bug about Solr writing wrong responses? Nov 25, 2008 6:14:35 PM BRST Error HTTP localhost cluster0 [ACTIVE] ExecuteThread: '1' for queue: 'weblogic.kernel.Default (s elf-tuning)' WLS Kernel 1227644075142 BEA-101083 Connection failure. java.net.ProtocolException: Didn't meet stated Content-Length, wrote: '259' bytes instead of stated: '258' bytes. at weblogic.servlet.internal. ServletOutputStreamImpl.ensureContentLength(ServletOutputStreamImpl.java:410) at weblogic.servlet.internal.ServletResponseImpl.ensureContentLength(ServletResponseImpl.java:1358) at weblogic.servlet.internal.ServletResponseImpl.send(ServletResponseImpl.java:1400) at weblogic.servlet.internal.ServletRequestImpl.run(ServletRequestImpl.java:1375) at weblogic.work.ExecuteThread.execute(ExecuteThread.java:200) at weblogic.work.ExecuteThread.run(ExecuteThread.java:172) -- Alexander Ramos Jardim
Re: Unknown field error using JDBC
This sounds exactly same issue I had when going from 1.3 to 1.4 ... it sounds like DIH is trying to automagically figure out the columns :-\ - Jon On Nov 25, 2008, at 6:37 AM, Joel Karlsson wrote: Hello, I get Unknown field error when I'm indexing an Oracle dB. I've reduced the number of fields/columns in order to troubleshoot. If I change the uniqeKey to timestamp (for example) and create a dynamic field dynamicField name=* type=text indexed=true stored=true the indexing works fine, except the id-field is empty. --data- config .xml--- ... dataSource driver=oracle.jdbc.OracleDriver url=jdbc:oracle:thin:@host:port/service-name user=user password=pw name=ds1/ ... entity name=document pk=PUBID query=SELECT PUBID FROM UPLMAIN dataSource=ds1 field column=PUBID name=id/ /entity ... -- -- schema .xml --- ... field name=id type=text indexed=true stored=true required=true / ... uniqueKeyid/uniqueKey ... --ERROR- message 2008-nov-25 12:25:25 org.apache.solr.handler.dataimport.SolrWriter upload VARNING: Error creating document : SolrInputDocument[{PUBID=PUBID(1.0)={43392}}] org.apache.solr.common.SolrException: ERROR:unknown field 'PUBID' at org .apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java: 274) ... --- Anyone who had similar problems or knows how to solve this!? Any help is truly appreciated!! // Joel
Re: port of Nutch CommonGrams to Solr for help with slow phrase queries
On Mon, 24 Nov 2008 13:31:39 -0500 Burton-West, Tom [EMAIL PROTECTED] wrote: The approach to this problem used by Nutch looks promising. Has anyone ported the Nutch CommonGrams filter to Solr? Construct n-grams for frequently occuring terms and phrases while indexing. Optimize phrase queries to use the n-grams. Single terms are still indexed too, with n-grams overlaid. http://lucene.apache.org/nutch/apidocs-0.8.x/org/apache/nutch/analysis/C ommonGrams.html Tom, i haven't used Nutch's implementation, but used the current implementation (1.3) of ngrams and shingles to address exactly the same issue ( database of music albums and tracks). We didn't notice any severe performance hit but : - data set isn't huge ( ca 1 MM docs). - reindexed nightly via DIH from MS-SQL, so we can use a separate cache layer to lower the number of hits to SOLR. B _ {Beto|Norberto|Numard} Meijome Truth has no special time of its own. Its hour is now -- always. Albert Schweitzer I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned.
Re: port of Nutch CommonGrams to Solr for help with slow phrase queries
On Wed, 26 Nov 2008 10:08:03 +1100 Norberto Meijome [EMAIL PROTECTED] wrote: We didn't notice any severe performance hit but : - data set isn't huge ( ca 1 MM docs). - reindexed nightly via DIH from MS-SQL, so we can use a separate cache layer to lower the number of hits to SOLR. To make this clear - there was a noticeable hit when we removed stop words, but the nature of the beast forced our hand. b _ {Beto|Norberto|Numard} Meijome Peace can only be achieved by understanding. Albert Einstein I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned.
Increased garbage with Solr 1.3?
We are moving from Solr 1.1 to 1.3, and have noticed that 1.3 is working the garbage collector a lot more. Has anyone else seen this? wunder
Re: Increased garbage with Solr 1.3?
On Tue, Nov 25, 2008 at 7:56 PM, Walter Underwood [EMAIL PROTECTED] wrote: We are moving from Solr 1.1 to 1.3, and have noticed that 1.3 is working the garbage collector a lot more. Has anyone else seen this? During indexing or searching? Indexing uses the SolrDocument class as an intermediate form, so that would cause some greater GC there (actually, there have been a ton of indexing related changes in Lucene too). Not too much comes to mind for searching though. -Yonik
Re: Increased garbage with Solr 1.3?
Searching. No facets, but fuzzy matching. --wunder On 11/25/08 5:08 PM, Yonik Seeley [EMAIL PROTECTED] wrote: On Tue, Nov 25, 2008 at 7:56 PM, Walter Underwood [EMAIL PROTECTED] wrote: We are moving from Solr 1.1 to 1.3, and have noticed that 1.3 is working the garbage collector a lot more. Has anyone else seen this? During indexing or searching? Indexing uses the SolrDocument class as an intermediate form, so that would cause some greater GC there (actually, there have been a ton of indexing related changes in Lucene too). Not too much comes to mind for searching though. -Yonik
copyField stored values question
Hello, I am using copyField to send the raw name of an entity into different fields for indexing: # schema.xml snippet field name=raw_name type=string indexed=false stored=true / field name=indexed_name type=some_custom_type indexed=true stored=true / field name=other_indexed_name type=some_other_type indexed=true stored=true / copyField source=raw_name dest=indexed_name / copyField source=raw_name dest=other_indexed_name / I set the indexed fields to be stored so that I could see what exactly my custom types' filters produce. In the Analyzer utility in the Admin webapp seems to apply the filters properly. However, query results against this index return the original raw_name value for both of the indexed fields. Is it the expected behavior that copyField targets with stored=true always store the source value they were given? If so, is there any way to store the post-filtered target value instead? Thanks, Michael Henson [EMAIL PROTECTED] This correspondence is from Napster, Inc. or its affiliated entities and is intended only for use by the recipient named herein. This correspondence may contain privileged, proprietary and/or confidential information, and is intended only to be seen and used by named addressee(s). You are notified that any discussion, dissemination,distribution or copying of this correspondence and any attachments, is strictly prohibited, unless otherwise authorized or consented to in writing by the sender. If you have received this correspondence in error, please notify the sender immediately, and please permanently delete the original and any copies of it and any attachment and destroy any related printouts without reading or copying them.
Re: copyField stored values question
On Tue, Nov 25, 2008 at 9:24 PM, Michael Henson [EMAIL PROTECTED] wrote: I set the indexed fields to be stored so that I could see what exactly my custom types' filters produce. In the Analyzer utility in the Admin webapp seems to apply the filters properly. However, query results against this index return the original raw_name value for both of the indexed fields. Stored fields are never modified. The output from analyzers is used for indexing purposes only. -Yonik
Re: CachedSqlEntityProcessor's purpose
On Tue, Nov 25, 2008 at 11:35 PM, Amit Nithian [EMAIL PROTECTED] wrote: Thanks for the responses. Few follow-ups: 1) It seems that the CachedSQLEntityProcessor performs the where clause in memory on the cache. Is this cache an in memory RDBMS or maps? It is a hashmap in memory 2) In the example, there were two use cases, one that is like query=select * from Y where xid=${X.ID} and another where it's query=select * from Y where=xid=${x.ID}. Is there any difference in how CachedSQLEntityPRocessor behaves? Does it know to strip off the WHERE clause and simply cache the select * from Y? It fetches all the rows using the 'query' first. he where=xid=x.id (see no ${} here ) is evaluated in the map. In the map all the xid values will be kept as keys and the lookup is done on the map after evaluating the value of 'x.id' as ${x.ID} Then for subsequent requests it looks What are some dataset sizes that have been tested using this framework and what are some performance metrics? Thanks again Amit On Tue, Nov 25, 2008 at 7:32 AM, Noble Paul നോബിള് नोब्ळ् [EMAIL PROTECTED] wrote: every row emitted by an outer entity results in a new Sql query in the inner entity. (yes 50 queries on inner entity)So,if you wish to join multiple tables then nested entities is the way to go. CachedSqlEntityProcessor is meant to help you reduce the number of queries fired on sub-entities. If you get the entire table in one query (by using select * from y) and use a separate where attribute , The entire set of rows in y get loaded into RAM. If you use it w/o the where attribute, it still ends up loading the entire table into the memory (it is an unbounded cache ).It can easily give you an OOM. dod not use CachedSqlEntityProcessor for tidying up. use it if you wish to save time and you have a lot of RAM On Tue, Nov 25, 2008 at 1:52 PM, Amit Nithian [EMAIL PROTECTED] wrote: I am starting to look at Solr's Data Import Handler framework and am quite impressed with it so far. My question is in trying to reduce the number of SQL queries issued to the database and saw this entity processor. In the following example: entity name=x query=select * from x entity name=y query=select * from y where xid=${x.id} processor=CachedSqlEntityProcessor /entity entity I like the concept of having multiple entity blocks for clarity but why wouldn't I have (for DB efficiency), the following as one entity's SQL statement select * from X,Y where x.id=y.xid and have two fields pointing at X and Y columns? My main question though is how the CachedSQLEntityProcessor helps in this case for I want to use the multiple entity blocks for cleanliness. If I have 500,000 X records, how many SQL queries in the second entity block (y) would get executed, 50? If there is any more detailed information about the number of queries executed in different circumstances, memory overhead or way that the data is brought from the database into Java it would be much appreciated for it's important for my application. Thanks in advance! Amit -- --Noble Paul -- --Noble Paul
Re: newbie question on SOLR distributed searches with many shards
anything that is passed as a request parameter can be put into the SearchHandlers defaults or invariants section . This is equivalent to passing the shard url in the request However this expects that you may need to setup a loadbalancer if a shard hhos more than one host On Wed, Nov 26, 2008 at 12:25 AM, Gerald De Conto [EMAIL PROTECTED] wrote: I wasn't able to find examples/anything via google so thought I'd ask: Say I want to implement a solution using distributed searches with many shards in SOLR 1.3.0. Also, say there are too many shards to pass in via the URL (dozens, hundreds, whatever) Is there a way to specify in solrconfig.xml (or elsewhere) a list of the shard URLs to use? I saw references to a shards.txt but no info on it. I also saw bits of info that suggested that there MIGHT be another way to do this. Any info appreciated on doing this sort of distributed search. thx -- --Noble Paul
Facet Query and Query
I am having some trouble to utilize the facet Query. As I know that the facet Query has better performance that simple query (q). Here is the example. http://localhost:8080/test_solr/select?q=*:*facet=truefq=state:CAfacet.mincount=1facet.field=cityfacet.field=sectorfacet.limit=-1sort=score+desc -- facet by sector and city for state of CA. Any idea how to optimize this query to avoid q=*:*? Thanks, Jae