Re: WordDelimiterFilterFactory + CamelCase query
On Thu, Nov 18, 2010 at 3:22 PM, Peter Karich peat...@yahoo.de wrote: Hi, Please add preserveOriginal=1 to your WDF [1] definition and reindex (or just try with the analysis page). but it is already there!? filter class=solr.WordDelimiterFilterFactory protected=protwords.txt generateWordParts=1 generateNumberParts=1 catenateAll=0 preserveOriginal=1/ Regards, Peter. Peter, I recently had this issue, and I had to set splitOnCaseChange=0 to keep the word delimiter filter from doing what you describe. Can you try that and see if it helps? - Ken
Re: Reindex Solr Using Tomcat
On Thu, Nov 18, 2010 at 3:33 PM, Eric Martin e...@makethembite.com wrote: Hi, I searched google and the wiki to find out how I can force a full re-index of all of my content and I came up with zilch. My goal is to be able to adjust the weight settings, re-index my entire database and then search my site and view the results of my weight adjustments. I am using Tomcat 5.x and Solr 1.4.1. Weird how I couldn't find this info. I must have missed it. Anyone know where to find it? Eric Eric, How you re-index SOLR determines which method you wish to use. You can either use the UpdateHandler using a POST of an XML file [1], or you can use the DataImportHandler (DIH) [2]. There exist other means, but these two should be sufficient to get started. How did you import your initial index in the first place? [1] http://wiki.apache.org/solr/UpdateXmlMessages [2] http://wiki.apache.org/solr/DataImportHandler
Re: Reindex Solr Using Tomcat
On Thu, Nov 18, 2010 at 3:42 PM, Eric Martin e...@makethembite.com wrote: Ah, I am using an ApacheSolr module in Drupal and used nutch to insert the data into the Solr index. When I using Jetty I could just delete the data contents in sshd and then restart the service forcing the reindex. Currently, the ApacheSolr module for Drupal allows for a 200 record re-index every cron run, but that is too slow for me. During implantation and testing I would prefer to re-index the entire database as I have over 400k records. I appreciate your help. My mind was searching for a command on the CLI that would just tell solr to reindex the entire dbase and be done with it. Eric, From what I could find, this looks to be your best bet: http://drupal.org/node/267543. - Ken
Re: How do I format this query with 2 search terms?
2010/11/17 Jón Helgi Jónsson jonjons...@gmail.com: I'm using index time boosting and need to specify every field I want to search (not use copy fields) or else the boosting wont work. This query with 1 saerchterm works fine, boosts look good: http://localhost:8983/solr/select/? q=companyName:foo +descriptionTxt:verslun fl=*%20scorerows=10start=0 However if I have 2 words in the query and do it like this boosting seems not to be working http://localhost:8983/solr/select/? q=companyName:foo+bar +descriptionTxt:foo+bar fl=*%20scorerows=10start=0 Its probably using the default search field for the second word which has no boosting configured. How do I go about this? Thanks, Jon Jon, You have a few options here, depending on what you want to achieve with your query: 1. If you're trying to do a phrase query, you simply need to ensure that your phrases are quoted. The default behavior in SOLR is to split the phrase into multiple chunks. If a word is not preceded with a field definition, then SOLR will automatically apply the word(s) as if you had specified the default field. So for your example, SOLR would parse your query into companyName:foo defaultField:bar descriptionTxt:foo defaultField:bar. 2. You can use the dismax query plugin instead of the standard query plugin. You simply configure the dismax section of your solrconfig.xml to your liking - you define which fields to search, apply any special boosts for your needs, etc (http://wiki.apache.org/solr/DisMaxQParserPlugin) - and then you simply feed the query terms without naming your fields (i.e., q=foo+bar), along with telling SOLR to use dismax (i.e., qt=whatever_you_named_your_dismax_handler). 3. If phrase queries are not important to you, you can manually prefix each term in your query with the field you wish to search; for example, you would do companyName:foo companyName:bar descriptionTxt:foo descriptionTxt:bar. Whichever way you decide to go, the best thing that you can do to understand SOLR and how it's working in your environment is to append debugQuery=on to the end of your URL; this tells SOLR to output information about how it parsed your query, how long each component took to run, and some other useful debugging information. It's very useful, and has come in handy several times here where I'm at when I wanted to know why SOLR returned the results (or didn't return) that I expected. I hope this helps. - Ken
Re: ranged and boolean query
On Wed, Nov 17, 2010 at 10:39 AM, Peter Blokland pe...@desk.nl wrote: hi. i'm using solr and am trying to limit my resultset to documents that either have a publication date in the range * to now, or have no publication date set at all (field is not present). however, using this : (pubdate:[* TO NOW]) OR ( NOT pubdate:*) gives me only the documents in the range * to now (reversing the two clauses has no effect). using only NOT pubdate:* gives me the correct set of documents (those not having a pubddate). any reason the OR does not work in this case ? ps: also tried it like this : pubdate:([* TO NOW] OR (NOT *)) which gives the same result. -- CUL8R, Peter. www.desk.nl --- Sent from my NetBSD-powered Talkie Toaster™ Peter, Instead of using NOT, try simply prefixing the field name with a minus sign. This tells SOLR to exclude the field. Otherwise, the word NOT would be treated as a term, and would be applied against your default field (which may or may not affect your results). So instead of (pubdate:[* TO NOW]) OR ( NOT pubdate:*), you would write (pubdate:[* TO NOW]) OR ( -pubdate:*). - Ken
Re: ranged and boolean query
On Wed, Nov 17, 2010 at 11:00 AM, Peter Blokland pe...@desk.nl wrote: hi, On Wed, Nov 17, 2010 at 10:54:48AM -0500, Ken Stanley wrote: pubdate:([* TO NOW] OR (NOT *)) Instead of using NOT, try simply prefixing the field name with a minus sign. This tells SOLR to exclude the field. Otherwise, the word NOT would be treated as a term, and would be applied against your default field (which may or may not affect your results). So instead of (pubdate:[* TO NOW]) OR ( NOT pubdate:*), you would write (pubdate:[* TO NOW]) OR ( -pubdate:*). tried that, it gives me exactly the same result... I can't really figure out what's going on. -- CUL8R, Peter. www.desk.nl --- Sent from my NetBSD-powered Talkie Toaster™ If you append your URL with debugQuery=on, it will tell you how SOLR parsed your query. What's your schema look like? And what does the debug query look like?
Re: DIH for multilingual index multiValued field?
On Sat, Nov 13, 2010 at 4:56 PM, Ahmet Arslan iori...@yahoo.com wrote: For (1) you probably need to write a custom transformer. Something like: public Object transformRow(MapString, Object row) { String language_code = row.get(language_code); String text = row.get(text); if(en.equals(language_code)) row.put(text_en, text); else if if(fr.equals(language_code)) row.put(text_fr, text); return row; } For (2), it doable with regex transformer. field column=mailId splitBy=, sourceColName=emailids/ The 'emailids' field in the table can be a comma separated value. So it ends up giving out one or more than one email ids and we expect the 'mailId' to be a multivalued field in Solr. [1] [1]http://wiki.apache.org/solr/DataImportHandler#RegexTransformer In my opinion, I think that this is a bit of overkill. Since the DIH supports multiple entities, with no real limit on the SQL queries, I think that the easiest (and less involved) approach would be to create three entities for the languages the OP wishes to index: entity name=english query=SELECT * FROM documents WHERE language_code='en' transformer=RegexTransformer field column=text_en column=text / field column=tags column=tags splitBy=, / /entity entity name=french query=SELECT * FROM documents WHERE language_code='fr' transformer=RegexTransformer field column=text_fr column=text / field column=tags column=tags splitBy=, / /entity entity name=chinese query=SELECT * FROM documents WHERE language_code='zh' transformer=RegexTransformer field column=text_zh column=text / field column=tags column=tags splitBy=, / /entity But, I admit that depending on future growth of languages, as well as other factors (i.e., needing more specific logic, etc), a programmatic approach might be warranted. I would recommend, however, that the database table be a little more normalized. Your definition for tags is quite limiting, and could be better served using a many-to-many relationship. Something like the following might serve you well: CREATE TABLE documents ( id INT NOT NULL AUTO_INCREMENT, language_code CHAR(2), tags CHAR(30), text TEXT, PRIMARY KEY (id) ); CREATE TABLE document_tags ( id INT NOT NULL AUTO_INCREMENT, tag CHAR(30), PRIMARY KEY (id) ); CREATE TABLE document_tag_lookup ( document_id INT NOT NULL, tag_id INT NOT NULL, PRIMARY KEY (document_id, tag_id) ); Then in the DIH, you simply nest a second entity to look up the zero or more tags that might be associated with your documents; take the english entity from above: entity name=english query=SELECT * FROM documents WHERE language_code='en' transformer=RegexTransformer field name=text_en column=text / entity name=english_tags query=SELECT * FROM document_tags dt INNER JOIN document_tag_lookup dtl ON (dtl.tag_id = dt.id AND dtl.document_id='${english.id}') field name=tags column=tag / /entity /entity This would allow for growth, and is easy to maintain. Additionally, if you wanted to implement a custom transformer of your own, you could. As an aside, a sort of compromise, you could also use the ScriptTransformer [1] to create a Javascript function that can do your language logic and create the necessary fields, and not have to worry about maintaining any custom Java code. [1] http://wiki.apache.org/solr/DataImportHandler#ScriptTransformer - Ken
Re: DIH for multilingual index multiValued field?
On Sat, Nov 13, 2010 at 5:59 PM, Ken Stanley doh...@gmail.com wrote: CREATE TABLE documents ( id INT NOT NULL AUTO_INCREMENT, language_code CHAR(2), tags CHAR(30), text TEXT, PRIMARY KEY (id) ); I apologize, but I couldn't leave the typo in my last post without a follow up; it might cause confusion. I copied the OP's original table definition and forgot to remove the tags field. My purposed definition for the documents table should be: CREATE TABLE documents ( id INT NOT NULL AUTO_INCREMENT, language_code CHAR(2), text TEXT, PRIMARY KEY (id) ); - Ken
Re: scheduling imports and heartbeats
On Tue, Nov 9, 2010 at 10:16 PM, Tri Nguyen tringuye...@yahoo.com wrote: Hi, Can I configure solr to schedule imports at a specified time (say once a day, once an hour, etc)? Also, does solr have some sort of heartbeat mechanism? Thanks, Tri Tri, If you use the DataImportHandler (DIH), you can set up a dataimport.properties file that can be configured to import on intervals. http://wiki.apache.org/solr/DataImportHandler#dataimport.properties_example As for heartbeat, you can use the ping handler (default is /admin/ping) to check the status of the servlet. - Ken
Re: Best practice for emailing this list?
On Wed, Nov 10, 2010 at 1:11 PM, robo - robom...@gmail.com wrote: How do people email this list without getting spam filter problems? Depends on which side of the spam filter that you're referring to. I've found that to keep these emails from entering my spam filter is to add a rule to Gmail that says Never send to spam. As for when I send emails, I make sure that I send my emails as plain text to avoid getting bounce backs. - Ken
Re: dynamically create unique key
On Tue, Nov 9, 2010 at 10:39 AM, Christopher Gross cogr...@gmail.com wrote: I'm trying to use Solr to store information from a few different sources in one large index. I need to create a unique key for the Solr index that will be unique per document. If I have 3 systems, and they all have a document with id=1, then I need to create a uniqueId field in my schema that contains both the system name and that id, along the lines of: sysa1, sysb1, and sysc1. That way, each document will have a unique id. I added this to my schema.xml: copyField source=source dest=uniqueId/ copyField source=id dest=uniqueId/ However, after trying to insert, I got this: java.lang.Exception: ERROR: multiple values encountered for non multiValued copy field uniqueId: sysa So instead of just appending to the uniqueId field, it tried to do a multiValued. Does anyone have an idea on how I can make this work? Thanks! -- Chris Chris, Depending on how you insert your documents into SOLR will determine how to create your unique field. If you are POST'ing the data via HTTP, then you would be responsible for building your unique id (i.e., your program/language would use string concatenation to add the unique id to the output before it gets to the update handler in SOLR). If you're using the DataImportHandler, then you can use the TemplateTransformer (http://wiki.apache.org/solr/DataImportHandler#TemplateTransformer) to dynamically build your unique id at document insertion time. For example, we here at bizjournals use SOLR and the DataImportHandler to index our documents. Like you, we run the risk of two or more ids clashing, and thus overwriting a different type of document. As such, we take two or three different fields and combine them together using the TemplateTransformer to generate a more unique id for each document we index. With respect to the multiValued option, that is used more for an array-like structure within a field. For example, if you have a blog entry with multiple tag keywords, you would probably want a field in SOLR that can contain the various tag keywords for each blog entry; this is where multiValued comes in handy. I hope that this helps to clarify things for you. - Ken Stanley
Re: dynamically create unique key
On Tue, Nov 9, 2010 at 10:53 AM, Christopher Gross cogr...@gmail.com wrote: Thanks Ken. I'm using a script with Java/SolrJ to copy documents from their original locations into the Solr Index. I wasn't sure if the copyField would help me, but from your answers it seems that I'll have to handle it on my own. That's fine -- it is definitely not hard to pass a new field myself. I was just thinking that there should be an easy way to have Solr build the unique field, since it was getting everything anyway. I was just confused as to why I was getting a multiValued error, since I was just trying to append to a field. I wasn't sure if I was missing something. Thanks again! -- Chris Chris, I definitely understand your sentiment. The thing to keep in mind with SOLR is that it really has limited logic mechanisms; in fact, unless you're willing to use the DataImportHandler (dih) and the ScriptTransformer, you really have no logic. The copyField directive in schema.xml is mainly used to help you easily copy the contents of one field into another so that it may be indexed in multiple ways; for example, you can index a string so that it is stored literally (i.e., Hello World), parsed using a whitespace tokenizer (i.e., Hello, World), parsed for an nGram tokenizer (i.e., H, He, Hel... ). This is beneficial to you because you wouldn't have to explicitly define each possible instance in your data stream. You just define the field once, and SOLR is smart enough to copy it where it needs to go. Glad to have helped. :) - Ken
Re: spell check vs terms component
On Tue, Nov 9, 2010 at 1:02 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: On Tue, Nov 9, 2010 at 8:20 AM, bbarani bbar...@gmail.com wrote: Hi, We are trying to implement auto suggest feature in our application. I would like to know the difference between terms vs spell check component. Both the handlers seems to display almost the same output, can anyone let me know the difference and also I would like to know when to go for spell check and when to go for terms component. SpellCheckComponent is designed to operate on whole words and not partial words so I don't know how well it will work for auto-suggest, if at all. As far as differences between SpellCheckComponent and Terms Component is concerned, TermsComponent is a straight prefix match whereas SCC takes edit distance into account. Also, SCC can deal with phrases composed of multiple words and also gives back a collated suggestion. -- Regards, Shalin Shekhar Mangar. An alternative to using the SpellCheckComponent and/or the TermsComponent, would be the (Edge)NGrams filter. Basically, this filter breaks words down into auto-suggest-friendly tokens (i.e., Hello = H, He, Hel, Hell, Hello) that works great for auto suggestion querying. Here is an article from Lucid Imagination on using the ngram filter: http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/ Here is the SOLR wiki entry for the filter: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.EdgeNGramFilterFactory - Ken Stanley
Re: Fixed value in dataimporthandler
On Mon, Nov 8, 2010 at 3:50 PM, Renato Wesenauer renato.wesena...@gmail.com wrote: Hi Ahmet Arslan, I'm using this in schema.xml: field name=secao type=cleannormalized_text indexed=true stored=true/ field name=indativo type=boolean indexed=true stored=true/ I'm using this in dataimporthandler: field column=secao xpath=/ROW/NomeSecaoMix / field column=indativo template=0 / The indexing process work correctly, but it's happening something wrong with the results of queries. All queries with some field with 2 words or more, plus the field indativo:true, it isn't returning any result. Example of queries: 1º) secao:accessories for cars AND indativo:true 2º) secao:accessories for cars AND indativo:false The first query returns 0 results, but there are 40.000 documents indexed with these fields. The second query returns 300.000 documents, but 300.000 is the total of documents for query secao:celular e telefonia, the correct would be 260.000. Another example: 1º) secao:toys AND indativo:true 2º) secao:toys AND indativo:false In this example, the two queries work correctly. The problem happens with values with 2 words or more, plus the indativo field. Do you know what can be happening? Thank you, Renato F. Wesenauer Renato, Correct me if I'm wrong, but you have an entity that you explicitly set to a false value for the indativo field. And when you query, is your intention to find the fields that were not indexed through that entity? The way that I am reading your question is that you are expecting the indativo field to be true by default, but I do not see where you're explicitly stating that in your schema. The reason that I bring this up is - and I could be wrong - I would think that if you do not set a value in SOLR, then it doesn't exist (either in the schema, or during indexing). If you are expecting the other entries where indativo was explicitly set to false to be true, you might need to tweak your schema so that the field definition is by default true. Is it possible to try adding the default attribute to your field definition and reindexing to see if that gives you what you're looking for? - Ken Stanley PS. If this came through twice, I apologize; I got a bounce-back saying my original reply was blocked, so I'm trying to re-send as plain text.
Re: Tomcat special character problem
On Sun, Nov 7, 2010 at 9:11 AM, Em mailformailingli...@yahoo.de wrote: Hi List, I got an issue with my Solr-environment in Tomcat. First: I am not very familiar with Tomcat, so it might be my fault and not Solr's. It can not be a solr-side configuration problem, since everything worked fine with my local Jetty-servlet container. However, when I deploy into Tomcat, several special characters were shown in their utf-8 representation. Example: göteburg will be displayed as str name=qgöteburg/str when it comes to search. I tried the following within my server.xml-file Connector port=8080 protocol=HTTP/1.1 connectionTimeout=2 redirectPort=8443 URIEncoding=UTF-8 / And restarted Tomcat afterwards. The problem only occurs when I try to search for something. It is no problem to index that data. Thank you for any help! Regards, Em -- View this message in context: http://lucene.472066.n3.nabble.com/Tomcat-special-character-problem-tp1857648p1857648.html Sent from the Solr - User mailing list archive at Nabble.com. That is definitely odd. When I tried copying göteburg and doing a manual query in my web browser, everything worked. How are you making the request to SOLR? When I viewed the properties/info of the results, my returned charset was in UTF-8. Can you confirm similar for you? When I grepped for UTF-8 in both my SOLR and Tomcat configs, nothing stood out as a special configuration option.
Re: Tomcat special character problem
On Sun, Nov 7, 2010 at 9:34 AM, Em mailformailingli...@yahoo.de wrote: Hi Ken, thank you for your quick answer! To make sure that there occurs no mistakes at my application's side, I send my requests with the form that is available at solr/admin/form.jsp I changed almost nothing from the example-configurations within the example-package except some auto-commit params. All the special-characters within the results were displayed correctly, and so far they were also indexed correctly. The only problem is querying with special-characters. I can confirm that the page is encoded in UTF-8 within my browser. Is there a possibility that Tomcat did not use the UTF-8 URIEncoding? Maybe I should say that Tomcat is behind an Apache HttpdServer and is mounted by a jk_mount. Thank you! I am not familiar with using your type of set up, but a quick Google search suggested using a second connector on a different port. If you're using mod_jk, you can try setting JkOptions +ForwardURICompatUnparsed to see if that helps. ( http://markstechstuff.blogspot.com/2008/02/utf-8-problem-between-apache-and-tomcat.html). Sorry I couldn't have been more help. :) - Ken
Re: querying multiple fields as one
On Thu, Nov 4, 2010 at 8:21 AM, Tommaso Teofili tommaso.teof...@gmail.comwrote: Hi all, having two fields named 'type' and 'cat' with identical type and options, but different values recorded, would it be possible to query them as they were one field? For instance q=type:electronics cat:electronics should return same results as q=common:electronics I know I could make it defining a third field 'common' with copyFields from 'type' and 'cat' to 'common' but this wouldn't be feasible if you've already lots of documents in your index and don't want to reindex everything, isn't it? Any suggestions? Thanks in advance, Tommaso Tommaso, If re-indexing is not feasible/preferred, you might try looking into creating a dismax handler that should give you what you're looking for in your query: http://wiki.apache.org/solr/DisMaxQParserPlugin. The same solrconfig.xml that comes with SOLR has a dismax parser that you can modify to your needs. - Ken Stanley
Re: Phrase Query Problem?
On Tue, Nov 2, 2010 at 8:19 AM, Erick Erickson erickerick...@gmail.comwrote: That's not the response I get when I try your query, so I suspect something's not quite right with your test... But you could also try putting parentheses around the words, like mykeywords:(Compliance+With+Conduct+Standards) Best Erick I agree with Erick, your query string showed quotes, but your parsed query did not. Using quotes, or parenthesis, would pretty much leave your query alone. There is one exception that I've found: if you use a stopword analyzer, any stop words would be converted to ? in the parsed query. So if you absolutely need every single word to match, regardless, you cannot use a field type that uses the stop word analyzer. For example, I have two dynamic field definitions: df_text_* that does the default text transformations (including stop words), and df_text_exact_* that does nothing (field type is string). When I run the query df_text_exact_company_name:Bank of America OR df_text_company_name:Bank of America, the following is shown as my query/parsed query when debugQuery is on: str name=rawquerystring df_text_exact_company_name:Bank of America OR df_text_company_name:Bank of America /str str name=querystring df_text_exact_company_name:Bank of America OR df_text_company_name:Bank of America /str str name=parsedquery df_text_exact_company_name:Bank of America PhraseQuery(df_text_company_name:bank ? america) /str str name=parsedquery_toString df_text_exact_company_name:Bank of America df_text_company_name:bank ? america /str The difference is subtle, but important. If I were to do df_text_company_name:Bank and America, I would still match Bank of America. These are things that you should keep in mind when you are creating fields for your indices. A useful tool for seeing what SOLR does to your query terms is the Analysis tool found in the admin panel. You can do an analysis on either a specific field, or by a field type, and you will see a breakdown by Analyzer for either the index, query, or both of any query that you put in. This would definitely be useful when trying to determine why SOLR might return what it does. - Ken
Highlighting and maxBooleanClauses limit
parser (even though the highlighting query is built internally)? I am not a SOLR expert by any measure of the word, and as such, I just don't understand how two words on one field (as noted by the use of hl.fl=df_text_content + hl.requireFieldMatch=true + hl.usePhraseHighlighter=true) could somehow exceed the limits of both 1024 and 2048. I am concerned that even if I continue increasing maxBooleanClauses, I am not actually solving anything; in fact, my concern is that if I were to keep increasing this limit, I am in fact begging for problems later on down the road. For the sake of completeness, here are the definitions of the field I'm highlighting on (schema.xml): fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory / filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitO nCaseChange=1 / filter class=solr.LowerCaseFilterFactory / filter class=solr.SnowballPorterFilterFactory language=English protected=protwords.txt / /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory / filter class=solr.SynonymFilterFactory synonyms=synonyms/synonyms.txt ignoreCase=true expand=true / filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitO nCaseChange=1 / filter class=solr.LowerCaseFilterFactory / filter class=solr.SnowballPorterFilterFactory language=English protected=protwords.txt / /analyzer /fieldType dynamicField name=df_text_* type=text indexed=true stored=true / solrQueryParser defaultOperator=OR / And here is my highlighter definition (solrconfig.xml): highlighting !-- Configure the standard fragmenter -- !-- This could most likely be commented out in the default case -- fragmenter name=gap class=org.apache.solr.highlight.GapFragmenter default=true lst name=defaults int name=hl.fragsize255/int /lst /fragmenter !-- A regular-expression-based fragmenter (f.i., for sentence extraction) -- fragmenter name=regex class=org.apache.solr.highlight.RegexFragmenter lst name=defaults !-- slightly smaller fragsizes work better because of slop -- int name=hl.fragsize70/int !-- allow 50% slop on fragment sizes -- float name=hl.regex.slop0.5/float !-- a basic sentence pattern -- str name=hl.regex.pattern[-\w ,/\n\']{20,200}/str /lst /fragmenter !-- Configure the standard formatter -- formatter name=html class=org.apache.solr.highlight.HtmlFormatter default=true lst name=defaults str name=hl.simple.pre![CDATA[em]]/str str name=hl.simple.post![CDATA[/em]]/str /lst /formatter /highlighting It is worth noting that I have not done anything (except formatting) to the highlighting configuration in solrconfig.xml. Any help, assistance, and/or guidance that can be provided would be greatly appreciated. Thank you, Ken Stanley It looked like something resembling white marble, which was probably what it was: something resembling white marble. -- Douglas Adams, The Hitchhikers Guide to the Galaxy
Re: Highlighting and maxBooleanClauses limit
On Tue, Nov 2, 2010 at 11:26 AM, Koji Sekiguchi k...@r.email.ne.jp wrote: (10/11/02 23:14), Ken Stanley wrote: I've noticed in the stack trace that this exception occurs when trying to build the query for the highlighting; I've confirmed this by copying the params and changing hl=true to hl=false. Unfortunately, when using debugQuery=on, I do not see any details on what is going on with the highlighting portion of the query (after artificially increasing the maxBooleanClauses so the query will run). With all of that said, my question(s) to the list are: Is there a way to determine how exactly the highlighter is building its query (i.e., some sort of highlighting debug setting)? Basically I think highlighter uses main query, but try to rewrite it before highlighting. Is the behavior of highlighting in SOLR intended to be held to the same restrictions (maxBooleanClauses) as the query parser (even though the highlighting query is built internally)? I think so because maxBooleanClauses is a static variable. I saw your stack trace and glance at highlighter source, my assumption is - highlighter tried to rewrite (expand) your range queries to boolean query, even if you set requireFieldMatch to true. Can you try to query without the range query? If the problem goes away, I think it is highlighter bug. Highlighter should skip the range query when user set requireFieldMatch to true, because your range query is for another field. If so, please open a jira issue. Koji -- http://www.rondhuit.com/en/ Koji, that is most excellent. Thank you for pointing out that the range queries were causing the highlighter to exceed the maxBooleanClauses. Once I removed them from my main query (and moved them into separate filter queries), SOLR and highlighting worked as I expected them to work. Per your suggestion, I have opened a JIRA ticket (SOLR-2216) for this problem. I am somewhat a novice at Java, and I have not yet had the pleasure of getting the SOLR sources in my working environment, but I would be more than eager to potentially assist in finding a solution - with maybe some mentoring from a more experienced developer. Anyway, thank you again, I am very excited to have a suitable work around for the time being. - Ken Stanley
Re: Phrase Query Problem?
On Mon, Nov 1, 2010 at 10:26 PM, Tod listac...@gmail.com wrote: I have a number of fields I need to do an exact match on. I've defined them as 'string' in my schema.xml. I've noticed that I get back query results that don't have all of the words I'm using to search with. For example: q=(((mykeywords:Compliance+With+Conduct+Standards)OR(mykeywords:All)OR(mykeywords:ALL)))start=0indent=truewt=json Should, with an exact match, return only one entry but it returns five some of which don't have any of the fields I've specified. I've tried this both with and without quotes. What could I be doing wrong? Thanks - Tod Tod, Without knowing your exact field definition, my first guess would be your first boolean query; because it is not quoted, what SOLR typically does is to transform that type of query into something like (assuming your uniqueKey is id): (mykeywords:Compliance id:With id:Conduct id:Standards). If you do (mykeywords:Compliance+With+Conduct+Standards) you might see different (better?) results. Otherwise, append debugQuery=on to your URL and you can see exactly how SOLR is parsing your query. If none of that helps, what is your field definition in your schema.xml? - Ken
Re: indexing '-
On Sun, Oct 31, 2010 at 12:12 PM, PeterKerk vettepa...@hotmail.com wrote: I have a city named 's-Hertogenbosch I want it to be indexed exactly like that, so 's-Hertogenbosch (without ) But now I get: lst name=city int name=hertogenbosch1/int int name=s1/int int name=shertogenbosch1/int /lst What filter should I add/remove from my field definition? I already tried a new fieldtype with just this, but no luck: fieldType name=exacttext class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=false/ /analyzer /fieldType My schema.xml fieldType name=textTight class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=false/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords_dutch.txt / filter class=solr.WordDelimiterFilterFactory generateWordParts=0 generateNumberParts=0 catenateWords=1 catenateNumbers=1 catenateAll=0/ filter class=solr.ISOLatin1AccentFilterFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=Dutch protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType field name=city type=textTight indexed=true stored=true/ -- View this message in context: http://lucene.472066.n3.nabble.com/indexing-tp1816969p1816969.html Sent from the Solr - User mailing list archive at Nabble.com. For exact text, you should try using either the string type, or a type that only uses the KeywordTokenizer. Other field types may perform transformations on the text similar to what you are seeing. - Ken
Re: If I want to move a core from one physical machine to another....
On Wed, Oct 27, 2010 at 6:12 PM, Ron Mayer r...@0ape.com wrote: If I want to move a core from one physical machine to another, is it as simple as just scp -r core5 otherserver:/path/on/other/server/ and then adding core name=core5name instanceDir=core5 / on that other server's solr.xml file and restarting the server there? PS: Should have I been able to figure the answer to that out by RTFM somewhere? Ron, In our current environment I index all of our data on one machine, and to save time with replication, I use scp to copy the data directory over to our other servers. On the server that I copy from, I don't turn SOLR off, but on the servers that I copy to, I shutdown tomcat; remove the data directory; mv the data directory I scp'd from the source; turn tomcat back on. I do it this way (especially with mv, versus cp) because it is the fastest way to get the data on the other servers. And, as Gora pointed out, you need to make sure that your configuration files match (specifically the schema.xml) the source. - Ken
Re: If I want to move a core from one physical machine to another....
On Thu, Oct 28, 2010 at 8:07 AM, Ephraim Ofir ephra...@icq.com wrote: How is this better than replication? Ephraim Ofir It's not; for our needs here, we have not set up replication through SOLR. We are working through OOM problems/performance tuning first, then best practices second. I just wanted the OP to know that it can be done, and how we do it. :)
Re: Looking for Developers
On Thu, Oct 28, 2010 at 2:57 PM, Michael McCandless luc...@mikemccandless.com wrote: I don't think we should do this until it becomes a real problem. The number of job offers is tiny compared to dev emails, so far, as far as I can tell. Mike By the time that it becomes a real problem, it would be too late to get people to stop spamming the -user mailing list; no? - Ken
Re: How do I this in Solr?
On Tue, Oct 26, 2010 at 9:15 AM, Savvas-Andreas Moysidis savvas.andreas.moysi...@googlemail.com wrote: If I get your question right, you probably want to use the AND binary operator as in samsung AND andriod AND GPS or +samsung +andriod +GPS N.b. For these queries you can also pass the q.op parameter in the request to temporarily change the default operator to AND; this has the same effect without having to build the query; i.e., you can just pass http://host:port/solr/select?q=samsung+android+gpsq.op=and; as the query string (along with any other params you need).
Re: ClassCastException Issue
On Mon, Oct 25, 2010 at 2:45 AM, Alex Matviychuk alex...@gmail.com wrote: Getting this when deploying to tomcat: [INFO][http-4443-exec-3][solr.schema.IndexSchema] readSchema():394 Reading Solr Schema [INFO][http-4443-exec-3][solr.schema.IndexSchema] readSchema():408 Schema name=tsadmin [ERROR][http-4443-exec-3][util.plugin.AbstractPluginLoader] log():139 java.lang.ClassCastException: org.apache.solr.schema.StrField cannot be cast to org.apache.solr.schema.FieldType at org.apache.solr.schema.IndexSchema$1.create(IndexSchema.java:419) at org.apache.solr.schema.IndexSchema$1.create(IndexSchema.java:447) at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:141) at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:456) at org.apache.solr.schema.IndexSchema.init(IndexSchema.java:95) at org.apache.solr.core.SolrCore.init(SolrCore.java:520) at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:137) solr schema: ?xml version=1.0 encoding=UTF-8 ? schema name=tsadmin version=1.2 types fieldType name=string class=solr.StrField sortMissingLast=true omitNorms=true/ ... /types fields field name=type type=string required=true/ ... /fields /schema Any ideas? Thanks, Alex Matviychuk Alex, I've run into this issue myself, and it was because I tried to create a fieldType called string (like you). Rename string to something else and the exception should go away. - Ken
Re: DataImporter using pure solr add XML
On Mon, Oct 25, 2010 at 10:12 AM, Dario Rigolin dario.rigo...@comperio.itwrote: Looking at DataImporter I'm not sure if it's possible to import using a standard adddoc... xml document representing a document add operation. Generating adddoc is quite expensive in my application and I have cached all those documents into a text column into MySQL database. It will be easier for me to push all updated documents directly from Database instead passing via multiple xml files posted in stream mode to Solr. Thank you. Dario. Dario, Technically nothing is stopping you from using the DIH to import your XML document(s). However, note that the docadd/add/doc structure is not required. In fact, you can make up your own structure for the documents, so long as you configure the DIH to recognize them. At minimum, you should be able to use something to the effect of: dataSource type=FileDataSource encoding=UTF-8 / document entity name=some_unique_name_for_the_entity rootEntity=false dataSource=null processor=FileListEntityProcessor fileName=some_regex_matching_your_files.*\.xml$ baseDir=/path/to/xml/files newerThan=${dataimporter.some_unique_name_for_the_entity.last_index_time} entity name=another_unique_entity_name dataSource=some_unique_name_for_the_entity processor=XPathEntityProcessor url=${some_unique_name_for_the_entity.fileAbsolutePath} forEach=/XMLROOT/CHILD_NODE stream=true !-- An optional list of field / definitions if your XML schema does not match that of SOLR -- /entity /entity /document The break down is as follows: The dataSource / defines the document encoding that SOLR should use for your XML files. The top-level entity / creates the list of files to parse (hence why the fileName attribute supports regex expressions). The dataSource attribute needs to be set null here (I'm using 1.4.1, and AFAIK this is the same as 1.3 as well). The rootEntity=false is important to tell SOLR that it should not try to define fields from this entity. The second-level entity / is where the documents found in the file list are processed and parsed. The dataSource attribute needs to be the name of the top-level entity /. The url attribute is defined as the absolute path to the file generated by the top-level entity. The forEach is the key component here; this is the minimum xPath needed to iterate over your document structure. So, if by example you had: XMLROOT CHILD_NODE field1data/field1 field2more data/field2 ... /CHILD_NODE /XMLROOT Also note that, in my experience, case sensitivity matters when parsing your xpath instructions. I hope this helps! - Ken Stanley
Re: xpath processing
On Fri, Oct 22, 2010 at 11:52 PM, pghorp...@ucla.edu wrote: dataConfig dataSource name=myfilereader type=FileDataSource/ document entity name=f rootEntity=false dataSource=null processor=FileListEntityProcessor fileName=.*xml recursive=true baseDir=C:\data\sample_records\mods\starr entity name=x dataSource=myfilereader processor=XPathEntityProcessor url=${f.fileAbsolutePath} stream=false forEach=/mods transformer=DateFormatTransformer,RegexTransformer,TemplateTransformer field column=id template=${f.file}/ field column=collectionKey template=starr/ field column=collectionName template=starr/ field column=fileAbsolutePath template=${f.fileAbsolutePath}/ field column=fileName template=${f.file}/ field column=fileSize template=${f.fileSize}/ field column=fileLastModified template=${f.fileLastModified}/ field column=classification_keyword xpath=/mods/classification/ field column=accessCondition_keyword xpath=/mods/accessCondition/ field column=nameNamePart_s xpath=/mods/name/namepa...@type = 'date'] / /entity /entity /document /dataConfig The documentation says you don't need a dataSource for your XPathEntityProcessor entity; in my configuration, I have mine set to the name of the top-level FileListEntityProcessor. Everything else looks fine. Can you provide one record from your data? Also, are you getting any errors in your log? - Ken
Re: xpath processing
Parinita, In its simplest form, what does your entity definition for DIH look like; also, what does one record from your xml look like? We need more information before we can really be of any help. :) - Ken It looked like something resembling white marble, which was probably what it was: something resembling white marble. -- Douglas Adams, The Hitchhikers Guide to the Galaxy On Fri, Oct 22, 2010 at 8:00 PM, pghorp...@ucla.edu wrote: Quoting pghorp...@ucla.edu: Can someone help me please? I am trying to import mods xml data in solr using the xml/http datasource This does not work with XPathEntityProcessor of the data import handler xpath=/mods/name/namepa...@type = 'date'] I actually have 143 records with type attribute as 'date' for element namePart. Thank you Parinita
Re: boosting injection
Andrea, Using the SOLR dismax query handler, you could set up queries like this to boost on fields of your choice. Basically, the q parameter would be the query terms (without the field definitions, and a qf (Query Fields) parameter that you use to define your boost(s): http://wiki.apache.org/solr/DisMaxQParserPlugin. A non-SOLR alternative would be to parse the query in whatever application is sending the queries to the SOLR instance to make the necessary transformations. Regards, Ken It looked like something resembling white marble, which was probably what it was: something resembling white marble. -- Douglas Adams, The Hitchhikers Guide to the Galaxy On Tue, Oct 19, 2010 at 8:48 AM, Andrea Gazzarini andrea.gazzar...@atcult.it wrote: Hi all, I have a client that is sending this query q=title:history AND author:joyce is it possible to transform at runtime this query in this way: q=title:history^10 AND author:joyce^5 ? Best regards, Andrea
Re: **SPAM** Re: boosting injection
Andrea, Another approach, aside of Markus' suggestion, would be to create your own handler that could intercept the query and perform whatever necessary transformations that you need at query time. However, that would require having Java knowledge (which I make no assumption). Regards, Ken It looked like something resembling white marble, which was probably what it was: something resembling white marble. -- Douglas Adams, The Hitchhikers Guide to the Galaxy On Tue, Oct 19, 2010 at 10:23 AM, Andrea Gazzarini andrea.gazzar...@atcult.it wrote: Hi Ken, thanks for your response...unfortunately it doesn't solve my problem. I cannot chnage the client behaviour so the query must be a query and not only the query terms. In this scenario, It would be great, for example, if I could declare the boost in the schema field definitionbut I think it's not possible isn't it? Regards Andrea -- *From:* Ken Stanley [mailto:doh...@gmail.com] *To:* solr-user@lucene.apache.org *Sent:* Tue, 19 Oct 2010 15:05:31 +0200 *Subject:* **SPAM** Re: boosting injection Andrea, Using the SOLR dismax query handler, you could set up queries like this to boost on fields of your choice. Basically, the q parameter would be the query terms (without the field definitions, and a qf (Query Fields) parameter that you use to define your boost(s): http://wiki.apache.org/solr/DisMaxQParserPlugin. A non-SOLR alternative would be to parse the query in whatever application is sending the queries to the SOLR instance to make the necessary transformations. Regards, Ken It looked like something resembling white marble, which was probably what it was: something resembling white marble. -- Douglas Adams, The Hitchhikers Guide to the Galaxy On Tue, Oct 19, 2010 at 8:48 AM, Andrea Gazzarini andrea.gazzar...@atcult.it wrote: Hi all, I have a client that is sending this query q=title:history AND author:joyce is it possible to transform at runtime this query in this way: q=title:history^10 AND author:joyce^5 ? Best regards, Andrea
Re: Documents and Cores, take 2
Ron, In the past I've worked with SOLR for a product that required the ability to search - separately - for companies, people, business lists, and a combination of the previous three. In designing this in SOLR, I found that using a combination of explicit field definitions and dynamic fields ( http://wiki.apache.org/solr/SchemaXml#Dynamic_fields) gave me the best possible solution for the problem. In essence, I created explicit fields that would be shared among all document types: a unique id, a document type, an indexed date, a modified date, and maybe a couple of other fields that share traits with all document types (i.e., name, a market specific to our business, etc). The unique id was built as a string, and was prefixed with the document type, and it ended with the unique id from the database. The dynamic fields can be configured to be as flexible as you need, and in my experience I would strongly recommend documenting each type of dynamic field for each of your document types as a reference for your developers (and yourself). :) This allows us to build queries that can be focused on specific document types, or combining all of the types into a super search. For example, you could something to the effect of: (docType: people) AND (df_firstName:John AND df_lastName:Hancock), (docType:companies) AND (df_BusinessName:Acme+Inc), or even ((df_firstName:John AND df_lastName:Hancock) OR (df_BusinessName:Acme+Inc)). I hope this helps! - Ken It looked like something resembling white marble, which was probably what it was: something resembling white marble. -- Douglas Adams, The Hitchhikers Guide to the Galaxy On Tue, Oct 19, 2010 at 4:57 PM, Olson, Ron rol...@lbpc.com wrote: Hi all- I have a newbie design question about documents, especially with SQL databases. I am trying to set up Solr to go against a database that, for example, has items and people. The way I see it, and I don't know if this is right or not (thus the question), is that I see both as separate documents as an item may contain a list of parts, which the user may want to search, and, as part of the item, view the list of people who have ordered the item. Then there's the actual people, who the user might want to search to find a name and, consequently, what items they ordered. To me they are both top level things, with some overlap of fields. If I'm searching for people, I'm likely not going to be interested in the parts of the item, while if I'm searching for items the likelihood is that I may want to search for 42532 which is, in this instance, a SKU, and not get hits on the zip code section of the people. Does it make sense, then, to separate these two out as separate documents? I believe so because the documentation I've read suggests that a document should be analogous to a row in a table (in this case, very de-normalized). What is tripping me up is, as far as I can tell, you can have only one document type per index, and thus one document per core. So in this example, I have two cores, items and people. Is this correct? Should I embrace the idea of having many cores or am I supposed to have a single, unified index with all documents (which doesn't seem like Solr supports). The ultimate question comes down to the search interface. I don't necessarily want to have the user explicitly state which document they want to search; I'd like them to simply type 42532 and get documents from both cores, and then possibly allow for filtering results after the fact, not before. As I've only used the admin site so far (which is core-specific), does the client API allow for unified searching across all cores? Assuming it does, I'd think my idea of multiple-documents is okay, but I'd love to hear from people who actually know what they're doing. :) Thanks, Ron BTW: Sorry about the problem with the previous message; I didn't know about thread hijacking. DISCLAIMER: This electronic message, including any attachments, files or documents, is intended only for the addressee and may contain CONFIDENTIAL, PROPRIETARY or LEGALLY PRIVILEGED information. If you are not the intended recipient, you are hereby notified that any use, disclosure, copying or distribution of this message or any of the information included in or with it is unauthorized and strictly prohibited. If you have received this message in error, please notify the sender immediately by reply e-mail and permanently delete and destroy this message and its attachments, along with any copies thereof. This message does not create any contractual obligation on behalf of the sender or Law Bulletin Publishing Company. Thank you.
Re: SOLR DateTime and SortableLongField field type problems
Just following up to see if anybody might have some words of wisdom on the issue? Thank you, Ken It looked like something resembling white marble, which was probably what it was: something resembling white marble. -- Douglas Adams, The Hitchhikers Guide to the Galaxy On Fri, Oct 15, 2010 at 6:42 PM, Ken Stanley doh...@gmail.com wrote: Hello all, I am using SOLR-1.4.1 with the DataImportHandler, and I am trying to follow the advice from http://www.mail-archive.com/solr-user@lucene.apache.org/msg11887.htmlabout converting date fields to SortableLong fields for better memory efficiency. However, whenever I try to do this using the DateFormater, I get exceptions when indexing for every row that tries to create my sortable fields. In my schema.xml, I have the following definitions for the fieldType and dynamicField: fieldType name=sdate class=solr.SortableLongField indexed=true stored=false sortMissingLast=true omitNorms=true / dynamicField name=sort_date_* type=sdate stored=false indexed=true / In my dih.xml, I have the following definitions: dataConfig dataSource type=FileDataSource encoding=UTF-8 / entity name=xml_stories rootEntity=false dataSource=null processor=FileListEntityProcessor fileName=legacy_stories.*\.xml$ recursive=false baseDir=/usr/local/extracts newerThan=${dataimporter.xml_stories.last_index_time} entity name=stories pk=id dataSource=xml_stories processor=XPathEntityProcessor url=${xml_stories.fileAbsolutePath} forEach=/RECORDS/RECORD stream=true transformer=DateFormatTransformer,HTMLStripTransformer,RegexTransformer,TemplateTransformer onError=continue field column=_modified_date xpath=/RECORDS/RECORD/pr...@name='R_ModifiedTime']/PVAL / field column=modified_date sourceColName=_modified_date dateTimeFormat=-MM-dd'T'hh:mm:ss'Z' / field column=_df_date_published xpath=/RECORDS/RECORD/pr...@name='R_StoryDate']/PVAL / field column=df_date_published sourceColName=_df_date_published dateTimeFormat=-MM-dd'T'hh:mm:ss'Z' / field column=sort_date_modified sourceColName=modified_date dateTimeFormat=MMddhhmmss / field column=sort_date_published sourceColName=df_date_published dateTimeFormat=MMddhhmmss / /entity /entity /document /dataConfig The fields in question are in the formats: RECORDS RECORD PROP NAME=R_StoryDate PVAL2001-12-04T00:00:00Z/PVAL /PROP PROP NAME=R_ModifiedTime PVAL2001-12-04T19:38:01Z/PVAL /PROP /RECORD /RECORDS The exception that I am receiving is: Oct 15, 2010 6:23:24 PM org.apache.solr.handler.dataimport.DateFormatTransformer transformRow WARNING: Could not parse a Date field java.text.ParseException: Unparseable date: Wed Nov 28 21:39:05 EST 2007 at java.text.DateFormat.parse(DateFormat.java:337) at org.apache.solr.handler.dataimport.DateFormatTransformer.process(DateFormatTransformer.java:89) at org.apache.solr.handler.dataimport.DateFormatTransformer.transformRow(DateFormatTransformer.java:69) at org.apache.solr.handler.dataimport.EntityProcessorWrapper.applyTransformer(EntityProcessorWrapper.java:195) at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:241) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:357) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:383) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:242) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:180) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:331) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:389) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:370) I know that it has to be the SortableLong fields, because if I remove just those two lines from my dih.xml, everything imports as I expect it to. Am I doing something wrong? Mis-using the SortableLong and/or DateTransformer? Is this not supported in my version of SOLR? I'm not very experienced with Java, so digging into the code would be a lost cause for me right now. I was hoping that somebody here might be able to help point me in the right/correct direction. It should be noted that the modified_date and df_date_published fields index just fine (so long as I do it as I've defined above). Thank you, - Ken It looked like something resembling white marble, which was probably what
Re: SOLR DateTime and SortableLongField field type problems
On Mon, Oct 18, 2010 at 7:52 AM, Michael Sokolov soko...@ifactory.comwrote: I think if you look closely you'll find the date quoted in the Exception report doesn't match any of the declared formats in the schema. I would suggest, as a first step, hunting through your data to see where that date is coming from. -Mike [Note: RE-sending this because apparently in my sleepy-stupor, I clicked to wrong Reply button and never sent this to the list (It's a Monday) :)] I've noticed that date anomaly as well, and I've discovered that is one of the gotchas of DIH: it seems to modify my date to that format. All of the dates in the data are in the correct -MM-dd'T'hh:mm:ss'Z' format. Once it is run through dateTImeFormat, I assume it is converted into a date object; trying to use that date object in any other form (i.e., using template, or even another dateTimeFormat) results in the exception I've described (displaying the date in the incorrect format). Thanks, Ken Stanley
Re: problem on running fullimport
On Fri, Oct 15, 2010 at 7:42 AM, swapnil dubey swapnil.du...@gmail.comwrote: Hi, I am using the full import option with the data-config file as mentioned below dataConfig dataSource type=JdbcDataSource driver=com.mysql.jdbc.Driver url=jdbc:mysql:///xxx user=xxx password=xx / document entity name=yyy query=select studentName from test1 field column=studentName name=studentName / /entity /document /dataConfig on running the full-import option I am getting the error mentioned below.I had already included the dataimport.properties file in my conf file.help me to get the issue resolved response - lst name=responseHeader int name=status0/int int name=QTime334/int /lst - lst name=initArgs - lst name=defaults str name=configdata-config.xml/str /lst /lst str name=commandfull-import/str str name=modedebug/str null name=documents/ - lst name=verbose-output - lst name=entity:test1 - lst name=document#1 str name=queryselect studentName from test1/str - str name=EXCEPTION org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to execute query: select studentName from test1 Processing Document # 1 ... -- Regards Swapnil Dubey Swapnil, Everything looks fine, except that in your entity definition you forgot to define which datasource you wish to use. So if you add the 'dataSource=JdbcDataSource' that should get rid of your exception. As a reminder, the DataImportHandler wiki ( http://wiki.apache.org/solr/DataImportHandler) on Apache's website is very helpful with learning how to use the DIH properly. It has helped me with having a printed copy beside me for easy and quick reference. - Ken
SOLR DateTime and SortableLongField field type problems
Hello all, I am using SOLR-1.4.1 with the DataImportHandler, and I am trying to follow the advice from http://www.mail-archive.com/solr-user@lucene.apache.org/msg11887.html about converting date fields to SortableLong fields for better memory efficiency. However, whenever I try to do this using the DateFormater, I get exceptions when indexing for every row that tries to create my sortable fields. In my schema.xml, I have the following definitions for the fieldType and dynamicField: fieldType name=sdate class=solr.SortableLongField indexed=true stored=false sortMissingLast=true omitNorms=true / dynamicField name=sort_date_* type=sdate stored=false indexed=true / In my dih.xml, I have the following definitions: dataConfig dataSource type=FileDataSource encoding=UTF-8 / entity name=xml_stories rootEntity=false dataSource=null processor=FileListEntityProcessor fileName=legacy_stories.*\.xml$ recursive=false baseDir=/usr/local/extracts newerThan=${dataimporter.xml_stories.last_index_time} entity name=stories pk=id dataSource=xml_stories processor=XPathEntityProcessor url=${xml_stories.fileAbsolutePath} forEach=/RECORDS/RECORD stream=true transformer=DateFormatTransformer,HTMLStripTransformer,RegexTransformer,TemplateTransformer onError=continue field column=_modified_date xpath=/RECORDS/RECORD/pr...@name='R_ModifiedTime']/PVAL / field column=modified_date sourceColName=_modified_date dateTimeFormat=-MM-dd'T'hh:mm:ss'Z' / field column=_df_date_published xpath=/RECORDS/RECORD/pr...@name='R_StoryDate']/PVAL / field column=df_date_published sourceColName=_df_date_published dateTimeFormat=-MM-dd'T'hh:mm:ss'Z' / field column=sort_date_modified sourceColName=modified_date dateTimeFormat=MMddhhmmss / field column=sort_date_published sourceColName=df_date_published dateTimeFormat=MMddhhmmss / /entity /entity /document /dataConfig The fields in question are in the formats: RECORDS RECORD PROP NAME=R_StoryDate PVAL2001-12-04T00:00:00Z/PVAL /PROP PROP NAME=R_ModifiedTime PVAL2001-12-04T19:38:01Z/PVAL /PROP /RECORD /RECORDS The exception that I am receiving is: Oct 15, 2010 6:23:24 PM org.apache.solr.handler.dataimport.DateFormatTransformer transformRow WARNING: Could not parse a Date field java.text.ParseException: Unparseable date: Wed Nov 28 21:39:05 EST 2007 at java.text.DateFormat.parse(DateFormat.java:337) at org.apache.solr.handler.dataimport.DateFormatTransformer.process(DateFormatTransformer.java:89) at org.apache.solr.handler.dataimport.DateFormatTransformer.transformRow(DateFormatTransformer.java:69) at org.apache.solr.handler.dataimport.EntityProcessorWrapper.applyTransformer(EntityProcessorWrapper.java:195) at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:241) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:357) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:383) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:242) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:180) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:331) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:389) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:370) I know that it has to be the SortableLong fields, because if I remove just those two lines from my dih.xml, everything imports as I expect it to. Am I doing something wrong? Mis-using the SortableLong and/or DateTransformer? Is this not supported in my version of SOLR? I'm not very experienced with Java, so digging into the code would be a lost cause for me right now. I was hoping that somebody here might be able to help point me in the right/correct direction. It should be noted that the modified_date and df_date_published fields index just fine (so long as I do it as I've defined above). Thank you, - Ken It looked like something resembling white marble, which was probably what it was: something resembling white marble. -- Douglas Adams, The Hitchhikers Guide to the Galaxy
Re: Searching Across Multiple Cores
Steve, Using shards is actually quite simple; it's just a matter of setting up your shards (via multiple cores, or multiple instances of SOLR) and then passing the shards parameter in the query string. The shards parameter is a comma-separated list of the servers/cores you wish to use together. So, let's try this using a fictitious example. You have two cores, one called main for your main data set of metadata and favorites for your user favorites meta data. You set up each schema accordingly, and you've indexed your data. When you want to do a query on both sets of data you would build your query appropriately, and then use the following URL (the host is assumed to be localhost for simplicity): http://localhost/solr/main/select?q=id:[*+TO+*]shards=localhost/solr/main,localhost/solr/favoritesrows=100start=0 I am personally investigating using this technique to tie together two cores that utilize different schemas; one schema will contain news articles, blogs, and similar types of data, while another schema will contain company-specific information, such as addresses, etc. If you're still having trouble after trying this, let me know and I'd be more than happy to share any findings that I come across. I hope that this helps to clear things up for you. :) - Ken It looked like something resembling white marble, which was probably what it was: something resembling white marble. -- Douglas Adams, The Hitchhikers Guide to the Galaxy On Thu, Oct 14, 2010 at 4:25 AM, Lohrenz, Steven steven.lohr...@hmhpub.comwrote: Ken, I have been through that page many times. I could use Distributed search for what? The first scenario or the second? The question is: can I merge a set of results from the two cores/shards and only return results that exist in both (determined by the resourceId, which exists on both)? Cheers, Steve -Original Message- From: Ken Stanley [mailto:doh...@gmail.com] Sent: 13 October 2010 20:08 To: solr-user@lucene.apache.org Subject: Re: Searching Across Multiple Cores On Wed, Oct 13, 2010 at 2:11 PM, Lohrenz, Steven steven.lohr...@hmhpub.comwrote: Hi, I am trying to figure out if how I can accomplish the following: I have a fairly static and large set of resources I need to have indexed and searchable. Solr seems to be a perfect fit for that. In addition I need to have the ability for my users to add resources from the main data set to a 'Favourites' folder (which can include a few more tags added by them). The Favourites needs to be searchable in the same manner as the main data set, across all the same fields. My first thought was to have two separate schemas - the first for the main data set and its metadata - the second for the Favourites folder with all of the metadata from the main set copied over and then adding the additional fields. Then I thought that would probably waste quite a bit of space (the number of users is much larger than the number of main resources). So then I thought I could have the main data set with its metadata. Then there would be second one for the Favourites folder with the unique id from the first and the additional fields it needs (userId, grade, folder, tag). In addition, I would create another schema/core with all the fields from the other two and have a request handler defined on it that searches across the other 2 cores and returns the results through this core. This third core would have searches run against it where the results would expect to only be returned for a single user. For example, a user searches their Favourites folder for all the items with Foo. The result is only those items the user has added to their Favourites with Foo somewhere in their main data set metadata. Could this be made to work? What would the consequences be? Any alternative suggestions? Thanks, Steve Steve, From your description, it really sounds like you could reap the benefits of using Distributed Search in SOLR: http://wiki.apache.org/solr/DistributedSearch I hope that this helps. - Ken
Re: Searching Across Multiple Cores
On Wed, Oct 13, 2010 at 2:11 PM, Lohrenz, Steven steven.lohr...@hmhpub.comwrote: Hi, I am trying to figure out if how I can accomplish the following: I have a fairly static and large set of resources I need to have indexed and searchable. Solr seems to be a perfect fit for that. In addition I need to have the ability for my users to add resources from the main data set to a 'Favourites' folder (which can include a few more tags added by them). The Favourites needs to be searchable in the same manner as the main data set, across all the same fields. My first thought was to have two separate schemas - the first for the main data set and its metadata - the second for the Favourites folder with all of the metadata from the main set copied over and then adding the additional fields. Then I thought that would probably waste quite a bit of space (the number of users is much larger than the number of main resources). So then I thought I could have the main data set with its metadata. Then there would be second one for the Favourites folder with the unique id from the first and the additional fields it needs (userId, grade, folder, tag). In addition, I would create another schema/core with all the fields from the other two and have a request handler defined on it that searches across the other 2 cores and returns the results through this core. This third core would have searches run against it where the results would expect to only be returned for a single user. For example, a user searches their Favourites folder for all the items with Foo. The result is only those items the user has added to their Favourites with Foo somewhere in their main data set metadata. Could this be made to work? What would the consequences be? Any alternative suggestions? Thanks, Steve Steve, From your description, it really sounds like you could reap the benefits of using Distributed Search in SOLR: http://wiki.apache.org/solr/DistributedSearch I hope that this helps. - Ken
Re: searching while importing
On Wed, Oct 13, 2010 at 6:38 PM, Shawn Heisey s...@elyograg.org wrote: If you are using the DataImportHandler, you will not be able to search new data until the full-import or delta-import is complete and the update is committed. When I do a full reindex, it takes about 5 hours, and until it is finished, I cannot search it. This is not true; when I use the DIH to do a full-import, I (and my team) are still able to search on the already-indexed data that exists. I have not tried to issue a manual commit in the middle of an import to see whether that makes data inserted up to that point searchable, but I would not expect that to work. If you set the autoCommit properties maxDocs and maxTime to reasonable values, then once those limits are reached, I suspect that SOLR would commit and continue indexing; however, I have not had the chance to use those features in solrconfig.xml. If you need this kind of functionality, you may need to change your build system so that a full import clears the index manually and then does a series of delta-import batches. The only time I've had an issue with being able to search while indexing is when my DIH had mis-configuration that caused the import to finish without indexing anything, thus wiping out my data. Aside of that, I continually index and search at the same time almost every day (using 1.4.1). On 10/13/2010 3:51 PM, Tri Nguyen wrote: Hi, Can I perform searches against the index while it is being imported? Does importing add 1 document at a time or will solr make a temporary index and switch to that index when indexing is done? Thanks, Tri
Re: Solr PHP PECL Extension going to Stable Release - Wishing for Any New Features?
If you are using Solr via PHP and would like to see any new features in the extension please feel free to send me a note. I'm new to this list, but in seeing this thread - and using PHP SOLR - I wanted to make a suggestion that - while minor - I think would greatly improve the quality of the extension. (I'm basing this mostly off of SolrQuery since that's where I've encountered the issue, but this might be true elsewhere) Whenever a method is supposed to return an array (i.e., SolrQuery::getFields(), SolrQuery::getFacets(), etc), if there is no data to return, a null is returned. I think that this should be normalized across the board to return an empty array. First, the documentation is contradictory (http://us.php.net/manual/en/solrquery.getfields.php) in that the method signature says that it returns an array (not mixed), while the Return Values section says that it returns either an array or null. Secondly, returning an array under any circumstance provides more consistency and less logic; for example, let's say that I am looking for the fields (as-is in its current state): ?php // .. assume a proper set up if ($solrquery-getFields() !== null) { foreach ($solrquery-getFields() as $field) { // Do something } } ? This is a minor request, I know. But, I feel that it would go a long way toward polishing the extension up for general consumption. Thank you, Ken Stanley PS. I apologize if this request has come through the pipes already; as I've stated, I am new to this list; I have yet to find any reference to my request. :)