Re: Fastest way to import big amount of documents in SolrCloud
If you build your index in Hadoop, read this (it is about the Cloudera Search but in my understanding also should work with Solr Hadoop contrib since 4.7) http://www.cloudera.com/content/cloudera-content/cloudera-docs/Search/latest/Cloudera-Search-User-Guide/csug_batch_index_to_solr_servers_using_golive.html On Thu, May 1, 2014 at 1:47 PM, Costi Muraru costimur...@gmail.com wrote: Hi guys, What would you say it's the fastest way to import data in SolrCloud? Our use case: each day do a single import of a big number of documents. Should we use SolrJ/DataImportHandler/other? Or perhaps is there a bulk import feature in SOLR? I came upon this promising link: http://wiki.apache.org/solr/UpdateCSV Any idea on how UpdateCSV is performance-wise compared with SolrJ/DataImportHandler? If SolrJ, should we split the data in chunks and start multiple clients at once? In this way we could perhaps take advantage of the multitude number of servers in the SolrCloud configuration? Either way, after the import is finished, should we do an optimize or a commit or none ( http://wiki.solarium-project.org/index.php/V1:Optimize_command)? Any tips and tricks to perform this process the right way are gladly appreciated. Thanks, Costi
Re: timeAllowed in not honoring
On Thu, 2014-05-01 at 23:38 +0200, Shawn Heisey wrote: I was surprised to read that fc uses less memory. I think that is an error in the documentation. Except for special cases, such as asking for all facet values on a high cardinality field, I would estimate that enum uses less memory than fc. - Toke Eskildsen, State and University Library, Denmark
Re: timeAllowed in not honoring
On Thu, 2014-05-01 at 23:03 +0200, Aman Tandon wrote: So can you explain how enum is faster than default. The fundamental difference is than enum iterates terms and counts how many of the documents associated to the terms are in the hits, while fc iterates all hits and updates a counter for the term associated to the document. A bit too simplified we have enum: terms-docs, fc: hits-terms. enum wins when there are relatively few unique terms and is much less affected by index updates than fc. As Shawn says, you are best off by testing. We are planning to move to SolrCloud with the version solr 4.7.1, so does this 14 GB of RAM will be sufficient? or should we increase it? Switching to SolrCloud does not change your fundamental memory requirements for searching. The merging part adds some overhead, but with a heap of 14GB, I would be surprised if that would require an increase. Consider using DocValues for facet fields with many unique values, for getting both speed and low memory usage at the cost of increased index size. - Toke Eskildsen, State and University Library, Denmark
Re: Block Join Score Highlighting
Mikhail Khludnev wrote Hello, Score support is addressed at https://issues.apache.org/jira/browse/SOLR-5882. Highlighting is another story. be aware of http://heliosearch.org/expand-block-join/ it might somehow useful for your problem. Thx for the reply! My score question is answered with that. I already tried expanding based on that exact article. With expanding I might be able to search in parent, filter children and also return the children based on the same filter query. However this doesn't give me the most relevant child and certainly won't allow me to use the boost of that child in the score of the parent document. I am forced to search on child level as this allows me to use the unique boost of the child to influence the score. What I would need is to return snippet based on search in the parent, but now snippets are based on the returning document. -- View this message in context: http://lucene.472066.n3.nabble.com/Block-Join-Score-Highlighting-tp4134045p4134273.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: XSLT Caching Warning
I have a few transforms that I need to do, but I turned set the cache lifetime very high. I'm just trying to rectify error messages that pop up. If it's something that I can ignore, then that's OK, I just wanted to be sure. Thanks! -- Chris On Thu, May 1, 2014 at 10:32 PM, Alexandre Rafalovitch arafa...@gmail.comwrote: I think the key message here is: simplistic XSLT caching mechanism is not appropriate for high load scenarios. As in, maybe this is not really a production-level component. One exception is given and it is not just lifetime, it's also a single-transform. Are you satisfying both of those conditions? If so, it's probably ok to just ignore the warning. Regards, Alex. Personal website: http://www.outerthoughts.com/ Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency On Fri, May 2, 2014 at 3:28 AM, Christopher Gross cogr...@gmail.com wrote: I get this warning when Solr (4.7.2) Starts: WARN org.apache.solr.util.xslt.TransformerProvider â The TransformerProvider's simplistic XSLT caching mechanism is not appropriate for high load scenarios, unless a single XSLT transform is used and xsltCacheLifetimeSeconds is set to a sufficiently high value. The solrconfig.xml setting is: queryResponseWriter name=xslt class=solr.XSLTResponseWriter int name=xsltCacheLifetimeSeconds10/int /queryResponseWriter Is there a different class that I should be using? Is there a higher number than 10 that will do the trick? Thanks! -- Chris
Export big extract from Solr to [My]SQL
Hi I want to make extracts from my Solr to MySQL. Any tools around that can help med perform such a task? I find a lot about data-import from SQL when googling, but nothing about export/extract. It is not all of the data in Solr I need to extract. It is only documents that full fill a normal Solr query, but the number of documents fulfilling it will (potentially) be huge. Regards, Per Steffensen
Re: Export big extract from Solr to [My]SQL
Hi Per, basically I see three options * use a lot of memory to scope with huge result sets * user result set paging * SOLR 4.7 supports cursors (https://issues.apache.org/jira/browse/SOLR-5463) Cheers, Siegfried Goeschl On 02.05.14 13:32, Per Steffensen wrote: Hi I want to make extracts from my Solr to MySQL. Any tools around that can help med perform such a task? I find a lot about data-import from SQL when googling, but nothing about export/extract. It is not all of the data in Solr I need to extract. It is only documents that full fill a normal Solr query, but the number of documents fulfilling it will (potentially) be huge. Regards, Per Steffensen
Displaying ExternalFileField values in CSVResponse - Solr 4.6
Hi,nbsp; nbsp;We are using Solr4.6 to index and search our ecommerce product details. We are using ExternalFileField option to incorporate some ranking signals.nbsp;The problem I am facing currently is that the values of ExternalFileField are not displayed in the CSVResponse of the solr. However I am able to get the valuesfor other response formats such as XML, JSON, Python etc.Can anyone please let me know if there is a way to display the values in CSVResponse. I don#39;t prefer to use other response formats as these responses are fatterthan the CSV Response and parsing them involves additional performance cost.If the required functionality is available in a later version of Solr, I will be able to upgrade to it.Any help would be great.Thanks,Sanjeevnbsp;
PostingHighlighter complains about no offsets
I've been wanting to try out the PostingsHighlighter, so I added storeOffsetsWithPositions to my field definition, enabled the highlighter in solrconfig.xml, reindexed and tried it out. When I issue a query I'm getting this error: |field 'text' was indexed without offsets, cannot highlight java.lang.IllegalArgumentException: field 'text' was indexed without offsets, cannot highlight at org.apache.lucene.search.postingshighlight.PostingsHighlighter.highlightDoc(PostingsHighlighter.java:545) at org.apache.lucene.search.postingshighlight.PostingsHighlighter.highlightField(PostingsHighlighter.java:467) at org.apache.lucene.search.postingshighlight.PostingsHighlighter.highlightFieldsAsObjects(PostingsHighlighter.java:392) at org.apache.lucene.search.postingshighlight.PostingsHighlighter.highlightFields(PostingsHighlighter.java:293)| I've been trying to figure out why the field wouldn't have offsets indexed, but I just can't see it. Is there something in the analysis chain that could stripping out offsets? This is the field definition: field name=text type=text_en indexed=true stored=true multiValued=false termVectors=true termPositions=true termOffsets=true storeOffsetsWithPositions=true / (Yes I know PH doesn't require term vectors; I'm keeping them around for now while I experiment) fieldType name=text_en class=solr.TextField positionIncrementGap=100 analyzer type=index !-- We are indexing mostly HTML so we need to ignore the tags -- charFilter class=solr.HTMLStripCharFilterFactory/ !--tokenizer class=solr.StandardTokenizerFactory/-- tokenizer class=solr.WhitespaceTokenizerFactory/ !-- lower casing must happen before WordDelimiterFilter or protwords.txt will not work -- filter class=solr.LowerCaseFilterFactory/ filter class=solr.WordDelimiterFilterFactory stemEnglishPossessive=1 protected=protwords.txt/ !-- This deals with contractions -- filter class=solr.SynonymFilterFactory synonyms=synonyms.txt expand=true ignoreCase=true/ filter class=solr.HunspellStemFilterFactory dictionary=en_US.dic affix=en_US.aff ignoreCase=true/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query !--tokenizer class=solr.StandardTokenizerFactory/-- tokenizer class=solr.WhitespaceTokenizerFactory/ !-- lower casing must happen before WordDelimiterFilter or protwords.txt will not work -- filter class=solr.LowerCaseFilterFactory/ filter class=solr.WordDelimiterFilterFactory protected=protwords.txt/ !-- setting tokenSeparator= solves issues with compound words and improves phrase search -- filter class=solr.HunspellStemFilterFactory dictionary=en_US.dic affix=en_US.aff ignoreCase=true/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType
Re: What are the best practices on Multiple Language support in Solr Cloud ?
Hi Shamik, I don't have an answer for you, just a couple of comments. Why not use dynamic field definitions in the schema? As you say most of your fields are not analysed you just add a language tag _en, _fr, _de, ...) to the field when you index or query. Then you can add languages as you need without having to touch the schema. For fields that you do analyse (stop words or synonyms) then you'll have to explicitly define a field type for them. My experience with docs that are in two or three main languages is that single core or multi-core has not been that critical, sharding and replication made a bigger difference to us. You could put english in one core and everything else in another. What we tried to do was just index stuff to the same field, that is french and english getting indexed to contents or title field (we have our own tokenizer and filter chain so did actually analyse them differently) but we got into lots of problems with tf-idf, so I'd advise to not do that. The motivation was that we wanted multi-ligual results. Terry's approach here is much better, and as you thought is addressing the multi-lingual requirement, but I still don't think it totally addresses the tf-idf problem. So if you don't need multilingual don't go that route. I am curious to see what other people think. Niki
Roll up query with original facets
Hello All, I am having a query issue I cannot seem to find the correct answer for. I am searching against a list of items and returning facets for that list of items. I would like to group the result set on a field such as a “parentItemId”. parentItemId maps to other documents within the same core. I would like my query to return the documents that match parentItemId, but still return the facets of the original query. Is this possible with SOLR 4.3 that I am running? I can provide more details if needed, thanks! Darin
Re: PostingHighlighter complains about no offsets
I checked using the analysis admin page, and I believe there are offsets being generated (I assume start/end=offsets). So IDK I am going to try reindexing again. Maybe I neglected to reload the config before I indexed last time. -Mike On 05/02/2014 09:34 AM, Michael Sokolov wrote: I've been wanting to try out the PostingsHighlighter, so I added storeOffsetsWithPositions to my field definition, enabled the highlighter in solrconfig.xml, reindexed and tried it out. When I issue a query I'm getting this error: |field 'text' was indexed without offsets, cannot highlight java.lang.IllegalArgumentException: field 'text' was indexed without offsets, cannot highlight at org.apache.lucene.search.postingshighlight.PostingsHighlighter.highlightDoc(PostingsHighlighter.java:545) at org.apache.lucene.search.postingshighlight.PostingsHighlighter.highlightField(PostingsHighlighter.java:467) at org.apache.lucene.search.postingshighlight.PostingsHighlighter.highlightFieldsAsObjects(PostingsHighlighter.java:392) at org.apache.lucene.search.postingshighlight.PostingsHighlighter.highlightFields(PostingsHighlighter.java:293)| I've been trying to figure out why the field wouldn't have offsets indexed, but I just can't see it. Is there something in the analysis chain that could stripping out offsets? This is the field definition: field name=text type=text_en indexed=true stored=true multiValued=false termVectors=true termPositions=true termOffsets=true storeOffsetsWithPositions=true / (Yes I know PH doesn't require term vectors; I'm keeping them around for now while I experiment) fieldType name=text_en class=solr.TextField positionIncrementGap=100 analyzer type=index !-- We are indexing mostly HTML so we need to ignore the tags -- charFilter class=solr.HTMLStripCharFilterFactory/ !--tokenizer class=solr.StandardTokenizerFactory/-- tokenizer class=solr.WhitespaceTokenizerFactory/ !-- lower casing must happen before WordDelimiterFilter or protwords.txt will not work -- filter class=solr.LowerCaseFilterFactory/ filter class=solr.WordDelimiterFilterFactory stemEnglishPossessive=1 protected=protwords.txt/ !-- This deals with contractions -- filter class=solr.SynonymFilterFactory synonyms=synonyms.txt expand=true ignoreCase=true/ filter class=solr.HunspellStemFilterFactory dictionary=en_US.dic affix=en_US.aff ignoreCase=true/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query !--tokenizer class=solr.StandardTokenizerFactory/-- tokenizer class=solr.WhitespaceTokenizerFactory/ !-- lower casing must happen before WordDelimiterFilter or protwords.txt will not work -- filter class=solr.LowerCaseFilterFactory/ filter class=solr.WordDelimiterFilterFactory protected=protwords.txt/ !-- setting tokenSeparator= solves issues with compound words and improves phrase search -- filter class=solr.HunspellStemFilterFactory dictionary=en_US.dic affix=en_US.aff ignoreCase=true/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType
Re: Displaying ExternalFileField values in CSVResponse - Solr 4.6
Hi Sanjeev, Here is the relevant jira : https://issues.apache.org/jira/browse/SOLR-5423 which has fix versions 4.7.1, 4.8, 5.0. So I recommend to use/download latest 4.8.0 version. Ahmet On Friday, May 2, 2014 2:46 PM, Sanjeev Pragada sanje...@rediff.co.in wrote: Hi,nbsp; nbsp;We are using Solr4.6 to index and search our ecommerce product details. We are using ExternalFileField option to incorporate some ranking signals.nbsp;The problem I am facing currently is that the values of ExternalFileField are not displayed in the CSVResponse of the solr. However I am able to get the valuesfor other response formats such as XML, JSON, Python etc.Can anyone please let me know if there is a way to display the values in CSVResponse. I don't prefer to use other response formats as these responses are fatterthan the CSV Response and parsing them involves additional performance cost.If the required functionality is available in a later version of Solr, I will be able to upgrade to it.Any help would be great.Thanks,Sanjeevnbsp;
Re: Block Join Score Highlighting
On Fri, May 2, 2014 at 2:34 PM, StrW_dev r.j.bamb...@structweb.nl wrote: Mikhail Khludnev wrote Hello, Score support is addressed at https://issues.apache.org/jira/browse/SOLR-5882. Highlighting is another story. be aware of http://heliosearch.org/expand-block-join/ it might somehow useful for your problem. Thx for the reply! My score question is answered with that. but you forget to vote for that issue! Regarding highlighting, unfortunately I've never work with it. Hence, no quick help from my side. I already tried expanding based on that exact article. With expanding I might be able to search in parent, filter children and also return the children based on the same filter query. However this doesn't give me the most relevant child and certainly won't allow me to use the boost of that child in the score of the parent document. I am forced to search on child level as this allows me to use the unique boost of the child to influence the score. What I would need is to return snippet based on search in the parent, but now snippets are based on the returning document. -- View this message in context: http://lucene.472066.n3.nabble.com/Block-Join-Score-Highlighting-tp4134045p4134273.html Sent from the Solr - User mailing list archive at Nabble.com. -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Export big extract from Solr to [My]SQL
The cursor-based deep paging in 4.7+ works very well and the performance on large extracts (for us, maybe up to 100K documents) is excellent, though it will obviously depend on the number and size of fields that you need to pull. I wrote a Perl module to do the extractions from Solr without problems (and DBI takes care of writing to a database). I'm probably going to rewrite in Python since the final destination of many of our extracts is Tableau, which has a Python API for creating TDEs (Tableau data extracts) regards -Simon On Fri, May 2, 2014 at 7:43 AM, Siegfried Goeschl sgoes...@gmx.at wrote: Hi Per, basically I see three options * use a lot of memory to scope with huge result sets * user result set paging * SOLR 4.7 supports cursors (https://issues.apache.org/ jira/browse/SOLR-5463) Cheers, Siegfried Goeschl On 02.05.14 13:32, Per Steffensen wrote: Hi I want to make extracts from my Solr to MySQL. Any tools around that can help med perform such a task? I find a lot about data-import from SQL when googling, but nothing about export/extract. It is not all of the data in Solr I need to extract. It is only documents that full fill a normal Solr query, but the number of documents fulfilling it will (potentially) be huge. Regards, Per Steffensen
Re: Searching for tokens does not return any results
bq: but this index was created using a Java program using Lucene interface Elaborating a bit on Koji's comment... The fact that you used Lucene to index the doc means that the analysis page is almost, but not quite entirely, useless on the indexing side. It's looking at your field definition in schema.xml and running your input stream through the indexing portion of your analysis chain constructed from the schema. What's actually in your index though was put there by raw Lucene. So your Lucene program _must_ create an analysis chain that is absolutely identical to what's in your schema for the admin/analysis page to be accurate. Quick test: go to you admin/schema browser page or use the TermsComponent (https://cwiki.apache.org/confluence/display/solr/The+Terms+Component) or Luke to examine the actual tokens in your field. My bet is that you'll see that the actual terms are not what you expect and almost certainly not what the admin/analysis page shows on the index side. Keeping an independent Lucene program that puts data into your index with raw Lucene aligned with your schema is, as you can see, something of a problem. If at all possible, consider letting Solr do the indexing and sending it documents with SolrJ, here's a reference: https://cwiki.apache.org/confluence/display/solr/Using+SolrJ By the way, I want to compliment you on your post. You did all the right things: defined your problem clearly added the critical bit (index created with Lucene). This is especially relevant I think illustrated the input and output told us what the problem was gave us the field definitions showed the results of some of your investigation Best Erick On Thu, May 1, 2014 at 7:31 AM, Koji Sekiguchi k...@r.email.ne.jp wrote: Hi Yetkin, welcome! I think StandardAnalyzer of Lucene is the problem you are facing. Why don't you have another field using StandardAnalyzer and see how it tokenizes CRD_PROD on Solr admin GUI? I forgot in the detail but we can use Lucene's Analyzer in schema.xml something like this: fieldType ... analyzer class=solr.StandardAnalyzer/ /fieldType Koji -- http://soleami.com/blog/comparing-document-classification-functions-of-lucene-and-mahout.html (2014/05/01 23:04), Yetkin Ozkucur wrote: Hello everyone, I am new to SOLR and this is my first post in this list. I have been working on this problem for a couple of days. I tried everything which I found in google but it looks like I am missing something. Here is my problem: I have a field called: DBASE_LOCAT_NM_TEXT It contains values like: CRD_PROD The goal is to be able to search this field either by putting the exact string CRD_PROD or part of it (tokenized by _) like CRD or PROD Currently: This query returns results: q=DBASE_LOCAT_NM_TEXT:CRD_PROD But this does not: q=DBASE_LOCAT_NM_TEXT:CRD I want to understand why the second query does not return any results Here is how I configured the field: field name=DBASE_LOCAT_NM_TEXT type=text_general indexed=true stored=true required=false multiValued=false/ And Here is how I configured the field type : fieldType name=text_general class=solr.TextField positionIncrementGap=100 analyzer type=index filter class=solr.WordDelimiterFilterFactory preserveOriginal=1 generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query filter class=solr.WordDelimiterFilterFactory preserveOriginal=1 generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/ tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType I am also using the analysis panel in the SOLR admin console. It shows this: WT CRD_PROD WDF CRD_PROD CRD PROD CRDPROD SF CRD_PROD CRD PROD CRDPROD LCF crd_prod crd prod crdprod SKMFcrd_prod crd prod crdprod RDTFcrd_prod crd prod crdprod I am not sure if it is related or not but this index was created using a Java program using Lucene interface. It used StandardAnalyzer for writing and the field was configured as tokenized, indexed and stored.
Re: RE : Shards don't return documents in same order
Francois: Yes, there are several means to examine the raw terms in the index. The admin/schema-browser page TermsComponent: https://cwiki.apache.org/confluence/display/solr/The+Terms+Component Luke the schema-browser is all set up for you, it's easiest. The TermsComponent should be directly usable too, I believe it's configured by default in solrconfig.xml Luke takes a bit of setup but is a great tool. Did you re-index from scratch on all shards? I presume your ordering is still not the same on all shards... the order I'd expect would be: mb20140410a mb20140410anew mb20140411a Best, Erick On Thu, May 1, 2014 at 8:27 AM, Francois Perron francois.per...@ticketmaster.com wrote: Hi Erick, thank you for your response. You are right, I changed alphaOnlySort to keep lettres and numbers and to remove some acticles (a, an, the). This is the filetype definition : fieldType name=alphaOnlySort class=solr.TextField sortMissingLast=true omitNorms=true analyzer tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.TrimFilterFactory/ filter class=solr.PatternReplaceFilterFactory replace=all replacement= pattern=(\b(a|an|the)\b|[^a-z,0-9])/ /analyzer /fieldType Then, I tested each name with admin ui on each server and this is the results : server1 MB20140410A = mb20140410a MB20140411A = mb20140411a MB20140410A-New = mb20140410anew server2 MB20140410A = mb20140410a MB20140411A = mb20140411a MB20140410A-New = mb20140410anew server3 MB20140410A = mb20140410a MB20140411A = mb20140411a MB20140410A-New = mb20140410anew Unfortunately, all results are identical so is there a mean to view data real indexed in these documents ? Can be a problem with a particular server ? All configs are in zookeeper so all cores shouldhave the same config, right ? Is there any way to force a replicat to resynchronize ? Regards, Francois. De : Erick Erickson [erickerick...@gmail.com] Envoyé : 30 avril 2014 16:36 À : solr-user@lucene.apache.org Objet : Re: Shards don't return documents in same order Hmmm, take a look at the admin/analysis page for these inputs for alphaOnlySort. If you're using the stock Solr distro, you're probably not considering the effects patternReplaceFilterFactory which is removing all non-letters. So these three terms reduce to mba mba mbanew You can look at the actual indexed terms by the admin/schema-browser as well. That said, unless you transposed the order because you were concentrating on the numeric part, the doc with MB20140410A-New should always be sorting last. All of which is irrelevant if you're doing something else with alphaOnlySort, so please paste in the fieldType definition if you've changed it. What gets returned in the doc for _stored_ data is a verbatim copy, NOT the output of the analysis chain, which can be confusing. Oh, and Solr uses the internal lucene doc ID to break ties, and docs on different replicas can have different internal Lucene doc IDs relative to each other as a result of merging so that's something else to watch out for. Best, Erick On Wed, Apr 30, 2014 at 1:06 PM, Francois Perron francois.per...@ticketmaster.com wrote: Hi guys, I have a small SolrCloud setup (3 servers, 1 collection with 1 shard and 3 replicat). In my schema, I have a alphaOnlySort field with a copyfield. This is a part of my managed-schema : field name=_root_ type=string indexed=true stored=false/ field name=_uid type=string multiValued=false indexed=true required=true stored=true/ field name=_version_ type=long indexed=true stored=true/ field name=event_id type=string indexed=true stored=true/ field name=event_name type=text_general indexed=true stored=true/ field name=event_name_sort type=alphaOnlySort/ with the copyfield copyField source=event_name dest=event_name_sort/ The problem is : I query my collection with a sort on my alphasort field but on one of my servers, the sort order is not the same. On server 1 and 2, I have this result : doc str name=event_nameMB20140410A/str /doc doc str name=event_nameMB20140410A-New/str /doc doc str name=event_nameMB20140411A/str /doc and on the third one, this : str name=event_nameMB20140410A/str /doc doc str name=event_nameMB20140411A/str /doc doc str name=event_nameMB20140410A-New/str /doc The doc named MB20140411A should be at the end ... Any idea ? Regards
Re: Fastest way to import big amount of documents in SolrCloud
re: optimize after every import This is not recommended in 4.x unless and until you have evidence that it really does help, reviews are very mixed, and it's been renamed force merge in 4.x just so people don't think Of course I want to do this, who wouldn't?. bq: Doing a commit instead of optimize is usually bringing the master and slave nodes down This isn't expected unless you're committing far too frequently. I'd dis-recommend doing any commits except, possibly, a single commit after all my clients had finished indexing. But even that isn't necessary. In batch modes in SolrCloud, reasonable setups are autocommit: 15 seconds WITH openSearcher=false autosoftcommit: the interval it takes you to run all your indexing. Seems odd, but here's the backtround: http://searchhub.org/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/ Best, Erick On Thu, May 1, 2014 at 11:12 PM, Alexander Kanarsky kanarsky2...@gmail.com wrote: If you build your index in Hadoop, read this (it is about the Cloudera Search but in my understanding also should work with Solr Hadoop contrib since 4.7) http://www.cloudera.com/content/cloudera-content/cloudera-docs/Search/latest/Cloudera-Search-User-Guide/csug_batch_index_to_solr_servers_using_golive.html On Thu, May 1, 2014 at 1:47 PM, Costi Muraru costimur...@gmail.com wrote: Hi guys, What would you say it's the fastest way to import data in SolrCloud? Our use case: each day do a single import of a big number of documents. Should we use SolrJ/DataImportHandler/other? Or perhaps is there a bulk import feature in SOLR? I came upon this promising link: http://wiki.apache.org/solr/UpdateCSV Any idea on how UpdateCSV is performance-wise compared with SolrJ/DataImportHandler? If SolrJ, should we split the data in chunks and start multiple clients at once? In this way we could perhaps take advantage of the multitude number of servers in the SolrCloud configuration? Either way, after the import is finished, should we do an optimize or a commit or none ( http://wiki.solarium-project.org/index.php/V1:Optimize_command)? Any tips and tricks to perform this process the right way are gladly appreciated. Thanks, Costi
Re: Roll up query with original facets
I think this might be what you're looking for.. http://wiki.apache.org/solr/SimpleFacetParameters#Multi-Select_Faceting_and_LocalParams Best, Erick On Fri, May 2, 2014 at 7:19 AM, Darin Amos dari...@gmail.com wrote: Hello All, I am having a query issue I cannot seem to find the correct answer for. I am searching against a list of items and returning facets for that list of items. I would like to group the result set on a field such as a “parentItemId”. parentItemId maps to other documents within the same core. I would like my query to return the documents that match parentItemId, but still return the facets of the original query. Is this possible with SOLR 4.3 that I am running? I can provide more details if needed, thanks! Darin
ANNOUNCE: Apache Solr Reference Guide for 4.8
The Lucene PMC is pleased to announce that there is a new version of the Solr Reference Guide available for Solr 4.8. The 396 page PDF serves as the definitive user's manual for Solr 4.8. It can be downloaded from the Apache mirror network: https://www.apache.org/dyn/closer.cgi/lucene/solr/ref-guide/ -Hoss
Can't use 2 highlighting components in the same solrconfig
Hoping someone can help me... I'm trying to use both the PostingsHighlighter and the FastVectorHighlighter in the same solrconfig (selection driven by different request handlers), but once I define 2 search components in the config, it always picks the Postings Highlighter (even if I never reference it in any request handler). Is this even possible to do? (I'm using 4.7.1). I think the culprit is some specific code in SolrCore.loadSearchComponents(), which specifically overwrites the highlighting component with the contents of the postingshighlight component - so the components map has 2 entries, but they both point to the same highlighting class (the PostingsHighlighter). It seems pretty deliberate (it only does it for the highlighter!), but wondering if there is some reason to allow only one version of the highlighter to be used. We're using 2 highlighters since the FVH is slow when creating snippets for a search result list (10-50 documents), so we turned to the PH (which is definitely faster, even though it doesn't keep phrases together, but that's a post for another day). But we like FVH for highlighting query terms in the full document, once the user clicks on a result. The plan is to use the PH in a search request handler, and the FVH in a document view request handler. Thanks.
RE: Searching for tokens does not return any results
Erick, Koji, Ahmet: Thank you all for your answers! I think I found the problem and I am on the right track to fix it. 1- As you suggested the problem was in the Java code populating the index. The analyzer in the Java code had to be consistent with the one defined in SOLR. I was able to achieve my goal by creating a slightly customized analyzer. 2- To be able to see the tokens in the index was key to debug the problem. I downloaded Luke (well a tweaked version of it for lucene 4.4) to be able to see tokens. I did not know SOLR had that terms component. That is a good tip too. Have a good weekend. Thanks, Yetkin -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Friday, May 02, 2014 11:57 AM To: solr-user@lucene.apache.org Subject: Re: Searching for tokens does not return any results bq: but this index was created using a Java program using Lucene interface Elaborating a bit on Koji's comment... The fact that you used Lucene to index the doc means that the analysis page is almost, but not quite entirely, useless on the indexing side. It's looking at your field definition in schema.xml and running your input stream through the indexing portion of your analysis chain constructed from the schema. What's actually in your index though was put there by raw Lucene. So your Lucene program _must_ create an analysis chain that is absolutely identical to what's in your schema for the admin/analysis page to be accurate. Quick test: go to you admin/schema browser page or use the TermsComponent (https://cwiki.apache.org/confluence/display/solr/The+Terms+Component) or Luke to examine the actual tokens in your field. My bet is that you'll see that the actual terms are not what you expect and almost certainly not what the admin/analysis page shows on the index side. Keeping an independent Lucene program that puts data into your index with raw Lucene aligned with your schema is, as you can see, something of a problem. If at all possible, consider letting Solr do the indexing and sending it documents with SolrJ, here's a reference: https://cwiki.apache.org/confluence/display/solr/Using+SolrJ By the way, I want to compliment you on your post. You did all the right things: defined your problem clearly added the critical bit (index created with Lucene). This is especially relevant I think illustrated the input and output told us what the problem was gave us the field definitions showed the results of some of your investigation Best Erick On Thu, May 1, 2014 at 7:31 AM, Koji Sekiguchi k...@r.email.ne.jp wrote: Hi Yetkin, welcome! I think StandardAnalyzer of Lucene is the problem you are facing. Why don't you have another field using StandardAnalyzer and see how it tokenizes CRD_PROD on Solr admin GUI? I forgot in the detail but we can use Lucene's Analyzer in schema.xml something like this: fieldType ... analyzer class=solr.StandardAnalyzer/ /fieldType Koji -- http://soleami.com/blog/comparing-document-classification-functions-of -lucene-and-mahout.html (2014/05/01 23:04), Yetkin Ozkucur wrote: Hello everyone, I am new to SOLR and this is my first post in this list. I have been working on this problem for a couple of days. I tried everything which I found in google but it looks like I am missing something. Here is my problem: I have a field called: DBASE_LOCAT_NM_TEXT It contains values like: CRD_PROD The goal is to be able to search this field either by putting the exact string CRD_PROD or part of it (tokenized by _) like CRD or PROD Currently: This query returns results: q=DBASE_LOCAT_NM_TEXT:CRD_PROD But this does not: q=DBASE_LOCAT_NM_TEXT:CRD I want to understand why the second query does not return any results Here is how I configured the field: field name=DBASE_LOCAT_NM_TEXT type=text_general indexed=true stored=true required=false multiValued=false/ And Here is how I configured the field type : fieldType name=text_general class=solr.TextField positionIncrementGap=100 analyzer type=index filter class=solr.WordDelimiterFilterFactory preserveOriginal=1 generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query filter class=solr.WordDelimiterFilterFactory preserveOriginal=1 generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/ tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/
Re: Searching for tokens does not return any results
Glad to hear it! You shouldn't really have to customize the analyzer to get it to behave as it would if you just used Solr to ingest documents, just chain things together. That's what Solr does after all. Of course you may have special needs that are better served by more customization. TermsComponent is a useful tool. Note that you also get raw terms if you use the admin/schema-browser page, identify your field, and then click the show term info button. That technique is somewhat limited though. The schema-browser page is especially useful for very small indexes and/or test cases I'll admit. I do vaguely remember something not right with the schema-browser at one point though, so it might not work as I expect for 4.4 Best, Erick On Fri, May 2, 2014 at 1:56 PM, Yetkin Ozkucur yetkin.ozku...@asg.com wrote: Erick, Koji, Ahmet: Thank you all for your answers! I think I found the problem and I am on the right track to fix it. 1- As you suggested the problem was in the Java code populating the index. The analyzer in the Java code had to be consistent with the one defined in SOLR. I was able to achieve my goal by creating a slightly customized analyzer. 2- To be able to see the tokens in the index was key to debug the problem. I downloaded Luke (well a tweaked version of it for lucene 4.4) to be able to see tokens. I did not know SOLR had that terms component. That is a good tip too. Have a good weekend. Thanks, Yetkin -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Friday, May 02, 2014 11:57 AM To: solr-user@lucene.apache.org Subject: Re: Searching for tokens does not return any results bq: but this index was created using a Java program using Lucene interface Elaborating a bit on Koji's comment... The fact that you used Lucene to index the doc means that the analysis page is almost, but not quite entirely, useless on the indexing side. It's looking at your field definition in schema.xml and running your input stream through the indexing portion of your analysis chain constructed from the schema. What's actually in your index though was put there by raw Lucene. So your Lucene program _must_ create an analysis chain that is absolutely identical to what's in your schema for the admin/analysis page to be accurate. Quick test: go to you admin/schema browser page or use the TermsComponent (https://cwiki.apache.org/confluence/display/solr/The+Terms+Component) or Luke to examine the actual tokens in your field. My bet is that you'll see that the actual terms are not what you expect and almost certainly not what the admin/analysis page shows on the index side. Keeping an independent Lucene program that puts data into your index with raw Lucene aligned with your schema is, as you can see, something of a problem. If at all possible, consider letting Solr do the indexing and sending it documents with SolrJ, here's a reference: https://cwiki.apache.org/confluence/display/solr/Using+SolrJ By the way, I want to compliment you on your post. You did all the right things: defined your problem clearly added the critical bit (index created with Lucene). This is especially relevant I think illustrated the input and output told us what the problem was gave us the field definitions showed the results of some of your investigation Best Erick On Thu, May 1, 2014 at 7:31 AM, Koji Sekiguchi k...@r.email.ne.jp wrote: Hi Yetkin, welcome! I think StandardAnalyzer of Lucene is the problem you are facing. Why don't you have another field using StandardAnalyzer and see how it tokenizes CRD_PROD on Solr admin GUI? I forgot in the detail but we can use Lucene's Analyzer in schema.xml something like this: fieldType ... analyzer class=solr.StandardAnalyzer/ /fieldType Koji -- http://soleami.com/blog/comparing-document-classification-functions-of -lucene-and-mahout.html (2014/05/01 23:04), Yetkin Ozkucur wrote: Hello everyone, I am new to SOLR and this is my first post in this list. I have been working on this problem for a couple of days. I tried everything which I found in google but it looks like I am missing something. Here is my problem: I have a field called: DBASE_LOCAT_NM_TEXT It contains values like: CRD_PROD The goal is to be able to search this field either by putting the exact string CRD_PROD or part of it (tokenized by _) like CRD or PROD Currently: This query returns results: q=DBASE_LOCAT_NM_TEXT:CRD_PROD But this does not: q=DBASE_LOCAT_NM_TEXT:CRD I want to understand why the second query does not return any results Here is how I configured the field: field name=DBASE_LOCAT_NM_TEXT type=text_general indexed=true stored=true required=false multiValued=false/ And Here is how I configured the field type : fieldType name=text_general class=solr.TextField positionIncrementGap=100 analyzer type=index filter class=solr.WordDelimiterFilterFactory
Spellchecking - looking for general advice
Hi I was looking at spellcheck (Direct and FileBased) and testing that they can do. Direct works fine most of the time, but I'd like to find solution for few corner cases: 1) having recruted and recruiter in index, recruter should suggest the latter. Obviously the distance to the former is smaller, so it may be completely arbitrary, and perhaps must be handled on application side rather then solr. 2) restraunt doesn't suggest restaurant - I assume that distance is to big for that. Those are few examples of queries that spellcheck gets (according to my requirements) wrong. For now I am just looking at possible solutions and I'd need to come up with initial concept to have something to show to users and get more feedback, likely with more cases to correct. I'd like to know if there are some tweaks to spellcheck component I could make (or perhaps other ways of doing this with solr), or am I forced to hardcode list of all such corrections that go beyond what spellcheck can do? One solution I am considering is to put list of those special cases into FileSpellChecker (it seems to be more relaxed, and handles restraunt case well) and fall back to Direct if this yields no results... though I am not sure yet how well that would work in practice if the list of misspelled words would grow beyond few I have now. It would most likely woldn't scale Another possibility would be to analyze list of queries our users use that yield little results and check if there is spellchecked version that improves that... but that seems to require human to review corrections. Yet another thing I was thinking about would be to pull terms into separate spellchecker (like aspell) and see if they do better job or are more tweakable. That's a bit open ended problem, so any advice welcome. -- Maciej Dziardziel fied...@gmail.com
Re: ANNOUNCE: Apache Solr Reference Guide for 4.8
Somebody should create an offline search interface for it. :-) Regards, Alex On 02/05/2014 11:53 pm, Chris Hostetter hoss...@apache.org wrote: The Lucene PMC is pleased to announce that there is a new version of the Solr Reference Guide available for Solr 4.8. The 396 page PDF serves as the definitive user's manual for Solr 4.8. It can be downloaded from the Apache mirror network: https://www.apache.org/dyn/closer.cgi/lucene/solr/ref-guide/ -Hoss