Re: Index
Every indexed document has to have a unique ID associated with it. You may do a search by ID something like http://localhost:/solr/select?q=id:X If you see a result, then the document has been indexed and is searchable. You might also want to check Luke (http://code.google.com/p/luke) to gain more insight about the index. *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny On Fri, Jul 29, 2011 at 03:40, GAURAV PAREEK gauravpareek2...@gmail.comwrote: Yes NICK you are correct ? how can you check whether it has been indexed by solr, and is searchable? On Fri, Jul 29, 2011 at 3:27 AM, Nicholas Chase nch...@earthlink.net wrote: Do you mean, how can you check whether it has been indexed by solr, and is searchable? Nick On 7/28/2011 5:45 PM, GAURAV PAREEK wrote: Hi All, How we can check the particular;ar file is not INDEX in solr ? Regards, Gaurav
Solr 3.2.0 is not writing log
I'm using Solr 1.4 with jetty for my site, it writes log into files in example/logs. Now I'm testing Solr 3.2.0 with jetty on another server, but no log is written into this folder: example/logs. It is always empty. Do I need to do something to turn on the log? Any hint will be appreciated. Ruixiang
Re: convert date format at indexing time
Please Is there any suggestion on This? Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/convert-date-format-at-indexing-time-tp3191078p3208989.html Sent from the Solr - User mailing list archive at Nabble.com.
Auto-Commit and failures / schema violations
Hello, we are running a large CMS with multiple customers and we are now going to use solr for our search and indexing tasks. As we have a lot of users working simultaneously on the CMS we decided not to commit our changes programatically (we use StreamingUpdateSolrServer) on each add. Instead we are using the autocommit functions ins solr-config.xml. To be reliable we write Timestamp files on each add of a document to the StreamingUpdateSolrServer. (In case of a crash we could restart indexing since that timetamp. ) Unfortunately we don't know how to be sure that the add was successfull, as (for example) schema violations seem to be detected on commit, which is therefore too late, as the timestamp is usually already overwritten then. So: Are there any valid approaches to bes sure that an add of a document has been processed successfully? Maybe: Is ist better to collect a list of documents to add and commit these, instead of using the auto-commit function? Thanks in advance for any help! Dirk Högemann ___ Schon gehört? WEB.DE hat einen genialen Phishing-Filter in die Toolbar eingebaut! http://produkte.web.de/go/toolbar
slow highlighting because of stemming
Dear all, I am quite new about using Solr, but would like to ask your help. I am developing an application which should be able to highlight the results of a query. For this I am using regex fragmenter: highlighting fragmenter name=regex class=org.apache.solr.highlight.RegexFragmenter lst name=defaults int name=hl.fragsize500/int float name=hl.regex.slop0.5/float str name=hl.pre![CDATA[b]]/str str name=hl.post![CDATA[/b]]/str str name=hl.useFastVectorHighlightertrue/str str name=hl.regex.pattern[-\w ,/\n\']{20,300}[.?!]/str str name=hl.fldokumentum_syn_query/str /lst /fragmenter /highlighting The field is indexed with term vectors and offsets: field name=dokumentum_syn_query type=huntext_syn indexed=true stored=true multiValued=true termVectors=on termPositions=on termOffsets=on/ fieldType name=huntext_syn class=solr.TextField stored=true indexed=true positionIncrementGap=100 analyzer type=index tokenizer class=com.morphologic.solr.huntoken.HunTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords_query.txt enablePositionIncrements=true / filter class=com.morphologic.solr.hunstem.HumorStemFilterFactory lex=/home/oroszgy/workspace/morpho/solrplugins/data/lex cache=alma/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords_query.txt enablePositionIncrements=true / filter class=com.morphologic.solr.hunstem.HumorStemFilterFactory lex=/home/oroszgy/workspace/morpho/solrplugins/data/lex cache=alma/ filter class=solr.SynonymFilterFactory synonyms=synonyms_query.txt ignoreCase=true expand=true/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType The highlighting works well, excepts that its really slow. I realized that this is because the highlighter/fragmenter does stemming for all the results documents again. Could you please help me why does it happen an how should I avoid this? (I thought that using fastvectorhighlighter will solve my problem, but it didn't) Thanks in advance! Gyuri Orosz
Re: [WARNING] Index corruption and crashes in Apache Lucene Core / Apache Solr with Java 7
Hello, thanks for the warning, that's a pretty nasty bug. A patch was made for OpenJDK, if anybody is interested to try it out that would be great: http://hg.openjdk.java.net/hsx/hotspot-comp/hotspot/rev/4e761e7e6e12 Regards, Sanne 2011/7/28 Uwe Schindler uschind...@apache.org: Hello Apache Lucene Apache Solr users, Hello users of other Java-based Apache projects, Oracle released Java 7 today. Unfortunately it contains hotspot compiler optimizations, which miscompile some loops. This can affect code of several Apache projects. Sometimes JVMs only crash, but in several cases, results calculated can be incorrect, leading to bugs in applications (see Hotspot bugs 7070134 [1], 7044738 [2], 7068051 [3]). Apache Lucene Core and Apache Solr are two Apache projects, which are affected by these bugs, namely all versions released until today. Solr users with the default configuration will have Java crashing with SIGSEGV as soon as they start to index documents, as one affected part is the well-known Porter stemmer (see LUCENE-3335 [4]). Other loops in Lucene may be miscompiled, too, leading to index corruption (especially on Lucene trunk with pulsing codec; other loops may be affected, too - LUCENE-3346 [5]). These problems were detected only 5 days before the official Java 7 release, so Oracle had no time to fix those bugs, affecting also many more applications. In response to our questions, they proposed to include the fixes into service release u2 (eventually into service release u1, see [6]). This means you cannot use Apache Lucene/Solr with Java 7 releases before Update 2! If you do, please don't open bug reports, it is not the committers' fault! At least disable loop optimizations using the -XX:-UseLoopPredicate JVM option to not risk index corruptions. Please note: Also Java 6 users are affected, if they use one of those JVM options, which are not enabled by default: -XX:+OptimizeStringConcat or -XX:+AggressiveOpts It is strongly recommended not to use any hotspot optimization switches in any Java version without extensive testing! In case you upgrade to Java 7, remember that you may have to reindex, as the unicode version shipped with Java 7 changed and tokenization behaves differently (e.g. lowercasing). For more information, read JRE_VERSION_MIGRATION.txt in your distribution package! On behalf of the Lucene project, Uwe [1] http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7070134 [2] http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7044738 [3] http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7068051 [4] https://issues.apache.org/jira/browse/LUCENE-3335 [5] https://issues.apache.org/jira/browse/LUCENE-3346 [6] http://s.apache.org/StQ - Uwe Schindler uschind...@apache.org Apache Lucene PMC Member / Committer Bremen, Germany http://lucene.apache.org/ - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
AUTO: Ryan J Minniear is out of the office. (returning 08/01/2011)
I am out of the office until 08/01/2011. I will respond to your message when I return. Please contact Robert Guthrie for any urgent issues. Note: This is an automated response to your message Solr 3.2.0 is not writing log sent on 7/29/11 2:08:07. This is the only notification you will receive while this person is away.
Updating opinion
Hello, I want some opinions for the updating process of my application. Users can edit there own data. This data will be validated and must be updated every 24 hours. I want to do this at night(0:00). Now lets say 50.000 documents are edited. The delta import will take ~20 minutes. So the indexing proces is ready at 0:20. Some data is depending on day. So the index has wrong data for 20 minutes. Now i thought i can fix this problem this way: I can do every hour a delta import without a commit. I do this 24 times and on the end of the day i do a commit and optimize the index. Is this possible? Is it faster to do the updates in parts? -- View this message in context: http://lucene.472066.n3.nabble.com/Updating-opinion-tp3209251p3209251.html Sent from the Solr - User mailing list archive at Nabble.com.
Query on multi valued field
Hi All, I have a specific requirement in the multi-valued field type.The requirement is as follows There is a multivalued field in each document which can have mutliple elements or single element. For Eg: Consider that following are the documents matched for say q= *:* *DOC1* doc arr name=multi str1/str /arr /doc * * *DOC2* doc arr name=multi str1/str str3/str str4/str /arr /doc *DOC3* doc arr name=multi str1/str str2/str /arr /doc The query is get only those documents which have multiple elements for that multivalued field. I.e, doc 2 and 3 should be returned from the above set.. Is there anyway to achieve this? Awaiting reply, Thanks Regards, Rajani
Combine XML data with DIH
I have folder with XML files 1.xml contains: idhttp://www.site.com/1.html/id linkhttp://www.othersite.com/2.html/link contentbla1/content 2.xml contains: idhttp://www.othersite.com/2.html/id contentbla2lt;//contentgt; I want to create document in Solr: idhttp://www.site.com/1.html/id contentbla2lt;//contentgt; Can this be done with DIH? And how? -- View this message in context: http://lucene.472066.n3.nabble.com/Combine-XML-data-with-DIH-tp3209413p3209413.html Sent from the Solr - User mailing list archive at Nabble.com.
segment.gen file is not replicated
Dear list, is there a deeper logic behind why the segment.gen file is not replicated with solr 3.2? Is it obsolete because I have a single segment? Regards, Bernd
Re: Dealing with keyword stuffing
Cool, So I used SweetSpotSimilarity with default params and I see some improvements. However, I could still see some of the 'stuffed' documents coming up in the results. I feel that SweetSpotSimilarity alone is not enough. Going through http://trec.nist.gov/pubs/trec16/papers/ibm-haifa.mq.final.pdf I figure out that there are other things - Pivoted Length Normalization and term frequency normalization that needs fine tuning too. Should I create a custom Similarity Class that overrides all the default behavior? I guess that should help me get more relevant results. Where should I start beginning with it? Pl. do not assume less obvious things, I am still learning !! :-) *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny On Thu, Jul 28, 2011 at 17:03, Gora Mohanty g...@mimirtech.com wrote: On Thu, Jul 28, 2011 at 3:48 PM, Pranav Prakash pra...@gmail.com wrote: [...] I am not sure how to use SweetSpotSimilarity. I am googling on this, but any useful insights are so much appreciated. Replace the existing DefaultSimilarity class in schema.xml (look towards the bottom of the file) with the SweetSpotSimilarity class, e.g., have a line like: similarity class=org.apache.lucene.search.SweetSpotSimilarity/ Regards, Gora
Re: Combine XML data with DIH
To make it easier, I included example config: dataConfig dataSource type=FileDataSource / document entity name=file rootEntity=false dataSource=null processor=FileListEntityProcessor fileName=^.*\.xml$ recursive=false baseDir=/srv/www/servers/crawler/files entity name=crawl pk=id datasource=file url=${file.fileAbsolutePath} processor=XPathEntityProcessor forEach=/doc transformer=RegexTransformer field column=id xpath=/doc/id / field column=link xpath=/doc/link / field column=content xpath=/doc/content / /entity /entity /document /dataConfig O. Klein wrote: I have folder with XML files 1.xml contains: idhttp://www.site.com/1.html/id linkhttp://www.othersite.com/2.html/link contentbla1/content 2.xml contains: idhttp://www.othersite.com/2.html/id contentbla2lt;//contentgt; I want to create document in Solr: idhttp://www.site.com/1.html/id contentbla2lt;//contentgt; Can this be done with DIH? And how? -- View this message in context: http://lucene.472066.n3.nabble.com/Combine-XML-data-with-DIH-tp3209413p3209664.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Updating opinion
I would imagine if you're doing updates all day the commit might take a long time. You could try it though and see if it works for you. Another option, which will use more disk memory is to replicate all your data to another core just after midnight. Then update the data all day long as you please (and commit) on the new core. At the stroke of midnight the next day, swap cores. This way you can control (nearly) the exact moment the new data becomes public. See http://wiki.apache.org/solr/CoreAdmin#SWAP James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: roySolr [mailto:royrutten1...@gmail.com] Sent: Friday, July 29, 2011 5:36 AM To: solr-user@lucene.apache.org Subject: Updating opinion Hello, I want some opinions for the updating process of my application. Users can edit there own data. This data will be validated and must be updated every 24 hours. I want to do this at night(0:00). Now lets say 50.000 documents are edited. The delta import will take ~20 minutes. So the indexing proces is ready at 0:20. Some data is depending on day. So the index has wrong data for 20 minutes. Now i thought i can fix this problem this way: I can do every hour a delta import without a commit. I do this 24 times and on the end of the day i do a commit and optimize the index. Is this possible? Is it faster to do the updates in parts? -- View this message in context: http://lucene.472066.n3.nabble.com/Updating-opinion-tp3209251p3209251.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Updating opinion
Although, now that I think more, you could probably get away with the commit-at-midnight option provided it doesn't take much time to warm a new searcher. Another thing is if you set a low merge factor you likely won't need to optimize. The optimize usually would take a lot longer than the commit, so you want to avoid doing one if you can. You still won't be able to guarantee the new documents are available right at the stroke of midnight be you can probably usually be close. If you need to be precise, you'll probably want to use 2 cores. James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: Dyer, James Sent: Friday, July 29, 2011 8:58 AM To: solr-user@lucene.apache.org Subject: RE: Updating opinion I would imagine if you're doing updates all day the commit might take a long time. You could try it though and see if it works for you. Another option, which will use more disk memory is to replicate all your data to another core just after midnight. Then update the data all day long as you please (and commit) on the new core. At the stroke of midnight the next day, swap cores. This way you can control (nearly) the exact moment the new data becomes public. See http://wiki.apache.org/solr/CoreAdmin#SWAP James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: roySolr [mailto:royrutten1...@gmail.com] Sent: Friday, July 29, 2011 5:36 AM To: solr-user@lucene.apache.org Subject: Updating opinion Hello, I want some opinions for the updating process of my application. Users can edit there own data. This data will be validated and must be updated every 24 hours. I want to do this at night(0:00). Now lets say 50.000 documents are edited. The delta import will take ~20 minutes. So the indexing proces is ready at 0:20. Some data is depending on day. So the index has wrong data for 20 minutes. Now i thought i can fix this problem this way: I can do every hour a delta import without a commit. I do this 24 times and on the end of the day i do a commit and optimize the index. Is this possible? Is it faster to do the updates in parts? -- View this message in context: http://lucene.472066.n3.nabble.com/Updating-opinion-tp3209251p3209251.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Index time boosting with DIH
Thanks for the answer. I want to share the configuration that worked for me (see the follow up question at the end): (Boosting a document on the basis of a field value at index time.) It took me some time to figure out, that for the row.get to work, I had to use the column name (the one in the select list) whereas for a put the fieldname (or pseudo fieldname) is working. dataConfig dataSource .../ script![CDATA[ function BoostDoc(row) { if(row.get('SOME_COLUMN') == 'someValue') { row.put('$docBoost', 20); } return row; } ]]/script document name=mydoc entity name=myentity transformer=script:BoostDoc query=select ... field column=SOME_COLUMN name=someField / ... A follow-up question: This is only working for non-wildcard queries for me (StandardRequestHandler as well as edismax) For wildcard-queries a constant score is returned. Is there any way to get this setting working for wildcard queries as well? -Original Message- From: Shalin Shekhar Mangar [mailto:shalinman...@gmail.com] Sent: Donnerstag, 28. Juli 2011 12:37 To: solr-user@lucene.apache.org Subject: Re: Index time boosting with DIH On Thu, Jul 28, 2011 at 3:56 PM, Bürkle, David david.buer...@irix.chwrote: Can someone point me to an example for using index time boosting with the DataImportHandler. You can use the special flag variable $docBoost to add a index time boost. http://wiki.apache.org/solr/DataImportHandler#Special_Commands -- Regards, Shalin Shekhar Mangar.
Re: convert date format at indexing time
If you use DIH with TikaEntityProcessor you get the dates in Solr compatible format if you use the dates stored in the meta-data. dataSource type=BinURLDataSource name=bin/ entity name=tika processor=TikaEntityProcessor url=${crawl.id} dataSource=bin onError=continue format=text field column=created meta=true name=creation_date/ /entity -- View this message in context: http://lucene.472066.n3.nabble.com/convert-date-format-at-indexing-time-tp3191078p3209881.html Sent from the Solr - User mailing list archive at Nabble.com.
combining xml and nutch index in solr
hi I have a xml file which has url, category,subcategory, title kind of details. and we crawl the urls in xml using Nutch. Anyway for use to merge both? like schema will look like url category subcategory title crawl_data_summary_from_nutch crawl_data_body_content_from_nutch Any solution for this? thanks abhay -- View this message in context: http://lucene.472066.n3.nabble.com/combining-xml-and-nutch-index-in-solr-tp3209911p3209911.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Combine XML data with DIH
hi I have never done this with xml files but u can have multiple data sources in dih config http://wiki.apache.org/solr/DataImportHandler#multipleds abhay -- View this message in context: http://lucene.472066.n3.nabble.com/Combine-XML-data-with-DIH-tp3209413p3209933.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Combine XML data with DIH
Yeah, but how do I combine the two based on the value in link? -- View this message in context: http://lucene.472066.n3.nabble.com/Combine-XML-data-with-DIH-tp3209413p3209983.html Sent from the Solr - User mailing list archive at Nabble.com.
Solr Incremental Indexing
Hi Need some help in Solr incremental indexing approch. I have built my Solr index using SolrJ API and now want to update the index whenever any changes has been made in database. My requirement is not to use DB triggers to call any update events. I want to update my index on the fly whenever my application updates any record in database. Note: My indexing logic to get the required data from DB is some what complex and involves many tables. Please suggest me how can I proceed here. Thanks Lateef
RE: embeded solrj doesn't refresh index
Thanks Marc. Guess I was not clear about my previous statement. So let me rephrase. I use DIH to import data into solr and do indexing. Everything works fine. I have another embedded solr server setting to the same index files. I use embedded solrj to search the index file. So the first solr is for indexing purpose, it can be turned off once the indexing is done. However the changes in the index files cannot show up from embedded solrj, that is, once the new index is built, from embedded solrj, I still get the old results. Only after I restart the embedded solr server, the new changes are reflected from solrj. The embedded solrj works like there was a caching that it always goes to first. Thanks. JB -Original Message- From: Marc Sturlese [mailto:marc.sturl...@gmail.com] Sent: Friday, July 22, 2011 1:57 AM To: solr-user@lucene.apache.org Subject: RE: embeded solrj doesn't refresh index Are u indexing with full import? In case yes and the resultant index has similar num of docs (that the one you had before) try setting reopenReaders to false in solrconfig.xml * You have to send the comit, of course. -- View this message in context: http://lucene.472066.n3.nabble.com/embeded-solrj-doesn-t-refresh-index-tp318 4321p3190892.html Sent from the Solr - User mailing list archive at Nabble.com.
dealing with so many different sorting options
As I'm using solr more and more, I'm finding that I need to do searches and then order by new criteria. So I am constantly add new fields into solr and then reindexing everything. I want to know if adding in all this data into solr is the normal way to deal with sorting. I'm finding that I have almost a whole copy of my database in solr. Should I be pulling out all the data from solr and then sort in my database? This solution seems like it would take too long. Could/Should I just move to solr as my primary store so I can query directly against it without having to reindex all the time? Right now we store about 50 million docs, but the size is growing pretty fast and it is a pain to reindex everything everytime I add a new column to sort by.
Re: Exact match not the first result returned
I implemented both solutions Hoss suggested and was able to achieve the desired results. I would like to go with defType=dismax qf=myname pf=myname_str^100 q=Frank but that doesn't seem to work if I have a query like myname:Frank otherfield:something. So I think I will go with q=+myname:Frank myname_str:Frank^100 Thanks for the help everyone! Brian Lamb On Wed, Jul 27, 2011 at 10:55 PM, Chris Hostetter hossman_luc...@fucit.orgwrote: : With your solution, RECORD 1 does appear at the top but I think thats just : blind luck more than anything else because RECORD 3 shows as having the same : score. So what more can I do to push RECORD 1 up to the top. Ideally, I'd : like all three records returned with RECORD 1 being the first listing. with omitNorms RECORD1 and RECORD3 have the same score because only the tf() matters, and both docs contain the term frank exactly twice. the reason RECORD1 isn't scoring higher even though it contains (as you put it matchings 'Fred' exactly is that from a term perspective, RECORD1 doesn't actually match myname:Fred exactly, because there are in fact other terms in that field because it's multivalued. one way to indicate that you (only* want documents where entire field values to match your input (ie: RECORD1 but no other records) would be to use a StrField instead of a TextField or an analyzer that doesn't split up tokens (lie: something using KeywordTokenizer). that way a query on myname:Frank would not match a document where you had indexed the value Frank Stalone by a query for myname:Frank Stalone would. in your case, you don't want *only* the exact field value matches, but you want them boosted, so you could do something like copyField myname into myname_str and then do... q=+myname:Frank myname_str:Frank^100 ...in which case a match on myname is required, but a match on myname_str will greatly increase the score. dismax (and edismax) are really designed for situations like this... defType=dismax qf=myname pf=myname_str^100 q=Frank -Hoss
Re: slow highlighting because of stemming
I'm not sure I would identify stemming as the culprit here. Do you have very large documents? If so, there is a patch for FVH committed to limit the number of phrases it looks at; see hl.phraseLimit, but this won't be available until 3.4 is released. You can also limit the amount of each document that is analyzed by the regular Highlighter using maxDocCharsToAnalyze (and maybe this applies to FVH? not sure) Using RegexFragmenter is also probably slower than something like SimpleFragmenter. There is work to implement faster highlighting for Solr/Lucene, but it depends on some basic changes to the search architecture so it might be a while before that becomes available. See https://issues.apache.org/jira/browse/LUCENE-3318 if you're interested in following that development. -Mike On 07/29/2011 04:55 AM, Orosz György wrote: Dear all, I am quite new about using Solr, but would like to ask your help. I am developing an application which should be able to highlight the results of a query. For this I am using regex fragmenter: highlighting fragmenter name=regex class=org.apache.solr.highlight.RegexFragmenter lst name=defaults int name=hl.fragsize500/int float name=hl.regex.slop0.5/float str name=hl.pre![CDATA[b]]/str str name=hl.post![CDATA[/b]]/str str name=hl.useFastVectorHighlightertrue/str str name=hl.regex.pattern[-\w ,/\n\']{20,300}[.?!]/str str name=hl.fldokumentum_syn_query/str /lst /fragmenter /highlighting The field is indexed with term vectors and offsets: field name=dokumentum_syn_query type=huntext_syn indexed=true stored=true multiValued=true termVectors=on termPositions=on termOffsets=on/ fieldType name=huntext_syn class=solr.TextField stored=true indexed=true positionIncrementGap=100 analyzer type=index tokenizer class=com.morphologic.solr.huntoken.HunTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords_query.txt enablePositionIncrements=true / filter class=com.morphologic.solr.hunstem.HumorStemFilterFactory lex=/home/oroszgy/workspace/morpho/solrplugins/data/lex cache=alma/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords_query.txt enablePositionIncrements=true / filter class=com.morphologic.solr.hunstem.HumorStemFilterFactory lex=/home/oroszgy/workspace/morpho/solrplugins/data/lex cache=alma/ filter class=solr.SynonymFilterFactory synonyms=synonyms_query.txt ignoreCase=true expand=true/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType The highlighting works well, excepts that its really slow. I realized that this is because the highlighter/fragmenter does stemming for all the results documents again. Could you please help me why does it happen an how should I avoid this? (I thought that using fastvectorhighlighter will solve my problem, but it didn't) Thanks in advance! Gyuri Orosz
Error with Extracting PDF metadata
I am using Solr 3.3 and I am trying to extract and index meta data from PDF files. I am using the DataImportHandler with the TikaEntityProcessor to add the documents. Here is are the fields as defined in my schema.xml file: field name=title type=text indexed=true stored=true multiValued=false/ field name=description type=text indexed=true stored=true multiValued=false/ field name=date_published type=string indexed=false stored=true multiValued=false/ field name=link type=string indexed=true stored=true multiValued=false required=false/ field name=imgName type=string indexed=false stored=true multiValued=false required=false/ dynamicField name=attr_* type=textgen indexed=true stored=true multiValued=false/ So I suppose the meta data information should be indexed and stored in fields prefixed as attr_. Here is how my data config file looks. It takes a source directory path from a database, passes it to a FileListEntityProcessor which will pass each of the pdf files found in the directory to the TikaEntityProcessor to extract and index the content. entity onError=skip name=fileSourcePaths rootEntity=false dataSource=dbSource fileName=.*pdf query=select path from file_sources entity name=fileSource processor=FileListEntityProcessor transformer=ThumbnailTransformer baseDir=${fileSourcePaths.path} recursive=true rootEntity=false field name=link column=fileAbsolutePath thumbnail=true/ field name=imgName column=imgName/ entity rootEntity=true onError=abort name=file processor=TikaEntityProcessor url=${fileSource.fileAbsolutePath} dataSource=fileSource format=text field column=resourceName name=title meta=true/ field column=Creation-Date name=date_published meta=true/ field column=text name=description/ /entity /entity It extracts the description and Creation-date just fine but it doesn't seem like it is extracting resourceName and so there is no title field for the documents when I query the index . This is weird because both Creation-date and resourceName are meta data. Also, none of the other possible meta data was being stored under the attr_ fields. I came across some threads which said there are know problems with using Tika 0.8 so I downloaded Tika 0.9 and replaced it over 0.8. I also downloaded and replaced pdfbox, jempbox and fontbox from 1.3 to 1.4. I tested one of the pdf's separately with just Tika to see what meta data is stored with the file. This is what I found: Content-Length: 546459 Content-Type: application/pdf Creation-Date: 2010-06-09T12:11:12Z Last-Modified: 2010-06-09T14:53:38Z created: Wed Jun 09 08:11:12 EDT 2010 creator: XSL Formatter V4.3 MR9a (4,3,2009,1022) for Windows producer: Antenna House PDF Output Library 2.6.0 (Windows) resourceName: Argentina.pdf trapped: False xmpTPg:NPages: 2 As you can see, it does have a resourceName meta data. I tried indexing again but I got the same result. Creation-date extracts and indexes just fine but not resourceName. Also the rest of the attributes are not being indexed under the attr_ fields. Whats going wrong? -- View this message in context: http://lucene.472066.n3.nabble.com/Error-with-Extracting-PDF-metadata-tp3210813p3210813.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Auto-Commit and failures / schema violations
: sure that the add was successfull, as (for example) schema violations : seem to be detected on commit, which is therefore too late, as the I have no idea what that stamement means -- if you are getting an error, can you be specific as to what type of error you are getting? (ie: what is returned to the client, and what do you see in the logs) -Hoss
Re: I can't pass the unit test when compile from apache-solr-3.3.0-src
: I find that the junit test will always fail, and told me ’BUILD FAILED‘ : : but if I type 'ant dist', I can get a apache-solr-3.3-SNAPSHOT.war : with no warning. : : Is it a problem just me? Can you please be specific... * which test(s) fail for you? * what are the failures? Any time a test fails, that info appears in the ant test output, and the full details or all tests are written to build/test-results you can run ant test-reports from the solr directory to generate an HTML report of all the success/failure info. -Hoss
Re: Solr versioning policy
: 1. Is this the plan moving forward (to aim for a new minor release : approximately every couple of months)? The goal is to release minor versions more frequently as features and low priority bug fixes are available. If there is a high priority bug fix available, and and no likelihood of a near-term minor release, then bug fixes releases (ie: 3.4.1) will be done (as has always been the case) This new accelerated minor-feature release approach is possible because of the parallel development branches approach that was instituted a while back, but once those branches were created it took some time to get the test/build/release processes automated enough that devs felt formortable releasing more frequently. There's no hard and fast rule about often releases will happen. Anyone can step up and push for a release if they feel the features are ready. : 2. Will minor version increases always be backwards compatible (so I could : upgrade from 3.x to 3.y where y x without having to update the : schema/config or rebuild the indexes)? That has always been the goal, yes. Sometimes the mechanism for dealing with new bugs/features requires making changes to config files and when known this is noted in the Upgrading section of CHANGES.txt for the affected release. -Hoss
Re: omitNorms
: my field category (string) has omitNorms=True and omitTermFreqAndPositions=True. : i have indexed all docs but when i do a search like: : http://xxx:xxx/solr/select/?q=category:AdebugQuery=on : i see there's normalization and idf and tf. Why? i can't understand the reason. those options ensure that that information isn't calculated and stored in your index, so they don't affect searches, but the debugging code still shows where the norms/tf (which don't exist for those fields) are part of the score calculation. You'll note thta they are always 1 in this debug info, making them No-Ops in the multiplication... : 8.676225 = (MATCH) fieldWeight(category:A in 826), product of: : 1.0 = tf(termFreq(category:A)=1) : 8.676225 = idf(docFreq=6978, maxDocs=15049953) : 1.0 = fieldNorm(field=category, doc=826) -Hoss
Re: Display term frequency / phrase freqency for documents
: I'd like to expose the termFrequency / phraseFrequency to the end user in my : application. For example I would like to be able to say Your search term : appears X times in this document. : : I can see these figures exposed via debugQuery=on, where I get output like ... : Is there anyway to expose these figures in XML nodes though? I could parse : them from the debug output but that feels very hack ! http://wiki.apache.org/solr/CommonQueryParameters#debug.explain.structured -Hoss
Re: Disabling Coord on Solr queries
: I am looking for the simplest way to disable coord in Solr queries. I have : found out Lucene allows this by construction of a BooleanQuery with : diableCoord=false: : public *BooleanQuery*(boolean disableCoord) : : Is there any way to activate this functionality directly from a Solr query? Not that i know of, but if you'd like to open a jira issue it owuld probably be fairly easy to add this to the LuceneQParser so you could do something like... q={!lucene coord=false}my boolean query -Hoss
Looking for a senior search engineer
Hi, Sorry if this isn't the right place for this message, but it's a very specific role we're looking for and I'm not sure where else to find solr experts! I was wondering if anyone would be interested, or knew anyone who would be interested in working on goodreads.com's search: We're using Solr, and we'd like someone with experience doing: solr-replication faceted search more cool stuff We run ruby on rails for the website. Potential applicants don't need to know ruby or rails, but they'd be expected to pick it up after starting. More info on our website: http://.goodreads.com/about/us Michael Economy Director Engineering, Goodreads Inc.
Re: dih fetching but not adding records to index
quick question if i want to just load document with id=2 how would that work? I tried xpath expression that works with xpath tools but not in solr. How would i do this? dataConfig dataSource type=FileDataSource / document entity name=f processor=FileListEntityProcessor baseDir=c:\temp fileName=promotions.xml recursive=false rootEntity=false dataSource=null entity name=x processor=XPathEntityProcessor forEach=/add/doc url=${f.fileAbsolutePath} pk=id field column=id xpath=/add/doc/[id=2]/id/ /entity /entity /document /dataConfig -- View this message in context: http://lucene.472066.n3.nabble.com/dih-fetching-but-not-adding-records-to-index-tp3189438p3211083.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: I found a sorting bug in solr/lucene
: According to that bug list, there are other characters that break the : sorting function. Is there a list of safe characters I can use as a : delimiter? the safest field names to use (and most efficient to parse when sorting) are things that follow the the id semenatics in java (not including the $ character at the begining) ... http://download.oracle.com/javase/1.4.2/docs/api/java/lang/Character.html#isJavaIdentifierStart%28char%29 http://download.oracle.com/javase/1.4.2/docs/api/java/lang/Character.html#isJavaIdentifierPart%28char%29 So sorts like foo_bar_baz asc will definitely work, and are heavily tested I've just posted a patch to SOLR-2606 that should fix the foo:bar asc and foo-bar asc situations, but because of the function query sort parsing that happens first, they will always be slightly slower to parse. -Hoss