RE: java "GC overhead limit exceeded"
Hi, which version do you use? 1.4.1 is highly recommended since previous versions contained some bugs related to memory usage that could lead to memory leaks. i had this gc overhead limit in my setup as well. only workaround that helped was a dayly restart of all instances. with 1.4.1 this issue seems to be fixed. -Ursprüngliche Nachricht- Von: Jonathan Rochkind [mailto:rochk...@jhu.edu] Gesendet: Dienstag, 27. Juli 2010 01:18 An: solr-user@lucene.apache.org Betreff: java "GC overhead limit exceeded" I am now occasionally getting a Java "GC overhead limit exceeded" error in my Solr. This may or may not be related to recently adding much better (and more) warming querries. I can get it when trying a 'commit', after deleting all documents in my index, or in other cases. Anyone run into this, and have suggestions as to how to set my java options to eliminate? I'm not sure this simply means that my heap size needs to be bigger, it seems to be something else. Any advice appreciated. Googling didn't get me much I trusted. Jonathan
Re: spell checking....
This is in solrconfig.xml::: default solr.IndexBasedSpellChecker spell ./spellchecker 0.7 true true jarowinkler lowerfilt org.apache.lucene.search.spell.JaroWinklerDistance ./spellchecker true true textSpell i added the following in standard request handler:: explicit default false false 1 spellcheck
Re: Querying throws java.util.ArrayList.RangeCheck
Do you have any custom code, or is this stock solr (and which version, and what is the request)? -Yonik http://www.lucidimagination.com On Tue, Jul 27, 2010 at 12:30 AM, Manepalli, Kalyan wrote: > Hi, > I am stuck at this weird problem during querying. While querying the solr > index I am getting the following error. > Index: 52, Size: 16 java.lang.IndexOutOfBoundsException: Index: 52, Size: 16 > at java.util.ArrayList.RangeCheck(ArrayList.java:547) at > java.util.ArrayList.get(ArrayList.java:322) at > org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:288) at > org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:217) at > org.apache.lucene.index.SegmentReader.document(SegmentReader.java:948) at > org.apache.lucene.index.DirectoryReader.document(DirectoryReader.java:506) at > org.apache.lucene.index.IndexReader.document(IndexReader.java:947) at > org.apache.solr.search.SolrIndexReader.document(SolrIndexReader.java:444) at > > During debugging I found that the SolrIndexReader is trying to read a > document which doesnt exist in the index. > I tried optimizing the index and restarting the server but still no luck. > > Any help in resolving this issue will be appreciated. > > Thanks > Kalyan
Querying throws java.util.ArrayList.RangeCheck
Hi, I am stuck at this weird problem during querying. While querying the solr index I am getting the following error. Index: 52, Size: 16 java.lang.IndexOutOfBoundsException: Index: 52, Size: 16 at java.util.ArrayList.RangeCheck(ArrayList.java:547) at java.util.ArrayList.get(ArrayList.java:322) at org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:288) at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:217) at org.apache.lucene.index.SegmentReader.document(SegmentReader.java:948) at org.apache.lucene.index.DirectoryReader.document(DirectoryReader.java:506) at org.apache.lucene.index.IndexReader.document(IndexReader.java:947) at org.apache.solr.search.SolrIndexReader.document(SolrIndexReader.java:444) at During debugging I found that the SolrIndexReader is trying to read a document which doesnt exist in the index. I tried optimizing the index and restarting the server but still no luck. Any help in resolving this issue will be appreciated. Thanks Kalyan
Re: Design questions/Schema Help
i think the search log will require a lot of storage which may make indexes size unreasonable large if store in solr. and the aggregration results may not really fixed in lucene index structure. :) kiwi happy hacking ! On Tue, Jul 27, 2010 at 7:47 AM, Tommy Chheng wrote: > Alternatively, have you considered storing(or i should say indexing) the > search logs with Solr? > > This lets you text search across your search queries. You can perform time > range queries with solr as well. > > @tommychheng > Programmer and UC Irvine Graduate Student > Find a great grad school based on research interests: > http://gradschoolnow.com > > > > On 7/26/10 4:43 PM, Mark wrote: > >> We are thinking about using Cassandra to store our search logs. Can >> someone point me in the right direction/lend some guidance on design? I am >> new to Cassandra and I am having trouble wrapping my head around some of >> these new concepts. My brain keeps wanting to go back to a RDBMS design. >> >> We will be storing the user query, # of hits returned and their session >> id. We would like to be able to answer the following questions. >> >> - What is the n most popular queries and their counts within the last x >> (mins/hours/days/etc). Basically the most popular searches within a given >> time range. >> - What is the most popular query within the last x where hits = 0. Same as >> above but with an extra "where" clause >> - For session id x give me all their other queries >> - What are all the session ids that searched for 'foos' >> >> We accomplish the above functionality w/ MySQL using 2 tables. One for the >> raw search log information and the other to keep the aggregate/running >> counts of queries. >> >> Would this sort of ad-hoc querying be better implemented using Hadoop + >> Hive? If so, should I be storing all this information in Cassandra then >> using Hadoop to retrieve it? >> >> Thanks for your suggestions >> >>
StatsComponent and sint?
Man, what types of fields is StatsComponent actually known to work with? With an sint, it seems to have trouble if there are any documents with null values for the field. It appears to decide that a null/empty/blank value is -1325166535, and is thus the minimum value. At least if I'm interpreting what's going on right. Anyone run into this?
RE: java "GC overhead limit exceeded"
> Short answer: "GC overhead limit exceeded" means "out of memory". Aha, thanks. So the answer is just "raise your Xmx/heap size, you need more memory to do what you're doing", yeah? Jonathan
Is there a cache for a query?
I want a cache to cache all result of a query(all steps including collapse, highlight and facet). I read http://wiki.apache.org/solr/SolrCaching, but can't find a global cache. Maybe I can use external cache to store key-value. Is there any one in solr?
Re: java "GC overhead limit exceeded"
On Mon, Jul 26, 2010 at 7:17 PM, Jonathan Rochkind wrote: > I am now occasionally getting a Java "GC overhead limit exceeded" error in > my Solr. This may or may not be related to recently adding much better (and > more) warming querries. When memory gets tight, the JVM kicks of a garbage collection to try and free more space (and it normally can free at least some). When only a little memory is freed, and GC keeps kicking in, it starts to eat up a majority of the CPU time and the JVM gives up with a "GC overhead limit exceeded" error. Short answer: "GC overhead limit exceeded" means "out of memory". -Yonik http://www.lucidimagination.com
Re: Updating fields in Solr
See below: On Mon, Jul 26, 2010 at 11:49 AM, Pramod Goyal wrote: > Hi, > I have a requirement where i need to keep updating certain fields in > the schema. My requirement is to change some of the fields or add some > values to a field ( multi-value field ). I understand that i can use Solr > update for this. If i am using Solr update do i need to publish the entire > document again or do i just need to publish the updated fields. Again in > case of update can i add( append ) news values to the existing fields ? > updating a document does, indeed, require that you reindex the whole thing. There's no capability to just update a field. > >In my document most of the parts remains unchanged however few fields > keeps changing. Will it be costly to update the entire document just to > change a field ? I was wondering if i should create to solr core one for > static content and another one for dynamic content. This way i can reduce > the time taken to update a document, but it introduces the complexity of > querying different core and combining the result on the client side. > Do you care how costly it is? By that I mean what is your expected update rate, how big is your index, etc. If you're updating 1 document a day you don't care. If you're updating 100/sec, you care very much. In between it's an interesting question :). Multiple cores are a possibility, but you're right that's more complex. I'd really evaluate (by gathering statistics) whether you need to before trying it. > > Is there a way to configure solr so that client can execute a single > query and solr internally executes multiple queries across different cores > and return single result ? > I'll leave this one to someone else... Best Erick
Re: Total number of terms in an index?
: Sorry, like the subject, I mean the total number of terms. it's not stored anywhere, so the only way to fetch it is to actually iteate all of the terms and count them (that's why LukeRequestHandler is slow slow to compute this particular value) If i remember right, someone mentioned at one point that flex would let you store data about stuff like this in your index as part of the segment writing, but frankly i'm still not sure how that iwll help -- because you unless your index is fully optimized, you still have to iterate the terms in each segment to 'de-dup' them. -Hoss
Re: spell checking....
It's almost impossible to analyze this kind of thing without seeing your schema and debug output. You might want to review: http://wiki.apache.org/solr/UsingMailingLists Best Erick On Mon, Jul 26, 2010 at 9:56 AM, satya swaroop wrote: > hi all, >i am a new one to solr and able to implement indexing the documents > by following the solr wiki. now i am trying to add the spellchecking. i > followed the spellcheck component in wiki but not getting the suggested > spellings. i first build it by spellcheck.build=true,... > > here i give u the example::: > > > http://localhost:8080/solr/spell?q=javs&spellcheck=true&spellcheck.collate=true > > > > - > > > > > > > > > here the response should actualy suggest the "java" but didnt.. > > can any one guide me about it... > i am using solr 1.4, tomcat in ubuntu > > > > > > Regards, > swarup >
Re: Similar search regarding a result document
I need much more detailed information before I can make sense of your use case. Could you provide some sample? MoreLikeThis sounds in the right neighborhood, but I'm guessing. Best Erick On Mon, Jul 26, 2010 at 9:02 AM, wrote: > > Hi, > > I would like to implement a similar search feature... but not relative to > the initial search query but relative to each resuts documents. > > The structure of each doc is: > id > title > content > price > etc... > > Then we have a database of global search seach queries, i'm thinking to > integrate this in solr. > > I'm planing to implement this as a query of a query... but before i would > like to know if there is a built in function in Solr for this? > > Thanks for your help. > > > > > > > > > >
Re: question about relevance
I'm having trouble getting my head around what you're trying to accomplish, so if this is off base you know why . But what it smells like is that you're trying to do database-ish things in a SOLR index, which is almost always the wrong approach. Is there a way to index redundant data with each document so all you have to do to get the "relevant" users is a simple query? Adding scores is also suspect.. I don't see how that does predictable things. But I'm also failing completely to understand what a "relevant" user is. not much help, if this is way off base perhaps you could provide some additional use-cases? Best Erick On Mon, Jul 26, 2010 at 2:37 AM, Bharat Jain wrote: > Hello All, > > I have a index which store multiple objects belonging to a user > > for e.g. > > -> Identifies user > object type e.g. userBasic or userAdv > > >> MAPS to userBasicInfoObject > > > > -> MAPS to userAdvInfoObject > > > > > > Now when I am doing some query I get multiple records mapping to java > objects (identified by objType) that belong to the same user. > > > Now I want to show the relevant users at the top of the list. I am thinking > of adding the Lucene scores of different result documents to get the best > scores. Is this correct approach to get the relevance of the user? > > Thanks > Bharat Jain >
Solr crawls during replication
We have an index around 25-30G w/ 1 master and 5 slaves. We perform replication every 30 mins. During replication the disk I/O obviously shoots up on the slaves to the point where all requests routed to that slave take a really long time... sometimes to the point of timing out. Is there any logical or physical changes we could make to our architecture to overcome this problem? Thanks
Re: Design questions/Schema Help
Alternatively, have you considered storing(or i should say indexing) the search logs with Solr? This lets you text search across your search queries. You can perform time range queries with solr as well. @tommychheng Programmer and UC Irvine Graduate Student Find a great grad school based on research interests: http://gradschoolnow.com On 7/26/10 4:43 PM, Mark wrote: We are thinking about using Cassandra to store our search logs. Can someone point me in the right direction/lend some guidance on design? I am new to Cassandra and I am having trouble wrapping my head around some of these new concepts. My brain keeps wanting to go back to a RDBMS design. We will be storing the user query, # of hits returned and their session id. We would like to be able to answer the following questions. - What is the n most popular queries and their counts within the last x (mins/hours/days/etc). Basically the most popular searches within a given time range. - What is the most popular query within the last x where hits = 0. Same as above but with an extra "where" clause - For session id x give me all their other queries - What are all the session ids that searched for 'foos' We accomplish the above functionality w/ MySQL using 2 tables. One for the raw search log information and the other to keep the aggregate/running counts of queries. Would this sort of ad-hoc querying be better implemented using Hadoop + Hive? If so, should I be storing all this information in Cassandra then using Hadoop to retrieve it? Thanks for your suggestions
Re: Design questions/Schema Help
On 7/26/10 4:43 PM, Mark wrote: We are thinking about using Cassandra to store our search logs. Can someone point me in the right direction/lend some guidance on design? I am new to Cassandra and I am having trouble wrapping my head around some of these new concepts. My brain keeps wanting to go back to a RDBMS design. We will be storing the user query, # of hits returned and their session id. We would like to be able to answer the following questions. - What is the n most popular queries and their counts within the last x (mins/hours/days/etc). Basically the most popular searches within a given time range. - What is the most popular query within the last x where hits = 0. Same as above but with an extra "where" clause - For session id x give me all their other queries - What are all the session ids that searched for 'foos' We accomplish the above functionality w/ MySQL using 2 tables. One for the raw search log information and the other to keep the aggregate/running counts of queries. Would this sort of ad-hoc querying be better implemented using Hadoop + Hive? If so, should I be storing all this information in Cassandra then using Hadoop to retrieve it? Thanks for your suggestions Whoops wrong forum
Design questions/Schema Help
We are thinking about using Cassandra to store our search logs. Can someone point me in the right direction/lend some guidance on design? I am new to Cassandra and I am having trouble wrapping my head around some of these new concepts. My brain keeps wanting to go back to a RDBMS design. We will be storing the user query, # of hits returned and their session id. We would like to be able to answer the following questions. - What is the n most popular queries and their counts within the last x (mins/hours/days/etc). Basically the most popular searches within a given time range. - What is the most popular query within the last x where hits = 0. Same as above but with an extra "where" clause - For session id x give me all their other queries - What are all the session ids that searched for 'foos' We accomplish the above functionality w/ MySQL using 2 tables. One for the raw search log information and the other to keep the aggregate/running counts of queries. Would this sort of ad-hoc querying be better implemented using Hadoop + Hive? If so, should I be storing all this information in Cassandra then using Hadoop to retrieve it? Thanks for your suggestions
Re: NullPointerException with CURL, but not in browser
: However, when I'm trying this very URL with curl within my (perl) script, I : receive a NullPointerException: : CURL-COMMAND: curl -sL : http://localhost:8983/solr/select?indent=on&version=2.2&q=*&fq=ListId%3A881&start=0&rows=0&fl=*%2Cscore&qt=standard&wt=standard it appears you aren't quoting the URL, so that first "&" character is causing the shell to think yo uare done with the command, and you want it to be backgrounded (allthough i'm not certain, since it depends on how you are having perl execute curl) i would suggest that you avoid exec/system calls to "curl" from Perl, and use an LWP::UserAgent instead. -Hoss
java "GC overhead limit exceeded"
I am now occasionally getting a Java "GC overhead limit exceeded" error in my Solr. This may or may not be related to recently adding much better (and more) warming querries. I can get it when trying a 'commit', after deleting all documents in my index, or in other cases. Anyone run into this, and have suggestions as to how to set my java options to eliminate? I'm not sure this simply means that my heap size needs to be bigger, it seems to be something else. Any advice appreciated. Googling didn't get me much I trusted. Jonathan
Re: Total number of terms in an index?
Sorry, like the subject, I mean the total number of terms. On Mon, Jul 26, 2010 at 4:03 PM, Jason Rutherglen wrote: > What's the fastest way to obtain the total number of docs from the > index? (The Luke request handler takes a long time to load so I'm > looking for something else). >
Total number of terms in an index?
What's the fastest way to obtain the total number of docs from the index? (The Luke request handler takes a long time to load so I'm looking for something else).
NullPointerException with CURL, but not in browser
Hi *, I'd like to see how many documents I have in my index with a certain ListId, in this example ListId 881. http://localhost:8983/solr/select?indent=on&version=2.2&q=*&fq=ListId%3A881&start=0&rows=0&fl=*%2Cscore&qt=standard&wt=standard In the browser, the output looks perfect, I indeed have 3 matching documents in the index: 0 4097 *,score on 0 * standard standard ListId:881 2.2 0 However, when I'm trying this very URL with curl within my (perl) script, I receive a NullPointerException: CURL-COMMAND: curl -sL http://localhost:8983/solr/select?indent=on&version=2.2&q=*&fq=ListId%3A881&start=0&rows=0&fl=*%2Cscore&qt=standard&wt=standard Error 500 HTTP ERROR: 500null java.lang.NullPointerException at java.io.StringReader.(StringReader.java:33) at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:197) ... Grateful for any kind of help. cheers - MOPS
Re: How to Combine Drupal solrconfig.xml with Nutch solrconfig.xml?
Hi Savannah, I have just answered this question over on drupal.org. http://drupal.org/node/811062 Response number 5 and 11 will help you. On the solrconfig.xml side of things you will only really need Drupal's version. Although still in alpha my Nutch module will help you out with integration http://drupal.org/project/nutch Regards, David Stuart On 26 Jul 2010, at 21:37, Savannah Beckett wrote: > I am using Drupal ApacheSolr module to integrate solr with drupal. I already > integrated solr with nutch. I already moved nutch's solrconfig.xml and > schema.xml to solr's example directory, and it work. I tried to append > Drupal's > ApacheSolr module's own solrconfig.xml and schema.xml into the same xml > files, > but I got the following error when I "java -jar start.jar": > > Jul 26, 2010 1:18:31 PM org.apache.solr.common.SolrException log > SEVERE: Exception during parsing file: > solrconfig.xml:org.xml.sax.SAXParseException: The markup in the document > following the root element must be well-formed. > at > com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:249) > at > com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:284) > > at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:124) > at org.apache.solr.core.Config.(Config.java:110) > at org.apache.solr.core.SolrConfig.(SolrConfig.java:130) > at > org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:134) > > at > org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:83) > > Why? does solrconfig.xml allow to have 2 sections? does schema.xml > allow to have 2 sections? > > Thanks. > >
How to Combine Drupal solrconfig.xml with Nutch solrconfig.xml?
I am using Drupal ApacheSolr module to integrate solr with drupal. I already integrated solr with nutch. I already moved nutch's solrconfig.xml and schema.xml to solr's example directory, and it work. I tried to append Drupal's ApacheSolr module's own solrconfig.xml and schema.xml into the same xml files, but I got the following error when I "java -jar start.jar": Jul 26, 2010 1:18:31 PM org.apache.solr.common.SolrException log SEVERE: Exception during parsing file: solrconfig.xml:org.xml.sax.SAXParseException: The markup in the document following the root element must be well-formed. at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:249) at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:284) at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:124) at org.apache.solr.core.Config.(Config.java:110) at org.apache.solr.core.SolrConfig.(SolrConfig.java:130) at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:134) at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:83) Why? does solrconfig.xml allow to have 2 sections? does schema.xml allow to have 2 sections? Thanks.
Solr 3.1 and ExtractingRequestHandler resulting in blank content
Hello all, I’m working on a project with Solr. I had 1.4.1 working OK using ExtractingRequestHandler except that it was crashing on some PDFs. I noticed that Tika bundled with 1.4.1 was 0.4, which was kind of old. I decided to try updating to 0.7 as per the directions here: http://wiki.apache.org/solr/ExtractingRequestHandler but it was giving me errors (I forget what they were specifically). Then I tried downloading Solr 3.1 from the source repository, which I noticed came with Tika 0.7. I figured this would be an easier route to get working. Now I’m testing with 3.1 and 0.7 and I’m noticing my documents are going into Solr OK, but they all have blank content (no document text stored in Solr). I did see that the default “text” field is not stored. Changing that to stored=true didn’t help. Changing to fmap.content=attr_content&uprefix=attr_content didn’t help either. I have attached all relevant info here. Please let me know if someone sees something I don’t (it’s entirely possible as I’m relatively new to Solr). Schema.xml: id text Solrconfig.xml: ${solr.abortOnConfigurationError:true} LUCENE_31 C:/Program Files/Apache Software Foundation/solr-3.1/data false 10 32 1 1000 1 native false 32 10 false true 1 0 false 1024 true 20 200 solr rocks010 static firstSearcher warming query from solrconfig.xml false 2 explicit velocity browse layout Solritas dismax *:* 10 *,score on cat manu_exact 1 text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4 on text features name 0 name dismax explicit 0.01 text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4 text^0.2 features^1.1 name^1.5 manu^1.4 manu_exact^1.9 popularity^0.5 recip(price,1,1000,1000)^0.3 id,name,price,score 2<-1 5<-2 6<90% 100 *:* text features name 0 name regex dismax explicit text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 2<-1 5<-2 6<90% incubationdate_dt:[* TO NOW/DAY-1MONTH]^2.2 inStock:true cat manu_exact price:[* TO 500] price:[500 TO *] textSpell default name ./spellchecker false false 1 spellcheck true tvComponent default org.carrot2.clustering.lingo.LingoClusteringAlgorithm 20 stc org.carrot2.clustering.stc.STCClusteringAlgorithm true default true name id features true false clusteringComponent text true ignored_ true links ignored_ true termsComponent string elevate.xml explicit elevator standard solrpingquery all explicit true 100 70 0.5 [-\w ,/\n\"']{20,200}
Re: Solr Doc Lucene Doc !?
ah okay thx =) the class "SolrInputDocuments" is only for indexing an document and "SolrDocuement" for the search ? when Solr index an document first step is to create an SolrInputDocument. then in class "DocumentBuilder" creates solr in function "Document toDocument (SolrInputDoc, Schema)" an Lucene Document ?! -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Doc-Lucene-Doc-tp995922p997196.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: slave index is bigger than master index
: No I didn't. I thought you aren't supposed to run optimize on slaves. Well correct, you should make all changes to the master. : but it doesn;t matter now, as I think its fixed now. I just added a dummy : document on master, ran a commit call and then once that executed ran an : optimize call. This triggered snapshooter to replicate the index, which : somehow resulted in normal index size at slaves. My hunch: are you running on windows? Windows filesystems have issues with trying to delete a file while processes still have the file handle open. Since Solr needs those "old" filehandles to continue serving requests while it opens up the "new" copy of the index, those files wind up left on disk. the *next* time a new index is opened, it tries to delete those files again, and then they succeed... http://wiki.apache.org/lucene-java/LuceneFAQ#Why_do_I_have_a_deletable_file_.28and_old_segment_files_remain.29_after_running_optimize.3F ...if you notice this situation happen again, check and see if you have a "deletables" file. -Hoss
Re: Can't find org.apache.solr.client.solrj.embedded
: where is a Jar, containing org.apache.solr.client.solrj.embedded? Classes in the embedded package are useless w/o the rest of the Solr internal "core" classes, so they are included directly in the apache-solr-core-1.4.1.jar. (i know .. the directory structure doesn't make a lot of sense) : Also I can't find any other sources than : >http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/src/webapp/src/org/apache/solr/client/solrj/embedded/ : , which does not fit to Solr 1.4. All the source code for Solr 1.4.1 is included in the 1.4.1 release artifacts (the tgz or zip files) .. if you want to find it in SVN it's located here... https://svn.apache.org/repos/asf/lucene/solr/tags/release-1.4.1/ -Hoss
Re: Solr Doc Lucene Doc !?
: i want to learn more about the technology. : : exists an issue to create really an solrDoc ? Or its in the code only for a : better understanding of the lucene and solr border ? There is a real and actual class named "SolrDocument". it is a simpler object then Lucene's "Document" class becuase in Solr the details about the field types (stored, indexed, etc...) are handled by the schema, and are not distinct per Field instance. http://lucene.apache.org/solr/api/org/apache/solr/common/SolrDocument.html : -- : View this message in context: http://lucene.472066.n3.nabble.com/Solr-Doc-Lucene-Doc-tp995922p99.html : Sent from the Solr - User mailing list archive at Nabble.com. : -Hoss
Extracting PDF text/comment/callout/typewriter boxes with Solr CELL/Tika/PDFBox
Every so often I need to index new batches of scanned PDFs and occasionally Adobe's OCR can't recognize the text in a couple of these documents. In these situations I would like to type in a small amount of text onto the document and have it be extracted by Solr CELL. Adobe Pro 9 has a number of different ways to add text directly to a PDF file: *Typewriter *Sticky Note *Callout boxes *Text boxes I tried indexing documents with each of these text additions with Solr 1.4.1 + Solr CELL but can't extract the text in any of these boxes. If someone has modified their Solr CELL installation to use more recent versions of Tika (above 0.4) or PDFBox (above 0.7.3) and/or can can comment on whether newer versions can pull the text out of any of these various text boxes I'd appreciate that very much. -Jon - SECURITY/CONFIDENTIALITY WARNING: This message and any attachments are intended solely for the individual or entity to which they are addressed. This communication may contain information that is privileged, confidential, or exempt from disclosure under applicable law (e.g., personal health information, research data, financial information). Because this e-mail has been sent without encryption, individuals other than the intended recipient may be able to view the information, forward it to others or tamper with the information without the knowledge or consent of the sender. If you are not the intended recipient, or the employee or person responsible for delivering the message to the intended recipient, any dissemination, distribution or copying of the communication is strictly prohibited. If you received the communication in error, please notify the sender immediately by replying to this message and deleting the message and any accompanying files from your system. If, due to the security risks, you do not wish to receive further communications via e-mail, please reply to this message and inform the sender that you do not wish to receive further e-mail from the sender. -
Re: 2 type of docs in same schema?
I still assume that what you mean by "search queries data" is just some other form of document (in this case containing 1 seach-request per document) I'm not sure what you intend to do by that actually, but yes indexing stays the same (you probably want to mark field "type" as required so you don't forget to include in in your indexing-program) 2010/7/26 > > Thanks for you answer! That's great. > > Now to index search quieries data is there something special to do? or it > stay as usual? > > > > > > > > > -Original Message- > From: Geert-Jan Brits > To: solr-user@lucene.apache.org > Sent: Mon, Jul 26, 2010 4:57 pm > Subject: Re: 2 type of docs in same schema? > > > You can easily have different types of documents in 1 core: > > 1. define searchquery as a field(just as the others in your schema) > 2. define type as a field (this allows you to decide which type of > documents > to search for, e.g: "type_normal" or "type_search") > > now searching on regular docs becomes: > q=title:some+title&fq=type:type_normal > > and searching for searchqueries becomes (I think this is what you want): > q=searchquery:bmw+car&fq=type:type_search > > Geert-Jan > > 2010/7/26 > > > > > > > > > I need you expertise on this one... > > > > We would like to index every search query that is passed in our solr > engine > > (same core) > > > > Our docs format are like this (already in our schema): > > title > > content > > price > > category > > etc... > > > > Now how to add "search queries" as a field in our schema? Know that the > > search queries won't have all the field above? > > For example: > > q=bmw car > > q=car wheels > > q=moto honda > > etc... > > > > Should we run an other core that only index search queries? or is there a > > way to do this with same instance and same core? > > > > Thanks for your help > > > > > > > > >
Re:Re: How to speed up solr search speed
Isn't it always one of these three? (from most likely to least likely, generally) Memory Disk Speed WebServer and it's code CPU. Memory and Disk are related, as swapping occurs between them. As long as memory is high enough, it becomes: Disk Speed WebServer and it's code CPU If the WebServer is configured to be as fast as is possible,THEN the CPU comes into play. So normally: 1/ Put enough memory in so it doesn't swap 2/ Buy the fastest damn disk/diskArrays/SolidState/HyperDrive RamDisk/RAIDed HyperDrive RamDisk that you can afford. 3/ Tune your webserver code. 1 GOOD *LAPTOP* with 8-16 gig of ram(with a 64bit OS), and an single, external SATA HyperDrive 64Gig RamDrive is SCREAMING, way beyond most single server boxes you'll pay to get hosting on. Dennis Gearon Signature Warning EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Fri, 7/16/10, marship wrote: > From: marship > Subject: Re:Re: How to speed up solr search speed > To: solr-user@lucene.apache.org > Date: Friday, July 16, 2010, 11:26 AM > Hi. Peter. > > Thanks for replying. > > > >Hi Scott! > > > >> I am aware these cores on same server are > interfering with each other. > > > >Thats not good. Try to use only one core per CPU. With > more per CPU you > >won't have any benefits over the single-core version, I > think. > > I only have 2 servers, each CPU with 8 cores. Each server > has 6G memory. So I have 16 CPU core in total. But I have 70 > solr cores so I have to use them on my 2 servers. Based on > my observation, even when the search is processing, the CPU > usage is not high. The memory usage is not high too. Each > solr(jetty) instance on consume 40M-60M memory. My server > always have 2-3G memory availabe. > > > >> can solr use more memory to avoid disk operation > conflicts? > > > >Yes, only the memory you have on the machine of course. > Are you using > >tomcat or jetty? > > > > I am using jetty. > >> For my case, I don't think solr can work as fast > as 100-200ms on average. > > > >We have indices with a lot entries not as large as > yours, but in the > >range of X Million. and have response times under > 100ms. > >What about testing only one core with 5-10 Mio docs? If > the response > >time isn't any better maybe you need a different field > config or sth. > >different is wrong? > > For the moment, I really don't know. I tried to use java > -sever -jar start.jar to start jetty/solr. I saw when solr > start, sometimes some core search for simple keyword like > "design" will take 70s, of course some only take 0-15ms. > From my aspect, I do believe it is the harddisk accessed by > these cores deplays each other. So finally some cores fall > behind. But the bad news for me is the solr distriubted > search's speed is decided by the slowest one. > > > > > >> So should I add it or the default(without it ) is > ok? > > > >Without is also okay -> solr uses default. > >With 75 Mio docs it should around 20 000 but I guess > there is sth. > >different wrong: maybe caching or field definition. > Could you post the > >latter one? > > > > Sorry. What are you asking me to post? > > > > > >Regards, > >Peter. > > > >> Hi. Peter. > >> I think I am not using faceting, highlighting ... > I read about them > >> but don't know how to work with them. I am using > the default "example" > >> just change the indexed fields. > >> For my case, I don't think solr can work as fast > as 100-200ms on > >> average. I tried some keywords on only single solr > instance. It > >> sometimes takes more than 20s. I just input 4 > keywords. I agree it is > >> keyword concerns. But the issue is it doesn't work > consistently. > >> > >> When 37 instances on same server works at same > time (when a > >> distributed search start), it goes worse, I saw > some solr cores > >> execute very fast, 0ms, ~40ms, ~200ms. But more > solr cores executed as > >> ~2500ms, ~3500ms, ~6700ms. and about 5-10 solr > cores need more than > >> 17s. I have 70 cores running. And the search speed > depends on the > >> SLOWEST one. Even 69 cores can run at 1ms. but > last one need 50s. then > >> the distributed search speed is 50s. > >> I am aware these cores on same server are > interfering with each other. > >> As I have lots of free memory. I want to know, > with the prerequisite, > >> can solr use more memory to avoid disk operation > conflicts? > >> > >> Thanks. > >> Regards. > >> Scott > >> > >> 在2010-07-15 17:19:57,"Peter Karich" > 写道: > >>> How does your queries look like? Do you use > faceting, highlighting, ... ? > >>> Did you try to customize the cache? > >>> Setting the HashDocSet to "0.005 of all > documents" improves our > >>> search speed a lot. > >>> Did you optimize the index? > >>> > >>> 500ms seems to be slow for an 'average' > search. I am not an expert > >>> but without highlighting it should be faster > as 100ms or at least 200ms > >>> > >>> Regards,
Re: how to Protect data
If it's not the data that's being searched, you can alway encode it before inserting it. You either have to either fruther encode it to base64 to make it printable before storing it, OR use a binary field. You probably could also set up an external process that cycles through every document in the index, encodes the fields in question and reinserts the document. The time and horse power to do that might be better spent regenerating the index from scratch with the newly encoded documents. You might even be able to modify something in Solr/Lucene to do the enocding automatically using Java. Java must have encryption libraries like most other languages. I don't know solr/lucene well enough to say, but the data that's in the searchable columns must be visible as well, in some manner. I don't know how understandable it is after being tokenized. Someone else would have to comment on that. Dennis Gearon Signature Warning EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Sun, 7/25/10, Girish Pandit wrote: > From: Girish Pandit > Subject: how to Protect data > To: solr-user@lucene.apache.org > Date: Sunday, July 25, 2010, 5:12 PM > Hi, > > I was being ask about protecting data, means that the > search index data is stored in the some indexed files and > when you open those indexed files, I can clearly see the > data, means some texts, e.g. name, address, postal code > etc. > > is there anyway I can hide the data? means some kind of > data encoding to not even see any text raw data. > > -Girish > >
Re: 2 type of docs in same schema?
Thanks for you answer! That's great. Now to index search quieries data is there something special to do? or it stay as usual? -Original Message- From: Geert-Jan Brits To: solr-user@lucene.apache.org Sent: Mon, Jul 26, 2010 4:57 pm Subject: Re: 2 type of docs in same schema? You can easily have different types of documents in 1 core: 1. define searchquery as a field(just as the others in your schema) 2. define type as a field (this allows you to decide which type of documents to search for, e.g: "type_normal" or "type_search") now searching on regular docs becomes: q=title:some+title&fq=type:type_normal and searching for searchqueries becomes (I think this is what you want): q=searchquery:bmw+car&fq=type:type_search Geert-Jan 2010/7/26 > > > > I need you expertise on this one... > > We would like to index every search query that is passed in our solr engine > (same core) > > Our docs format are like this (already in our schema): > title > content > price > category > etc... > > Now how to add "search queries" as a field in our schema? Know that the > search queries won't have all the field above? > For example: > q=bmw car > q=car wheels > q=moto honda > etc... > > Should we run an other core that only index search queries? or is there a > way to do this with same instance and same core? > > Thanks for your help > > >
Updating fields in Solr
Hi, I have a requirement where i need to keep updating certain fields in the schema. My requirement is to change some of the fields or add some values to a field ( multi-value field ). I understand that i can use Solr update for this. If i am using Solr update do i need to publish the entire document again or do i just need to publish the updated fields. Again in case of update can i add( append ) news values to the existing fields ? In my document most of the parts remains unchanged however few fields keeps changing. Will it be costly to update the entire document just to change a field ? I was wondering if i should create to solr core one for static content and another one for dynamic content. This way i can reduce the time taken to update a document, but it introduces the complexity of querying different core and combining the result on the client side. Is there a way to configure solr so that client can execute a single query and solr internally executes multiple queries across different cores and return single result ?
Re: Solr Doc Lucene Doc !?
i want to learn more about the technology. exists an issue to create really an solrDoc ? Or its in the code only for a better understanding of the lucene and solr border ? -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Doc-Lucene-Doc-tp995922p99.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: slave index is bigger than master index
as far as i know this is not needed, the optimized index is automatically replicated to the slaves. therefore something seems to be really wrong with your setup. maybe the slave index got corrupted for some reason? did u try deleting the data dir + slave restart for a fresh replicated index? maybe worth a try.. good luck -Ursprüngliche Nachricht- Von: Peter Karich [mailto:peat...@yahoo.de] Gesendet: Montag, 26. Juli 2010 16:54 An: solr-user@lucene.apache.org Betreff: Re: slave index is bigger than master index did you try an optimize on the slave too? > Yes I always run an optimize whenever I index on master. In fact I > just ran an optimize command an hour ago, but it didn't make any difference. >
Re: slave index is bigger than master index
No I didn't. I thought you aren't supposed to run optimize on slaves. Well but it doesn;t matter now, as I think its fixed now. I just added a dummy document on master, ran a commit call and then once that executed ran an optimize call. This triggered snapshooter to replicate the index, which somehow resulted in normal index size at slaves. I still don't get what exactly happened there, and will be investigating into this. If I do find anything interesting, will update on this mailing list. Thanks for all your input anyways, -Muneeb -- View this message in context: http://lucene.472066.n3.nabble.com/slave-index-is-bigger-than-master-index-tp996329p996611.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: 2 type of docs in same schema?
You can easily have different types of documents in 1 core: 1. define searchquery as a field(just as the others in your schema) 2. define type as a field (this allows you to decide which type of documents to search for, e.g: "type_normal" or "type_search") now searching on regular docs becomes: q=title:some+title&fq=type:type_normal and searching for searchqueries becomes (I think this is what you want): q=searchquery:bmw+car&fq=type:type_search Geert-Jan 2010/7/26 > > > > I need you expertise on this one... > > We would like to index every search query that is passed in our solr engine > (same core) > > Our docs format are like this (already in our schema): > title > content > price > category > etc... > > Now how to add "search queries" as a field in our schema? Know that the > search queries won't have all the field above? > For example: > q=bmw car > q=car wheels > q=moto honda > etc... > > Should we run an other core that only index search queries? or is there a > way to do this with same instance and same core? > > Thanks for your help > > >
Re: slave index is bigger than master index
did you try an optimize on the slave too? > Yes I always run an optimize whenever I index on master. In fact I just ran > an optimize command an hour ago, but it didn't make any difference. >
Re: DIH : SQL query (sub-entity) is executed although variable is not set (null or empty list)
Hi Chantal, did you tried to write a http://wiki.apache.org/solr/DIHCustomFunctions custom DIH Function ? If not, I think this will be a solution. Just check, whether "${prog.vip}" is an empty string or null. If so, you need to replace it with a value that never can response anything. So the vip-field will always be empty for such queries. Maybe that helps? Hopefully, the variable resolver is able to resolve something like ${dih.functions.getReplacementIfNeeded(prog.vip). Kind regards, - Mitch Chantal Ackermann wrote: > > Hi, > > my use case is the following: > > In a sub-entity I request rows from a database for an input list of > strings: > >/* multivalued, not required */ > query="select SSC_VALUE from SSC_VALUE > where SSC_ATTRIBUTE_ID=1 > and SSC_VALUE in (${prog.vip})"> > > > > > The root entity is "prog" and it has an optional multivalued field > called "vip". When the list of "vip" values is empty, the SQL for the > sub-entity above throws an SQLException. (Working with Oracle which does > not allow an empty expression in the "in"-clause.) > > Two things: > (A) best would be not to run the query whenever ${prog.vip} is null or > empty. > (B) From the documentation, it is not clear that onError is only checked > in the transformer runs but not checked when the SQL for the entity > throws an exception. (Trunk version JdbcDataSource lines 250pp). > > IMHO, (A) is the better fix, and if so, (B) is the right decision. (If > (A) is not easily fixable, making (B) work would be helpful.) > > Looking through the code, I've realized that the replacement of the > variables is done in a very generic way. I've not yet seen an > appropriate way to check on those variables in order to stop the > processing of the entity if the variable is empty. > Is there a way to do this? Or maybe there is a completely different way > to get my use case working. Any help most appreciated! > > Thanks, > Chantal > > > -- View this message in context: http://lucene.472066.n3.nabble.com/DIH-SQL-query-sub-entity-is-executed-although-variable-is-not-set-null-or-empty-list-tp995983p996446.html Sent from the Solr - User mailing list archive at Nabble.com.
2 type of docs in same schema?
I need you expertise on this one... We would like to index every search query that is passed in our solr engine (same core) Our docs format are like this (already in our schema): title content price category etc... Now how to add "search queries" as a field in our schema? Know that the search queries won't have all the field above? For example: q=bmw car q=car wheels q=moto honda etc... Should we run an other core that only index search queries? or is there a way to do this with same instance and same core? Thanks for your help
Re: Solr Doc Lucene Doc !?
DataImportHandler (DIH) is an add-on to Solr. It lets you import documents from a number of sources in a flexible way. The only connection DIH has to Lucene is that Solr uses Lucene as the index engine. When you work with Solr you naturally talk about Solr Documents, if you were working with Lucene natively (without Solr) you would talk about Lucene documents, but they are basically the same thing. Are you having a specific issue? Or are you just trying to learn more about the technology? If you are mostly trying to understand DIH, then you should think in terms of Solr and Solr documents. Understand that Lucene is working behind the scenes, but you really don't need to worry about that all that often. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Doc-Lucene-Doc-tp995922p996425.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Problem with parsing date
I have just fixed it. Problem was related with operating system value - they were different that solr expected with incoming datastream. Regards, Rafal Zawadzki On Mon, Jul 26, 2010 at 3:20 PM, Chantal Ackermann < chantal.ackerm...@btelligent.de> wrote: > On Mon, 2010-07-26 at 14:46 +0200, Rafal Bluszcz Zawadzki wrote: > > EEE, d MMM HH:mm:ss z > > not sure but you might want to try with an uppercase 'Z' for the > timezone (surrounded by single quotes, alternatively). The rest of your > pattern looks fine. But if you still run into problems try different > versions, like putting the comma in quotes etc. > > Cheers, > Chantal > > >
Re: slave index is bigger than master index
I just checked my config file, and I do have exact same values for deletionPolicy tag, as you attached in your email, so I dont really think it could be this. -- View this message in context: http://lucene.472066.n3.nabble.com/slave-index-is-bigger-than-master-index-tp996329p996373.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: AW: slave index is bigger than master index
Yes I always run an optimize whenever I index on master. In fact I just ran an optimize command an hour ago, but it didn't make any difference. -- View this message in context: http://lucene.472066.n3.nabble.com/slave-index-is-bigger-than-master-index-tp996329p996364.html Sent from the Solr - User mailing list archive at Nabble.com.
spell checking....
hi all, i am a new one to solr and able to implement indexing the documents by following the solr wiki. now i am trying to add the spellchecking. i followed the spellcheck component in wiki but not getting the suggested spellings. i first build it by spellcheck.build=true,... here i give u the example::: http://localhost:8080/solr/spell?q=javs&spellcheck=true&spellcheck.collate=true - here the response should actualy suggest the "java" but didnt.. can any one guide me about it... i am using solr 1.4, tomcat in ubuntu Regards, swarup
AW: slave index is bigger than master index
Hi, are u calling on the master to finally remove deleted documents and merge the index files? once a day is recommended: http://wiki.apache.org/solr/SolrPerformanceFactors#Optimization_Considerations cheers -Ursprüngliche Nachricht- Von: Muneeb Ali [mailto:muneeba...@hotmail.com] Gesendet: Montag, 26. Juli 2010 15:37 An: solr-user@lucene.apache.org Betreff: slave index is bigger than master index Hi, I am using Solr 1.4 version, with master-slave setup. We have one master slave and two slave servers. It was all working fine, but lately solr slaves are behaving strange. Particularly during replicating the index, the slave nodes die and always need a restart. Also the index size of slave nodes is much bigger (336GB) than the master node index (i.e. only 86GB). I am guessing that its not removing previous indices at slave nodes when replicating? Has anyone faced similar issues? Any help would be highly appreciated. Thanks very much. -Muneeb -- View this message in context: http://lucene.472066.n3.nabble.com/slave-index-is-bigger-than-master-index-tp996329p996329.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: slave index is bigger than master index
Hi, I think that you may be using a Lucene/Solr IndexDeletionPolicy that does not remove old commits (and you aren't propagating solr-config via replication). You can configre this feature on the solr-config.xml inside the tag: * 1 0 * I hope this can be helpful. Cheers, Tommaso 2010/7/26 Muneeb Ali > > Hi, > > I am using Solr 1.4 version, with master-slave setup. We have one master > slave and two slave servers. It was all working fine, but lately solr > slaves > are behaving strange. Particularly during replicating the index, the slave > nodes die and always need a restart. Also the index size of slave nodes is > much bigger (336GB) than the master node index (i.e. only 86GB). > > I am guessing that its not removing previous indices at slave nodes when > replicating? Has anyone faced similar issues? > > Any help would be highly appreciated. > > Thanks very much. > > -Muneeb > -- > View this message in context: > http://lucene.472066.n3.nabble.com/slave-index-is-bigger-than-master-index-tp996329p996329.html > Sent from the Solr - User mailing list archive at Nabble.com. >
slave index is bigger than master index
Hi, I am using Solr 1.4 version, with master-slave setup. We have one master slave and two slave servers. It was all working fine, but lately solr slaves are behaving strange. Particularly during replicating the index, the slave nodes die and always need a restart. Also the index size of slave nodes is much bigger (336GB) than the master node index (i.e. only 86GB). I am guessing that its not removing previous indices at slave nodes when replicating? Has anyone faced similar issues? Any help would be highly appreciated. Thanks very much. -Muneeb -- View this message in context: http://lucene.472066.n3.nabble.com/slave-index-is-bigger-than-master-index-tp996329p996329.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Problem with parsing date
On Mon, 2010-07-26 at 14:46 +0200, Rafal Bluszcz Zawadzki wrote: > EEE, d MMM HH:mm:ss z not sure but you might want to try with an uppercase 'Z' for the timezone (surrounded by single quotes, alternatively). The rest of your pattern looks fine. But if you still run into problems try different versions, like putting the comma in quotes etc. Cheers, Chantal
Similar search regarding a result document
Hi, I would like to implement a similar search feature... but not relative to the initial search query but relative to each resuts documents. The structure of each doc is: id title content price etc... Then we have a database of global search seach queries, i'm thinking to integrate this in solr. I'm planing to implement this as a query of a query... but before i would like to know if there is a built in function in Solr for this? Thanks for your help.
Re: 2 solr dataImport requests on a single core at the same time
btw , i want to put all the requestHandlers(more than 1) in 1 xml file and i want to use this in my solrConfig.xml i have used xinclude but it didnt work .. please suggest me any thing Thanks, Prasad -- View this message in context: http://lucene.472066.n3.nabble.com/2-solr-dataImport-requests-on-a-single-core-at-the-same-time-tp978649p996248.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: 2 solr dataImport requests on a single core at the same time
Tq very Much .. -- View this message in context: http://lucene.472066.n3.nabble.com/2-solr-dataImport-requests-on-a-single-core-at-the-same-time-tp978649p996190.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Problem with parsing date
I am using also others dateFormat string, also in same data handler and they works. But not this one. And this data are fetching from the external source, so I don't have possibility to modify them (well, theoritacly i can save them, edit etc but this is not the way). Why this is not working with SOLR? On Mon, Jul 26, 2010 at 2:37 PM, Li Li wrote: > I uses format like -MM-ddThh:mm:ssZ. it works > > 2010/7/26 Rafal Bluszcz Zawadzki : > > Hi, > > > > I am using Data Import Handler from Solr 1.4. > > > > Parts of my data-config.xml are: > > > > > > >processor="XPathEntityProcessor" > >stream="false" > >forEach="/multistatus/response" > >url="/tmp/file.xml" > > > > transformer="RegexTransformer,DateFormatTransformer,TemplateTransformer" > >> > > . > > > > > xpath="/multistatus/response/propstat/prop/getlastmodified" > > dateTimeFormat="EEE, d MMM HH:mm:ss z" /> > > > xpath="/multistatus/response/propstat/prop/creationdate" > > dateTimeFormat="-MM-dd'T'hh:mm:ss'Z'"/> > > > > During full-import I got message: > > > > WARNING: Error creating document : > > SolrInputDocument[{SearchableText=SearchableText(1.0)={phrase}, > > parentPaths=parentPaths(1.0)={/site}, > > review_state=review_state(1.0)={published}, created=created(1.0)={Sat Oct > 11 > > 14:38:27 CEST 2003}, UID=UID(1.0)={http://www.example.com:80/File-1563}, > > Title=Title(1.0)={This is only an example document}, > > portal_type=portal_type(1.0)={Document}, modified=modified(1.0)={Wed, 15 > Jul > > 2009 08:23:34 GMT}}] > > org.apache.solr.common.SolrException: Invalid Date String:'Wed, 15 Jul > 2009 > > 08:23:34 GMT' > > at org.apache.solr.schema.DateField.parseMath(DateField.java:163) > > at > org.apache.solr.schema.TrieDateField.createField(TrieDateField.java:171) > > at org.apache.solr.schema.SchemaField.createField(SchemaField.java:94) > > at > > > org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:246) > > at > > > org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:60) > > > > Which as I understand, means that Solr / Java coudnt parse my date. > > > > In my xml file it looks like: > > Wed, 15 Jul 2009 08:23:34 GMT > > > > In my opinion format "EEE, d MMM HH:mm:ss z" is correct, and what > more > > important - it was suppouse to work with same data week ago :) > > > > Any idea will be appreciate. > > > > -- > > Rafal Zawadzki > > Backend developer > > >
Re: Problem with Pdf, Sol 1.4.1 Cell
Hi, I think there is an open bug for it at: https://issues.apache.org/jira/browse/SOLR-1902 Using Solr 1.4.1 and upgrading Tika libraries to 0.8 snapshot I had also to upgrade pdfbox, fontbox and jembox to 1.2.1; I got no errors and it seems it's able to index PDFs without any errors (I can query them by id:doc1 for example) but did not extract text or other metadata from them. Building a new Solr distribution from trunk (ant distr) and using Tika 0.8 snapshot (with pdfbox, fontbox and jebox 1.2.1) it seems it's working. My 2 cents, Tommaso 2010/7/23 Alessandro Benedetti > Hi all, > as I saw in this discussion [1] there were many issues with PDF indexing in > Solr 1.4 due to TIka library (0.4 Version). > In Solr 1.4.1 the tika library is the same so I guess the issues are the > same. > Could anyone, who contributed to the previous thread, help me in resolving > these issues? > I need a simple tutorial that could help me to upgrade Solr Cell! > > Something like this: > 1) download tika core from trunk > 2)create jar with maven dependecies > 3)unjar Sol 1.4.1 and change tika library > 4)jar the patched Solr 1.4.1 and enjoy! > > [1] > > http://markmail.org/message/zbkplnzqho7mxwy3#query:+page:1+mid:gamcxdx34ayt6ccg+state:results > > Best regards > > -- > -- > > Benedetti Alessandro > Personal Page: http://tigerbolt.altervista.org > > "Tyger, tyger burning bright > In the forests of the night, > What immortal hand or eye > Could frame thy fearful symmetry?" > > William Blake - Songs of Experience -1794 England >
Re: Problem with parsing date
I uses format like -MM-ddThh:mm:ssZ. it works 2010/7/26 Rafal Bluszcz Zawadzki : > Hi, > > I am using Data Import Handler from Solr 1.4. > > Parts of my data-config.xml are: > > > processor="XPathEntityProcessor" > stream="false" > forEach="/multistatus/response" > url="/tmp/file.xml" > > transformer="RegexTransformer,DateFormatTransformer,TemplateTransformer" > > > . > > xpath="/multistatus/response/propstat/prop/getlastmodified" > dateTimeFormat="EEE, d MMM HH:mm:ss z" /> > xpath="/multistatus/response/propstat/prop/creationdate" > dateTimeFormat="-MM-dd'T'hh:mm:ss'Z'"/> > > During full-import I got message: > > WARNING: Error creating document : > SolrInputDocument[{SearchableText=SearchableText(1.0)={phrase}, > parentPaths=parentPaths(1.0)={/site}, > review_state=review_state(1.0)={published}, created=created(1.0)={Sat Oct 11 > 14:38:27 CEST 2003}, UID=UID(1.0)={http://www.example.com:80/File-1563}, > Title=Title(1.0)={This is only an example document}, > portal_type=portal_type(1.0)={Document}, modified=modified(1.0)={Wed, 15 Jul > 2009 08:23:34 GMT}}] > org.apache.solr.common.SolrException: Invalid Date String:'Wed, 15 Jul 2009 > 08:23:34 GMT' > at org.apache.solr.schema.DateField.parseMath(DateField.java:163) > at org.apache.solr.schema.TrieDateField.createField(TrieDateField.java:171) > at org.apache.solr.schema.SchemaField.createField(SchemaField.java:94) > at > org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:246) > at > org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:60) > > Which as I understand, means that Solr / Java coudnt parse my date. > > In my xml file it looks like: > Wed, 15 Jul 2009 08:23:34 GMT > > In my opinion format "EEE, d MMM HH:mm:ss z" is correct, and what more > important - it was suppouse to work with same data week ago :) > > Any idea will be appreciate. > > -- > Rafal Zawadzki > Backend developer >
Problem with parsing date
Hi, I am using Data Import Handler from Solr 1.4. Parts of my data-config.xml are: . During full-import I got message: WARNING: Error creating document : SolrInputDocument[{SearchableText=SearchableText(1.0)={phrase}, parentPaths=parentPaths(1.0)={/site}, review_state=review_state(1.0)={published}, created=created(1.0)={Sat Oct 11 14:38:27 CEST 2003}, UID=UID(1.0)={http://www.example.com:80/File-1563}, Title=Title(1.0)={This is only an example document}, portal_type=portal_type(1.0)={Document}, modified=modified(1.0)={Wed, 15 Jul 2009 08:23:34 GMT}}] org.apache.solr.common.SolrException: Invalid Date String:'Wed, 15 Jul 2009 08:23:34 GMT' at org.apache.solr.schema.DateField.parseMath(DateField.java:163) at org.apache.solr.schema.TrieDateField.createField(TrieDateField.java:171) at org.apache.solr.schema.SchemaField.createField(SchemaField.java:94) at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:246) at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:60) Which as I understand, means that Solr / Java coudnt parse my date. In my xml file it looks like: Wed, 15 Jul 2009 08:23:34 GMT In my opinion format "EEE, d MMM HH:mm:ss z" is correct, and what more important - it was suppouse to work with same data week ago :) Any idea will be appreciate. -- Rafal Zawadzki Backend developer
Can't find org.apache.solr.client.solrj.embedded
Hello experts, where is a Jar, containing org.apache.solr.client.solrj.embedded? I miss this package in 'apache-solr-solrj-1.4.[01].jar'. Also I can't find any other sources than >http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/src/webapp/src/org/apache/solr/client/solrj/embedded/ , which does not fit to Solr 1.4. Any tips for a blind newbie? Uwe
Re: Solr Doc Lucene Doc !?
... but in the code is the talk about of, SolrDocuments. these are higher level docs, used to construct the lucene doc to index ... !!?!?!?!? and in wiki is the talk about "Build Solr documents by aggregating data from multiple columns and tables according to configuration" http://wiki.apache.org/solr/DataImportHandler?highlight=(dih) so its a little bit confused. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Doc-Lucene-Doc-tp995922p996005.html Sent from the Solr - User mailing list archive at Nabble.com.
DIH : SQL query (sub-entity) is executed although variable is not set (null or empty list)
Hi, my use case is the following: In a sub-entity I request rows from a database for an input list of strings: /* multivalued, not required */ The root entity is "prog" and it has an optional multivalued field called "vip". When the list of "vip" values is empty, the SQL for the sub-entity above throws an SQLException. (Working with Oracle which does not allow an empty expression in the "in"-clause.) Two things: (A) best would be not to run the query whenever ${prog.vip} is null or empty. (B) From the documentation, it is not clear that onError is only checked in the transformer runs but not checked when the SQL for the entity throws an exception. (Trunk version JdbcDataSource lines 250pp). IMHO, (A) is the better fix, and if so, (B) is the right decision. (If (A) is not easily fixable, making (B) work would be helpful.) Looking through the code, I've realized that the replacement of the variables is done in a very generic way. I've not yet seen an appropriate way to check on those variables in order to stop the processing of the entity if the variable is empty. Is there a way to do this? Or maybe there is a completely different way to get my use case working. Any help most appreciated! Thanks, Chantal
Re: Solr Doc Lucene Doc !?
Stockii, Solr's index is a Lucene Index. Therefore, Solr documents are Lucene documents. Kind regards, - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Doc-Lucene-Doc-tp995922p995968.html Sent from the Solr - User mailing list archive at Nabble.com.
Solr Doc Lucene Doc !?
Hello. I write a little text about SOLR and LUCENE by using the DIH. what documents are creating and inserting DIH ? in wiki is the talk about "solr documents" but i thought that, solr uses lucene to do this and so that DIH creates Lucnee Documents, not Solr Documents !? what are doing the DIH exactly ? how can i easy find out that !? or how is intern manage the documents in solr ? is there a different about solr and lucene doc`s ? can anyone give me a little overview how DIH works ? that would be great ;-) thx stockiii -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Doc-Lucene-Doc-tp995922p995922.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: schema.xml
Hi , There is no required fields except u specify any fields to required.U can remove or add as many fields u want. That is an example schema which shows how feilds are configured -- View this message in context: http://lucene.472066.n3.nabble.com/schema-xml-tp995696p995800.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: how to Protect data
Hi Girish, I am not aware of such a thing. But you could use a middleware to avoid certain fields from being retrieved via the 'fl' parameter: http://wiki.apache.org/solr/CommonQueryParameters#fl E.g. for your customers the query looks like q=hello&fl=title and for your admin the query looks like q=hello&fl=title,securedField ( Instead of a fullblown middleware you could try http://wiki.apache.org/solr/VelocityResponseWriter ) Another option is to store the data encrypted in a field which could be retrieved AND additionally store it cleartext in a second field which is only searchable but will not returned. Regards, Peter. > Hi, > > I was being ask about protecting data, means that the search index > data is stored in the some indexed files and when you open those > indexed files, I can clearly see the data, means some texts, e.g. > name, address, postal code etc. > > is there anyway I can hide the data? means some kind of data encoding > to not even see any text raw data. > > -Girish
Re: help with a schema design problem
Hi, I haven't read everything thoroughly but have you considered creating fields for each of your (I think what you call) "party value"? So that you can query like "client:Pramod". You would then be able to facet on client and supplier. Cheers, Chantal On Fri, 2010-07-23 at 23:23 +0200, Geert-Jan Brits wrote: > Multiple rows in the OPs example are combined to form 1 solr-document (e.g: > row 1 and 2 both have documentid=1) > Because of this combine, it would match p_value from row1 with p_type from > row2 (or vice versa) > > > 2010/7/23 Nagelberg, Kallin > > > > > > When i search > > > > > p_value:"Pramod" AND p_type:"Supplier" > > > > > > > > > > it would give me result as document 1. Which is incorrect, since in > > > > > document > > > > > 1 Pramod is a Client and not a Supplier. > > > > Would it? I would expect it to give you nothing. > > > > -Kal > > > > > > > > -Original Message- > > From: Geert-Jan Brits [mailto:gbr...@gmail.com] > > Sent: Friday, July 23, 2010 5:05 PM > > To: solr-user@lucene.apache.org > > Subject: Re: help with a schema design problem > > > > > Is there any way in solr to say p_value[someIndex]="pramod" > > And p_type[someIndex]="client". > > No, I'm 99% sure there is not. > > > > > One way would be to define a single field in the schema as p_value_type = > > "client pramod" i.e. combine the value from both the field and store it in > > a > > single field. > > yep, for the use-case you mentioned that would definitely work. Multivalued > > of course, so it can contain "Supplier Raj" as well. > > > > > > 2010/7/23 Pramod Goyal > > > > >In my case the document id is the unique key( each row is not a unique > > > document ) . So a single document has multiple Party Value and Party > > Type. > > > Hence i need to define both Party value and Party type as mutli-valued. > > Is > > > there any way in solr to say p_value[someIndex]="pramod" And > > > p_type[someIndex]="client". > > >Is there any other way i can design my schema ? I have some solutions > > > but none seems to be a good solution. One way would be to define a single > > > field in the schema as p_value_type = "client pramod" i.e. combine the > > > value > > > from both the field and store it in a single field. > > > > > > > > > On Sat, Jul 24, 2010 at 12:18 AM, Geert-Jan Brits > > > wrote: > > > > > > > With the usecase you specified it should work to just index each "Row" > > as > > > > you described in your initial post to be a seperate document. > > > > This way p_value and p_type all get singlevalued and you get a correct > > > > combination of p_value and p_type. > > > > > > > > However, this may not go so well with other use-cases you have in mind, > > > > e.g.: requiring that no multiple results are returned with the same > > > > document > > > > id. > > > > > > > > > > > > > > > > 2010/7/23 Pramod Goyal > > > > > > > > > I want to do that. But if i understand correctly in solr it would > > store > > > > the > > > > > field like this: > > > > > > > > > > p_value: "Pramod" "Raj" > > > > > p_type: "Client" "Supplier" > > > > > > > > > > When i search > > > > > p_value:"Pramod" AND p_type:"Supplier" > > > > > > > > > > it would give me result as document 1. Which is incorrect, since in > > > > > document > > > > > 1 Pramod is a Client and not a Supplier. > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Jul 23, 2010 at 11:52 PM, Nagelberg, Kallin < > > > > > knagelb...@globeandmail.com> wrote: > > > > > > > > > > > I think you just want something like: > > > > > > > > > > > > p_value:"Pramod" AND p_type:"Supplier" > > > > > > > > > > > > no? > > > > > > -Kallin Nagelberg > > > > > > > > > > > > -Original Message- > > > > > > From: Pramod Goyal [mailto:pramod.go...@gmail.com] > > > > > > Sent: Friday, July 23, 2010 2:17 PM > > > > > > To: solr-user@lucene.apache.org > > > > > > Subject: help with a schema design problem > > > > > > > > > > > > Hi, > > > > > > > > > > > > Lets say i have table with 3 columns document id Party Value and > > > Party > > > > > > Type. > > > > > > In this table i have 3 rows. 1st row Document id: 1 Party Value: > > > Pramod > > > > > > Party Type: Client. 2nd row: Document id: 1 Party Value: Raj Party > > > > Type: > > > > > > Supplier. 3rd row Document id:2 Party Value: Pramod Party Type: > > > > Supplier. > > > > > > Now in this table if i use SQL its easy for me find all document > > with > > > > > Party > > > > > > Value as Pramod and Party Type as Client. > > > > > > > > > > > > I need to design solr schema so that i can do the same in Solr. If > > i > > > > > create > > > > > > 2 fields in solr schema Party value and Party type both of them > > multi > > > > > > valued > > > > > > and try to query +Pramod +Supplier then solr will return me the > > first > > > > > > document, even though in the first document Pramod is a client and > > > not > > > > a > > > > > > supplier > > > > > > Thanks, > > > > > > Pramod Goyal > > > > > > > > > > > > >
Integration Problem
Hi everybody, since a while i'm working with solr and i have integrated it with liferay 6.0.3. So every search request from liferay is processed by solr and its index. But i have to integrate another system, this system offers me a webservice. the results of these webservice should be in the results of solr but not in the index of it. I tried to do that with a custom query handler and a custom response writer and i'm able to write in the response msg of solr but only in the response node of the xml msg an not in the results node. So is there any solution how i could write in the results node of the xml msg from solr? thanks in advance best regards joerg