Re: useFastVectorHighlighter doesn't work
I had a schema defined as field name=text type=text indexed=true stored=false termVectors=true termPositions=true termOffsets=true/ You need to mark your text field as stored=true to use hl.useFastVectorHighlighter=true
Which time consuming processes are executed during Solr startup?
For example we know that cache warming is executed during startup. Are any other processes executed during Solr startup? Thank you, Ivan
Re: Is there any relationship between size of index and indexing performance?
Hi Ivan, It depends on number of terms it has to load. If you index less amount of data but store large amount of data then your index size may be big but actual terms may be less. It is not directly proportional. Regards Aditya www.findbestopensource.com On Mon, May 28, 2012 at 3:00 PM, Ivan Hrytsyuk ihryts...@softserveinc.comwrote: Let's assume we are indexing 1GB of data. Does size of index have any impact on indexing performance? I.e. will we have any difference in case of empty index vs 50 GB index? Thank you, Ivan
boost date for fresh result query
Hi I am facing problem to boost on date field. I have following field in schema field name=release_date type=date indexed=true stored=true/ solr version 3.4 I don't want to sort by date but want to give 50 to 60% boost those result which have latest date... following are the query : http://localhost:8083/solr/movie/select/?defType=dismaxq=titanicfq=title,tagsqf=tags ^2.0,title^1.0mm=1bf=recip(rord(release_date),1,1000,1000) but it seems the result has no effect because still all old date results are on top or I am unable to make query syntax. Is there anything missing in query.. Please help me to make the function possible.. thanks regards Jonty
Re: Is there any relationship between size of index and indexing performance?
indexing performance is mostly about the number of docs. but when you are optimizing, a large index takes a bit much time On Mon, May 28, 2012 at 12:48 PM, Aditya findbestopensou...@gmail.comwrote: Hi Ivan, It depends on number of terms it has to load. If you index less amount of data but store large amount of data then your index size may be big but actual terms may be less. It is not directly proportional. Regards Aditya www.findbestopensource.com On Mon, May 28, 2012 at 3:00 PM, Ivan Hrytsyuk ihryts...@softserveinc.comwrote: Let's assume we are indexing 1GB of data. Does size of index have any impact on indexing performance? I.e. will we have any difference in case of empty index vs 50 GB index? Thank you, Ivan -- Bilal Dadanlar fizy.com | Yazılım Mühendisi
Re: UpdateRequestProcessor : flattened values
On Sun, May 27, 2012 at 11:54:02PM -0400, Jack Krupansky wrote: You can create your own update processor that gets control between the output of Tika and the indexing of the document. See: http://wiki.apache.org/solr/UpdateRequestProcessor Seems to be exactly what I was looking for, thanks a lot ! I just started an (almost working) implementation but I've one notice: Let's get a field valueS: Collection v = doc.getFieldValues( author ); ( in my `processAdd(AddUpdateCommand cmd)` ) and push a doc, say using: `curl -F content=@my.pdf -F literal.author=a -F literal.author=b -F literal.author=c d` Then `log.warn(author: + v + : + v.size());` throws: WARN: author: [pdfauthor, a b c d] : 2 It's not (yet) a blocker in my personal case but I fear it's important enough to be noted: using a custom UpdateRequestProcessor, the access to individual literal fields seems (currently) very limited as they appear to be flattened. I'm quite sure there should already an hidden bug report about this somewhere. Other than that and unless I hit some other unexpected issue, this way to customize the request processor perfectly suits my needs. thanks !
indexing unstructured text (tweets)
Hi all. I am in the process of setting up Solr for my application, which is full text search on a bunch of tweets from twitter. I am afraid I am missing something. From the books I am reading, Apache Solr 3 Enterprise Search Server, it looks like Solr works with structured input, like XML or CVS, while I have the most wild and unstructured input ever (tweets). A section named Indexing documents with Solr Cell seems to address my problem, but also shows that before getting to Solr, I might need to use another Apache tool called Tika. Can anybody provide a brief explaination about the general picture? Can I index my tweets with Solr? Or do I need to put also Tika in my pipeline? Best regards, Giovanni Gherdovich
Re: indexing unstructured text (tweets)
Hi, You want to use Tika, if you have your data in some binary format, like pdf or excel. It extracts text from the binary for you. If you just want to index the text contents of tweets (including web links etc), using just off-the-shelf Solr is enough. You'll have to wrap your text input (per each tweet I would assume) into an xml or other supported structured format in accordance with the schema that you have defined. So at minimum, you would have two fields: a unique id of a document and its textual contents (a tweet). So design your schema first, create (e.g.) xml with the documents to add and post them onto SOLR. Dmitry On Mon, May 28, 2012 at 2:37 PM, Giovanni Gherdovich g.gherdov...@gmail.com wrote: Hi all. I am in the process of setting up Solr for my application, which is full text search on a bunch of tweets from twitter. I am afraid I am missing something. From the books I am reading, Apache Solr 3 Enterprise Search Server, it looks like Solr works with structured input, like XML or CVS, while I have the most wild and unstructured input ever (tweets). A section named Indexing documents with Solr Cell seems to address my problem, but also shows that before getting to Solr, I might need to use another Apache tool called Tika. Can anybody provide a brief explaination about the general picture? Can I index my tweets with Solr? Or do I need to put also Tika in my pipeline? Best regards, Giovanni Gherdovich -- Regards, Dmitry Kan
Re: indexing unstructured text (tweets)
Hey, I think you might be over-thinking this. Tweets are structured. You have the content (tweet), the user who tweeted it and various other meta data. So your 'document', might look like this: add doc field name=tweetIdABCD1234/field field name=tweetI bought some apples/field field name=userJohnnyBoy/field /doc /add To get this structure, you can use any programming language your comfortable with and load it into Solr via various means. Obviously you can add more 'meta' fields that you get from twitter if you want as well. David On 28/05/2012 9:37 PM, Giovanni Gherdovich wrote: Hi all. I am in the process of setting up Solr for my application, which is full text search on a bunch of tweets from twitter. I am afraid I am missing something. From the books I am reading, Apache Solr 3 Enterprise Search Server, it looks like Solr works with structured input, like XML or CVS, while I have the most wild and unstructured input ever (tweets). A section named Indexing documents with Solr Cell seems to address my problem, but also shows that before getting to Solr, I might need to use another Apache tool called Tika. Can anybody provide a brief explaination about the general picture? Can I index my tweets with Solr? Or do I need to put also Tika in my pipeline? Best regards, Giovanni Gherdovich
Re: indexing unstructured text (tweets)
Hello Dmitry and David, 2012/5/28 Dmitry Kan dmitry@gmail.com: [...] If you just want to index the text contents of tweets (including web links etc), using just off-the-shelf Solr is enough. You'll have to wrap your text input (per each tweet I would assume) into an xml [...] So design your schema first, create (e.g.) xml with the documents to add and post them onto SOLR. 2012/5/28 David Radunz da...@boxen.net: Hey, I think you might be over-thinking this. [...] So your 'document', might look like this: [...] Thank you for your feedbacks. I'll take it easy and do as you suggest. Cheers, Giovanni
Re: [Announce] Solr 3.6 with RankingAlgorithm 1.4.2 - NRT support
It is a single node. I am trying to find out if the performance can be referenced. Regarding information on Solr with RankingAlgorithm, you can find all the information here: http://solr-ra.tgels.org On RankingAlgorithm: http://rankingalgorithm.tgels.org Regards, - NN On 5/27/2012 4:50 PM, Li Li wrote: yes, I am also interested in good performance with 2 billion docs. how many search nodes do you use? what's the average response time and qps ? another question: where can I find related paper or resources of your algorithm which explains the algorithm in detail? why it's better than google site(better than lucene is not very interested because lucene is not originally designed to provide search function like google)? On Mon, May 28, 2012 at 1:06 AM, Darren Govonidar...@ontrenet.com wrote: I think people on this list would be more interested in your approach to scaling 2 billion documents than modifying solr/lucene scoring (which is already top notch). So given that, can you share any references or otherwise substantiate good performance with 2 billion documents? Thanks. On Sun, 2012-05-27 at 08:29 -0700, Nagendra Nagarajayya wrote: Actually, RankingAlgorithm 1.4.2 has been scaled to more than 2 billion docs. With RankingAlgorithm 1.4.3, using the parameters age=latestdocs=number feature, you can retrieve the NRT inserted documents in milliseconds from such a huge index improving query and faceting performance and using very little resources ... Currently, RankingAlgorithm 1.4.3 is only available with Solr 4.0, and the NRT insert performance with Solr 4.0 is about 70,000 docs / sec. RankingAlgorithm 1.4.3 should become available with Solr 3.6 soon. Regards, Nagendra Nagarajayya http://solr-ra.tgels.org http://rankingalgorithm.tgels.org On 5/27/2012 7:32 AM, Darren Govoni wrote: Hi, Have you tested this with a billion documents? Darren On Sun, 2012-05-27 at 07:24 -0700, Nagendra Nagarajayya wrote: Hi! I am very excited to announce the availability of Solr 3.6 with RankingAlgorithm 1.4.2. This NRT supports now works with both RankingAlgorithm and Lucene. The insert/update performance should be about 5000 docs in about 490 ms with the MbArtists Index. RankingAlgorithm 1.4.2 has multiple algorithms, improved performance over the earlier releases, supports the entire Lucene Query Syntax, ± and/or boolean queries and can scale to more than a billion documents. You can get more information about NRT performance from here: http://solr-ra.tgels.org/wiki/en/Near_Real_Time_Search_ver_3.x You can download Solr 3.6 with RankingAlgorithm 1.4.2 from here: http://solr-ra.tgels.org Please download and give the new version a try. Regards, Nagendra Nagarajayya http://solr-ra.tgels.org http://rankingalgorithm.tgels.org ps. MbArtists index is the example index used in the Solr 1.4 Enterprise Book
Negative value in numFound
Hi I have a index of size 1 Tb.. And I prepared this by setting up a background script to index records. The index was fine last 2 days, and i have not disturbed the process. Suddenly when i queried the index i get this response, where the value of numFound is negative. Can any one say why/how this occurs and also the cure. response lst name=responseHeader int name=status0/int int name=QTime0/int lst name=params str name=indenton/str str name=start0/str str name=q*:*/str str name=version2.2/str str name=rows10/str /lst /lst result name=response numFound=-2125433775 start=0/ /response Regards Senthil Kumar M R -- View this message in context: http://lucene.472066.n3.nabble.com/Negative-value-in-numFound-tp3986398.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Negative value in numFound
Hi! Can you please show your hardware parameters, version of Solr, that you're using and schema.xml file? thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/Negative-value-in-numFound-tp3986398p3986408.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: [Announce] Solr 3.6 with RankingAlgorithm 1.4.2 - NRT support
I don't recall anyone being able to get acceptable performance with a single index that large with solr/lucene. The conventional wisdom is that parallel searching across cores (or shards in SolrCloud) is the best way to handle index sizes in the illions. So its of great interest how you did. Anyone else gotten an index(es) with billions of documents to perform well? I'm greatly interested in how. On Mon, 2012-05-28 at 05:12 -0700, Nagendra Nagarajayya wrote: It is a single node. I am trying to find out if the performance can be referenced. Regarding information on Solr with RankingAlgorithm, you can find all the information here: http://solr-ra.tgels.org On RankingAlgorithm: http://rankingalgorithm.tgels.org Regards, - NN On 5/27/2012 4:50 PM, Li Li wrote: yes, I am also interested in good performance with 2 billion docs. how many search nodes do you use? what's the average response time and qps ? another question: where can I find related paper or resources of your algorithm which explains the algorithm in detail? why it's better than google site(better than lucene is not very interested because lucene is not originally designed to provide search function like google)? On Mon, May 28, 2012 at 1:06 AM, Darren Govonidar...@ontrenet.com wrote: I think people on this list would be more interested in your approach to scaling 2 billion documents than modifying solr/lucene scoring (which is already top notch). So given that, can you share any references or otherwise substantiate good performance with 2 billion documents? Thanks. On Sun, 2012-05-27 at 08:29 -0700, Nagendra Nagarajayya wrote: Actually, RankingAlgorithm 1.4.2 has been scaled to more than 2 billion docs. With RankingAlgorithm 1.4.3, using the parameters age=latestdocs=number feature, you can retrieve the NRT inserted documents in milliseconds from such a huge index improving query and faceting performance and using very little resources ... Currently, RankingAlgorithm 1.4.3 is only available with Solr 4.0, and the NRT insert performance with Solr 4.0 is about 70,000 docs / sec. RankingAlgorithm 1.4.3 should become available with Solr 3.6 soon. Regards, Nagendra Nagarajayya http://solr-ra.tgels.org http://rankingalgorithm.tgels.org On 5/27/2012 7:32 AM, Darren Govoni wrote: Hi, Have you tested this with a billion documents? Darren On Sun, 2012-05-27 at 07:24 -0700, Nagendra Nagarajayya wrote: Hi! I am very excited to announce the availability of Solr 3.6 with RankingAlgorithm 1.4.2. This NRT supports now works with both RankingAlgorithm and Lucene. The insert/update performance should be about 5000 docs in about 490 ms with the MbArtists Index. RankingAlgorithm 1.4.2 has multiple algorithms, improved performance over the earlier releases, supports the entire Lucene Query Syntax, ± and/or boolean queries and can scale to more than a billion documents. You can get more information about NRT performance from here: http://solr-ra.tgels.org/wiki/en/Near_Real_Time_Search_ver_3.x You can download Solr 3.6 with RankingAlgorithm 1.4.2 from here: http://solr-ra.tgels.org Please download and give the new version a try. Regards, Nagendra Nagarajayya http://solr-ra.tgels.org http://rankingalgorithm.tgels.org ps. MbArtists index is the example index used in the Solr 1.4 Enterprise Book
Re: Negative value in numFound
The details are below Solr : 3.5 Using a Schema file with 53 fields and 8 fields indexed among them. OS : CentOS 5.4 64 Bit Java : 1.6.0 64 Bit Apache Tomcat : 7.0.22 Intel(R) Xeon(R) CPU L5518 @ 2.13GHz (16 Processors) /dev/mapper/index 5.9T 1.9T 4.0T 33% /Index Had around 2 Billion Records, when I queried it last time (2 days back) Do I need to run the checkIndex tool? Regards Senthil Kumar M R -- View this message in context: http://lucene.472066.n3.nabble.com/Negative-value-in-numFound-tp3986398p3986421.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: indexing unstructured text (tweets)
Other obvious metadata from the Twitter API to index would be hashtags, user mentions (both the user id/screen name and user name), date/time, urls mentioned (expanded if a URL shortener is used), and possibly coordinates for spatial search. You would have to add all these fields and values yourself in your Solr input document. Tika can't help you there. Although, I imagine quite a few people have already done this quite a few times before, so maybe somebody could contribute their Twitter Solr schema. Anybody? -- Jack Krupansky -Original Message- From: David Radunz Sent: Monday, May 28, 2012 8:00 AM To: solr-user@lucene.apache.org Subject: Re: indexing unstructured text (tweets) Hey, I think you might be over-thinking this. Tweets are structured. You have the content (tweet), the user who tweeted it and various other meta data. So your 'document', might look like this: add doc field name=tweetIdABCD1234/field field name=tweetI bought some apples/field field name=userJohnnyBoy/field /doc /add To get this structure, you can use any programming language your comfortable with and load it into Solr via various means. Obviously you can add more 'meta' fields that you get from twitter if you want as well. David On 28/05/2012 9:37 PM, Giovanni Gherdovich wrote: Hi all. I am in the process of setting up Solr for my application, which is full text search on a bunch of tweets from twitter. I am afraid I am missing something. From the books I am reading, Apache Solr 3 Enterprise Search Server, it looks like Solr works with structured input, like XML or CVS, while I have the most wild and unstructured input ever (tweets). A section named Indexing documents with Solr Cell seems to address my problem, but also shows that before getting to Solr, I might need to use another Apache tool called Tika. Can anybody provide a brief explaination about the general picture? Can I index my tweets with Solr? Or do I need to put also Tika in my pipeline? Best regards, Giovanni Gherdovich
Re: Accent Characters
Hi, Jack. First of all thank you for your help. Well, I tried again then I realized that my problem is not really with solr. I did run this query against solr after start it up with the command java -jar start.jar: http://localhost:8983/solr/coreFR/spell?q=content:pr%C3%A9sentaspellcheck=truespellcheck.collate=truerows=0spellcheck.count=10 It gives me the result: ?xml version=1.0 encoding=UTF-8 ? response lst name=responseHeader int name=status0/int int name=QTime31/int /lst result name=response numFound=0 start=0 / lst name=spellcheck lst name=suggestions lst name=présenta int name=numFound10/int int name=startOffset8/int int name=endOffset16/int arr name=suggestion strprésente/str strprésent/str strprésenté/str strprésents/str strprésentant/str strprésentera/str strprésentait/str strprésentes/str strprésenter/str strprésentée/str /arr /lst str name=collationcontent:présente/str /lst /lst /response And I did run exactly the same query after deploy solr.war in tomcat 7. Here is my result: ?xml version=1.0 encoding=UTF-8 ? response lst name=responseHeader int name=status0/int int name=QTime16/int /lst result name=response numFound=0 start=0 / lst name=spellcheck lst name=suggestions lst name=présenta int name=numFound10/int int name=startOffset8/int int name=endOffset16/int arr name=suggestion strpresent/str strprbsent/str strpresentant/str strpresentait/str strpuisent/str strpasent/str strpensent/str strposent/str strdresent/str strresenti/str /arr /lst str name=collationcontent:present/str /lst /lst /response As my application is running under tomcat, it means that I have some issue with tomcat, but the weird stuff is that I already google it looking for a fix and find out that we have to set up a parameter into server.xml tomcat config file: Connector port=5443 protocol=HTTP/1.1 connectionTimeout=2 redirectPort=8443 URIEncoding=UTF-8 / But it's not working as you can see. I'm feeling a little stupid because it doesn't look like a big problem. For sure people around the world are using solr with accents queries running under tomcat properly! Thank you -- View this message in context: http://lucene.472066.n3.nabble.com/Accent-Characters-tp3985931p3986423.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: indexing unstructured text (tweets)
This is a bit old but provides good information for schema design- http://www.readwriteweb.com/archives/this_is_what_a_tweet_looks_like.php Found this link as well- https://gist.github.com/702360 The types of the field may depend on the search requirements. Regards, Anuj On Mon, May 28, 2012 at 7:21 PM, Jack Krupansky j...@basetechnology.comwrote: Other obvious metadata from the Twitter API to index would be hashtags, user mentions (both the user id/screen name and user name), date/time, urls mentioned (expanded if a URL shortener is used), and possibly coordinates for spatial search. You would have to add all these fields and values yourself in your Solr input document. Tika can't help you there. Although, I imagine quite a few people have already done this quite a few times before, so maybe somebody could contribute their Twitter Solr schema. Anybody? -- Jack Krupansky -Original Message- From: David Radunz Sent: Monday, May 28, 2012 8:00 AM To: solr-user@lucene.apache.org Subject: Re: indexing unstructured text (tweets) Hey, I think you might be over-thinking this. Tweets are structured. You have the content (tweet), the user who tweeted it and various other meta data. So your 'document', might look like this: add doc field name=tweetIdABCD1234/**field field name=tweetI bought some apples/field field name=userJohnnyBoy/field /doc /add To get this structure, you can use any programming language your comfortable with and load it into Solr via various means. Obviously you can add more 'meta' fields that you get from twitter if you want as well. David On 28/05/2012 9:37 PM, Giovanni Gherdovich wrote: Hi all. I am in the process of setting up Solr for my application, which is full text search on a bunch of tweets from twitter. I am afraid I am missing something. From the books I am reading, Apache Solr 3 Enterprise Search Server, it looks like Solr works with structured input, like XML or CVS, while I have the most wild and unstructured input ever (tweets). A section named Indexing documents with Solr Cell seems to address my problem, but also shows that before getting to Solr, I might need to use another Apache tool called Tika. Can anybody provide a brief explaination about the general picture? Can I index my tweets with Solr? Or do I need to put also Tika in my pipeline? Best regards, Giovanni Gherdovich
Re: indexing unstructured text (tweets)
Hello Jack, hi all, 2012/5/28 Jack Krupansky j...@basetechnology.com: Other obvious metadata from the Twitter API to index would be hashtags, user mentions (both the user id/screen name and user name), date/time, urls mentioned (expanded if a URL shortener is used), and possibly coordinates for spatial search. You rise good points here. Just to understand better how it works in Solr: say that we have a tweet that makes use of a hashtag and mentions another user. I don't know how this would actually appear coming from the Twitter Streaming API, and I am assuming that, at least the tweet itself (excluding date/time and stuff) , is just raw text, like Hey @alex1987, thank you for telling me how cool is #rubyonrails So: in order to make Solr understand that here we have a mention to a user (@alex1987) and a hashtag (#rubyonrails) I have to format it myself and include those info in my own XML schema, and preprocess that tweet in order to get to add doc field name=tweetIdABCD1234/field field name=tweet_textHey @alex1987, thank you for telling me how cool is #rubyonrails/field field name=userhappyRubyist/field field name=mentionsalex1987/field field name=hashtagsrubyonrails/field /doc /add Correct? I have to preprocess and explicit those fields, if I want them to be indexed as metadata, right? I am asking since I am new here to Solr. Although, I imagine quite a few people have already done this quite a few times before, so maybe somebody could contribute their Twitter Solr schema. Anybody? Oh that would be nice :-) Cheers, Giovanni
Re: indexing unstructured text (tweets)
The Twitter API extracts hash tag and user mentions for you, in addition to giving you the full raw text. You'll have to read up on the Twitter API. -- Jack Krupansky -Original Message- From: Giovanni Gherdovich Sent: Monday, May 28, 2012 10:09 AM To: solr-user@lucene.apache.org Subject: Re: indexing unstructured text (tweets) Hello Jack, hi all, 2012/5/28 Jack Krupansky j...@basetechnology.com: Other obvious metadata from the Twitter API to index would be hashtags, user mentions (both the user id/screen name and user name), date/time, urls mentioned (expanded if a URL shortener is used), and possibly coordinates for spatial search. You rise good points here. Just to understand better how it works in Solr: say that we have a tweet that makes use of a hashtag and mentions another user. I don't know how this would actually appear coming from the Twitter Streaming API, and I am assuming that, at least the tweet itself (excluding date/time and stuff) , is just raw text, like Hey @alex1987, thank you for telling me how cool is #rubyonrails So: in order to make Solr understand that here we have a mention to a user (@alex1987) and a hashtag (#rubyonrails) I have to format it myself and include those info in my own XML schema, and preprocess that tweet in order to get to add doc field name=tweetIdABCD1234/field field name=tweet_textHey @alex1987, thank you for telling me how cool is #rubyonrails/field field name=userhappyRubyist/field field name=mentionsalex1987/field field name=hashtagsrubyonrails/field /doc /add Correct? I have to preprocess and explicit those fields, if I want them to be indexed as metadata, right? I am asking since I am new here to Solr. Although, I imagine quite a few people have already done this quite a few times before, so maybe somebody could contribute their Twitter Solr schema. Anybody? Oh that would be nice :-) Cheers, Giovanni
Re: Negative value in numFound
Hm... Have you any errors in logs? During search, during indexing? -- View this message in context: http://lucene.472066.n3.nabble.com/Negative-value-in-numFound-tp3986398p3986426.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: UpdateRequestProcessor : flattened values
... the access to individual literal fields seems (currently) very limited as they appear to be flattened. That is s feature of SolrCell, to flatten multiple values for a non-multi-valued field into a string concatenation of the values. All you need to do is add multiValued=true to the author field in your schema.xml: field name=author type=text_general indexed=true stored=true/ becomes field name=author type=text_general indexed=true stored=true multiValued=true/ -- Jack Krupansky -Original Message- From: Raphaël Sent: Monday, May 28, 2012 7:17 AM To: solr-user@lucene.apache.org Subject: Re: UpdateRequestProcessor : flattened values On Sun, May 27, 2012 at 11:54:02PM -0400, Jack Krupansky wrote: You can create your own update processor that gets control between the output of Tika and the indexing of the document. See: http://wiki.apache.org/solr/UpdateRequestProcessor Seems to be exactly what I was looking for, thanks a lot ! I just started an (almost working) implementation but I've one notice: Let's get a field valueS: Collection v = doc.getFieldValues( author ); ( in my `processAdd(AddUpdateCommand cmd)` ) and push a doc, say using: `curl -F content=@my.pdf -F literal.author=a -F literal.author=b -F literal.author=c d` Then `log.warn(author: + v + : + v.size());` throws: WARN: author: [pdfauthor, a b c d] : 2 It's not (yet) a blocker in my personal case but I fear it's important enough to be noted: using a custom UpdateRequestProcessor, the access to individual literal fields seems (currently) very limited as they appear to be flattened. I'm quite sure there should already an hidden bug report about this somewhere. Other than that and unless I hit some other unexpected issue, this way to customize the request processor perfectly suits my needs. thanks !
Re: indexing unstructured text (tweets)
Hello Jack and Anuj, 2012/5/28 Jack Krupansky j...@basetechnology.com: The Twitter API extracts hash tag and user mentions for you, in addition to giving you the full raw text. You'll have to read up on the Twitter API. That's what I thought just after hittind send on the message above ;-) I am pretty sure the Twitter API format maps very nicely to a suitable input format for Solr, if not even being already good for direct feeding into Solr. I am a bit unlucky here because I have been provided with only the raw text for about 1.5 million tweets; so I would have to write a few lines of code to restore at least user mentions, hashtags and URLs. 2012/5/28 Anuj Kumar anujs...@gmail.com: This is a bit old but provides good information for schema design- http://www.readwriteweb.com/archives/this_is_what_a_tweet_looks_like.php Found this link as well- https://gist.github.com/702360 The types of the field may depend on the search requirements. Anuj you provide very interesting links here, thanks, even tho those kind of specifics might be already present in the twitter API doc. After I'll be done with my first Solr setup, I might setup the whole pipeline (getting the Twitter feeds myself) on my machines, so that I can exploit the whole information content provided by Twitter. Cheers, Giovanni
Re: Negative value in numFound
Is this for a single-shard or multi-shard index? There is a 2^31-1 limit for a single Lucene index since document numbers are int (32-bit signed in Java) in Lucene, but with Solr shards you can have a multiple of that, based on number of shards. If you are multi-shard, maybe one of the shards grew too large. -- Jack Krupansky -Original Message- From: tosenthu Sent: Monday, May 28, 2012 8:15 AM To: solr-user@lucene.apache.org Subject: Negative value in numFound Hi I have a index of size 1 Tb.. And I prepared this by setting up a background script to index records. The index was fine last 2 days, and i have not disturbed the process. Suddenly when i queried the index i get this response, where the value of numFound is negative. Can any one say why/how this occurs and also the cure. response lst name=responseHeader int name=status0/int int name=QTime0/int lst name=params str name=indenton/str str name=start0/str str name=q*:*/str str name=version2.2/str str name=rows10/str /lst /lst result name=response numFound=-2125433775 start=0/ /response Regards Senthil Kumar M R -- View this message in context: http://lucene.472066.n3.nabble.com/Negative-value-in-numFound-tp3986398.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: indexing unstructured text (tweets)
Ah, okay. Here's some PHP regexp code for parsing a raw tweet to get user names and hash tags: http://saturnboy.com/2010/02/parsing-twitter-with-regexp/ -- Jack Krupansky -Original Message- From: Giovanni Gherdovich Sent: Monday, May 28, 2012 10:35 AM To: solr-user@lucene.apache.org Subject: Re: indexing unstructured text (tweets) Hello Jack and Anuj, 2012/5/28 Jack Krupansky j...@basetechnology.com: The Twitter API extracts hash tag and user mentions for you, in addition to giving you the full raw text. You'll have to read up on the Twitter API. That's what I thought just after hittind send on the message above ;-) I am pretty sure the Twitter API format maps very nicely to a suitable input format for Solr, if not even being already good for direct feeding into Solr. I am a bit unlucky here because I have been provided with only the raw text for about 1.5 million tweets; so I would have to write a few lines of code to restore at least user mentions, hashtags and URLs. 2012/5/28 Anuj Kumar anujs...@gmail.com: This is a bit old but provides good information for schema design- http://www.readwriteweb.com/archives/this_is_what_a_tweet_looks_like.php Found this link as well- https://gist.github.com/702360 The types of the field may depend on the search requirements. Anuj you provide very interesting links here, thanks, even tho those kind of specifics might be already present in the twitter API doc. After I'll be done with my first Solr setup, I might setup the whole pipeline (getting the Twitter feeds myself) on my machines, so that I can exploit the whole information content provided by Twitter. Cheers, Giovanni
Re: indexing unstructured text (tweets)
2012/5/28 Jack Krupansky j...@basetechnology.com: Ah, okay. Here's some PHP regexp code for parsing a raw tweet to get user names and hash tags: http://saturnboy.com/2010/02/parsing-twitter-with-regexp/ Awesome! thank you very much Jack. GGhh
Re: indexing unstructured text (tweets)
On 28 May 2012 20:12, Jack Krupansky j...@basetechnology.com wrote: Ah, okay. Here's some PHP regexp code for parsing a raw tweet to get user names and hash tags: http://saturnboy.com/2010/02/parsing-twitter-with-regexp/ [...] One could also use the Solr DataImportHandler, and RegexTransformer to do the job: http://wiki.apache.org/solr/DataImportHandler#RegexTransformer Regards, Gora
Re: Negative value in numFound
There was an Out Of Memory.. But still the indexing was happening further.. -- View this message in context: http://lucene.472066.n3.nabble.com/Negative-value-in-numFound-tp3986398p3986437.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Negative value in numFound
In some cases multi-shard architecture might significantly slow down the search process at this index size... By the way, how much RAM do you use? -- View this message in context: http://lucene.472066.n3.nabble.com/Negative-value-in-numFound-tp3986398p3986438.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Negative value in numFound
Hi It is a multicore but when i searched the shards query even then i get this response result name=response numFound=-390662429 start=0 which is again a negative value. Might be the total number of records may be 2147483647 (2^31-1), But is this limitation documented anywhere. What is the strategy to over come this situation. Expectation of my application is to have 12 billion records. So please suggest me a strategy for my situation. Regards Senthil Kumar M R -- View this message in context: http://lucene.472066.n3.nabble.com/Negative-value-in-numFound-tp3986398p3986439.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Negative value in numFound
OOM is a problem. You need more RAM and more machines, and maybe more shards. -- Jack Krupansky -Original Message- From: tosenthu Sent: Monday, May 28, 2012 11:29 AM To: solr-user@lucene.apache.org Subject: Re: Negative value in numFound There was an Out Of Memory.. But still the indexing was happening further.. -- View this message in context: http://lucene.472066.n3.nabble.com/Negative-value-in-numFound-tp3986398p3986437.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Negative value in numFound
numFound=-390662429 That suggests that you have at least two shards which each have 2G docs (2^31-1). How many shards do you have and how big do you think they should be in terms of number of documents? Are you being careful to distribute your update requests between shards so that no shard grows too large? That gets back to the preceding question. -- Jack Krupansky -Original Message- From: tosenthu Sent: Monday, May 28, 2012 11:34 AM To: solr-user@lucene.apache.org Subject: Re: Negative value in numFound Hi It is a multicore but when i searched the shards query even then i get this response result name=response numFound=-390662429 start=0 which is again a negative value. Might be the total number of records may be 2147483647 (2^31-1), But is this limitation documented anywhere. What is the strategy to over come this situation. Expectation of my application is to have 12 billion records. So please suggest me a strategy for my situation. Regards Senthil Kumar M R -- View this message in context: http://lucene.472066.n3.nabble.com/Negative-value-in-numFound-tp3986398p3986439.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: UpdateRequestProcessor : flattened values
On Mon, May 28, 2012 at 10:30:03AM -0400, Jack Krupansky wrote: ... the access to individual literal fields seems (currently) very limited as they appear to be flattened. That is s feature of SolrCell, to flatten multiple values for a non-multi-valued field into a string concatenation of the values. All you need to do is add multiValued=true to the author field in your schema.xml: Indeed it both works and makes sense though I'd have thought flattening had happen later in the process. Again, thank for your precious help.
Re: UpdateRequestProcessor : flattened values
And it might make sense to have a multi-value flattening attribute for Solr itself rather than in SolrCell. -- Jack Krupansky -Original Message- From: Raphaël Sent: Monday, May 28, 2012 12:56 PM To: solr-user@lucene.apache.org Subject: Re: UpdateRequestProcessor : flattened values On Mon, May 28, 2012 at 10:30:03AM -0400, Jack Krupansky wrote: ... the access to individual literal fields seems (currently) very limited as they appear to be flattened. That is s feature of SolrCell, to flatten multiple values for a non-multi-valued field into a string concatenation of the values. All you need to do is add multiValued=true to the author field in your schema.xml: Indeed it both works and makes sense though I'd have thought flattening had happen later in the process. Again, thank for your precious help.
Re: Negative value in numFound
The RAM is about 14.5G. Allocated for Tomcat.. I have now 2 shards. But I was in an impression i can handle it with couple of Shards. But in this case i need to have shards which can only grow up 2^31-1 records and many such shards to support 12 Billion records. I will try to have more cores and distribute update between them. Then comes my next question. Is there a possibility to restrict by any configuration for a core to reject updates based on the number of records. And is there a possibility to split a index into 2 or more based on a query. Any how my network will have 2 SOLR servers to participate in indexing and search.. Probably i need to have at least 6 cores distributed across these machines to support 12 Billion Records. What is you say? -- View this message in context: http://lucene.472066.n3.nabble.com/Negative-value-in-numFound-tp3986398p3986453.html Sent from the Solr - User mailing list archive at Nabble.com.
xpathentityprocessor not import all documents
i have xml files need to import in solr, xml looks like below, root doc id1/id namealbert/name addLA/add /doc doc id2/id namejohn/name addNY/add /doc /root xml filepath is in sql database, so i have created dataimporthandler file as per below dataConfig dataSource name=ds1 ../ dataSource type=FileDataSource name=FD / document name=Emp entity name=FilePath query=SelectFilePathFromDB entity name=xmlEntity onError=continue rootEntity=true processor=XPathEntityProcessor forEach=/root/doc/ url=${FilePath.Path} dataSource=FD field xpath=root/doc/id column=id / field xpath=root/doc/name column=name / field xpath=root/doc/add column=add / /entity /entity /dataConfig now when i do full import, it will just add one document only, but when i tried with remove FilePath entity, and give static path in url its imported all documents properly. where i am making mistake ? Sagar Joshi -- View this message in context: http://lucene.472066.n3.nabble.com/xpathentityprocessor-not-import-all-documents-tp3986441.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Negative value in numFound
I think 100 million documents is a realistic number for a single shard. Maybe 250 million depending on your data. But I would say that beyond that is being unrealistic. In some cases, even 50 million might be too much for a single shard, depending on the data and query usage. Sure, maybe depending on your data 2 billion documents might work, but I wouldn't bet on it. And even if you manage to index 500 million or more documents on a single shard, memory and performance for production query loads would be questionable. Query capacity also depends on things like number of faceted fields (i.e., the field cache), string field size, number of unique terms in each field, solr query cache, and highlighting of large fields. Not to mention wanting to have enough capacity so that the number of documents can grow over time. As an experiment, index 250 million documents in one shard and see how typical queries perform, and how much JVM memory you use and still have available. Make sure to try quite a few queries (using a script), especially if any fields are faceted or highlighted. Then you can decide whether you feel comfortable trying a larger shard size or if a smaller size is needed. -- Jack Krupansky -Original Message- From: tosenthu Sent: Monday, May 28, 2012 1:25 PM To: solr-user@lucene.apache.org Subject: Re: Negative value in numFound The RAM is about 14.5G. Allocated for Tomcat.. I have now 2 shards. But I was in an impression i can handle it with couple of Shards. But in this case i need to have shards which can only grow up 2^31-1 records and many such shards to support 12 Billion records. I will try to have more cores and distribute update between them. Then comes my next question. Is there a possibility to restrict by any configuration for a core to reject updates based on the number of records. And is there a possibility to split a index into 2 or more based on a query. Any how my network will have 2 SOLR servers to participate in indexing and search.. Probably i need to have at least 6 cores distributed across these machines to support 12 Billion Records. What is you say? -- View this message in context: http://lucene.472066.n3.nabble.com/Negative-value-in-numFound-tp3986398p3986453.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Negative value in numFound
You went over the max limit for number of docs. On Monday, May 28, 2012, tosenthu wrote: Hi I have a index of size 1 Tb.. And I prepared this by setting up a background script to index records. The index was fine last 2 days, and i have not disturbed the process. Suddenly when i queried the index i get this response, where the value of numFound is negative. Can any one say why/how this occurs and also the cure. response lst name=responseHeader int name=status0/int int name=QTime0/int lst name=params str name=indenton/str str name=start0/str str name=q*:*/str str name=version2.2/str str name=rows10/str /lst /lst result name=response numFound=-2125433775 start=0/ /response Regards Senthil Kumar M R -- View this message in context: http://lucene.472066.n3.nabble.com/Negative-value-in-numFound-tp3986398.html Sent from the Solr - User mailing list archive at Nabble.com. -- Bill Bell billnb...@gmail.com cell 720-256-8076
suggestions developing a multi-version concurrency control (MVCC) mechanism
Hello all, For the first step of the distributed snapshot isolation system I'm developing for Solr, I'm going to need to have a MVCC mechanism as opposed to the single-version concurrency control mechanism already developed (DistributedUpdateProcessor class). I'm trying to find the very best way to develop this into Solr 4.x (trunk) and so any help would be greatly appreciated! Essentially I need to be able to store multiple version of a document so that when you look up a document with a given timestamp, you're given the correct version (anything the same or older, not fresher). The older versioned documents need to be stored in the index itself to ensure they are durable and can be manipulated as other Solr data can be. One way to do this is to store the old versioned Solr documents within the latest Solr Document, but I'm not sure this is even possible? Alternatively, I could have the latest versioned Document store the unique keys which point to other older documents. The problem with this is that it complicates things having various partial objects which all combine as one logically document. Are there any suggestions as to the best way to develop this feature? Thank you in advance for any help you can spare! Nicholas
Re: xpathentityprocessor not import all documents
Try adding rootEntity=false to the FilePath entity. The DIH code ends up ignoring your rootEntity=true on the XPathEntityProcessor entity if the parent does not have rootEntity=false. I'm not sure if that is really correct, but that's the way the code is. -- Jack Krupansky -Original Message- From: Sagar Joshi Sent: Monday, May 28, 2012 11:47 AM To: solr-user@lucene.apache.org Subject: xpathentityprocessor not import all documents i have xml files need to import in solr, xml looks like below, root doc id1/id namealbert/name addLA/add /doc doc id2/id namejohn/name addNY/add /doc /root xml filepath is in sql database, so i have created dataimporthandler file as per below dataConfig dataSource name=ds1 ../ dataSource type=FileDataSource name=FD / document name=Emp entity name=FilePath query=SelectFilePathFromDB entity name=xmlEntity onError=continue rootEntity=true processor=XPathEntityProcessor forEach=/root/doc/ url=${FilePath.Path} dataSource=FD field xpath=root/doc/id column=id / field xpath=root/doc/name column=name / field xpath=root/doc/add column=add / /entity /entity /dataConfig now when i do full import, it will just add one document only, but when i tried with remove FilePath entity, and give static path in url its imported all documents properly. where i am making mistake ? Sagar Joshi -- View this message in context: http://lucene.472066.n3.nabble.com/xpathentityprocessor-not-import-all-documents-tp3986441.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: suggestions developing a multi-version concurrency control (MVCC) mechanism
You can use the document id and timestamp as a compound unique id. Then the search would also sort by id, then by timestamp. Result grouping might let you pick the most recent document from each of the sorted docs. On Mon, May 28, 2012 at 3:15 PM, Nicholas Ball nicholas.b...@nodelay.com wrote: Hello all, For the first step of the distributed snapshot isolation system I'm developing for Solr, I'm going to need to have a MVCC mechanism as opposed to the single-version concurrency control mechanism already developed (DistributedUpdateProcessor class). I'm trying to find the very best way to develop this into Solr 4.x (trunk) and so any help would be greatly appreciated! Essentially I need to be able to store multiple version of a document so that when you look up a document with a given timestamp, you're given the correct version (anything the same or older, not fresher). The older versioned documents need to be stored in the index itself to ensure they are durable and can be manipulated as other Solr data can be. One way to do this is to store the old versioned Solr documents within the latest Solr Document, but I'm not sure this is even possible? Alternatively, I could have the latest versioned Document store the unique keys which point to other older documents. The problem with this is that it complicates things having various partial objects which all combine as one logically document. Are there any suggestions as to the best way to develop this feature? Thank you in advance for any help you can spare! Nicholas -- Lance Norskog goks...@gmail.com
Re: boost date for fresh result query
please suggest me I am stuck here.. On Mon, May 28, 2012 at 3:21 PM, Jonty Rhods jonty.rh...@gmail.com wrote: Hi I am facing problem to boost on date field. I have following field in schema field name=release_date type=date indexed=true stored=true/ solr version 3.4 I don't want to sort by date but want to give 50 to 60% boost those result which have latest date... following are the query : http://localhost:8083/solr/movie/select/?defType=dismaxq=titanicfq=title,tagsqf=tags ^2.0,title^1.0mm=1bf=recip(rord(release_date),1,1000,1000) but it seems the result has no effect because still all old date results are on top or I am unable to make query syntax. Is there anything missing in query.. Please help me to make the function possible.. thanks regards Jonty
Re: boost date for fresh result query
Add debugQuery=true to your query and look at the scores of the older vs. newer docs compared to the boost. Maybe the boost needs to be increased. -- Jack Krupansky -Original Message- From: Jonty Rhods Sent: Monday, May 28, 2012 5:51 AM To: solr-user@lucene.apache.org Subject: boost date for fresh result query Hi I am facing problem to boost on date field. I have following field in schema field name=release_date type=date indexed=true stored=true/ solr version 3.4 I don't want to sort by date but want to give 50 to 60% boost those result which have latest date... following are the query : http://localhost:8083/solr/movie/select/?defType=dismaxq=titanicfq=title,tagsqf=tags ^2.0,title^1.0mm=1bf=recip(rord(release_date),1,1000,1000) but it seems the result has no effect because still all old date results are on top or I am unable to make query syntax. Is there anything missing in query.. Please help me to make the function possible.. thanks regards Jonty
RE: useFastVectorHighlighter doesn't work
Hi, The reason why I use useFastVectorHighlighter is because I want to set stored=false, and with more settings like termVectors=true termPositions=true termOffsets=true. If stored=true, what is the difference between normal highlight and useFastVectorHighlighter? What is the right situation for using useFastVectorHighlighter? Thanks! -Original Message- From: Ahmet Arslan [mailto:iori...@yahoo.com] Sent: 2012年5月28日 16:40 To: solr-user@lucene.apache.org Subject: Re: useFastVectorHighlighter doesn't work I had a schema defined as field name=text type=text indexed=true stored=false termVectors=true termPositions=true termOffsets=true/ You need to mark your text field as stored=true to use hl.useFastVectorHighlighter=true
Re: Negative value in numFound
... is this limitation documented anywhere... Kind of, but not very well, at least at the Lucene level. The Lucene File Formats page says Lucene uses a Java int to refer to document numbers, and the index file format uses an Int32 on-disk to store document numbers. This is a limitation of both the index file format and the current implementation. It also says that The first document added to an index is numbered zero. Since Java Integer.MAX_INT is 2^31-1, that means the maximum number of documents in a single Lucene (or Solr) index is 2^31. See: http://lucene.apache.org/core/3_6_0/fileformats.html And the Lucene IndexSearcher API uses int for document number and number of documents in index. See: http://lucene.apache.org/core/3_6_0/api/core/org/apache/lucene/search/IndexSearcher.html There is a little discussion of the limit issue here: https://issues.apache.org/jira/browse/LUCENE-2420 I am not aware of any explicit mention of the single-index Lucene document limit at the Solr level. -- Jack Krupansky -Original Message- From: tosenthu Sent: Monday, May 28, 2012 11:34 AM To: solr-user@lucene.apache.org Subject: Re: Negative value in numFound Hi It is a multicore but when i searched the shards query even then i get this response result name=response numFound=-390662429 start=0 which is again a negative value. Might be the total number of records may be 2147483647 (2^31-1), But is this limitation documented anywhere. What is the strategy to over come this situation. Expectation of my application is to have 12 billion records. So please suggest me a strategy for my situation. Regards Senthil Kumar M R -- View this message in context: http://lucene.472066.n3.nabble.com/Negative-value-in-numFound-tp3986398p3986439.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Accent Characters
The query seems fine - as far as the URL being UTF-8. It seems that the documents are not being passed to Solr with UTF-8 encoding. The document is not part of the URL. It is HTTP POST data. Try an explicit curl command to add a document and see if it is indexed with the accents. -- Jack Krupansky -Original Message- From: couto.vicente Sent: Monday, May 28, 2012 9:58 AM To: solr-user@lucene.apache.org Subject: Re: Accent Characters Hi, Jack. First of all thank you for your help. Well, I tried again then I realized that my problem is not really with solr. I did run this query against solr after start it up with the command java -jar start.jar: http://localhost:8983/solr/coreFR/spell?q=content:pr%C3%A9sentaspellcheck=truespellcheck.collate=truerows=0spellcheck.count=10 It gives me the result: ?xml version=1.0 encoding=UTF-8 ? response lst name=responseHeader int name=status0/int int name=QTime31/int /lst result name=response numFound=0 start=0 / lst name=spellcheck lst name=suggestions lst name=présenta int name=numFound10/int int name=startOffset8/int int name=endOffset16/int arr name=suggestion strprésente/str strprésent/str strprésenté/str strprésents/str strprésentant/str strprésentera/str strprésentait/str strprésentes/str strprésenter/str strprésentée/str /arr /lst str name=collationcontent:présente/str /lst /lst /response And I did run exactly the same query after deploy solr.war in tomcat 7. Here is my result: ?xml version=1.0 encoding=UTF-8 ? response lst name=responseHeader int name=status0/int int name=QTime16/int /lst result name=response numFound=0 start=0 / lst name=spellcheck lst name=suggestions lst name=présenta int name=numFound10/int int name=startOffset8/int int name=endOffset16/int arr name=suggestion strpresent/str strprbsent/str strpresentant/str strpresentait/str strpuisent/str strpasent/str strpensent/str strposent/str strdresent/str strresenti/str /arr /lst str name=collationcontent:present/str /lst /lst /response As my application is running under tomcat, it means that I have some issue with tomcat, but the weird stuff is that I already google it looking for a fix and find out that we have to set up a parameter into server.xml tomcat config file: Connector port=5443 protocol=HTTP/1.1 connectionTimeout=2 redirectPort=8443 URIEncoding=UTF-8 / But it's not working as you can see. I'm feeling a little stupid because it doesn't look like a big problem. For sure people around the world are using solr with accents queries running under tomcat properly! Thank you -- View this message in context: http://lucene.472066.n3.nabble.com/Accent-Characters-tp3985931p3986423.html Sent from the Solr - User mailing list archive at Nabble.com.
Multicore Issue - Server Restart
Hello , We have a multicore webapp for every 50 cores.Currently 3 Multicore webapps and 150 cores distributed across the 3 webapps. When we re started the server [Tomcat] ,we noticed that the solr.xml was wiped out and we could not see any cores in webapp1 and webapp3 ,but only a few cores in webapp 2. The solr.xml has persistent =true. Given this what could have possibly happened ? Solution was to add all the cores manually to solr.xml and restart server,But I am unsure as to what would have caused this and there is no indication in the logs also for any issue, Regards Suajtha