I have a field that is defined using what I believe is fairly standard text
fieldType. I have documents with the words 'evaluate', 'evaluating',
'evaluation' in them. When I search on the whole word, obviously it works,
if I search on 'eval' it finds nothing. However for some reason if I search
on
I don't use dismax, but do something similar with a regular query. I have a
field defined in my schema.xml called 'dummy' (not sure why its called that
actually) but it defaults to 1 on every document indexed. So say I want to
give a score bump to documents that have an image, I can do queries
I believe that is not a setting, it's not telling you that you have 'optimize
turned on' it's a state, your index is currently optimized. If you index a
new document or delete an existing document, and don't issue an optimize
command, then your index should be optimize=false.
--
View this message
I have a field defined as:
field name=content type=text indexed=true stored=false
termVectors=true multiValued=true /
where text is unmodified from the schema.xml example that came with Solr
1.4.1.
I have documents with some compound words indexed, words like Sandstone. And
in several cases
I tried setting catenateWords=1 on the Query analyzer and that didn't do
anything. I think what I need is to set my Index Analyzer to have
preserveOriginal=1 and then re-index everything. That will be a pain, so
I'll do a small test to make sure first. I'm really surprised
preserveOriginal=1 isn't
This is more a speculation than direction, I don't currently use Field
Collapsing but my take on it is that it returns the number of docs
collapsed. So instead of faceting could you do a search returning DocID,
collapsing on DocID sorting on date, then the count of collapsed docs
*should* match
Thanks Markus, for your patience with getting the response in as well the
comments.
This is my Dev environment, I'm actually going to be setting up a new
master-slave configuration in a different environment today. I'll see if
it's environment specific or not. One thing I didn't mention, wasn't
But I can query Cassandra directly for the documents if I wanted/needed to?
And, when I need to re-index, I could read from Cassandra, index into Solr,
which will write back to Cassandra overwriting the existing document(s)?
Basically the steps would be, index documents into Solr which would
Ah. I see. That reduces its usefulness to me some. The multi-master aspect is
still a big draw of course. But I was hoping this also added an integrated
persistence layer to Solr as well.
--
View this message in context:
Is it just me or is Replication a POS? (Solr 1.4.1, Tomcat 6.0.32)
1) I had set my pollInterval to 60 seconds but it appears to fire constantly
so I set it to 5 minutes and I see in the Tomcat logs where it fires the
replication check anywhere from 2 minutes to 4 1/2 minutes, but never
anything
create a separate document for each book-bookshelf combination.
doc 1 = book 1,shelf 1
doc 2 = book 1,shelf 3
doc 3 = book 2,shelf 1
etc.
then your queries are q=book_id to get all bookshelfs a given book is on
or q=shelf_id to get all books on a given bookshelf.
Biggest problem people face
I modified the subject to include Lucendra, in case anyone has heard of it by
that name.
--
View this message in context:
http://lucene.472066.n3.nabble.com/Anyone-familiar-with-Solandra-or-Lucendra-tp2927357p2933051.html
Sent from the Solr - User mailing list archive at Nabble.com.
My understanding is that the Master has done all the indexing, that
replication is a series of file copies to a temp directory, then a move and
commit. The slave only gets hit with the effects of a commit, so whatever
warming queries are in place, and the caches get reset. Doing too many
commits
The recent Amazon outage exposed a weakness in our architecture. We could
really use a Master-Master redundancy. We already have Master to multiple
Slaves. I've looked at the various options of converting a Slave into a
Master, of having a Repeater (hybrid master/slave) become the Master etc.
But,
Master/slave replication does this out of the box, easily. Just set the slave
to update on Optimize only. Then you can update the master as much as you
want. When you are ready to update the slave (the search instance), just
optimize the master. On the slave's next cycle check it will refresh
I have Replication set up with
str name=pollInterval00:00:60/str
I assumed that meant it would poll the master for updates once a minute. But
my logs make it look like it is trying to sync up almost constantly. Below
is an example of my log from just 1 minute in time. Am I reading this wrong?
I'm using version 1.4.1. It appears that when several documents in a result
set have the same score, the secondary sort is by 'indexed_at' ascending.
Can this be altered in the config xml files? If I wanted the secondary sort
to be indexed_at descending for example, or by a different field, say
Indexing isn't a problem, it's just disk space and space is cheap. But, if
you do facets on all those price columns, that gets put into RAM which isn't
as cheap or plentiful. Your cache buffers may get overloaded a lot and
performance will suffer.
2000 price columns seems like a lot, could the
Is NAME a product name? Why would it be multivalue? And why would it appear
on more than one document? Is each 'document' a package of products? And
the pricing tiers are on the package, not individual pieces?
So sounds like you could, potentially, have a PriceListX column for each
user. As your
Is sort order when 'score' is the same a Lucene thing? Should I ask on the
Lucene forum?
--
View this message in context:
http://lucene.472066.n3.nabble.com/Result-order-when-score-is-the-same-tp2816127p2817330.html
Sent from the Solr - User mailing list archive at Nabble.com.
Au contraire, I have almost 4 million documents, representing businesses in
the US. And having the score be the same is a very common occurrence.
It is quite clear from testing that if score is the same, then it sorts on
indexed_at ascending. It seems silly to make me add a sort on every query,
Is a new DocID generated everytime a doc with the same UniqueID is added to
the index? If so, then docID must be incremental and would look like
indexed_at ascending. What I see (and why it's a problem for me) is the
following.
a search brings back the first 5 documents in a result set of say 60.
Some options to reduce performance implications are:
replication... index your documents in one solr instance, and query in a
different one. that way the users of the query side will not be as adversely
impacted by frequent changes. You have better control over when change
occurs.
separate
I've tried several times to get an active account on
solr-...@lucene.apache.org and the mailing list won't send me a confirmation
email, and therefore won't let me post because I'm not confirmed. Could I
get someone that is a member of Solr-Dev to post either my original request
in this thread,
I have a huge need for a new field type. It would be a Poly field, similar to
Point or Payload. It would take 2 data elements and a search would return a
hit if the search term fell within the range of the elements. For example
let's say I have a document representing an Employment record. I may
True. And that's my temporary solution. But it's ugly code, even uglier
queries. I may have several such fields in a single query. A PolyField
solution would be so much more elegant and useful. I'm actually shocked more
people don't need/want something like it.
--
View this message in context:
I use a lot of dynamic fields, so looking at my schema isn't a good way to
see all the field names that may be indexed across all documents. Is there a
way to query solr for that information? All field names that are indexed, or
stored? Possibly a count by field name? Is there any other metadata
That's exactly what I wanted, thanks. Any idea what
long name=version1294513299077/long
refers to under the index section? I have 2 cores on one Tomcat instance,
and 1 on a second instance (different server) and all 3 have different
numbers for version, so I don't think it's the version of
Thanks guys. I read (and actually digested this time) how multivalued fields
work and now realize my question came from a 'structured language/dbms'
background. The multivalued field is stored basically as a single value with
extra spacing between terms (the positionIncrementGap previously
Is that it? Of all the strange, esoteric, little understood configuration
settings available in solrconfig.xml, the only thing that affects Index
Speed vs Query Speed is turning on/off the Query Cache and RamBufferSize?
And for the latter, why wouldn't RamBufferSize be the same for both...that
No, I have both, a single field (for free form text search), and individual
fields (for directed search). I already duplicate the data and that's not a
problem, disk space is cheap. What I wanted to know was whether it is best
to make the single field multiValued=true or not. That is, should my
In the Wiki and the book by Smiley and Pugh, and in the comments inside the
solrconfig.xml file itself, it always talks about the various settings in
the context of a blended use solr index. By that I mean, it assumes you are
indexing and querying from the same solr instance. However, if I have a
I have to support both general searches (free form text) and directed
searches (field:val field2:val). To do the general search I have a field
defined as:
field name=content type=text indexed=true stored=false
termVectors=true multiValued=true /
and several copyField commands like:
copyField
http://wiki.apache.org/solr/FrontPage Solr Wiki
http://wiki.apache.org/solr/FAQ Solr FAQ
http://www.amazon.com/Solr-1-4-Enterprise-Search-Server/dp/1847195881/ref=sr_1_1?ie=UTF8qid=1295018231sr=8-1
A good book on Solr
And this forum you posted to
If this is a one-time cleanup, not something you need to do programmatically,
you could delete the index directory ( solrDir/data/index ). In my case I
have to stop Tomcat, delete .\index and restart Tomcat. It is very fast and
starts me out with a fresh, empty, index. Noticed you are multi-core,
A/ You have to update all the fields, if you leave one off, it won't be in
the document anymore. I have my 'persisted' data stored outside of Solr, so
on update I get the stored data, modify it and update Solr with every field
(even if one changed). You could also do a Query/Modify/Update
I have about 30 million documents and with the exception of the Unique ID,
Type and a couple of date fields, every document is made of dynamic fields.
Now, I only have maybe 1 in 5 being multi-value, but search and facet
performance doesn't look appreciably different from a fixed schema solution.
Yep, www.apache.org is down. They tick off the wikihackers too? :)
--
View this message in context:
http://lucene.472066.n3.nabble.com/Solr-site-not-accessible-tp2105072p2105095.html
Sent from the Solr - User mailing list archive at Nabble.com.
Hear hear! In the beginning of my journey with Solr/Lucene I couldn't have
done it without this site. Smiley and Pugh's book was useful, but this forum
was invaluable. I don't have as many questions now, but each new venture,
Geospatial searching, replication and redundancy, performance tuning,
Multi word queries is the bread and butter of Solr/Lucene, so I'm not sure I
understand the complete issue here. For clarity, is 'abstract' the name of
your default text field, or is your query
q=abstract: mouse genome
if the latter, my thought was is it possible that the query is being
While we are on this subject...my company is kind of new to the whole open
source as a production tool concept. I can't push anything to production
that isn't labeled as 'release' or similar designation. So, 1.4.1 is what I
have right now. I can play with other versions but that's about it. I'm
Thanks Jan. I didn't know about 1.4.2 I'll give it a look. However, your link
is something I've already seen. I understand the different Solr versions, my
question was more on what is the process, and timeline, for the community to
turn the current trunk into a 'release'. From that link, and
Unfortunately the default operator is set to AND and I can't change that at
this time.
If I do (city:Chicago^10 OR Romantic OR View) it returns way too many
unwanted results.
If I do (city:Chicago^10 OR (Romantic AND View)) it returns less unwanted
results, but still a lot.
iorixxx's solution
I can't seem to find the right formula for this. I have a need to build a
query where one of the fields should boost the score, but not affect the
query if there isn't a match. For example, if I have documents with
restaurants, name, address, cuisine, description, etc. I want to search on,
say,
Jonathan, Dismax is something I've been meaning to look into, and bq does
seem to fit the bill, although I'm worried about this line in the wiki
:TODO: That latter part is deprecated behavior but still works. It can be
problematic so avoid it.
It still seems to be the closest to what I want
I modified the text of this hopefully to make it clearer. I wasn't sure what
I was asking was coming across well. And I'm adding this comment in a
shameless attempt to boost my question back to the top for people to see.
Before I write a messy work around, just wanted to check the community to
Doing a range search is straightforward. I have a fixed value in a document
field, I search on [x TO y] and if the fixed value is in the range requested
it gets a hit. But, what if I have data in a document where there is a min
value and a max value and my query is a fixed value and I want to get
Interesting wiki link, I hadn't seen that table before.
And to answer your specific question about indexed=true, stored=false, this
is most often done when you are using analyzers/tokenizers on your field.
This field is for search only, you would never retrieve it's contents for
display. It may
I don't understand why you would want to show Sweden if it isn't in the
index, what will your UI do if the user selects Sweden?
However, one way to handle this would be to make a second document type.
Have a field called type or some such, and make the new document type be
'dummy' or 'system' or
mem usage is over 400M, do you mean Tomcat mem size? If you don't give your
cache sizes enough room to grow you will choke the performance. You should
adjust your Tomcat settings to let the cache grow to at least 1GB or better
would be 2GB. You may also want to look into
keep in mind that the str name=id paradigm isn't completely useless, the
str is a data type (string), it can be int, float, double, date, and others.
So to not lose any information you may want to do something like:
id type=int123/id
title type=strxyz/title
Which I agree makes more sense to
Quick tangent... I went to the link you provided, and the delete part makes
sense. But the next tip, how to re-index after a schema change. What is the
point of step
5. Send an optimize/ command.
? Why do you need to optimize an empty index? Or is my understanding of
Optimize incorrect?
--
Do you have any other Analyzers or Formatters involved? I use delimiters in
certain string fields all the time. Usually a colon : or slash / but
should be the same for a period. I've never seen this behavior. But if you
have any kind of tokenizer or formatter involved beyond
fieldType
Could it be a case-sensitivity issue? The StrField type is not analyzed, but
indexed/stored verbatim. (from the schema comments). If you are looking for
ab.pqr but it is in fact ab.Pqr in the solr document, it wouldn't find it.
--
View this message in context:
A slightly different route to take, but one that should help test/refine a
semantic parser is wikipedia. They make available their entire corpus, or
any subset you define. The whole thing is like 14 terabytes, but you can get
smaller sets.
--
View this message in context:
Those are at least 3 different questions. Easiest first, sorting.
addsort=ad_post_date+desc (or asc) for sorting on date,
descending or ascending
check out how http://www.supermind.org/blog/378/lucene-scoring-for-dummies
Lucene scores by default. It might close to what you want. The
Sounds like you want the
http://wiki.apache.org/solr/DataImportHandler#CachedSqlEntityProcessor
CachedSqlEntityProcessor it lets you make one query that is cached locally
and can be joined to with a separate query.
--
View this message in context:
Chris, I agree, having the ability to make rows something like -1 to bring
back everything would be convenient. However, the 2 call approach
(q=blahrows=0 followed by q=blahrows=numFound) isn't that slow, and does
give you more information up front. You can optimize your Array or List
sizes in
You don't give an indication of size. How large are the documents being
indexed and how many of them are there. However, my opinion would be a
single index with an 'active' flag. In your queries you can use
FilterQueries (fq=) to optimize on just active if you wish, or just
inactive if that is
Alok,
I noticed you also posted to the SolrNet forum, and that's a better place
for this question. But basically, SolrNet is a wrapper around Solr
functionality. It lets you build your Solr interactions (Queries, Stats,
Facets, etc) and Inserts/Deletes using .Net objects.
The reading of a data
You are querying for 'branch' and trying to place it in 'skill'.
Also, you have Name and Column backwards, it should be:
field column=id name=id/
field column=name name=name/
field column=city name=city_t/
field column=skill name=skill_t/
--
View this message in context:
That's exactly what I want. I was just searching the wiki using the wrong
terms.
Thanks!
--
View this message in context:
http://lucene.472066.n3.nabble.com/High-Low-field-value-tp1402568p1403164.html
Sent from the Solr - User mailing list archive at Nabble.com.
We would really need to see more information, but some first things to look
for are:
are your field definitions in the schema.xml set to indexed=true (if you
want to search it) and stored=true (if you want to see it in the return
results)?
is the case of the field names the same in schema.xml
the 'text' fieldType is not suitable for sorting. You need to use the
copyField directive in your schema and at indexing time copy the data to
your TITLE and UPDBY fields, and you need to create 2 new fields:
field name=TITLE_sort type=string indexed=true stored=true /
field name=UPDBY_sort
my feeling is that private fields in a public document will be the hardest
nut to crack, unless you have an intermediary layer that users call instead
of hitting your solr instance directly. If you front it with a web service
you could handle various authorization scenarios a little easier.
One way I've done to handle this, and it works only for some types of data,
is to put the searchable part of the sub-doc in a search field
(indexed=true) and put an xml or json representation of the sub-doc in a
stored only field. Then if the main doc is hit via search I can grab the xml
or json,
Are you just trying to learn the tiny details of how Solr and DIH work? Is
this just an intellectual curiosity? Or are you having some specific problem
that you are trying to solve? If you have a problem, could you describe the
symptoms of the problem? I am using Solr, DIH, and several other
Short answer is no, there isn't a way. Solr doesn't have the concept of
'Update' to an indexed document. You need to add the full document (all
'columns') each time any one field changes. If doing that in your
DataImportHandler logic is difficult you may need to write a separate Update
Service
It may not be the data config. Do you have the fields in the schema.xml that
the image data is going to set to be multiValued=true?
Although, I would think the last image would be stored, not the first, but
haven't really tested this.
--
View this message in context:
If your concern is performance, faceting integers versus faceting strings, I
believe Lucene makes the differences negligible. Given that choice I'd go
with string. Now..if you need to keep an association between id and string,
you may want to facet a combined field id:string or some other
I'd try 2 things.
First do a query
q=EMAIL_HEADER_FROM:test.de
and make sure some documents are found. If nothing is found, there is
nothing to delete.
Second, how are you testing to see if the document is deleted? The physical
data isn't removed from the index until you Optimize I believe.
Glad I could help. I also would think it was a very common issue. Personally
my schema is almost all dynamic fields. I have unique_id, content,
last_update_date and maybe one other field specifically defined, the rest
are all dynamic. This lets me accept an almost endless variety of document
In your schema.xml there is a field called
defaultSearchFieldcontent/defaultSearchField
it may be something other than 'content'. This field is the one searched if
you don't specify one in the query.
You can explicitly put something there with an add or you can have a
copyField directive in
Not sure the processing would be any faster than just querying again, but, in
your original result set the first doc that has a field value that matches a
to 10 facet, will be the number 1 item if you fq on that facet value. So you
don't need to query it again. You would only need to query those
I believe they come back alphabetically sorted (not sure if this is language
specific or not), so a quick way might be to change the name from createdate
to zz_createdate or something like that.
Generally with XML you should not be worried about order however. It's
usually a sign of a design
The Nabble.com page for Solr - User seems to be broken. I haven't seen an
update on it since early this morning. However I'm still getting email
notifications so people are seeing and responding to posts. I'm just
curious, are you just using email and responding to
solr-u...@lucene.apache.org? Or
for STRING_VALUE, I assume there is a property in the 'select *' results
called string_value? if so I'm not sure why it wouldn't work. If not, then
that's why, it doesn't have anything to put there.
For ATTRIBUTE_NAME, is it possibly a case issue? you called it
'Attribute_Name' in your query,
DataImportHandler (DIH) is an add-on to Solr. It lets you import documents
from a number of sources in a flexible way. The only connection DIH has to
Lucene is that Solr uses Lucene as the index engine.
When you work with Solr you naturally talk about Solr Documents, if you were
working with
parallel calls. simultaneously query for type:short rows=10 and
type:extensive rows=1 and merge your results. This would also let you
separate your short docs from your extensive docs into different solr
instances if you wished...depending on your document architecture this could
speed up one
That just gives a count of documents by type. The use-case, I believe, is to
return from a search, 10 documents of type 'short' and 1 document of type
'extensive'.
--
View this message in context:
Your environment may be different, but this is how I did it. (Apache Tomcat
on Windows 2008)
go to \program files\apache...\Tomcat\conf\catalina\localhost
rename solr.xml to search.xml
recycle Tomcat service
--
View this message in context:
Oh, okay. Got it now. Unfortunately I don't believe Solr supplies a total
count of matching facet values. One way to do this, although performance may
suffer, is to set your limit to -1 and just get back everything, that will
give you the count. You may want to set mincount to 1 so you aren't
No one has done performance analysis? Or has a link to anywhere where it's
been done?
basically fastest way to get documents into Solr. So many options available,
what's the fastest:
1) file import (xml, csv) vs DIH vs POSTing
2) number of concurrent clients 1 vs 10 vs 100 ...is there a
It may just be a mis-wording, but if you do distinct on 'unique' IDs, the
count should be the same as response.numFound. But if you didn't mean
'unique', just count of some field in the results, Rebecca is correct,
facets should do the job. Something like:
If at all possible I like to do any processing work up front and not deal
with extravagant queries. If your grid definitions don't change, or don't
change often, just assign a cell number to each 100 square grid. Then in a
pre-processing step assign the appropriate cell number to your document
I was curious if anyone has done work on finding what an optimal (or max)
number of client processes are for indexing. That is, if I have the ability
to spin up N number of processes that construct a POST to add/update a Solr
document, is there a point at which the number of clients posting
Thanks for all the suggestions! I'm absorbing them as quickly as I can.
--
View this message in context:
http://lucene.472066.n3.nabble.com/Tag-generation-tp969888p973277.html
Sent from the Solr - User mailing list archive at Nabble.com.
Your example though doesn't show different ContentType, it shows a different
sort order. That would be difficult to achieve in one call. Sounds like your
best bet is asynchronous (multi-threaded) calls if your architecture will
allow for it.
--
View this message in context:
A colleague mentioned that he knew of services where you pass some content
and it spits out some suggested Tags or Keywords that would be best suited
to associate with that content.
Does anyone know if there is a contrib to Solr or Lucene that does something
like this? Or a third party tool that
Sounds like you want the 'text' fieldType (or equivalent) and are using
'string' or 'lowercase'. Those must match all exactly (well, case
insensitively in the case of 'lowercase'). The TextType field types (like
'text') do tokenizations so matches will occur under many more conditions.
--
View
Yep, my schema does this all day long.
--
View this message in context:
http://lucene.472066.n3.nabble.com/MultiValue-dynamicField-and-copyField-tp965941p966536.html
Sent from the Solr - User mailing list archive at Nabble.com.
Frederico,
You should also pose your question on the SolrNet forum,
http://groups.google.com/group/solrnet?hl=en
Switching from GET to POST isn't a Solr issue, but a SolrNet issue.
--
View this message in context:
http://lucene.472066.n3.nabble.com/Query-URl-too-long-tp959990p960208.html
Sent
92 matches
Mail list logo