Hey all,
I'm trying to create a Solr index for the 2010 Medline-baseline
(www.pubmed.gov, over 18 million XML documents). My goal is to be able
to retrieve single XML documents by their ID. Each document comes with a
unique ID, the PubMedID. So my schema (important portions) looks like this:
Am 16.11.2010 07:25, schrieb xu cheng:
hi all:
I configure an app with solr to index documents
and there are some Chinese content in the documents
and I've configure the apache tomcat URIEncoding to be utf-8
and I use the program curl to sent the documents in xml format
however , when I query th
Many thanks, Peter K. for posting up on the wiki - great!
Yes, fc = field cache. Field Collapsing is something very nice indeed,
but is entirely different.
As Erik mentions in the wiki post, using per-segment faceting can be a
huge boon to performance. It does require the latest Solr trunk build
Hi Peter,
First off, many thanks for putting together the NRT Wiki page!
This may have changed recently, but the NRT stuff - e.g. per-segment
commits etc. is for the latest Solr 4 trunk only.
If your setup uses the 3x Solr code branch, then there's a bit of work
to do to move to the new version.
hashing is not 100% guaranteed to produce unique values.
It'w worth reading about and knowing about :-)
Dennis Gearon
Signature Warning
It is always a good idea to learn from your own mistakes. It is usually a
better
idea to learn from others’ mistakes, so you do not have t
Wow, if all you want is to retrieve by ID, a database would be fine, even a NO
SQL database.
Dennis Gearon
Signature Warning
It is always a good idea to learn from your own mistakes. It is usually a
better
idea to learn from others’ mistakes, so you do not have to make them
Also this
http://search-lucene.com/m/hBnHH1Q4NVb2
--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
On 15. nov. 2010, at 23.21, Ahmet Arslan wrote:
>> I've got a document with a "type"
>> field. If the type is 1, I want to boost the
>> document's relevancy, but type=1 i
Retrieval by ID would only be one possible case; I'm still at the
beginning of the project, I imagine adding more fields for more
complicated queries in the future. I imagine a "where - like" query over
all the XML documents stored in a DBMS wouldn't be too performant ;)
And at a later stage
I reply to myself because I founded the mistake. The italian stopwords file
that I founded on apache site contains on the same line of each stopword a
comment shell style, the stopwords tokenizer probably is basical and doesn't
accept comments on the same line of stopwords. I dropped them and now
Hi,
I am having a field as below in my feed.
-00-00T00:00:00Z
I have configured the field as below in data-config.xml.
But after indexing, the field value becomes like this
0002-11-30T00:00:00Z
I want to have the value as '-00-00T00:00:00Z' after indexing also.
Could anyon
You haven't defined what you want to see, so it's hard
to help. What does "top" mean? The order you put it
into the index? Lexical sort? Frequency count?
Numerical ordering?
Why do you want to do this? Perhaps if you explained
your use case we'd be able to offer some alternatives.
Best
Erick
On
Several questions. Pardon me if they're obvious, but I've spent fr
too much of my life overlooking the obvious...
1> Is it possible you're running out of disk? 40-50G could suck up
a lot of disk, especially when merging. You may need that much again
free when a merge occurs.
2> speaking of mer
I am capturing all the user entered search terms in to the database and the
number of times the search term is entered.Let us say
"laptop" has entered 100 times.
"laptop bag" has entered 80 times.
"laptop battery" has entered 90 times.
I am using terms component for auto suggest feature.If the
> I am capturing all the user entered search terms in to the
> database and the number of times the search term is
> entered.Let us say
>
> "laptop" has entered 100 times.
> "laptop bag" has entered 80 times.
> "laptop battery" has entered 90 times.
>
> I am using terms component for auto suggest
On 11/11/2010 3:24 PM, Dennis Gearon wrote:
I look forward to the eanswers to this one.
Well, it seems it was as easy as adding the CJKTokenizerFactory:
positionIncrementGap="100">
Once I did that and reindexed I could search for both english and
chinese using the default 'text' fi
We've recently added randomized testing for result grouping that
resulted in finding + fixing a number of bugs.
I've you've been using this feature, you should move to the latest
trunk version.
I've also added a section at the bottom of the wiki page to list
current limitations.
http://wiki.apache
I've got a Solr multi core system and I'm trying to swap the cores
after a re-index via SolrJ using a separate HTTP Solr web server. My
application seems to be generating a URL that's not valid for my Solr
Tomcat installation but I can't see why or where it's getting its data
from.
Core swapping
I'm using Lucid Imagination installation kit for SOLR (the last one with SOLR
1.4).
I would like to use stopwords, and I installed in
LucidWorks/lucidworks/solr/conf/stopwords.txt the italian version of the
file.
Moreover the field where I want to clean stopwords is declared in schema.xml
as
CoreAdmin is handled by /solr/admin/cores/ and not by
/solr/CORENAME/admin/cores/
On Tuesday 16 November 2010 16:17:34 Shaun Campbell wrote:
> I've got a Solr multi core system and I'm trying to swap the cores
> after a re-index via SolrJ using a separate HTTP Solr web server. My
> application s
The key is that Solr handles merges by copying, and only after
the copy is complete does it delete the old index. So you'll need
at least 2x your final index size before you start, especially if you
optimize...
Here's a handy matrix of what you need in your index depending
upon what you want to do
Hi,
Is there any way to commit docs with unique key but with shared content ?
Example:
required="true" />
I have a lot of itens with same content but with different codes.
Because this index is very large there any way to commit docs with
unique id and
code but with shared content to have a
Hi,
Is there any way to commit docs with unique key but with shared content ?
Example:
required="true" />
I have a lot of itens with same content but with different codes.
Because this index is very large there any way to commit docs with
unique id and
code but with shared content to have a
Hi
I've switched my app to now use an EmbeddedSolrServer. I'm doing an
index on my rebuild core and swapping cores at the end.
Unfortunately, without restarting my web app I can't see the newly
indexed data. I can see core swapping is working, and I can see the
data after indexing without restar
Thank you very much, I will have a read on your links.
The full-text-red-flag is exactly the thing why I'm testing this with
Solr. As was said before by Dennis, I could also use a database as long
as I don't need sophisticated query capabilities. To be honest, I don't
know the performance gap
I am using the solr cloud branch on 6 machines. I first load PubMed into
HBase, and then push the fields I care about to solr. Indexing from HBase to
solr takes about 18 minutes. Loading to hbase takes a little longer (2
hours?), but it only happens once so I haven't spent much time trying to
> Nov 14, 2010 2:41:46 AM org.apache.solr.common.SolrException log
> SEVERE: java.lang.IllegalArgumentException: Increment must be zero or
> greater: -2147483648
Hi John, this looks like a tokenizer/tokenstreams bug.
what I think is happening is that clearAttributes() is not properly
called, so f
sorry, yes by inject I simply mean post, I was hoping to get away without
writing any 'native' solr code to upload gzip files, but its sounds like
that is not possible. (The file's that I'm uploading (aka posting are CSV
formatted).
I will poke around and which solution ServletFilter/DataImportH
thanks for the explanation.
the results for the autocompletion are pretty good now, but we still have a
small problem.
When there are hits in the "edgytext2" fields, results which only have hits in
the "edgytext" field
should not be returned at all.
Example:
Query: "Martin Sco"
Current Resu
Thanks for the feedback, guys!
On 11/15/2010 10:14 AM, Dan Lynn wrote:
Hi,
I just finished reading on the wiki about deduplication and the
solr.UUIDField type. What I'd like to do is generate an ID for a
document by hashing a subset of its fields. One route I thought would
be to do this ahea
On Tue, Nov 16, 2010 at 5:31 AM, Dennis Gearon wrote:
> hashing is not 100% guaranteed to produce unique values.
But if you go to enough bits with a good hash function, you can get
the odds lower than the odds of something else changing the value like
cosmic rays flipping a bit on you.
-Yonik
ht
Hi Peter,
thanks for your response. I will dig into the sharding stuff asap :-)
This may have changed recently, but the NRT stuff - e.g. per-segment
commits etc. is for the latest Solr 4 trunk only.
Do I need to turn something 'on'?
Or do you know wether the NRT patches are documented some
it seems adding the '+' (required) operator to each term in a multi-term query
does the trick:
http://lucene.apache.org/java/2_4_0/queryparsersyntax.html#+
ie: edgytext2:(+Martin +Sco)
-robert
On Nov 16, 2010, at 8:52 PM, Robert Gründler wrote:
> thanks for the explanation.
>
> the result
We intend to use schema.url for indexing documents. However, the remote urls
are secured and would need basic authentication to be able access the
document.
The implementation with stream.file would mean to download the files and
would cause duplicity, whereas stream.body would have indexing perfo
I meant stream.url
Regards,
Jayendra
On Tue, Nov 16, 2010 at 5:37 PM, Jayendra Patil <
jayendra.patil@gmail.com> wrote:
> We intend to use schema.url for indexing documents. However, the remote
> urls are secured and would need basic authentication to be able access the
> document.
>
> The i
Can you explain a bit more? Because on the face
of it, "unique IDs but shared content" doesn't make
sense. The point of unique IDs is that they identify
documents uniquely.
A more usual setup is to have the content correspond
to the unique id. Is what you want really making your
"code" field multi
Hi (again)
I am looking at the spell checker options:
http://wiki.apache.org/solr/SpellCheckerRequestHandler#Term_Source_Configura
tion
http://wiki.apache.org/solr/SpellCheckComponent#Use_in_the_Solr_Example
I am looking in my solrconfig.xml and I see one is already in use. I am kin
They're not mutually exclusive. Part of your index size is because you
*store*
the full xml, which means that a verbatim copy of the raw data is placed in
the
index, along with the searchable terms. Including the tags. This only makes
sense if you're going to return the original data to the user AN
I had to deal with spellchecking today a bit. Make sure you are
performing the analysis step at index-time as such:
schema.xml:
.
multiValued="true"/>
From http://wiki.apache.org/solr/SpellCheckingAnalysis:
Use
> Hi (again)
>
>
>
> I am looking at the spell checker options:
>
>
>
> http://wiki.apache.org/solr/SpellCheckerRequestHandler#Term_Source_Configur
> a tion
>
>
>
> http://wiki.apache.org/solr/SpellCheckComponent#Use_in_the_Solr_Example
>
>
>
> I am looking in my solrconfig.xml and I s
Thanks Dan! Few questions:
Use a to divert your main text fields to the spell field and
then configure your spell checker to use the "spell" field to derive the
spelling index.
This will still keep my current copyfield for the same data, right?
I don't need to rebuild, just reindex.
" After thi
hi:
the problem lies in the web server that interact with the solr server. and
after some transformation, it works now
thanks
2010/11/16 Peter Karich
> Am 16.11.2010 07:25, schrieb xu cheng:
>
> hi all:
>> I configure an app with solr to index documents
>> and there are some Chinese content in
See interjected responses below
On 11/16/2010 06:14 PM, Eric Martin wrote:
Thanks Dan! Few questions:
Use a to divert your main text fields to the spell field and
then configure your spell checker to use the "spell" field to derive the
spelling index.
Right. A copyField just copies data from
Ah, I thought I was going nuts. Thanks for clarifying about the Wiki.
-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io]
Sent: Tuesday, November 16, 2010 5:11 PM
To: solr-user@lucene.apache.org
Subject: Re: Spell Checker
> Hi (again)
>
>
>
> I am looking at t
Hi:
Ok, I made the changes and have the spell checker build on optimize set to
true. So I guess now, I just reindex. I have to run to class now so I can't
check it for another 30 minutes. Cheers!
-Original Message-
From: Dan Lynn [mailto:d...@danlynn.com]
Sent: Tuesday, November 16, 2010
Good hash functions almost never have 'collisions' as they are called,
duplicates, as long as you stay under a certain percentage of the bits for the
number of entries.
Read up on WikiPedia, but I believe that no Hash Function is much good above
50%
of the address space it generates. Many ar
Hi, all.
I am using solr running multiple indexes, by Flattening Data Into a
Single Index [1].
A type field in schema to stand for type of document, say it has
following options: book, movie, music
When query it, some types may have more result rows than others, for
example, we need 3 result ro
On Tue, Nov 16, 2010 at 9:05 PM, Dennis Gearon wrote:
> Read up on WikiPedia, but I believe that no Hash Function is much good above
> 50%
> of the address space it generates.
50% is way to high - collisions will happen before that.
But given that something like MD5 has 128 bits, that's 3.4e38,
Peter Wang writes:
reply myself
I find a PPT[1] about solr, it call such thing as "Field Collapsing" .
It will be added to solr 1.5? unfortunately, I am using Solr 1.4
for solr 1.4, is there other solutions for such task?
[1] http://lucene-eurocon.org/slides/Solr-15-and-Beyond_Yonik-Seely.pd
I have one question related to single word token with dismax query. In order
to be found I need to add the quote around the search query all the time.
This is quite hard for me to do since it is part of full text search.
Here is my solr query and field type definition (Solr 1.4):
Hi,
I am facing a issue with copyFields in SOlr. Here is what i am doing
Schema:
I insert a document with say ID as 100 and product as sampleproduct. When i
view the document in the solr admin page i see the correct value for
the product_copy field ( same as the prodcut fi
hi all:
I configure a solr application and there is a field of type text,and some
kind like this 123456, that is a string of number
and I wanna solr to sort the result on this field
however, when I use sort asc , it works perfectly ,and when I sort it with
desc, the application became unacceptabll
Nobody has ever reported seeing a collision 'in the wild' with MD5. It
is broken, but that takes an algorithm.
As to cosmic rays: it's a real problem. A recent Google paper reported
that some ram chips will have 1 bit error per gigabit per century, while
others have that much per hour. I've al
52 matches
Mail list logo