DIH full-import failure, no real error message

2010-11-16 Thread Erik Fäßler
Hey all, I'm trying to create a Solr index for the 2010 Medline-baseline (www.pubmed.gov, over 18 million XML documents). My goal is to be able to retrieve single XML documents by their ID. Each document comes with a unique ID, the PubMedID. So my schema (important portions) looks like this:

Re: encoding messy code

2010-11-16 Thread Peter Karich
Am 16.11.2010 07:25, schrieb xu cheng: hi all: I configure an app with solr to index documents and there are some Chinese content in the documents and I've configure the apache tomcat URIEncoding to be utf-8 and I use the program curl to sent the documents in xml format however , when I query th

Re: Tuning Solr caches with high commit rates (NRT)

2010-11-16 Thread Peter Sturge
Many thanks, Peter K. for posting up on the wiki - great! Yes, fc = field cache. Field Collapsing is something very nice indeed, but is entirely different. As Erik mentions in the wiki post, using per-segment faceting can be a huge boon to performance. It does require the latest Solr trunk build

Re: Possibilities of (near) real time search with solr

2010-11-16 Thread Peter Sturge
Hi Peter, First off, many thanks for putting together the NRT Wiki page! This may have changed recently, but the NRT stuff - e.g. per-segment commits etc. is for the latest Solr 4 trunk only. If your setup uses the 3x Solr code branch, then there's a bit of work to do to move to the new version.

Re: hash uniqueKey generation?

2010-11-16 Thread Dennis Gearon
hashing is not 100% guaranteed to produce unique values. It'w worth reading about and knowing about :-) Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have t

Re: DIH full-import failure, no real error message

2010-11-16 Thread Dennis Gearon
Wow, if all you want is to retrieve by ID, a database would be fine, even a NO SQL database. Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them

Re: Boosting on a document value

2010-11-16 Thread Jan Høydahl / Cominvent
Also this http://search-lucene.com/m/hBnHH1Q4NVb2 -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com On 15. nov. 2010, at 23.21, Ahmet Arslan wrote: >> I've got a document with a "type" >> field. If the type is 1, I want to boost the >> document's relevancy, but type=1 i

Re: DIH full-import failure, no real error message

2010-11-16 Thread Erik Fäßler
Retrieval by ID would only be one possible case; I'm still at the beginning of the project, I imagine adding more fields for more complicated queries in the future. I imagine a "where - like" query over all the XML documents stored in a DBMS wouldn't be too performant ;) And at a later stage

Re: stopwords file configuration

2010-11-16 Thread alendo
I reply to myself because I founded the mistake. The italian stopwords file that I founded on apache site contains on the same line of each stopword a comment shell style, the stopwords tokenizer probably is basical and doesn't accept comments on the same line of stopwords. I dropped them and now

DateFormatTransformer issue with value 0000-00-00T00:00:00Z

2010-11-16 Thread Shanmugavel SRD
Hi, I am having a field as below in my feed. -00-00T00:00:00Z I have configured the field as below in data-config.xml. But after indexing, the field value becomes like this 0002-11-30T00:00:00Z I want to have the value as '-00-00T00:00:00Z' after indexing also. Could anyon

Re: Term component sort is not working

2010-11-16 Thread Erick Erickson
You haven't defined what you want to see, so it's hard to help. What does "top" mean? The order you put it into the index? Lexical sort? Frequency count? Numerical ordering? Why do you want to do this? Perhaps if you explained your use case we'd be able to offer some alternatives. Best Erick On

Re: DIH full-import failure, no real error message

2010-11-16 Thread Erick Erickson
Several questions. Pardon me if they're obvious, but I've spent fr too much of my life overlooking the obvious... 1> Is it possible you're running out of disk? 40-50G could suck up a lot of disk, especially when merging. You may need that much again free when a merge occurs. 2> speaking of mer

Re: Term component sort is not working

2010-11-16 Thread sivaprasad
I am capturing all the user entered search terms in to the database and the number of times the search term is entered.Let us say "laptop" has entered 100 times. "laptop bag" has entered 80 times. "laptop battery" has entered 90 times. I am using terms component for auto suggest feature.If the

Re: Term component sort is not working

2010-11-16 Thread Ahmet Arslan
> I am capturing all the user entered search terms in to the > database and the number of times the search term is > entered.Let us say > > "laptop" has entered 100 times. > "laptop bag" has entered 80 times. > "laptop battery" has entered 90 times. > > I am using terms component for auto suggest

Re: Retrieving indexed content containing multiple languages

2010-11-16 Thread Tod
On 11/11/2010 3:24 PM, Dennis Gearon wrote: I look forward to the eanswers to this one. Well, it seems it was as easy as adding the CJKTokenizerFactory: positionIncrementGap="100"> Once I did that and reindexed I could search for both english and chinese using the default 'text' fi

result grouping / field collapsing changes

2010-11-16 Thread Yonik Seeley
We've recently added randomized testing for result grouping that resulted in finding + fixing a number of bugs. I've you've been using this feature, you should move to the latest trunk version. I've also added a section at the bottom of the wiki page to list current limitations. http://wiki.apache

Core Swapping

2010-11-16 Thread Shaun Campbell
I've got a Solr multi core system and I'm trying to swap the cores after a re-index via SolrJ using a separate HTTP Solr web server.  My application seems to be generating a URL that's not valid for my Solr Tomcat installation but I can't see why or where it's getting its data from. Core swapping

stopwords file configuration

2010-11-16 Thread alendo
I'm using Lucid Imagination installation kit for SOLR (the last one with SOLR 1.4). I would like to use stopwords, and I installed in LucidWorks/lucidworks/solr/conf/stopwords.txt the italian version of the file. Moreover the field where I want to clean stopwords is declared in schema.xml as

Re: Core Swapping

2010-11-16 Thread Markus Jelsma
CoreAdmin is handled by /solr/admin/cores/ and not by /solr/CORENAME/admin/cores/ On Tuesday 16 November 2010 16:17:34 Shaun Campbell wrote: > I've got a Solr multi core system and I'm trying to swap the cores > after a re-index via SolrJ using a separate HTTP Solr web server. My > application s

Re: DIH full-import failure, no real error message

2010-11-16 Thread Erick Erickson
The key is that Solr handles merges by copying, and only after the copy is complete does it delete the old index. So you'll need at least 2x your final index size before you start, especially if you optimize... Here's a handy matrix of what you need in your index depending upon what you want to do

Unique ID with shared content field

2010-11-16 Thread Thyago
Hi, Is there any way to commit docs with unique key but with shared content ? Example: required="true" /> I have a lot of itens with same content but with different codes. Because this index is very large there any way to commit docs with unique id and code but with shared content to have a

Unique ID with shared content field

2010-11-16 Thread Thyago
Hi, Is there any way to commit docs with unique key but with shared content ? Example: required="true" /> I have a lot of itens with same content but with different codes. Because this index is very large there any way to commit docs with unique id and code but with shared content to have a

EmbeddedSolrServer, Indexing and Core Swapping

2010-11-16 Thread Shaun Campbell
Hi I've switched my app to now use an EmbeddedSolrServer. I'm doing an index on my rebuild core and swapping cores at the end. Unfortunately, without restarting my web app I can't see the newly indexed data. I can see core swapping is working, and I can see the data after indexing without restar

Re: DIH full-import failure, no real error message

2010-11-16 Thread Erik Fäßler
Thank you very much, I will have a read on your links. The full-text-red-flag is exactly the thing why I'm testing this with Solr. As was said before by Dennis, I could also use a database as long as I don't need sophisticated query capabilities. To be honest, I don't know the performance gap

RE: DIH full-import failure, no real error message

2010-11-16 Thread Buttler, David
I am using the solr cloud branch on 6 machines. I first load PubMed into HBase, and then push the fields I care about to solr. Indexing from HBase to solr takes about 18 minutes. Loading to hbase takes a little longer (2 hours?), but it only happens once so I haven't spent much time trying to

Re: occasional exception

2010-11-16 Thread Robert Muir
> Nov 14, 2010 2:41:46 AM org.apache.solr.common.SolrException log > SEVERE: java.lang.IllegalArgumentException: Increment must be zero or > greater: -2147483648 Hi John, this looks like a tokenizer/tokenstreams bug. what I think is happening is that clearAttributes() is not properly called, so f

Re: does solr support posting gzipped content?

2010-11-16 Thread danomano
sorry, yes by inject I simply mean post, I was hoping to get away without writing any 'native' solr code to upload gzip files, but its sounds like that is not possible. (The file's that I'm uploading (aka posting are CSV formatted). I will poke around and which solution ServletFilter/DataImportH

Re: EdgeNGram relevancy

2010-11-16 Thread Robert Gründler
thanks for the explanation. the results for the autocompletion are pretty good now, but we still have a small problem. When there are hits in the "edgytext2" fields, results which only have hits in the "edgytext" field should not be returned at all. Example: Query: "Martin Sco" Current Resu

Re: hash uniqueKey generation?

2010-11-16 Thread Dan Lynn
Thanks for the feedback, guys! On 11/15/2010 10:14 AM, Dan Lynn wrote: Hi, I just finished reading on the wiki about deduplication and the solr.UUIDField type. What I'd like to do is generate an ID for a document by hashing a subset of its fields. One route I thought would be to do this ahea

Re: hash uniqueKey generation?

2010-11-16 Thread Yonik Seeley
On Tue, Nov 16, 2010 at 5:31 AM, Dennis Gearon wrote: > hashing is not 100% guaranteed to produce unique values. But if you go to enough bits with a good hash function, you can get the odds lower than the odds of something else changing the value like cosmic rays flipping a bit on you. -Yonik ht

Re: Possibilities of (near) real time search with solr

2010-11-16 Thread Peter Karich
Hi Peter, thanks for your response. I will dig into the sharding stuff asap :-) This may have changed recently, but the NRT stuff - e.g. per-segment commits etc. is for the latest Solr 4 trunk only. Do I need to turn something 'on'? Or do you know wether the NRT patches are documented some

Re: EdgeNGram relevancy

2010-11-16 Thread Robert Gründler
it seems adding the '+' (required) operator to each term in a multi-term query does the trick: http://lucene.apache.org/java/2_4_0/queryparsersyntax.html#+ ie: edgytext2:(+Martin +Sco) -robert On Nov 16, 2010, at 8:52 PM, Robert Gründler wrote: > thanks for the explanation. > > the result

basic authentication for schema.url

2010-11-16 Thread Jayendra Patil
We intend to use schema.url for indexing documents. However, the remote urls are secured and would need basic authentication to be able access the document. The implementation with stream.file would mean to download the files and would cause duplicity, whereas stream.body would have indexing perfo

Re: basic authentication for schema.url

2010-11-16 Thread Jayendra Patil
I meant stream.url Regards, Jayendra On Tue, Nov 16, 2010 at 5:37 PM, Jayendra Patil < jayendra.patil@gmail.com> wrote: > We intend to use schema.url for indexing documents. However, the remote > urls are secured and would need basic authentication to be able access the > document. > > The i

Re: Unique ID with shared content field

2010-11-16 Thread Erick Erickson
Can you explain a bit more? Because on the face of it, "unique IDs but shared content" doesn't make sense. The point of unique IDs is that they identify documents uniquely. A more usual setup is to have the content correspond to the unique id. Is what you want really making your "code" field multi

Spell Checker

2010-11-16 Thread Eric Martin
Hi (again) I am looking at the spell checker options: http://wiki.apache.org/solr/SpellCheckerRequestHandler#Term_Source_Configura tion http://wiki.apache.org/solr/SpellCheckComponent#Use_in_the_Solr_Example I am looking in my solrconfig.xml and I see one is already in use. I am kin

Re: DIH full-import failure, no real error message

2010-11-16 Thread Erick Erickson
They're not mutually exclusive. Part of your index size is because you *store* the full xml, which means that a verbatim copy of the raw data is placed in the index, along with the searchable terms. Including the tags. This only makes sense if you're going to return the original data to the user AN

Re: Spell Checker

2010-11-16 Thread Dan Lynn
I had to deal with spellchecking today a bit. Make sure you are performing the analysis step at index-time as such: schema.xml: . multiValued="true"/> From http://wiki.apache.org/solr/SpellCheckingAnalysis: Use

Re: Spell Checker

2010-11-16 Thread Markus Jelsma
> Hi (again) > > > > I am looking at the spell checker options: > > > > http://wiki.apache.org/solr/SpellCheckerRequestHandler#Term_Source_Configur > a tion > > > > http://wiki.apache.org/solr/SpellCheckComponent#Use_in_the_Solr_Example > > > > I am looking in my solrconfig.xml and I s

RE: Spell Checker

2010-11-16 Thread Eric Martin
Thanks Dan! Few questions: Use a to divert your main text fields to the spell field and then configure your spell checker to use the "spell" field to derive the spelling index. This will still keep my current copyfield for the same data, right? I don't need to rebuild, just reindex. " After thi

Re: encoding messy code

2010-11-16 Thread xu cheng
hi: the problem lies in the web server that interact with the solr server. and after some transformation, it works now thanks 2010/11/16 Peter Karich > Am 16.11.2010 07:25, schrieb xu cheng: > > hi all: >> I configure an app with solr to index documents >> and there are some Chinese content in

Re: Spell Checker

2010-11-16 Thread Dan Lynn
See interjected responses below On 11/16/2010 06:14 PM, Eric Martin wrote: Thanks Dan! Few questions: Use a to divert your main text fields to the spell field and then configure your spell checker to use the "spell" field to derive the spelling index. Right. A copyField just copies data from

RE: Spell Checker

2010-11-16 Thread Eric Martin
Ah, I thought I was going nuts. Thanks for clarifying about the Wiki. -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Tuesday, November 16, 2010 5:11 PM To: solr-user@lucene.apache.org Subject: Re: Spell Checker > Hi (again) > > > > I am looking at t

RE: Spell Checker

2010-11-16 Thread Eric Martin
Hi: Ok, I made the changes and have the spell checker build on optimize set to true. So I guess now, I just reindex. I have to run to class now so I can't check it for another 30 minutes. Cheers! -Original Message- From: Dan Lynn [mailto:d...@danlynn.com] Sent: Tuesday, November 16, 2010

Re: hash uniqueKey generation?

2010-11-16 Thread Dennis Gearon
Good hash functions almost never have 'collisions' as they are called, duplicates, as long as you stay under a certain percentage of the bits for the number of entries. Read up on WikiPedia, but I believe that no Hash Function is much good above 50% of the address space it generates. Many ar

How to limit result rows by field types?

2010-11-16 Thread Peter Wang
Hi, all. I am using solr running multiple indexes, by Flattening Data Into a Single Index [1]. A type field in schema to stand for type of document, say it has following options: book, movie, music When query it, some types may have more result rows than others, for example, we need 3 result ro

Re: hash uniqueKey generation?

2010-11-16 Thread Yonik Seeley
On Tue, Nov 16, 2010 at 9:05 PM, Dennis Gearon wrote: > Read up on WikiPedia, but I believe that no Hash Function is much good above > 50% > of the address space it generates. 50% is way to high - collisions will happen before that. But given that something like MD5 has 128 bits, that's 3.4e38,

Re: How to limit result rows by field types?

2010-11-16 Thread Peter Wang
Peter Wang writes: reply myself I find a PPT[1] about solr, it call such thing as "Field Collapsing" . It will be added to solr 1.5? unfortunately, I am using Solr 1.4 for solr 1.4, is there other solutions for such task? [1] http://lucene-eurocon.org/slides/Solr-15-and-Beyond_Yonik-Seely.pd

Must require quote with single word token query?

2010-11-16 Thread Chamnap Chhorn
I have one question related to single word token with dismax query. In order to be found I need to add the quote around the search query all the time. This is quite hard for me to do since it is part of full text search. Here is my solr query and field type definition (Solr 1.4):

Issue with copyField when updating document

2010-11-16 Thread Pramod Goyal
Hi, I am facing a issue with copyFields in SOlr. Here is what i am doing Schema: I insert a document with say ID as 100 and product as sampleproduct. When i view the document in the solr admin page i see the correct value for the product_copy field ( same as the prodcut fi

sort desc and out of memory exception

2010-11-16 Thread xu cheng
hi all: I configure a solr application and there is a field of type text,and some kind like this 123456, that is a string of number and I wanna solr to sort the result on this field however, when I use sort asc , it works perfectly ,and when I sort it with desc, the application became unacceptabll

Re: hash uniqueKey generation?

2010-11-16 Thread Lance Norskog
Nobody has ever reported seeing a collision 'in the wild' with MD5. It is broken, but that takes an algorithm. As to cosmic rays: it's a real problem. A recent Google paper reported that some ram chips will have 1 bit error per gigabit per century, while others have that much per hour. I've al