Re: Creating custom Filter / Tokenizer / Request Handler for integration of NER-Framework

2012-05-25 Thread Chris Hostetter

: Another problem (just discovered this): TokenizerFactories do not get
: resource handlers. So, you can't go read config or model files for
: your Tokenizer. TokenFilters do, so you can use the KeywordTokenizer

TokenizerFactory subclasses can implement ResourceLoaderAware and load any 
resources they want.


-Hoss


ExtendedDisMax Field Alias Question

2012-05-25 Thread Jamie Johnson
I was wondering if someone could explain if the following is supported
with the current EDisMax Field Aliasing.

I have a field like person_name which exists in solr, we also have 2
other fields named person_first_name and person_last_name.  I would
like to allow queries for person_name to be aliased as person_name,
person_first_name and person_last_name.  Is this allowed or does the
alias need to not appear in the list of fields to be aliased to (I
remember seeing something about aliases to other aliases is allowed)?
I could obviously create a purely virtual field which aliases all 3
but it would be nice if the parser could support this case.


Re: Creating custom Filter / Tokenizer / Request Handler for integration of NER-Framework

2012-05-25 Thread Lance Norskog
Another problem (just discovered this): TokenizerFactories do not get
resource handlers. So, you can't go read config or model files for
your Tokenizer. TokenFilters do, so you can use the KeywordTokenizer
(make one big term) and do your work in a TokenFilter that gets the
whole thing.

On Thu, May 24, 2012 at 7:33 AM, Jan Høydahl  wrote:
> As Ahmet says, The Update Chain is probably the place to integrate such 
> document oriented processing.
> See http://www.cominvent.com/2011/04/04/solr-architecture-diagram/ for how it 
> integrates with Solr.
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.facebook.com/Cominvent
> Solr Training - www.solrtraining.com
>
> On 24. mai 2012, at 14:04, Wunderlich, Tobias wrote:
>
>> Hey Guys,
>>
>> I am recently working on a project to integrate a 
>> Named-Entity-Recognition-Framework (NER) in an existing searchplatform based 
>> on Solr. The Platform uses ManifoldCF to automatically gather the content 
>> from various repositories. The NER-Framework creates Annotations/Metadata 
>> from given content which I then want to integrate into the search-platform 
>> as metadata to use for faceting. Since MCF handles all content gathering, I 
>> need a way to integrate the NER-Framework directly into Solr. The Goal is to 
>> get all Annotations per document into a multivalued field.  My first thought 
>> was to create a custom filter, which just takes the content and gives back 
>> only the Annotations.  But as I understand it, a filter only processes 
>> predetermined Tokens, which is useless for my purpose, since the 
>> NER-Framework needs to process the whole content of a document. What about a 
>> custom Tokenizer? Would it be possible to process the whole text and give 
>> back only the Annotations as Tokens? A third thought was to manipulate the 
>> ExtractRequestHandler (Solr Cell) used by MCF to somehow add the Annotations 
>> as Metadata when the content and metadata is distributed to the different 
>> fields.
>>
>> I hope my problem description is sufficient. Does anybody have any thoughts 
>> on that subject?
>>
>> Best regards,
>> Tobias
>



-- 
Lance Norskog
goks...@gmail.com


Solr boost relevancy

2012-05-25 Thread Gau
Consider a db of just names. Now if I use synonym expansion at query time, I
get a set of results. 
(Background: I created a class, which resets idf, tf, .. .all to 1) since
they dont matter to me anymore. What really matters is, how closely does the
query match to the given name. 

Currently I am getting all results with the same score (makes sense since I
reset all the factors to 1), but how do I rank now depending on the
closeness of match.

P.S: the query is being exapanded at query time to match all the documents
from the synonyms. I want to make sure that if I enter  "Raj" , i get Raj as
the topmost results and the synonyms like "Raju" to be after that.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-boost-relevancy-tp3986200.html
Sent from the Solr - User mailing list archive at Nabble.com.


Queries to solr being blocked

2012-05-25 Thread KPK
Hello

I just wanted to ask if queries to solr index are blocked while delta
import?
I read at the wiki page that queries to solr are not blocked while full
imports, but the page doesnt mention anything about delta import. What 
happens then?

I am currently facing a problem, my query takes very long time to respond.
Currently I am scheduling delta import every 1 min, as my DB size keeps on
increasing every minute. But I doubt this is causing some performance issue.
I doubt if the query is being made to the solr index while the CRON job is
runing at the background for delta import. I am using
DataImportHandlerDeltaQuery Via FullImport for this purpose. 
Is this causing a delay in responding to the query or is it smething else.

Any help would be appreciated.

Thanks,
Kushal

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Queries-to-solr-being-blocked-tp3986181.html
Sent from the Solr - User mailing list archive at Nabble.com.


Strange Error - org.apache.solr.response.XMLWriter.writePrim(XMLWriter.java:778)

2012-05-25 Thread Rohit
Hi,

 

I delete some data from Solr, post the deletion I am getting truncated XML
when I run q=*:* query,  in all other cases the queries execute fine. The
following error is shown in the log files,

 

May 25, 2012 7:10:36 PM org.apache.solr.common.SolrException log

SEVERE: java.lang.NullPointerException

at
org.apache.solr.response.XMLWriter.writePrim(XMLWriter.java:778)

at
org.apache.solr.response.XMLWriter.writeStr(XMLWriter.java:687)

at org.apache.solr.schema.StrField.write(StrField.java:45)

at
org.apache.solr.schema.SchemaField.write(SchemaField.java:131)

at
org.apache.solr.response.XMLWriter.writeDoc(XMLWriter.java:370)

at
org.apache.solr.response.XMLWriter$3.writeDocs(XMLWriter.java:546)

at
org.apache.solr.response.XMLWriter.writeDocuments(XMLWriter.java:483)

at
org.apache.solr.response.XMLWriter.writeDocList(XMLWriter.java:520)

at
org.apache.solr.response.XMLWriter.writeVal(XMLWriter.java:583)

at
org.apache.solr.response.XMLWriter.writeResponse(XMLWriter.java:132)

 

I gather that this is related to

https://issues.apache.org/jira/browse/SOLR-1903,   is there any other fix to
solve this problem. I am currently using solr 3.6.

 

Regards,

Rohit

 

 



Re: What is the "docs" number in Solr explain query results for fieldnorm?

2012-05-25 Thread Yonik Seeley
On Fri, May 25, 2012 at 2:13 PM, Tom Burton-West  wrote:
> The explain (debugQuery) shows the following for fieldnorm:
>  0.625 = fieldNorm(field=ocr, doc=16624)
> What does the "doc=16624" mean?

It's the internal document id (i.e. it's debugging info and doesn't
affect scoring)

-Yonik
http://lucidimagination.com


Re: What is the "docs" number in Solr explain query results for fieldnorm?

2012-05-25 Thread Andrzej Bialecki

On 25/05/2012 20:13, Tom Burton-West wrote:

Hello all,

I am trying to understand the output of Solr explain for a one word query.
I am querying on the "ocr" field with no stemming/synonyms or stopwords.
And no query or index time boosting.

The query is "ocr:the"

The document (result below)  which contains two words "The Aeroplane" gets
more hits than documents with 50 or more occurances of the word "the"
Since the idf is the same I am assuming this is a result of length norms.

The explain (debugQuery) shows the following for fieldnorm:
  0.625 = fieldNorm(field=ocr, doc=16624)
What does the "doc=16624" mean?  It certainly can not represent either the
length of the field (as an integer) since there are only two terms in the
field.
It can't represent the number of docs with the query term (the idf output
shows the word "the" occurs in 16,219 docs.


Hi Tom,

This is an internal document number within a Lucene index. This number 
is useless from the level of Solr APIs because you can't use it to 
actually do anything. At the Lucene level (e.g. in Luke) you could 
navigate to this number and for example retrieve stored fields of this 
document.


As it's shown in the Explanation-s, it can be only used to co-ordinate 
parts of the query that matched the same document number.


--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: [solrmarc-tech] apostrophe / ayn / alif

2012-05-25 Thread Charles Riley
 "the encoding of the character used for alif (02BE) carries with it an
assigned property in the Unicode database of (Lm), putting it into the
category of 'Modifier_Letter'..."

Correction to what I put there:  02BC, rather.  The rest of that still
holds up; the data I'm looking at regarding properties can be found here:

ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt
http://www.unicode.org/reports/tr44/#Property_Values
ftp://ftp.unicode.org/Public/UNIDATA/DerivedCoreProperties.txt

Charles


What is the "docs" number in Solr explain query results for fieldnorm?

2012-05-25 Thread Tom Burton-West
Hello all,

I am trying to understand the output of Solr explain for a one word query.
I am querying on the "ocr" field with no stemming/synonyms or stopwords.
And no query or index time boosting.

The query is "ocr:the"

The document (result below)  which contains two words "The Aeroplane" gets
more hits than documents with 50 or more occurances of the word "the"
Since the idf is the same I am assuming this is a result of length norms.

The explain (debugQuery) shows the following for fieldnorm:
 0.625 = fieldNorm(field=ocr, doc=16624)
What does the "doc=16624" mean?  It certainly can not represent either the
length of the field (as an integer) since there are only two terms in the
field.
It can't represent the number of docs with the query term (the idf output
shows the word "the" occurs in 16,219 docs.

I have appended below the explain scoring for a couple of documents with tf
50 and 67.


0.6798219
DF9199B7049F8DFE-220
DF9199B7049F8DFE
The Aeroplane


0.6798219 = (MATCH) fieldWeight(ocr:the in 16624), product of:
  1.0 = tf(termFreq(ocr:the)=1)
  1.087715 = idf(docFreq=16219, maxDocs=17707)
  0.625 = fieldNorm(field=ocr, doc=16624)


Tom Burton-West

-


0.42061833 = (MATCH) fieldWeight(ocr:the in 8396), product of:
  7.071068 = tf(termFreq(ocr:the)=50)
  1.087715 = idf(docFreq=16219, maxDocs=17707)
  0.0546875 = fieldNorm(field=ocr, doc=8396)




 

0.41734362 = (MATCH) fieldWeight(ocr:the in 2782), product of:
  8.185352 = tf(termFreq(ocr:the)=67)
  1.087715 = idf(docFreq=16219, maxDocs=17707)
  0.046875 = fieldNorm(field=ocr, doc=2782)



Re: Accent Characters

2012-05-25 Thread Jack Krupansky
I tried your scenario with the Solr 3.6 example and it seemed to work fine 
and suggested an accented term for me.


Some possibilities:

1) Your term had an editing distance that was too high relative to any 
accented correction. Check your term and count how many characters must be 
changed to match an accented term. Case changes count as well. In the case 
of a 4-character word, the maximum editing distance allowed (by default) is 
2. Maybe you simply need to override the default for "accuracy;  e.g., 
&spellcheck.accuracy=0.35, compared to the default of 0.5.
2) Did you get some other suggestion  when you expected the accented term? 
If so, increase the spellcheck.count request parameter from 1 to 10 see 
other suggestions.
3) You have some other schema/solrconfig changes that you haven't told us 
about.


Try to reproduce your issue against a fresh copy of Solr 3.6 example, and 
then see how your actual configuration (that fails) is different from the 
example.


Here's my test query and the spellcheck result :

http://localhost:8983/solr/spell?q=x%20Cafe%20y&spellcheck=true&spellcheck.collate=true&spellcheck.build=true&spellcheck.count=10


 
   
 2
 2
 6
 
   café
   cofe
 
   
   x café y
 


And here was my test doc:

curl http://localhost:8983/solr/update?commit=true -H "Content-Type: 
text/xml" --data-binary 'doc-c1name="content">Internet café - Café au lait - Viennese coffee house - Maid 
café cofe'


Here is a test query that returns zero suggestions, because the editing 
distance is greater than two (Capital C, unaccented character, and extra 
character at end):


http://localhost:8983/solr/spell?q=x%20Cafex%20y&spellcheck=true&spellcheck.collate=true&spellcheck.build=true

But, by overriding the default "accuracy" of 0.5 and dropping it to 0.35, I 
can get the expected suggestion:


http://localhost:8983/solr/spell?q=x%20Cafex%20y&spellcheck=true&spellcheck.collate=true&spellcheck.build=true&spellcheck.accuracy=0.35

-- Jack Krupansky

-Original Message- 
From: couto.vicente

Sent: Thursday, May 24, 2012 10:28 AM
To: solr-user@lucene.apache.org
Subject: Accent Characters

Hello All.
I'm a newbie in Solr and I saw this subject a lot, but no one answer was
satisfactory or (probably) I don't know how to properly set up the Solr
environment.
I indexed documents in Solr with a French content field. I used the field
type "text_fr" that comes with the solr schema.xml file.



My spellchecker is almost the same that comes with solrconfig.xml:

   
 default
 content
 spellchecker


   

When I try any search query either with words with accent or not, I get the
results pretty fine.
But if I try the spell checking or even a facet query, it looks like Solr is
ignoring the words with accents.
I Google it a lot I could not find any satisfactory fix.

Can anyone give me a help?

Thank you!


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Accent-Characters-tp3985931.html
Sent from the Solr - User mailing list archive at Nabble.com. 



RE: Wildcard-Search Solr 3.5.0

2012-05-25 Thread spring
> I don't know the specific rules in these specific stemmers, 
> but generally a 
> "less aggressive" stemming (e.g., "plural-only") of 
> "paintings" would be 
> "painting", while a "more aggressive" stemming would be 
> "paint". For some 
> "aggressive" stemmers the stemmed word is not even a word.

Sounds logically :)

> It would be nice to have doc with some example words for each stemmer.

Absolutely!

Thx alot!



Re: Why is Solr still shipped with Jetty 6 / switching to Jetty 8?

2012-05-25 Thread William Bell
Let's just wait until SOLR 4.0 is out in a couple months.

On Fri, May 25, 2012 at 9:06 AM, Maciej Lisiewski  wrote:
>
>> There is some discussion here:
>> https://issues.apache.org/jira/browse/SOLR-3159
>>
>
> I've seen it - it's one of the Jira tickets I was referring to: Jetty 8 is
> default for trunk now, but I have failed to find any info about using Jetty
> 8 with Solr 3.6.
>
> --
> Maciej Lisiewski



-- 
Bill Bell
billnb...@gmail.com
cell 720-256-8076


Re: Solr Performance

2012-05-25 Thread Jack Krupansky

Hmmm... what's going on here with email names and addresses???

My email client says "From: chris.a.mattm...@jpl.nasa.gov" for the name, but 
shows an email address of "csnsha...@gmail.com". Is this message from Chris 
A. Mattmann or not?!?


And in the actual eamil header I see this:
From: =?utf-8?b?Y2hyaXMuYS5tYXR0bWFubkBqcGwubmFzYS5nb3Y=?= 



Very strange.

-- Jack Krupansky

-Original Message- 
From: chris.a.mattm...@jpl.nasa.gov

Sent: Friday, May 25, 2012 7:08 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr Performance

Jack Krupansky  basetechnology.com> writes:



I vaguely recall some thread blocking issue with trying to parse too many
PDF files at one time in the same JVM.

Occasionally Tika (actually PDFBox) has been known to hang for some PDF
docs.

Do you have enough memory in the JVM? When the CPU is busy, is there much
memory available in the JVM? Maybe garbage collection is taking too much 
of

the CPU.




Hi Jack,

Thanks for your quick response. Yes. I hope I have enough JVM memory. Here 
is

the mem settings.

-Xms11g -Xmx11g -XX:MaxPermSize=2g

Is this a common issue seen for PDF extraction and indexing? Why i am not 
able

to do more than 1k documents per hour?

Thanks,
Surendra. 



Re: Wildcard-Search Solr 3.5.0

2012-05-25 Thread Jack Krupansky
I don't know the specific rules in these specific stemmers, but generally a 
"less aggressive" stemming (e.g., "plural-only") of "paintings" would be 
"painting", while a "more aggressive" stemming would be "paint". For some 
"aggressive" stemmers the stemmed word is not even a word.


It would be nice to have doc with some example words for each stemmer.

-- Jack Krupansky

-Original Message- 
From: spr...@gmx.eu

Sent: Friday, May 25, 2012 5:59 AM
To: solr-user@lucene.apache.org
Subject: RE: Wildcard-Search Solr 3.5.0

Oh, thx for the update! I didn't noticed that solr 3.6 has a text_de field
type. These two options... less / more aggressive. Aggressive in terms of
what?

Thank you!


-Original Message-
From: Jack Krupansky [mailto:j...@basetechnology.com]
Sent: Freitag, 25. Mai 2012 03:25
To: solr-user@lucene.apache.org
Subject: Re: Wildcard-Search Solr 3.5.0

I tried it and it does appear to be the
SnowballPorterFilterFactory that
normally does the accent folding but can't here because it is
not multi-term
aware. I did notice that the text_de field type that comes in
the Solr 3.6
example schema handles your case fine. It uses the
GermanNormalizationFilterFactory to fold accented characters and is
multi-term aware. Any particular reason you're not using the
stock text_de
field type? It also has three stemming options which might be
sufficient for
your needs.

In any case, try to make your text_de field type closer to the stock
version, and try to use GermanNormalizationFilterFactory, and
that may be
good enough for your situation. 




Re: Why is Solr still shipped with Jetty 6 / switching to Jetty 8?

2012-05-25 Thread Maciej Lisiewski



There is some discussion here:
https://issues.apache.org/jira/browse/SOLR-3159



I've seen it - it's one of the Jira tickets I was referring to: Jetty 8 
is default for trunk now, but I have failed to find any info about using 
Jetty 8 with Solr 3.6.


--
Maciej Lisiewski


Re: Why is Solr still shipped with Jetty 6 / switching to Jetty 8?

2012-05-25 Thread Jack Krupansky

There is some discussion here:
https://issues.apache.org/jira/browse/SOLR-3159

-- Jack Krupansky

-Original Message- 
From: Maciej Lisiewski 
Sent: Friday, May 25, 2012 10:43 AM 
To: solr-user@lucene.apache.org 
Subject: Why is Solr still shipped with Jetty 6 / switching to Jetty 8? 

I have just noticed that Solr 3.6 still includes Jetty 6, which is no 
longer maintained.
Not no longer developed, but it has actually reached End of Life as of 
26th January 2012 ( 
http://dev.eclipse.org/mhonarc/lists/jetty-announce/msg00026.html ) and 
that means no bugfixes or security patches - for almost 4 months now.


Both Jetty 7.x and 8.x are currently considered stable, there are 
multiple tickets in Jira considering upgrade to either of the two, but 
the work seems to be complete-ish only for 4.0, not for latest stable, 
that was released months after EoL announcement.


Does anyone have any experience with Solr 3.6 and Jetty 8? Will it work 
out of the box or should I expect all hell breaking loose?



--
Maciej Lisiewski


Why is Solr still shipped with Jetty 6 / switching to Jetty 8?

2012-05-25 Thread Maciej Lisiewski
I have just noticed that Solr 3.6 still includes Jetty 6, which is no 
longer maintained.
Not no longer developed, but it has actually reached End of Life as of 
26th January 2012 ( 
http://dev.eclipse.org/mhonarc/lists/jetty-announce/msg00026.html ) and 
that means no bugfixes or security patches - for almost 4 months now.


Both Jetty 7.x and 8.x are currently considered stable, there are 
multiple tickets in Jira considering upgrade to either of the two, but 
the work seems to be complete-ish only for 4.0, not for latest stable, 
that was released months after EoL announcement.


Does anyone have any experience with Solr 3.6 and Jetty 8? Will it work 
out of the box or should I expect all hell breaking loose?



--
Maciej Lisiewski


Generating maven artifacts for 3.6.0 build - correct -Dversion to use?

2012-05-25 Thread Aaron Daubman
Greetings,

Following the directions here:
http://svn.apache.org/repos/asf/lucene/dev/trunk/dev-tools/maven/README.maven

for building Lucene/Solr with Maven, what is the correct -Dversion to pass
in to get-maven-poms.

This seems set up for building -SNAPSHOT, however, I would like to use
maven to build the 3.6.0 tag.

If I set version to 3.6.0, however, this causes issue with lucene, which
seems to really only want version 3.6 (no 0) and even causes the version
check test to fail.

What is the correct version to pass in to get-maven-poms for a 3.6.0
release build via maven?

Thanks,
  Aaron


Re: Solr 4.0 Distributed Concurrency Control Mechanism?

2012-05-25 Thread Nicholas Ball

Hey all,

I have another question with regards to this thread.

Does anyone know what the state is of the rollback command in 4.0 and how
it works with both; replicas (i.e. distributed rollbacks) and the snapshot
isolation implemented (i.e. timestamps reverted?), the relevant class is
DistributedUpdateProcessor but not sure if I'm missing something. Has this
been implemented?

Cheers,
Nicholas

On Thu, 24 May 2012 09:53:23 -0600, Nicholas Ball
 wrote:
> Thanks for the link, will investigate further. On the outset though, it
> looks as though it's not what we want to be going towards.
> Also note that it's not open-sourced (other than Solandra which hasn't
> been updated in ges https://github.com/tjake/Solandra).
> 
> Rather than build on top of Cassandra, the new NRT + transaction log
Solr
> features really make it more of a possibility to make Solr into a
> NoSQL-like system and possibly with better transactional guarantees than
> NoSQL!
> 
> Speaking to yonik has given me more information on this. Currently,
there
> is an optimistic lock-free mechanism on a per-document basis only as for
> most, documents only live on a single logical shard. It essentially
checks
> the _version_ you send in for a document against the latest version for
the
> document it has.
> 
> I propose an additional feature to this for those who want to have such
> guarantees spanning over multiple documents living on various shards. In
my
> use-case, I have shards holding documents that point to other shards. In
> this case, an update would need to be an atomic transaction spanning
over
> various documents on various shards. Would anyone object to having this
> functionality added to Solr if I were to contribute it?
> 
> Many thanks,
> Nicholas
> 
> On Thu, 24 May 2012 08:16:25 -0700, Walter Underwood
>  wrote:
>> You should take a look at what DataStax has already done with Solr and
>> Cassandra.
>> 
>>
http://www.datastax.com/dev/blog/cassandra-with-solr-integration-details
>> 
>> wunder
>> 
>> On May 24, 2012, at 7:50 AM, Nicholas Ball wrote:
>> 
>>> 
>>> Hey all,
>>> 
>>> I've been working on a SOLR set up with some heavy customization
(using
>>> the adminHandler as a way into the system) for a research project @
>>> Imperial College London, however I now see there has been a
substantial
>>> push towards a NoSQL.  For this, there needs to be some kind of
>>> optimistic
>>> fine-grained concurrency control on updates. As we have document
>>> versioning
>>> in-built into Lucene (and therefore Solr) this shouldn't be too
>>> difficult,
>>> however the push has been more of a focus on single core optimistic
>>> LOCKING.
>>> 
>>> I would like to take this toward a multi-core (and multi-node)
>>> distributed
>>> optimistic lock-free mechanism. This is gives us the ability to
provide
>>> stronger guarantees than NoSQL wrt distributed transaction isolation
> and
>>> as
>>> we can now do soft-commits, we can also provide specific version
>>> rollbacks
>>> (http://java.dzone.com/articles/exploring-transactional-0). Some more
>>> interesting reading on this topic: (read-)snapshot isolation
>>> (http://pages.cs.wisc.edu/~cs764-1/critique.pdf) and even stronger
>>> guarantees with a slight performance hit with write-snapshot isolation
>>> (http://www.fever.ch/usbkey_eurosys12/papers/p155-yabandehA.pdf).
> People
>>> are starting to realize that we don't have to sacrifice guarantees for
>>> better performance and scalability (like NoSQL) but rather relax them
>>> very
>>> minimally.
>>> 
>>> What I need is for someone to shed some light on this feature and the
>>> future plans of Solr wrt this is? Am I correct in thinking that a
>>> multiversion concurrency control (MVCC) locking mechanism now exist
for
> a
>>> single core or is it lock-free and multi-core?
>>> 
>>> Many thanks,
>>> Nicholas Ball (aka incunix)
>> 
>> --
>> Walter Underwood
>> wun...@wunderwood.org


indexing documents from a git repository

2012-05-25 Thread Welty, Richard
i have a need to incrementally index documents (probably MS 
Office/OpenOffice/pdf files)
from a GIT repository using Tika. i'm expecting to run periodic pulls against 
the repository
to find new and updated docs.

does anyone have any experience and/or thoughts/suggestions that they'd like to 
share?

thanks,
  richard


Re: how can I specify the number of replications for each shard?

2012-05-25 Thread Mark Miller
I think we are going to add some more knobs, but currently it's done like this.

Say you want 3 shards, each with 3 replicas.

Start each shard with the sys prop -DnumShards=3, and start 9 shards.

On May 24, 2012, at 11:42 PM, Vince Wei (jianwei) wrote:

> I am using Solr 4.0.
> 
> I want the number of replications for each shard is 3.
> 
> How can do this?
> 
> 
> 
> Sincerely
> 
> Vince Wei
> 
> 
> 
> From: Vince Wei (jianwei) 
> Sent: 2012年5月25日 11:40
> To: 'solr-user@lucene.apache.org'
> Subject: how can I specify the number of replications for each shard?
> 
> 
> 
> Hi All,
> 
> 
> 
> how can I specify the number of replications for each shard?
> 
> Thanks!
> 
> 
> 
> 
> 
> Sincerely
> 
> Vince Wei
> 

- Mark Miller
lucidimagination.com













Re: Solr Performance

2012-05-25 Thread chris . a . mattmann
Jack Krupansky  basetechnology.com> writes:

> 
> I vaguely recall some thread blocking issue with trying to parse too many 
> PDF files at one time in the same JVM.
> 
> Occasionally Tika (actually PDFBox) has been known to hang for some PDF 
> docs.
> 
> Do you have enough memory in the JVM? When the CPU is busy, is there much 
> memory available in the JVM? Maybe garbage collection is taking too much of 
> the CPU.
> 


Hi Jack,

Thanks for your quick response. Yes. I hope I have enough JVM memory. Here is
the mem settings.

-Xms11g -Xmx11g -XX:MaxPermSize=2g 

Is this a common issue seen for PDF extraction and indexing? Why i am not able
to do more than 1k documents per hour?

Thanks,
Surendra.



RE: Wildcard-Search Solr 3.5.0

2012-05-25 Thread spring
Oh, thx for the update! I didn't noticed that solr 3.6 has a text_de field
type. These two options... less / more aggressive. Aggressive in terms of
what?

Thank you!

> -Original Message-
> From: Jack Krupansky [mailto:j...@basetechnology.com] 
> Sent: Freitag, 25. Mai 2012 03:25
> To: solr-user@lucene.apache.org
> Subject: Re: Wildcard-Search Solr 3.5.0
> 
> I tried it and it does appear to be the 
> SnowballPorterFilterFactory that 
> normally does the accent folding but can't here because it is 
> not multi-term 
> aware. I did notice that the text_de field type that comes in 
> the Solr 3.6 
> example schema handles your case fine. It uses the 
> GermanNormalizationFilterFactory to fold accented characters and is 
> multi-term aware. Any particular reason you're not using the 
> stock text_de 
> field type? It also has three stemming options which might be 
> sufficient for 
> your needs.
> 
> In any case, try to make your text_de field type closer to the stock 
> version, and try to use GermanNormalizationFilterFactory, and 
> that may be 
> good enough for your situation.



Re: terms component misleading results

2012-05-25 Thread Cam Bazz
Oh ok, I got it.

So If I update the document three times, does that mean I have 1
normal document, and 2 marked for deletion?

Because the max difference was 1 - no matter how many times you update.

I think I can manage the faceting to do what I need. I guess that will
be faster than making a real query, and extracting the full docs.

Best Regards,
-C.B.

On Fri, May 25, 2012 at 10:14 AM, Chris Hostetter
 wrote:
>
> : the terms count go +1 for that specific term. for example, if I have
> : two documents in index, each with tag="ccc" and if I update one of the
> : documents, the terms frequency for ccc becomes 3. when I optimize the
> : index, it goes down again to correct number. (2)
>
> http://wiki.apache.org/solr/TermsComponent
>
>>> Retrieving terms in index order is very fast since the implementation
>>> directly uses Lucene's TermEnum to iterate over the term dictionary.
> ...
>>> The doc frequencies returned are the number of documents that match the
>>> term, including any documents that have been marked for deletion but
>>> not yet removed from the index.
>
> : Is there any way to get the exact term frequency?
>
> field faceting.
>
>
> -Hoss


Re: upgrade to 3.6

2012-05-25 Thread Cam Bazz
Hello,

I have tested, but was not able to replicate the problem.

(basically i indexed few documents with utf8 chars, and then searched
for them, and found ok)

On the issues at 27/Apr/12 08:56

> the fix is now committed to 3.6 branch

I just recently downloaded the 3.6 - well actually it seems I
downloaded it at  2012-04-27 19:27 GMT+2 (from file stamp)

Does that mean that I was lucky?

Best,


On Fri, May 25, 2012 at 10:17 AM, Sami Siren  wrote:
> Hi,
>
> If you're using non ascii data with solrj you might want to test that
> it works for you properly. See for example
> https://issues.apache.org/jira/browse/SOLR-3375
>
> --
>  Sami Siren
>
> On Fri, May 25, 2012 at 10:11 AM, Cam Bazz  wrote:
>> Hello,
>>
>> I have upgraded from 1.4 to 3.6 - it went quite smooth, using the same
>> schema.xml
>>
>> I have done some testing, and I have not found any problems yet. Soon
>> I will migrate the production system to 3.6
>>
>> Any recomendations on this matter? Maybe I skipped something?
>>
>> Best Regards,
>> C.B.


Re: upgrade to 3.6

2012-05-25 Thread Sami Siren
Hi,

If you're using non ascii data with solrj you might want to test that
it works for you properly. See for example
https://issues.apache.org/jira/browse/SOLR-3375

--
 Sami Siren

On Fri, May 25, 2012 at 10:11 AM, Cam Bazz  wrote:
> Hello,
>
> I have upgraded from 1.4 to 3.6 - it went quite smooth, using the same
> schema.xml
>
> I have done some testing, and I have not found any problems yet. Soon
> I will migrate the production system to 3.6
>
> Any recomendations on this matter? Maybe I skipped something?
>
> Best Regards,
> C.B.


Re: terms component misleading results

2012-05-25 Thread Chris Hostetter

: the terms count go +1 for that specific term. for example, if I have
: two documents in index, each with tag="ccc" and if I update one of the
: documents, the terms frequency for ccc becomes 3. when I optimize the
: index, it goes down again to correct number. (2)

http://wiki.apache.org/solr/TermsComponent

>> Retrieving terms in index order is very fast since the implementation 
>> directly uses Lucene's TermEnum to iterate over the term dictionary. 
...
>> The doc frequencies returned are the number of documents that match the 
>> term, including any documents that have been marked for deletion but 
>> not yet removed from the index. 

: Is there any way to get the exact term frequency?

field faceting.


-Hoss


upgrade to 3.6

2012-05-25 Thread Cam Bazz
Hello,

I have upgraded from 1.4 to 3.6 - it went quite smooth, using the same
schema.xml

I have done some testing, and I have not found any problems yet. Soon
I will migrate the production system to 3.6

Any recomendations on this matter? Maybe I skipped something?

Best Regards,
C.B.


terms component misleading results

2012-05-25 Thread Cam Bazz
Hello,

I need to know exact count of certain terms in the documents. I
noticed that when I update a document, (only one field for testing)
the terms count go +1 for that specific term. for example, if I have
two documents in index, each with tag="ccc" and if I update one of the
documents, the terms frequency for ccc becomes 3. when I optimize the
index, it goes down again to correct number. (2)

Is there any way to get the exact term frequency?

Regular querying works well, but i quite did not understand why the
terms count is misleading.

Best Regards,
C.B.