date:20120524

RE: how can I specify the number of replications for each shard?

2012-05-24 Thread Vince Wei (jianwei)

I am using Solr 4.0.

I want the number of replications for each shard is 3.

How can do this?

 

Sincerely

Vince Wei

 

From: Vince Wei (jianwei) 
Sent: 2012年5月25日 11:40
To: 'solr-user@lucene.apache.org'
Subject: how can I specify the number of replications for each shard?

 

Hi All,

 

how can I specify the number of replications for each shard?

Thanks!

 

 

Sincerely

Vince Wei

how can I specify the number of replications for each shard?

2012-05-24 Thread Vince Wei (jianwei)

Hi All,

 

how can I specify the number of replications for each shard?

Thanks!

 

 

Sincerely

Vince Wei

Re: getTransformer error

2012-05-24 Thread Chris Hostetter


: Anyone found a solution to the getTransformer error. I am getting the same
: error.

If you use Solr 3.6, with the example jetty and example configs, do you 
get the same error using the provided example XSL files?

http://localhost:8983/solr/select?q=*:*&wt=xslt&tr=example.xsl
http://localhost:8983/solr/select?q=*:*&wt=xslt&tr=example_rss.xsl
http://localhost:8983/solr/select?q=*:*&wt=xslt&tr=example_atom.xsl

... I just tried those and had no problems.

: Caused by: java.io.IOException: Unable to initialize Templates
: 'example.xslt'

Just to be clear: is your file actually named "example.xslt" ? the example 
that comes with Solr is "example.xsl" (no "T")

Can you post that file?

Does it by any chance contain an xsl include (or xinclude)?  If so, see 
the note about SOLR-1656 in CHANGES.txt...

* SOLR-1656: XIncludes and other HREFs in XML files loaded by ResourceLoader
  are fixed to be resolved using the URI standard (RFC 2396). The system
  identifier is no longer a plain filename with path, it gets initialized
  using a custom URI scheme "solrres:". This scheme is resolved using a
  EntityResolver that utilizes ResourceLoader
  (org.apache.solr.common.util.SystemIdResolver). This makes all relative
  pathes in Solr's config files behave like expected. This change
  introduces some backwards breaks in the API: Some config classes
  (Config, SolrConfig, IndexSchema) were changed to take
  org.xml.sax.InputSource instead of InputStream. There may also be some
  backwards breaks in existing config files, it is recommended to check
  your config files / XSLTs and replace all XIncludes/HREFs that were
  hacked to use absolute paths to use relative ones. (uschindler)






-Hoss

Re: Throws Null Pointer Exception Even Query is Correct in solr

2012-05-24 Thread Chris Hostetter


: in sufficient amount .. But still its is throwing Null Pointer Exception in
: Tomcat and in Eclipse while debugging i had seen Error as "Error Executing
: Query" . Please give me suggestion for this.
: 
: Note: While the ids are below or equal to 99800 the Query is returning the
: Result

what exactly is the full stack trace? ... that's the minimum amount of 
infor needed to make any meaningful guess as to what the source of 
and exception might be.



-Hoss

Re: Wildcard-Search Solr 3.5.0

2012-05-24 Thread Jack Krupansky

I tried it and it does appear to be the SnowballPorterFilterFactory that 
normally does the accent folding but can't here because it is not multi-term 
aware. I did notice that the text_de field type that comes in the Solr 3.6 
example schema handles your case fine. It uses the 
GermanNormalizationFilterFactory to fold accented characters and is 
multi-term aware. Any particular reason you're not using the stock text_de 
field type? It also has three stemming options which might be sufficient for 
your needs.


In any case, try to make your text_de field type closer to the stock 
version, and try to use GermanNormalizationFilterFactory, and that may be 
good enough for your situation.


-- Jack Krupansky

-Original Message- 
From: spr...@gmx.eu

Sent: Wednesday, May 23, 2012 10:16 AM
To: solr-user@lucene.apache.org
Subject: RE: Wildcard-Search Solr 3.5.0


I'd guess that this is because SnowballPorterFilterFactory
does not implement MultiTermAwareComponent. Not sure, though.


Yes, I think this hinders the automagically multiterm awarness to do it's
job.
Could an own analyzer chain with  help? Like
described (very, very short, too short...) here:
http://wiki.apache.org/solr/MultitermQueryAnalysis

Re: DIH using a connection pool

2012-05-24 Thread Lance Norskog

Yes, this is the right way for the DIH. You might find it easier to
write a separate local client that polls the DB and uploads changes.
The DIH is oriented toward longer batch jobs.

On Thu, May 24, 2012 at 7:29 AM, Esteban Donato
 wrote:
> Hi community,
>
> I am using Solr with DIH to index content from a DB.  The point is
> that I have to configure DIH to check changes in the DB very
> frequently (aprox 1 sec) to maintain the index almost up-to-date.  I
> noted that JDBCDataSource closes the DB connection after every
> execution which is not acceptable with this update rate.  Ideally I
> would need DIH using a connection pool.  Looking at DIH code and faq I
> noticed that I can configure a connection pool and expose it via jndi
> for the JDBCDataSource to use it.  My question is: is this they way to
> go for integrating a connection pool with DIH?
>
> Thanks
> Esteban



-- 
Lance Norskog
goks...@gmail.com

Re: Tips on creating a custom QueryCache?

2012-05-24 Thread Chris Hostetter


: 1) Any recommendations on which best to sub-class? I'm guessing, for this
: scenario with "rare" batch puts and no evictions, I'd be looking for get
: performance. This will also be on a box with many CPUs - so I wonder if the
: older LRUCache would be preferable?

i suspect you are correct ... the entire point of the other caches is 
dealingwith faster replacement, so you really don't care.

You might even find it worth while to write your own 
"NoReplacementCache" from scratch backed by a HashMap (instead of the 
LinkedHashMap used in LRUCache)

: 2) Would I need to worry about "auto warming" at all? I'm still a little
: foggy on lifecycle of firstSearcher versus newSearcher (is firstSearcher
: really only ever called the first time the solr instanced is started?). In

essentially correct - technically it's when the SolrCore is created (which 
may not be exactly when the solr instance is started).

it's when the "first searcher" is created for the SolrCore, and there is 
no other previous searcher.

: any case, since the only time a commit would occur is when batch updates,
: re-indexing and re-optimizing occurs (once a day off-peak perhaps) I
: *think* I would always want to perform the same "static warming" rather
: than attempting to auto-warm from the old cache - does this make sense?

auto-warming should be faster then any static warming you could do, 
because the existing Query objects will already be in memory, no 
loading/parsing required.  That said: if you expect/want the list of 
queries to be evolvable (ie: add or remove a few queries w/o restarting / 
reloading the SolrCore) then you're correct: better to use static warming 
in newSearcher and skip autowarming completley -- but for your usecase you 
shouldn't need both.

-Hoss

Re: Tips on creating a custom QueryCache?

2012-05-24 Thread Aaron Daubman

Hoss, brilliant as always - many thanks! =)

Subclassing the SolrCache class sounds like a good way to accomplish this.

Some questions:
1) Any recommendations on which best to sub-class? I'm guessing, for this
scenario with "rare" batch puts and no evictions, I'd be looking for get
performance. This will also be on a box with many CPUs - so I wonder if the
older LRUCache would be preferable?

2) Would I need to worry about "auto warming" at all? I'm still a little
foggy on lifecycle of firstSearcher versus newSearcher (is firstSearcher
really only ever called the first time the solr instanced is started?). In
any case, since the only time a commit would occur is when batch updates,
re-indexing and re-optimizing occurs (once a day off-peak perhaps) I
*think* I would always want to perform the same "static warming" rather
than attempting to auto-warm from the old cache - does this make sense?

Thanks again!
 Aaron

On Thu, May 24, 2012 at 7:38 PM, Chris Hostetter
wrote:

>
> Interesting problem,
>
> w/o making any changes to Solr, you could probably get this behavior be:
>  a) sizing your cache large neough.
>  b) using a firstSearcher that generates your N queries on startup
>  c) configure autowarming of 100%
>  d) ensure every query you send uses cache=false
>
>
> The tricky part being "d".
>
> But if you don't mind writing a little java, i think this should actually
> be fairly trivial to do w/o needing "d" at all...
>
> 1) subclass the existing SolrCache class of your choice.
> 2) in your subclass, make "put" be a No-Op if getState()==LIVE, else
> super.put(...)
>
> ...so during any warming phase (either static from
> firstSearcher/newSearcher, or because of autowarming) the cache will
> accept new objects, but once warming is done it will ignore requests to
> add new items (so it will never evict anything)
>
> Then all you need is a firstSearcher event listener that feeds you your N
> queries (model it after "QuerySenderListener" but have it read from
> whatever source you want instead of the solrconfig.xml)
>
> : The reason for this somewhat different approach to caching is that we may
> : get any number of odd queries throughout the day for which performance
> : isn't important, and we don't want any of these being added to the cache
> or
> : evicting other entries from the cache. We need to ensure high performance
> : for this pre-determined list of queries only (while still handling other
> : arbitrary queries, if not as quickly)
>
> FWIW: my defacto way of dealing with this in the past was to siloize my
> slave machines by usecase.  For example, in one index: i had 1 master,
> which replicated to 2*N slaves, as well as a repeater.  The 2*N slaves
> were behind 2 diff load balancers (N even numbered machines and N odd
> numbered machines), and the two sets of slaves had diff static cache
> warming configs - even numbered machines warmed queries common to
> "browsing" categories, odd numbered machines warmed top-searches.  If the
> front end was doing an arbitrary search, it was routed to the load blancer
> for the odd-numbered slaves.  if the front end was doing a category
> browse, the query was routed to the even-numbered slaves.  Meanwhile: the
> "repeater" was replicating out to a bunch of smaller one-off boxes with
> cache configs by use case, ie: the data-wharehouse and analytics team had
> their own slave they could run their own complex queries against.  the
> tools team had a dedicated slave that various internal tools would query
> via ajax to get metadata, etc...
>
> -Hoss
>

Re: Tips on creating a custom QueryCache?

2012-05-24 Thread Chris Hostetter


Interesting problem,

w/o making any changes to Solr, you could probably get this behavior be:
 a) sizing your cache large neough.
 b) using a firstSearcher that generates your N queries on startup
 c) configure autowarming of 100%
 d) ensure every query you send uses cache=false


The tricky part being "d".

But if you don't mind writing a little java, i think this should actually 
be fairly trivial to do w/o needing "d" at all...

1) subclass the existing SolrCache class of your choice.
2) in your subclass, make "put" be a No-Op if getState()==LIVE, else 
super.put(...)

...so during any warming phase (either static from 
firstSearcher/newSearcher, or because of autowarming) the cache will 
accept new objects, but once warming is done it will ignore requests to 
add new items (so it will never evict anything)

Then all you need is a firstSearcher event listener that feeds you your N 
queries (model it after "QuerySenderListener" but have it read from 
whatever source you want instead of the solrconfig.xml)

: The reason for this somewhat different approach to caching is that we may
: get any number of odd queries throughout the day for which performance
: isn't important, and we don't want any of these being added to the cache or
: evicting other entries from the cache. We need to ensure high performance
: for this pre-determined list of queries only (while still handling other
: arbitrary queries, if not as quickly)

FWIW: my defacto way of dealing with this in the past was to siloize my 
slave machines by usecase.  For example, in one index: i had 1 master, 
which replicated to 2*N slaves, as well as a repeater.  The 2*N slaves 
were behind 2 diff load balancers (N even numbered machines and N odd 
numbered machines), and the two sets of slaves had diff static cache 
warming configs - even numbered machines warmed queries common to 
"browsing" categories, odd numbered machines warmed top-searches.  If the 
front end was doing an arbitrary search, it was routed to the load blancer 
for the odd-numbered slaves.  if the front end was doing a category 
browse, the query was routed to the even-numbered slaves.  Meanwhile: the 
"repeater" was replicating out to a bunch of smaller one-off boxes with 
cache configs by use case, ie: the data-wharehouse and analytics team had 
their own slave they could run their own complex queries against.  the 
tools team had a dedicated slave that various internal tools would query 
via ajax to get metadata, etc...

-Hoss

Re: First query to find meta data, second to search. How to group into one?

2012-05-24 Thread Chris Hostetter


: We are using mm=70% in solrconfig.xml
: We are using qf=title description
: We are not doing phrase query in "q"
: 
: In case of a multi-word search text, mostly the end results are the junk
: ones. Because the words, mentioned in search text, are written in different
: fields and in different contexts.
: For example searching for "water proof" (without double quotes) brings a
: record where title = "rose water" and description = "... no proof of
: contamination ..."

Did you consider using "pf" ? ... just specifying something 
like "pf=title^100 description^100" should help shove records like the 
example you gave to the bottom of the result set relative to records that 
actualy contain the phrase "water proof" in a single field.

it won't *remove* these results, just promote other results, so it's not 
really comparible to what you are doing, but i still strongly suggest you 
consider it (in can even be complimentary to what you are doing now, by 
ensuring that the top N you pick from the first results are relaly the 
"top" N.

:- We are firing first query to get top "n" results. We assume that first
:"n" results are mostly good results. "n" is dynamic within a predefined
:minimum and maximum value.
:- We are calculating frequency of category ids in these top results. We
:are not using facets because that gives count for all, relevant or
:irrelevant, results.
:- Based on category frequencies within top matching results we are
:trying to find a few most frequent categories by simple calculation. Now we
:are very confident that these categories are the ones which best suit to
:our query.

FWIW: I've done this before in a custom hierarchical faceting component 
(to adjust the order used in displaying category drill down options)
and i found it worked very well, but the key is picking a good N.  If i 
remember correctly, i went with a percentage of the total result size, 
maxed out a fixed constant (which i also used as my docList window size, 
so getting those N docs was essentially free unless the user started 
drilling down deep in pagination).  But i also recall seeing a 
paper somewhere that talked about a similar idea and had an 
equation for finding a "cliff" in scores to identify where the "good" 
matches ended (the math confused me, but i think it was about looking at 
the delta in scores between successive documents compared to the delta of 
the last X docs? ... does this sound familiar to anybody else?)

:- Finally we are firing a second query with top categories, calculated
:above, in filter query (fq).

a) word of caution: when programaticly adding filters like this, make sure 
you give your users some visual feedback that it's happening, and some way 
to override the filter.  there is nothing more frustrating then having a 
search UI assume it knows what you want, and giving you know way to say 
"no reall, i wanted what i asked for".  A classic anoying as hell 
example was Yahoo's yellow page serach ~10 years ago.  if you typed in 
something that was the name of a "category" it would give you a listing of 
all businesses in that category in the city you specified.  Making it 
completley impossible to find a (furniture) store named "The Magazine" in 
berkeley -- because your search would automaticly be filtered to the 
category "Books & Magazines" with no way to break out.

b) instead of filtering, you might wnat to consider just just adding boost 
queries on the top categories -- it won't remove results, so if that's 
really what you want never mind, but it should have roughly the same 
effect on the first few pages of results, but people can still drill 
down to find those other documents if they wish.

: Does it require writing a plugin if I want to move above logic into Solr?
: Which component do I need to modify - QueryComponent?
: 
: Or is there any better or even equivalent method in Solr of doing this or
: similar thing?

you could subclass QueryComponent and use your subclass in place of 
QueryComponent, or you might consider just adding a new component in front 
of QueryComponent that does your initial query, looks at the results, and 
then modifies the filters and let's QueryComponent do it's normal work.

I'm not sure which one would be easier.

-Hoss

Re: Merging two DocSets in solr

2012-05-24 Thread Chris Hostetter


:   I get two different DocSets from two different searchers. I need
: to merge them into one and get the facet counts from the merged
: docSets. How do I do it? Any pointers would be appreciated.

1) if you really mean "two different searchers" then you can not do this 
-- DocSets, and the docs they represent, are specific to a single 
searcher.  the same "docid" might refer to two completely differnet 
Documents in two different SolrIndexSearcher, so there is no way to relate 
them.

2) assuming your DocSets come from the *same* SolrIndexSearcher, then you 
have to define what you mean by "merge".  the DocSet API provides both  
intersection(DocSet) and union(DocSet) methods precisesly for this purpose 
-- just pick your meaning.

3) once you have your "merged" DocSet, you can construct a SimpleFacets 
instance using that DocSet and get whatever facets you want.



-Hoss

Re: configuring solr3.6 for a large intensive index only run

2012-05-24 Thread Shawn Heisey


On 5/23/2012 12:27 PM, Lance Norskog wrote:

If you want to suppress merging, set the 'mergeFactor' very high.
Perhaps 100. Note that Lucene opens many files (50? 100? 200?) for
each segment. You would have to set the 'ulimit' for file descriptors
to 'unlimited' or 'millions'.


My installation (Solr 3.5.0) creates 11 files per segment, and there is 
often a 12th file for deletes.  I have termvectors turned on for some of 
my fields.  If you aren't using termvectors at all, the last three files 
in my list are not created:


_26n_2.del  _26n.fdt  _26n.fdx  _26n.fnm  _26n.frq  _26n.nrm  _26n.prx  
_26n.tii  _26n.tis  _26n.tvd  _26n.tvf  _26n.tvx


I have yet to try 3.6, but I would imagine that it isn't a lot different 
than 3.5.  I use a fairly high mergeFactor of 35, and I am considering 
raising it even higher so that during normal operation there will never 
be a merge that's not under my control.  When I do a full index rebuild, 
there is so much data added that it will still do automatic merges.


Thanks,
Shawn

Re: [solrmarc-tech] apostrophe / ayn / alif

2012-05-24 Thread Charles Riley

True, no argument there as to usage.

I should have clarified that the encoding of the character used for alif
(02BE) carries with it an assigned property in the Unicode database of
(Lm), putting it into the category of 'Modifier_Letter', which contrasts
with the property (Sk), 'Modifier_Symbol', a property assigned to
characters that are more commonly used as diacritics.

I think the inclusion of characters into the filter factories was
determined based on these properties as assigned, though yes, there's often
a broader range of uses that each character is actually used for.

Charles


On Thu, May 24, 2012 at 1:41 PM, Naomi Dushay  wrote:

> The alif and ayn can also be used as diacritic-like characters in Korean;
>  this is a known practice.   But thanks anyway.
>
> On May 24, 2012, at 9:30 AM, Charles Riley wrote:
>
> Hi Naomi,
>
> I don't have a conclusive answer for you on this yet, but let me pick up
> on a few points.
>
> First, the apostrophe is probably being handled through ignoring
> punctuation in the ICUCollationKeyFilterFactory.
>
> Alif isn't a diacritic but a letter, and its character properties would be
> handled as such, apparently also outside the scope of what the folding
> filter factory does unless it's tailored.
>
> From the solrwiki, this looks like a helpful rule of thumb:
>
> "When To use a CharFilter vs a TokenFilter
>
> There are several pairs of CharFilters and TokenFilters that have related
> (ie: MappingCharFilter and ASCIIFoldingFilter) or nearly identical
> functionality (ie: PatternReplaceCharFilterFactory and
> PatternReplaceFilterFactory) and it may not always be obvious which is the
> best choice.
>
> The ultimate decision depends largely on what Tokenizer you are using, and
> whether you need to "out smart" it by preprocessing the stream of
> characters.
>
> For example, maybe you have a tokenizer such as StandardTokenizer and you
> are pretty happy with how it works overall, but you want to customize how
> some specific characters behave.
> In such a situation you could modify the rules and re-build your own
> tokenizer with javacc, but perhaps its easier to simply map some of the
> characters before tokenization with a CharFilter."
>
>
> Charles
>
> On Tue, May 15, 2012 at 2:47 PM, Naomi Dushay wrote:
>
>> We are using the ICUFoldingFilterFactory with great success to fold
>> diacritics so searches with and without the diacritics get the same results.
>>
>> We recently discovered we have some Korean records that use an alif
>> diacritic instead of an apostrophe, and this diacritic is NOT getting
>> folded.   Has anyone experienced this for alif or ayn characters?   Do you
>> have a solution?
>>
>>
>> - Naomi
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "solrmarc-tech" group.
>> To post to this group, send email to solrmarc-t...@googlegroups.com.
>> To unsubscribe from this group, send email to
>> solrmarc-tech+unsubscr...@googlegroups.com.
>> For more options, visit this group at
>> http://groups.google.com/group/solrmarc-tech?hl=en.
>>
>>
>
>
> --
> *Charles L. Riley*
> *Catalog Librarian for Africana*
> *Sterling Memorial Library, Yale University*
> *<**zenodo...@gmail.com* *>*
> *203-432-7566*
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "solrmarc-tech" group.
> To post to this group, send email to solrmarc-t...@googlegroups.com.
> To unsubscribe from this group, send email to
> solrmarc-tech+unsubscr...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/solrmarc-tech?hl=en.
>
>
>  --
> You received this message because you are subscribed to the Google Groups
> "solrmarc-tech" group.
> To post to this group, send email to solrmarc-t...@googlegroups.com.
> To unsubscribe from this group, send email to
> solrmarc-tech+unsubscr...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/solrmarc-tech?hl=en.
>



-- 
*Charles L. Riley*
*Catalog Librarian for Africana*
*Sterling Memorial Library, Yale University*
*<**zenodo...@gmail.com* *>*
*203-432-7566*

Re: field compression in solr 3.6

2012-05-24 Thread Shawn Heisey


On 5/23/2012 2:48 PM, pramila_tha...@ontla.ola.org wrote:

Hi Everyone,

solr 3.6 does not seem to be honoring the field compress.

While merging the indexes the size of Index is very big.

Is there any other way to  handle this to keep compression functionality?


Compression support was removed from Solr.  I am not clear on the 
reasons, but there was probably a good one.  The wiki says it happened 
in 1.4.1.


http://wiki.apache.org/solr/SchemaXml#Data_Types

There seems to be a patch to put compression back in, implemented in a 
different way that is not compatible with fields compressed in the old 
way.  The patch has not been committed to any Solr version.


https://issues.apache.org/jira/browse/SOLR-752

Thanks,
Shawn

Re: how to reduce the result size to 2-3 lines and expand based on user interest

2012-05-24 Thread Ahmet Arslan

> Just wondering if you have any suggestions!!! The other
> thing I tried using
> following url and the results returned same way as they were
> (no trimming of
> description to 300 chars). not sure if it is because of
> config file
> settings.
> 
> 
> http://localhost:8983/solr/browse?&hl=true&hl.fl=DESCRIPTION&hl.maxAnalyzedChars=0&f.DESCRIPTION.hl.alternateField=DESCRIPTION&f.DESCRIPTION.hl.maxAlternateFieldLength=300

Do you have a field named DESCRIPTION? What happens when you add &wt=xml to 
your URL?

Why do want to achieve this?
1-) You are worried about transferring very long text?
2-) Or just display purposes?

If just display purposes you can sub-string description field at client side.

Re: Minor typo: None-hex character in unicode escape sequence

2012-05-24 Thread Chris Hostetter


: I just happened to notice a typo when I mistyped a Unicode escape sequence in 
a query:

Thanks Jack, r1342363.

: Dismax doesn’t get the error since apparently it doesn’t recognize Unicode 
escape sequences.

correct .. dismax doesn't accept any escape sequence (but literal 
unicode characters should work fine)


-Hoss

Re: need to verify my understanding of default value of mm (minimum match) for edismax

2012-05-24 Thread Jack Krupansky

That's my understanding for releases of Solr before 4.0, that the default 
for MM is 100%. You can add a default value of MM in your query request 
handler in solrconfig.xml.


-- Jack Krupansky

-Original Message- 
From: geeky2

Sent: Thursday, May 24, 2012 10:48 AM
To: solr-user@lucene.apache.org
Subject: need to verify my understanding of default value of mm (minimum 
match) for edismax


environment: solr 3.5
default operator is OR

i want to make sure i understand how the mm param(minimum match) works for
the edismax parser

http://wiki.apache.org/solr/ExtendedDisMax?highlight=%28dismax%29#mm_.28Minimum_.27Should.27_Match.29

it looks like the rule is 100% of the terms must match across the fields,
unless i over ride this with the mm=x param - do i have this right?

what i am seeing is a query that matches on:

q=singer sewing 9010

will fail if it is changed to:

q=singer sewing machine 9010

for the second query - if i add mm=3 - then it comes back with results

thank you


--
View this message in context: 
http://lucene.472066.n3.nabble.com/need-to-verify-my-understanding-of-default-value-of-mm-minimum-match-for-edismax-tp3985936.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: [solrmarc-tech] apostrophe / ayn / alif

2012-05-24 Thread Naomi Dushay

The alif and ayn can also be used as diacritic-like characters in Korean;  this 
is a known practice.   But thanks anyway.

On May 24, 2012, at 9:30 AM, Charles Riley wrote:

> Hi Naomi,
> 
> I don't have a conclusive answer for you on this yet, but let me pick up on a 
> few points.
> 
> First, the apostrophe is probably being handled through ignoring punctuation 
> in the ICUCollationKeyFilterFactory.  
> 
> Alif isn't a diacritic but a letter, and its character properties would be 
> handled as such, apparently also outside the scope of what the folding filter 
> factory does unless it's tailored.
> 
> From the solrwiki, this looks like a helpful rule of thumb:
> 
> "When To use a CharFilter vs a TokenFilter
> There are several pairs of CharFilters and TokenFilters that have related 
> (ie: MappingCharFilter and ASCIIFoldingFilter) or nearly identical 
> functionality (ie: PatternReplaceCharFilterFactory and 
> PatternReplaceFilterFactory) and it may not always be obvious which is the 
> best choice.
> 
> The ultimate decision depends largely on what Tokenizer you are using, and 
> whether you need to "out smart" it by preprocessing the stream of characters.
> 
> For example, maybe you have a tokenizer such as StandardTokenizer and you are 
> pretty happy with how it works overall, but you want to customize how some 
> specific characters behave.
> 
> In such a situation you could modify the rules and re-build your own 
> tokenizer with javacc, but perhaps its easier to simply map some of the 
> characters before tokenization with a CharFilter."
> 
> 
> Charles
> 
> On Tue, May 15, 2012 at 2:47 PM, Naomi Dushay  wrote:
> We are using the ICUFoldingFilterFactory with great success to fold 
> diacritics so searches with and without the diacritics get the same results.
> 
> We recently discovered we have some Korean records that use an alif diacritic 
> instead of an apostrophe, and this diacritic is NOT getting folded.   Has 
> anyone experienced this for alif or ayn characters?   Do you have a solution?
> 
> 
> - Naomi
> 
> --
> You received this message because you are subscribed to the Google Groups 
> "solrmarc-tech" group.
> To post to this group, send email to solrmarc-t...@googlegroups.com.
> To unsubscribe from this group, send email to 
> solrmarc-tech+unsubscr...@googlegroups.com.
> For more options, visit this group at 
> http://groups.google.com/group/solrmarc-tech?hl=en.
> 
> 
> 
> 
> -- 
> Charles L. Riley
> Catalog Librarian for Africana
> Sterling Memorial Library, Yale University
> 
> 203-432-7566
> 
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "solrmarc-tech" group.
> To post to this group, send email to solrmarc-t...@googlegroups.com.
> To unsubscribe from this group, send email to 
> solrmarc-tech+unsubscr...@googlegroups.com.
> For more options, visit this group at 
> http://groups.google.com/group/solrmarc-tech?hl=en.

Re: How many in the XML source file before indexing?

2012-05-24 Thread Yonik Seeley

On Thu, May 24, 2012 at 7:29 AM, Michael Kuhlmann  wrote:
> However, I doubt it. I've not been too deeply into the UpdateHandler yet,
> but I think it first needs to parse the complete XML file before it starts
> to index.

Solr's update handlers all stream (XML, JSON, CSV), reading and
indexing a document at a time from the input.

-Yonik
http://lucidimagination.com

Re: [solrmarc-tech] apostrophe / ayn / alif

2012-05-24 Thread Charles Riley

Hi Naomi,

I don't have a conclusive answer for you on this yet, but let me pick up on
a few points.

First, the apostrophe is probably being handled through ignoring
punctuation in the ICUCollationKeyFilterFactory.

Alif isn't a diacritic but a letter, and its character properties would be
handled as such, apparently also outside the scope of what the folding
filter factory does unless it's tailored.

>From the solrwiki, this looks like a helpful rule of thumb:

"When To use a CharFilter vs a TokenFilter

There are several pairs of CharFilters and TokenFilters that have related
(ie: MappingCharFilter and ASCIIFoldingFilter) or nearly identical
functionality (ie: PatternReplaceCharFilterFactory and
PatternReplaceFilterFactory) and it may not always be obvious which is the
best choice.

The ultimate decision depends largely on what Tokenizer you are using, and
whether you need to "out smart" it by preprocessing the stream of
characters.

For example, maybe you have a tokenizer such as StandardTokenizer and you
are pretty happy with how it works overall, but you want to customize how
some specific characters behave.
In such a situation you could modify the rules and re-build your own
tokenizer with javacc, but perhaps its easier to simply map some of the
characters before tokenization with a CharFilter."

Charles

On Tue, May 15, 2012 at 2:47 PM, Naomi Dushay  wrote:

> We are using the ICUFoldingFilterFactory with great success to fold
> diacritics so searches with and without the diacritics get the same results.
>
> We recently discovered we have some Korean records that use an alif
> diacritic instead of an apostrophe, and this diacritic is NOT getting
> folded.   Has anyone experienced this for alif or ayn characters?   Do you
> have a solution?
>
>
> - Naomi
>
> --
> You received this message because you are subscribed to the Google Groups
> "solrmarc-tech" group.
> To post to this group, send email to solrmarc-t...@googlegroups.com.
> To unsubscribe from this group, send email to
> solrmarc-tech+unsubscr...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/solrmarc-tech?hl=en.
>
>

-- 
*Charles L. Riley*
*Catalog Librarian for Africana*
*Sterling Memorial Library, Yale University*
*<**zenodo...@gmail.com* *>*
*203-432-7566*

Re: List of recommendation engines with solr

2012-05-24 Thread Trev

Have you heard of NG Data with their product called Lily?

--
View this message in context: 
http://lucene.472066.n3.nabble.com/List-of-recommendation-engines-with-solr-tp3818917p3985922.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr Performance

2012-05-24 Thread Jack Krupansky

I vaguely recall some thread blocking issue with trying to parse too many 
PDF files at one time in the same JVM.


Occasionally Tika (actually PDFBox) has been known to hang for some PDF 
docs.


Do you have enough memory in the JVM? When the CPU is busy, is there much 
memory available in the JVM? Maybe garbage collection is taking too much of 
the CPU.


-- Jack Krupansky

-Original Message- 
From: chris.a.mattm...@jpl.nasa.gov

Sent: Thursday, May 24, 2012 9:55 AM
To: solr-user@lucene.apache.org
Subject: Solr Performance

Hi Chris

First of all,thanks lot that your earlier inputs for my document indexing
failures helped me a lot!

Now I am facing few performance issues with the indexing.
This is what I am doing-

- Read data from an excel sheet which essentially contains the path of the 
PDF

file to be indexed and few literals that I have to add to the Solr Update
request which i can use as filter query to solr when I am
searching.[Category$Subcategory$pathTotheFile]

- My input sheet data may vary from few thousands to upto 6 million lines.

- I am making a Set from these lines and deviding it into 4 chunks and 
spawning

4 threads which will prepare the Solr ContentStreamUpdateRequest request and
post it to solr.

- In this process I have these issues ::

1. My system's cpu touches a high percentile and the indexing is aborted.

2. If I have a "setAutoCommitWithin" it doesn't work (meaning that initially 
I

can find few documents committed,after that nothing happens)

3.I have used StreamingUpdateSolrServer with quesize 20, and thread count of 
4.


4.My main aim is to boost up the indexing rate [speed].

Can you suggest where and all I can tweak my routine?

Thanks in advace...

Surendra.

Re: Arabic document analysis fails for HttpServer

2012-05-24 Thread Sami Siren

Hi,

there are some serious issues with encoding of the data sent to Solr
in the released 3.6.0 version of Solrj (HttpSolrServer), for example:
https://issues.apache.org/jira/browse/SOLR-3375

I believe your issue should already be fixed in the 3.6.0 branch. The
contents from that branch will eventually become solr 3.6.1.

For now I recommend you use the Commons version of the solr server (if
You need to be on released version) or then just check out the fixed
version from the 3.6 branch.

--
 Sami Siren

On Thu, May 24, 2012 at 6:23 PM, Shane Perry  wrote:
> Hi,
>
> Upgrading from 3.5 to 3.6, the CommonsHttpServer was deprecated in
> favor of HttpServer.  After updating my code and running my unit
> tests, I have one test that fails.  Digging into it I found that
> FieldAnalysisRequest was return zero tokens for the test string (which
> is Arabic).  I am at a loss for what steps to take next and would
> appreciate any direction that could be given.  I have included a unit
> test which demonstrates the behavior.
>
> My field's schema is:
>
>    
>      
>        
>        
>        
>        
>        
>      
>    
>
> I've also tried using the LegacyHTMLStripCharFilterFactory but with
> the same results.
>
> Thanks,
>
> Shane
>
>
> =
>
> public class TestServer {
>
>  private static final String HOST = "http://localhost:8080/junit-master";;
>  private static final String ARABIC_TEXT = "ﺐﻃﺎﻠﺒﻳ";
>  private static final String ARABIC_FIELD = "text";
>
>  @Test
>  public void testArabicCommonsHttpServer() throws Exception {
>    CommonsHttpSolrServer server = null;
>    try {
>      server = new CommonsHttpSolrServer(HOST);
>
>      server.setParser(new XMLResponseParser());
>    } catch (MalformedURLException ex) {
>    }
>
>    assertTrue(server != null);
>
>    List tokens = analysis(analysis(server, ARABIC_FIELD, 
> ARABIC_TEXT));
>
>    assertTrue(!tokens.isEmpty());
>  }
>
>  @Test
>  public void testArabicHttpServer() throws Exception {
>    HttpSolrServer server = new HttpSolrServer(HOST);
>
>    server.setParser(new XMLResponseParser());
>
>    assertTrue(server != null);
>
>    List tokens = analysis(analysis(server, ARABIC_FIELD, 
> ARABIC_TEXT));
>
>    assertTrue(!tokens.isEmpty());
>  }
>
>  private static FieldAnalysisResponse analysis(SolrServer server,
> String field, String text) {
>    FieldAnalysisResponse response = null;
>
>    FieldAnalysisRequest request = new FieldAnalysisRequest("/analysis/field").
>            addFieldName(field).
>            setFieldValue(text).
>            setQuery(text);
>
>    request.setMethod(METHOD.POST);
>
>    try {
>      response = request.process(server);
>    } catch (Exception ex) {
>    }
>
>    return response;
>  }
>
>  private static List analysis(FieldAnalysisResponse response) {
>    List token = new LinkedList();
>
>    if (response == null) {
>      return token;
>    }
>
>    Iterator> iterator = response.
>            getAllFieldNameAnalysis().iterator();
>    if (iterator.hasNext()) {
>      Entry entry = iterator.next();
>      Iterator phaseIterator = 
> entry.getValue().getQueryPhases().
>              iterator();
>
>      List tokens = null;
>      while (phaseIterator.hasNext()) {
>        tokens = phaseIterator.next().getTokens(); // Only need the last one
>      }
>
>      for (TokenInfo ti : tokens) {
>        token.add(ti.getText());
>      }
>    }
>
>    return token;
>  }
> }

Re: Tips on creating a custom QueryCache?

2012-05-24 Thread Aaron Daubman

Thanks for the reply,

Do you have any pointers to relevant Docs or Examples that show how this
should be chained together?

Thanks again,
 Aaron

On Thu, May 24, 2012 at 3:03 AM, Otis Gospodnetic <
otis_gospodne...@yahoo.com> wrote:

> Perhaps this could be a custom SearchComponent that's run before the usual
> QueryComponent?
> This component would be responsible for loading queries, executing them,
> caching results, and for returning those results when these queries are
> encountered later on.
>
> Otis
>
> >
> > From: Aaron Daubman 
> >Subject: Tips on creating a custom QueryCache?
> >
> >Greetings,
> >
> >I'm looking for pointers on where to start when creating a
> >custom QueryCache.
> >Our usage patterns are possibly a bit unique, so let me explain the
> desired
> >use case:
> >
> >Our Solr index is read-only except for dedicated periods where it is
> >updated and re-optimized.
> >
> >On startup, I would like to create a specific QueryCache that would cache
> >the top ~20,000 (arbitrary but large) queries. This cache should never
> >evict entries, and, after the "warming process" to populate, should never
> >be added to either.
> >
> >The warming process would be to run through the (externally determined)
> >list of anticipated top X (say 20,000) queries and cache these results.
> >
> >This cache would then be used for the duration of the solr run-time (until
> >the period, perhaps daily, where the index is updated and re-optimized, at
> >which point the cache would be re-created)
> >
> >Where should I begin looking to implement such a cache?
> >
> >The reason for this somewhat different approach to caching is that we may
> >get any number of odd queries throughout the day for which performance
> >isn't important, and we don't want any of these being added to the cache
> or
> >evicting other entries from the cache. We need to ensure high performance
> >for this pre-determined list of queries only (while still handling other
> >arbitrary queries, if not as quickly)
> >
> >Thanks,
> >  Aaron
>

Re: Solr 4.0 Distributed Concurrency Control Mechanism?

2012-05-24 Thread Nicholas Ball

Thanks for the link, will investigate further. On the outset though, it
looks as though it's not what we want to be going towards.
Also note that it's not open-sourced (other than Solandra which hasn't
been updated in ges https://github.com/tjake/Solandra).

Rather than build on top of Cassandra, the new NRT + transaction log Solr
features really make it more of a possibility to make Solr into a
NoSQL-like system and possibly with better transactional guarantees than
NoSQL!

Speaking to yonik has given me more information on this. Currently, there
is an optimistic lock-free mechanism on a per-document basis only as for
most, documents only live on a single logical shard. It essentially checks
the _version_ you send in for a document against the latest version for the
document it has.

I propose an additional feature to this for those who want to have such
guarantees spanning over multiple documents living on various shards. In my
use-case, I have shards holding documents that point to other shards. In
this case, an update would need to be an atomic transaction spanning over
various documents on various shards. Would anyone object to having this
functionality added to Solr if I were to contribute it?

Many thanks,
Nicholas

On Thu, 24 May 2012 08:16:25 -0700, Walter Underwood
 wrote:
> You should take a look at what DataStax has already done with Solr and
> Cassandra.
> 
> http://www.datastax.com/dev/blog/cassandra-with-solr-integration-details
> 
> wunder
> 
> On May 24, 2012, at 7:50 AM, Nicholas Ball wrote:
> 
>> 
>> Hey all,
>> 
>> I've been working on a SOLR set up with some heavy customization (using
>> the adminHandler as a way into the system) for a research project @
>> Imperial College London, however I now see there has been a substantial
>> push towards a NoSQL.  For this, there needs to be some kind of
>> optimistic
>> fine-grained concurrency control on updates. As we have document
>> versioning
>> in-built into Lucene (and therefore Solr) this shouldn't be too
>> difficult,
>> however the push has been more of a focus on single core optimistic
>> LOCKING.
>> 
>> I would like to take this toward a multi-core (and multi-node)
>> distributed
>> optimistic lock-free mechanism. This is gives us the ability to provide
>> stronger guarantees than NoSQL wrt distributed transaction isolation
and
>> as
>> we can now do soft-commits, we can also provide specific version
>> rollbacks
>> (http://java.dzone.com/articles/exploring-transactional-0). Some more
>> interesting reading on this topic: (read-)snapshot isolation
>> (http://pages.cs.wisc.edu/~cs764-1/critique.pdf) and even stronger
>> guarantees with a slight performance hit with write-snapshot isolation
>> (http://www.fever.ch/usbkey_eurosys12/papers/p155-yabandehA.pdf).
People
>> are starting to realize that we don't have to sacrifice guarantees for
>> better performance and scalability (like NoSQL) but rather relax them
>> very
>> minimally.
>> 
>> What I need is for someone to shed some light on this feature and the
>> future plans of Solr wrt this is? Am I correct in thinking that a
>> multiversion concurrency control (MVCC) locking mechanism now exist for
a
>> single core or is it lock-free and multi-core?
>> 
>> Many thanks,
>> Nicholas Ball (aka incunix)
> 
> --
> Walter Underwood
> wun...@wunderwood.org

Re: Index-time field boost with DIH

2012-05-24 Thread Chamnap Chhorn

Thanks for your reply.

I need to boost at the document level and at the field level as well. Only
the query match certain fields would get boost.

In DIH, there is $docBoost (boost at document level), but documentation
about field-boost at all.

On Thu, May 24, 2012 at 10:32 PM, Walter Underwood wrote:

> If you want different boosts for different documents, then use the "boost"
> parameter in edismax. You can store the factor in a field, then use it to
> affect the score.
>
> If you store it in a field named "docboost", you could use this in an
> edismax config in your solrconfig.xml.
>
>   log(max(docboost,1))
>
> This will be multiplied into the score for each document. I use the max()
> function to avoid problems with zero and negative values.
>
> wunder
>
> On May 24, 2012, at 8:19 AM, Chamnap Chhorn wrote:
>
> > I need to do index-time field boosting because the client buy position
> > asset. Therefore, some document when matched are more important than
> > others. That's what index time boost does, right?
> >
> > On Thu, May 24, 2012 at 10:10 PM, Walter Underwood <
> wun...@wunderwood.org>wrote:
> >
> >> Why? Query-time boosting is fast and more flexible.
> >>
> >> wunder
> >> Search Guy, Netflix & Chegg
> >>
> >> On May 24, 2012, at 6:11 AM, Chamnap Chhorn wrote:
> >>
> >>> Anyone could help me? I really need index-time field-boosting.
> >>>
> >>> On Thu, May 24, 2012 at 4:21 PM, Chamnap Chhorn <
> chamnapchh...@gmail.com
> >>> wrote:
> >>>
>  Hi all,
> 
>  I want to do index-time boost field on DIH. Is there any way to do
> >> this? I
>  see on this documentation, there is only $docBoost. How about field
> >> boost?
>  Is it possible?
> 
>  Thanks
>  http://chamnap.github.com/
> 
> >>>
> >>>
> >>>
> >>> --
> >>> Chhorn Chamnap
> >>> http://chamnapchhorn.blogspot.com/
> >>
> >>
> >>
> >>
> >>
> >
> >
> > --
> > Chhorn Chamnap
> > http://chamnapchhorn.blogspot.com/
>
> --
> Walter Underwood
> wun...@wunderwood.org
>
>
>
>


-- 
Chhorn Chamnap
http://chamnapchhorn.blogspot.com/

Re: Index-time field boost with DIH

2012-05-24 Thread Walter Underwood

If you want different boosts for different documents, then use the "boost" 
parameter in edismax. You can store the factor in a field, then use it to 
affect the score.

If you store it in a field named "docboost", you could use this in an edismax 
config in your solrconfig.xml.

   log(max(docboost,1))

This will be multiplied into the score for each document. I use the max() 
function to avoid problems with zero and negative values.

wunder

On May 24, 2012, at 8:19 AM, Chamnap Chhorn wrote:

> I need to do index-time field boosting because the client buy position
> asset. Therefore, some document when matched are more important than
> others. That's what index time boost does, right?
> 
> On Thu, May 24, 2012 at 10:10 PM, Walter Underwood 
> wrote:
> 
>> Why? Query-time boosting is fast and more flexible.
>> 
>> wunder
>> Search Guy, Netflix & Chegg
>> 
>> On May 24, 2012, at 6:11 AM, Chamnap Chhorn wrote:
>> 
>>> Anyone could help me? I really need index-time field-boosting.
>>> 
>>> On Thu, May 24, 2012 at 4:21 PM, Chamnap Chhorn >> wrote:
>>> 
 Hi all,

 I want to do index-time boost field on DIH. Is there any way to do
>> this? I
 see on this documentation, there is only $docBoost. How about field
>> boost?
 Is it possible?

 Thanks
 http://chamnap.github.com/

>>> 
>>> 
>>> 
>>> --
>>> Chhorn Chamnap
>>> http://chamnapchhorn.blogspot.com/
>> 
>> 
>> 
>> 
>> 
> 
> 
> -- 
> Chhorn Chamnap
> http://chamnapchhorn.blogspot.com/

--
Walter Underwood
wun...@wunderwood.org

Re: Dismax, explicit phrase queries not on qf

2012-05-24 Thread Jack Krupansky

The "pf" fields are used for implicit phrase queries to do "implicit phrase 
proximity boosting" and don't relate at all to explicit phrase queries. I 
don't think there is any way to control the fields for explicit phrase 
queries separate from non-phrase term queries.


-- Jack Krupansky

-Original Message- 
From: Markus Jelsma

Sent: Thursday, May 24, 2012 8:53 AM
To: solr-user@lucene.apache.org
Subject: Dismax, explicit phrase queries not on qf

Hi,

With (e)dismax explicit phrase queries are executed on the qf fields. The qf 
field, however, may contain field(s) we don't want a phrase query for. How 
can we tell the dismax query parser to only do phrase queries (explicit or 
not) on the fields listed in the pf parameter.


Thanks
Markus

Re: how to reduce the result size to 2-3 lines and expand based on user interest

2012-05-24 Thread srini

hI iorixxx,

Just wondering if you have any suggestions!!! The other thing I tried using
following url and the results returned same way as they were (no trimming of
description to 300 chars). not sure if it is because of config file
settings.


http://localhost:8983/solr/browse?&hl=true&hl.fl=DESCRIPTION&hl.maxAnalyzedChars=0&f.DESCRIPTION.hl.alternateField=DESCRIPTION&f.DESCRIPTION.hl.maxAlternateFieldLength=300

Thanks in Advance...
Srini

--
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-reduce-the-result-size-to-2-3-lines-and-expand-based-on-user-interest-tp3985692p3985945.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Index-time field boost with DIH

2012-05-24 Thread Chamnap Chhorn

I need to do index-time field boosting because the client buy position
asset. Therefore, some document when matched are more important than
others. That's what index time boost does, right?

On Thu, May 24, 2012 at 10:10 PM, Walter Underwood wrote:

> Why? Query-time boosting is fast and more flexible.
>
> wunder
> Search Guy, Netflix & Chegg
>
> On May 24, 2012, at 6:11 AM, Chamnap Chhorn wrote:
>
> > Anyone could help me? I really need index-time field-boosting.
> >
> > On Thu, May 24, 2012 at 4:21 PM, Chamnap Chhorn  >wrote:
> >
> >> Hi all,
> >>
> >> I want to do index-time boost field on DIH. Is there any way to do
> this? I
> >> see on this documentation, there is only $docBoost. How about field
> boost?
> >> Is it possible?
> >>
> >> Thanks
> >> http://chamnap.github.com/
> >>
> >
> >
> >
> > --
> > Chhorn Chamnap
> > http://chamnapchhorn.blogspot.com/
>
>
>
>
>


-- 
Chhorn Chamnap
http://chamnapchhorn.blogspot.com/

Re: Solr 4.0 Distributed Concurrency Control Mechanism?

2012-05-24 Thread Walter Underwood

You should take a look at what DataStax has already done with Solr and 
Cassandra.

http://www.datastax.com/dev/blog/cassandra-with-solr-integration-details

wunder

On May 24, 2012, at 7:50 AM, Nicholas Ball wrote:

> 
> Hey all,
> 
> I've been working on a SOLR set up with some heavy customization (using
> the adminHandler as a way into the system) for a research project @
> Imperial College London, however I now see there has been a substantial
> push towards a NoSQL.  For this, there needs to be some kind of optimistic
> fine-grained concurrency control on updates. As we have document versioning
> in-built into Lucene (and therefore Solr) this shouldn't be too difficult,
> however the push has been more of a focus on single core optimistic
> LOCKING.
> 
> I would like to take this toward a multi-core (and multi-node) distributed
> optimistic lock-free mechanism. This is gives us the ability to provide
> stronger guarantees than NoSQL wrt distributed transaction isolation and as
> we can now do soft-commits, we can also provide specific version rollbacks
> (http://java.dzone.com/articles/exploring-transactional-0). Some more
> interesting reading on this topic: (read-)snapshot isolation
> (http://pages.cs.wisc.edu/~cs764-1/critique.pdf) and even stronger
> guarantees with a slight performance hit with write-snapshot isolation
> (http://www.fever.ch/usbkey_eurosys12/papers/p155-yabandehA.pdf). People
> are starting to realize that we don't have to sacrifice guarantees for
> better performance and scalability (like NoSQL) but rather relax them very
> minimally.
> 
> What I need is for someone to shed some light on this feature and the
> future plans of Solr wrt this is? Am I correct in thinking that a
> multiversion concurrency control (MVCC) locking mechanism now exist for a
> single core or is it lock-free and multi-core?
> 
> Many thanks,
> Nicholas Ball (aka incunix)

--
Walter Underwood
wun...@wunderwood.org

Re: Index-time field boost with DIH

2012-05-24 Thread Walter Underwood

Why? Query-time boosting is fast and more flexible. 

wunder
Search Guy, Netflix & Chegg

On May 24, 2012, at 6:11 AM, Chamnap Chhorn wrote:

> Anyone could help me? I really need index-time field-boosting.
> 
> On Thu, May 24, 2012 at 4:21 PM, Chamnap Chhorn 
> wrote:
> 
>> Hi all,
>> 
>> I want to do index-time boost field on DIH. Is there any way to do this? I
>> see on this documentation, there is only $docBoost. How about field boost?
>> Is it possible?
>> 
>> Thanks
>> http://chamnap.github.com/
>> 
> 
> 
> 
> -- 
> Chhorn Chamnap
> http://chamnapchhorn.blogspot.com/

Solr 4.0 Distributed Concurrency Control Mechanism?

2012-05-24 Thread Nicholas Ball


Hey all,

I've been working on a SOLR set up with some heavy customization (using
the adminHandler as a way into the system) for a research project @
Imperial College London, however I now see there has been a substantial
push towards a NoSQL.  For this, there needs to be some kind of optimistic
fine-grained concurrency control on updates. As we have document versioning
in-built into Lucene (and therefore Solr) this shouldn't be too difficult,
however the push has been more of a focus on single core optimistic
LOCKING.

I would like to take this toward a multi-core (and multi-node) distributed
optimistic lock-free mechanism. This is gives us the ability to provide
stronger guarantees than NoSQL wrt distributed transaction isolation and as
we can now do soft-commits, we can also provide specific version rollbacks
(http://java.dzone.com/articles/exploring-transactional-0). Some more
interesting reading on this topic: (read-)snapshot isolation
(http://pages.cs.wisc.edu/~cs764-1/critique.pdf) and even stronger
guarantees with a slight performance hit with write-snapshot isolation
(http://www.fever.ch/usbkey_eurosys12/papers/p155-yabandehA.pdf). People
are starting to realize that we don't have to sacrifice guarantees for
better performance and scalability (like NoSQL) but rather relax them very
minimally.

What I need is for someone to shed some light on this feature and the
future plans of Solr wrt this is? Am I correct in thinking that a
multiversion concurrency control (MVCC) locking mechanism now exist for a
single core or is it lock-free and multi-core?

Many thanks,
Nicholas Ball (aka incunix)

need to verify my understanding of default value of mm (minimum match) for edismax

2012-05-24 Thread geeky2

environment: solr 3.5
default operator is OR

i want to make sure i understand how the mm param(minimum match) works for
the edismax parser

http://wiki.apache.org/solr/ExtendedDisMax?highlight=%28dismax%29#mm_.28Minimum_.27Should.27_Match.29

it looks like the rule is 100% of the terms must match across the fields,
unless i over ride this with the mm=x param - do i have this right?

what i am seeing is a query that matches on:

q=singer sewing 9010

will fail if it is changed to:

q=singer sewing machine 9010

for the second query - if i add mm=3 - then it comes back with results

thank you


--
View this message in context: 
http://lucene.472066.n3.nabble.com/need-to-verify-my-understanding-of-default-value-of-mm-minimum-match-for-edismax-tp3985936.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: problem on running fullimport

2012-05-24 Thread Dyer, James

On your  tag, specify "batchSize" with a value that your 
db/driver allows.  The hardcoded default is 500.  If you set it to -1 it 
converts it to Integer.MIN_VALUE.  See 
http://wiki.apache.org/solr/DataImportHandler#Configuring_JdbcDataSource , 
which recommends using this -1 value in the case of errors.

When executing queries, JdbcDataSource calls Statement.setFetchSize(batchSize) 
.  Based on the java.sql.Statement api documentation, you can set this to 0 to 
retain your db/driver's default.  See 
http://docs.oracle.com/javase/6/docs/api/java/sql/Statement.html#setFetchSize%28int%29

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-Original Message-
From: pla [mailto:patrick.archib...@gmail.com] 
Sent: Thursday, May 24, 2012 9:10 AM
To: solr-user@lucene.apache.org
Subject: Re: problem on running fullimport

Thanks Alexey Serba. I encountered the *java.sql.SQLException: Illegal value
for setFetchSize()* error after upgrading one of my servers to MySQL version
5.5.22. 

PLA

--
View this message in context: 
http://lucene.472066.n3.nabble.com/problem-on-running-fullimport-tp1707206p3985924.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Creating custom Filter / Tokenizer / Request Handler for integration of NER-Framework

2012-05-24 Thread Jan Høydahl

As Ahmet says, The Update Chain is probably the place to integrate such 
document oriented processing.
See http://www.cominvent.com/2011/04/04/solr-architecture-diagram/ for how it 
integrates with Solr.

--
Jan Høydahl, search solution architect
Cominvent AS - www.facebook.com/Cominvent
Solr Training - www.solrtraining.com

On 24. mai 2012, at 14:04, Wunderlich, Tobias wrote:

> Hey Guys,
> 
> I am recently working on a project to integrate a 
> Named-Entity-Recognition-Framework (NER) in an existing searchplatform based 
> on Solr. The Platform uses ManifoldCF to automatically gather the content 
> from various repositories. The NER-Framework creates Annotations/Metadata 
> from given content which I then want to integrate into the search-platform as 
> metadata to use for faceting. Since MCF handles all content gathering, I need 
> a way to integrate the NER-Framework directly into Solr. The Goal is to get 
> all Annotations per document into a multivalued field.  My first thought was 
> to create a custom filter, which just takes the content and gives back only 
> the Annotations.  But as I understand it, a filter only processes 
> predetermined Tokens, which is useless for my purpose, since the 
> NER-Framework needs to process the whole content of a document. What about a 
> custom Tokenizer? Would it be possible to process the whole text and give 
> back only the Annotations as Tokens? A third thought was to manipulate the 
> ExtractRequestHandler (Solr Cell) used by MCF to somehow add the Annotations 
> as Metadata when the content and metadata is distributed to the different 
> fields.
> 
> I hope my problem description is sufficient. Does anybody have any thoughts 
> on that subject?
> 
> Best regards,
> Tobias

DIH using a connection pool

2012-05-24 Thread Esteban Donato

Hi community,

I am using Solr with DIH to index content from a DB.  The point is
that I have to configure DIH to check changes in the DB very
frequently (aprox 1 sec) to maintain the index almost up-to-date.  I
noted that JDBCDataSource closes the DB connection after every
execution which is not acceptable with this update rate.  Ideally I
would need DIH using a connection pool.  Looking at DIH code and faq I
noticed that I can configure a connection pool and expose it via jndi
for the JDBCDataSource to use it.  My question is: is this they way to
go for integrating a connection pool with DIH?

Thanks
Esteban

Accent Characters

2012-05-24 Thread couto.vicente

Hello All.
I'm a newbie in Solr and I saw this subject a lot, but no one answer was
satisfactory or (probably) I don't know how to properly set up the Solr
environment.
I indexed documents in Solr with a French content field. I used the field
type "text_fr" that comes with the solr schema.xml file.



My spellchecker is almost the same that comes with solrconfig.xml:


  default
  content
  spellchecker
  
  


When I try any search query either with words with accent or not, I get the
results pretty fine.
But if I try the spell checking or even a facet query, it looks like Solr is
ignoring the words with accents.
I Google it a lot I could not find any satisfactory fix.

Can anyone give me a help?

Thank you!


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Accent-Characters-tp3985931.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: List of recommendation engines with solr

2012-05-24 Thread Óscar Marín Miró

Hello Paul, Mahout is a machine learning [clustering & classification] and
recommendation library, with Hadoop linking.

So, the answer is yes, it qualifies as a recommender engine on itself (with
no other libs), scalable through Hadoop

On Tue, Mar 13, 2012 at 9:23 AM, Paul Libbrecht  wrote:

>
> Just out of curiosity,
>
> does Mahout qualify as a recommender-engine, or is it rather a library for
> it with (potentially open-source) recommenders built on it, with a more
> specific purpose?
>
> The page:
>https://cwiki.apache.org/MAHOUT/powered-by-mahout.html
> does not seem to list many open-source tools (or maybe some are?).
>
> thanks in advance
>
> Paul
>
> Le 13 mars 2012 à 05:07, Rohan a écrit :
>
> > Hi Gora,
> >
> > Thanks a lot for your valuable comments, really appreciated.
> > Yeah , You got me correctly I am exactly  looking for "Mahout" as I am
>  using Java as my business layer with Apache solr.
> >
> > Thanks,
> > Rohan
> >
> > From: Gora Mohanty-3 [via Lucene] [mailto:
> ml-node+s472066n3819480...@n3.nabble.com]
> > Sent: Monday, March 12, 2012 8:28 PM
> > To: Rohan Ashok Kumbhar
> > Subject: Re: List of recommendation engines with solr
> >
> > On 12 March 2012 16:30, Rohan <[hidden
> email]> wrote:
> >> Hi All,
> >>
> >> I would require list of recs engine which can be integrated with solr
> and
> >> also suggest best one out of this.
> >>
> >> any comments would be appriciated!!
> >
> > What exactly do you mean by that? Why is integration with Solr
> > a requirement, and what do you expect to gain by such an integration?
> > "Best" also probably depends on the context of your requirements.
> >
> > There are a variety of open-source recommendation engines.
> > If you are looking at something from Apache, and in Java, Mahout
> > might be a good choice.
> >
> > Regards,
> > Gora
> >
> > 
> > If you reply to this email, your message will be added to the discussion
> below:
> >
> http://lucene.472066.n3.nabble.com/List-of-recommendation-engines-with-solr-tp3818917p3819480.html
> > To unsubscribe from List of recommendation engines with solr, click here<
> http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=3818917&code=Um9oYW5fS3VtYmhhckBpbmZvc3lzLmNvbXwzODE4OTE3fC0xMjUwNDUyNDI1
> >.
> > NAML<
> http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
> >
> >
> >  CAUTION - Disclaimer *
> > This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended
> solely
> > for the use of the addressee(s). If you are not the intended recipient,
> please
> > notify the sender by e-mail and delete the original message. Further,
> you are not
> > to copy, disclose, or distribute this e-mail or its contents to any
> other person and
> > any such actions are unlawful. This e-mail may contain viruses. Infosys
> has taken
> > every reasonable precaution to minimize this risk, but is not liable for
> any damage
> > you may sustain as a result of any virus in this e-mail. You should
> carry out your
> > own virus checks before opening the e-mail or attachment. Infosys
> reserves the
> > right to monitor and review the content of all messages sent to or from
> this e-mail
> > address. Messages sent to or from this e-mail address may be stored on
> the
> > Infosys e-mail system.
> > ***INFOSYS End of Disclaimer INFOSYS***
> >
> >
> > --
> > View this message in context:
> http://lucene.472066.n3.nabble.com/List-of-recommendation-engines-with-solr-tp3818917p3821268.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
>
>


-- 
Whether it's science, technology, personal experience, true love,
astrology, or gut feelings, each of us has confidence in something that we
will never fully comprehend.
 --Roy H. William

Re: shard distribution of multiple collections in SolrCloud

2012-05-24 Thread Tommaso Teofili

2012/5/24 Mark Miller 

> I don't think there is yet - my fault, did not realize - we should make
> one.
>
> I've been messing around with some early stuff, but I'm still unsure about
> some things. Might just put in something simple to start though.
>

sure, I'll take a look and try to help there.
Tommaso


>
>
> On May 24, 2012, at 4:39 AM, Tommaso Teofili wrote:
>
> > 2012/5/23 Mark Miller 
> >
> >> Yeah, currently you have to create the core on each node...we are
> working
> >> on a 'collections' api that will make this a simple one call operation.
> >>
> >
> > Mark, is there a Jira for that yet?
> > Tomamso
> >
> >
> >>
> >> We should have this soon.
> >>
> >> - Mark
> >>
> >> On May 23, 2012, at 2:36 PM, Daniel Brügge wrote:
> >>
> >>> Hi,
> >>>
> >>> i am creating several cores using the following script. I use this for
> >>> testing SolrCloud and to learn about the distribution of multiple
> >>> collections.
> >>>
> >>> max=500
>  for ((i=2; i<=$max; ++i )) ;
>  do
>    curl "
> 
> >>
> http://solrinstance1:8983/solr/admin/cores?action=CREATE&name=collection$i&collection=collection$i&collection.configName=myconfig
>  "
>  done
> >>>
> >>>
> >>> I've setup a SolrCloud with 2 shards which are each replicated by 2
> other
> >>> instances I start.
> >>>
> >>> When I first start the installation I have the default "collection1" in
> >>> place which is sharded over shard1 and shard2 with 2 leader nodes and 2
> >>> nodes which replicate the leaders.
> >>>
> >>> When I run this script above which calls the Coreadmin on one of the
> >>> shards, all the collections are created on only this shard without a
> >>> replica. So e.g.
> >>>
> >>>
> >>> "collection8":{"shard1":{"solrinstance1:8983_solr_collection8":{
> >>>   "shard":"shard1",
> >>>   "leader":"true",
> >>>   "state":"active",
> >>>   "core":"collection8",
> >>>   "collection":"collection8",
> >>>   "node_name":"solrinstance1:8983_solr",
> >>>
> >>>   "base_url":"http://solrinstance1:8983/solr"}}}
> >>>
> >>>
> >>> I always thought, that via zookeeper these collections are sharded and
> >>> replicated or do I need to call on each node the create core action?
> But
> >>> then I need to know about these nodes, right?
> >>>
> >>>
> >>> Thanks & regards
> >>>
> >>> Daniel
> >>
> >> - Mark Miller
> >> lucidimagination.com
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
>
> - Mark Miller
> lucidimagination.com
>
>
>
>
>
>
>
>
>
>
>
>

RE: List of recommendation engines with solr

2012-05-24 Thread Rohan

HI ,

Sorry , I have no idea  as I never worked on this .

Thanks,
Rohan

From: Trev [via Lucene] [mailto:ml-node+s472066n3985922...@n3.nabble.com]
Sent: Thursday, May 24, 2012 7:37 PM
To: Rohan Ashok Kumbhar
Subject: Re: List of recommendation engines with solr

Have you heard of NG Data with their product called Lily?

If you reply to this email, your message will be added to the discussion below:
http://lucene.472066.n3.nabble.com/List-of-recommendation-engines-with-solr-tp3818917p3985922.html
To unsubscribe from List of recommendation engines with solr, click 
here.
NAML

 CAUTION - Disclaimer *
This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely
for the use of the addressee(s). If you are not the intended recipient, please
notify the sender by e-mail and delete the original message. Further, you are 
not
to copy, disclose, or distribute this e-mail or its contents to any other 
person and
any such actions are unlawful. This e-mail may contain viruses. Infosys has 
taken
every reasonable precaution to minimize this risk, but is not liable for any 
damage
you may sustain as a result of any virus in this e-mail. You should carry out 
your
own virus checks before opening the e-mail or attachment. Infosys reserves the
right to monitor and review the content of all messages sent to or from this 
e-mail
address. Messages sent to or from this e-mail address may be stored on the
Infosys e-mail system.
***INFOSYS End of Disclaimer INFOSYS***

--
View this message in context: 
http://lucene.472066.n3.nabble.com/List-of-recommendation-engines-with-solr-tp3818917p3985927.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: shard distribution of multiple collections in SolrCloud

2012-05-24 Thread Mark Miller

I don't think there is yet - my fault, did not realize - we should make one. 

I've been messing around with some early stuff, but I'm still unsure about some 
things. Might just put in something simple to start though.


On May 24, 2012, at 4:39 AM, Tommaso Teofili wrote:

> 2012/5/23 Mark Miller 
> 
>> Yeah, currently you have to create the core on each node...we are working
>> on a 'collections' api that will make this a simple one call operation.
>> 
> 
> Mark, is there a Jira for that yet?
> Tomamso
> 
> 
>> 
>> We should have this soon.
>> 
>> - Mark
>> 
>> On May 23, 2012, at 2:36 PM, Daniel Brügge wrote:
>> 
>>> Hi,
>>> 
>>> i am creating several cores using the following script. I use this for
>>> testing SolrCloud and to learn about the distribution of multiple
>>> collections.
>>> 
>>> max=500
 for ((i=2; i<=$max; ++i )) ;
 do
   curl "
 
>> http://solrinstance1:8983/solr/admin/cores?action=CREATE&name=collection$i&collection=collection$i&collection.configName=myconfig
 "
 done
>>> 
>>> 
>>> I've setup a SolrCloud with 2 shards which are each replicated by 2 other
>>> instances I start.
>>> 
>>> When I first start the installation I have the default "collection1" in
>>> place which is sharded over shard1 and shard2 with 2 leader nodes and 2
>>> nodes which replicate the leaders.
>>> 
>>> When I run this script above which calls the Coreadmin on one of the
>>> shards, all the collections are created on only this shard without a
>>> replica. So e.g.
>>> 
>>> 
>>> "collection8":{"shard1":{"solrinstance1:8983_solr_collection8":{
>>>   "shard":"shard1",
>>>   "leader":"true",
>>>   "state":"active",
>>>   "core":"collection8",
>>>   "collection":"collection8",
>>>   "node_name":"solrinstance1:8983_solr",
>>> 
>>>   "base_url":"http://solrinstance1:8983/solr"}}}
>>> 
>>> 
>>> I always thought, that via zookeeper these collections are sharded and
>>> replicated or do I need to call on each node the create core action? But
>>> then I need to know about these nodes, right?
>>> 
>>> 
>>> Thanks & regards
>>> 
>>> Daniel
>> 
>> - Mark Miller
>> lucidimagination.com
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 

- Mark Miller
lucidimagination.com

Re: problem on running fullimport

2012-05-24 Thread pla

Thanks Alexey Serba. I encountered the *java.sql.SQLException: Illegal value
for setFetchSize()* error after upgrading one of my servers to MySQL version
5.5.22. 

PLA

--
View this message in context: 
http://lucene.472066.n3.nabble.com/problem-on-running-fullimport-tp1707206p3985924.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Creating custom Filter / Tokenizer / Request Handler for integration of NER-Framework

2012-05-24 Thread Ahmet Arslan


> I am recently working on a project to integrate a
> Named-Entity-Recognition-Framework (NER) in an existing
> searchplatform based on Solr. The Platform uses ManifoldCF
> to automatically gather the content from various
> repositories. The NER-Framework creates Annotations/Metadata
> from given content which I then want to integrate into the
> search-platform as metadata to use for faceting. Since MCF
> handles all content gathering, I need a way to integrate the
> NER-Framework directly into Solr. The Goal is to get all
> Annotations per document into a multivalued field.  My
> first thought was to create a custom filter, which just
> takes the content and gives back only the Annotations. 
> But as I understand it, a filter only processes
> predetermined Tokens, which is useless for my purpose, since
> the NER-Framework needs to process the whole content of a
> document. What about a custom Tokenizer? Would it be
> possible to process the whole text and give back only the
> Annotations as Tokens? A third thought was to manipulate the
> ExtractRequestHandler (Solr Cell) used by MCF to somehow add
> the Annotations as Metadata when the content and metadata is
> distributed to the different fields.
> 
> I hope my problem description is sufficient. Does anybody
> have any thoughts on that subject?

UpdateRequestProcessor is more appropriate in this case. Like 
http://wiki.apache.org/solr/SolrUIMA

Re: Index-time field boost with DIH

2012-05-24 Thread Chamnap Chhorn

Anyone could help me? I really need index-time field-boosting.

On Thu, May 24, 2012 at 4:21 PM, Chamnap Chhorn wrote:

> Hi all,
>
> I want to do index-time boost field on DIH. Is there any way to do this? I
> see on this documentation, there is only $docBoost. How about field boost?
> Is it possible?
>
> Thanks
> http://chamnap.github.com/
>



-- 
Chhorn Chamnap
http://chamnapchhorn.blogspot.com/

Dismax, explicit phrase queries not on qf

2012-05-24 Thread Markus Jelsma

Hi,

With (e)dismax explicit phrase queries are executed on the qf fields. The qf 
field, however, may contain field(s) we don't want a phrase query for. How can 
we tell the dismax query parser to only do phrase queries (explicit or not) on 
the fields listed in the pf parameter.

Thanks
Markus

Creating custom Filter / Tokenizer / Request Handler for integration of NER-Framework

2012-05-24 Thread Wunderlich, Tobias

Hey Guys,

I am recently working on a project to integrate a 
Named-Entity-Recognition-Framework (NER) in an existing searchplatform based on 
Solr. The Platform uses ManifoldCF to automatically gather the content from 
various repositories. The NER-Framework creates Annotations/Metadata from given 
content which I then want to integrate into the search-platform as metadata to 
use for faceting. Since MCF handles all content gathering, I need a way to 
integrate the NER-Framework directly into Solr. The Goal is to get all 
Annotations per document into a multivalued field.  My first thought was to 
create a custom filter, which just takes the content and gives back only the 
Annotations.  But as I understand it, a filter only processes predetermined 
Tokens, which is useless for my purpose, since the NER-Framework needs to 
process the whole content of a document. What about a custom Tokenizer? Would 
it be possible to process the whole text and give back only the Annotations as 
Tokens? A third thought was to manipulate the ExtractRequestHandler (Solr Cell) 
used by MCF to somehow add the Annotations as Metadata when the content and 
metadata is distributed to the different fields.

I hope my problem description is sufficient. Does anybody have any thoughts on 
that subject?

Best regards,
Tobias

Creating custom Filter / Tokenizer / Request Handler for integration of NER-Framework

2012-05-24 Thread Wunderlich, Tobias

Hey Guys,

I am recently working on a project to integrate a 
Named-Entity-Recognition-Framework (NER) in an existing searchplatform based on 
Solr. The Platform uses ManifoldCF to automatically gather the content from 
various repositories. The NER-Framework creates Annotations/Metadata from given 
content which I then want to integrate into the search-platform as metadata to 
use for faceting. Since MCF handles all content gathering, I need a way to 
integrate the NER-Framework directly into Solr. The Goal is to get all 
Annotations per document into a multivalued field.  My first thought was to 
create a custom filter, which just takes the content and gives back only the 
Annotations.  But as I understand it, a filter only processes predetermined 
Tokens, which is useless for my purpose, since the NER-Framework needs to 
process the whole content of a document. What about a custom Tokenizer? Would 
it be possible to process the whole text and give back only the Annotations as 
Tokens? A third thought was to manipulate the ExtractRequestHandler (Solr Cell) 
used by MCF to somehow add the Annotations as Metadata when the content and 
metadata is distributed to the different fields.

I hope my problem description is sufficient. Does anybody have any thoughts on 
that subject?

Best regards,
Tobias

Re: How many in the XML source file before indexing?

2012-05-24 Thread Bruno Mannina


humm... ok I will do the test as soon as receive the database.

Thx a lot !

Le 24/05/2012 13:29, Michael Kuhlmann a écrit :

Just try it!

Maybe you're lucky, and it works with 80M docs. If each document takes 
100 k, then it only needs 8 GB memory for indexing.


However, I doubt it. I've not been too deeply into the UpdateHandler 
yet, but I think it first needs to parse the complete XML file before 
it starts to index.


But that worst thing that can happen is an OOM exception. And when you 
need to split the xml files, then you can split into smaller chunks as 
well.


Just a note: In Solr, you're always updating, even in the first 
indexation. There's no difference between updates and inserts.


Greetings,
Michael

Am 24.05.2012 12:37, schrieb Bruno Mannina:

In fact it's not for an update but only for the first indexation.

I mean, I will receive the full database with around 80M docs in some
XML files (one per country in the world).
 From these 80M docs I will generate right XML format for each doc. (I
don't need all fields from the source)

And as actually for my test (12 000 docs), I generate one file per doc,
there is no problem.
But with 80M docs I can't generate one file per doc.

It's for this reason I asked the max number of  in a file .

For the first time, if a country file fails, no problem, I will check it
and re-generate it.

Is it bad to create a file with 5M  ?


Le 24/05/2012 11:46, Michael Kuhlmann a écrit :

There is no hard limit for the maximum nunmber of documents per update.

It's only memory dependent. The smaller each document, and the more
memory Solr can acquire, the more documents can you send in one update.

However, I wouldn't pish it too jard anyway. If you can send, say, 100
documents per update, the you won't gain much if you send 200
documents instead, or even 1000. The number of requests don't count
that much.

And, if the update fails for some reason, then the whole request will
be ignored. If you had sent 1000 documents in an update, and one of
them had a field missing, for example, then it's hard to find out
which one.

Greetings,
Michael

Am 24.05.2012 10:58, schrieb Bruno Mannina:

I can't find my answer concerning the max number of  ?

Can someone can tell me if there is no limit?

Le 24/05/2012 09:55, Bruno Mannina a écrit :

Sorry I just found : http://wiki.apache.org/solr/UpdateXmlMessages

I will take also a look to find the max number of .

Le 24/05/2012 09:51, Paul Libbrecht a écrit :

Bruno,
see the solrconfig.xml, you have all sorts of tweaks for this 
kind of

things.

paul


Le 24 mai 2012 à 09:49, Bruno Mannina a écrit :


Hi All,

Just a little question concerning the max number of





that I can write in the xml source file before indexing? only one,
10, 100, 1000, unlimited...?

I must indexed 80M docs so I can't create one xml file by doc.

thanks,
Bruno

RE: field "name" was indexed without position data; cannot run PhraseQuery (term=a)

2012-05-24 Thread Markus Jelsma



 
 
-Original message-
> From:Michael McCandless 
> Sent: Thu 24-May-2012 13:15
> To: Markus Jelsma 
> Cc: solr-user@lucene.apache.org
> Subject: Re: field "name" was indexed without position data; cannot 
> run PhraseQuery (term=a)
> 
> I believe termPositions=false refers to the term vectors and not how
> the field is indexed (which is very confusing I think...).
> 
> I think you'll need to index a separate field disabling term freqs +
> positions than the field the queryparser can query?
> 
> But ... if all of this is to just do custom scoring ... can't you just
> set a custom similarity for the field and index it normally (with term
> freq + positions).

Yes, will do that.
Thanks!

> 
> Mike McCandless
> 
> http://blog.mikemccandless.com
> 
> On Thu, May 24, 2012 at 6:47 AM, Markus Jelsma
>  wrote:
> > Thanks!
> >
> > How can we, in that case, omit term frequency for a qf field? I assume the 
> > way to go is to configure a custom flat term frequency similarity for that 
> > field. And how can it be that this error is not thrown with 
> > termPosition=false for that field but only with omitTermFreqAndPositions?
> >
> > Markus
> >
> >
> > -Original message-
> >> From:Michael McCandless 
> >> Sent: Thu 24-May-2012 12:26
> >> To: solr-user@lucene.apache.org
> >> Subject: Re: field "name" was indexed without position data; 
> >> cannot run PhraseQuery (term=a)
> >>
> >> This behavior has changed.
> >>
> >> In 3.x, you silently got no results in such cases.
> >>
> >> In trunk, you get an exception notifying you that the query cannot run.
> >>
> >> Mike McCandless
> >>
> >> http://blog.mikemccandless.com
> >>
> >> On Thu, May 24, 2012 at 6:04 AM, Markus Jelsma
> >>  wrote:
> >> > Hi,
> >> >
> >> > What is the intended behaviour for explicit phrase queries on fields 
> >> > without position data? If a (e)dismax qf parameter included a field 
> >> > omitTermFreqAndPositions=true user explicit phrase queries throw the 
> >> > following error on trunk but not on the 3x branch.
> >> >
> >> > java.lang.IllegalStateException: field "name" was indexed without 
> >> > position data; cannot run PhraseQuery (term=a)
> >> >        at 
> >> > org.apache.lucene.search.PhraseQuery$PhraseWeight.scorer(PhraseQuery.java:274)
> >> >        at 
> >> > org.apache.lucene.search.DisjunctionMaxQuery$DisjunctionMaxWeight.scorer(DisjunctionMaxQuery.java:160)
> >> >        at 
> >> > org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:589)
> >> >        at 
> >> > org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:280)
> >> >        at 
> >> > org.apache.solr.search.SolrIndexSearcher.getDocListAndSetNC(SolrIndexSearcher.java:1518)
> >> >        at 
> >> > org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1265)
> >> >        at 
> >> > org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:384)
> >> >        at 
> >> > org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:411)
> >> >        at 
> >> > org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:204)
> >> >        at 
> >> > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
> >> >        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1555)
> >> >        at 
> >> > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:442)
> >> >        at 
> >> > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:263)
> >> >        at 
> >> > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1337)
> >> >        at 
> >> > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:484)
> >> >        at 
> >> > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119)
> >> >        at 
> >> > org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524)
> >> >        at 
> >> > org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:233)
> >> >        at 
> >> > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1065)
> >> >        at 
> >> > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:413)
> >> >        at 
> >> > org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:192)
> >> >        at 
> >> > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:999)
> >> >        at 
> >> > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117)
> >> >        at 
> >> > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:250)
> >> >        at 
> >> > org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:149)
> >> >        at 
> >> > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:111)
> >> >        at org.eclipse.jetty.server.Server.handle(Server.java:351)
> >> >        at

Re: solr error when querying.

2012-05-24 Thread Sami Siren

Ron,

Did you actually add new xslt file there or did you try to use the
example one, if the latter I believe the filename is example.xsl not
example.xslt

--
 Sami Siren


On Wed, May 23, 2012 at 5:30 PM, watson  wrote:
> Here is my query:
> http://127.0.0.1:/solr/JOBS/select/??q=Apache&wt=xslt&tr=example.xslt
>
> The response I get is the following.  I have example.xslt in the /conf/xslt
> path.   What is wrong here?  Thanks!
>
>
> HTTP ERROR 500
>
> Problem accessing /solr/JOBS/select/. Reason:
>
>    getTransformer fails in getContentType
>
> java.lang.RuntimeException: getTransformer fails in getContentType
>        at
> org.apache.solr.response.XSLTResponseWriter.getContentType(XSLTResponseWriter.java:72)
>        at
> org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:326)
>        at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:261)
>        at
> com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:129)
>        at
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:59)
>        at
> com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:122)
>        at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:110)
>        at
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>        at 
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
>        at
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
>        at 
> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
>        at 
> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
>        at 
> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
>        at
> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
>        at
> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
>        at 
> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
>        at org.mortbay.jetty.Server.handle(Server.java:326)
>        at 
> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
>        at
> org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
>        at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
>        at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
>        at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
>        at
> org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
>        at
> org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
> Caused by: java.io.IOException: Unable to initialize Templates
> 'example.xslt'
>        at
> org.apache.solr.util.xslt.TransformerProvider.getTemplates(TransformerProvider.java:117)
>        at
> org.apache.solr.util.xslt.TransformerProvider.getTransformer(TransformerProvider.java:77)
>        at
> org.apache.solr.response.XSLTResponseWriter.getTransformer(XSLTResponseWriter.java:130)
>        at
> org.apache.solr.response.XSLTResponseWriter.getContentType(XSLTResponseWriter.java:69)
>        ... 23 more
> Caused by: javax.xml.transform.TransformerConfigurationException: Could not
> compile stylesheet
>        at
> com.sun.org.apache.xalan.internal.xsltc.trax.TransformerFactoryImpl.newTemplates(Unknown
> Source)
>        at
> org.apache.solr.util.xslt.TransformerProvider.getTemplates(TransformerProvider.java:110)
>        ... 26 more
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/solr-error-when-querying-tp3985677.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: How many in the XML source file before indexing?

2012-05-24 Thread Michael Kuhlmann


Just try it!

Maybe you're lucky, and it works with 80M docs. If each document takes 
100 k, then it only needs 8 GB memory for indexing.


However, I doubt it. I've not been too deeply into the UpdateHandler 
yet, but I think it first needs to parse the complete XML file before it 
starts to index.


But that worst thing that can happen is an OOM exception. And when you 
need to split the xml files, then you can split into smaller chunks as well.


Just a note: In Solr, you're always updating, even in the first 
indexation. There's no difference between updates and inserts.


Greetings,
Michael

Am 24.05.2012 12:37, schrieb Bruno Mannina:

In fact it's not for an update but only for the first indexation.

I mean, I will receive the full database with around 80M docs in some
XML files (one per country in the world).
 From these 80M docs I will generate right XML format for each doc. (I
don't need all fields from the source)

And as actually for my test (12 000 docs), I generate one file per doc,
there is no problem.
But with 80M docs I can't generate one file per doc.

It's for this reason I asked the max number of  in a file .

For the first time, if a country file fails, no problem, I will check it
and re-generate it.

Is it bad to create a file with 5M  ?


Le 24/05/2012 11:46, Michael Kuhlmann a écrit :

There is no hard limit for the maximum nunmber of documents per update.

It's only memory dependent. The smaller each document, and the more
memory Solr can acquire, the more documents can you send in one update.

However, I wouldn't pish it too jard anyway. If you can send, say, 100
documents per update, the you won't gain much if you send 200
documents instead, or even 1000. The number of requests don't count
that much.

And, if the update fails for some reason, then the whole request will
be ignored. If you had sent 1000 documents in an update, and one of
them had a field missing, for example, then it's hard to find out
which one.

Greetings,
Michael

Am 24.05.2012 10:58, schrieb Bruno Mannina:

I can't find my answer concerning the max number of  ?

Can someone can tell me if there is no limit?

Le 24/05/2012 09:55, Bruno Mannina a écrit :

Sorry I just found : http://wiki.apache.org/solr/UpdateXmlMessages

I will take also a look to find the max number of .

Le 24/05/2012 09:51, Paul Libbrecht a écrit :

Bruno,
see the solrconfig.xml, you have all sorts of tweaks for this kind of
things.

paul


Le 24 mai 2012 à 09:49, Bruno Mannina a écrit :


Hi All,

Just a little question concerning the max number of





that I can write in the xml source file before indexing? only one,
10, 100, 1000, unlimited...?

I must indexed 80M docs so I can't create one xml file by doc.

thanks,
Bruno

Re: field "name" was indexed without position data; cannot run PhraseQuery (term=a)

2012-05-24 Thread Michael McCandless

I believe termPositions=false refers to the term vectors and not how
the field is indexed (which is very confusing I think...).

I think you'll need to index a separate field disabling term freqs +
positions than the field the queryparser can query?

But ... if all of this is to just do custom scoring ... can't you just
set a custom similarity for the field and index it normally (with term
freq + positions).

Mike McCandless

http://blog.mikemccandless.com

On Thu, May 24, 2012 at 6:47 AM, Markus Jelsma
 wrote:
> Thanks!
>
> How can we, in that case, omit term frequency for a qf field? I assume the 
> way to go is to configure a custom flat term frequency similarity for that 
> field. And how can it be that this error is not thrown with 
> termPosition=false for that field but only with omitTermFreqAndPositions?
>
> Markus
>
>
> -Original message-
>> From:Michael McCandless 
>> Sent: Thu 24-May-2012 12:26
>> To: solr-user@lucene.apache.org
>> Subject: Re: field "name" was indexed without position data; 
>> cannot run PhraseQuery (term=a)
>>
>> This behavior has changed.
>>
>> In 3.x, you silently got no results in such cases.
>>
>> In trunk, you get an exception notifying you that the query cannot run.
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>> On Thu, May 24, 2012 at 6:04 AM, Markus Jelsma
>>  wrote:
>> > Hi,
>> >
>> > What is the intended behaviour for explicit phrase queries on fields 
>> > without position data? If a (e)dismax qf parameter included a field 
>> > omitTermFreqAndPositions=true user explicit phrase queries throw the 
>> > following error on trunk but not on the 3x branch.
>> >
>> > java.lang.IllegalStateException: field "name" was indexed without position 
>> > data; cannot run PhraseQuery (term=a)
>> >        at 
>> > org.apache.lucene.search.PhraseQuery$PhraseWeight.scorer(PhraseQuery.java:274)
>> >        at 
>> > org.apache.lucene.search.DisjunctionMaxQuery$DisjunctionMaxWeight.scorer(DisjunctionMaxQuery.java:160)
>> >        at 
>> > org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:589)
>> >        at 
>> > org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:280)
>> >        at 
>> > org.apache.solr.search.SolrIndexSearcher.getDocListAndSetNC(SolrIndexSearcher.java:1518)
>> >        at 
>> > org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1265)
>> >        at 
>> > org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:384)
>> >        at 
>> > org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:411)
>> >        at 
>> > org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:204)
>> >        at 
>> > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
>> >        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1555)
>> >        at 
>> > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:442)
>> >        at 
>> > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:263)
>> >        at 
>> > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1337)
>> >        at 
>> > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:484)
>> >        at 
>> > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119)
>> >        at 
>> > org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524)
>> >        at 
>> > org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:233)
>> >        at 
>> > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1065)
>> >        at 
>> > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:413)
>> >        at 
>> > org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:192)
>> >        at 
>> > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:999)
>> >        at 
>> > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117)
>> >        at 
>> > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:250)
>> >        at 
>> > org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:149)
>> >        at 
>> > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:111)
>> >        at org.eclipse.jetty.server.Server.handle(Server.java:351)
>> >        at 
>> > org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:454)
>> >        at 
>> > org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:47)
>> >        at 
>> > org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:890)
>> >        at 
>> > org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:944)
>> >        at org.eclip

Re: Throws Null Pointer Exception Even Query is Correct in solr

2012-05-24 Thread Sami Siren

What version of solr (solrj) are you using?

--
 Sami SIren

On Thu, May 24, 2012 at 8:41 AM, in.abdul  wrote:
> Hi Dmitry ,
>
> There is no out of memory execution in solr ..
>            Thanks and Regards,
>        S SYED ABDUL KATHER
>
>
>
> On Thu, May 24, 2012 at 1:14 AM, Dmitry Kan [via Lucene] <
> ml-node+s472066n3985762...@n3.nabble.com> wrote:
>
>> do you also see out of memory exception in your tomcat logs? If so, try
>> setting the JVM's -Xmx to something reasonable.
>>
>> -- Dmitry
>>
>> On Wed, May 23, 2012 at 10:09 PM, in.abdul <[hidden 
>> email]>
>> wrote:
>>
>> > Sorry i missed the point i am already using Method.Post Only  .. Still i
>> > could not able to execute
>> >             Thanks and Regards,
>> >        S SYED ABDUL KATHER
>> >
>> >
>> >
>> > On Thu, May 24, 2012 at 12:19 AM, iorixxx [via Lucene] <
>> > [hidden email] >
>> wrote:
>> >
>> > > >     I have creteria where i am passing more than
>> > > > 10 ids in Query like
>> > > > q=(ROWINDEX:(1 2 3 4  )) using solrJ . i had increased
>> > > > the Max Boolean
>> > > > clause  = 10500 and i had increased the Max Header
>> > > > Size in tomcat also
>> > > > in sufficient amount .. But still its is throwing Null
>> > > > Pointer Exception in
>> > > > Tomcat and in Eclipse while debugging i had seen Error as
>> > > > "Error Executing
>> > > > Query" . Please give me suggestion for this.
>> > >
>> > >
>> > > If you are using GET method ( which is default) try POST method
>> instead.
>> > > See how to use it : http://search-lucene.com/m/34M4GTEIaD
>> > >
>> > >
>> > > --
>> > >  If you reply to this email, your message will be added to the
>> discussion
>> > > below:
>> > >
>> > >
>> >
>> http://lucene.472066.n3.nabble.com/Throws-Null-Pointer-Exception-Even-Query-is-Correct-in-solr-tp3985736p3985746.html
>> > >  To unsubscribe from Lucene, click here<
>> >
>> >
>> > > .
>> > > NAML<
>> >
>> http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
>> > >
>> > >
>> >
>> >
>> > -
>> > THANKS AND REGARDS,
>> > SYED ABDUL KATHER
>> > --
>> > View this message in context:
>> >
>> http://lucene.472066.n3.nabble.com/Throws-Null-Pointer-Exception-Even-Query-is-Correct-in-solr-tp3985736p3985754.html
>> > Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
>>
>>
>> --
>> Regards,
>>
>> Dmitry Kan
>>
>>
>> --
>>  If you reply to this email, your message will be added to the discussion
>> below:
>>
>> http://lucene.472066.n3.nabble.com/Throws-Null-Pointer-Exception-Even-Query-is-Correct-in-solr-tp3985736p3985762.html
>>  To unsubscribe from Lucene, click 
>> here
>> .
>> NAML
>>
>
>
> -
> THANKS AND REGARDS,
> SYED ABDUL KATHER
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Throws-Null-Pointer-Exception-Even-Query-is-Correct-in-solr-tp3985736p3985834.html
> Sent from the Solr - User mailing list archive at Nabble.com.

RE: field "name" was indexed without position data; cannot run PhraseQuery (term=a)

2012-05-24 Thread Markus Jelsma

Thanks!

How can we, in that case, omit term frequency for a qf field? I assume the way 
to go is to configure a custom flat term frequency similarity for that field. 
And how can it be that this error is not thrown with termPosition=false for 
that field but only with omitTermFreqAndPositions?

Markus
 
 
-Original message-
> From:Michael McCandless 
> Sent: Thu 24-May-2012 12:26
> To: solr-user@lucene.apache.org
> Subject: Re: field "name" was indexed without position data; cannot 
> run PhraseQuery (term=a)
> 
> This behavior has changed.
> 
> In 3.x, you silently got no results in such cases.
> 
> In trunk, you get an exception notifying you that the query cannot run.
> 
> Mike McCandless
> 
> http://blog.mikemccandless.com
> 
> On Thu, May 24, 2012 at 6:04 AM, Markus Jelsma
>  wrote:
> > Hi,
> >
> > What is the intended behaviour for explicit phrase queries on fields 
> > without position data? If a (e)dismax qf parameter included a field 
> > omitTermFreqAndPositions=true user explicit phrase queries throw the 
> > following error on trunk but not on the 3x branch.
> >
> > java.lang.IllegalStateException: field "name" was indexed without position 
> > data; cannot run PhraseQuery (term=a)
> >        at 
> > org.apache.lucene.search.PhraseQuery$PhraseWeight.scorer(PhraseQuery.java:274)
> >        at 
> > org.apache.lucene.search.DisjunctionMaxQuery$DisjunctionMaxWeight.scorer(DisjunctionMaxQuery.java:160)
> >        at 
> > org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:589)
> >        at 
> > org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:280)
> >        at 
> > org.apache.solr.search.SolrIndexSearcher.getDocListAndSetNC(SolrIndexSearcher.java:1518)
> >        at 
> > org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1265)
> >        at 
> > org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:384)
> >        at 
> > org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:411)
> >        at 
> > org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:204)
> >        at 
> > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
> >        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1555)
> >        at 
> > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:442)
> >        at 
> > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:263)
> >        at 
> > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1337)
> >        at 
> > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:484)
> >        at 
> > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119)
> >        at 
> > org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524)
> >        at 
> > org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:233)
> >        at 
> > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1065)
> >        at 
> > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:413)
> >        at 
> > org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:192)
> >        at 
> > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:999)
> >        at 
> > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117)
> >        at 
> > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:250)
> >        at 
> > org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:149)
> >        at 
> > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:111)
> >        at org.eclipse.jetty.server.Server.handle(Server.java:351)
> >        at 
> > org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:454)
> >        at 
> > org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:47)
> >        at 
> > org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:890)
> >        at 
> > org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:944)
> >        at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:634)
> >        at 
> > org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:230)
> >        at 
> > org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:66)
> >        at 
> > org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:254)
> >        at 
> > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:599)
> >        at 
> > org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:534)
> >        at java.lang.Thread.run(Thread.java:662)
>

Re: How many in the XML source file before indexing?

2012-05-24 Thread Bruno Mannina


In fact it's not for an update but only for the first indexation.

I mean, I will receive the full database with around 80M docs in some 
XML files (one per country in the world).
From these 80M docs I will generate right XML format for each doc. (I 
don't need all fields from the source)


And as actually for my test (12 000 docs), I generate one file per doc, 
there is no problem.

But with 80M docs I can't generate one file per doc.

It's for this reason I asked the max number of  in a file .

For the first time, if a country file fails, no problem, I will check it 
and re-generate it.


Is it bad to create a file with 5M  ?


Le 24/05/2012 11:46, Michael Kuhlmann a écrit :

There is no hard limit for the maximum nunmber of documents per update.

It's only memory dependent. The smaller each document, and the more 
memory Solr can acquire, the more documents can you send in one update.


However, I wouldn't pish it too jard anyway. If you can send, say, 100 
documents per update, the you won't gain much if you send 200 
documents instead, or even 1000. The number of requests don't count 
that much.


And, if the update fails for some reason, then the whole request will 
be ignored. If you had sent 1000 documents in an update, and one of 
them had a field missing, for example, then it's hard to find out 
which one.


Greetings,
Michael

Am 24.05.2012 10:58, schrieb Bruno Mannina:

I can't find my answer concerning the max number of  ?

Can someone can tell me if there is no limit?

Le 24/05/2012 09:55, Bruno Mannina a écrit :

Sorry I just found : http://wiki.apache.org/solr/UpdateXmlMessages

I will take also a look to find the max number of .

Le 24/05/2012 09:51, Paul Libbrecht a écrit :

Bruno,
see the solrconfig.xml, you have all sorts of tweaks for this kind of
things.

paul


Le 24 mai 2012 à 09:49, Bruno Mannina a écrit :


Hi All,

Just a little question concerning the max number of





that I can write in the xml source file before indexing? only one,
10, 100, 1000, unlimited...?

I must indexed 80M docs so I can't create one xml file by doc.

thanks,
Bruno

Re: field "name" was indexed without position data; cannot run PhraseQuery (term=a)

2012-05-24 Thread Michael McCandless

This behavior has changed.

In 3.x, you silently got no results in such cases.

In trunk, you get an exception notifying you that the query cannot run.

Mike McCandless

http://blog.mikemccandless.com

On Thu, May 24, 2012 at 6:04 AM, Markus Jelsma
 wrote:
> Hi,
>
> What is the intended behaviour for explicit phrase queries on fields without 
> position data? If a (e)dismax qf parameter included a field 
> omitTermFreqAndPositions=true user explicit phrase queries throw the 
> following error on trunk but not on the 3x branch.
>
> java.lang.IllegalStateException: field "name" was indexed without position 
> data; cannot run PhraseQuery (term=a)
>        at 
> org.apache.lucene.search.PhraseQuery$PhraseWeight.scorer(PhraseQuery.java:274)
>        at 
> org.apache.lucene.search.DisjunctionMaxQuery$DisjunctionMaxWeight.scorer(DisjunctionMaxQuery.java:160)
>        at 
> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:589)
>        at 
> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:280)
>        at 
> org.apache.solr.search.SolrIndexSearcher.getDocListAndSetNC(SolrIndexSearcher.java:1518)
>        at 
> org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1265)
>        at 
> org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:384)
>        at 
> org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:411)
>        at 
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:204)
>        at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
>        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1555)
>        at 
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:442)
>        at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:263)
>        at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1337)
>        at 
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:484)
>        at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119)
>        at 
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524)
>        at 
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:233)
>        at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1065)
>        at 
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:413)
>        at 
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:192)
>        at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:999)
>        at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117)
>        at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:250)
>        at 
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:149)
>        at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:111)
>        at org.eclipse.jetty.server.Server.handle(Server.java:351)
>        at 
> org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:454)
>        at 
> org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:47)
>        at 
> org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:890)
>        at 
> org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:944)
>        at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:634)
>        at 
> org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:230)
>        at 
> org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:66)
>        at 
> org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:254)
>        at 
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:599)
>        at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:534)
>        at java.lang.Thread.run(Thread.java:662)
>
>
> Thanks

field "name" was indexed without position data; cannot run PhraseQuery (term=a)

2012-05-24 Thread Markus Jelsma

Hi,

What is the intended behaviour for explicit phrase queries on fields without 
position data? If a (e)dismax qf parameter included a field 
omitTermFreqAndPositions=true user explicit phrase queries throw the following 
error on trunk but not on the 3x branch.

java.lang.IllegalStateException: field "name" was indexed without position 
data; cannot run PhraseQuery (term=a)
at 
org.apache.lucene.search.PhraseQuery$PhraseWeight.scorer(PhraseQuery.java:274)
at 
org.apache.lucene.search.DisjunctionMaxQuery$DisjunctionMaxWeight.scorer(DisjunctionMaxQuery.java:160)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:589)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:280)
at 
org.apache.solr.search.SolrIndexSearcher.getDocListAndSetNC(SolrIndexSearcher.java:1518)
at 
org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1265)
at 
org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:384)
at 
org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:411)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:204)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1555)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:442)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:263)
at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1337)
at 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:484)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119)
at 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524)
at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:233)
at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1065)
at 
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:413)
at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:192)
at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:999)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117)
at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:250)
at 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:149)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:111)
at org.eclipse.jetty.server.Server.handle(Server.java:351)
at 
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:454)
at 
org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:47)
at 
org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:890)
at 
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:944)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:634)
at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:230)
at 
org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:66)
at 
org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:254)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:599)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:534)
at java.lang.Thread.run(Thread.java:662)


Thanks

Re: How many in the XML source file before indexing?

2012-05-24 Thread Michael Kuhlmann


"pish it too jard" - sounds funny. :)

I meant "push it too hard".

Am 24.05.2012 11:46, schrieb Michael Kuhlmann:

There is no hard limit for the maximum nunmber of documents per update.

It's only memory dependent. The smaller each document, and the more
memory Solr can acquire, the more documents can you send in one update.

However, I wouldn't pish it too jard anyway. If you can send, say, 100
documents per update, the you won't gain much if you send 200 documents
instead, or even 1000. The number of requests don't count that much.

And, if the update fails for some reason, then the whole request will be
ignored. If you had sent 1000 documents in an update, and one of them
had a field missing, for example, then it's hard to find out which one.

Greetings,
Michael

Am 24.05.2012 10:58, schrieb Bruno Mannina:

I can't find my answer concerning the max number of  ?

Can someone can tell me if there is no limit?

Le 24/05/2012 09:55, Bruno Mannina a écrit :

Sorry I just found : http://wiki.apache.org/solr/UpdateXmlMessages

I will take also a look to find the max number of .

Le 24/05/2012 09:51, Paul Libbrecht a écrit :

Bruno,
see the solrconfig.xml, you have all sorts of tweaks for this kind of
things.

paul


Le 24 mai 2012 à 09:49, Bruno Mannina a écrit :


Hi All,

Just a little question concerning the max number of





that I can write in the xml source file before indexing? only one,
10, 100, 1000, unlimited...?

I must indexed 80M docs so I can't create one xml file by doc.

thanks,
Bruno

Re: How many in the XML source file before indexing?

2012-05-24 Thread Michael Kuhlmann


There is no hard limit for the maximum nunmber of documents per update.

It's only memory dependent. The smaller each document, and the more 
memory Solr can acquire, the more documents can you send in one update.


However, I wouldn't pish it too jard anyway. If you can send, say, 100 
documents per update, the you won't gain much if you send 200 documents 
instead, or even 1000. The number of requests don't count that much.


And, if the update fails for some reason, then the whole request will be 
ignored. If you had sent 1000 documents in an update, and one of them 
had a field missing, for example, then it's hard to find out which one.


Greetings,
Michael

Am 24.05.2012 10:58, schrieb Bruno Mannina:

I can't find my answer concerning the max number of  ?

Can someone can tell me if there is no limit?

Le 24/05/2012 09:55, Bruno Mannina a écrit :

Sorry I just found : http://wiki.apache.org/solr/UpdateXmlMessages

I will take also a look to find the max number of .

Le 24/05/2012 09:51, Paul Libbrecht a écrit :

Bruno,
see the solrconfig.xml, you have all sorts of tweaks for this kind of
things.

paul


Le 24 mai 2012 à 09:49, Bruno Mannina a écrit :


Hi All,

Just a little question concerning the max number of





that I can write in the xml source file before indexing? only one,
10, 100, 1000, unlimited...?

I must indexed 80M docs so I can't create one xml file by doc.

thanks,
Bruno

Re: How many in the XML source file before indexing?

2012-05-24 Thread Bruno Mannina


I can't find my answer concerning the max number of  ?

Can someone can tell me if there is no limit?

Le 24/05/2012 09:55, Bruno Mannina a écrit :

Sorry I just found : http://wiki.apache.org/solr/UpdateXmlMessages

I will take also a look to find the max number of .

Le 24/05/2012 09:51, Paul Libbrecht a écrit :

Bruno,
see the solrconfig.xml, you have all sorts of tweaks for this kind of 
things.


paul


Le 24 mai 2012 à 09:49, Bruno Mannina a écrit :


Hi All,

Just a little question concerning the max number of





that I can write in the xml source file before indexing? only one, 
10, 100, 1000, unlimited...?


I must indexed 80M docs so I can't create one xml file by doc.

thanks,
Bruno

Re: shard distribution of multiple collections in SolrCloud

2012-05-24 Thread Tommaso Teofili

2012/5/23 Mark Miller 

> Yeah, currently you have to create the core on each node...we are working
> on a 'collections' api that will make this a simple one call operation.
>

Mark, is there a Jira for that yet?
Tomamso


>
> We should have this soon.
>
> - Mark
>
> On May 23, 2012, at 2:36 PM, Daniel Brügge wrote:
>
> > Hi,
> >
> > i am creating several cores using the following script. I use this for
> > testing SolrCloud and to learn about the distribution of multiple
> > collections.
> >
> > max=500
> >> for ((i=2; i<=$max; ++i )) ;
> >> do
> >>curl "
> >>
> http://solrinstance1:8983/solr/admin/cores?action=CREATE&name=collection$i&collection=collection$i&collection.configName=myconfig
> >> "
> >> done
> >
> >
> > I've setup a SolrCloud with 2 shards which are each replicated by 2 other
> > instances I start.
> >
> > When I first start the installation I have the default "collection1" in
> > place which is sharded over shard1 and shard2 with 2 leader nodes and 2
> > nodes which replicate the leaders.
> >
> > When I run this script above which calls the Coreadmin on one of the
> > shards, all the collections are created on only this shard without a
> > replica. So e.g.
> >
> >
> > "collection8":{"shard1":{"solrinstance1:8983_solr_collection8":{
> >"shard":"shard1",
> >"leader":"true",
> >"state":"active",
> >"core":"collection8",
> >"collection":"collection8",
> >"node_name":"solrinstance1:8983_solr",
> >
> >"base_url":"http://solrinstance1:8983/solr"}}}
> >
> >
> > I always thought, that via zookeeper these collections are sharded and
> > replicated or do I need to call on each node the create core action? But
> > then I need to know about these nodes, right?
> >
> >
> > Thanks & regards
> >
> > Daniel
>
> - Mark Miller
> lucidimagination.com
>
>
>
>
>
>
>
>
>
>
>
>

Re: shard distribution of multiple collections in SolrCloud

2012-05-24 Thread Daniel Brügge

Ok, thanks a lot, good to know.

BTW: The speed of creating a collections is not the fastest - at least here
on this server I use (approx. second), but this is normal right?

On Wed, May 23, 2012 at 9:28 PM, Mark Miller  wrote:

> Yeah, currently you have to create the core on each node...we are working
> on a 'collections' api that will make this a simple one call operation.
>
> We should have this soon.
>
> - Mark
>
> On May 23, 2012, at 2:36 PM, Daniel Brügge wrote:
>
> > Hi,
> >
> > i am creating several cores using the following script. I use this for
> > testing SolrCloud and to learn about the distribution of multiple
> > collections.
> >
> > max=500
> >> for ((i=2; i<=$max; ++i )) ;
> >> do
> >>curl "
> >>
> http://solrinstance1:8983/solr/admin/cores?action=CREATE&name=collection$i&collection=collection$i&collection.configName=myconfig
> >> "
> >> done
> >
> >
> > I've setup a SolrCloud with 2 shards which are each replicated by 2 other
> > instances I start.
> >
> > When I first start the installation I have the default "collection1" in
> > place which is sharded over shard1 and shard2 with 2 leader nodes and 2
> > nodes which replicate the leaders.
> >
> > When I run this script above which calls the Coreadmin on one of the
> > shards, all the collections are created on only this shard without a
> > replica. So e.g.
> >
> >
> > "collection8":{"shard1":{"solrinstance1:8983_solr_collection8":{
> >"shard":"shard1",
> >"leader":"true",
> >"state":"active",
> >"core":"collection8",
> >"collection":"collection8",
> >"node_name":"solrinstance1:8983_solr",
> >
> >"base_url":"http://solrinstance1:8983/solr"}}}
> >
> >
> > I always thought, that via zookeeper these collections are sharded and
> > replicated or do I need to call on each node the create core action? But
> > then I need to know about these nodes, right?
> >
> >
> > Thanks & regards
> >
> > Daniel
>
> - Mark Miller
> lucidimagination.com
>
>
>
>
>
>
>
>
>
>
>
>

Re: How many in the XML source file before indexing?

2012-05-24 Thread Bruno Mannina


Sorry I just found : http://wiki.apache.org/solr/UpdateXmlMessages

I will take also a look to find the max number of .

Le 24/05/2012 09:51, Paul Libbrecht a écrit :

Bruno,
see the solrconfig.xml, you have all sorts of tweaks for this kind of things.

paul


Le 24 mai 2012 à 09:49, Bruno Mannina a écrit :


Hi All,

Just a little question concerning the max number of





that I can write in the xml source file before indexing? only one, 10, 100, 
1000, unlimited...?

I must indexed 80M docs so I can't create one xml file by doc.

thanks,
Bruno

Per Field Similarity

2012-05-24 Thread hemantverm...@gmail.com

Hi All

I have a scinerio, suppose I want to use a new feature which is available in
trunk and also available as a patch.
Should I apply patch to the latest release version to use new feature or
directly use the trunk?
Which one will be the good approach and why?

Thanks in advance
Hemant

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Per-Field-Similarity-tp3985857.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: How many in the XML source file before indexing?

2012-05-24 Thread Paul Libbrecht

Bruno,
see the solrconfig.xml, you have all sorts of tweaks for this kind of things.

paul


Le 24 mai 2012 à 09:49, Bruno Mannina a écrit :

> Hi All,
> 
> Just a little question concerning the max number of
> 
> 
> 
> 
> 
> that I can write in the xml source file before indexing? only one, 10, 100, 
> 1000, unlimited...?
> 
> I must indexed 80M docs so I can't create one xml file by doc.
> 
> thanks,
> Bruno
> 
> 
> 
>

How many in the XML source file before indexing?

2012-05-24 Thread Bruno Mannina


Hi All,

Just a little question concerning the max number of





that I can write in the xml source file before indexing? only one, 10, 
100, 1000, unlimited...?


I must indexed 80M docs so I can't create one xml file by doc.

thanks,
Bruno

Re: System requirements in my case?

2012-05-24 Thread Bruno Mannina


Thanks a lot for all these help !


Le 24/05/2012 09:12, Otis Gospodnetic a écrit :

Bruno,

You can use jconsole to see the size of the JVM heap, if that's what you are 
after.

Otis 


Performance Monitoring for Solr / ElasticSearch / HBase - 
http://sematext.com/spm





From: Bruno Mannina
To: solr-user@lucene.apache.org
Sent: Tuesday, May 22, 2012 8:43 AM
Subject: Re: System requirements in my case?

I installed a temp server on my university with 12 000 docs (Ubuntu+solr
3.6.0)
May be I can preview the size of memory I need?

Q: How can I check the memory used?


Le 22/05/2012 13:14, findbestopensource a écrit :

Seems to be fine. Go head.

Before hosting, Have you tried / tested your application in local setup.
RAM usage is what matters in terms of Solr. Just benchmark your app for 100
000 documents, Log the memory used. Calculate the RAM reqd for 80 000 000
documents.

Regards
Aditya
www.findbestopensource.com


On Tue, May 22, 2012 at 2:36 PM, Bruno Mannina   wrote:


My choice: 
http://www.ovh.com/fr/**serveurs_dedies/eg_best_of.xml

24 Go DDR3

Le 22/05/2012 10:26, findbestopensource a écrit :

Dedicated Server may not be required. If you want to cut down cost, then

prefer shared server.

How much the RAM?

Regards
Aditya
www.findbestopensource.com


On Tue, May 22, 2012 at 12:36 PM, Bruno Manninawrote:

Dear Solr users,

My company would like to use solr to index around 80 000 000 documents
(xml files with around 5~10ko size each).
My program (robot) will connect to this solr with boolean requests.

Number of users: around 1000
Number of requests by user and by day: 300
Number of users by day: 30

I would like to subscribe to a host provider with this configuration:
- Dedicated Server
- Ubuntu
- Intel Xeon i7 2x 266+ GHz 12 Go 2 * 1500Go
- Unlimited bandwidth
- IP fixe

Do you think this configuration is enough?

Thanks for your info,
Sincerely
Bruno

Re: System requirements in my case?

2012-05-24 Thread Otis Gospodnetic

Bruno,

You can use jconsole to see the size of the JVM heap, if that's what you are 
after.

Otis 

Performance Monitoring for Solr / ElasticSearch / HBase - 
http://sematext.com/spm 



>
> From: Bruno Mannina 
>To: solr-user@lucene.apache.org 
>Sent: Tuesday, May 22, 2012 8:43 AM
>Subject: Re: System requirements in my case?
> 
>I installed a temp server on my university with 12 000 docs (Ubuntu+solr 
>3.6.0)
>May be I can preview the size of memory I need?
>
>Q: How can I check the memory used?
>
>
>Le 22/05/2012 13:14, findbestopensource a écrit :
>> Seems to be fine. Go head.
>>
>> Before hosting, Have you tried / tested your application in local setup.
>> RAM usage is what matters in terms of Solr. Just benchmark your app for 100
>> 000 documents, Log the memory used. Calculate the RAM reqd for 80 000 000
>> documents.
>>
>> Regards
>> Aditya
>> www.findbestopensource.com
>>
>>
>> On Tue, May 22, 2012 at 2:36 PM, Bruno Mannina  wrote:
>>
>>> My choice: 
>>> http://www.ovh.com/fr/**serveurs_dedies/eg_best_of.xml
>>>
>>> 24 Go DDR3
>>>
>>> Le 22/05/2012 10:26, findbestopensource a écrit :
>>>
>>>   Dedicated Server may not be required. If you want to cut down cost, then
 prefer shared server.

 How much the RAM?

 Regards
 Aditya
 www.findbestopensource.com


 On Tue, May 22, 2012 at 12:36 PM, Bruno Mannina   wrote:

   Dear Solr users,
> My company would like to use solr to index around 80 000 000 documents
> (xml files with around 5~10ko size each).
> My program (robot) will connect to this solr with boolean requests.
>
> Number of users: around 1000
> Number of requests by user and by day: 300
> Number of users by day: 30
>
> I would like to subscribe to a host provider with this configuration:
> - Dedicated Server
> - Ubuntu
> - Intel Xeon i7 2x 266+ GHz 12 Go 2 * 1500Go
> - Unlimited bandwidth
> - IP fixe
>
> Do you think this configuration is enough?
>
> Thanks for your info,
> Sincerely
> Bruno
>
>
>
>
>
>

Re: Planning of future Solr setup

2012-05-24 Thread Otis Gospodnetic

Christian,

You don't mention SolrCloud explicitly and based on what you wrote I'm assuming 
you are thinking/planning on using the Solr 3.* setup for this.  I think that's 
the first thing to change - this is going to be a pain to manage if you use 
Solr 3.*.  You should immediately start looking at using SolrCloud for this.  
Once you have a look you will see how a number of your questions will quickly 
become non-questions. :)

Otis 

Performance Monitoring for Solr / ElasticSearch / HBase - 
http://sematext.com/spm 




>
> From: Christian von Wendt-Jensen 
>To: "solr-user@lucene.apache.org"  
>Sent: Wednesday, May 23, 2012 6:59 AM
>Subject: Planning of future Solr setup
> 
>Hi,
>
>I'm in the middle of planning a new Solr setup. The situation is this:
>- We currently have one document type with around 20 fields, indexed, not 
>stored, except for a few date fields
>- We currently have indexed 400M documents across 20+ shards.
>- The number of documents to be indexed is around 1M/day, and this number is 
>increasing.
>- The index files totals to around 750GB
>- Users will mostly search newly indexed documents (news), and therefore the 
>shards represents dateranges.
>- Each month or so, we add a new shard.
>
>
>In my planning, my goals are:
>- it should be very easy to add a new shard and bring it online. Maybe it 
>could even be fully automated.
>- it should be very easy to retire a (old) shard in order to reclaim the 
>hardware resources for newer documents.
>- It should be very easy to scale wide or high by adding more machines or more 
>CPU/RAM. The resources should be able to autobalance the shards for optimum 
>resources usage.
>- Rebalancing should be very fast.
>- The setup should support one writer and many readers of the same physical 
>index. This avoids replication and moving large files around. This again 
>supports fast rebalancing of hardware resources.
>- Clients should be notified about shards coming online or going offline.
>
>The goals require a kind of distributed configuration and notifcation system. 
>Here I imagine Zookeeper comes into play.
>In order to make rebalancing very fast, the index should stay where they are, 
>and not be moved around. Instead Solr instances on available resources should 
>be configured to point to relevant shards. This requires a SAN storage, I 
>imagine.
>
>
>Questions:
>1. What is best practice in regard to using a machines resources: one tomcat 
>instance per one shard until memory and CPU is used up? Or rather one 
>tomcat/multiple cores, and the tomcat gets all memory available on the machine?
>2. Would it be a good idea to mix master and slave cores in the same tomcat 
>instance or should a machine be dedicated to either master cores or slave 
>cores?
>3. What would be the best way to notify the slave cores about recent commits 
>by the masters, remembering that replication is disabled?
>4. In the one writer, many readers scenario, what happens when the writer 
>merges/updates segments? Will the index files be physically deleted/altered? 
>And how will the slaves react to that?
>5. Would it be advisable to use a SAN for sharing index files between readers 
>and writers (one writer)? Any best practices on this area? I imagine one large 
>share on the SAN that all "resources" can mount.
>
>
>
>
>
>
>Med venlig hilsen / Best Regards
>
>Christian von Wendt-Jensen
>
>
>
>

Re: Tips on creating a custom QueryCache?

2012-05-24 Thread Otis Gospodnetic

Perhaps this could be a custom SearchComponent that's run before the usual 
QueryComponent?
This component would be responsible for loading queries, executing them, 
caching results, and for returning those results when these queries are 
encountered later on.

Otis 

Performance Monitoring for Solr / ElasticSearch / HBase - 
http://sematext.com/spm 



>
> From: Aaron Daubman 
>To: solr-user@lucene.apache.org 
>Sent: Wednesday, May 23, 2012 12:00 PM
>Subject: Tips on creating a custom QueryCache?
> 
>Greetings,
>
>I'm looking for pointers on where to start when creating a
>custom QueryCache.
>Our usage patterns are possibly a bit unique, so let me explain the desired
>use case:
>
>Our Solr index is read-only except for dedicated periods where it is
>updated and re-optimized.
>
>On startup, I would like to create a specific QueryCache that would cache
>the top ~20,000 (arbitrary but large) queries. This cache should never
>evict entries, and, after the "warming process" to populate, should never
>be added to either.
>
>The warming process would be to run through the (externally determined)
>list of anticipated top X (say 20,000) queries and cache these results.
>
>This cache would then be used for the duration of the solr run-time (until
>the period, perhaps daily, where the index is updated and re-optimized, at
>which point the cache would be re-created)
>
>Where should I begin looking to implement such a cache?
>
>The reason for this somewhat different approach to caching is that we may
>get any number of odd queries throughout the day for which performance
>isn't important, and we don't want any of these being added to the cache or
>evicting other entries from the cache. We need to ensure high performance
>for this pre-determined list of queries only (while still handling other
>arbitrary queries, if not as quickly)
>
>Thanks,
>      Aaron
>
>
>

Re: configuring solr3.6 for a large intensive index only run

2012-05-24 Thread Otis Gospodnetic

Scott,

In addition to what Lance said, make sure your ramBufferSizeMB in 
solrconfig.xml is high. Try with 512MB or 1024MB.  Seeing Solr/Lucene index 
segment merging visualization in SPM for Solr is one of my favourite reports in 
SPM.  It's kind of "amazing" how much index size fluctuates!

Otis 

Performance Monitoring for Solr / ElasticSearch / HBase - 
http://sematext.com/spm 



>
> From: Scott Preddy 
>To: solr-user@lucene.apache.org 
>Sent: Wednesday, May 23, 2012 2:19 PM
>Subject: configuring solr3.6 for a large intensive index only run
> 
>I am trying to do a very large insertion (about 68million documents) into a
>solr instance.
>
>Our schema is pretty simple. About 40 fields using these types:
>
>   
>      omitNorms="true"/>
>      positionIncrementGap="100">
>         
>            
>            
>         
>         
>            
>            
>         
>      
>      omitNorms="true" positionIncrementGap="0"/>
>   
>
>We are running solrj clients from a hadoop cluster, and are struggling with
>the merge process as time progresses.
>As the number of documents grows, merging will eventually hog everything.
>
>What we would really like to do is turn merging off and just do an index
>run with a sparse solrconfig and then
>start things back up with our runtime config which would kick off merging
>when it starts.
>
>Is there a way to do this?
>
>I came close to finding an answer in this post, but did not find out how to
>actually turn off merging.
>
>Post by Mike McCandless:
>http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html
>
>
>

72 matches

Mail list logo