Re: DIH - Insert another record After first load

2010-08-12 Thread Shalin Shekhar Mangar
On Thu, Aug 12, 2010 at 7:05 AM, Girish  wrote:

> Hi,
>
> I did load of the data with DIH and now once the data is loaded. I want to
> load the records dynamically as an when I received.
>
> Use cases:
>
>   1. I did load of 7MM records and now everything is working fine.
>   2. A new record is received, now I want to add this new record into the
> indexed data. Here is difference in the processing and the logic:
>  * Initial data Load is done from a Oracle Materialized view
>  * The new record is added into the tables from where view is
> created and not available in the view now
>  * now I want to add this new record into the index. I have a Java
> bean loaded with the data including the index column.
>  * I looked at the indexed file and it is all encoded.
>   3. How do I load above loaded Java bean to the index?
>
>
Take a look at SolrJ - http://wiki.apache.org/solr/Solrj

-- 
Regards,
Shalin Shekhar Mangar.


Re: SOLR Query

2010-08-12 Thread Moiz Bhukhiya
I tried ap_address:(tom+cruise) and that worked. I am sure its the same
problem as you suspected!

Thanks a lot Erick(& users!) for your time.
Moiz

On Thu, Aug 12, 2010 at 8:51 PM, Erick Erickson wrote:

> You'll get a lot of insight into what's actually happening if you append
> &debugQuery=true to your queries, or check the "debug" checkbox
> in the solr admin page.
>
> But I suspect (and it's a guess since you haven't included your schema)
> that your problem is that you're mixing explicit and default fields.
> Something
> like "q=ap_address:Tom+Cruise", I think, gets parsed into something like
> ap_address:tom + default_field:cruise
>
> What happens if you try ap_address:(tom +cruise)?
>
> Best
> Erick
>
> On Thu, Aug 12, 2010 at 7:19 PM, Moiz Bhukhiya  >wrote:
>
> > Hi there,
> >
> >
> > I've a problem querying SOLR for a specific field with a query string
> that
> > contains spaces. I added following lines in the schema.xml to add my own
> > defined fields. Fields are: ap_name, ap_address, ap_dob, ap_desg, ap_sec.
> >
> > Since all these fields are beginning with ap_, I included the the
> following
> > dynamicField.
> > 
> >
> >
> > I included this line to make a query for all fields instead of a specfic
> > field.
> > 
> >
> > I added the following document in my index:
> >
> > 
> > 
> > 1
> > Tom Cruise
> > San Fransisco
> > 
> > 
> >
> > 1. When I query q=Tom+Cruise, I should get the above document since it is
> > available in "text" which ic my default query field. [Works as expected]
> > 2. When I query q=ap_address:Tom, I should not get the above document
> since
> > Tom is not available in ap_address. [Works as expected]
> > 3. When I query q=ap_address:Tom+Cruise, I shouldnt not get the above
> > document BUT I GET IT. {Doesnt work as expected]
> >
> > Could anyone please explain me what mistake I am making?
> >
> > Thanks alot, appreciate any help!
> > Moiz
> >
>


DataImportHandler and SAXParseExceptions with Jetty

2010-08-12 Thread harrysmith

Win XP, Solr 1.4.1 out of the box install, using jetty. If I add greater than
or less than (ie < or >) in any xml field and attempt to load or run from
the DataImportConsole I receive a SAXParseException. Example follows:

If I don't have a 'less than' it works just fine. I know this must work,
because the examples given on the wiki show deltaQueries using a greater
than/less than compare.


Relevant snippet from data-config.xml :



Stack trace received:
org.apache.solr.common.SolrException: FATAL: Could not create importer.
DataImporter config invalid
at
org.apache.solr.handler.dataimport.DataImportHandler.inform(DataImportHandler.java:121)
at
org.apache.solr.handler.dataimport.DataImportHandler.handleRequestBody(DataImportHandler.java:222)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
at
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
at
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
at
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
at
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
at org.mortbay.jetty.Server.handle(Server.java:285)
at 
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
at
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:821)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:513)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
at
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
at
org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
Caused by: org.apache.solr.handler.dataimport.DataImportHandlerException:
Exception occurred while initializing context
at
org.apache.solr.handler.dataimport.DataImporter.loadDataConfig(DataImporter.java:190)
at
org.apache.solr.handler.dataimport.DataImporter.(DataImporter.java:101)
at
org.apache.solr.handler.dataimport.DataImportHandler.inform(DataImportHandler.java:113)
... 22 more
Caused by: org.xml.sax.SAXParseException: The value of attribute "query"
associated with an element type "null" must not contain the '<' character.
at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown
Source)
at
com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(Unknown
Source)
at
org.apache.solr.handler.dataimport.DataImporter.loadDataConfig(DataImporter.java:178)
... 24 more

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/DataImportHandler-and-SAXParseExceptions-with-Jetty-tp1125898p1125898.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Index compatibility 1.4 Vs 3.1 Trunk

2010-08-12 Thread Robert Muir
On Thu, Aug 12, 2010 at 8:29 PM, Chris Hostetter
wrote:

>
> It was a big part of the proposal regarding hte creation of hte 3x
> branch ... that index format compabtibility between major versions would
> no longer be supported by silently converted on first write -- instead
> there there would be a tool for explicit conversion...
>
>
> http://search.lucidimagination.com/search/document/c10057266d3471c6/proposal_about_version_api_relaxation
>
> http://search.lucidimagination.com/search/document/c494a78f1ec1bfb5/lucene_3_x_branch_created
>
>
>
Hoss, did you actually *read* these documents


"We will only provide a conversion tool that can convert indexes from
the last "branch_3x" up to this trunk (4.0) release, so they can be
read later, but may not contain terms with all current analyzers, so
people need mostly reindexing. Older indexes will not be able to be
read natively without conversion first (with maybe loss of analyzer
compatibility)."



the fact 4.0 can read 3.x indexes *at all* without a converter tool is
only because Mike Mccandless went the extra mile.


i dont see anything suggesting we should support any tools for 2.x indexes!

-- 
Robert Muir
rcm...@gmail.com


Re: SOLR Query

2010-08-12 Thread Erick Erickson
You'll get a lot of insight into what's actually happening if you append
&debugQuery=true to your queries, or check the "debug" checkbox
in the solr admin page.

But I suspect (and it's a guess since you haven't included your schema)
that your problem is that you're mixing explicit and default fields.
Something
like "q=ap_address:Tom+Cruise", I think, gets parsed into something like
ap_address:tom + default_field:cruise

What happens if you try ap_address:(tom +cruise)?

Best
Erick

On Thu, Aug 12, 2010 at 7:19 PM, Moiz Bhukhiya wrote:

> Hi there,
>
>
> I've a problem querying SOLR for a specific field with a query string that
> contains spaces. I added following lines in the schema.xml to add my own
> defined fields. Fields are: ap_name, ap_address, ap_dob, ap_desg, ap_sec.
>
> Since all these fields are beginning with ap_, I included the the following
> dynamicField.
> 
>
>
> I included this line to make a query for all fields instead of a specfic
> field.
> 
>
> I added the following document in my index:
>
> 
> 
> 1
> Tom Cruise
> San Fransisco
> 
> 
>
> 1. When I query q=Tom+Cruise, I should get the above document since it is
> available in "text" which ic my default query field. [Works as expected]
> 2. When I query q=ap_address:Tom, I should not get the above document since
> Tom is not available in ap_address. [Works as expected]
> 3. When I query q=ap_address:Tom+Cruise, I shouldnt not get the above
> document BUT I GET IT. {Doesnt work as expected]
>
> Could anyone please explain me what mistake I am making?
>
> Thanks alot, appreciate any help!
> Moiz
>


Re: Results from More then One Cors?

2010-08-12 Thread Erick Erickson
There is no information to go on here. Please review
http://wiki.apache.org/solr/UsingMailingLists

and add some more details...

Best
Erick

On Thu, Aug 12, 2010 at 2:09 PM, Jörg Agatz wrote:

> Hallo Users...
>
> I tryed to get results from more then one Cores..
> But i dont know how..
>
> Maby you have a Idea..
>
> I need it into PHP
>
> King
>


Re: indexing???

2010-08-12 Thread Erick Erickson
Can you provide more details? What is the error you're receiving?
What do you "think" is going on?

It might be helpful if you reviewed:
http://wiki.apache.org/solr/UsingMailingLists

Best
Erick

On Thu, Aug 12, 2010 at 8:21 AM, satya swaroop  wrote:

> Hi all,
>   The indexing part of solr is going good,but i got a error on indexing
> a single pdf file. when i searched for the error in the mailing list i
> found
> that the error was due to copyright of that file. can't we index a file
> which has copy rights or any digital rights???
>
> regards,
>   satya
>


Re: Deleting with the DIH sometimes doesn't delete

2010-08-12 Thread Lance Norskog
Which version of Solr is this? How many documents are there in the
index? Etc. It is hard for us to help you without more details.


On Thu, Aug 12, 2010 at 8:32 AM, Qwerky  wrote:
>
> I'm doing deletes with the DIH but getting mixed results. Sometimes the
> documents get deleted, other times I can still find them in the index. What
> would prevent a doc from getting deleted?
>
> For example, I delete 594039 and get this in the logs;
>
> 2010-08-12 14:41:55,625 [Thread-210] INFO  [DataImporter] Starting Delta
> Import
> 2010-08-12 14:41:55,625 [Thread-210] INFO  [SolrWriter] Read
> productimportupdate.properties
> 2010-08-12 14:41:55,625 [Thread-210] INFO  [DocBuilder] Starting delta
> collection.
> 2010-08-12 14:41:55,625 [Thread-210] INFO  [DocBuilder] Running
> ModifiedRowKey() for Entity: item
> 2010-08-12 14:41:55,625 [Thread-210] INFO  [DocBuilder] Completed
> ModifiedRowKey for Entity: item rows obtained : 0
> 2010-08-12 14:41:55,625 [Thread-210] INFO  [DocBuilder] Completed
> DeletedRowKey for Entity: item rows obtained : 1
> 2010-08-12 14:41:55,625 [Thread-210] INFO  [DocBuilder] Completed
> parentDeltaQuery for Entity: item
> 2010-08-12 14:41:55,625 [Thread-210] INFO  [DocBuilder] Deleting stale
> documents
> 2010-08-12 14:41:55,625 [Thread-210] INFO  [SolrWriter] Deleting document:
> 594039
> 2010-08-12 14:41:55,703 [Thread-210] INFO  [SolrDeletionPolicy] newest
> commit = 1281030128383
> 2010-08-12 14:41:55,718 [Thread-210] DEBUG [SolrIndexWriter] Opened Writer
> DirectUpdateHandler2
> 2010-08-12 14:41:55,718 [Thread-210] INFO  [DocBuilder] Delta Import
> completed successfully
> 2010-08-12 14:41:55,718 [Thread-210] INFO  [DocBuilder] Import completed
> successfully
> 2010-08-12 14:41:55,718 [Thread-210] INFO  [DirectUpdateHandler2] start
> commit(optimize=true,waitFlush=false,waitSearcher=true,expungeDeletes=false)
> 2010-08-12 14:42:08,562 [Thread-210] DEBUG [SolrIndexWriter] Closing Writer
> DirectUpdateHandler2
> 2010-08-12 14:42:10,437 [Thread-210] INFO  [SolrDeletionPolicy]
> SolrDeletionPolicy.onCommit: commits:num=2
>
> commit{dir=E:\SOLR\kiosk\data\index,segFN=segments_8,version=1281030128383,generation=8,filenames=[_39.frq,
> _2i.fdx, _39.tis, _39.prx, _39.fnm, _2i.fdt, _39.tii, _39.nrm, segments_8]
>
> commit{dir=E:\SOLR\kiosk\data\index,segFN=segments_9,version=1281030128384,generation=9,filenames=[_3a.prx,
> _3a.tis, _3a.fnm, _3a.nrm, _3a.fdt, _3a.tii, _3a.fdx, _3a.frq, segments_9]
> 2010-08-12 14:42:10,437 [Thread-210] INFO  [SolrDeletionPolicy] newest
> commit = 1281030128384
>
> ..this works fine; I can no longer find 594039 in the index. But a little
> later I delete a couple more (33252 and 105224) and get the following (I
> added two docs at the same time);
>
> 2010-08-12 15:27:42,828 [Thread-217] INFO  [DataImporter] Starting Delta
> Import
> 2010-08-12 15:27:42,828 [Thread-217] INFO  [SolrWriter] Read
> productimportupdate.properties
> 2010-08-12 15:27:42,828 [Thread-217] INFO  [DocBuilder] Starting delta
> collection.
> 2010-08-12 15:27:42,843 [Thread-217] INFO  [DocBuilder] Running
> ModifiedRowKey() for Entity: item
> 2010-08-12 15:27:42,843 [Thread-217] INFO  [DocBuilder] Completed
> ModifiedRowKey for Entity: item rows obtained : 2
> 2010-08-12 15:27:42,843 [Thread-217] INFO  [DocBuilder] Completed
> DeletedRowKey for Entity: item rows obtained : 2
> 2010-08-12 15:27:42,843 [Thread-217] INFO  [DocBuilder] Completed
> parentDeltaQuery for Entity: item
> 2010-08-12 15:27:42,843 [Thread-217] INFO  [DocBuilder] Deleting stale
> documents
> 2010-08-12 15:27:42,843 [Thread-217] INFO  [SolrWriter] Deleting document:
> 33252
> 2010-08-12 15:27:42,906 [Thread-217] INFO  [SolrDeletionPolicy]
> SolrDeletionPolicy.onInit: commits:num=1
>
> commit{dir=E:\SOLR\kiosk\data\index,segFN=segments_9,version=1281030128384,generation=9,filenames=[_3a.prx,
> _3a.tis, _3a.fnm, _3a.nrm, _3a.fdt, _3a.tii, _3a.fdx, _3a.frq, segments_9]
> 2010-08-12 15:27:42,906 [Thread-217] INFO  [SolrDeletionPolicy] newest
> commit = 1281030128384
> 2010-08-12 15:27:42,906 [Thread-217] DEBUG [SolrIndexWriter] Opened Writer
> DirectUpdateHandler2
> 2010-08-12 15:27:42,906 [Thread-217] INFO  [SolrWriter] Deleting document:
> 105224
> 2010-08-12 15:27:42,906 [Thread-217] INFO  [DocBuilder] Delta Import
> completed successfully
> 2010-08-12 15:27:42,906 [Thread-217] INFO  [DocBuilder] Import completed
> successfully
> 2010-08-12 15:27:42,906 [Thread-217] INFO  [DirectUpdateHandler2] start
> commit(optimize=true,waitFlush=false,waitSearcher=true,expungeDeletes=false)
> 2010-08-12 15:27:55,578 [Thread-217] DEBUG [SolrIndexWriter] Closing Writer
> DirectUpdateHandler2
> 2010-08-12 15:27:56,875 [Thread-217] INFO  [SolrDeletionPolicy]
> SolrDeletionPolicy.onCommit: commits:num=2
>
> commit{dir=E:\SOLR\kiosk\data\index,segFN=segments_9,version=1281030128384,generation=9,filenames=[_3a.prx,
> _3a.tis, _3a.fnm, _3a.nrm, _3a.fdt, _3a.tii, _3a.fdx, _3a.frq, segments_9]
>
> commit{dir=E:\SOLR\kiosk\data\index,segFN=segments_a

Re: Indexing and ExtractingRequestHandler

2010-08-12 Thread Lance Norskog
This is probably true about Luke. The trunk has a new Lucene format
and does not read any previous format.  The trunk is a busy code base.
The 3.1 branch is slated to be the next Solr release, and is probably
a better base for your testing. Best of all is to use the Solr 1.4.1
binary release.

On Wed, Aug 11, 2010 at 8:08 PM, Harry Hochheiser  wrote:
> Thanks.
>
> I've done Tika command line to parse the Excel file, and I see
> contents in it that don't appear to be indexed. I've tried the path of
> using Tika to parse the Excel and then using extracting request
> handler to index the resulting text, and that doesn't work either.
>
> As far as Luke goes, I've built it from scratch. Still bombs. Is it
> possible that it's not compatible with lucene  builds based on trunk?
>
> thanks,
>
>
> -harry
>
> On Wed, Aug 11, 2010 at 6:48 PM, Jan Høydahl / Cominvent
>  wrote:
>> Hi,
>>
>> You can try Tika command line to parse your Excel file, then you will se the 
>> exact textual output from it, which will be indexed into Solr, and thus 
>> inspect whether something is missing.
>>
>> Are you sure you use a version of Luke which supports your version of Lucene?
>>
>> --
>> Jan Høydahl, search solution architect
>> Cominvent AS - www.cominvent.com
>> Training in Europe - www.solrtraining.com
>>
>> On 11. aug. 2010, at 23.33, Harry Hochheiser wrote:
>>
>>> I'm trying to use Solr to index the contents of an Excel file, using
>>> the ExtractingRequestHandler (CSV handler won't work for me - I need
>>> to consider the whole spreadsheet as one document), and I'm running
>>> into some trouble.
>>>
>>> Is there any way to see what's going on during the indexing process?
>>> I'm concerned that I may be losing some terms, and I'd like to see if
>>> i can snoop on the terms that are added to the index as they go along.
>>> How might I do this?
>>>
>>> Barring that, how can I inspect the index post-fact?  I have tried to
>>> use luke to see what's in the index, but I get an error: "Unknown
>>> format version -10". Is it possible to get luke to work?
>>>
>>> My solr build is straight out of SVN.
>>>
>>> thanks,
>>>
>>> harry
>>
>>
>



-- 
Lance Norskog
goks...@gmail.com


Re: hl.usePhraseHighlighter

2010-08-12 Thread Chris Hostetter

: Subject: hl.usePhraseHighlighter
: References: <1281125904548-1031951.p...@n3.nabble.com>
:  <960560.55971...@web52904.mail.re2.yahoo.com>
: In-Reply-To: <960560.55971...@web52904.mail.re2.yahoo.com>

http://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is "hidden" in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.
See Also:  http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking


-Hoss



Re: In multicore env, can I make it access core0 by default

2010-08-12 Thread Chris Hostetter

: In-Reply-To: 
: References: 
:  
: Subject: In multicore env, can I make it access core0 by default

http://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is "hidden" in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.
See Also:  http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking




-Hoss



Re: PDF file

2010-08-12 Thread Chris Hostetter

: Subject: PDF file
: References: <20100729152139.321c4...@ibis>
:  
: In-Reply-To: 

http://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is "hidden" in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.
See Also:  http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking




-Hoss



Re: Filter Performance in Solr 1.3

2010-08-12 Thread Lance Norskog
There was a major Lucene change in filter handling from Solr 1.3 to
Solr 1.4. They are much much faster in 1.4. Really Lucene 2.4.1 to
Lucene 2.9.2. The filter is now consulted much earlier in the search
process, thus weeding out many more documents early.

It sounds like in Solr 1.3, you should only use filter queries for
queries with large document sets.

On Wed, Aug 11, 2010 at 12:21 PM, Bargar, Matthew B
 wrote:
> The search with the filter takes longer than a search for the same term
> but no filter after repeated searches, after the cache should have come
> into play. To be more specific, this happens on filters that exclude
> very few results from the overall set.
>
> For instance, type:video returns few results and as one would expect,
> returns much quicker than a search without that filter.
>
> -type:video, on the other hand returns a lot of results and excludes
> very few, and actually takes longer than a search without any filter at
> all.
>
> Is this what one might expect when using a filter that excludes few
> results, or does it still seem like something strange might be
> happening?
>
> Thanks,
> Matt
>
> -Original Message-
> From: Geert-Jan Brits [mailto:gbr...@gmail.com]
> Sent: Wednesday, August 11, 2010 2:55 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Filter Performance in Solr 1.3
>
> fq's are the preferred way to use for filtering when the same filter is
> often  used. (since the filter-set can be cached seperately) .
>
> as to your direct question:
>> My question is whether there is anything that can be done in 1.3 to
> help alleviate the problem, before upgrading to 1.4?
>
> I don't think so (perhaps some patches that I'm not aware of) .
>
> When are you seeing increased search time?
>
> is it the first time the filter is used? If that's the case: that's
> logical since the filter needs to be build.
> (fq)-filters only show their strength (as said above)  when you use them
> repeatedly.
>
> If on the other hand you're seeing slower repsonse times with a
> fq-filter applied all the time, then the same queries without the
> fq-filter, there must be something strange going on since this really
> shouldn't happen in normal situations.
>
> Geert-Jan
>
>
>
>
>
> 2010/8/11 Bargar, Matthew B 
>
>> Hi there, I have a question about filter (fq) performance in Solr 1.3.
>> After doing some testing it seems as though adding a filter increases
>> search time. From what I've read here
>> http://www.derivante.com/2009/06/23/solr-filtering-performance-increas
>> e/
>>
>> and here
>> http://www.lucidimagination.com/blog/2009/05/27/filtered-query-perform
>> an
>> ce-increases-for-solr-14/
>>
>> it seems as though upgrading to 1.4 would solve this problem. My
>> question is whether there is anything that can be done in 1.3 to help
>> alleviate the problem, before upgrading to 1.4? It becomes an issue
>> because the majority of searches that are done on our site need some
>> content type excluded or filtered for. Does it make sense to use the
>> fq parameter in this way, or is there some better approach since
>> filters are almost always used?
>>
>> Thank you!
>>
>



-- 
Lance Norskog
goks...@gmail.com


Re: Indexing large files using Solr Cell causes OutOfMemory error

2010-08-12 Thread Chris Hostetter

: Subject: Indexing large files using Solr Cell causes OutOfMemory error
: References: 
: In-Reply-To: 

http://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is "hidden" in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.
See Also:  http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking




-Hoss



Re: index pdf files

2010-08-12 Thread Chris Hostetter

: Subject: index pdf files
: References: 
:  <4c63ed43.4030...@r.email.ne.jp>
:  
: In-Reply-To: 

http://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is "hidden" in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.
See Also:  http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking




-Hoss



Re: Solr 1.4.1 and 3x: Grouping of query changes results

2010-08-12 Thread Chris Hostetter

: > Does not return document as expected:
: > id:1234 AND (-indexid:1 AND -indexid:2) AND -indexid:3
: > 
: > Has anyone else experienced this? The exact placement of the parens isn't
: > key, just adding a level of nesting changes the query results.
...
: I could be wrong but I think this has to do with Solr's lack of support for
: purely negative queries, try the following and see if it behaves correctly:
: 
: id:1234 AND (*:* AND -indexid:1 AND -indexid:2) AND -indexid:3

1) correct.  In general a purely negative query can't work -- queries must 
select something, it doesn't matter if they are nested in another query or 
not.

the query string "A AND (-B AND -C) AND -D" says that a document must 
match A and it must match "a query which does not match anything" and it 
must not match D ... it's thta middle clause that prevents anything from 
matching.

Solr does support purely negative queries if they are the "top level" 
query (ie: "q=-foo") but it doesn't rewrite nested sub queries (ie: "q=foo 
(-bar -baz)")

2) FWIW: setting asside the pure negative query aspect of this question, 
changing the grouping of clauses can always affect the results of a query 
-- this is because the grouping dictates the scoring (due to queryNorms 
and coord factors) so "A (B C (D E)) F" can produce very results in a very 
different order then "A B C D E F" ... likewise "A C -B" will match 
different documents then "A (C -B)"  (the latter will match a document 
containing both A and B, the former will not)


-Hoss



Re: analysis tool vs. reality

2010-08-12 Thread Robert Muir
On Thu, Aug 12, 2010 at 8:07 PM, Chris Hostetter
wrote:

>
> : > You say it's bogus because the qp will divide on whitesapce first --
> but
> : > you're assuming you know what query parser will be used ... the "field"
> : > query parser (to name one) doesn't split on whitespace first.  That's
> my
> : > point: analysis.jsp doesn't make any assumptions about what query
> parser
> : > *might* be used, it just tells you what your analyzers do with strings.
> : >
> :
> : you're right, we should just fix the bug that the queryparser tokenizes
> on
> : whitespace first. then analysis.jsp will be significantly less confusing.
>
> dude .. not trying to get into a holy war here
>
> actually I'm suggesting the practical solution: that we fix the primary
problem that makes it confusing.


> even if you change the Lucene QUeryParser so that whitespace isn't a meta
> character it doens't affect the underlying issue: analysis.jsp is agnostic
> about QueryParsers.


analysis.jsp isn't agnostic about queryparsers, its ignorant of them, and
your default queryparser is actually a de-facto whitespace tokenizer, don't
try to sugarcoat it.

-- 
Robert Muir
rcm...@gmail.com


Re: DataImportHandler in Solr 1.4.1: exception handling in FileListEntityProcessor

2010-08-12 Thread Lance Norskog
Please add a JIRA issue for this.

On Wed, Aug 11, 2010 at 6:24 AM, Sascha Szott  wrote:
> Sorry, there was a mistake in the stack trace. The correct one is:
>
> SEVERE: Full Import failed
> org.apache.solr.handler.dataimport.DataImportHandlerException: 'baseDir'
> value: /home/doe/foo is not a directory Processing Document # 3
>        at
> org.apache.solr.handler.dataimport.FileListEntityProcessor.init(FileListEntityProcessor.java:122)
>        at
> org.apache.solr.handler.dataimport.EntityProcessorWrapper.init(EntityProcessorWrapper.java:71)
>        at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:319)
>        at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:383)
>        at
> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:242)
>        at
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:180)
>        at
> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:331)
>        at
> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:389)
>        at
> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:370)
>
> -Sascha
>
> On 11.08.2010 15:18, Sascha Szott wrote:
>>
>> Hi folks,
>>
>> why does FileListEntityProcessor ignores onError="continue" and abort
>> indexing if a directory or a file does not exist?
>>
>> I'm using both XPathEntityProcessor and FileListEntityProcessor with
>> onError set to continue. In case a directory or file is not present an
>> Exception is thrown and indexing is stopped immediately.
>>
>> Below you can find a stack trace that is generated in case the directory
>> /home/doe/foo does not exist:
>>
>> SEVERE: Full Import failed
>> org.apache.solr.handler.dataimport.DataImportHandlerException: 'baseDir'
>> value: /home/doe/foo/bar.xml is not a directory Processing Document # 3
>> at
>>
>> org.apache.solr.handler.dataimport.FileListEntityProcessor.init(FileListEntityProcessor.java:122)
>>
>> at
>>
>> org.apache.solr.handler.dataimport.EntityProcessorWrapper.init(EntityProcessorWrapper.java:71)
>>
>> at
>>
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:319)
>>
>> at
>>
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:383)
>>
>> at
>>
>> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:242)
>>
>> at
>> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:180)
>> at
>>
>> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:331)
>>
>> at
>>
>> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:389)
>>
>> at
>>
>> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:370)
>>
>>
>> How should I configure both processors so that missing directories and
>> files are ignored and the indexing process does not stop immediately?
>>
>> Best,
>> Sascha
>



-- 
Lance Norskog
goks...@gmail.com


Re: DIH and multivariable fields problems

2010-08-12 Thread Lance Norskog
Please add a JIRA issue for this.
https://issues.apache.org/jira/secure/BrowseProject.jspa

On Tue, Aug 10, 2010 at 6:59 PM, kenf_nc  wrote:
>
> Glad I could help. I also would think it was a very common issue. Personally
> my schema is almost all dynamic fields. I have unique_id, content,
> last_update_date and maybe one other field specifically defined, the rest
> are all dynamic. This lets me accept an almost endless variety of document
> types into the same schema.  So if I planned on using DIH I had to come up
> with a way, and stitching together solutions to a couple related issues got
> me to my script transform. Mine is more convoluted than the one I gave here,
> but obviously you got the gist of the idea.
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/DIH-and-multivariable-fields-problems-tp1032893p1081738.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Lance Norskog
goks...@gmail.com


Re: edismax pf2 and ps

2010-08-12 Thread Jayendra Patil
We pretty much had the same issue, ended up customizing the ExtendedDismax
code.

In your case its just a change of a single line
addShingledPhraseQueries(query, normalClauses, phraseFields2, 2,
 tiebreaker, pslop);
to
addShingledPhraseQueries(query, normalClauses, phraseFields2, 2,
 tiebreaker, 0);

Regards,
Jayendra


On Thu, Aug 12, 2010 at 1:04 PM, Ron Mayer  wrote:

> Short summary:
>
>   Is there any way I can specify that I want a lot
>   of phrase slop for the "pf" parameter, but none
>   at all for the "pf2" parameter?
>
> I find the 'pf' parameter with a pretty large 'ps' to do a very
> nice job for providing a modest boost to many documents that are
> quite well related to many queries in my system.
>
> In contrast, I find the 'pf2' parameter with zero 'ps' does
> extremely well at providing a high boost to documents that
> are often exactly what someone's searching for.
>
> Is there any way I can get both effects?
>
> Edismax's pf2 parameter is really nice for boosting exact phrases
> in queries like 'black jacket red cap white shoes'.   But as soon
> as even a little phrase slop (ps) is added, it seems like it starts
> boosting documents with red jackets and white caps just as much as
> those with black jackets and red caps.
>
> My gut feeling is that if I could have "pf" with a large phrase
> slop and the pf2 with zero phrase slop, it'd give me better overall
> results than any single phrase slop setting that gets applied to both.
>
> Is there any good way for me to test that?
>
>  Thanks,
>   Ron
>
>


Re: Index compatibility 1.4 Vs 3.1 Trunk

2010-08-12 Thread Chris Hostetter
: >
: > That should still be true in the the official 4.0 release (i really should
: > have said "When 4.0 can no longer read SOlr 1.4 indexes"), ...
: > i havne't been following the detials closely, but i suspect that tool
: > hasn't been writen yet because there isn't much point until the full
: > details of the trunk index format are nailed down.

: This is news to me?
: 
: File formats are back-compatible between major versions. Version X.N should
: be able to read indexes generated by any version after and including version
: X-1.0, but may-or-may-not be able to read indexes generated by version
: X-2.N.

It was a big part of the proposal regarding hte creation of hte 3x 
branch ... that index format compabtibility between major versions would 
no longer be supported by silently converted on first write -- instead 
there there would be a tool for explicit conversion...

http://search.lucidimagination.com/search/document/c10057266d3471c6/proposal_about_version_api_relaxation
http://search.lucidimagination.com/search/document/c494a78f1ec1bfb5/lucene_3_x_branch_created



-Hoss



Re: Hierarchical faceting

2010-08-12 Thread Jayendra Patil
We were able to get the hierarchy faceting working with a work around
approach.

e.g. if you have Europe//Norway//Oslo as an entry

1. Create a new multivalued field with string type



2. Index the field for "Europe//Norway//Oslo" with values

0//Europe
1//Europe//Norway
2//Europe//Norway//Oslo

3. The Facet can now be used in the Queries :-

1st Level - Would return all entries @ 1st level e.g. 0//USA, 0//Europe

fq=

f.country_facet.facet.prefix=0//

facet.field=country_facet


2nd Level - Would return all entries @ second level in Europe
1//Europe//Norway, 1//Europe//Sweden

fq=country_facet:0//Europe

f.country_facet.facet.prefix=1//Europe

facet.field=country_facet



3rd Level - Would return 1//Europe//Norway entries

fq=country_facet:1//Europe//Norway

f.country_facet.facet.prefix=2//Europe//Norway

facet.field=country_facet

Increment the facet.prefix by 1 so that you limit the facet results to to
that prefix.
Also works for any depth.

Regards,
Jayendra


On Thu, Aug 12, 2010 at 6:01 PM, Mats Bolstad  wrote:

> Hey all,
>
> I am doing a search on hierarchical data, and I have a hard time
> getting my head around the following problem.
>
> I want a result as follows, in one single query only:
>
> USA (3)
> > California (2)
> > Arizona (1)
> Europe (4)
> > Norway (3)
> >> Oslo (3)
> > Sweden (1)
>
> How it looks in the XML/JSON response is not really important, this is
> more a presentation issue. I guess I could store the values "USA",
> "USA/California", "Europe/Norway/Oslo" as strings for each document,
> and do some JavaScript-ing to show the hierarchies appropriately. When
> a specific item in the facet is selected, for example "Norway", Solr
> could be queries with a filter query on "Europe/Norway*"?
>
> Do anyone have some experiences they could please share with me?
>
> I have tried out SOLR-64, and it gives me the results I look for.
> However, I do not have the opportunity to use a patch in the
> production environment ...
>
> --
> Thanks,
> Mats Bolstad
>


Re: Multiple updatehandlers in solr, different autocommit settings

2010-08-12 Thread Chris Hostetter
: 
: I'm trying to set different autocommit settings to 2 separate request
: handlers...I would like a requesthandler to use an update handler and a
: second requesthandler use another update handler...
: 
: can I have more than one update handler in the same solrconfig?
: how can I configure a requesthandler to a specific updatehandler?

I don't beleive it's possible to have more then one  in 
your solrconfig.xml.

The UpdateHandler is responsible for managing the IndexWRiter, and only 
one of those can be open against a Lucene index at a time (i'm fairly 
certian)

One thing that's not clear from your question is wehter you realize that 
there are no "transactions" in Solr ... the notion of a "commit" applies 
to all documents thta have been added regardless of what RequestHandler 
they came from -- so having independent autoCommit settings for differnet 
RequestHandler instances would really give you anything ... when either 
handler triggers a commit, every document that has been added up until 
that point will be committed.

-Hoss



Re: analysis tool vs. reality

2010-08-12 Thread Chris Hostetter

: > You say it's bogus because the qp will divide on whitesapce first -- but
: > you're assuming you know what query parser will be used ... the "field"
: > query parser (to name one) doesn't split on whitespace first.  That's my
: > point: analysis.jsp doesn't make any assumptions about what query parser
: > *might* be used, it just tells you what your analyzers do with strings.
: >
: 
: you're right, we should just fix the bug that the queryparser tokenizes on
: whitespace first. then analysis.jsp will be significantly less confusing.

dude .. not trying to get into a holy war here

even if you change the Lucene QUeryParser so that whitespace isn't a meta 
character it doens't affect the underlying issue: analysis.jsp is agnostic 
about QueryParsers.  Some other QParser the users uses might have other 
special behavior and if people don't understand hte distinction between 
QueryParsing and analysis they can still be confused -- hell even if the 
only QParser anyone ever uses is the lucene QParser, and even if you get 
the QUeryParser changed so that whitespace isn't a metacharacter, you we 
are still going to be left with the fact that *other* charaters (like '+' 
and '-' and '"' and '*' and ...) are metacharacters for that query parser, 
and have special meaning.

analysis.jsp isn't going to know about those, or do anything special for 
them -- so people cna still be easily confused when analysis.jsp says 
one thing about how the string "+foo* -bar" get's analyzed, but that 
string as a query means something completley different.

Hence my point: leave arguments about QueryParser out of it -- how do we 
make the function of analysis.jsp more clear?


-Hoss



Re: Solrj ContentStreamUpdateRequest Slow

2010-08-12 Thread Chris Hostetter

: It returns in around a second.  When I execute the attached code it takes just
: over three minutes.  The optimal for me would be able get closer to the
: performance I'm seeing with curl using Solrj.

I think your problem may be that StreamingUpdateSolrServer buffers up 
commands and sends them in batches in a background thread.  if you want to 
send individual updates in real time (and time them) you should just use 
CommonsHttpSolrServer


-Hoss



Re: analysis tool vs. reality

2010-08-12 Thread Robert Muir
On Thu, Aug 12, 2010 at 7:55 PM, Chris Hostetter
wrote:
>
>
> You say it's bogus because the qp will divide on whitesapce first -- but
> you're assuming you know what query parser will be used ... the "field"
> query parser (to name one) doesn't split on whitespace first.  That's my
> point: analysis.jsp doesn't make any assumptions about what query parser
> *might* be used, it just tells you what your analyzers do with strings.
>

you're right, we should just fix the bug that the queryparser tokenizes on
whitespace first. then analysis.jsp will be significantly less confusing.


-- 
Robert Muir
rcm...@gmail.com


Re: analysis tool vs. reality

2010-08-12 Thread Chris Hostetter

: Furthermore, I would like to add its not just the highlight matches
: functionality that is horribly broken here, but the output of the analysis
: itself is misleading.
: 
: lets say i take 'textTight' from the example, and add the following synonym:
: 
: this is broken => broke
: 
: the query time analysis is wrong, as it clearly shows synonymfilter
: collapsing "this is broken" to broke, but in reality with the qp for that
: field, you are gonna get 3 separate tokenstreams and this will never
: actually happen (because the qp will divide it up on whitespace first)
: 
: So really the output from 'Query Analyzer' is completely bogus.

analysis.jsp is only intended to explain *analysis* ... it accurately 
tells you what the  for the specified field (or 
fieldType) is going to produce given a hunk of text.

That is what it does, that is all that it does, that is all it has ever 
done, and all it has ever purported to do.

You say it's bogus because the qp will divide on whitesapce first -- but 
you're assuming you know what query parser will be used ... the "field" 
query parser (to name one) doesn't split on whitespace first.  That's my 
point: analysis.jsp doesn't make any assumptions about what query parser 
*might* be used, it just tells you what your analyzers do with strings.

Saying the output of analisys.jsp is bogus because it doesn't take into 
account QueryParsing is like saying the output of stats.jsp is bogus 
because those are only the stats of the local solr instance on that 
machine, and it doesn't do distributed stats -- yeah that would be nice to 
have, but the stats.jsp never implies that's what it's giving you.

If there are ways we can make the purpose of analysis.jsp more obvious, 
and less missleading for people who don't udnerstand the distinction 
between query parsing and analysis then i am all for it.  if you really 
believe getting rid of the "highlite" check box is going to help, then 
fine -- but i have yet to see any evidence that people who don't 
understand the relationship between query parsing and analysis are 
confused by the blue boxes.

what people seem to be confused by is when they see the same tokens 
ultimately produced by both the "index" analyzer and the "query" analyzer 
-- it doesn't matter if those tokens are in blue or not, if they see that 
the tokens in the "index" analyzer output are a super set of the tokens in 
the "query" analyzer output then they tend to assume that means searching 
for the string in the "query" box will match documents containing hte 
string in the "index" text box.

Getting rid of the blue table cell is just going to make it harder to 
notice matching tokens in the output -- not reduce the confusion when 
those matching tokens exist in the output.

My question is: What can we do to make it more clear what the *purpose* of 
analysis.jsp is?  is there verbage we can add to the page to make it more 
obvious?

NOTE: I'm not just asking Robert, this is a question for the solr-user 
community as a whole.  I *know* what analysis.jsp is for, i've never been 
confused -- for people who have been confused in hte past (or are still 
confused) please help us understand what type of changes we could make to 
the output of analysis.jsp to make it's functionality more understandable.



-Hoss



SOLR Query

2010-08-12 Thread Moiz Bhukhiya
Hi there,


I've a problem querying SOLR for a specific field with a query string that
contains spaces. I added following lines in the schema.xml to add my own
defined fields. Fields are: ap_name, ap_address, ap_dob, ap_desg, ap_sec.

Since all these fields are beginning with ap_, I included the the following
dynamicField.



I included this line to make a query for all fields instead of a specfic
field.


I added the following document in my index:



1
Tom Cruise
San Fransisco



1. When I query q=Tom+Cruise, I should get the above document since it is
available in "text" which ic my default query field. [Works as expected]
2. When I query q=ap_address:Tom, I should not get the above document since
Tom is not available in ap_address. [Works as expected]
3. When I query q=ap_address:Tom+Cruise, I shouldnt not get the above
document BUT I GET IT. {Doesnt work as expected]

Could anyone please explain me what mistake I am making?

Thanks alot, appreciate any help!
Moiz


Re: Duplicate a core

2010-08-12 Thread Chris Hostetter

: Is it possible to duplicate a core?  I want to have one core contain only
: documents within a certain date range (ex: 3 days old), and one core with
: all documents that have ever been in the first core.  The small core is then
: replicated to other servers which do "real-time" processing on it, but the
: "archive" core exists for longer term searching.

It's not something i've ever dealt with, but if i were going to pursue it 
i would investigate wether this works...

1) have three+ solr instances: "master", "archive" and one or more "query" 
   machines
2) index everything to core named "recent" on server "master"
3) configure the "query" machines to replicate "recent" from "master"
4) configure the "archive" machine to replicate "recent" from "master"
5) configure the "archive" machine to also have an "all" core
6) on some timed bases:
   - delete docs from "recent" on "master" that are *older* then X
   - delete docs from "recent" on "archive" that are *newer* then X
   - use the index merge command on "archive" to merge the "recent" 
 core into the "all" core


...i'm pretty sure that merge command will require that you shutdown both 
cores on archive during the merge, but that's a good idea anyway.

if you need continuous searching of the "all" core to be available, then 
just setup that core on "archive" as a repeater and have some 
"archive-query" machines slaving off of it.


that should work.



-Hoss



Re: Data Import Handler Query

2010-08-12 Thread Manali Joshi
Thanks Alexey. That solved the issue. I am now able to get all images
information in the index.

On Thu, Aug 12, 2010 at 12:47 AM, Alexey Serba  wrote:

> Try to define image solr fields <-> db columns mapping explicitly in
> "image" entity, i.e.
>
> 
> 
>
>
> 
>
> See
> http://www.lucidimagination.com/search/document/c8f2ed065ee75651/dih_and_multivariable_fields_problems
>
> On Thu, Aug 12, 2010 at 2:30 AM, Manali Joshi 
> wrote:
> > I tried making the schema fields that get the image data to
> > multiValued="true". But it still gets only the first image data. It
> doesn't
> > have information about all the images.
> >
> >
> >
> >
> > On Wed, Aug 11, 2010 at 1:15 PM, kenf_nc 
> wrote:
> >
> >>
> >> It may not be the data config. Do you have the fields in the schema.xml
> >> that
> >> the image data is going to set to be multiValued="true"?
> >>
> >> Although, I would think the last image would be stored, not the first,
> but
> >> haven't really tested this.
> >> --
> >> View this message in context:
> >>
> http://lucene.472066.n3.nabble.com/Data-Import-Handler-Query-tp1092010p1092917.html
> >> Sent from the Solr - User mailing list archive at Nabble.com.
> >>
> >
>


can searcher.getReader().getFieldNames() return only stored fields?

2010-08-12 Thread Gerald

Collection myFL =
searcher.getReader().getFieldNames(IndexReader.FieldOption.ALL);

will return all fields in the schema (i.e. index, stored, and
indexed+stored).

Collection myFL =
searcher.getReader().getFieldNames(IndexReader.FieldOption.INDEXED );

likely returns all fields that are indexed (I havent tried).

however, both of these can/will return fields that are not stored.  is there
a parameter that I can use to only return fields that are stored?

there does not seem to be a IndexReader.FieldOption.STORED and cant tell if
any of the others might work

any info helpful. thx
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/can-searcher-getReader-getFieldNames-return-only-stored-fields-tp1124178p1124178.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How to extend the BinaryResponseWriter imposed by Solrj

2010-08-12 Thread Chris Hostetter

: I'm trying to extend the writer used by solrj
: (org.apache.solr.response.BinaryResponseWriter), i have declared it in
...
: I see that it is initialized, but when i try to set the 'wt' param to
: 'myWriter'
: 
: solrQuery.setParam("wt","myWriter"), nothing happen, it's still using the
: 'javabin' writer.

I'm not certian, but i don't think SolrJ respects a "wt" param set by the 
caller .. i think "ResponseParser" dictates what "wt" param is sent to the 
server -- that's why javabin is the default and calling 
"server.setParser(new XMLResponseParser())" causes XML to be sent by the 
server (even if don't set "wt=xml" in your SolrParams)

If you've customized the BinaryResponseWriter then presumably you've had 
to write a custom ResponseParser as well, correct? (otherwise how would 
you take advantage of your customizations to hte format) ... so take a 
look at the existing ResponseParsers to see how they force the "wt" param 
and do the same thing in your custom ResponseParser.

(Note: this is all mostly speculation on my part)

-Hoss



Re: Solr query result cache size and "expire" property

2010-08-12 Thread Chris Hostetter

: please help - how can I calculate queryresultcache size (how much RAM should
: be dedicated for that). I have 1,5 index size, 4 mio docs.
: QueryResultWindowSize is 20.
: Could I use "expire" property on the documents in this cache?

There is no "expire" property, items are automaticly removed from the 
cache if the cache gets full, and the entire cache is thrown out when a 
new searcher is loaded (that's the only time it would make sense to 
"expire" anything)

honestly: trial and error is typically the best bet for sizing your 
queryResultsCache ... the size of your index is much less relevant then 
the types of queries you get.  If you typically only have 200 unique 
queries over and over again, and no one ever does any ohter queries, then 
any number abot 200 is going to be essentially the same.

if you have 200 queries thta get a *lot* and 100 other queries that 
get hit once or twice ver ... then something ~250 is probably a good idea 
... any more is probably just a waste of ram, any less is probably a waste 
of CPU.



-Hoss



Re: Phrase search

2010-08-12 Thread Chris Hostetter

: I'm trying to match "Apple 2" but not "Apple2" using phrase search, this is 
why I have it quoted.

:  I was under the impression --when I use phrase search-- all the 
: analyzer magic would not apply, but it is!!!  Otherwise, how would I 
: search for a phrase?!

well .. yes ... even with phrase searches your query is analyzed.

the only differnce is that with a quoted phrase search, the entire phrase 
is analyzed at one time -- when the input isn't quoted, the whitespace is 
evaluated by the QueryParser as markup just like quotes and +/-, 
etc... (unless it's escaped) and the individual words are analyzed 
independently.

: Using Google, when I search for "Windows 7" (with quotes), unlike Solr, 
: I don't get hits on "Window7".  I want to use catenateNumbers="1" which 
: I want it to take effect on other searches but no phrase searches.  Is 
: this possible ?

you need to elaborate more on what you do and don't want to match -- so 
far you've given one example of a query you want to execute, and a 
document you *don't* want to match that query, but not an example of what 
types of documents you *do* want to match that query -- you also haven't 
given examples of queries that you *do* want that example document to 
match.

i suspect that catenateNumbers="1" isn't actually your problem ... it 
sounds like you don't actually want WordDelimiterFilter doing the "split" 
at index time at all.

Forget the phrase queries for a second: the question to ask yourself is: 
when you index a document containing "Windows7" do you want a serach for 
the word Windows to match thta document?

If the answer is "no" then you probably don't want WordDelimiterFilter at 
all.



-Hoss



Hierarchical faceting

2010-08-12 Thread Mats Bolstad
Hey all,

I am doing a search on hierarchical data, and I have a hard time
getting my head around the following problem.

I want a result as follows, in one single query only:

USA (3)
> California (2)
> Arizona (1)
Europe (4)
> Norway (3)
>> Oslo (3)
> Sweden (1)

How it looks in the XML/JSON response is not really important, this is
more a presentation issue. I guess I could store the values "USA",
"USA/California", "Europe/Norway/Oslo" as strings for each document,
and do some JavaScript-ing to show the hierarchies appropriately. When
a specific item in the facet is selected, for example "Norway", Solr
could be queries with a filter query on "Europe/Norway*"?

Do anyone have some experiences they could please share with me?

I have tried out SOLR-64, and it gives me the results I look for.
However, I do not have the opportunity to use a patch in the
production environment ...

--
Thanks,
Mats Bolstad


Re: clustering component

2010-08-12 Thread Matt Mitchell
Hey thanks Stanislaw! I'm going to try this against the current trunk
tonight and see what happens.

Matt

On Wed, Jul 28, 2010 at 8:41 AM, Stanislaw Osinski <
stanislaw.osin...@carrotsearch.com> wrote:

> > The patch should also work with trunk, but I haven't verified it yet.
> >
>
> I've just added a patch against solr trunk to
> https://issues.apache.org/jira/browse/SOLR-1804.
>
> S.
>


RE: Require some advice

2010-08-12 Thread Nagelberg, Kallin
Try this,

http://viewer.opencalais.com/

They have an open API for that data. With your text message of :

"John Mayer Mumbai 411004 Juhu, car driver, also capable of body guard"
 
It gives back:

People: John Mayer Mumbai
Positions: body guard, car driver.

It's not perfect but it's not bad either..

Regards,
Kallin Nagelberg
-Original Message-
From: Michael Griffiths [mailto:mgriffi...@am-ind.com] 
Sent: Thursday, August 12, 2010 3:28 PM
To: solr-user@lucene.apache.org
Subject: RE: Require some advice

Solr is a search engine, not an entity extraction tool. 

While there are some decent open source entity extraction tools, they are 
focused on processing sentences and paragraphs. The structural differences in 
text messages means you'd need to do a fair amount of work to get decent entity 
extraction.

That said, you may want to look into simple word/phrase matching if your domain 
is sufficiently small. Use RegEx to extract ZIP, use dictionaries to extract 
city/area, skills, and names. Much simpler and cheaper. 

-Original Message-
From: Pavan Gupta [mailto:pavan@gmail.com] 
Sent: Thursday, August 12, 2010 2:58 PM
To: solr-user@lucene.apache.org
Subject: Require some advice

Hi,
I am new to text search and mining and have been doing research for different 
available products. My application requires reading a SMS message
(unstructured) and finding out entities such as person name, area, zip , city 
and skills associated with the person. SMS would be in form of free text. The 
parsed data would be stored in database and used by Solr to display results.
A SMS message could in the following form:
"John Mayer Mumbai 411004 Juhu, car driver, also capable of body guard"
We need to interpret in the following manner:
first name -> John
last name -> Mayer
city-> Mumbai
zip -> 411004
area->Juhu
skills -> car driver, body guard


1. Is Solr capable enough to handle this application considering that SMS 
message would be unstructured.
2. How is Solr/Lucene as compared to other tools such as UIMA, GATE, CER 
(stanford university), Lingpipe?
3. Is Solr only text search or can be used for information extraction?
4. Is it recommended to use Solr with other products such as UIMA and GATE.

There are companies that are specialized in making meaning out of unstructured 
SMS messages. Do we have something similar in open source world? Can we extend 
Solr for the same purpose?

You reply would be appreciated.
Thanking you.
Regards,
Pavan


Re: General questions about distributed solr shards

2010-08-12 Thread Shawn Heisey

 On 8/11/2010 3:27 PM, JohnRodey wrote:

1) Is there any information on preferred maximum sizes for a single solr
index.  I've read some people say 10 million, some say 80 million, etc...
Is there any official recommendation or has anyone experimented with large
datasets into the tens of billions?

2) Is there any down side to running multiple solr shard instances on a
single machine rather than one shard instance with a larger index per
machine?  I would think that having 5 instances with 1/5 the index would
return results approx 5 times faster.

3) Say you have a solr configuration with multiple shards.  If you attempt
to query while one of the shards is down you will receive a HTTP 500 on the
client due to a connection refused on the server.  Is there a way to tell
the server to ignore this and return as many results as possible?  In other
words if you have 100 shards, it is possible that occasionally a process may
die, but I would still like to return results from the active shards.


1) It highly depends on what's in your index.  I'll let someone more 
qualified address this question in more detail.


2) Distributed search adds overhead.  It has to query the individual 
shards, send additional requests to gather the matching records, and 
then assemble the results.  If you create enough shards that you can fit 
all (or most) of each index in whatever RAM is left for the OS disk 
cache, you'll see a VERY significant boost in search speed by using 
shards.  If


3) There are a couple of patches that address this, but in the end, 
you'll be better served by setting up a replicated pair and using a load 
balancer.  I've got a distributed index with two machines per shard, the 
master and the slave.  The load balancer checks the ping status URL 
every 5 seconds to see whether each machine is up.  If one goes down, it 
is removed from the load balancer and everything keeps working.


Each of my shards is about 12.5GB in size and the VMs that access the 
data have 9GB total RAM.  I wish I had more memory!




Re: possible bug in sorting by Function?

2010-08-12 Thread solr-user

issue resolve.  problem was that solr.war was silently not being overwritten
by new version.

will try to spend more time debugging before posting.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/possible-bug-in-sorting-by-Function-tp1118235p1121349.html
Sent from the Solr - User mailing list archive at Nabble.com.


SOLR-788 - disributed More Like This

2010-08-12 Thread Shawn Heisey
 I tried some time ago to use SOLR-788.  Ultimately I was able to get 
both patch versions to apply (separately), but neither worked.  The 
suggestion I received when I commented on the issue was to download the 
specific release mentioned in the patch and then update, but the patch 
was created before the merge with Lucene, so I have no idea how to go 
about that.


Without a much better understanding of Solr internals and a bunch more 
time to learn Java, there's no way that I can work on it myself.  Is 
there anyone who has the time and inclination to get distributed MLT 
working with branch_3x?  A further goal would be to have it actually 
committed before release.


Thanks,
Shawn



RE: Require some advice

2010-08-12 Thread Michael Griffiths
Solr is a search engine, not an entity extraction tool. 

While there are some decent open source entity extraction tools, they are 
focused on processing sentences and paragraphs. The structural differences in 
text messages means you'd need to do a fair amount of work to get decent entity 
extraction.

That said, you may want to look into simple word/phrase matching if your domain 
is sufficiently small. Use RegEx to extract ZIP, use dictionaries to extract 
city/area, skills, and names. Much simpler and cheaper. 

-Original Message-
From: Pavan Gupta [mailto:pavan@gmail.com] 
Sent: Thursday, August 12, 2010 2:58 PM
To: solr-user@lucene.apache.org
Subject: Require some advice

Hi,
I am new to text search and mining and have been doing research for different 
available products. My application requires reading a SMS message
(unstructured) and finding out entities such as person name, area, zip , city 
and skills associated with the person. SMS would be in form of free text. The 
parsed data would be stored in database and used by Solr to display results.
A SMS message could in the following form:
"John Mayer Mumbai 411004 Juhu, car driver, also capable of body guard"
We need to interpret in the following manner:
first name -> John
last name -> Mayer
city-> Mumbai
zip -> 411004
area->Juhu
skills -> car driver, body guard


1. Is Solr capable enough to handle this application considering that SMS 
message would be unstructured.
2. How is Solr/Lucene as compared to other tools such as UIMA, GATE, CER 
(stanford university), Lingpipe?
3. Is Solr only text search or can be used for information extraction?
4. Is it recommended to use Solr with other products such as UIMA and GATE.

There are companies that are specialized in making meaning out of unstructured 
SMS messages. Do we have something similar in open source world? Can we extend 
Solr for the same purpose?

You reply would be appreciated.
Thanking you.
Regards,
Pavan


Re: possible bug in sorting by Function?

2010-08-12 Thread solr-user

problem could be related to some oddity in sum()??  some more examples:

note: Latitude and Longitude are fields of type=double

works:
http://10.0.11.54:8994/solr/select?q=*:*&sort=sum(sum(1,1.0))%20asc
http://10.0.11.54:8994/solr/select?q=*:*&sort=sum(Latitude,Latitude)%20asc
http://10.0.11.54:8994/solr/select?q=*:*&sort=sum(rad(Latitude))%20asc
http://10.0.11.54:8994/solr/select?q=*:*&sort=sum(sum(Latitude,1))%20asc
http://10.0.11.54:8994/solr/select?q=*:*&sort=sum(sum(Latitude,1.0))%20asc

fails:
http://10.0.11.54:8994/solr/select?q=*:*&sort=sum(sum(Latitude,1),sum(Latitude,1))%20asc
http://10.0.11.54:8994/solr/select?q=*:*&sort=sum(sum(Latitude,1.0),sum(Latitude,1.0))%20asc
http://10.0.11.54:8994/solr/select?q=*:*&sort=sum(rad(Latitude),rad(Latitude))%20asc

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/possible-bug-in-sorting-by-Function-tp1118235p1120017.html
Sent from the Solr - User mailing list archive at Nabble.com.


Free Webinar: Findability: Designing the Search Experience

2010-08-12 Thread Erik Hatcher

Here's perhaps the coolest webinar we've done to date, IMO :)

I attended Tyler's presentation at Lucene EuroCon* and thoroughly  
enjoyed it.  Search UI/UX is a fascinating topic to me, and really  
important to do well for the applications most of us are building.


I'm pleased to pass along the blurb below.  See you there!

Erik

* http://lucene-eurocon.org/sessions-track2-day2.html#3



Lucid Imagination presents a free webinar
Wednesday, August 18, 2010 10:00 AM PST / 1:00 PM EST / 19:00 CET
Sign up at http://www.eventsvc.com/lucidimagination/081810?trk=ap

You don't need billions of dollars or users to build a user-friendly  
search application. In fact, studies of how and why people search have  
revealed a set of principles that can  result in happy users who find  
what they're seeking with as little friction as possible -- and help  
you build a better, more successful search application.


Join special guest Tyler Tate, user experience designer at UK-based  
TwigKit Search, for a high-level discussion of key user interface  
strategies for search that can be leveraged with Lucene and Solr. The  
presentation covers:

* Ten things to know about designing the search experience
* When to assume users know what they’re looking for – and when not to
* Navigation/discovery techniques, such as faceted navigation, tag  
clouds, histograms and more
* Practical considerations in leveraging suggestions into search  
interactions


Lucid Imagination presents a free webinar
Wednesday, August 18, 2010 10:00 AM PST / 1:00 PM EST / 19:00 CET
Sign up at http://www.eventsvc.com/lucidimagination/081810?trk=ap

About the presenter: Tyler Tate is co-founder of TwigKit, a UK-based  
company focused on building truly usable interfaces for search. Tyler  
has led user experience design for enterprise applications from CMS to  
CRM, and is the creator of the popular 1KB CSS Grid. Tyler also  
organizes a monthly Enterprise Search Meetup in London, and blogs at  
blog.twigkit.com.


-
Join the Revolution!
Don't miss Lucene Revolution
Lucene & Solr User Conference
Boston | October 7-8 2010
http://lucenerevolution.org
-

This webinar is sponsored by Lucid Imagination, the commercial entity  
exclusively dedicated to Apache Lucene/Solr open source search  
technology. Our solutions can help you develop and deploy search  
solutions with confidence: SLA-based support subscriptions,  
professional training, best practices consulting, along with and value- 
add software and free documentation and certified distributions of  
Lucene and Solr.


"Apache Lucene" and "Apache Solr" are trademarks of the Apache  
Software Foundation.

RE: index pdf files

2010-08-12 Thread Ma, Xiaohui (NIH/NLM/LHC) [C]
I got the following error when I index some pdf files. I wonder if anyone has 
this issue before and how to fix it. Thanks so much in advance!

***



Error 500 

HTTP ERROR: 500org.apache.tika.exception.TikaException: 
Unexpected RuntimeException from org.apache.tika.parser.pdf.pdfpar...@44ffb2

org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: 
Unexpected RuntimeException from org.apache.tika.parser.pdf.pdfpar...@44ffb2
at 
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:211)
at 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
at 
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
at 
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
at 
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
at org.mortbay.jetty.Server.handle(Server.java:285)
at 
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
at 
org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
at 
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
at 
org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
Caused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException 
from org.apache.tika.parser.pdf.pdfpar...@44ffb2
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:121)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105)
at 
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190)
***

-Original Message-
From: Stefan Moises [mailto:moi...@shoptimax.de] 
Sent: Thursday, August 12, 2010 1:58 PM
To: solr-user@lucene.apache.org
Subject: Re: index pdf files

Maybe this helps: 
http://www.packtpub.com/article/indexing-data-solr-1.4-enterprise-search-server-2

Cheers,
Stefan

Am 12.08.2010 19:45, schrieb Ma, Xiaohui (NIH/NLM/LHC) [C]:
> Does anyone know if I need define fields in schema.xml for indexing pdf 
> files? If I need, please tell me how I can do it.
>
> I defined fields in schema.xml and created data-configuration file by using 
> xpath for xml files. Would you please tell me if I need do it for pdf files 
> and how I can do?
>
> Thanks so much for your help as always!
>
> -Original Message-
> From: Marco Martinez [mailto:mmarti...@paradigmatecnologico.com]
> Sent: Thursday, August 12, 2010 11:45 AM
> To: solr-user@lucene.apache.org
> Subject: Re: index pdf files
>
> To help you we need the description of your fields in your schema.xml and
> the query that you do when you search only a single word.
>
> Marco Martínez Bautista
> http://www.paradigmatecnologico.com
> Avenida de Europa, 26. Ática 5. 3ª Planta
> 28224 Pozuelo de Alarcón
> Tel.: 91 352 59 42
>
>
> 2010/8/12 Ma, Xiaohui (NIH/NLM/LHC) [C]
>
>
>> I wrote a simple java program to import a pdf file. I can get a result when
>> I do search *:* from admin page. I get nothing if I search a word. I wonder
>> if I did something wrong or miss set something.
>>
>> Here is part of result I get when do *:* search:
>> *
>> -
>> -
>>   Hristovski D
>>   
>> -
>>   application/pdf
>>   
>> -
>>   microarray analysis, literature-based discovery, semantic
>> predications, natural language processing
>>   
>> -
>>   Thu Aug 12 10:58:37 EDT 2010
>>   
>> -
>>   Combining Semantic Relations and DNA Microarray Data for Novel
>> Hypotheses Generation Combining Semantic Relations and DNA Microarray Data
>> for Novel Hypotheses Generation Dimitar Hristovski, PhD,1 Andrej
>> Kastrin,2..

Require some advice

2010-08-12 Thread Pavan Gupta
Hi,
I am new to text search and mining and have been doing research for
different available products. My application requires reading a SMS message
(unstructured) and finding out entities such as person name, area, zip ,
city and skills associated with the person. SMS would be in form of free
text. The parsed data would be stored in database and used by Solr to
display results.
A SMS message could in the following form:
"John Mayer Mumbai 411004 Juhu, car driver, also capable of body guard"
We need to interpret in the following manner:
first name -> John
last name -> Mayer
city-> Mumbai
zip -> 411004
area->Juhu
skills -> car driver, body guard


1. Is Solr capable enough to handle this application considering that SMS
message would be unstructured.
2. How is Solr/Lucene as compared to other tools such as UIMA, GATE, CER
(stanford university), Lingpipe?
3. Is Solr only text search or can be used for information extraction?
4. Is it recommended to use Solr with other products such as UIMA and GATE.

There are companies that are specialized in making meaning out of
unstructured SMS messages. Do we have something similar in open source
world? Can we extend Solr for the same purpose?

You reply would be appreciated.
Thanking you.
Regards,
Pavan


XSL import/include relative to app server home directory...

2010-08-12 Thread Brian Johnson
Hello,

I'm customizing my XML response using with the XSLTResponseWriter using
"&wt=xslt&tr=transform.xsl". Because I have a few use-cases to support, I
wanted to break up the common bits and import/include them from multiple top
level xslt files, but it appears that the base directory of the transform is
the directory the application was launched in.

Inside my "transform.xsl" I have this, for example




which results in stack traces such as (copied only the relevant bits).

Caused by: java.io.IOException: Unable to initialize Templates 'transform.xsl'

Caused by: javax.xml.transform.TransformerException: Had IO Exception
with stylesheet file: common/image-links.xsl
Caused by: java.io.FileNotFoundException: C:\dev\jboss-5.1.0.GA
\bin\common\image-links.xsl

This appears to be caused by a lack of provided systemId on the StreamSource
of the xslt document I've requested. I've copied the relevant lines that I
believe are the root cause of the problem here for reference.

TransformFactory.getTemplates():line 105-6

final InputStream xsltStream = loader.openResource("xslt/" + filename);
result = tFactory.newTemplates(new StreamSource(xsltStream));


The "loader" variable is an instance of solr's ResourceLoader which has no
ability to provide the systemId to set on StreamSource to make relative
references work in the xslt. It seems like we need something along the lines
of

String systemId = loader.getResourceURL().toString() + "xslt/";
result = tFactory.newTemplates(new StreamSource(xsltStream, systemId));


I looked for a bug/patch and didn't see anything. Please let me know, if I
missed the patch or if there is another way to solve this problem (aside
from not using xsl:include or xsl:import).

Thanks in advance,

Brian

For reference...
http://onjava.com/pub/a/onjava/excerpt/java_xslt_ch5/index.html?page=5
https://jira.springframework.org/secure/attachment/10163/AbstractXsltView.patch
(similar
bug that was in spring)


Field getting tokenized prior to charFilter on select query

2010-08-12 Thread Andrew Chalupa

I'm attempting to make use of PatternReplaceCharFilterFactory, but am running 
into issues on both 1.4.1 ( I ported it) and on nightly (4.0-2010-07-27).  It 
seems that on a real query the charFilter isn't executed prior to the 
tokenizer. 

I modified the example configuration included in the distribution with the 
following fieldType in schema.xml and mapped a new field to it. 
    
    
      
        
        
        
        
        
        
      
        
    
    
    
I vaildated that the regex works properly outside of Solr using just Java.  The 
regex attempts to normalize single word characters around an '&' into something 
consistent for searching.  For example, it will turn "A & B Company" into "A&B 
Company".  The user can then search on "A&B", "A and B", or "A & B" and the 
proper result will be located.

However, when I import a document with "A & B Company" I can't ever locate it 
with "A & B" query.  It can be located with "A&B" query.  When I run 
analysis.jsp it works properly and it will match using any of the combinations.

So from this I concluded that it was being indexed properly, but for some 
reason the query wasn't applying the regex properly.  I hooked up a debugger 
and could see a difference in how the analyzer was applying the charFilter and 
how the query was applying the charFilter.  When the analyzer invoked 
PatternReplaceCharFilterFactory.create(CharStream) the entire field was 
provided in a single call.  When the query invoked 
PatternReplaceCharFilterFactory.create(CharStream) it invoked it 3 times with 3 
seperate tokens (A, &, B).  Because of this the regex won't ever locate the 
full string in the field.

I'm using the following encoded URL to perform the query.  
This works
http://localhost:8983/solr/select?q=name:%28a%26b%29

But this doesn't
http://localhost:8983/solr/select?q=name:%28a+%26+b%29

Why is the query parser tokenizing the name field prior to the charFilter 
getting a chance to perform processing? 

Re: possible bug in sorting by Function?

2010-08-12 Thread solr-user

small typo in last email:  second sum should have been hsin, but I notice
that the problem also occurs when I leave it as sum

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/possible-bug-in-sorting-by-Function-tp1118235p1118260.html
Sent from the Solr - User mailing list archive at Nabble.com.


possible bug in sorting by Function?

2010-08-12 Thread solr-user

I was looking at the ability to sort by Function that was added to solr.

For the most part it seems to work.  However solr doesn't seem to like to
sort by certain functions. 

For example, this sum works:

http://10.0.11.54:8994/solr/select?q=*:*&sort=sum(1,Latitude,Longitude,sum(Latitude,Longitude))
asc

but this hsin doesn't work:

http://10.0.11.54:8994/solr/select?q=*:*&sort=sum(3959,rad(47.544594),rad(-122.38723),rad(Latitude),rad(Longitude))

and gives me a "Must declare sort field or function" error, pointing to a
line in QueryParsing.java.

Note that I did apply the SOLR-1297-2.patch supplied by Koji Sekiguchi but
it didn't seem to help.

I am using solr 903398 2010-01-26 20:21:09Z.

Any suggestions appreciated.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/possible-bug-in-sorting-by-Function-tp1118235p1118235.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: index pdf files

2010-08-12 Thread Ma, Xiaohui (NIH/NLM/LHC) [C]
Thanks so much. I got it work now. I really appreciate your help!
Xiaohui 

-Original Message-
From: Stefan Moises [mailto:moi...@shoptimax.de] 
Sent: Thursday, August 12, 2010 1:58 PM
To: solr-user@lucene.apache.org
Subject: Re: index pdf files

Maybe this helps: 
http://www.packtpub.com/article/indexing-data-solr-1.4-enterprise-search-server-2

Cheers,
Stefan

Am 12.08.2010 19:45, schrieb Ma, Xiaohui (NIH/NLM/LHC) [C]:
> Does anyone know if I need define fields in schema.xml for indexing pdf 
> files? If I need, please tell me how I can do it.
>
> I defined fields in schema.xml and created data-configuration file by using 
> xpath for xml files. Would you please tell me if I need do it for pdf files 
> and how I can do?
>
> Thanks so much for your help as always!
>
> -Original Message-
> From: Marco Martinez [mailto:mmarti...@paradigmatecnologico.com]
> Sent: Thursday, August 12, 2010 11:45 AM
> To: solr-user@lucene.apache.org
> Subject: Re: index pdf files
>
> To help you we need the description of your fields in your schema.xml and
> the query that you do when you search only a single word.
>
> Marco Martínez Bautista
> http://www.paradigmatecnologico.com
> Avenida de Europa, 26. Ática 5. 3ª Planta
> 28224 Pozuelo de Alarcón
> Tel.: 91 352 59 42
>
>
> 2010/8/12 Ma, Xiaohui (NIH/NLM/LHC) [C]
>
>
>> I wrote a simple java program to import a pdf file. I can get a result when
>> I do search *:* from admin page. I get nothing if I search a word. I wonder
>> if I did something wrong or miss set something.
>>
>> Here is part of result I get when do *:* search:
>> *
>> -
>> -
>>   Hristovski D
>>   
>> -
>>   application/pdf
>>   
>> -
>>   microarray analysis, literature-based discovery, semantic
>> predications, natural language processing
>>   
>> -
>>   Thu Aug 12 10:58:37 EDT 2010
>>   
>> -
>>   Combining Semantic Relations and DNA Microarray Data for Novel
>> Hypotheses Generation Combining Semantic Relations and DNA Microarray Data
>> for Novel Hypotheses Generation Dimitar Hristovski, PhD,1 Andrej
>> Kastrin,2...
>> *
>> Please help me out if anyone has experience with pdf files. I really
>> appreciate it!
>>
>> Thanks so much,
>>
>>
>>  
>

-- 
***
Stefan Moises
Senior Softwareentwickler

shoptimax GmbH
Guntherstraße 45 a
90461 Nürnberg
Amtsgericht Nürnberg HRB 21703
GF Friedrich Schreieck

Tel.: 0911/25566-25
Fax:  0911/25566-29
moi...@shoptimax.de
http://www.shoptimax.de
***



Re: Solr Doc Lucene Doc !?

2010-08-12 Thread stockii

i write a little thesis about this. and i need to know how solr is using
lucene -in which way. in example of using dih and searching. so for my
better understanding ..  ;-)


-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Doc-Lucene-Doc-tp995922p1118089.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: index pdf files

2010-08-12 Thread Ma, Xiaohui (NIH/NLM/LHC) [C]
Thanks so much for your help! I defined dynamic field in schema.xml as 
following:


But I wonder what I should put for .

I really appreciate your help!

-Original Message-
From: Stefan Moises [mailto:moi...@shoptimax.de] 
Sent: Thursday, August 12, 2010 1:58 PM
To: solr-user@lucene.apache.org
Subject: Re: index pdf files

Maybe this helps: 
http://www.packtpub.com/article/indexing-data-solr-1.4-enterprise-search-server-2

Cheers,
Stefan

Am 12.08.2010 19:45, schrieb Ma, Xiaohui (NIH/NLM/LHC) [C]:
> Does anyone know if I need define fields in schema.xml for indexing pdf 
> files? If I need, please tell me how I can do it.
>
> I defined fields in schema.xml and created data-configuration file by using 
> xpath for xml files. Would you please tell me if I need do it for pdf files 
> and how I can do?
>
> Thanks so much for your help as always!
>
> -Original Message-
> From: Marco Martinez [mailto:mmarti...@paradigmatecnologico.com]
> Sent: Thursday, August 12, 2010 11:45 AM
> To: solr-user@lucene.apache.org
> Subject: Re: index pdf files
>
> To help you we need the description of your fields in your schema.xml and
> the query that you do when you search only a single word.
>
> Marco Martínez Bautista
> http://www.paradigmatecnologico.com
> Avenida de Europa, 26. Ática 5. 3ª Planta
> 28224 Pozuelo de Alarcón
> Tel.: 91 352 59 42
>
>
> 2010/8/12 Ma, Xiaohui (NIH/NLM/LHC) [C]
>
>
>> I wrote a simple java program to import a pdf file. I can get a result when
>> I do search *:* from admin page. I get nothing if I search a word. I wonder
>> if I did something wrong or miss set something.
>>
>> Here is part of result I get when do *:* search:
>> *
>> -
>> -
>>   Hristovski D
>>   
>> -
>>   application/pdf
>>   
>> -
>>   microarray analysis, literature-based discovery, semantic
>> predications, natural language processing
>>   
>> -
>>   Thu Aug 12 10:58:37 EDT 2010
>>   
>> -
>>   Combining Semantic Relations and DNA Microarray Data for Novel
>> Hypotheses Generation Combining Semantic Relations and DNA Microarray Data
>> for Novel Hypotheses Generation Dimitar Hristovski, PhD,1 Andrej
>> Kastrin,2...
>> *
>> Please help me out if anyone has experience with pdf files. I really
>> appreciate it!
>>
>> Thanks so much,
>>
>>
>>  
>

-- 
***
Stefan Moises
Senior Softwareentwickler

shoptimax GmbH
Guntherstraße 45 a
90461 Nürnberg
Amtsgericht Nürnberg HRB 21703
GF Friedrich Schreieck

Tel.: 0911/25566-25
Fax:  0911/25566-29
moi...@shoptimax.de
http://www.shoptimax.de
***



Results from More then One Cors?

2010-08-12 Thread Jörg Agatz
Hallo Users...

I tryed to get results from more then one Cores..
But i dont know how..

Maby you have a Idea..

I need it into PHP

King


Re: Solr Doc Lucene Doc !?

2010-08-12 Thread kenf_nc

Are you just trying to learn the tiny details of how Solr and DIH work? Is
this just an intellectual curiosity? Or are you having some specific problem
that you are trying to solve? If you have a problem, could you describe the
symptoms of the problem? I am using Solr, DIH, and several other related
technologies and have never needed to know the difference between a
SolrDocument and a LuceneDocument or how the UpdateHandler chains. So I'm
curious about what your ultimate goal is with these questions.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Doc-Lucene-Doc-tp995922p1117472.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: index pdf files

2010-08-12 Thread Stefan Moises
Maybe this helps: 
http://www.packtpub.com/article/indexing-data-solr-1.4-enterprise-search-server-2


Cheers,
Stefan

Am 12.08.2010 19:45, schrieb Ma, Xiaohui (NIH/NLM/LHC) [C]:

Does anyone know if I need define fields in schema.xml for indexing pdf files? 
If I need, please tell me how I can do it.

I defined fields in schema.xml and created data-configuration file by using 
xpath for xml files. Would you please tell me if I need do it for pdf files and 
how I can do?

Thanks so much for your help as always!

-Original Message-
From: Marco Martinez [mailto:mmarti...@paradigmatecnologico.com]
Sent: Thursday, August 12, 2010 11:45 AM
To: solr-user@lucene.apache.org
Subject: Re: index pdf files

To help you we need the description of your fields in your schema.xml and
the query that you do when you search only a single word.

Marco Martínez Bautista
http://www.paradigmatecnologico.com
Avenida de Europa, 26. Ática 5. 3ª Planta
28224 Pozuelo de Alarcón
Tel.: 91 352 59 42


2010/8/12 Ma, Xiaohui (NIH/NLM/LHC) [C]

   

I wrote a simple java program to import a pdf file. I can get a result when
I do search *:* from admin page. I get nothing if I search a word. I wonder
if I did something wrong or miss set something.

Here is part of result I get when do *:* search:
*
-
-
  Hristovski D
  
-
  application/pdf
  
-
  microarray analysis, literature-based discovery, semantic
predications, natural language processing
  
-
  Thu Aug 12 10:58:37 EDT 2010
  
-
  Combining Semantic Relations and DNA Microarray Data for Novel
Hypotheses Generation Combining Semantic Relations and DNA Microarray Data
for Novel Hypotheses Generation Dimitar Hristovski, PhD,1 Andrej
Kastrin,2...
*
Please help me out if anyone has experience with pdf files. I really
appreciate it!

Thanks so much,


 
   


--
***
Stefan Moises
Senior Softwareentwickler

shoptimax GmbH
Guntherstraße 45 a
90461 Nürnberg
Amtsgericht Nürnberg HRB 21703
GF Friedrich Schreieck

Tel.: 0911/25566-25
Fax:  0911/25566-29
moi...@shoptimax.de
http://www.shoptimax.de
***



RE: Improve Query Time For Large Index

2010-08-12 Thread Burton-West, Tom
Hi Peter,

If hits aren't showing up, and you aren't getting any queryResultCache hits 
even with the exact query being repeated, something is very wrong.  I'd suggest 
first getting the query result cache working, and then moving on to look at 
other possible bottlenecks.  

What are your settings for queryResultWindowSize and queryResultMaxDocsCached?

Following up on Robert's point, you might also try to run a few queries in the 
admin interface with the debug flag on to see if the query parser is creating 
phrase queries (assuming you have queries like http://foo.bar.baz).  The 
debug/explain will indicate whether the parsed query is a PhraseQuery.

Tom



-Original Message-
From: Peter Karich [mailto:peat...@yahoo.de] 
Sent: Thursday, August 12, 2010 5:36 AM
To: solr-user@lucene.apache.org
Subject: Re: Improve Query Time For Large Index

Hi Tom,

I tried again with:
  

and even now the hitratio is still 0. What could be wrong with my setup?

('free -m' shows that the cache has over 2 GB free.)

Regards,
Peter.

> Hi Peter,
>
> Can you give a few more examples of slow queries?  
> Are they phrase queries? Boolean queries? prefix or wildcard queries?
> If one word queries are your slow queries, than CommonGrams won't help.  
> CommonGrams will only help with phrase queries.
>
> How are you using termvectors?  That may be slowing things down.  I don't 
> have experience with termvectors, so someone else on the list might speak to 
> that.
>
> When you say the query time for common terms stays slow, do you mean if you 
> re-issue the exact query, the second query is not faster?  That seems very 
> strange.  You might restart Solr, and send a first query (the first query 
> always takes a relatively long time.)  Then pick one of your slow queries and 
> send it 2 times.  The second time you send the query it should be much faster 
> due to the Solr caches and you should be able to see the cache hit in the 
> Solr admin panel.  If you send the exact query a second time (without enough 
> intervening queries to evict data from the cache, ) the Solr queryResultCache 
> should get hit and you should see a response time in the .01-5 millisecond 
> range.
>
> What settings are you using for your Solr caches?
>
> How much memory is on the machine?  If your bottleneck is disk i/o for 
> frequent terms, then you want to make sure you have enough memory for the OS 
> disk cache.  
>
> I assume that http is not in your stopwords.  CommonGrams will only help with 
> phrase queries
> CommonGrams was committed and is in Solr 1.4.  If you decide to use 
> CommonGrams you definitely need to re-index and you also need to use both the 
> index time filter and the query time filter.  Your index will be larger.
>
> 
> 
> 
> 
>
> 
> 
> 
> 
>
>
>
> Tom
> -Original Message-
> From: Peter Karich [mailto:peat...@yahoo.de] 
> Sent: Tuesday, August 10, 2010 3:32 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Improve Query Time For Large Index
>
> Hi Tom,
>
> my index is around 3GB large and I am using 2GB RAM for the JVM although
> a some more is available.
> If I am looking into the RAM usage while a slow query runs (via
> jvisualvm) I see that only 750MB of the JVM RAM is used.
>
>   
>> Can you give us some examples of the slow queries?
>> 
> for example the empty query solr/select?q=
> takes very long or solr/select?q=http
> where 'http' is the most common term
>
>   
>> Are you using stop words?  
>> 
> yes, a lot. I stored them into stopwords.txt
>
>   
>> http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2
>> 
> this looks interesting. I read through
> https://issues.apache.org/jira/browse/SOLR-908 and it seems to be in 1.4.
> I only need to enable it via:
>
>  words="stopwords.txt"/>
>
> right? Do I need to reindex?
>
> Regards,
> Peter.
>
>   
>> Hi Peter,
>>
>> A few more details about your setup would help list members to answer your 
>> questions.
>> How large is your index?  
>> How much memory is on the machine and how much is allocated to the JVM?
>> Besides the Solr caches, Solr and Lucene depend on the operating system's 
>> disk caching for caching of postings lists.  So you need to leave some 
>> memory for the OS.  On the other hand if you are optimizing and refreshing 
>> every 10-15 minutes, that will invalidate all the caches, since an optimized 
>> index is essentially a set of new files.
>>
>> Can you give us some examples of the slow queries?  Are you using stop 
>> words?  
>>
>> If your slow queries are phrase queries, then you might try either adding 
>> the most frequent terms in your index to the stopwords list  or try 
>> CommonGrams and add them to the common words list.  (Details on CommonGrams 
>> here: 
>> http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2)
>>
>> Tom Burton-West
>>
>> -Original Message-
>> From: Peter Karich [mailto:peat...@yahoo.de] 
>> Sent: Tuesday, August 10, 2010

RE: index pdf files

2010-08-12 Thread Ma, Xiaohui (NIH/NLM/LHC) [C]
Does anyone know if I need define fields in schema.xml for indexing pdf files? 
If I need, please tell me how I can do it. 

I defined fields in schema.xml and created data-configuration file by using 
xpath for xml files. Would you please tell me if I need do it for pdf files and 
how I can do?

Thanks so much for your help as always!

-Original Message-
From: Marco Martinez [mailto:mmarti...@paradigmatecnologico.com] 
Sent: Thursday, August 12, 2010 11:45 AM
To: solr-user@lucene.apache.org
Subject: Re: index pdf files

To help you we need the description of your fields in your schema.xml and
the query that you do when you search only a single word.

Marco Martínez Bautista
http://www.paradigmatecnologico.com
Avenida de Europa, 26. Ática 5. 3ª Planta
28224 Pozuelo de Alarcón
Tel.: 91 352 59 42


2010/8/12 Ma, Xiaohui (NIH/NLM/LHC) [C] 

> I wrote a simple java program to import a pdf file. I can get a result when
> I do search *:* from admin page. I get nothing if I search a word. I wonder
> if I did something wrong or miss set something.
>
> Here is part of result I get when do *:* search:
> *
> - 
> - 
>  Hristovski D
>  
> - 
>  application/pdf
>  
> - 
>  microarray analysis, literature-based discovery, semantic
> predications, natural language processing
>  
> - 
>  Thu Aug 12 10:58:37 EDT 2010
>  
> - 
>  Combining Semantic Relations and DNA Microarray Data for Novel
> Hypotheses Generation Combining Semantic Relations and DNA Microarray Data
> for Novel Hypotheses Generation Dimitar Hristovski, PhD,1 Andrej
> Kastrin,2...
> *
> Please help me out if anyone has experience with pdf files. I really
> appreciate it!
>
> Thanks so much,
>
>


Re: how to update solr to older 1.5 builds instead of to trunk

2010-08-12 Thread solr-user

no, once upgraded I wouldnt need to have an older solr read the indexes. 
misunderstood the note.

thx
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-update-solr-to-older-1-5-builds-instead-of-to-trunk-tp1113863p1115694.html
Sent from the Solr - User mailing list archive at Nabble.com.


edismax pf2 and ps

2010-08-12 Thread Ron Mayer
Short summary:

   Is there any way I can specify that I want a lot
   of phrase slop for the "pf" parameter, but none
   at all for the "pf2" parameter?

I find the 'pf' parameter with a pretty large 'ps' to do a very
nice job for providing a modest boost to many documents that are
quite well related to many queries in my system.

In contrast, I find the 'pf2' parameter with zero 'ps' does
extremely well at providing a high boost to documents that
are often exactly what someone's searching for.

Is there any way I can get both effects?

Edismax's pf2 parameter is really nice for boosting exact phrases
in queries like 'black jacket red cap white shoes'.   But as soon
as even a little phrase slop (ps) is added, it seems like it starts
boosting documents with red jackets and white caps just as much as
those with black jackets and red caps.

My gut feeling is that if I could have "pf" with a large phrase
slop and the pf2 with zero phrase slop, it'd give me better overall
results than any single phrase slop setting that gets applied to both.

Is there any good way for me to test that?

  Thanks,
  Ron



Re: how to update solr to older 1.5 builds instead of to trunk

2010-08-12 Thread Yonik Seeley
On Thu, Aug 12, 2010 at 12:24 PM, solr-user  wrote:
> Thanks Yonik but
> http://svn.apache.org/repos/asf/lucene/dev/branches/branch_3x/solr/CHANGES.txt
> says that the lucene index has changed

Right - but it will be able to read your older index.
Do you need Solr 1.4 to be able to read the new index once you upgrade?

-Yonik
http://www.lucidimagination.com


Re: Indexing Hanging during GC?

2010-08-12 Thread Rebecca Watson
hi,

> 1) I assume you are doing batching interspersed with commits

as each file I crawl for are article-level each  contains all the
sentences for the article so they are naturally batched into the about
500 documents per post in LCF.

I use auto-commit in Solr:

 50 
 90 
   

> 2) Why do you need sentence level Lucene docs?

that's an application specific need due to linguistic info needed on a
per-sentence
basis.

> 3) Are your custom handlers/parsers a part of SOLR jvm? Would not be
> surprised if you a memory/connection leak their (or it is not
> releasing some resource explicitly)

I thought this could be the case too -- but if I replace the use of my custom
analysers and specify my fields are of type "text" instead (from standard
solrconfig.xml i.e. using solr-based analysers) then I get this kind of hanging
too -- at least it did when I didn't have any explicit GC settings... it does
take longer to replicate as my analysers/field types are more complex than
"text" field type.

i will try it again with the different GC settings tomorrow and post
the results.

> In general, we have NEVER had a problem in loading Solr.

i'm not sure if we would either if we posted as we created the
index.xml format...
but because we post 500+ documents a time (one article file per LCF post) and
LCF can post these files quickly i'm not sure if I need to try and slow down
the post rate!?

thanks for your replies,

bec :)

> On 8/12/10, Rebecca Watson  wrote:
>> sorry -- i used the term "documents" too loosely!
>>
>> 180k scientific articles with between 500-1000 sentences each
>> and we index sentence-level index documents
>> so i'm guessing about 100 million lucene index documents in total.
>>
>> an update on my progress:
>>
>> i used GC settings of:
>> -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+CMSPermGenSweepingEnabled
>>       -XX:NewSize=2g -XX:MaxNewSize=2g -XX:SurvivorRatio=8
>> -XX:CMSInitiatingOccupancyFraction=70
>>
>> which allowed the indexing process to run to 11.5k articles and
>> for about 2hours before I got the same kind of hanging/unresponsive Solr
>> with
>> this as the tail of the solr logs:
>>
>> Before GC:
>> Statistics for BinaryTreeDictionary:
>> 
>> Total Free Space: 2416734
>> Max   Chunk Size: 2412032
>> Number of Blocks: 3
>> Av.  Block  Size: 805578
>> Tree      Height: 3
>> 5980.480: [ParNew: 1887488K->1887488K(1887488K), 0.193 secs]5980.480:
>> [CMS
>>
>> I also saw (in jconsole) that the number of threads rose from the
>> steady 32 used for the
>> 2 hours to 72 before Solr finally became unresponsive...
>>
>> i've got the following GC info params switched on (as many as i could
>> find!):
>> -XX:+PrintClassHistogram -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
>>       -XX:+PrintGCApplicationConcurrentTime 
>> -XX:+PrintGCApplicationStoppedTime
>>       -XX:PrintFLSStatistics=1
>>
>> with 11.5k docs in about 2 hours this was 11.5k * 500 / 2 = 2.875
>> million fairly small
>> docs per hour!! this produced an index of about 40GB to give you an
>> idea of index
>> size...
>>
>> because i've already got the documents in solr native xml format
>> i.e. one file per article each with ...
>> i.e. posting each set of sentence docs per article in every LCF file post...
>> this means that LCF can throw documents at Solr very fast and i think
>> i'm
>> breaking it GC-wise.
>>
>> i'm going to try adding in System.gc() calls to see if this runs ok
>> (albeit slower)...
>> otherwise i'm pretty much at a loss as to what could be causing this GC
>> issue/
>> solr hanging if it's not a GC issue...
>>
>> thanks :)
>>
>> bec
>>
>> On 12 August 2010 21:42, dc tech  wrote:
>>> I am a little confused - how did 180k documents become 100m index
>>> documents?
>>> We use have over 20 indices (for different content sets), one with 5m
>>> documents (about a couple of pages each) and another with 100k+ docs.
>>> We can index the 5m collection in a couple of days (limitation is in
>>> the source) which is 100k documents an hour without breaking a sweat.
>>>
>>>
>>>
>>> On 8/12/10, Rebecca Watson  wrote:
 Hi,

 When indexing large amounts of data I hit a problem whereby Solr
 becomes unresponsive
 and doesn't recover (even when left overnight!). I think i've hit some
 GC problems/tuning
 is required of GC and I wanted to know if anyone has ever hit this
 problem.
 I can replicate this error (albeit taking longer to do so) using
 Solr/Lucene analysers
 only so I thought other people might have hit this issue before over
 large data sets

 Background on my problem follows -- but I guess my main question is --
 can
 Solr
 become so overwhelmed by update posts that it becomes completely
 unresponsive??

 Right now I think the problem is that the java GC is hanging but I've
 been working
 on this all week and it took a while to figure out it might be
 GC-based / w

Re: how to update solr to older 1.5 builds instead of to trunk

2010-08-12 Thread solr-user

Thanks Yonik but
http://svn.apache.org/repos/asf/lucene/dev/branches/branch_3x/solr/CHANGES.txt
says that the lucene index has changed

"
Upgrading from Solr 1.4
--

* The Lucene index format has changed and as a result, once you upgrade, 
  previous versions of Solr will no longer be able to read your indices.
  In a master/slave configuration, all searchers/slaves should be upgraded
  before the master.  If the master were to be updated first, the older
  searchers would not be able to read the new index format."

not to mention that regression testing is a pain 

Is there any way to get a set of builds with versions prior to 3.x??
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-update-solr-to-older-1-5-builds-instead-of-to-trunk-tp1113863p1114353.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr Doc Lucene Doc !?

2010-08-12 Thread stockii

no help ? =( 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Doc-Lucene-Doc-tp995922p1114172.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: how to update solr to older 1.5 builds instead of to trunk

2010-08-12 Thread Yonik Seeley
Another option is the 3x branch - that should still be able to read
indexes from Solr 1.4/Lucene 2.9
I personally don't expect a 1.5 release to ever materialize.
There will eventually be a Lucene/Solr 3.1 release off of the 3x
branch, and a Lucene/Solr 4.0 release off of trunk.

-Yonik
http://www.lucidimagination.com

On Thu, Aug 12, 2010 at 11:59 AM, solr-user  wrote:
>
> please excuse this newbie question, but:
>
> I want to upgrade solr to a version but not to the latest version in the
> trunk (because there are so many changes that I would have to test against,
> and modify my custom classes for, and behavior changes, and deal with the
> lucene index change, etc)
>
> My thought was to try to look at versions that are post 903398 2010-01-26
> 20:21:09Z but pre the change in the lucene index.  Eventually picking up the
> version that had the features I wanted but with as few other changes as
> feasible.  I know I could probably apply a bunch of patches but some of the
> patches seem to rely on other patches which rely on other patches which rely
> on ...  It just seems easier to pick the version that has just the
> features/patches I want.
>
> I have no trouble seeing/using the trunk at
> http://svn.apache.org/repos/asf/lucene/dev/trunk/ but it only seems to have
> builds 984777 thru 984832
>
> So where would I find significantly older builds (ie like the one I am
> currently using - 903398)?
>
> I tried using svn on repository
> http://svn.apache.org/viewvc/lucene/solr/branches/branch-1.5-dev/ but get a
> "Repository moved permanently to
> '/viewc/lucene/solr/branches/branch-1.5-dev/' message.
>
> Any help would be great
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/how-to-update-solr-to-older-1-5-builds-instead-of-to-trunk-tp1113863p1113863.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Indexing Hanging during GC?

2010-08-12 Thread dc tech
1) I assume you are doing batching interspersed with commits
2) Why do you need sentence level Lucene docs?
3) Are your custom handlers/parsers a part of SOLR jvm? Would not be
surprised if you a memory/connection leak their (or it is not
releasing some resource explicitly)

In general, we have NEVER had a problem in loading Solr.

On 8/12/10, Rebecca Watson  wrote:
> sorry -- i used the term "documents" too loosely!
>
> 180k scientific articles with between 500-1000 sentences each
> and we index sentence-level index documents
> so i'm guessing about 100 million lucene index documents in total.
>
> an update on my progress:
>
> i used GC settings of:
> -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+CMSPermGenSweepingEnabled
>   -XX:NewSize=2g -XX:MaxNewSize=2g -XX:SurvivorRatio=8
> -XX:CMSInitiatingOccupancyFraction=70
>
> which allowed the indexing process to run to 11.5k articles and
> for about 2hours before I got the same kind of hanging/unresponsive Solr
> with
> this as the tail of the solr logs:
>
> Before GC:
> Statistics for BinaryTreeDictionary:
> 
> Total Free Space: 2416734
> Max   Chunk Size: 2412032
> Number of Blocks: 3
> Av.  Block  Size: 805578
> Tree  Height: 3
> 5980.480: [ParNew: 1887488K->1887488K(1887488K), 0.193 secs]5980.480:
> [CMS
>
> I also saw (in jconsole) that the number of threads rose from the
> steady 32 used for the
> 2 hours to 72 before Solr finally became unresponsive...
>
> i've got the following GC info params switched on (as many as i could
> find!):
> -XX:+PrintClassHistogram -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
>   -XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCApplicationStoppedTime
>   -XX:PrintFLSStatistics=1
>
> with 11.5k docs in about 2 hours this was 11.5k * 500 / 2 = 2.875
> million fairly small
> docs per hour!! this produced an index of about 40GB to give you an
> idea of index
> size...
>
> because i've already got the documents in solr native xml format
> i.e. one file per article each with ...
> i.e. posting each set of sentence docs per article in every LCF file post...
> this means that LCF can throw documents at Solr very fast and i think
> i'm
> breaking it GC-wise.
>
> i'm going to try adding in System.gc() calls to see if this runs ok
> (albeit slower)...
> otherwise i'm pretty much at a loss as to what could be causing this GC
> issue/
> solr hanging if it's not a GC issue...
>
> thanks :)
>
> bec
>
> On 12 August 2010 21:42, dc tech  wrote:
>> I am a little confused - how did 180k documents become 100m index
>> documents?
>> We use have over 20 indices (for different content sets), one with 5m
>> documents (about a couple of pages each) and another with 100k+ docs.
>> We can index the 5m collection in a couple of days (limitation is in
>> the source) which is 100k documents an hour without breaking a sweat.
>>
>>
>>
>> On 8/12/10, Rebecca Watson  wrote:
>>> Hi,
>>>
>>> When indexing large amounts of data I hit a problem whereby Solr
>>> becomes unresponsive
>>> and doesn't recover (even when left overnight!). I think i've hit some
>>> GC problems/tuning
>>> is required of GC and I wanted to know if anyone has ever hit this
>>> problem.
>>> I can replicate this error (albeit taking longer to do so) using
>>> Solr/Lucene analysers
>>> only so I thought other people might have hit this issue before over
>>> large data sets
>>>
>>> Background on my problem follows -- but I guess my main question is --
>>> can
>>> Solr
>>> become so overwhelmed by update posts that it becomes completely
>>> unresponsive??
>>>
>>> Right now I think the problem is that the java GC is hanging but I've
>>> been working
>>> on this all week and it took a while to figure out it might be
>>> GC-based / wasn't a
>>> direct result of my custom analysers so i'd appreciate any advice anyone
>>> has
>>> about indexing large document collections.
>>>
>>> I also have a second questions for those in the know -- do we have a
>>> chance
>>> of indexing/searching over our large dataset with what little hardware
>>> we already
>>> have available??
>>>
>>> thanks in advance :)
>>>
>>> bec
>>>
>>> a bit of background: ---
>>>
>>> I've got a large collection of articles we want to index/search over
>>> -- about 180k
>>> in total. Each article has say 500-1000 sentences and each sentence has
>>> about
>>> 15 fields, many of which are multi-valued and we store most fields as
>>> well
>>> for
>>> display/highlighting purposes. So I'd guess over 100 million index
>>> documents.
>>>
>>> In our small test collection of 700 articles this results in a single
>>> index
>>> of
>>> about 13GB.
>>>
>>> Our pipeline processes PDF files through to Solr native xml which we call
>>> "index.xml" files i.e. in ... format ready to post straight to
>>> Solr's
>>> update handler.
>>>
>>> We create the index.xml files as we pull in information from
>>> a few sources and creation of these files from their or

how to update solr to older 1.5 builds instead of to trunk

2010-08-12 Thread solr-user

please excuse this newbie question, but:  

I want to upgrade solr to a version but not to the latest version in the
trunk (because there are so many changes that I would have to test against,
and modify my custom classes for, and behavior changes, and deal with the
lucene index change, etc)

My thought was to try to look at versions that are post 903398 2010-01-26
20:21:09Z but pre the change in the lucene index.  Eventually picking up the
version that had the features I wanted but with as few other changes as
feasible.  I know I could probably apply a bunch of patches but some of the
patches seem to rely on other patches which rely on other patches which rely
on ...  It just seems easier to pick the version that has just the
features/patches I want.

I have no trouble seeing/using the trunk at
http://svn.apache.org/repos/asf/lucene/dev/trunk/ but it only seems to have
builds 984777 thru 984832

So where would I find significantly older builds (ie like the one I am
currently using - 903398)?

I tried using svn on repository
http://svn.apache.org/viewvc/lucene/solr/branches/branch-1.5-dev/ but get a
"Repository moved permanently to
'/viewc/lucene/solr/branches/branch-1.5-dev/' message.

Any help would be great

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-update-solr-to-older-1-5-builds-instead-of-to-trunk-tp1113863p1113863.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: index pdf files

2010-08-12 Thread Ma, Xiaohui (NIH/NLM/LHC) [C]
Thanks so much. I didn't know how to make any changes in schema.xml for pdf 
files. I used solr default schema.xml. Please tell me what I need do in 
schema.xml.

The simple java program I use is following. I also attached that pdf file. I 
really appreciate your help!
*
public class importPDF {
  public static void main(String[] args) {
try {
String fileName = "pub2009001.pdf";
String solrId = "pub2009001.pdf";

  indexFilesSolrCell(fileName, solrId);

} catch (Exception ex) {
  System.out.println(ex.toString());
}
  }

 public static void indexFilesSolrCell(String fileName, String solrId)
throws IOException, SolrServerException {
String urlString = "http://lhcinternal.nlm.nih.gov:8989/solr/lhcpdf";;
SolrServer solr = new CommonsHttpSolrServer(urlString);

ContentStreamUpdateRequest up
  = new ContentStreamUpdateRequest("/update/extract");

up.addFile(new File(fileName));

up.setParam("literal.id", solrId);
up.setParam("uprefix", "attr_");
up.setParam("fmap.content", "attr_content");

up.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
solr.request(up);
  }
}


-Original Message-
From: Marco Martinez [mailto:mmarti...@paradigmatecnologico.com] 
Sent: Thursday, August 12, 2010 11:45 AM
To: solr-user@lucene.apache.org
Subject: Re: index pdf files

To help you we need the description of your fields in your schema.xml and
the query that you do when you search only a single word.

Marco Martínez Bautista
http://www.paradigmatecnologico.com
Avenida de Europa, 26. Ática 5. 3ª Planta
28224 Pozuelo de Alarcón
Tel.: 91 352 59 42


2010/8/12 Ma, Xiaohui (NIH/NLM/LHC) [C] 

> I wrote a simple java program to import a pdf file. I can get a result when
> I do search *:* from admin page. I get nothing if I search a word. I wonder
> if I did something wrong or miss set something.
>
> Here is part of result I get when do *:* search:
> *
> - 
> - 
>  Hristovski D
>  
> - 
>  application/pdf
>  
> - 
>  microarray analysis, literature-based discovery, semantic
> predications, natural language processing
>  
> - 
>  Thu Aug 12 10:58:37 EDT 2010
>  
> - 
>  Combining Semantic Relations and DNA Microarray Data for Novel
> Hypotheses Generation Combining Semantic Relations and DNA Microarray Data
> for Novel Hypotheses Generation Dimitar Hristovski, PhD,1 Andrej
> Kastrin,2...
> *
> Please help me out if anyone has experience with pdf files. I really
> appreciate it!
>
> Thanks so much,
>
>


Re: index pdf files

2010-08-12 Thread Marco Martinez
To help you we need the description of your fields in your schema.xml and
the query that you do when you search only a single word.

Marco Martínez Bautista
http://www.paradigmatecnologico.com
Avenida de Europa, 26. Ática 5. 3ª Planta
28224 Pozuelo de Alarcón
Tel.: 91 352 59 42


2010/8/12 Ma, Xiaohui (NIH/NLM/LHC) [C] 

> I wrote a simple java program to import a pdf file. I can get a result when
> I do search *:* from admin page. I get nothing if I search a word. I wonder
> if I did something wrong or miss set something.
>
> Here is part of result I get when do *:* search:
> *
> - 
> - 
>  Hristovski D
>  
> - 
>  application/pdf
>  
> - 
>  microarray analysis, literature-based discovery, semantic
> predications, natural language processing
>  
> - 
>  Thu Aug 12 10:58:37 EDT 2010
>  
> - 
>  Combining Semantic Relations and DNA Microarray Data for Novel
> Hypotheses Generation Combining Semantic Relations and DNA Microarray Data
> for Novel Hypotheses Generation Dimitar Hristovski, PhD,1 Andrej
> Kastrin,2...
> *
> Please help me out if anyone has experience with pdf files. I really
> appreciate it!
>
> Thanks so much,
>
>


index pdf files

2010-08-12 Thread Ma, Xiaohui (NIH/NLM/LHC) [C]
I wrote a simple java program to import a pdf file. I can get a result when I 
do search *:* from admin page. I get nothing if I search a word. I wonder if I 
did something wrong or miss set something. 

Here is part of result I get when do *:* search:
*
- 
- 
  Hristovski D 
  
- 
  application/pdf 
  
- 
  microarray analysis, literature-based discovery, semantic predications, 
natural language processing 
  
- 
  Thu Aug 12 10:58:37 EDT 2010 
  
- 
  Combining Semantic Relations and DNA Microarray Data for Novel 
Hypotheses Generation Combining Semantic Relations and DNA Microarray Data for 
Novel Hypotheses Generation Dimitar Hristovski, PhD,1 Andrej 
Kastrin,2...
*
Please help me out if anyone has experience with pdf files. I really appreciate 
it!

Thanks so much,



Deleting with the DIH sometimes doesn't delete

2010-08-12 Thread Qwerky

I'm doing deletes with the DIH but getting mixed results. Sometimes the
documents get deleted, other times I can still find them in the index. What
would prevent a doc from getting deleted?

For example, I delete 594039 and get this in the logs;

2010-08-12 14:41:55,625 [Thread-210] INFO  [DataImporter] Starting Delta
Import
2010-08-12 14:41:55,625 [Thread-210] INFO  [SolrWriter] Read
productimportupdate.properties
2010-08-12 14:41:55,625 [Thread-210] INFO  [DocBuilder] Starting delta
collection.
2010-08-12 14:41:55,625 [Thread-210] INFO  [DocBuilder] Running
ModifiedRowKey() for Entity: item
2010-08-12 14:41:55,625 [Thread-210] INFO  [DocBuilder] Completed
ModifiedRowKey for Entity: item rows obtained : 0
2010-08-12 14:41:55,625 [Thread-210] INFO  [DocBuilder] Completed
DeletedRowKey for Entity: item rows obtained : 1
2010-08-12 14:41:55,625 [Thread-210] INFO  [DocBuilder] Completed
parentDeltaQuery for Entity: item
2010-08-12 14:41:55,625 [Thread-210] INFO  [DocBuilder] Deleting stale
documents 
2010-08-12 14:41:55,625 [Thread-210] INFO  [SolrWriter] Deleting document:
594039
2010-08-12 14:41:55,703 [Thread-210] INFO  [SolrDeletionPolicy] newest
commit = 1281030128383
2010-08-12 14:41:55,718 [Thread-210] DEBUG [SolrIndexWriter] Opened Writer
DirectUpdateHandler2
2010-08-12 14:41:55,718 [Thread-210] INFO  [DocBuilder] Delta Import
completed successfully
2010-08-12 14:41:55,718 [Thread-210] INFO  [DocBuilder] Import completed
successfully
2010-08-12 14:41:55,718 [Thread-210] INFO  [DirectUpdateHandler2] start
commit(optimize=true,waitFlush=false,waitSearcher=true,expungeDeletes=false)
2010-08-12 14:42:08,562 [Thread-210] DEBUG [SolrIndexWriter] Closing Writer
DirectUpdateHandler2
2010-08-12 14:42:10,437 [Thread-210] INFO  [SolrDeletionPolicy]
SolrDeletionPolicy.onCommit: commits:num=2

commit{dir=E:\SOLR\kiosk\data\index,segFN=segments_8,version=1281030128383,generation=8,filenames=[_39.frq,
_2i.fdx, _39.tis, _39.prx, _39.fnm, _2i.fdt, _39.tii, _39.nrm, segments_8]

commit{dir=E:\SOLR\kiosk\data\index,segFN=segments_9,version=1281030128384,generation=9,filenames=[_3a.prx,
_3a.tis, _3a.fnm, _3a.nrm, _3a.fdt, _3a.tii, _3a.fdx, _3a.frq, segments_9]
2010-08-12 14:42:10,437 [Thread-210] INFO  [SolrDeletionPolicy] newest
commit = 1281030128384

..this works fine; I can no longer find 594039 in the index. But a little
later I delete a couple more (33252 and 105224) and get the following (I
added two docs at the same time);

2010-08-12 15:27:42,828 [Thread-217] INFO  [DataImporter] Starting Delta
Import
2010-08-12 15:27:42,828 [Thread-217] INFO  [SolrWriter] Read
productimportupdate.properties
2010-08-12 15:27:42,828 [Thread-217] INFO  [DocBuilder] Starting delta
collection.
2010-08-12 15:27:42,843 [Thread-217] INFO  [DocBuilder] Running
ModifiedRowKey() for Entity: item
2010-08-12 15:27:42,843 [Thread-217] INFO  [DocBuilder] Completed
ModifiedRowKey for Entity: item rows obtained : 2
2010-08-12 15:27:42,843 [Thread-217] INFO  [DocBuilder] Completed
DeletedRowKey for Entity: item rows obtained : 2
2010-08-12 15:27:42,843 [Thread-217] INFO  [DocBuilder] Completed
parentDeltaQuery for Entity: item
2010-08-12 15:27:42,843 [Thread-217] INFO  [DocBuilder] Deleting stale
documents 
2010-08-12 15:27:42,843 [Thread-217] INFO  [SolrWriter] Deleting document:
33252
2010-08-12 15:27:42,906 [Thread-217] INFO  [SolrDeletionPolicy]
SolrDeletionPolicy.onInit: commits:num=1

commit{dir=E:\SOLR\kiosk\data\index,segFN=segments_9,version=1281030128384,generation=9,filenames=[_3a.prx,
_3a.tis, _3a.fnm, _3a.nrm, _3a.fdt, _3a.tii, _3a.fdx, _3a.frq, segments_9]
2010-08-12 15:27:42,906 [Thread-217] INFO  [SolrDeletionPolicy] newest
commit = 1281030128384
2010-08-12 15:27:42,906 [Thread-217] DEBUG [SolrIndexWriter] Opened Writer
DirectUpdateHandler2
2010-08-12 15:27:42,906 [Thread-217] INFO  [SolrWriter] Deleting document:
105224
2010-08-12 15:27:42,906 [Thread-217] INFO  [DocBuilder] Delta Import
completed successfully
2010-08-12 15:27:42,906 [Thread-217] INFO  [DocBuilder] Import completed
successfully
2010-08-12 15:27:42,906 [Thread-217] INFO  [DirectUpdateHandler2] start
commit(optimize=true,waitFlush=false,waitSearcher=true,expungeDeletes=false)
2010-08-12 15:27:55,578 [Thread-217] DEBUG [SolrIndexWriter] Closing Writer
DirectUpdateHandler2
2010-08-12 15:27:56,875 [Thread-217] INFO  [SolrDeletionPolicy]
SolrDeletionPolicy.onCommit: commits:num=2

commit{dir=E:\SOLR\kiosk\data\index,segFN=segments_9,version=1281030128384,generation=9,filenames=[_3a.prx,
_3a.tis, _3a.fnm, _3a.nrm, _3a.fdt, _3a.tii, _3a.fdx, _3a.frq, segments_9]

commit{dir=E:\SOLR\kiosk\data\index,segFN=segments_a,version=1281030128385,generation=10,filenames=[_3c.tis,
_3c.fdt, _3c.fnm, _3c.nrm, _3c.tii, segments_a, _3c.fdx, _3c.prx, _3c.frq]
2010-08-12 15:27:56,875 [Thread-217] INFO  [SolrDeletionPolicy] newest
commit = 1281030128385
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Deleting-with-the-DIH-sometimes-doesn-t-delete-tp1113098p1113098.

Re: Indexing Hanging during GC?

2010-08-12 Thread Rebecca Watson
sorry -- i used the term "documents" too loosely!

180k scientific articles with between 500-1000 sentences each
and we index sentence-level index documents
so i'm guessing about 100 million lucene index documents in total.

an update on my progress:

i used GC settings of:
-XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+CMSPermGenSweepingEnabled
-XX:NewSize=2g -XX:MaxNewSize=2g -XX:SurvivorRatio=8
-XX:CMSInitiatingOccupancyFraction=70

which allowed the indexing process to run to 11.5k articles and
for about 2hours before I got the same kind of hanging/unresponsive Solr with
this as the tail of the solr logs:

Before GC:
Statistics for BinaryTreeDictionary:

Total Free Space: 2416734
Max   Chunk Size: 2412032
Number of Blocks: 3
Av.  Block  Size: 805578
Tree  Height: 3
5980.480: [ParNew: 1887488K->1887488K(1887488K), 0.193 secs]5980.480: [CMS

I also saw (in jconsole) that the number of threads rose from the
steady 32 used for the
2 hours to 72 before Solr finally became unresponsive...

i've got the following GC info params switched on (as many as i could find!):
-XX:+PrintClassHistogram -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
-XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCApplicationStoppedTime
-XX:PrintFLSStatistics=1

with 11.5k docs in about 2 hours this was 11.5k * 500 / 2 = 2.875
million fairly small
docs per hour!! this produced an index of about 40GB to give you an
idea of index
size...

because i've already got the documents in solr native xml format
i.e. one file per article each with ...
i.e. posting each set of sentence docs per article in every LCF file post...
this means that LCF can throw documents at Solr very fast and i think i'm
breaking it GC-wise.

i'm going to try adding in System.gc() calls to see if this runs ok
(albeit slower)...
otherwise i'm pretty much at a loss as to what could be causing this GC issue/
solr hanging if it's not a GC issue...

thanks :)

bec

On 12 August 2010 21:42, dc tech  wrote:
> I am a little confused - how did 180k documents become 100m index documents?
> We use have over 20 indices (for different content sets), one with 5m
> documents (about a couple of pages each) and another with 100k+ docs.
> We can index the 5m collection in a couple of days (limitation is in
> the source) which is 100k documents an hour without breaking a sweat.
>
>
>
> On 8/12/10, Rebecca Watson  wrote:
>> Hi,
>>
>> When indexing large amounts of data I hit a problem whereby Solr
>> becomes unresponsive
>> and doesn't recover (even when left overnight!). I think i've hit some
>> GC problems/tuning
>> is required of GC and I wanted to know if anyone has ever hit this problem.
>> I can replicate this error (albeit taking longer to do so) using
>> Solr/Lucene analysers
>> only so I thought other people might have hit this issue before over
>> large data sets
>>
>> Background on my problem follows -- but I guess my main question is -- can
>> Solr
>> become so overwhelmed by update posts that it becomes completely
>> unresponsive??
>>
>> Right now I think the problem is that the java GC is hanging but I've
>> been working
>> on this all week and it took a while to figure out it might be
>> GC-based / wasn't a
>> direct result of my custom analysers so i'd appreciate any advice anyone has
>> about indexing large document collections.
>>
>> I also have a second questions for those in the know -- do we have a chance
>> of indexing/searching over our large dataset with what little hardware
>> we already
>> have available??
>>
>> thanks in advance :)
>>
>> bec
>>
>> a bit of background: ---
>>
>> I've got a large collection of articles we want to index/search over
>> -- about 180k
>> in total. Each article has say 500-1000 sentences and each sentence has
>> about
>> 15 fields, many of which are multi-valued and we store most fields as well
>> for
>> display/highlighting purposes. So I'd guess over 100 million index
>> documents.
>>
>> In our small test collection of 700 articles this results in a single index
>> of
>> about 13GB.
>>
>> Our pipeline processes PDF files through to Solr native xml which we call
>> "index.xml" files i.e. in ... format ready to post straight to
>> Solr's
>> update handler.
>>
>> We create the index.xml files as we pull in information from
>> a few sources and creation of these files from their original PDF form is
>> farmed out across a grid and is quite time-consuming so we distribute this
>> process rather than creating index.xml files on the fly...
>>
>> We do a lot of linguistic processing and to enable search functionality
>> of our resulting terms requires analysers that split terms/ join terms
>> together
>> i.e. custom analysers that perform string operations and are quite
>> time-consuming/
>> have large overhead compared to most analysers (they take approx
>> 20-30% more time
>> and use twice as many short-lived objects than the "text" field 

Re: Indexing large files using Solr Cell causes OutOfMemory error

2010-08-12 Thread Gora Mohanty
On Thu, 12 Aug 2010 14:32:19 +0200
Lannig Carina  wrote:

> Hi,
> 
> I'm trying to index a txt-File (~150MB) using Solr Cell/Tika.
> The curl command aborts due to a java.lang.OutOfMemoryError.
[...]
> AFAIK Tika keeps the whole file in RAM and posts it as one single
> string to Solr. I'm using JVM-args: Xmx1024M and solr default
> config with
[...]

Do not know about Tika, but what is the size of your Solr index,
and the number of documents in it? Solr seems to need RAM, and
while we did not do real benchmarks then, even with a few tens of
thousands of documents, performance seemed to improve by allocating
2GB RAM. Besides, unless you are on a very tight budget, throwing a
few GB more RAM at the problem seems to be an easy, and not
very expensive, way out.

Regards,
Gora


Re: Solr branches

2010-08-12 Thread Tomasz Wegrzanowski
On 12 August 2010 13:46, Koji Sekiguchi  wrote:
> (10/08/12 21:06), Tomasz Wegrzanowski wrote:
>>
>> Hi,
>>
>> I'm having oome problems with solr. From random browsing
>> I'm getting an impression that a lot of memory fixes happened
>> recently in solr and lucene.
>>
>> Could you give me a quick summary how (un)stable are different
>> lucene / solr branches and how much improvement I can expect?
>
> Lucene/Solr have CHANGES.txt. You can refer to it to see
> how much Lucene/Solr get improved from previous release.

This is technically true, but I'm not sufficiently familiar with
solr/lucene development process to infer much about performance
and stability of different branches from it.


Re: Indexing Hanging during GC?

2010-08-12 Thread dc tech
I am a little confused - how did 180k documents become 100m index documents?
We use have over 20 indices (for different content sets), one with 5m
documents (about a couple of pages each) and another with 100k+ docs.
We can index the 5m collection in a couple of days (limitation is in
the source) which is 100k documents an hour without breaking a sweat.



On 8/12/10, Rebecca Watson  wrote:
> Hi,
>
> When indexing large amounts of data I hit a problem whereby Solr
> becomes unresponsive
> and doesn't recover (even when left overnight!). I think i've hit some
> GC problems/tuning
> is required of GC and I wanted to know if anyone has ever hit this problem.
> I can replicate this error (albeit taking longer to do so) using
> Solr/Lucene analysers
> only so I thought other people might have hit this issue before over
> large data sets
>
> Background on my problem follows -- but I guess my main question is -- can
> Solr
> become so overwhelmed by update posts that it becomes completely
> unresponsive??
>
> Right now I think the problem is that the java GC is hanging but I've
> been working
> on this all week and it took a while to figure out it might be
> GC-based / wasn't a
> direct result of my custom analysers so i'd appreciate any advice anyone has
> about indexing large document collections.
>
> I also have a second questions for those in the know -- do we have a chance
> of indexing/searching over our large dataset with what little hardware
> we already
> have available??
>
> thanks in advance :)
>
> bec
>
> a bit of background: ---
>
> I've got a large collection of articles we want to index/search over
> -- about 180k
> in total. Each article has say 500-1000 sentences and each sentence has
> about
> 15 fields, many of which are multi-valued and we store most fields as well
> for
> display/highlighting purposes. So I'd guess over 100 million index
> documents.
>
> In our small test collection of 700 articles this results in a single index
> of
> about 13GB.
>
> Our pipeline processes PDF files through to Solr native xml which we call
> "index.xml" files i.e. in ... format ready to post straight to
> Solr's
> update handler.
>
> We create the index.xml files as we pull in information from
> a few sources and creation of these files from their original PDF form is
> farmed out across a grid and is quite time-consuming so we distribute this
> process rather than creating index.xml files on the fly...
>
> We do a lot of linguistic processing and to enable search functionality
> of our resulting terms requires analysers that split terms/ join terms
> together
> i.e. custom analysers that perform string operations and are quite
> time-consuming/
> have large overhead compared to most analysers (they take approx
> 20-30% more time
> and use twice as many short-lived objects than the "text" field type).
>
> Right now i'm working on my new Imac:
> quad-core 2.8 GHz intel Core i7
> 16 GB 1067 MHz DDR3 RAM
> 2TB hard-drive (about half free)
> Version 10.6.4 OSX
>
> Production environment:
> 2 linux boxes each with:
> 8-core Intel(R) Xeon(R) CPU @ 2.00GHz
> 16GB RAM
>
> I use java 1.6 and Solr version 1.4.1 with multi-cores (a single core
> right now).
>
> I setup Solr to use autocommit as we'll have several document collections /
> post
> to Solr from different data sets:
>
>  
> 
>   50 
>   90 
> 
>
> I also have
>   false
> 1024
> 10
> -
>
> *** First question:
> Has anyone else found that Solr hangs/becomes unresponsive after too
> many documents are indexed at once i.e. Solr can't keep up with the post
> rate?
>
> I've got LCF crawling my local test set (file system connection
> required only) and
> posting documents to Solr using 6GB of RAM. As I said above, these documents
> are in native Solr XML format () with one file per article so
> each
>  contains all the sentence-level documents for the article.
>
> With LCF I post about 2.5/3k articles (files) per hour -- so about
> 2.5k*500 /3600 =
> 350 s per second post-rate -- is this normal/expected??
>
> Eventually, after about 3000 files (an hour or so) Solr starts to
> hang/becomes
> unresponsive and with Jconsole/GC logging I can see that the Old-Gen space
> is
> about 90% full and the following is the end of the solr log file-- where you
> can see GC has been called:
> --
> 3012.290: [GC Before GC:
> Statistics for BinaryTreeDictionary:
> 
> Total Free Space: 53349392
> Max   Chunk Size: 3200168
> Number of Blocks: 66
> Av.  Block  Size: 808324
> Tree  Height: 13
> Before GC:
> Statistics for BinaryTreeDictionary:
> 
> Total Free Space: 0
> Max   Chunk Size: 0
> Number of Blocks: 0
> Tree  Height: 0
> 3012.290: [ParNew (promotion failed): 143071K->142663K(153344K),
> 0.0769802 secs]3012.367: [CMS
> 

Re: Schema Definition Question

2010-08-12 Thread kenf_nc

One way I've done to handle this, and it works only for some types of data,
is to put the searchable part of the sub-doc in a search field
(indexed=true) and put an xml or json representation of the sub-doc in a
stored only field. Then if the main doc is hit via search I can grab the xml
or json, convert it to an object graph and do whatever I want.

If you need to search on a variety of elements in the sub-doc this becomes
less useful an approach. But in some use-cases it worked for me.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Schema-Definition-Question-tp1049966p1110159.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr branches

2010-08-12 Thread Koji Sekiguchi

(10/08/12 21:06), Tomasz Wegrzanowski wrote:

Hi,

I'm having oome problems with solr. From random browsing
I'm getting an impression that a lot of memory fixes happened
recently in solr and lucene.

Could you give me a quick summary how (un)stable are different
lucene / solr branches and how much improvement I can expect?

   

Lucene/Solr have CHANGES.txt. You can refer to it to see
how much Lucene/Solr get improved from previous release.

Koji

--
http://www.rondhuit.com/en/



Indexing large files using Solr Cell causes OutOfMemory error

2010-08-12 Thread Lannig Carina
Hi,

I'm trying to index a txt-File (~150MB) using Solr Cell/Tika.
The curl command aborts due to a java.lang.OutOfMemoryError.
*
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Arrays.java:3209)
at java.lang.String.(String.java:215)
at java.lang.StringBuilder.toString(StringBuilder.java:430)
at org.apache.solr.handler.extraction.SolrContentHandler.newDocument(Sol
rContentHandler.java:124)
at org.apache.solr.handler.extraction.ExtractingDocumentLoader.doAdd(Ext
ractingDocumentLoader.java:119)
at org.apache.solr.handler.extraction.ExtractingDocumentLoader.addDoc(Ex
tractingDocumentLoader.java:125)
at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(Extr
actingDocumentLoader.java:195)
at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Co
ntentStreamHandlerBase.java:54)
at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandl
erBase.java:131)
at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handle
Request(RequestHandlers.java:237)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1323)
at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter
.java:337)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilte
r.java:240)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Appl
icationFilterChain.java:235)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationF
ilterChain.java:206)
at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperV
alve.java:233)
at org.apache.catalina.core.StandardContextValve.invoke(StandardContextV
alve.java:191)
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.j
ava:127)
at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.j
ava:102)
at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineVal
ve.java:109)
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.jav
a:298)
at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java
:852)
at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.proce
ss(Http11Protocol.java:588)
at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:48
9)
at java.lang.Thread.run(Thread.java:619)
) that prevented it from fulfilling this request.Apache Tomcat/6.0.26
*

AFAIK Tika keeps the whole file in RAM and posts it as one single string to 
Solr.
I'm using JVM-args: Xmx1024M and solr default config with
*
  

false
32
10
...
  

  


   ...
*
Is there a chance to force Solr/Tika to flush the memory during indexing a file?
Increasing RAM in dependence on the size of the largest file to index seems not 
very nice.
Did I miss some configuration option or do I have to modify Java code? I just 
found http://osdir.com/ml/tika-dev.lucene.apache.org/2009-02/msg00020.html and 
I'm wondering if there is a solution yet.

Carina

indexing???

2010-08-12 Thread satya swaroop
Hi all,
   The indexing part of solr is going good,but i got a error on indexing
a single pdf file. when i searched for the error in the mailing list i found
that the error was due to copyright of that file. can't we index a file
which has copy rights or any digital rights???

regards,
  satya


Re: Analysing SOLR logfiles

2010-08-12 Thread Peter Karich

I wonder too, that there shouldn't be a special tool which analyzes solr
logfiles (e.g. parses qtime, the parameters q, fq, ...)

Because there are some other open source log analyzers out there:
http://yaala.org/ http://www.mrunix.net/webalizer/

Another free tool is newrelic.com (you will submit your query data to
this site, same as for google analytics). Setup is easy.

For traffic on our site which triggers the solr search we use piwik and
common queries can be extracted easily. Setup was done in 5 minutes.

Regards,
Peter.

> we've just started using awstats - as suggested by the solr 1.4 book.
>
> its open source!:
> http://awstats.sourceforge.net/
>
> On 12 August 2010 18:18, Jay Flattery  wrote:
>   
>> Thanks - splunk looks overkill.
>> We're extremely small scale - were hoping for something open source :-)
>>
>>
>> - Original Message 
>> From: Jan Høydahl / Cominvent 
>> To: solr-user@lucene.apache.org
>> Sent: Wed, August 11, 2010 11:14:37 PM
>> Subject: Re: Analysing SOLR logfiles
>>
>> Have a look at www.splunk.com
>>
>> --
>> Jan Høydahl, search solution architect
>> Cominvent AS - www.cominvent.com
>> Training in Europe - www.solrtraining.com
>>
>> On 11. aug. 2010, at 19.34, Jay Flattery wrote:
>>
>> 
>>> Hi there,
>>>
>>>
>>> Just wondering what tools people use to analyse SOLR log files.
>>>
>>> We're looking to do things like extracting common queries, calculating
>>> averaging
>>>
>>>
>>> Qtime and hits, returning particularly slow/expensive queries, etc.
>>>
>>> Would prefer not to code something (completely) from scratch.
>>>
>>> Thanks!
>>>
>>>
>>>   



Solr branches

2010-08-12 Thread Tomasz Wegrzanowski
Hi,

I'm having oome problems with solr. From random browsing
I'm getting an impression that a lot of memory fixes happened
recently in solr and lucene.

Could you give me a quick summary how (un)stable are different
lucene / solr branches and how much improvement I can expect?


Re: Multiple Facet Dates

2010-08-12 Thread Raphaël Droz

On 05/08/2010 09:59, Raphaël Droz wrote:

Hi,
I saw this post : 
http://lucene.472066.n3.nabble.com/Multiple-Facet-Dates-td495480.html
I didn't see work in progress or plans about this feature on the list 
and bugtracker.


Does someone already created a patch, pof, ... I wouldn't have been 
able to find ?
From my naïve point of view the ratio "usefulness" / "added code 
complexity" appears as high.


My use-case is to provide, in one request :
- the results count for each one of several years (tag-based exclusion)
- the results count for each month of a given year
- the results count for each day of a given month and year)

I pretty sure someone here already encountered the above, isn't ?

After having understood :
"This parameter can be specified on a per field basis."
I created 3 more copy-fields, it's then obvious :

// the real constraint requested
fq={!tag=datefq}date
f.date.facet.date.start=2008-12-08T06:00:00Z
f.date.facet.date.end=2008-12-09T06:00:00Z
f.date.facet.date.gap=+1DAY

// three more field for the total
facet.date={!ex%3Ddatefq}date_for_year
facet.date={!ex%3Ddatefq}date_for_year_month
facet.date={!ex%3Ddatefq}date_for_year_month_day

// the count for all year without the constraint
f.date_for_year.facet.date.start=1970-01-01T06:00:00Z
f.date_for_year.facet.date.end=2011-01-01T06:00:00Z
f.date_for_year.facet.date.gap=+1YEAR

// the count for all month of the year requested (2008) without the 
constraint

f.date_for_year_month.facet.date.start=2008-01-01T06:00:00Z
f.date_for_year_month.facet.date.end=2008-12-31T06:00:00Z
f.date_for_year_month.facet.date.gap=+1MONTH

// idem for the days...

Thanks for your work !

Raph


Re: Improve Query Time For Large Index

2010-08-12 Thread Robert Muir
exactly!

On Thu, Aug 12, 2010 at 5:26 AM, Peter Karich  wrote:

> Hi Robert!
>
> >  Since the example given was "http" being slow, its worth mentioning that
> if
> > queries are "one word" urls [for example http://lucene.apache.org] these
> > will actually form slow phrase queries by default.
> >
>
> do you mean that http://lucene.apache.org will be split up into "http
> lucene apache org" and solr will perform a phrase query?
>
> Regards,
> Peter.
>



-- 
Robert Muir
rcm...@gmail.com


Re: Analysing SOLR logfiles

2010-08-12 Thread Rebecca Watson
we've just started using awstats - as suggested by the solr 1.4 book.

its open source!:
http://awstats.sourceforge.net/

On 12 August 2010 18:18, Jay Flattery  wrote:
> Thanks - splunk looks overkill.
> We're extremely small scale - were hoping for something open source :-)
>
>
> - Original Message 
> From: Jan Høydahl / Cominvent 
> To: solr-user@lucene.apache.org
> Sent: Wed, August 11, 2010 11:14:37 PM
> Subject: Re: Analysing SOLR logfiles
>
> Have a look at www.splunk.com
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Training in Europe - www.solrtraining.com
>
> On 11. aug. 2010, at 19.34, Jay Flattery wrote:
>
>> Hi there,
>>
>>
>> Just wondering what tools people use to analyse SOLR log files.
>>
>> We're looking to do things like extracting common queries, calculating
>>averaging
>>
>>
>> Qtime and hits, returning particularly slow/expensive queries, etc.
>>
>> Would prefer not to code something (completely) from scratch.
>>
>> Thanks!
>>
>>
>>
>>
>
>
>
>
>


Re: Improve Query Time For Large Index

2010-08-12 Thread Peter Karich
Hi Tom!

> Hi Peter,
>
> Can you give a few more examples of slow queries?  
> Are they phrase queries? Boolean queries? prefix or wildcard queries?
>   

I am experimenting with one word queries only at the moment.

> If one word queries are your slow queries, than CommonGrams won't help.  
> CommonGrams will only help with phrase queries.
>   

hmmh, ok.

> How are you using termvectors? 
yes.

> That may be slowing things down.  I don't have experience with termvectors, 
> so someone else on the list might speak to that.
>   

ok. But for highlighting I'll need them to speed things up (a lot).


> When you say the query time for common terms stays slow, do you mean if you 
> re-issue the exact query, the second query is not faster?  That seems very 
> strange. 

Yes. Indeed. The queryResultCache has no hits at all. Strange.

>  You might restart Solr, and send a first query (the first query always takes 
> a relatively long time.)  Then pick one of your slow queries and send it 2 
> times.  The second time you send the query it should be much faster due to 
> the Solr caches and you should be able to see the cache hit in the Solr admin 
> panel.  If you send the exact query a second time (without enough intervening 
> queries to evict data from the cache, ) the Solr queryResultCache should get 
> hit and you should see a response time in the .01-5 millisecond range.
>   

That's not the case. The second query is only some few milliseconds
faster (but stays >2s). But I'm not sure what I am doing wrong. The
other 3 caches have a good hitratio but queryResultCache has 0. For
queryResultCache I am using:


But even if I double that it didn't make the hitratio > 0

> How much memory is on the machine?  If your bottleneck is disk i/o for 
> frequent terms, then you want to make sure you have enough memory for the OS 
> disk cache.  
>   

Yes, there should be enough memory for the OS-disc-cache.

> I assume that http is not in your stopwords.

exactly.


> CommonGrams will only help with phrase queries. CommonGrams was committed and 
> is in Solr 1.4.  If you decide to use CommonGrams you definitely need to 
> re-index and you also need to use both the index time filter and the query 
> time filter.  Your index will be larger.
>
> 
> 
> 
> 
>
> 
> 
> 
> 
>   

Thanks, I will try that, if I can solve the current issue :-)
And thanks for all your answers, I will try to experiment with my setup
in more detail now ...

Kind regards,
Peter.



> Subject: Re: Improve Query Time For Large Index
>
> Hi Tom,
>
> my index is around 3GB large and I am using 2GB RAM for the JVM although
> a some more is available.
> If I am looking into the RAM usage while a slow query runs (via
> jvisualvm) I see that only 750MB of the JVM RAM is used.
>
>   
>> Can you give us some examples of the slow queries?
>> 
> for example the empty query solr/select?q=
> takes very long or solr/select?q=http
> where 'http' is the most common term
>
>   
>> Are you using stop words?  
>> 
> yes, a lot. I stored them into stopwords.txt
>
>   
>> http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2
>> 
> this looks interesting. I read through
> https://issues.apache.org/jira/browse/SOLR-908 and it seems to be in 1.4.
> I only need to enable it via:
>
>  words="stopwords.txt"/>
>
> right? Do I need to reindex?
>
> Regards,
> Peter.
>
>   
>> Hi Peter,
>>
>> A few more details about your setup would help list members to answer your 
>> questions.
>> How large is your index?  
>> How much memory is on the machine and how much is allocated to the JVM?
>> Besides the Solr caches, Solr and Lucene depend on the operating system's 
>> disk caching for caching of postings lists.  So you need to leave some 
>> memory for the OS.  On the other hand if you are optimizing and refreshing 
>> every 10-15 minutes, that will invalidate all the caches, since an optimized 
>> index is essentially a set of new files.
>>
>> Can you give us some examples of the slow queries?  Are you using stop 
>> words?  
>>
>> If your slow queries are phrase queries, then you might try either adding 
>> the most frequent terms in your index to the stopwords list  or try 
>> CommonGrams and add them to the common words list.  (Details on CommonGrams 
>> here: 
>> http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2)
>>
>> Tom Burton-West
>>
>> -Original Message-
>> From: Peter Karich [mailto:peat...@yahoo.de] 
>> Sent: Tuesday, August 10, 2010 9:54 AM
>> To: solr-user@lucene.apache.org
>> Subject: Improve Query Time For Large Index
>>
>> Hi,
>>
>> I have 5 Million small documents/tweets (=> ~3GB) and the slave index
>> replicates itself from master every 10-15 minutes, so the index is
>> optimized before querying. We are using solr 1.4.1 (patched with
>> SOLR-1624) via SolrJ.
>>
>> Now the search speed is slow >2s for common terms which hits more than 2
>> mio docs and acceptable for others: <0.5

Re: Improve Query Time For Large Index

2010-08-12 Thread Peter Karich
Hi Tom,

I tried again with:
  

and even now the hitratio is still 0. What could be wrong with my setup?

('free -m' shows that the cache has over 2 GB free.)

Regards,
Peter.

> Hi Peter,
>
> Can you give a few more examples of slow queries?  
> Are they phrase queries? Boolean queries? prefix or wildcard queries?
> If one word queries are your slow queries, than CommonGrams won't help.  
> CommonGrams will only help with phrase queries.
>
> How are you using termvectors?  That may be slowing things down.  I don't 
> have experience with termvectors, so someone else on the list might speak to 
> that.
>
> When you say the query time for common terms stays slow, do you mean if you 
> re-issue the exact query, the second query is not faster?  That seems very 
> strange.  You might restart Solr, and send a first query (the first query 
> always takes a relatively long time.)  Then pick one of your slow queries and 
> send it 2 times.  The second time you send the query it should be much faster 
> due to the Solr caches and you should be able to see the cache hit in the 
> Solr admin panel.  If you send the exact query a second time (without enough 
> intervening queries to evict data from the cache, ) the Solr queryResultCache 
> should get hit and you should see a response time in the .01-5 millisecond 
> range.
>
> What settings are you using for your Solr caches?
>
> How much memory is on the machine?  If your bottleneck is disk i/o for 
> frequent terms, then you want to make sure you have enough memory for the OS 
> disk cache.  
>
> I assume that http is not in your stopwords.  CommonGrams will only help with 
> phrase queries
> CommonGrams was committed and is in Solr 1.4.  If you decide to use 
> CommonGrams you definitely need to re-index and you also need to use both the 
> index time filter and the query time filter.  Your index will be larger.
>
> 
> 
> 
> 
>
> 
> 
> 
> 
>
>
>
> Tom
> -Original Message-
> From: Peter Karich [mailto:peat...@yahoo.de] 
> Sent: Tuesday, August 10, 2010 3:32 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Improve Query Time For Large Index
>
> Hi Tom,
>
> my index is around 3GB large and I am using 2GB RAM for the JVM although
> a some more is available.
> If I am looking into the RAM usage while a slow query runs (via
> jvisualvm) I see that only 750MB of the JVM RAM is used.
>
>   
>> Can you give us some examples of the slow queries?
>> 
> for example the empty query solr/select?q=
> takes very long or solr/select?q=http
> where 'http' is the most common term
>
>   
>> Are you using stop words?  
>> 
> yes, a lot. I stored them into stopwords.txt
>
>   
>> http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2
>> 
> this looks interesting. I read through
> https://issues.apache.org/jira/browse/SOLR-908 and it seems to be in 1.4.
> I only need to enable it via:
>
>  words="stopwords.txt"/>
>
> right? Do I need to reindex?
>
> Regards,
> Peter.
>
>   
>> Hi Peter,
>>
>> A few more details about your setup would help list members to answer your 
>> questions.
>> How large is your index?  
>> How much memory is on the machine and how much is allocated to the JVM?
>> Besides the Solr caches, Solr and Lucene depend on the operating system's 
>> disk caching for caching of postings lists.  So you need to leave some 
>> memory for the OS.  On the other hand if you are optimizing and refreshing 
>> every 10-15 minutes, that will invalidate all the caches, since an optimized 
>> index is essentially a set of new files.
>>
>> Can you give us some examples of the slow queries?  Are you using stop 
>> words?  
>>
>> If your slow queries are phrase queries, then you might try either adding 
>> the most frequent terms in your index to the stopwords list  or try 
>> CommonGrams and add them to the common words list.  (Details on CommonGrams 
>> here: 
>> http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2)
>>
>> Tom Burton-West
>>
>> -Original Message-
>> From: Peter Karich [mailto:peat...@yahoo.de] 
>> Sent: Tuesday, August 10, 2010 9:54 AM
>> To: solr-user@lucene.apache.org
>> Subject: Improve Query Time For Large Index
>>
>> Hi,
>>
>> I have 5 Million small documents/tweets (=> ~3GB) and the slave index
>> replicates itself from master every 10-15 minutes, so the index is
>> optimized before querying. We are using solr 1.4.1 (patched with
>> SOLR-1624) via SolrJ.
>>
>> Now the search speed is slow >2s for common terms which hits more than 2
>> mio docs and acceptable for others: <0.5s. For those numbers I don't use
>> highlighting or facets. I am using the following schema [1] and from
>> luke handler I know that numTerms =~20 mio. The query for common terms
>> stays slow if I retry again and again (no cache improvements).
>>
>> How can I improve the query time for the common terms without using
>> Distributed Search [2] ?
>>
>> Regards,
>> Peter.
>>
>>
>> [1]
>> > required="t

Re: Improve Query Time For Large Index

2010-08-12 Thread Peter Karich
Hi Robert!

>  Since the example given was "http" being slow, its worth mentioning that if
> queries are "one word" urls [for example http://lucene.apache.org] these
> will actually form slow phrase queries by default.
>   

do you mean that http://lucene.apache.org will be split up into "http
lucene apache org" and solr will perform a phrase query?

Regards,
Peter.


Re: Analysing SOLR logfiles

2010-08-12 Thread Jay Flattery
Thanks - splunk looks overkill.
We're extremely small scale - were hoping for something open source :-)


- Original Message 
From: Jan Høydahl / Cominvent 
To: solr-user@lucene.apache.org
Sent: Wed, August 11, 2010 11:14:37 PM
Subject: Re: Analysing SOLR logfiles

Have a look at www.splunk.com

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com

On 11. aug. 2010, at 19.34, Jay Flattery wrote:

> Hi there,
> 
> 
> Just wondering what tools people use to analyse SOLR log files.
> 
> We're looking to do things like extracting common queries, calculating 
>averaging 
>
> 
> Qtime and hits, returning particularly slow/expensive queries, etc.
> 
> Would prefer not to code something (completely) from scratch.
> 
> Thanks!
> 
> 
> 
> 






Indexing Hanging during GC?

2010-08-12 Thread Rebecca Watson
Hi,

When indexing large amounts of data I hit a problem whereby Solr
becomes unresponsive
and doesn't recover (even when left overnight!). I think i've hit some
GC problems/tuning
is required of GC and I wanted to know if anyone has ever hit this problem.
I can replicate this error (albeit taking longer to do so) using
Solr/Lucene analysers
only so I thought other people might have hit this issue before over
large data sets

Background on my problem follows -- but I guess my main question is -- can Solr
become so overwhelmed by update posts that it becomes completely unresponsive??

Right now I think the problem is that the java GC is hanging but I've
been working
on this all week and it took a while to figure out it might be
GC-based / wasn't a
direct result of my custom analysers so i'd appreciate any advice anyone has
about indexing large document collections.

I also have a second questions for those in the know -- do we have a chance
of indexing/searching over our large dataset with what little hardware
we already
have available??

thanks in advance :)

bec

a bit of background: ---

I've got a large collection of articles we want to index/search over
-- about 180k
in total. Each article has say 500-1000 sentences and each sentence has about
15 fields, many of which are multi-valued and we store most fields as well for
display/highlighting purposes. So I'd guess over 100 million index documents.

In our small test collection of 700 articles this results in a single index of
about 13GB.

Our pipeline processes PDF files through to Solr native xml which we call
"index.xml" files i.e. in ... format ready to post straight to Solr's
update handler.

We create the index.xml files as we pull in information from
a few sources and creation of these files from their original PDF form is
farmed out across a grid and is quite time-consuming so we distribute this
process rather than creating index.xml files on the fly...

We do a lot of linguistic processing and to enable search functionality
of our resulting terms requires analysers that split terms/ join terms together
i.e. custom analysers that perform string operations and are quite
time-consuming/
have large overhead compared to most analysers (they take approx
20-30% more time
and use twice as many short-lived objects than the "text" field type).

Right now i'm working on my new Imac:
quad-core 2.8 GHz intel Core i7
16 GB 1067 MHz DDR3 RAM
2TB hard-drive (about half free)
Version 10.6.4 OSX

Production environment:
2 linux boxes each with:
8-core Intel(R) Xeon(R) CPU @ 2.00GHz
16GB RAM

I use java 1.6 and Solr version 1.4.1 with multi-cores (a single core
right now).

I setup Solr to use autocommit as we'll have several document collections / post
to Solr from different data sets:

 

  50 
  90 


I also have
  false
1024
10
-

*** First question:
Has anyone else found that Solr hangs/becomes unresponsive after too
many documents are indexed at once i.e. Solr can't keep up with the post rate?

I've got LCF crawling my local test set (file system connection
required only) and
posting documents to Solr using 6GB of RAM. As I said above, these documents
are in native Solr XML format () with one file per article so each
 contains all the sentence-level documents for the article.

With LCF I post about 2.5/3k articles (files) per hour -- so about
2.5k*500 /3600 =
350 s per second post-rate -- is this normal/expected??

Eventually, after about 3000 files (an hour or so) Solr starts to hang/becomes
unresponsive and with Jconsole/GC logging I can see that the Old-Gen space is
about 90% full and the following is the end of the solr log file-- where you
can see GC has been called:
--
3012.290: [GC Before GC:
Statistics for BinaryTreeDictionary:

Total Free Space: 53349392
Max   Chunk Size: 3200168
Number of Blocks: 66
Av.  Block  Size: 808324
Tree  Height: 13
Before GC:
Statistics for BinaryTreeDictionary:

Total Free Space: 0
Max   Chunk Size: 0
Number of Blocks: 0
Tree  Height: 0
3012.290: [ParNew (promotion failed): 143071K->142663K(153344K),
0.0769802 secs]3012.367: [CMS
--

I can replicate this with Solr using "text" field types in place of
those that use my
custom analysers -- whereby Solr takes longer to become unresponsive (about
3 hours / 13k docs) but there is the same kind of GC message at the end
 of the log file / Jconsole shows that the Old-Gen space was almost full so was
due for a collection sweep.

I don't use any special GC settings but found an article here:
http://www.lucidimagination.com/blog/2009/09/19/java-garbage-collection-boot-camp-draft/

that suggests using particular GC settings for Solr -- I will try
these but thought
someone else could suggest anoth

Re: Data Import Handler Query

2010-08-12 Thread Alexey Serba
Try to define image solr fields <-> db columns mapping explicitly in
"image" entity, i.e.







See 
http://www.lucidimagination.com/search/document/c8f2ed065ee75651/dih_and_multivariable_fields_problems

On Thu, Aug 12, 2010 at 2:30 AM, Manali Joshi  wrote:
> I tried making the schema fields that get the image data to
> multiValued="true". But it still gets only the first image data. It doesn't
> have information about all the images.
>
>
>
>
> On Wed, Aug 11, 2010 at 1:15 PM, kenf_nc  wrote:
>
>>
>> It may not be the data config. Do you have the fields in the schema.xml
>> that
>> the image data is going to set to be multiValued="true"?
>>
>> Although, I would think the last image would be stored, not the first, but
>> haven't really tested this.
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/Data-Import-Handler-Query-tp1092010p1092917.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>


Re: DIH transformer script size limitations with Jetty?

2010-08-12 Thread Shalin Shekhar Mangar
On Thu, Aug 12, 2010 at 5:42 AM, harrysmith  wrote:

>
> To follow up on my own question, it appears this is only an issue when
> using
> the DataImport console debugging tools. It looks like when submitting the
> debugging request, the data-config.xml is sent via a GET request, which
> would fail.  However, using the exact same data-config.xml via a
> full-import
> operation (ie not a dry run debug), it looks like the request is sent POST
> and the import works fine.
>

You are right. In debug mode, the data-config is sent as a GET request. Can
you open a Jira issue?

-- 
Regards,
Shalin Shekhar Mangar.