Re: How to delete documents from SOLR index using DIH

2010-08-25 Thread Pawan Darira
Thanks Erick. Your solution do make sense. Actually i wanted to know, how to
use delete via query or unique id through DIH?

Is there any specific query to be mentioned in data-config.xml? Also Is
there any separate command like "full-import" ,"delta-import" for deleting
documents from index?



On Thu, Aug 26, 2010 at 12:03 AM, Erick Erickson wrote:

> I'm not sure what you mean here. You can delete via query or unique id. But
> DIH really isn't relevant here.
>
> If you've defined a unique key, simply re-adding any changed documents will
> delete the old one and insert the new document.
>
> If this makes no sense, could you explain what the underlying problem
> you're
> trying to solve is?
>
> HTH
> Erick
>
> On Tue, Aug 24, 2010 at 8:56 PM, Pawan Darira  >wrote:
>
> > Hi
> >
> > I am using data import handler to build index. How can i delete documents
> > from my index using DIH.
> >
> > --
> > Thanks,
> > Pawan Darira
> >
>



-- 
Thanks,
Pawan Darira


JVM GC is very frequent.

2010-08-25 Thread Chengyang
We have about 500million documents are indexed.The index size is aobut 10G. 
Running on a 32bit box. During the pressure testing, we monitered that the JVM 
GC is very frequent, about 5min once. Is there any tips to turning this?


Re: Duplicating a Solr Doc

2010-08-25 Thread Max Lynch
It seems like this is a way to accomplish what I was looking for:
CoreContainer coreContainer = new CoreContainer();
File home = new
File("/home/max/packages/test/apache-solr-1.4.1/example/solr");
File f = new File(home, "solr.xml");


coreContainer.load("/home/max/packages/test/apache-solr-1.4.1/example/solr",
f);

SolrCore core = coreContainer.getCore("newsblog");
IndexSchema schema = core.getSchema();
DocumentBuilder builder = new DocumentBuilder(schema);


// get a Lucene Doc
// Document d = ...


SolrDocument solrDocument = new SolrDocument();

builder.loadStoredFields(solrDocument, d);
logger.debug("Loaded stored date: " +
solrDocument.getFieldValue("date_added_solr"));

However, one thing that scares me is the warning message I get from the
CoreContainer:
 [java] Aug 25, 2010 10:25:23 PM org.apache.solr.update.SolrIndexWriter
finalize
 [java] SEVERE: SolrIndexWriter was not closed prior to finalize(),
indicates a bug -- POSSIBLE RESOURCE LEAK!!!

I'm not sure what exactly triggers that but it's a result of the code I
posted above.

On Wed, Aug 25, 2010 at 10:49 PM, Max Lynch  wrote:

> Right now I am doing some processing on my Solr index using Lucene Java.
> Basically, I loop through the index in Java and do some extra processing of
> each document (processing that is too intensive to do during indexing).
>
> However, when I try to update the document in solr with new fields (using
> SolrJ), the document either loses fields I don't explicitly set, or if I
> have Solr-specific fields such as a solr "date" field type, I am not able to
> copy the value as I can't read the value from Java.
>
> Is there a way to add a field to a solr document without having to
> re-create the document?  If not, how can I read the value of a Solr date in
> java?  Document.get("date_field") returns null even though the value shows
> up when I access it through solr.  If I could read this value I could just
> copy the fields from the Lucene Document to a SolrInputDocument.
>
> Thanks.
>


Re: Delete by query issue

2010-08-25 Thread Max Lynch
Thanks Lance.  I'll give that a try going forward.

On Wed, Aug 25, 2010 at 9:59 PM, Lance Norskog  wrote:

> Here's the problem: the standard Solr parser is a little weird about
> negative queries. The way to make this work is to say
>*:* AND -field:[* TO *]
>
> This means "select everything AND only these documents without a value
> in the field".
>
> On Wed, Aug 25, 2010 at 7:55 PM, Max Lynch  wrote:
> > I was trying to filter out all documents that HAVE that field.  I was
> trying
> > to delete any documents where that field had empty values.
> >
> > I just found a way to do it, but I did a range query on a string date in
> the
> > Lucene DateTools format and it worked, so I'm satisfied.  However, I
> believe
> > it worked because all of my documents have values for that field.
> >
> > Oh well.
> >
> > -max
> >
> > On Wed, Aug 25, 2010 at 9:45 PM, scott chu (朱炎詹)  >wrote:
> >
> >> Excuse me, what's the hyphen before  the field name 'date_added_solr'?
> Is
> >> this some kind of new query format that I didn't know?
> >>
> >> -date_added_solr:[* TO *]'
> >>
> >> - Original Message -
> >> From: "Max Lynch" 
> >> To: 
> >> Sent: Thursday, August 26, 2010 6:12 AM
> >> Subject: Delete by query issue
> >>
> >>
> >> > Hi,
> >> > I am trying to delete all documents that have null values for a
> certain
> >> > field.  To that effect I can see all of the documents I want to delete
> by
> >> > doing this query:
> >> > -date_added_solr:[* TO *]
> >> >
> >> > This returns about 32,000 documents.
> >> >
> >> > However, when I try to put that into a curl call, no documents get
> >> deleted:
> >> > curl http://localhost:8985/solr/newsblog/update?commit=true -H
> >> > "Content-Type: text/xml" --data-binary
> >> '-date_added_solr:[*
> >> > TO *]'
> >> >
> >> > Solr responds with:
> >> > 
> >> > 0 >> > name="QTime">364
> >> > 
> >> >
> >> > But nothing happens, even if I explicitly issue a commit afterward.
> >> >
> >> > Any ideas?
> >> >
> >> > Thanks.
> >> >
> >>
> >>
> >>
> >>
> 
> >>
> >>
> >>
> >> %<&b6G$J0T.'$$'d(l/f,r!C
> >> Checked by AVG - www.avg.com
> >> Version: 9.0.851 / Virus Database: 271.1.1/3093 - Release Date: 08/25/10
> >> 14:34:00
> >>
> >
>
>
>
> --
> Lance Norskog
> goks...@gmail.com
>


Duplicating a Solr Doc

2010-08-25 Thread Max Lynch
Right now I am doing some processing on my Solr index using Lucene Java.
Basically, I loop through the index in Java and do some extra processing of
each document (processing that is too intensive to do during indexing).

However, when I try to update the document in solr with new fields (using
SolrJ), the document either loses fields I don't explicitly set, or if I
have Solr-specific fields such as a solr "date" field type, I am not able to
copy the value as I can't read the value from Java.

Is there a way to add a field to a solr document without having to re-create
the document?  If not, how can I read the value of a Solr date in java?
Document.get("date_field") returns null even though the value shows up when
I access it through solr.  If I could read this value I could just copy the
fields from the Lucene Document to a SolrInputDocument.

Thanks.


Re: Delete by query issue

2010-08-25 Thread Lance Norskog
Here's the problem: the standard Solr parser is a little weird about
negative queries. The way to make this work is to say
*:* AND -field:[* TO *]

This means "select everything AND only these documents without a value
in the field".

On Wed, Aug 25, 2010 at 7:55 PM, Max Lynch  wrote:
> I was trying to filter out all documents that HAVE that field.  I was trying
> to delete any documents where that field had empty values.
>
> I just found a way to do it, but I did a range query on a string date in the
> Lucene DateTools format and it worked, so I'm satisfied.  However, I believe
> it worked because all of my documents have values for that field.
>
> Oh well.
>
> -max
>
> On Wed, Aug 25, 2010 at 9:45 PM, scott chu (朱炎詹) 
> wrote:
>
>> Excuse me, what's the hyphen before  the field name 'date_added_solr'? Is
>> this some kind of new query format that I didn't know?
>>
>> -date_added_solr:[* TO *]'
>>
>> - Original Message -
>> From: "Max Lynch" 
>> To: 
>> Sent: Thursday, August 26, 2010 6:12 AM
>> Subject: Delete by query issue
>>
>>
>> > Hi,
>> > I am trying to delete all documents that have null values for a certain
>> > field.  To that effect I can see all of the documents I want to delete by
>> > doing this query:
>> > -date_added_solr:[* TO *]
>> >
>> > This returns about 32,000 documents.
>> >
>> > However, when I try to put that into a curl call, no documents get
>> deleted:
>> > curl http://localhost:8985/solr/newsblog/update?commit=true -H
>> > "Content-Type: text/xml" --data-binary
>> '-date_added_solr:[*
>> > TO *]'
>> >
>> > Solr responds with:
>> > 
>> > 0> > name="QTime">364
>> > 
>> >
>> > But nothing happens, even if I explicitly issue a commit afterward.
>> >
>> > Any ideas?
>> >
>> > Thanks.
>> >
>>
>>
>>
>> 
>>
>>
>>
>> %<&b6G$J0T.'$$'d(l/f,r!C
>> Checked by AVG - www.avg.com
>> Version: 9.0.851 / Virus Database: 271.1.1/3093 - Release Date: 08/25/10
>> 14:34:00
>>
>



-- 
Lance Norskog
goks...@gmail.com


Re: Is there any strss test tool for testing Solr?

2010-08-25 Thread Amit Nithian
i recommend JMeter. We use that to do load testing on a search server. of
course you have to provide a reasonable set of queries as input... if you
don't have any then a reasonable estimation based on your expected traffic
should suffice. JMeter can be used for other load testing too..

Be careful though.. as silly as this may sound.. do NOT just issue random
queries because that won't exercise your caches... We had a load test that
killed our servers because our caches kept getting blown out. Of course the
traffic being generated was purely random was not representative of real
world traffic which usually has more predictable behavior.

hope that helps!
Amit

On Wed, Aug 25, 2010 at 7:50 PM, scott chu (朱炎詹) wrote:

> We're currently building a Solr index with ober 1.2 million documents. I
> want to do a good stress test of it. Does anyone know if ther's a
> appropriate stress test tool for Solr? Or any good suggestion?
>
> Best Regards,
>
> Scott
>


Re: Restricting HTML search?

2010-08-25 Thread Lance Norskog
Cool!  I did not know that Tika had a thorough&careful HTML parser.

On Wed, Aug 25, 2010 at 7:49 PM, Ken Krugler
 wrote:
> Actually TagSoup's reason for existence is to clean up all of the messy HTML
> that's out in the wild.
>
> Tika's HTML parser wraps this, and uses it to generate the stream of SAX
> events that it then consumes and turns into a normalized XHTML 1.0-compliant
> data stream.
>
> -- Ken
>
> On Aug 25, 2010, at 7:22pm, Lance Norskog wrote:
>
>> This assumes that the HTML is good quality. I don't know exactly what
>> your use case is. If you're crawling the web you will find some very
>> screwed-up HTML.
>>
>> On Wed, Aug 25, 2010 at 6:45 AM, Ken Krugler
>>  wrote:
>>>
>>> On Aug 24, 2010, at 10:55pm, Paul Libbrecht wrote:
>>>
 Wouldn't the usage of the NeckoHTML (as an XML-parser) and XPath be
 safer?
 I guess it all depends on the "quality" of the source document.
>>>
>>> If you're processing HTML then you definitely want to use something like
>>> NekoHTML or TagSoup.
>>>
>>> Note that Tika uses TagSoup and makes it easy to do special processing of
>>> specific elements - you give it a content handler that gets fed a stream
>>> of
>>> cleaned-up HTML elements.
>>>
>>> -- Ken
>>>
 Le 25-août-10 à 02:09, Lance Norskog a écrit :

> I would do this with regular expressions. There is a Pattern Analyzer
> and a Tokenizer which do regular expression-based text chopping. (I'm
> not sure how to make them do what you want). A more precise tool is
> the RegexTransformer in the DataImportHandler.
>
> Lance
>
> On Tue, Aug 24, 2010 at 7:08 AM, Andrew Cogan
>  wrote:
>>
>> I'm quite new to SOLR and wondering if the following is possible: in
>> addition to normal full text search, my users want to have the option
>> to
>> search only HTML heading innertext, i.e. content inside of , ,
>> or
>>  tags.

>>>
>>> 
>>> Ken Krugler
>>> +1 530-210-6378
>>> http://bixolabs.com
>>> e l a s t i c   w e b   m i n i n g
>>>
>>>
>>>
>>>
>>>
>>
>>
>>
>> --
>> Lance Norskog
>> goks...@gmail.com
>
> 
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c   w e b   m i n i n g
>
>
>
>
>



-- 
Lance Norskog
goks...@gmail.com


Re: Delete by query issue

2010-08-25 Thread Max Lynch
I was trying to filter out all documents that HAVE that field.  I was trying
to delete any documents where that field had empty values.

I just found a way to do it, but I did a range query on a string date in the
Lucene DateTools format and it worked, so I'm satisfied.  However, I believe
it worked because all of my documents have values for that field.

Oh well.

-max

On Wed, Aug 25, 2010 at 9:45 PM, scott chu (朱炎詹) wrote:

> Excuse me, what's the hyphen before  the field name 'date_added_solr'? Is
> this some kind of new query format that I didn't know?
>
> -date_added_solr:[* TO *]'
>
> - Original Message -
> From: "Max Lynch" 
> To: 
> Sent: Thursday, August 26, 2010 6:12 AM
> Subject: Delete by query issue
>
>
> > Hi,
> > I am trying to delete all documents that have null values for a certain
> > field.  To that effect I can see all of the documents I want to delete by
> > doing this query:
> > -date_added_solr:[* TO *]
> >
> > This returns about 32,000 documents.
> >
> > However, when I try to put that into a curl call, no documents get
> deleted:
> > curl http://localhost:8985/solr/newsblog/update?commit=true -H
> > "Content-Type: text/xml" --data-binary
> '-date_added_solr:[*
> > TO *]'
> >
> > Solr responds with:
> > 
> > 0 > name="QTime">364
> > 
> >
> > But nothing happens, even if I explicitly issue a commit afterward.
> >
> > Any ideas?
> >
> > Thanks.
> >
>
>
>
> 
>
>
>
> %<&b6G$J0T.'$$'d(l/f,r!C
> Checked by AVG - www.avg.com
> Version: 9.0.851 / Virus Database: 271.1.1/3093 - Release Date: 08/25/10
> 14:34:00
>


Re: Solr searching performance issues, using large documents (now 1MB documents)

2010-08-25 Thread Yonik Seeley
On Wed, Aug 25, 2010 at 2:34 PM, Peter Spam  wrote:
> This is a very small number of documents (7000), so I am surprised Solr is 
> having such a hard time with it!!
>
> I do facet on 3 terms.
>
> Subsequent "hello" searches are faster, but still well over a second.  This 
> is a very fast Mac Pro, with 6GB of RAM.

Search apps often need tweaking for best performance.
We probably need to determine if you are IO bound (because the index
is large enough that there are many disk seeks) or if you are CPU
bound (possible, depending on the faceting).

Perhaps one easy thing to start with is to add debugQuery=true and
report the timings of the different components it gives.

-Yonik
http://lucenerevolution.org  Lucene/Solr Conference, Boston Oct 7-8


Is there any strss test tool for testing Solr?

2010-08-25 Thread 朱炎詹
We're currently building a Solr index with ober 1.2 million documents. I 
want to do a good stress test of it. Does anyone know if ther's a 
appropriate stress test tool for Solr? Or any good suggestion?


Best Regards,

Scott 



Re: Increasing Logging of Delta Queries

2010-08-25 Thread Lance Norskog
There is a LogTransformer that logs data instead of adding to the document:

http://www.lucidimagination.com/search/document/CDRG_ch06_6.4.7.3?q=logging
transformer

http://wiki.apache.org/solr/DataImportHandler#LogTransformer

On Wed, Aug 25, 2010 at 12:35 PM, Vladimir Sutskever
 wrote:
> Hi All,
>
> Is there a way to increase the debugging level of SOLR delta query imports.
> I would like to see records that have been "picked up" by SOLR be spit out to 
> Standard Output or a log file.
>
>
> Thank You!
>
>
> Kind regards,
>
> Vladimir Sutskever
> Investment Bank - Technology
> JPMorgan Chase, Inc.
>
>
>
> This email is confidential and subject to important disclaimers and
> conditions including on offers for the purchase or sale of
> securities, accuracy and completeness of information, viruses,
> confidentiality, legal privilege, and legal entity disclaimers,
> available at http://www.jpmorgan.com/pages/disclosures/email.



-- 
Lance Norskog
goks...@gmail.com


Re: Restricting HTML search?

2010-08-25 Thread Ken Krugler
Actually TagSoup's reason for existence is to clean up all of the  
messy HTML that's out in the wild.


Tika's HTML parser wraps this, and uses it to generate the stream of  
SAX events that it then consumes and turns into a normalized XHTML 1.0- 
compliant data stream.


-- Ken

On Aug 25, 2010, at 7:22pm, Lance Norskog wrote:


This assumes that the HTML is good quality. I don't know exactly what
your use case is. If you're crawling the web you will find some very
screwed-up HTML.

On Wed, Aug 25, 2010 at 6:45 AM, Ken Krugler
 wrote:


On Aug 24, 2010, at 10:55pm, Paul Libbrecht wrote:

Wouldn't the usage of the NeckoHTML (as an XML-parser) and XPath  
be safer?

I guess it all depends on the "quality" of the source document.


If you're processing HTML then you definitely want to use something  
like

NekoHTML or TagSoup.

Note that Tika uses TagSoup and makes it easy to do special  
processing of
specific elements - you give it a content handler that gets fed a  
stream of

cleaned-up HTML elements.

-- Ken


Le 25-août-10 à 02:09, Lance Norskog a écrit :

I would do this with regular expressions. There is a Pattern  
Analyzer
and a Tokenizer which do regular expression-based text chopping.  
(I'm

not sure how to make them do what you want). A more precise tool is
the RegexTransformer in the DataImportHandler.

Lance

On Tue, Aug 24, 2010 at 7:08 AM, Andrew Cogan
 wrote:


I'm quite new to SOLR and wondering if the following is  
possible: in
addition to normal full text search, my users want to have the  
option to
search only HTML heading innertext, i.e. content inside of ,  
,

or
 tags.





Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g









--
Lance Norskog
goks...@gmail.com



Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






Re: Solr searching performance issues, using large documents (now 1MB documents)

2010-08-25 Thread Lance Norskog
How much disk space is used by the index?

If you run the Lucene CheckIndex program, how many terms etc. does it report?

When you do the first facet query, how much does the memory in use grow?

Are you storing the text fields, or only indexing? Do you fetch the
facets only, or do you also fetch the document contents?

On Wed, Aug 25, 2010 at 11:34 AM, Peter Spam  wrote:
> This is a very small number of documents (7000), so I am surprised Solr is 
> having such a hard time with it!!
>
> I do facet on 3 terms.
>
> Subsequent "hello" searches are faster, but still well over a second.  This 
> is a very fast Mac Pro, with 6GB of RAM.
>
>
> Thanks,
> Peter
>
> On Aug 25, 2010, at 9:52 AM, Yonik Seeley wrote:
>
>> On Wed, Aug 25, 2010 at 11:29 AM, Peter Spam  wrote:
>>> So, I went through all the effort to break my documents into max 1 MB 
>>> chunks, and searching for hello still takes over 40 seconds (searching 
>>> across 7433 documents):
>>>
>>>        8 results (41980 ms)
>>>
>>> What is going on???  (scroll down for my config).
>>
>> Are you still faceting on that query also?
>> Breaking your docs into many chunks means inflating the doc count and
>> will make faceting slower.
>> Also, first-time faceting (as with sorting) is slow... did you try
>> another query after  "hello" (and without a commit happening
>> inbetween) to see if it was faster?
>>
>> -Yonik
>> http://lucenerevolution.org  Lucene/Solr Conference, Boston Oct 7-8
>
>



-- 
Lance Norskog
goks...@gmail.com


Re: Delete by query issue

2010-08-25 Thread 朱炎詹
Excuse me, what's the hyphen before  the field name 'date_added_solr'? Is this 
some kind of new query format that I didn't know?

-date_added_solr:[* TO *]'

- Original Message - 
From: "Max Lynch" 
To: 
Sent: Thursday, August 26, 2010 6:12 AM
Subject: Delete by query issue


> Hi,
> I am trying to delete all documents that have null values for a certain
> field.  To that effect I can see all of the documents I want to delete by
> doing this query:
> -date_added_solr:[* TO *]
> 
> This returns about 32,000 documents.
> 
> However, when I try to put that into a curl call, no documents get deleted:
> curl http://localhost:8985/solr/newsblog/update?commit=true -H
> "Content-Type: text/xml" --data-binary '-date_added_solr:[*
> TO *]'
> 
> Solr responds with:
> 
> 0 name="QTime">364
> 
> 
> But nothing happens, even if I explicitly issue a commit afterward.
> 
> Any ideas?
> 
> Thanks.
>






%<&b6G$J0T.'$$'d(l/f,r!C
Checked by AVG - www.avg.com 
Version: 9.0.851 / Virus Database: 271.1.1/3093 - Release Date: 08/25/10 
14:34:00


How to set custom fields for SolrSearchBean Query in Nutch?

2010-08-25 Thread Savannah Beckett
I am using SolrSearchBean inside my custom parse filter in Nutch 1.1.  My 
solr/nutch setup is working.  I have Nutch to crawl and index into Solr and I 
am 
able to search solr index with my solr admin page.  My solr schema is 
completely 
different than the one in Nutch.  When I tried to query my solr index using 
SolrSearchBean, it somehow always treat my query with fields like content, 
site, 
url, etc, my solr index has none of those fields.  Of course, there is an 
exception complaining cannot executing query.  


How do I make SolrSearchBean use my solr setup's fields instead of nutch ones?  
Thanks.


  

Re: Distinct values versus schema change?

2010-08-25 Thread Lance Norskog
What you want is something called 'field collapsing'. This is a Solr
implementation that (at a high level) gives you one of these documents
and a report of how many more match the query. Collapsing multiple
product styles/colors/sizes to one consumer-visible product is a
common use case for this. Another use case is the 'More results from
this site' that you get from a Google search. This feature is slowly
being added  into the trunk.

Many people opt to fetch a few hundred records and do the "collapse"
themselves in the app. There are ways to optimize pulling those
records.

On Wed, Aug 25, 2010 at 10:50 AM, Willie Whitehead  wrote:
> Hi,
>
> I'm having a problem where a Solr query on all items in one category
> is returning duplicated items when an item appears in more than one
> subcategory. My schema involves a document for each item's subcategory
> instance. I know this is not correct.
>
> I'm not sure if I ever tried multiple values on subcategories. (Before
> the latest changes to the schema, I was only getting the first
> subcategory instance and I had a problem with parentcgyid.) Could you
> review the 3 results from 1 item below and advise how I can return
> only Distinct values for the itmid field?
>
> I think it's best that I change the schema to support multiple values.
> I'm currently already using faceting for the subcategories. Do I have
> to use it for this purpose also, or should I move forward to improve
> my schema configuration?
>
> 
> 14440
> Girl Costume
> GIRLCOSTUME
> 14440-GIRLCOSTUME
> Girl Costume Girl Child
> 9.99
> 1400
> Girls Costumes
> 8.99
> girl-costume-for-child-GIRLCOSTUME
> girls+costumes
> occupational
> -
> 
> L
> M
> S
> 
> In Stock
> Occupational|14440
> 
> -
> 
> 14150
> Girl Costume
> GIRLCOSTUME
> 14150-GIRLCOSTUME
> Girl Costume Girl Child
> 9.99
> 1400
> Girls Costumes
> 8.99
> girl-costume-for-child-GIRLCOSTUME
> girls+costumes
> classic
> -
> 
> L
> M
> S
> 
> In Stock
> Classic|14150
> 
> -
> 
> 14010
> Girl Costume
> GIRLCOSTUME
> 14010-GIRLCOSTUME
> Girl Costume Girl Child
> 9.99
> 1400
> Girls Costumes
> 8.99
> girl-costume-for-child-GIRLCOSTUME
> girls+costumes
> 50s+costumes
> -
> 
> L
> M
> S
> 
> In Stock
> 50's Costumes|14010
> 
>
>
> Thanks!
>



-- 
Lance Norskog
goks...@gmail.com


Re: Regd WSTX EOFException

2010-08-25 Thread Lance Norskog
Does this happen when you are indexing with many threads at once?
There are reports of sockets blocking and timing out in during
multi-threaded indexing.

On Wed, Aug 25, 2010 at 6:40 AM, Yonik Seeley
 wrote:
> On Wed, Aug 25, 2010 at 6:41 AM, Pooja Verlani  
> wrote:
>> Hi,
>> Sometimes while indexing to solr, I am getting  the following exception.
>> "com.ctc.wstx.exc.WstxEOFException: Unexpected end of input block in end tag"
>> I think its some configuration issue. Kindly suggest.
>>
>> I have a solr working with Tomcat 6
>
> Sounds like the input is sometimes being truncated (or corrupted) when
> it's sent to solr.
> What client are you using?
>
> -Yonik
> http://lucenerevolution.org  Lucene/Solr Conference, Boston Oct 7-8
>



-- 
Lance Norskog
goks...@gmail.com


Re: Restricting HTML search?

2010-08-25 Thread Lance Norskog
This assumes that the HTML is good quality. I don't know exactly what
your use case is. If you're crawling the web you will find some very
screwed-up HTML.

On Wed, Aug 25, 2010 at 6:45 AM, Ken Krugler
 wrote:
>
> On Aug 24, 2010, at 10:55pm, Paul Libbrecht wrote:
>
>> Wouldn't the usage of the NeckoHTML (as an XML-parser) and XPath be safer?
>> I guess it all depends on the "quality" of the source document.
>
> If you're processing HTML then you definitely want to use something like
> NekoHTML or TagSoup.
>
> Note that Tika uses TagSoup and makes it easy to do special processing of
> specific elements - you give it a content handler that gets fed a stream of
> cleaned-up HTML elements.
>
> -- Ken
>
>> Le 25-août-10 à 02:09, Lance Norskog a écrit :
>>
>>> I would do this with regular expressions. There is a Pattern Analyzer
>>> and a Tokenizer which do regular expression-based text chopping. (I'm
>>> not sure how to make them do what you want). A more precise tool is
>>> the RegexTransformer in the DataImportHandler.
>>>
>>> Lance
>>>
>>> On Tue, Aug 24, 2010 at 7:08 AM, Andrew Cogan
>>>  wrote:

 I'm quite new to SOLR and wondering if the following is possible: in
 addition to normal full text search, my users want to have the option to
 search only HTML heading innertext, i.e. content inside of , ,
 or
  tags.
>>
>
> 
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c   w e b   m i n i n g
>
>
>
>
>



-- 
Lance Norskog
goks...@gmail.com


Re: SolrJ addField with Reader

2010-08-25 Thread Lance Norskog
There are a couple of options here. Solr can fetch text from a file or
from HTTP given an url. Look at the stream.file and stream.url
parameters. You can use these from EmbeddedSolr.

Also, there are 'ContentStream' objects in the SolrJ API which you can
also use. Look at
http://lucene.apache.org/solr/api/org/apache/solr/common/util/ContentStreamBase.FileStream.html.
The unit tests have a few examples of how to use it.

Lance

On Wed, Aug 25, 2010 at 12:43 AM, Shalin Shekhar Mangar
 wrote:
> On Tue, Aug 24, 2010 at 10:37 AM, Bojan Vukojevic wrote:
>
>> I am using SolrJ with embedded  Solr server and some documents have a lot
>> of
>> text. Solr will be running on a small device with very limited memory. In
>> my
>> tests I cannot process more than 3MB of text (in a body) with 64MB heap.
>> According to Java there is about 30MB free memory before I call server.add
>> and with 5MB of text it runs out of memory.
>>
>> Is there a way around this?
>>
>> Is there a plan to enhance SolrJ to allow a reader to be passed in instead
>> of a string?
>>
>>
> Can you please open a Jira issue?
>
> --
> Regards,
> Shalin Shekhar Mangar.
>



-- 
Lance Norskog
goks...@gmail.com


Re: Create a new index while Solr is running

2010-08-25 Thread 朱炎詹
Take a look at Multicore feature, particular the SWAP, CREATE & MERGE 
actions.


Eric Pugh's "Solr 1.4 Enterprise Search Server" Book has good explanation.

Scott

- Original Message - 
From: "mraible" 

To: 
Sent: Thursday, August 26, 2010 6:31 AM
Subject: Create a new index while Solr is running




We're starting to use Solr for our application. The data that we'll be
indexing will change often and not accumulate over time. This means that 
we

want to blow away our index and re-create it every hour or so. What's the
easier way to do this while Solr is running and not give users a "no data
found" while we're doing it? In other words, keep the existing index in
place until the new one is done being created. I searched the docs a bit,
but couldn't find the answer I was looking for.

Thanks,

Matt
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Create-a-new-index-while-Solr-is-running-tp1342556p1342556.html

Sent from the Solr - User mailing list archive at Nabble.com.








¥¼¦b¶Ç¤J°T®§¤¤§ä¨ì¯f¬r¡C
Checked by AVG - www.avg.com
Version: 9.0.851 / Virus Database: 271.1.1/3093 - Release Date: 08/25/10 
14:34:00




Re: Create a new index while Solr is running

2010-08-25 Thread Ron Mayer
mraible wrote:
> We're starting to use Solr for our application. The data that we'll be
> indexing will change often and not accumulate over time. This means that we
> want to blow away our index and re-create it every hour or so. What's the
> easier way to do this while Solr is running and not give users a "no data
> found" while we're doing it? In other words, keep the existing index in
> place until the new one is done being created. I searched the docs a bit,
> but couldn't find the answer I was looking for. 

The "Multi-core" feature seems to work pretty well for me with a similar
case where I re-build indexes while a system's still live...
   http://wiki.apache.org/solr/CoreAdmin
in particular you might be interested in the "SWAP" core command
on that page seems to be what you want.

The one thing I didn't figure out yet is that while my new index
is building, and it decides to merge segments, everything (even
connecting to the admin page) on the other core is annoyingly slow.
Not sure if the machine's just too I/O constrained, or if something
else is happening.   I guess a common solution is to build the new
index on a different machine and use replication to move it over


Oh - or wouldn't everything just magically work if you do your deletes
and adds and make sure you don't "commit" until all the adds were done?


Create a new index while Solr is running

2010-08-25 Thread mraible

We're starting to use Solr for our application. The data that we'll be
indexing will change often and not accumulate over time. This means that we
want to blow away our index and re-create it every hour or so. What's the
easier way to do this while Solr is running and not give users a "no data
found" while we're doing it? In other words, keep the existing index in
place until the new one is done being created. I searched the docs a bit,
but couldn't find the answer I was looking for. 

Thanks,

Matt
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Create-a-new-index-while-Solr-is-running-tp1342556p1342556.html
Sent from the Solr - User mailing list archive at Nabble.com.


Delete by query issue

2010-08-25 Thread Max Lynch
Hi,
I am trying to delete all documents that have null values for a certain
field.  To that effect I can see all of the documents I want to delete by
doing this query:
-date_added_solr:[* TO *]

This returns about 32,000 documents.

However, when I try to put that into a curl call, no documents get deleted:
curl http://localhost:8985/solr/newsblog/update?commit=true -H
"Content-Type: text/xml" --data-binary '-date_added_solr:[*
TO *]'

Solr responds with:

0364


But nothing happens, even if I explicitly issue a commit afterward.

Any ideas?

Thanks.


Re: how to deal with virtual collection in solr?

2010-08-25 Thread Jan Høydahl / Cominvent
> 1. Currently we use Verity and have more than 20 collections, each collection 
> has a index for public items and a index for private items. So there are 
> virtual collections which point to each collection and a virtual collection 
> which points to all. For example, we have AA and BB collections.
> 
> AA virtual collection --> (AA index for public items and AA index for private 
> items).
> BB virtual collection --> (BB index for public items and BB index for private 
> items).
> All virtual collection --> (AA index for public items and AA index for 
> private items, BB index for public items and BB index for private items).
> 
> Would you please tell me what I should do for this if I use Solr?

There are multiple ways to solve this, depending on the nature of your 
collections. If they have somewhat different schemas, a natural choice would be 
to make multiple cores: AA-private, AA-public, BB-private, BB-public. Now you 
can query them individually or in combinations through the shards parameter. 
From next Solr version you can use virtual collections for the shard parameter, 
e.g. &shards=AA,BB etc. (See 
http://wiki.apache.org/solr/SolrCloud#Distributed_Requests)

If all your content is (roughly) the same kind of data, you could also solve 
your virtual collection issue through a "collection" field in your schema, and 
simply select collection through filters: &fq=collection:AA. You could even 
write a Search Component which translates a &collection= parameter in the 
request into the correct filters if you want to hide this implementation to the 
front ends.

> 2. Our project has different kind format files I need index them. For 
> example, xml files, pdf files and text files. Is it possible for Solr to 
> return a search result from all?

Sure. PDF and text files can be indexed through the ExtractingRequestHandler. 
XML can be indexed from XMLUpdateHandler or DataImportHandler. Solr uses Apache 
Tika internally to extract text from PDFs and other rich document formats.

> 
> 3. I got a error when I index pdf files which are version 1.5 or 1.6. Would 
> you please tell me if there is a patch to fix it?

How did you try to index these PDFs? What version of Solr are you using? 
Exactly what error message did you get?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com



Re: Slow facet sorting - lex vs count

2010-08-25 Thread Yonik Seeley
On Wed, Aug 25, 2010 at 7:22 AM, Eric Grobler  wrote:
> Hi Solr experts,
>
> There is a huge difference doing facet sorting on lex vs count
> The strange thing is that count sorting is fast when setting a small limit.
> I realize I can do sorting in the client, but I am just curious why this is.
>
> FAST - 16ms
> facet.field=city
> f.city.facet.limit=5000
> f.city.facet.sort=lex
>
> FAST - 20 ms
> facet.field=city
> f.city.facet.limit=50
> f.city.facet.sort=count
>
> SLOW - over 1 second
> facet.field=city
> f.city.facet.limit=5000
> f.city.facet.sort=count

FYI, I just tried my own single-valued faceting test:
10M documents, query matches 1M docs, faceting on a field that has
100,000 unique values:

facet.limit=100 -> 35ms
facet.limit=5000 -> 44ms
facet.limit=5 -> 100ms

The times are reported via QTime (i.e. they do not include the time to
write out the response to the client).
Maybe you're running into memory issues because of the size of the
BoundedTreeSet, response size, etc, and garbage collection is taking
up a lot of time?

-Yonik
http://lucenerevolution.org  Lucene/Solr Conference, Boston Oct 7-8


RE: how to deal with virtual collection in solr?

2010-08-25 Thread Ma, Xiaohui (NIH/NLM/LHC) [C]
Thank you for letting me know. Does Autonomy still support Verity search 
engine? 


-Original Message-
From: Walter Underwood [mailto:wun...@wunderwood.org] 
Sent: Wednesday, August 25, 2010 3:41 PM
To: solr-user@lucene.apache.org
Subject: Re: how to deal with virtual collection in solr? 

On Aug 25, 2010, at 12:18 PM, Ma, Xiaohui (NIH/NLM/LHC) [C] wrote:

> I just started to investigate Solr several weeks ago. Our current project 
> uses Verity search engine which is commercial product and the company is out 
> of business. 


Verity is not out of business. They were acquired by Autonomy.

wunder
--
Walter Underwood





Re: how to deal with virtual collection in solr?

2010-08-25 Thread Walter Underwood
On Aug 25, 2010, at 12:18 PM, Ma, Xiaohui (NIH/NLM/LHC) [C] wrote:

> I just started to investigate Solr several weeks ago. Our current project 
> uses Verity search engine which is commercial product and the company is out 
> of business. 


Verity is not out of business. They were acquired by Autonomy.

wunder
--
Walter Underwood





Increasing Logging of Delta Queries

2010-08-25 Thread Vladimir Sutskever
Hi All,

Is there a way to increase the debugging level of SOLR delta query imports.
I would like to see records that have been "picked up" by SOLR be spit out to 
Standard Output or a log file.


Thank You!


Kind regards,

Vladimir Sutskever
Investment Bank - Technology
JPMorgan Chase, Inc.



This email is confidential and subject to important disclaimers and
conditions including on offers for the purchase or sale of
securities, accuracy and completeness of information, viruses,
confidentiality, legal privilege, and legal entity disclaimers,
available at http://www.jpmorgan.com/pages/disclosures/email.  

Re: Slow facet sorting - lex vs count

2010-08-25 Thread Yonik Seeley
On Wed, Aug 25, 2010 at 2:50 PM, Yonik Seeley
 wrote:
> On Wed, Aug 25, 2010 at 10:55 AM, Eric Grobler
>  wrote:
>> Thanks for the technical explanation.
>> I will in general try to use lex and sort by count in the client if there
>> are not too many rows.
>
> I just developed a patch that may help this scenario:
> https://issues.apache.org/jira/browse/SOLR-2089
>
> If you have the ability to try out trunk, I'd be interested in the results.

Oh, wait, this will only help with multiValued fields... re-reading
your description, it looks like your field is single valued?  I guess
the time is then taken in the priority queue (actually a
BoundedTreeSet) that keeps the top 5000 results sorted.  Still it's
surprising that it's taking that long.

Was the 1 sec obtained from QTime in the Solr header, or measured externally?
If measured externally, what was the QTime as Solr reported it?

-Yonik
http://lucenerevolution.org Lucene/Solr Conference, Boston Oct 7-8


how to deal with virtual collection in solr?

2010-08-25 Thread Ma, Xiaohui (NIH/NLM/LHC) [C]
Hello,

I just started to investigate Solr several weeks ago. Our current project uses 
Verity search engine which is commercial product and the company is out of 
business. I am trying to evaluate if Solr can meet our requirements. I have 
following questions.

1. Currently we use Verity and have more than 20 collections, each collection 
has a index for public items and a index for private items. So there are 
virtual collections which point to each collection and a virtual collection 
which points to all. For example, we have AA and BB collections.

AA virtual collection --> (AA index for public items and AA index for private 
items).
BB virtual collection --> (BB index for public items and BB index for private 
items).
All virtual collection --> (AA index for public items and AA index for private 
items, BB index for public items and BB index for private items).

Would you please tell me what I should do for this if I use Solr?

2. Our project has different kind format files I need index them. For example, 
xml files, pdf files and text files. Is it possible for Solr to return a search 
result from all?

3. I got a error when I index pdf files which are version 1.5 or 1.6. Would you 
please tell me if there is a patch to fix it?

Thanks so much in advance,


Re: Slow facet sorting - lex vs count

2010-08-25 Thread Yonik Seeley
On Wed, Aug 25, 2010 at 10:55 AM, Eric Grobler
 wrote:
> Thanks for the technical explanation.
> I will in general try to use lex and sort by count in the client if there
> are not too many rows.

I just developed a patch that may help this scenario:
https://issues.apache.org/jira/browse/SOLR-2089

If you have the ability to try out trunk, I'd be interested in the results.

-Yonik
http://lucenerevolution.org  Lucene/Solr Conference, Boston Oct 7-8


Re: Solr searching performance issues, using large documents (now 1MB documents)

2010-08-25 Thread Peter Spam
This is a very small number of documents (7000), so I am surprised Solr is 
having such a hard time with it!!

I do facet on 3 terms.

Subsequent "hello" searches are faster, but still well over a second.  This is 
a very fast Mac Pro, with 6GB of RAM.


Thanks,
Peter

On Aug 25, 2010, at 9:52 AM, Yonik Seeley wrote:

> On Wed, Aug 25, 2010 at 11:29 AM, Peter Spam  wrote:
>> So, I went through all the effort to break my documents into max 1 MB 
>> chunks, and searching for hello still takes over 40 seconds (searching 
>> across 7433 documents):
>> 
>>8 results (41980 ms)
>> 
>> What is going on???  (scroll down for my config).
> 
> Are you still faceting on that query also?
> Breaking your docs into many chunks means inflating the doc count and
> will make faceting slower.
> Also, first-time faceting (as with sorting) is slow... did you try
> another query after  "hello" (and without a commit happening
> inbetween) to see if it was faster?
> 
> -Yonik
> http://lucenerevolution.org  Lucene/Solr Conference, Boston Oct 7-8



Re: How to delete documents from SOLR index using DIH

2010-08-25 Thread Erick Erickson
I'm not sure what you mean here. You can delete via query or unique id. But
DIH really isn't relevant here.

If you've defined a unique key, simply re-adding any changed documents will
delete the old one and insert the new document.

If this makes no sense, could you explain what the underlying problem you're
trying to solve is?

HTH
Erick

On Tue, Aug 24, 2010 at 8:56 PM, Pawan Darira wrote:

> Hi
>
> I am using data import handler to build index. How can i delete documents
> from my index using DIH.
>
> --
> Thanks,
> Pawan Darira
>


Distinct values versus schema change?

2010-08-25 Thread Willie Whitehead
Hi,

I'm having a problem where a Solr query on all items in one category
is returning duplicated items when an item appears in more than one
subcategory. My schema involves a document for each item's subcategory
instance. I know this is not correct.

I'm not sure if I ever tried multiple values on subcategories. (Before
the latest changes to the schema, I was only getting the first
subcategory instance and I had a problem with parentcgyid.) Could you
review the 3 results from 1 item below and advise how I can return
only Distinct values for the itmid field?

I think it's best that I change the schema to support multiple values.
I'm currently already using faceting for the subcategories. Do I have
to use it for this purpose also, or should I move forward to improve
my schema configuration?


14440
Girl Costume
GIRLCOSTUME
14440-GIRLCOSTUME
Girl Costume Girl Child
9.99
1400
Girls Costumes
8.99
girl-costume-for-child-GIRLCOSTUME
girls+costumes
occupational
-

L
M
S

In Stock
Occupational|14440

-

14150
Girl Costume
GIRLCOSTUME
14150-GIRLCOSTUME
Girl Costume Girl Child
9.99
1400
Girls Costumes
8.99
girl-costume-for-child-GIRLCOSTUME
girls+costumes
classic
-

L
M
S

In Stock
Classic|14150

-

14010
Girl Costume
GIRLCOSTUME
14010-GIRLCOSTUME
Girl Costume Girl Child
9.99
1400
Girls Costumes
8.99
girl-costume-for-child-GIRLCOSTUME
girls+costumes
50s+costumes
-

L
M
S

In Stock
50's Costumes|14010



Thanks!


Re: Solr searching performance issues, using large documents (now 1MB documents)

2010-08-25 Thread Yonik Seeley
On Wed, Aug 25, 2010 at 11:29 AM, Peter Spam  wrote:
> So, I went through all the effort to break my documents into max 1 MB chunks, 
> and searching for hello still takes over 40 seconds (searching across 7433 
> documents):
>
>        8 results (41980 ms)
>
> What is going on???  (scroll down for my config).

Are you still faceting on that query also?
Breaking your docs into many chunks means inflating the doc count and
will make faceting slower.
Also, first-time faceting (as with sorting) is slow... did you try
another query after  "hello" (and without a commit happening
inbetween) to see if it was faster?

-Yonik
http://lucenerevolution.org  Lucene/Solr Conference, Boston Oct 7-8


Re: Solr searching performance issues, using large documents (now 1MB documents)

2010-08-25 Thread Peter Spam
So, I went through all the effort to break my documents into max 1 MB chunks, 
and searching for hello still takes over 40 seconds (searching across 7433 
documents):

8 results (41980 ms)

What is going on???  (scroll down for my config).


-Peter
 
On Aug 16, 2010, at 3:59 PM, Markus Jelsma wrote:

> I've no idea if it's possible but i'd at least try to return an ArrayList of 
> rows instead of just a single row. And if it doesn't work, which is probably 
> the case, how about filing an issue in Jira?
> 
>  
> 
> Reading the docs in the matter, i think it should (made) to be possible to 
> return multiple rows in an ArrayList.
>  
> -Original message-
> From: Peter Spam 
> Sent: Tue 17-08-2010 00:47
> To: solr-user@lucene.apache.org; 
> Subject: Re: Solr searching performance issues, using large documents
> 
> Still stuck on this - any hints on how to write the JavaScript to split a 
> document?  Thanks!
> 
> 
> -Pete
> 
> On Aug 5, 2010, at 8:10 PM, Lance Norskog wrote:
> 
>> You may have to write your own javascript to read in the giant field
>> and split it up.
>> 
>> On Thu, Aug 5, 2010 at 5:27 PM, Peter Spam  wrote:
>>> I've read through the DataImportHandler page a few times, and still can't 
>>> figure out how to separate a large document into smaller documents.  Any 
>>> hints? :-)  Thanks!
>>> 
>>> -Peter
>>> 
>>> On Aug 2, 2010, at 9:01 PM, Lance Norskog wrote:
>>> 
 Spanning won't work- you would have to make overlapping mini-documents
 if you want to support this.
 
 I don't know how big the chunks should be- you'll have to experiment.
 
 Lance
 
 On Mon, Aug 2, 2010 at 10:01 AM, Peter Spam  wrote:
> What would happen if the search query phrase spanned separate document 
> chunks?
> 
> Also, what would the optimal size of chunks be?
> 
> Thanks!
> 
> 
> -Peter
> 
> On Aug 1, 2010, at 7:21 PM, Lance Norskog wrote:
> 
>> Not that I know of.
>> 
>> The DataImportHandler has the ability to create multiple documents
>> from one input stream. It is possible to create a DIH file that reads
>> large log files and splits each one into N documents, with the file
>> name as a common field. The DIH wiki page tells you in general how to
>> make a DIH file.
>> 
>> http://wiki.apache.org/solr/DataImportHandler
>> 
>> From this, you should be able to make a DIH file that puts log files
>> in as separate documents. As to splitting files up into
>> mini-documents, you might have to write a bit of Javascript to achieve
>> this. There is no data structure or software that implements
>> structured documents.
>> 
>> On Sun, Aug 1, 2010 at 2:06 PM, Peter Spam  wrote:
>>> Thanks for the pointer, Lance!  Is there an example of this somewhere?
>>> 
>>> 
>>> -Peter
>>> 
>>> On Jul 31, 2010, at 3:13 PM, Lance Norskog wrote:
>>> 
 Ah! You're not just highlighting, you're snippetizing. This makes it 
 easier.
 
 Highlighting does not stream- it pulls the entire stored contents into
 one string and then pulls out the snippet.  If you want this to be
 fast, you have to split up the text into small pieces and only
 snippetize from the most relevant text. So, separate documents with a
 common group id for the document it came from. You might have to do 2
 queries to achieve what you want, but the second query for the same
 query will be blindingly fast. Often <1ms.
 
 Good luck!
 
 Lance
 
 On Sat, Jul 31, 2010 at 1:12 PM, Peter Spam  wrote:
> However, I do need to search the entire document, or else the 
> highlighting will sometimes be blank :-(
> Thanks!
> 
> - Peter
> 
> ps. sorry for the many responses - I'm rushing around trying to get 
> this working.
> 
> On Jul 31, 2010, at 1:11 PM, Peter Spam wrote:
> 
>> Correction - it went from 17 seconds to 10 seconds - I was changing 
>> the hl.regex.maxAnalyzedChars the first time.
>> Thanks!
>> 
>> -Peter
>> 
>> On Jul 31, 2010, at 1:06 PM, Peter Spam wrote:
>> 
>>> On Jul 30, 2010, at 1:16 PM, Peter Karich wrote:
>>> 
 did you already try other values for hl.maxAnalyzedChars=2147483647
>>> 
>>> Yes, I tried dropping it down to 21, but it didn't have much of an 
>>> impact (one search I just tried went from 17 seconds to 15.8 
>>> seconds, and this is an 8-core Mac Pro with 6GB RAM - 4GB for java).
>>> 
 ? Also regular expression highlighting is more expensive, I think.
 What does the 'fuzzy' variable mean? If you use this to query via
 "~someTerm" instead "someTerm"
 then you should try the trunk o

Re: Slow facet sorting - lex vs count

2010-08-25 Thread Eric Grobler
Hi Yonik,

Thanks for the technical explanation.
I will in general try to use lex and sort by count in the client if there
are not too many rows.

Have a nice day.

Regards
ericz


On Wed, Aug 25, 2010 at 4:41 PM, Yonik Seeley wrote:

> On Wed, Aug 25, 2010 at 10:07 AM, Eric Grobler
>  wrote:
> > I use Solr 1.41
> > There are 14000 cities in the index.
> > The type is just a simple string:  > class="solr.StrField" sortMissingLast="true" omitNorms="true"/>
> > The facet method is fc.
> >
> > You are right I do not need 5000 cities, I was just surprised to see this
> > big difference, there are places where I do need to sort count and return
> > about 500 items.
> >
> > If Solr was also slow in locating the highest count city it would be less
> > surprising.
> > In other words, if I set the limit to 1, then solr returns Berlin as the
> > city with the highest count within 3ms which seems to indicate that the
> > facet is internally sorted by count.
> > However, the speed regresses linearly, 30ms for 10, 300ms for 1000 etc.
>
> The priority queue collecting values will be larger of course, but in
> this specific instance I bet most of  the time is being taken up in
> converting from term number to term value.  Here's a snippet of a
> comment from the implementation:
>  *   To further save memory, the terms (the actual string values) are
> not all stored in
>  *   memory, but a TermIndex is used to convert term numbers to term values
> only
>  *   for the terms needed after faceting has completed.  Only every
> 128th term value
>  *   is stored, along with it's corresponding term number, and this is
> used as an
>  *   index to find the closest term and iterate until the desired
> number is hit (very
>  *   much like Lucene's own internal term index).
>
> This is something that Lucene has improved in trunk, and that solr can
> make improvements to also.
> Besides optimizations, we could also implement options to store all
> values and eliminate the need to read the index to do the ord->string
> conversions.
>
> -Yonik
> http://lucenerevolution.org  Lucene/Solr Conference, Boston Oct 7-8
>
>
> > Regards
> > Eric
> >
> > On Wed, Aug 25, 2010 at 3:28 PM, Yonik Seeley <
> yo...@lucidimagination.com>
> > wrote:
> >>
> >> On Wed, Aug 25, 2010 at 7:22 AM, Eric Grobler <
> impalah...@googlemail.com>
> >> wrote:
> >> > There is a huge difference doing facet sorting on lex vs count
> >> > The strange thing is that count sorting is fast when setting a small
> >> > limit.
> >> > I realize I can do sorting in the client, but I am just curious why
> this
> >> > is.
> >>
> >> There are a lot of optimizations to make things fast for the common
> >> case - and setting a really high limit makes some of those
> >> ineffective.  Hopefully you don't really need to return the top 5000
> >> cities?
> >> What version of Solr is this? What faceting method is used? Is this a
> >> multi-valued field?  How many unique values are in the city field?
> >> How many docs in the index?
> >>
> >> -Yonik
> >> http://lucenerevolution.org Lucene/Solr Conference, Boston Oct 7-8
> >>
> >>
> >> > FAST - 16ms
> >> > facet.field=city
> >> > f.city.facet.limit=5000
> >> > f.city.facet.sort=lex
> >> >
> >> > FAST - 20 ms
> >> > facet.field=city
> >> > f.city.facet.limit=50
> >> > f.city.facet.sort=count
> >> >
> >> > SLOW - over 1 second
> >> > facet.field=city
> >> > f.city.facet.limit=5000
> >> > f.city.facet.sort=count
> >> >
> >> > Regards
> >> > ericz
> >> >
> >
> >
>


Re: Slow facet sorting - lex vs count

2010-08-25 Thread Yonik Seeley
On Wed, Aug 25, 2010 at 10:07 AM, Eric Grobler
 wrote:
> I use Solr 1.41
> There are 14000 cities in the index.
> The type is just a simple string:  class="solr.StrField" sortMissingLast="true" omitNorms="true"/>
> The facet method is fc.
>
> You are right I do not need 5000 cities, I was just surprised to see this
> big difference, there are places where I do need to sort count and return
> about 500 items.
>
> If Solr was also slow in locating the highest count city it would be less
> surprising.
> In other words, if I set the limit to 1, then solr returns Berlin as the
> city with the highest count within 3ms which seems to indicate that the
> facet is internally sorted by count.
> However, the speed regresses linearly, 30ms for 10, 300ms for 1000 etc.

The priority queue collecting values will be larger of course, but in
this specific instance I bet most of  the time is being taken up in
converting from term number to term value.  Here's a snippet of a
comment from the implementation:
 *   To further save memory, the terms (the actual string values) are
not all stored in
 *   memory, but a TermIndex is used to convert term numbers to term values only
 *   for the terms needed after faceting has completed.  Only every
128th term value
 *   is stored, along with it's corresponding term number, and this is
used as an
 *   index to find the closest term and iterate until the desired
number is hit (very
 *   much like Lucene's own internal term index).

This is something that Lucene has improved in trunk, and that solr can
make improvements to also.
Besides optimizations, we could also implement options to store all
values and eliminate the need to read the index to do the ord->string
conversions.

-Yonik
http://lucenerevolution.org  Lucene/Solr Conference, Boston Oct 7-8


> Regards
> Eric
>
> On Wed, Aug 25, 2010 at 3:28 PM, Yonik Seeley 
> wrote:
>>
>> On Wed, Aug 25, 2010 at 7:22 AM, Eric Grobler 
>> wrote:
>> > There is a huge difference doing facet sorting on lex vs count
>> > The strange thing is that count sorting is fast when setting a small
>> > limit.
>> > I realize I can do sorting in the client, but I am just curious why this
>> > is.
>>
>> There are a lot of optimizations to make things fast for the common
>> case - and setting a really high limit makes some of those
>> ineffective.  Hopefully you don't really need to return the top 5000
>> cities?
>> What version of Solr is this? What faceting method is used? Is this a
>> multi-valued field?  How many unique values are in the city field?
>> How many docs in the index?
>>
>> -Yonik
>> http://lucenerevolution.org Lucene/Solr Conference, Boston Oct 7-8
>>
>>
>> > FAST - 16ms
>> > facet.field=city
>> > f.city.facet.limit=5000
>> > f.city.facet.sort=lex
>> >
>> > FAST - 20 ms
>> > facet.field=city
>> > f.city.facet.limit=50
>> > f.city.facet.sort=count
>> >
>> > SLOW - over 1 second
>> > facet.field=city
>> > f.city.facet.limit=5000
>> > f.city.facet.sort=count
>> >
>> > Regards
>> > ericz
>> >
>
>


Re: Slow facet sorting - lex vs count

2010-08-25 Thread Eric Grobler
Hi Yonik,

Thanks for your response.

I use Solr 1.41
There are 14000 cities in the index.
The type is just a simple string: 
The facet method is fc.

You are right I do not need 5000 cities, I was just surprised to see this
big difference, there are places where I do need to sort count and return
about 500 items.

If Solr was also slow in locating the highest count city it would be less
surprising.
In other words, if I set the limit to 1, then solr returns Berlin as the
city with the highest count within 3ms which seems to indicate that the
facet is internally sorted by count.
However, the speed regresses linearly, 30ms for 10, 300ms for 1000 etc.

Regards
Eric

On Wed, Aug 25, 2010 at 3:28 PM, Yonik Seeley wrote:

> On Wed, Aug 25, 2010 at 7:22 AM, Eric Grobler 
> wrote:
> > There is a huge difference doing facet sorting on lex vs count
> > The strange thing is that count sorting is fast when setting a small
> limit.
> > I realize I can do sorting in the client, but I am just curious why this
> is.
>
> There are a lot of optimizations to make things fast for the common
> case - and setting a really high limit makes some of those
> ineffective.  Hopefully you don't really need to return the top 5000
> cities?
> What version of Solr is this? What faceting method is used? Is this a
> multi-valued field?  How many unique values are in the city field?
> How many docs in the index?
>
> -Yonik
> http://lucenerevolution.org Lucene/Solr Conference, Boston Oct 7-8
>
>
> > FAST - 16ms
> > facet.field=city
> > f.city.facet.limit=5000
> > f.city.facet.sort=lex
> >
> > FAST - 20 ms
> > facet.field=city
> > f.city.facet.limit=50
> > f.city.facet.sort=count
> >
> > SLOW - over 1 second
> > facet.field=city
> > f.city.facet.limit=5000
> > f.city.facet.sort=count
> >
> > Regards
> > ericz
> >
>


Re: Solr search speed very low

2010-08-25 Thread Geert-Jan Brits
have a look at http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters to
see how that works.

2010/8/25 Marco Martinez 

> You should use the tokenizer solr.WhitespaceTokenizerFactory in your field
> type to get your terms indexed, once you have indexed the data, you dont
> need to use the * in your queries that is a heavy query to solr.
>
> Marco Martínez Bautista
> http://www.paradigmatecnologico.com
> Avenida de Europa, 26. Ática 5. 3ª Planta
> 28224 Pozuelo de Alarcón
> Tel.: 91 352 59 42
>
>
> 2010/8/25 Andrey Sapegin 
>
> > Dear ladies and gentlemen.
> >
> > I'm newbie with Solr, I didn't find an aswer in wiki, so I'm writing
> here.
> >
> > I'm analysing Solr performance and have 1 problem. *Search time is about
> > 7-10 seconds per query.*
> >
> > I have a *.csv 5Gb-database with about 15 fields and 1 key field (record
> > number). I uploaded it to Solr without any problem using curl. This
> database
> > contains information about books and I'm intrested in keyword search
> using
> > one of the fields (not a key field). I mean that if I search, for
> example,
> > for word "Hello", I expect response with sentences containing "Hello":
> > "Hello all"
> > "Hello World"
> > "I say Hello to all"
> > etc.
> >
> > I tested it from console using time command and curl:
> >
> > /usr/bin/time -o test_results/time_solr -a curl "
> >
> http://localhost:8983/solr/select/?q=itemname:*$query*&version=2.2&start=0&rows=10&indent=on
> "
> > -6 2>&1 >> test_results/response_solr
> >
> > So, my query is *itemname:*$query**. 'Itemname' - is the name of field.
> > $query - is a bash variable containing only 1 word. All works fine.
> > *But unfortunately, search time is about 7-10 seconds per query.* For
> > example, Sphinx spent only about 0.3 second per query.
> > If I use only $query, without stars (*), I receive answer pretty fast,
> but
> > only exact matches.
> > And I want to see any sentence containing my $query in the response.
> Thats
> > why I'm using stars.
> >
> > NOW THE QUESTION.
> > Is my query syntax correct (*field:*word**) for keyword search)? Why
> > response time is so big? Can I reduce search time?
> >
> > Thank You in advance,
> > Kind Regards,
> >
> > Andrey Sapegin,
> > Software Developer,
> >
> > Unister GmbH
> > Barfußgässchen 11 | 04109 Leipzig
> >
> > andrey.sape...@unister-gmbh.de 
> > www.unister.de 
> >
> >
>


Re: Restricting HTML search?

2010-08-25 Thread Ken Krugler


On Aug 24, 2010, at 10:55pm, Paul Libbrecht wrote:

Wouldn't the usage of the NeckoHTML (as an XML-parser) and XPath be  
safer?

I guess it all depends on the "quality" of the source document.


If you're processing HTML then you definitely want to use something  
like NekoHTML or TagSoup.


Note that Tika uses TagSoup and makes it easy to do special processing  
of specific elements - you give it a content handler that gets fed a  
stream of cleaned-up HTML elements.


-- Ken


Le 25-août-10 à 02:09, Lance Norskog a écrit :


I would do this with regular expressions. There is a Pattern Analyzer
and a Tokenizer which do regular expression-based text chopping. (I'm
not sure how to make them do what you want). A more precise tool is
the RegexTransformer in the DataImportHandler.

Lance

On Tue, Aug 24, 2010 at 7:08 AM, Andrew Cogan
 wrote:

I'm quite new to SOLR and wondering if the following is possible: in
addition to normal full text search, my users want to have the  
option to
search only HTML heading innertext, i.e. content inside of ,  
, or

 tags.





Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






Re: Regd WSTX EOFException

2010-08-25 Thread Yonik Seeley
On Wed, Aug 25, 2010 at 6:41 AM, Pooja Verlani  wrote:
> Hi,
> Sometimes while indexing to solr, I am getting  the following exception.
> "com.ctc.wstx.exc.WstxEOFException: Unexpected end of input block in end tag"
> I think its some configuration issue. Kindly suggest.
>
> I have a solr working with Tomcat 6

Sounds like the input is sometimes being truncated (or corrupted) when
it's sent to solr.
What client are you using?

-Yonik
http://lucenerevolution.org  Lucene/Solr Conference, Boston Oct 7-8


Re: Slow facet sorting - lex vs count

2010-08-25 Thread Yonik Seeley
On Wed, Aug 25, 2010 at 7:22 AM, Eric Grobler  wrote:
> There is a huge difference doing facet sorting on lex vs count
> The strange thing is that count sorting is fast when setting a small limit.
> I realize I can do sorting in the client, but I am just curious why this is.

There are a lot of optimizations to make things fast for the common
case - and setting a really high limit makes some of those
ineffective.  Hopefully you don't really need to return the top 5000
cities?
What version of Solr is this? What faceting method is used? Is this a
multi-valued field?  How many unique values are in the city field?
How many docs in the index?

-Yonik
http://lucenerevolution.org Lucene/Solr Conference, Boston Oct 7-8


> FAST - 16ms
> facet.field=city
> f.city.facet.limit=5000
> f.city.facet.sort=lex
>
> FAST - 20 ms
> facet.field=city
> f.city.facet.limit=50
> f.city.facet.sort=count
>
> SLOW - over 1 second
> facet.field=city
> f.city.facet.limit=5000
> f.city.facet.sort=count
>
> Regards
> ericz
>


Slow facet sorting - lex vs count

2010-08-25 Thread Eric Grobler
Hi Solr experts,

There is a huge difference doing facet sorting on lex vs count
The strange thing is that count sorting is fast when setting a small limit.
I realize I can do sorting in the client, but I am just curious why this is.

FAST - 16ms
facet.field=city
f.city.facet.limit=5000
f.city.facet.sort=lex

FAST - 20 ms
facet.field=city
f.city.facet.limit=50
f.city.facet.sort=count

SLOW - over 1 second
facet.field=city
f.city.facet.limit=5000
f.city.facet.sort=count

Regards
ericz


Re: Solr search speed very low

2010-08-25 Thread Marco Martinez
You should use the tokenizer solr.WhitespaceTokenizerFactory in your field
type to get your terms indexed, once you have indexed the data, you dont
need to use the * in your queries that is a heavy query to solr.

Marco Martínez Bautista
http://www.paradigmatecnologico.com
Avenida de Europa, 26. Ática 5. 3ª Planta
28224 Pozuelo de Alarcón
Tel.: 91 352 59 42


2010/8/25 Andrey Sapegin 

> Dear ladies and gentlemen.
>
> I'm newbie with Solr, I didn't find an aswer in wiki, so I'm writing here.
>
> I'm analysing Solr performance and have 1 problem. *Search time is about
> 7-10 seconds per query.*
>
> I have a *.csv 5Gb-database with about 15 fields and 1 key field (record
> number). I uploaded it to Solr without any problem using curl. This database
> contains information about books and I'm intrested in keyword search using
> one of the fields (not a key field). I mean that if I search, for example,
> for word "Hello", I expect response with sentences containing "Hello":
> "Hello all"
> "Hello World"
> "I say Hello to all"
> etc.
>
> I tested it from console using time command and curl:
>
> /usr/bin/time -o test_results/time_solr -a curl "
> http://localhost:8983/solr/select/?q=itemname:*$query*&version=2.2&start=0&rows=10&indent=on";
> -6 2>&1 >> test_results/response_solr
>
> So, my query is *itemname:*$query**. 'Itemname' - is the name of field.
> $query - is a bash variable containing only 1 word. All works fine.
> *But unfortunately, search time is about 7-10 seconds per query.* For
> example, Sphinx spent only about 0.3 second per query.
> If I use only $query, without stars (*), I receive answer pretty fast, but
> only exact matches.
> And I want to see any sentence containing my $query in the response. Thats
> why I'm using stars.
>
> NOW THE QUESTION.
> Is my query syntax correct (*field:*word**) for keyword search)? Why
> response time is so big? Can I reduce search time?
>
> Thank You in advance,
> Kind Regards,
>
> Andrey Sapegin,
> Software Developer,
>
> Unister GmbH
> Barfußgässchen 11 | 04109 Leipzig
>
> andrey.sape...@unister-gmbh.de 
> www.unister.de 
>
>


Regd WSTX EOFException

2010-08-25 Thread Pooja Verlani
Hi,
Sometimes while indexing to solr, I am getting  the following exception.
"com.ctc.wstx.exc.WstxEOFException: Unexpected end of input block in end tag"
I think its some configuration issue. Kindly suggest.

I have a solr working with Tomcat 6

Thanks
Pooja


Solr search speed very low

2010-08-25 Thread Andrey Sapegin

Dear ladies and gentlemen.

I'm newbie with Solr, I didn't find an aswer in wiki, so I'm writing here.

I'm analysing Solr performance and have 1 problem. *Search time is about 
7-10 seconds per query.*


I have a *.csv 5Gb-database with about 15 fields and 1 key field (record 
number). I uploaded it to Solr without any problem using curl. This 
database contains information about books and I'm intrested in keyword 
search using one of the fields (not a key field). I mean that if I 
search, for example, for word "Hello", I expect response with sentences 
containing "Hello":

"Hello all"
"Hello World"
"I say Hello to all"
etc.

I tested it from console using time command and curl:

/usr/bin/time -o test_results/time_solr -a curl 
"http://localhost:8983/solr/select/?q=itemname:*$query*&version=2.2&start=0&rows=10&indent=on"; 
-6 2>&1 >> test_results/response_solr


So, my query is *itemname:*$query**. 'Itemname' - is the name of field. 
$query - is a bash variable containing only 1 word. All works fine.
*But unfortunately, search time is about 7-10 seconds per query.* For 
example, Sphinx spent only about 0.3 second per query.
If I use only $query, without stars (*), I receive answer pretty fast, 
but only exact matches.
And I want to see any sentence containing my $query in the response. 
Thats why I'm using stars.


NOW THE QUESTION.
Is my query syntax correct (*field:*word**) for keyword search)? Why 
response time is so big? Can I reduce search time?


Thank You in advance,
Kind Regards,

Andrey Sapegin,
Software Developer,

Unister GmbH
Barfußgässchen 11 | 04109 Leipzig

andrey.sape...@unister-gmbh.de 
www.unister.de 



Re: SolrException log

2010-08-25 Thread Tommaso Teofili
Hi again Bastian,

2010/8/23 Bastian Spitzer 

>  I dont seem to find a decent documentation on how those  parameters
> actually work.
>
> this is the default, example block:
>
>
>  
>  1
>  
>  0
>  
>
>
> so do i have to increase the maxCommitsToKeep to a value of 2 when i add a
> maxCommitAge Parameter? Or will 1 still be enough?


I would advice to arise commit points to a reasonable value considering
indexing (and commit requests) and searching frequencies. Infact keeping too
many commit points would waste disk space but having "enough" should prevent
you from your issue.
I would do some tests with small values of maxCommitsToKeep (no more than
10/20) and maxCommitAge to one of the proposed values (30MINUTES or 1DAY)
and see what happens.


> Do i have to
> call optimize more than once a day when i add maxOptimizedCommitsToKeep
> with a value of 1?


> can some1 please explain how this is supposed to work?
>

This (SolrDeletionPolicy) is an extension of the Lucene IndexDeletionPolicy
class that is supposed to handle deletions of portions of the index.
As you may see from code, when a new commit is called the number of current
commits is retrieved and only the ones that respect maxCommitAge are kept,
others are discarded.
If you have any IndexSearcher/Reader/Writer open on a (just) discarded
(portion of) commit point you will eventually encounter that issue, but,
since you are not running on a NFS-like file system, I am not sure this
could be the case; however my advice stays and doing some testing on the
maxCommitAge and maxCommitsToKeep should clarify it.
My 2 cents, have a nice day.
Tommaso



>
> -Ursprüngliche Nachricht-
> Von: Bastian Spitzer [mailto:bspit...@magix.net]
> Gesendet: Montag, 23. August 2010 16:40
> An: solr-user@lucene.apache.org
> Betreff: Re: SolrException log
>
> Hi Tommaso,
>
> Thanks for your Reply. The Solr Files are on local disk, on a reiserfs.
> I'll try to set a Deletion Policy and report back if that solved the
> problem, thank you for the hint.
>
> cheers,
> Bastian
>
> -Ursprüngliche Nachricht-
> Von: Tommaso Teofili [mailto:tommaso.teof...@gmail.com]
> Gesendet: Montag, 23. August 2010 15:31
> An: solr-user@lucene.apache.org
> Betreff: Re: SolrException log
>
> Hi Bastian,
> this seems to be related to IO and file deletion (optimization compacts and
> removes index files), are you running Solr on NFS or a distributed file
> system?
> You could set a propert IndexDeletionPolicy (SolrDeletionPolicy) in
> solrconfig.xml to handle this.
> My 2 cents,
> Tommaso
>
> 2010/8/11 Bastian Spitzer 
>
> > Hi,
> >
> > we are using solr 1.4.1 in a master-slave setup with replication,
> > requests are loadbalanced to both instances. this is just working
> > fine, but the slave behaves strange sometimes with a "SolrException
> > log" (trace below). We are using 1.4.1 for weeks now, and this has
> > happened only a few times so far, and it only occured on the Slave.
> > The Problem seemed to be gone when we added a cron-job to send a
> > periodic  (once a day) to the master, but today it did
> > happen again. The Index contains 55 files right now, after optimize
> > there are only 10. So it seems its a problem when the index is spread
> > among a lot files. The Slave wont ever recover once this Exception
> > shows up, the only thing that helps is a restart.
> >
> > Is this a known issue? Only workaround would be to track the
> > commit-counts and send additional  requests after a certain
> > amount of commits, but id prefer solving this problem rather than
> > building a workaround..
> >
> > Any hints/thoughts on this issue are verry much appreciated, thanks in
> > advance for your help.
> >
> > cheers Bastian.
> >
> > Aug 11, 2010 4:51:58 PM org.apache.solr.core.SolrCore execute
> > INFO: [] webapp=/solr path=/select
> > params={fl=media_id,keyword_1004&sort=priority_1000+desc,+score+desc&i
> > nd
> > ent=off&start=0&q=mandant_id:1000+AND+partner_id:1000+AND+active_1000:
> > tr
> > ue+AND+cat_id_path_1000:7231/7258*+AND+language_id:1004&rows=24&versio
> > ue+AND+n=
> > 2.2} status=500 QTime=2
> > Aug 11, 2010 4:51:58 PM org.apache.solr.common.SolrException log
> > SEVERE: java.io.IOException: read past EOF
> >at
> > org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.j
> > av
> > a:151)
> >at
> > org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput
> > .j
> > ava:38)
> >at
> > org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:78)
> >at
> > org.apache.lucene.index.SegmentTermDocs.next(SegmentTermDocs.java:112)
> >at
> > org.apache.lucene.search.FieldCacheImpl$IntCache.createValue(FieldCach
> > eI
> > mpl.java:461)
> >at
> > org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:
> > 22
> > 4)
> >at
> > org.apache.lucene.search.FieldCacheImpl.getInts(FieldCacheImpl.java:430)
> >at
> > org.apache.lucene.search.FieldCacheIm

solrCloud zookeepr related excpetions

2010-08-25 Thread Yatir Ben Shlomo
Hi I am running a zookeeper ensemble of 3 zookeeper instances
and established a solrCloud to work with it (2 masters , 2 slaves)
on each master machine I have 2 shards (4 shards in total)
on one of the masters I keep noticing ZooKeeper related exceptions which I 
can't understand:
One appears to be  TIME OUT in (ClientCnxn.java):906
And the other is java.lang.IllegalArgumentException: Path cannot be null 
(PathUtils.java:45)

Here are my logs (I set the log level to FINE on zookeeper package)

 Anyone can identify the issue?



FINE: Reading reply sessionid:0x12a97312613010b, packet:: clientPath:null 
serverPath:null finished:false header:: -8,101  replyHeader:: -8,-1,0  
request:: 
30064776552,v{'/collections},v{},v{'/collections/ENPwl/shards/ENPWL1,'/collections/ENPwl/shards/ENPWL4,'/collections/ENPwl/shards/ENPWL2,'/collections,'/collections/ENPwl/shards/ENPWL3,'/collections/ENPwlMaster/shards/ENPWLMaster_3,'/collections/ENPwlMaster/shards/ENPWLMaster_4,'/live_nodes,'/collections/ENPwlMaster/shards/ENPWLMaster_1,'/collections/ENPwlMaster/shards/ENPWLMaster_2}
  response:: null
Aug 25, 2010 5:18:19 AM org.apache.log4j.Category debug
FINE: Reading reply sessionid:0x12a97312613010b, packet:: clientPath:null 
serverPath:null finished:false header:: 540,8  replyHeader:: 540,-1,0  
request:: '/collections,F  response:: v{'ENPwl,'ENPwlMaster}
Aug 25, 2010 5:18:19 AM org.apache.solr.common.cloud.ZkStateReader 
updateCloudState
INFO: Cloud state update for ZooKeeper already scheduled
Aug 25, 2010 5:18:19 AM org.apache.log4j.Category error
SEVERE: Error while calling watcher
java.lang.IllegalArgumentException: Path cannot be null
at org.apache.zookeeper.common.PathUtils.validatePath(PathUtils.java:45)
at 
org.apache.zookeeper.ZooKeeper.getChildren(zookeeper:ZooKeeper.java):1196)
at 
org.apache.solr.common.cloud.SolrZkClient.getChildren(SolrZkClient.java:200)
at 
org.apache.solr.common.cloud.ZkStateReader$5.process(ZkStateReader.java:315)
at 
org.apache.zookeeper.ClientCnxn$EventThread.run(zookeeper:ClientCnxn.java):425)
Aug 25, 2010 5:18:19 AM org.apache.solr.common.cloud.ZkStateReader$4 process
INFO: Detected a shard change under ShardId:ENPWL3 in collection:ENPwl
Aug 25, 2010 5:18:19 AM org.apache.solr.common.cloud.ZkStateReader 
updateCloudState
INFO: Cloud state update for ZooKeeper already scheduled
Aug 25, 2010 5:18:19 AM org.apache.solr.common.cloud.ZkStateReader$4 process
INFO: Detected a shard change under ShardId:ENPWL4 in collection:ENPwl
Aug 25, 2010 5:18:19 AM org.apache.solr.common.cloud.ZkStateReader 
updateCloudState
INFO: Cloud state update for ZooKeeper already scheduled
Aug 25, 2010 5:18:19 AM org.apache.solr.common.cloud.ZkStateReader$4 process
INFO: Detected a shard change under ShardId:ENPWL1 in collection:ENPwl
Aug 25, 2010 5:18:19 AM org.apache.solr.common.cloud.ZkStateReader 
updateCloudState
INFO: Cloud state update for ZooKeeper already scheduled
Aug 25, 2010 5:18:19 AM org.apache.solr.cloud.ZkController$2 process
INFO: Updating live nodes:org.apache.solr.common.cloud.solrzkcli...@55308275
Aug 25, 2010 5:18:19 AM org.apache.solr.common.cloud.ZkStateReader 
updateCloudState
INFO: Updating live nodes from ZooKeeper...
Aug 25, 2010 5:18:19 AM org.apache.log4j.Category debug
FINE: Reading reply sessionid:0x12a97312613010b, packet:: clientPath:null 
serverPath:null finished:false header:: 541,8  replyHeader:: 541,-1,0  
request:: '/live_nodes,F  response:: 
v{'ob1078.nydc1.outbrain.com:8983_solr2,'ob1078.nydc1.outbrain.com:8983_solr1,'ob1061.nydc1.outbrain.com:8983_solr2,'ob1062.nydc1.outbrain.com:8983_solr1,'ob1062.nydc1.outbrain.com:8983_solr2,'ob1061.nydc1.outbrain.com:8983_solr1,'ob1077.nydc1.outbrain.com:8983_solr2,'ob1077.nydc1.outbrain.com:8983_solr1}
Aug 25, 2010 5:18:19 AM org.apache.log4j.Category error
SEVERE: Error while calling watcher
java.lang.IllegalArgumentException: Path cannot be null
at org.apache.zookeeper.common.PathUtils.validatePath(PathUtils.java:45)
at 
org.apache.zookeeper.ZooKeeper.getChildren(zookeeper:ZooKeeper.java):1196)
at 
org.apache.solr.common.cloud.SolrZkClient.getChildren(SolrZkClient.java:200)
at org.apache.solr.cloud.ZkController$2.process(ZkController.java:321)
at 
org.apache.zookeeper.ClientCnxn$EventThread.run(zookeeper:ClientCnxn.java):425)
Aug 25, 2010 5:18:19 AM org.apache.solr.common.cloud.ConnectionManager process
INFO: Watcher org.apache.solr.common.cloud.connectionmana...@339bb448 
name:ZooKeeperConnection Watcher:zook1:2181,zook2:2181,zook3:2181 got event 
WatchedEvent: Server state change. New state: Disconnected path:null type:None
Aug 25, 2010 5:18:19 AM org.apache.solr.common.cloud.ZkStateReader$4 process
INFO: Detected a shard change under ShardId:ENPWLMaster_1 in 
collection:ENPwlMaster
Aug 25, 2010 5:18:19 AM org.apache.solr.common.cloud.ZkStateReader 
updateCloudState
INFO: Cloud state update for ZooKeeper already scheduled
Aug 25, 2010 5:18:19 AM org.apa

Re: SolrJ addField with Reader

2010-08-25 Thread Shalin Shekhar Mangar
On Tue, Aug 24, 2010 at 10:37 AM, Bojan Vukojevic wrote:

> I am using SolrJ with embedded  Solr server and some documents have a lot
> of
> text. Solr will be running on a small device with very limited memory. In
> my
> tests I cannot process more than 3MB of text (in a body) with 64MB heap.
> According to Java there is about 30MB free memory before I call server.add
> and with 5MB of text it runs out of memory.
>
> Is there a way around this?
>
> Is there a plan to enhance SolrJ to allow a reader to be passed in instead
> of a string?
>
>
Can you please open a Jira issue?

-- 
Regards,
Shalin Shekhar Mangar.


Re: reduce the content???

2010-08-25 Thread Shalin Shekhar Mangar
On Wed, Aug 25, 2010 at 12:51 PM, satya swaroop  wrote:

> Hi all,
>  i indexed nearly 100 java pdf files which are of large size(min 1MB).
> The solr is showing the results with the entire content that it indexed
> which is taking time to show the results.. cant we reduce the content it
> shows or can i just have the file names and ids instead of the entire
> content in the results
>
>
Change the fields in your schema to stored="false" so that the content in
those fields is indexed but not returned. Alternately, you can choose to
limit the fields to be returned in the response using the "fl" parameter.

-- 
Regards,
Shalin Shekhar Mangar.


Re: 'Error 404: missing core name in path ' in adminconsole

2010-08-25 Thread Robert Naczinski
Thanx for your help.

I bound de.lvm.services.logging.PerformanceLoggingFilter in web.xml
and mapped it to /admin/* .
It works fine with EmbeddedSolr. I get NullPointer in some links under
admin/index.jsp, but I will solve this problem.

Robert

2010/8/25 Chris Hostetter :
>
> : we use in our application to the JEE EmbeddedSolrServer. It works very
> : well. Now I wanted to create the admin JSPs. For that I have copied
> : the JSPs from webroot Solr example. When I try to access
> : ...admin/index.jsp , I get 'Error 404: missing core name in path'
>
> just copying JSPs isn't enough to make the Solr admin interface magically
> work in a an app that uses EmbeddedSolr -- EmbeddedSolr is really not
> designed to be used this way at all, that's the trade of of using Embedded
> vs just running Solr as an app.
>
> In particular, the SolrDispatchFilter is responsible for intercepting all
> HTTP requests to Solr, and sets up a lot of pre-conditions that the JSPs
> depend on -- it's also what executes the various Handlers that are
> make many of the admin URLs work.
>
> If you really insist on pursuing this route, i suggest you start by
> looking at the existing Solr web.xml
>
>
> -Hoss
>
> --
> http://lucenerevolution.org/  ...  October 7-8, Boston
> http://bit.ly/stump-hoss      ...  Stump The Chump!
>
>


reduce the content???

2010-08-25 Thread satya swaroop
Hi all,
  i indexed nearly 100 java pdf files which are of large size(min 1MB).
The solr is showing the results with the entire content that it indexed
which is taking time to show the results.. cant we reduce the content it
shows or can i just have the file names and ids instead of the entire
content in the results

Regards,
satya