Re: Indexing gets significantly slower after every batch commit

2015-05-22 Thread Siegfried Goeschl
Hi Angel,

a while ago I had issues with VMWare VM - somehow snapshots were created 
regularly which dragged down the machine. So I think is is a good idea to 
baseline the performance on physical box before moving to VMs, production boxes 
or whatever is thrown at you

Cheers,

Siegfried Goeschl

> On 22 May 2015, at 11:15, Angel Todorov  wrote:
> 
> Thanks for the feedback guys. What i am going to try now is deploying my
> SOLR server on a physical machine with more RAM, and checking out this
> scenario there. I have some suspicion it could well be a hypervisor issue,
> but let's see. Just for the record - I've noticed those issues on a Win
> 2008R2 VM with 8 GB of RAM and 2 cores.
> 
> I don't see anything strange in the logs. One thing that I need to change,
> though, is the verbosity of logs in the console - looks like by default
> SOLR outputs text in the log for every single document that's indexed, as
> well as for every query that's executed.
> 
> Angel
> 
> 
> On Fri, May 22, 2015 at 1:03 AM, Erick Erickson 
> wrote:
> 
>> bq: Which is logical as index growth and time needed to put something
>> to it is log(n)
>> 
>> Not really. Solr indexes to segments, each segment is a fully
>> consistent "mini index".
>> When a segment gets flushed to disk, a new one is started. Of course
>> there'll be a
>> _little bit_ of added overyead, but it shouldn't be all that noticeable.
>> 
>> Furthermore, they're "append only". In the past, when I've indexed the
>> Wiki example,
>> my indexing speed actually goes faster.
>> 
>> So on the surface this sounds very strange to me. Are you seeing
>> anything at all in the
>> Solr logs that's supsicious?
>> 
>> Best,
>> Erick
>> 
>> On Thu, May 21, 2015 at 12:22 PM, Sergey Shvets 
>> wrote:
>>> Hi Angel
>>> 
>>> We also noticed that kind of performance degrade in our workloads.
>>> 
>>> Which is logical as index growth and time needed to put something to it
>> is
>>> log(n)
>>> 
>>> 
>>> 
>>> четверг, 21 мая 2015 г. пользователь Angel Todorov написал:
>>> 
>>>> hi Shawn,
>>>> 
>>>> Thanks a bunch for your feedback. I've played with the heap size, but I
>>>> don't see any improvement. Even if i index, say , a million docs, and
>> the
>>>> throughput is about 300 docs per sec, and then I shut down solr
>> completely
>>>> - after I start indexing again, the throughput is dropping below 300.
>>>> 
>>>> I should probably experiment with sharding those documents to multiple
>> SOLR
>>>> cores - that should help, I guess. I am talking about something like
>> this:
>>>> 
>>>> 
>>>> 
>> https://cwiki.apache.org/confluence/display/solr/Shards+and+Indexing+Data+in+SolrCloud
>>>> 
>>>> Thanks,
>>>> Angel
>>>> 
>>>> 
>>>> On Thu, May 21, 2015 at 11:36 AM, Shawn Heisey >>> > wrote:
>>>> 
>>>>> On 5/21/2015 2:07 AM, Angel Todorov wrote:
>>>>>> I'm crawling a file system folder and indexing 10 million docs, and
>> I
>>>> am
>>>>>> adding them in batches of 5000, committing every 50 000 docs. The
>>>>> problem I
>>>>>> am facing is that after each commit, the documents per sec that are
>>>>> indexed
>>>>>> gets less and less.
>>>>>> 
>>>>>> If I do not commit at all, I can index those docs very quickly, and
>>>> then
>>>>> I
>>>>>> commit once at the end, but once i start indexing docs _after_ that
>>>> (for
>>>>>> example new files get added to the folder), indexing is also slowing
>>>>> down a
>>>>>> lot.
>>>>>> 
>>>>>> Is it normal that the SOLR indexing speed depends on the number of
>>>>>> documents that are _already_ indexed? I think it shouldn't matter
>> if i
>>>>>> start from scratch or I index a document in a core that already has
>> a
>>>>>> couple of million docs. Looks like SOLR is either doing something
>> in a
>>>>>> linear fashion, or there is some magic config parameter that I am
>> not
>>>>> aware
>>>>>> of.
>>>>>> 
>>>>>> I've read all perf docs, and I've tried changing mergeFactor,
>>>>>> autowarmCounts, and the buffer sizes - to no avail.
>>>>>> 
>>>>>> I am using SOLR 5.1
>>>>> 
>>>>> Have you changed the heap size?  If you use the bin/solr script to
>> start
>>>>> it and don't change the heap size with the -m option or another
>> method,
>>>>> Solr 5.1 runs with a default size of 512MB, which is *very* small.
>>>>> 
>>>>> I bet you are running into problems with frequent and then ultimately
>>>>> constant garbage collection, as Java attempts to free up enough memory
>>>>> to allow the program to continue running.  If that is what is
>> happening,
>>>>> then eventually you will see an OutOfMemoryError exception.  The
>>>>> solution is to increase the heap size.  I would probably start with at
>>>>> least 4G for 10 million docs.
>>>>> 
>>>>> Thanks,
>>>>> Shawn
>>>>> 
>>>>> 
>>>> 
>> 



Re: New article on ZK "Poison Packet"

2015-05-10 Thread Siegfried Goeschl
Cool stuff - thanks for sharing

Siegfried Goeschl

> On 09 May 2015, at 08:43, steve  wrote:
> 
> While very technical and unusual, a very interesting view of the world of 
> Linux and ZooKeeper Clusters...
> http://www.pagerduty.com/blog/the-discovery-of-apache-zookeepers-poison-packet/
>  



Re: Indexing PDF and MS Office files

2015-04-16 Thread Siegfried Goeschl

Hi Vijay,

I know the this road too well :-)

For PDF you can fallback to other tools for text extraction

* ps2ascii.ps
* XPDF's pdftotext CLI utility (more comfortable than Ghostscript)
* some other tools exists as well (pdflib)

If you start command line tools from your JVM please have a look at 
commons-exec :-)


Cheers,

Siegfried Goeschl

PS: one more thing - please, tell your management that you will never 
ever successfully all real-world PDFs and cater for that fact in your 
requirements :-)


On 16.04.15 13:10, Vijaya Narayana Reddy Bhoomi Reddy wrote:

Erick,

I tried indexing both ways - SolrJ / Tika's AutoParser and as well as
SolrCell's ExtractRequestHandler. Majority of the PDF and Word documents
are getting parsed properly and indexed into Solr. However, a minority of
them keep failing wither PDFParser or OfficeParser error.

Not sure if this behaviour can be modified so that all the documents can be
indexed. The business requirement we have is to index all the documents.
However, if a small percentage of them fails, not sure what other ways
exist to index them.

Any help please?


Thanks & Regards
Vijay



On 15 April 2015 at 15:20, Erick Erickson  wrote:


There's quite a discussion here:
https://issues.apache.org/jira/browse/SOLR-7137

But, I personally am not a huge fan of pushing all the work on to Solr, in
a
production environment the Solr server is responsible for indexing,
parsing the
docs through Tika, perhaps searching etc. This doesn't scale all that well.

So an alternative is to use SolrJ with Tika, which is totally independent
of
what version of Tika is on the Solr server. Here's an example.

http://lucidworks.com/blog/indexing-with-solrj/

Best,
Erick

On Wed, Apr 15, 2015 at 4:46 AM, Vijaya Narayana Reddy Bhoomi Reddy
 wrote:

Thanks everyone for the responses. Now I am able to index PDF documents
successfully. I have implemented manual extraction using Tika's

AutoParser

and PDF functionality is working fine. However,  the error with some MS
office word documents still persist.

The error message is "java.lang.IllegalArgumentException: This paragraph

is

not the first one in the table" which will eventually result in

"Unexpected

RuntimeException from org.apache.tika.parser.microsoft.OfficeParser"

Upon some reading, it looks like its a bug with Tika 1.5 and seems to

have

been fixed with Tika 1.6 (

https://issues.apache.org/jira/browse/TIKA-1251 ).

I am new to Solr / Tika and hence wondering whether I can change the Tika
library alone to v1.6 without impacting any of the libraries within Solr
4.10.2? Please let me know your response and how to get away with this
issue.

Many thanks in advance.

Thanks & Regards
Vijay


On 15 April 2015 at 05:14, Shyam R  wrote:


Vijay,

You could try different excel files with different formats to rule out

the

issue is with TIKA version being used.

Thanks
Murthy

On Wed, Apr 15, 2015 at 9:35 AM, Terry Rhodes 
wrote:


Perhaps the PDF is protected and the content can not be extracted?

i have an unverified suspicion that the tika shipped with solr 4.10.2

may

not support some/all office 2013 document formats.





On 4/14/2015 8:18 PM, Jack Krupansky wrote:


Try doing a manual extraction request directly to Solr (not via

SolrJ)

and

use the extractOnly option to see if the content is actually

extracted.


See:
https://cwiki.apache.org/confluence/display/solr/
Uploading+Data+with+Solr+Cell+using+Apache+Tika

Also, some PDF files actually have the content as a bitmap image, so

no

text is extracted.


-- Jack Krupansky

On Tue, Apr 14, 2015 at 10:57 AM, Vijaya Narayana Reddy Bhoomi Reddy

<

vijaya.bhoomire...@whishworks.com> wrote:

  Hi,


I am trying to index PDF and Microsoft Office files (.doc, .docx,

.ppt,

.pptx, .xlx, and .xlx) files into Solr. I am facing the following

issues.

Request to please let me know what is going wrong with the indexing
process.

I am using solr 4.10.2 and using the default example server

configuration

that comes with Solr distribution.

PDF Files - Indexing as such works fine, but when I query using *.*

in

the
Solr Query console, metadata information is displayed properly.

However,

the PDF content field is empty. This is happening for all PDF files

I

have
tried. I have tried with some proprietary files, PDF eBooks etc.

Whatever

be the PDF file, content is not being displayed.

MS Office files -  For some office files, everything works perfect

and

the
extracted content is visible in the query console. However, for

others, I

see the below error message during the indexing process.

*Exception in thread "main"


org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:

org.apache.tika.exception.TikaException: Unexpected RuntimeException
from
org.apache.tika.parser.microsoft.OfficeParser*


I am using SolrJ to index the documents and below is the code

snippet

related to indexing. Pleas

Re: Measuring QPS

2015-04-06 Thread Siegfried Goeschl
Hi Walter,

sort of shameless plug - I ran into similar issues and wrote a JMeter SLA 
Reporting Backend - https://github.com/sgoeschl/jmeter-sla-report 
<https://github.com/sgoeschl/jmeter-sla-report>

* It reads the CSV/XML JMeter report file and sorts the response times in 
logarithmic buckets 
* the XML processor uses a Stax parser to handle huge JTL files (exceeding 1 GB)
* it also caters for merging JTL files when running multiple JMeter instances

Cheers,

Siegfried Goeschl



> On 06 Apr 2015, at 22:57, Walter Underwood  wrote:
> 
> The load testing is the easiest part.
> 
> We use JMeter to replay the prod logs. We start about a hundred threads and 
> use ConstantThroughputTimer to control the traffic level. JMeter tends to 
> fall over with two much data graphing, so we run it headless. Then we post 
> process with JMeter Plugins to get percentiles.
> 
> The complicated part of the servlet filter was getting it configured in 
> Tomcat. The code itself is not too bad.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
> On Apr 6, 2015, at 1:49 PM, Siegfried Goeschl  wrote:
> 
>> The good-sounding thing - you can do that easily with JMeter running the GUI 
>> or the command-line
>> 
>> Cheers,
>> 
>> Siegfried Goeschl
>> 
>>> On 06 Apr 2015, at 21:35, Davis, Daniel (NIH/NLM) [C] 
>>>  wrote:
>>> 
>>> This sounds really good:
>>> 
>>> "For load testing, we replay production logs to test that we meet the SLA 
>>> at a given traffic level."
>>> 
>>> The rest sounds complicated.   Ah well, that's the job.
>>> 
>>> -Original Message-
>>> From: Walter Underwood [mailto:wun...@wunderwood.org] 
>>> Sent: Monday, April 06, 2015 2:48 PM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: Measuring QPS
>>> 
>>> We built a servlet request filter that is configured in front of the Solr 
>>> servlets. It reports response times to metricsd, using the Codahale library.
>>> 
>>> That gives us counts, rates, and response time metrics. We mostly look at 
>>> percentiles, because averages are thrown off by outliers. Average is just 
>>> the wrong metric for a one-sided distribution like response times.
>>> 
>>> We use Graphite to display the 95th percentile response time for each 
>>> request handler. We use Tattle for alerting on those metrics.
>>> 
>>> We also use New Relic for a different look at the performance. It is good 
>>> at tracking from the front end through to Solr.
>>> 
>>> For load testing, we replay production logs to test that we meet the SLA at 
>>> a given traffic level.
>>> 
>>> Walter Underwood
>>> wun...@wunderwood.org
>>> http://observer.wunderwood.org/  (my blog)
>>> 
>>> On Apr 6, 2015, at 11:31 AM, Davis, Daniel (NIH/NLM) [C] 
>>>  wrote:
>>> 
>>>> OK,
>>>> 
>>>> I have a lot of chutzpah posting that here ;)The other guys answering 
>>>> the questions can probably explain it better.
>>>> I love showing off, however, so please forgive me.
>>>> 
>>>> -Original Message-
>>>> From: Davis, Daniel (NIH/NLM) [C]
>>>> Sent: Monday, April 06, 2015 2:25 PM
>>>> To: solr-user@lucene.apache.org
>>>> Subject: RE: Measuring QPS
>>>> 
>>>> Its very common to do autocomplete based on popular queries/titles over 
>>>> some sliding time window.   Some enterprise search systems even apply age 
>>>> weighting so that they don't need to re-index but continuously add to the 
>>>> index.   This way, they can do autocomplete based on what's popular these 
>>>> days.
>>>> 
>>>> We use relevance/field boosts/phrase matching etc. to get the best guess 
>>>> about what results they want to see.   This is similar - we use relevance, 
>>>> field boosting to guess what users want to search for.   Zipf's law 
>>>> applies to searches as well as results.
>>>> 
>>>> -Original Message-
>>>> From: Siegfried Goeschl [mailto:sgoes...@gmx.at]
>>>> Sent: Monday, April 06, 2015 2:17 PM
>>>> To: solr-user@lucene.apache.org
>>>> Subject: Re: Measuring QPS
>>>> 
>>>> Hi Daniel,
>>>> 
>>>> interesting - I never thought of autocompletion but for keeping track 
>>>> of user behaviour :-)
>

Re: Measuring QPS

2015-04-06 Thread Siegfried Goeschl
The good-sounding thing - you can do that easily with JMeter running the GUI or 
the command-line

Cheers,

Siegfried Goeschl

> On 06 Apr 2015, at 21:35, Davis, Daniel (NIH/NLM) [C]  
> wrote:
> 
> This sounds really good:
> 
> "For load testing, we replay production logs to test that we meet the SLA at 
> a given traffic level."
> 
> The rest sounds complicated.   Ah well, that's the job.
> 
> -Original Message-
> From: Walter Underwood [mailto:wun...@wunderwood.org] 
> Sent: Monday, April 06, 2015 2:48 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Measuring QPS
> 
> We built a servlet request filter that is configured in front of the Solr 
> servlets. It reports response times to metricsd, using the Codahale library.
> 
> That gives us counts, rates, and response time metrics. We mostly look at 
> percentiles, because averages are thrown off by outliers. Average is just the 
> wrong metric for a one-sided distribution like response times.
> 
> We use Graphite to display the 95th percentile response time for each request 
> handler. We use Tattle for alerting on those metrics.
> 
> We also use New Relic for a different look at the performance. It is good at 
> tracking from the front end through to Solr.
> 
> For load testing, we replay production logs to test that we meet the SLA at a 
> given traffic level.
> 
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
> On Apr 6, 2015, at 11:31 AM, Davis, Daniel (NIH/NLM) [C] 
>  wrote:
> 
>> OK,
>> 
>> I have a lot of chutzpah posting that here ;)The other guys answering 
>> the questions can probably explain it better.
>> I love showing off, however, so please forgive me.
>> 
>> -Original Message-
>> From: Davis, Daniel (NIH/NLM) [C]
>> Sent: Monday, April 06, 2015 2:25 PM
>> To: solr-user@lucene.apache.org
>> Subject: RE: Measuring QPS
>> 
>> Its very common to do autocomplete based on popular queries/titles over some 
>> sliding time window.   Some enterprise search systems even apply age 
>> weighting so that they don't need to re-index but continuously add to the 
>> index.   This way, they can do autocomplete based on what's popular these 
>> days.
>> 
>> We use relevance/field boosts/phrase matching etc. to get the best guess 
>> about what results they want to see.   This is similar - we use relevance, 
>> field boosting to guess what users want to search for.   Zipf's law applies 
>> to searches as well as results.
>> 
>> -Original Message-
>> From: Siegfried Goeschl [mailto:sgoes...@gmx.at]
>> Sent: Monday, April 06, 2015 2:17 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Measuring QPS
>> 
>> Hi Daniel,
>> 
>> interesting - I never thought of autocompletion but for keeping track 
>> of user behaviour :-)
>> 
>> * the numbers are helpful for the online advertisement team to sell 
>> campaigns
>> * it is used for sanity checks - sensible queries returning no results 
>> or returning too many results
>> 
>> Cheers,
>> 
>> Siegfried Goeschl
>> 
>>> On 06 Apr 2015, at 20:04, Davis, Daniel (NIH/NLM) [C] 
>>>  wrote:
>>> 
>>> Siegfried,
>>> 
>>> It is early days as yet.   I don't think we need a code drop.   AFAIK, none 
>>> of our current Solr applications autocomplete the search box based on 
>>> popular query/title keywords.   We have other applications that do that, 
>>> but they don't use Solr.
>>> 
>>> Thanks again,
>>> 
>>> Dan
>>> 
>>> -Original Message-
>>> From: Siegfried Goeschl [mailto:sgoes...@gmx.at]
>>> Sent: Monday, April 06, 2015 1:42 PM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: Measuring QPS
>>> 
>>> Hi Dan,
>>> 
>>> at willhaben.at (customer of mine) two SOLR components were written 
>>> for SOLR 3 and ported to SORL 4
>>> 
>>> 1) SlowQueryLog which dumps long-running search requests into a log 
>>> file
>>> 
>>> 2) Most Frequent Search Terms allowing to query & filter the most 
>>> frequent user search terms over the browser
>>> 
>>> Some notes along the line
>>> 
>>> 
>>> * For both components I have the "GO" to open source them but I never 
>>> had enough time to do that (shame on me) - see
>>> https://issues.apache.org/jira/browse/SOLR-4056
>>> 
>>> * The Most

Re: Measuring QPS

2015-04-06 Thread Siegfried Goeschl
Appreciated :-)

Siegfried Goeschl

> On 06 Apr 2015, at 20:31, Davis, Daniel (NIH/NLM) [C]  
> wrote:
> 
> OK,
> 
> I have a lot of chutzpah posting that here ;)The other guys answering the 
> questions can probably explain it better.
> I love showing off, however, so please forgive me.
> 
> -Original Message-
> From: Davis, Daniel (NIH/NLM) [C] 
> Sent: Monday, April 06, 2015 2:25 PM
> To: solr-user@lucene.apache.org
> Subject: RE: Measuring QPS
> 
> Its very common to do autocomplete based on popular queries/titles over some 
> sliding time window.   Some enterprise search systems even apply age 
> weighting so that they don't need to re-index but continuously add to the 
> index.   This way, they can do autocomplete based on what's popular these 
> days.
> 
> We use relevance/field boosts/phrase matching etc. to get the best guess 
> about what results they want to see.   This is similar - we use relevance, 
> field boosting to guess what users want to search for.   Zipf's law applies 
> to searches as well as results.
> 
> -Original Message-
> From: Siegfried Goeschl [mailto:sgoes...@gmx.at]
> Sent: Monday, April 06, 2015 2:17 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Measuring QPS
> 
> Hi Daniel,
> 
> interesting - I never thought of autocompletion but for keeping track of user 
> behaviour :-)
> 
> * the numbers are helpful for the online advertisement team to sell campaigns
> * it is used for sanity checks - sensible queries returning no results or 
> returning too many results
> 
> Cheers,
> 
> Siegfried Goeschl
> 
>> On 06 Apr 2015, at 20:04, Davis, Daniel (NIH/NLM) [C]  
>> wrote:
>> 
>> Siegfried,
>> 
>> It is early days as yet.   I don't think we need a code drop.   AFAIK, none 
>> of our current Solr applications autocomplete the search box based on 
>> popular query/title keywords.   We have other applications that do that, but 
>> they don't use Solr.
>> 
>> Thanks again,
>> 
>> Dan
>> 
>> -Original Message-
>> From: Siegfried Goeschl [mailto:sgoes...@gmx.at]
>> Sent: Monday, April 06, 2015 1:42 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Measuring QPS
>> 
>> Hi Dan,
>> 
>> at willhaben.at (customer of mine) two SOLR components were written 
>> for SOLR 3 and ported to SORL 4
>> 
>> 1) SlowQueryLog which dumps long-running search requests into a log 
>> file
>> 
>> 2) Most Frequent Search Terms allowing to query & filter the most 
>> frequent user search terms over the browser
>> 
>> Some notes along the line
>> 
>> 
>> * For both components I have the “GO" to open source them but I never 
>> had enough time to do that (shame on me) - see
>> https://issues.apache.org/jira/browse/SOLR-4056
>> 
>> * The Most Frequent Search Term component actually mimics a SOLR 
>> server you feed the user search terms so this might be a better 
>> solution in the long run. But this requires to have a separate SOLR 
>> core & ingest  plus GUI (check out SILK or ELK) - in other words more 
>> moving parts in production :-)
>> 
>> * If there is sufficient interest I can make a code drop on GitHub
>> 
>> Cheers,
>> 
>> Siegfried Goeschl
>> 
>> 
>> 
>>> On 06 Apr 2015, at 16:25, Davis, Daniel (NIH/NLM) [C] 
>>>  wrote:
>>> 
>>> Siegfried,
>>> 
>>> This is a wonderful find.   The second presentation is a nice write-up of a 
>>> large number of free tools.   The first presentation prompts a question - 
>>> did you add custom request handlers/code to automate determination of best 
>>> user search terms?   Did any of your custom work end-up in Solr?
>>> 
>>> Thank you so much,
>>> 
>>> Dan
>>> 
>>> P.S. - your first presentation takes me back to seeing "Angrif der 
>>> Klonkrieger" in Berlin after a conference - Hayden Christensen was less 
>>> annoying in German, because my wife and I don't speak German ;)   I haven't 
>>> thought of that in a while.
>>> 
>>> -Original Message-
>>> From: Siegfried Goeschl [mailto:sgoes...@gmx.at]
>>> Sent: Saturday, April 04, 2015 4:54 AM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: Measuring QPS
>>> 
>>> Hi Dan,
>>> 
>>> I’m using JavaMelody for my SOLR production servers - gives you the 
>>> relevant HTTP stats (what’s happening now & historical data) 

Re: Measuring QPS

2015-04-06 Thread Siegfried Goeschl
Hi Daniel,

interesting - I never thought of autocompletion but for keeping track of user 
behaviour :-)

* the numbers are helpful for the online advertisement team to sell campaigns
* it is used for sanity checks - sensible queries returning no results or 
returning too many results

Cheers,

Siegfried Goeschl

> On 06 Apr 2015, at 20:04, Davis, Daniel (NIH/NLM) [C]  
> wrote:
> 
> Siegfried,
> 
> It is early days as yet.   I don't think we need a code drop.   AFAIK, none 
> of our current Solr applications autocomplete the search box based on popular 
> query/title keywords.   We have other applications that do that, but they 
> don't use Solr.
> 
> Thanks again,
> 
> Dan
> 
> -Original Message-
> From: Siegfried Goeschl [mailto:sgoes...@gmx.at] 
> Sent: Monday, April 06, 2015 1:42 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Measuring QPS
> 
> Hi Dan,
> 
> at willhaben.at (customer of mine) two SOLR components were written for SOLR 
> 3 and ported to SORL 4
> 
> 1) SlowQueryLog which dumps long-running search requests into a log file
> 
> 2) Most Frequent Search Terms allowing to query & filter the most frequent 
> user search terms over the browser
> 
> Some notes along the line
> 
> 
> * For both components I have the “GO" to open source them but I never had 
> enough time to do that (shame on me) - see 
> https://issues.apache.org/jira/browse/SOLR-4056
> 
> * The Most Frequent Search Term component actually mimics a SOLR server you 
> feed the user search terms so this might be a better solution in the long 
> run. But this requires to have a separate SOLR core & ingest  plus GUI (check 
> out SILK or ELK) - in other words more moving parts in production :-)
> 
> * If there is sufficient interest I can make a code drop on GitHub 
> 
> Cheers,
> 
> Siegfried Goeschl
> 
> 
> 
>> On 06 Apr 2015, at 16:25, Davis, Daniel (NIH/NLM) [C]  
>> wrote:
>> 
>> Siegfried,
>> 
>> This is a wonderful find.   The second presentation is a nice write-up of a 
>> large number of free tools.   The first presentation prompts a question - 
>> did you add custom request handlers/code to automate determination of best 
>> user search terms?   Did any of your custom work end-up in Solr?
>> 
>> Thank you so much,
>> 
>> Dan
>> 
>> P.S. - your first presentation takes me back to seeing "Angrif der 
>> Klonkrieger" in Berlin after a conference - Hayden Christensen was less 
>> annoying in German, because my wife and I don't speak German ;)   I haven't 
>> thought of that in a while.
>> 
>> -Original Message-
>> From: Siegfried Goeschl [mailto:sgoes...@gmx.at]
>> Sent: Saturday, April 04, 2015 4:54 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Measuring QPS
>> 
>> Hi Dan,
>> 
>> I’m using JavaMelody for my SOLR production servers - gives you the 
>> relevant HTTP stats (what’s happening now & historical data) plus JVM 
>> monitoring as additional benefit. The servers are deployed on Tomcat 
>> so I’m of little help regarding Jetty - having said that
>> 
>> * you need two Jars (javamelody & robin)
>> * tinker with web.xml
>> 
>> Here are two of my presentations mentioning JavaMelody (plus some 
>> other stuff)
>> 
>> http://people.apache.org/~sgoeschl/presentations/solr-from-development
>> -to-production-20121210.pdf 
>> <http://people.apache.org/~sgoeschl/presentations/solr-from-developmen
>> t-to-production-20121210.pdf> 
>> http://people.apache.org/~sgoeschl/presentations/jsug-2015/jee-perform
>> ance-monitoring.pdf 
>> <http://people.apache.org/~sgoeschl/presentations/jsug-2015/jee-perfor
>> mance-monitoring.pdf>
>> 
>> Cheers,
>> 
>> Siegfried Goeschl
>> 
>>> On 03 Apr 2015, at 17:53, Shawn Heisey  wrote:
>>> 
>>> On 4/3/2015 9:37 AM, Davis, Daniel (NIH/NLM) [C] wrote:
>>>> I wanted to gather QPS for our production Solr instances, but I was 
>>>> surprised that the Admin UI did not contain this information.   We are 
>>>> running a mix of versions, but mostly 4.10 at this point.   We are not 
>>>> using SolrCloud at present; that's part of why I'm checking - I want to 
>>>> validate the size of our existing setup and what sort of SolrCloud setup 
>>>> would be needed to centralize several of them.
>>>> 
>>>> What is the best way to gather QPS information?
>>>> 
>>>> What is the best way to add information like this to the Admin UI, if I 
>>>> decide to take that step?
>>> 
>>> As of Solr 4.1 (three years ago), request rate information is 
>>> available in the admin UI and via JMX.  In the admin UI, choose a 
>>> core from the dropdown, click on Plugins/Stats, then QUERYHANDLER, 
>>> and open the handler you wish to examine.  You have 
>>> avgRequestsPerSecond, which is calculated for the entire runtime of 
>>> the SolrCore, as well as 5minRateReqsPerSecond and 
>>> 15minRateReqsPerSecond, which are far more useful pieces of information.
>>> 
>>> https://issues.apache.org/jira/browse/SOLR-1972
>>> 
>>> Thanks,
>>> Shawn
>>> 
>> 
> 



Re: Measuring QPS

2015-04-06 Thread Siegfried Goeschl
Hi Dan,

at willhaben.at (customer of mine) two SOLR components were written for SOLR 3 
and ported to SORL 4

1) SlowQueryLog which dumps long-running search requests into a log file

2) Most Frequent Search Terms allowing to query & filter the most frequent user 
search terms over the browser

Some notes along the line


* For both components I have the “GO" to open source them but I never had 
enough time to do that (shame on me) - see 
https://issues.apache.org/jira/browse/SOLR-4056

* The Most Frequent Search Term component actually mimics a SOLR server you 
feed the user search terms so this might be a better solution in the long run. 
But this requires to have a separate SOLR core & ingest  plus GUI (check out 
SILK or ELK) - in other words more moving parts in production :-)

* If there is sufficient interest I can make a code drop on GitHub 

Cheers,

Siegfried Goeschl



> On 06 Apr 2015, at 16:25, Davis, Daniel (NIH/NLM) [C]  
> wrote:
> 
> Siegfried,
> 
> This is a wonderful find.   The second presentation is a nice write-up of a 
> large number of free tools.   The first presentation prompts a question - did 
> you add custom request handlers/code to automate determination of best user 
> search terms?   Did any of your custom work end-up in Solr?
> 
> Thank you so much,
> 
> Dan
> 
> P.S. - your first presentation takes me back to seeing "Angrif der 
> Klonkrieger" in Berlin after a conference - Hayden Christensen was less 
> annoying in German, because my wife and I don't speak German ;)   I haven't 
> thought of that in a while.
> 
> -Original Message-
> From: Siegfried Goeschl [mailto:sgoes...@gmx.at] 
> Sent: Saturday, April 04, 2015 4:54 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Measuring QPS
> 
> Hi Dan,
> 
> I’m using JavaMelody for my SOLR production servers - gives you the relevant 
> HTTP stats (what’s happening now & historical data) plus JVM monitoring as 
> additional benefit. The servers are deployed on Tomcat so I’m of little help 
> regarding Jetty - having said that
> 
> * you need two Jars (javamelody & robin)
> * tinker with web.xml
> 
> Here are two of my presentations mentioning JavaMelody (plus some other stuff)
> 
> http://people.apache.org/~sgoeschl/presentations/solr-from-development-to-production-20121210.pdf
>  
> <http://people.apache.org/~sgoeschl/presentations/solr-from-development-to-production-20121210.pdf>
> http://people.apache.org/~sgoeschl/presentations/jsug-2015/jee-performance-monitoring.pdf
>  
> <http://people.apache.org/~sgoeschl/presentations/jsug-2015/jee-performance-monitoring.pdf>
>  
> 
> Cheers,
> 
> Siegfried Goeschl
> 
>> On 03 Apr 2015, at 17:53, Shawn Heisey  wrote:
>> 
>> On 4/3/2015 9:37 AM, Davis, Daniel (NIH/NLM) [C] wrote:
>>> I wanted to gather QPS for our production Solr instances, but I was 
>>> surprised that the Admin UI did not contain this information.   We are 
>>> running a mix of versions, but mostly 4.10 at this point.   We are not 
>>> using SolrCloud at present; that's part of why I'm checking - I want to 
>>> validate the size of our existing setup and what sort of SolrCloud setup 
>>> would be needed to centralize several of them.
>>> 
>>> What is the best way to gather QPS information?
>>> 
>>> What is the best way to add information like this to the Admin UI, if I 
>>> decide to take that step?
>> 
>> As of Solr 4.1 (three years ago), request rate information is 
>> available in the admin UI and via JMX.  In the admin UI, choose a core 
>> from the dropdown, click on Plugins/Stats, then QUERYHANDLER, and open 
>> the handler you wish to examine.  You have avgRequestsPerSecond, which 
>> is calculated for the entire runtime of the SolrCore, as well as 
>> 5minRateReqsPerSecond and 15minRateReqsPerSecond, which are far more 
>> useful pieces of information.
>> 
>> https://issues.apache.org/jira/browse/SOLR-1972
>> 
>> Thanks,
>> Shawn
>> 
> 



Re: Measuring QPS

2015-04-04 Thread Siegfried Goeschl
Hi Dan,

I’m using JavaMelody for my SOLR production servers - gives you the relevant 
HTTP stats (what’s happening now & historical data) plus JVM monitoring as 
additional benefit. The servers are deployed on Tomcat so I’m of little help 
regarding Jetty - having said that

* you need two Jars (javamelody & robin)
* tinker with web.xml

Here are two of my presentations mentioning JavaMelody (plus some other stuff)

http://people.apache.org/~sgoeschl/presentations/solr-from-development-to-production-20121210.pdf
 
<http://people.apache.org/~sgoeschl/presentations/solr-from-development-to-production-20121210.pdf>
http://people.apache.org/~sgoeschl/presentations/jsug-2015/jee-performance-monitoring.pdf
 
<http://people.apache.org/~sgoeschl/presentations/jsug-2015/jee-performance-monitoring.pdf>
 

Cheers,

Siegfried Goeschl

> On 03 Apr 2015, at 17:53, Shawn Heisey  wrote:
> 
> On 4/3/2015 9:37 AM, Davis, Daniel (NIH/NLM) [C] wrote:
>> I wanted to gather QPS for our production Solr instances, but I was 
>> surprised that the Admin UI did not contain this information.   We are 
>> running a mix of versions, but mostly 4.10 at this point.   We are not using 
>> SolrCloud at present; that's part of why I'm checking - I want to validate 
>> the size of our existing setup and what sort of SolrCloud setup would be 
>> needed to centralize several of them.
>> 
>> What is the best way to gather QPS information?
>> 
>> What is the best way to add information like this to the Admin UI, if I 
>> decide to take that step?
> 
> As of Solr 4.1 (three years ago), request rate information is available
> in the admin UI and via JMX.  In the admin UI, choose a core from the
> dropdown, click on Plugins/Stats, then QUERYHANDLER, and open the
> handler you wish to examine.  You have avgRequestsPerSecond, which is
> calculated for the entire runtime of the SolrCore, as well as
> 5minRateReqsPerSecond and 15minRateReqsPerSecond, which are far more
> useful pieces of information.
> 
> https://issues.apache.org/jira/browse/SOLR-1972
> 
> Thanks,
> Shawn
> 



Re: Trending functionality in Solr

2015-02-09 Thread Siegfried Goeschl

Hi folks,

I implemented something similar but never got around to contribute it - 
see https://issues.apache.org/jira/browse/SOLR-4056


The code was initially for SOLR3 but was recently ported to SOLR4

* capturing the most frequent search terms per core
* supports ad-hoc queries
* CSV export

If you are interested we could team up and make a proper SOLR 
contribution :-)


Cheers,

Siegfried Goeschl


On 08.02.15 05:26, S.L wrote:

Folks,

Is there a way to implement the trending functionality using Solr , to give
the results using a query for say the most searched terms in the past hours
or so , if the most searched terms is not possible is it possible to at
least the get results for the last 100 terms?

Thanks





Re: OutOfMemoryError for PDF document upload into Solr

2015-01-16 Thread Siegfried Goeschl

Hi Dan,

neat idea - made a mental note :-)

That brings us back to the point that in complex setups you should not 
do the document pre-processing directly in SOLR but have an import 
process which can safely crash when processing a 4GB PDF file


Cheers,

Siegfried Goeschl

On 16.01.15 05:02, Dan Davis wrote:

Why re-write all the document conversion in Java ;)  Tika is very slow.   5
GB PDF is very big.

If you have a lot of PDF like that try pdftotext in HTML and UTF-8 output
mode.   The HTML mode captures some meta-data that would otherwise be lost.


If you need to go faster still, you can  also write some stuff linked
directly against poppler library.

Before you jump down by through about Tika being slow - I wrote a PDF
indexer that ran at 36 MB/s per core.   Different indexer, all C, lots of
getjmp/longjmp.   But fast...



On Thu, Jan 15, 2015 at 1:54 PM,  wrote:


Siegfried and Michael Thank you for your replies and help.

-Original Message-
From: Siegfried Goeschl [mailto:sgoes...@gmx.at]
Sent: Thursday, January 15, 2015 3:45 AM
To: solr-user@lucene.apache.org
Subject: Re: OutOfMemoryError for PDF document upload into Solr

Hi Ganesh,

you can increase the heap size but parsing a 4 GB PDF document will very
likely consume A LOT OF memory - I think you need to check if that large
PDF can be parsed at all :-)

Cheers,

Siegfried Goeschl

On 14.01.15 18:04, Michael Della Bitta wrote:

Yep, you'll have to increase the heap size for your Tomcat container.

http://stackoverflow.com/questions/6897476/tomcat-7-how-to-set-initial
-heap-size-correctly

Michael Della Bitta

Senior Software Engineer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions
<https://plus.google.com/u/0/b/112002776285509593336/11200277628550959
3336/posts>
w: appinions.com <http://www.appinions.com/>

On Wed, Jan 14, 2015 at 12:00 PM,  wrote:


Hello,

Can someone pass on the hints to get around following error? Is there
any Heap Size parameter I can set in Tomcat or in Solr webApp that
gets deployed in Solr?

I am running Solr webapp inside Tomcat on my local machine which has
RAM of 12 GB. I have PDF document which is 4 GB max in size that
needs to be loaded into Solr




Exception in thread "http-apr-8983-exec-6" java.lang.: Java heap

space

  at java.util.AbstractCollection.toArray(Unknown Source)
  at java.util.ArrayList.(Unknown Source)
  at
org.apache.pdfbox.cos.COSDocument.getObjects(COSDocument.java:518)
  at

org.apache.pdfbox.cos.COSDocument.close(COSDocument.java:575)

  at

org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:254)

  at

org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1238)

  at

org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1203)

  at

org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:111)

  at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
  at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
  at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
  at


org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219)

  at


org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)

  at


org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)

  at


org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:246)

  at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967)
  at


org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777)

  at


org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418)

  at


org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)

  at


org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)

  at


org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)

  at


org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:220)

  at


org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:122)

  at


org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:170)

  at


org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103)

  at


org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:950)

  at


org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116)

  at


org.apache.catalina.connector.CoyoteAdapter.service(Coyo

Re: OutOfMemoryError for PDF document upload into Solr

2015-01-15 Thread Siegfried Goeschl

Hi Ganesh,

you can increase the heap size but parsing a 4 GB PDF document will very 
likely consume A LOT OF memory - I think you need to check if that large 
PDF can be parsed at all :-)


Cheers,

Siegfried Goeschl

On 14.01.15 18:04, Michael Della Bitta wrote:

Yep, you'll have to increase the heap size for your Tomcat container.

http://stackoverflow.com/questions/6897476/tomcat-7-how-to-set-initial-heap-size-correctly

Michael Della Bitta

Senior Software Engineer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions
<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>

On Wed, Jan 14, 2015 at 12:00 PM,  wrote:


Hello,

Can someone pass on the hints to get around following error? Is there any
Heap Size parameter I can set in Tomcat or in Solr webApp that gets
deployed in Solr?

I am running Solr webapp inside Tomcat on my local machine which has RAM
of 12 GB. I have PDF document which is 4 GB max in size that needs to be
loaded into Solr




Exception in thread "http-apr-8983-exec-6" java.lang.: Java heap space
 at java.util.AbstractCollection.toArray(Unknown Source)
 at java.util.ArrayList.(Unknown Source)
 at
org.apache.pdfbox.cos.COSDocument.getObjects(COSDocument.java:518)
 at org.apache.pdfbox.cos.COSDocument.close(COSDocument.java:575)
 at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:254)
 at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1238)
 at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1203)
 at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:111)
 at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
 at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
 at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
 at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219)
 at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
 at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
 at
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:246)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967)
 at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777)
 at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418)
 at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
 at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
 at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
 at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:220)
 at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:122)
 at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:170)
 at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103)
 at
org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:950)
 at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116)
 at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:421)
 at
org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1070)
 at
org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:611)
 at
org.apache.tomcat.util.net.AprEndpoint$SocketProcessor.doRun(AprEndpoint.java:2462)
 at
org.apache.tomcat.util.net.AprEndpoint$SocketProcessor.run(AprEndpoint.java:2451)


Thanks
Ganesh








Re: Slow queries

2014-12-08 Thread Siegfried Goeschl
Hi,

using Jetty is the recommended approach while using Tomcat is not recommend 
(unless you are a Tomcat shop). 

But any discussion comes back to the original question - why is it slow now? 
Are you I/O-bound, are CPU-bound, how many documents are committed/deleted over 
the time, do you having expensive SOLR queries, what is your server code is 
doing - many questions and even more answers to that - in other words nobody 
can help you when the basic work is not done. And when you know your 
application performance-wise you probably also the solution :-)

Cheers,

Siegfried Goeschl


> On 08 Dec 2014, at 11:00, melb  wrote:
> 
> THnks for the answer
> A dedicated box will be a great solution but I will wait for that solution,
> I have restricted sources
> Is Optimze action can improve performance?
> Is using default servlet engine Jetty can be harmful for the performance,
> SHould I use an independant tomcat engine?
> 
> rgds,
> 
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Slow-queries-tp4172032p4173092.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: Slow queries

2014-12-02 Thread Siegfried Goeschl
It might be a good idea to

* move SOLR to a dedicated box :-)
* load your SOLR server with 20.000.000 documents (the estimated number of 
documents after three years) and do performance testing & tuning

Afterwards you have some hard facts about hardware sizing and expected 
performance for the next three years :-)

Cheers,

Siegfried Goeschl

> On 02 Dec 2014, at 10:02, melb  wrote:
> 
> Yes  performance degraded over the time, I can raise the memory but I can't
> do it every time and the volume will keep growing
> Is it better to put the solr on dedicated machine?
> Is there any thing else that can be done to the solr instance for example
> deviding the collection?
> 
> rgds,
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Slow-queries-tp4172032p4172039.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: Slow queries

2014-12-02 Thread Siegfried Goeschl
If you performance was fine but degraded over the time it might be 
easier to check / increase the memory to have better disk caching.


Cheers,

Siegfried Goeschl


On 02.12.14 09:27, melb wrote:

Hi,

I have a solr collection with 16 millions documents and growing daily with
1 documents
recently it is becoming slow to answer my request ( several seconds)
specially when I use multi-words query
I am running solr on a machine with 32G RAM but heavy used one

What are my options to optimize the collection and speed up querying it
is it normal with this volume of data? is sharding is a good solution?

regards,





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Slow-queries-tp4172032.html
Sent from the Solr - User mailing list archive at Nabble.com.





Re: AW: AW: slorj -> httpclient 4, but we already have httpclient 3 in use

2014-09-19 Thread Siegfried Goeschl

Lucky you :-)

Siegfried Goeschl

On 19.09.14 07:31, Clemens Wyss DEV wrote:

I'd like to mention, that substituting the httpcore.jar  with the latest (4.3) 
"sufficed"...

-Ursprüngliche Nachricht-
Von: Guido Medina [mailto:guido.med...@temetra.com]
Gesendet: Donnerstag, 18. September 2014 18:20
An: solr-user@lucene.apache.org
Betreff: Re: AW: slorj -> httpclient 4, but we already have httpclient 3 in use

SolrJ client after 4.8 I think requires HTTP client 4.3.x so why not just start 
there as base version?

Guido.

On 18/09/14 16:49, Siegfried Goeschl wrote:

AFAIK even the different minor versions are source/binary compatible
so you might need to tinker with the right "version" to get your
server running

Cheers,

Siegfried Goeschl

On 18.09.14 17:45, Guido Medina wrote:

Hi Clemens,

If you are going thru the effort of migrating from SolrJ 3 to 4 and
HTTP client 3 to 4 make sure you do it using HTTP client 4.3.x
(Latest is
4.3.5) since there are deprecations and stuff from 3.x to 4.0.x, to
4.1.x, to ..., to 4.3.x

It will be painful but it is better do it one time and not later
needed to do it again. I was on a similar situation (well my company)
and I had to suffer such migration (not my company but myself since
I'm the one that keeps all those things up to date)

Best regards,

Guido.

On 18/09/14 16:14, Clemens Wyss DEV wrote:

I guess you are right ;)

-Ursprüngliche Nachricht-
Von: Siegfried Goeschl [mailto:sgoes...@gmx.at]
Gesendet: Donnerstag, 18. September 2014 16:38
An: solr-user@lucene.apache.org
Betreff: Re: slorj -> httpclient 4, but we already have httpclient 3
in use

Hi Clemens,

I think you need to upgrade you framework

* AFAIK is httpclient 3 & 4 uses the same package names - which is
slightly unfortunate
* assuming that they are using the same package name it is
non-deterministic which httpclient library is loaded - might work on
your local box but not on the production server or might change to a
change in the project

Cheers,

Siegfried Goeschl


On 18.09.14 15:08, Clemens Wyss DEV wrote:

I doing initial steps with solrj which is based on httpclient 4.
Unfortunately parts of our framework are based on httpclient 3.
So when I instantiate an HttpSolrServer I run into:

java.lang.VerifyError: Cannot inherit from final class ...
 at
org.apache.http.impl.client.DefaultHttpClient.createHttpParams(Defa
ultHttpClient.java:157)


 at
org.apache.http.impl.client.AbstractHttpClient.getParams(AbstractHt
tpClient.java:447)


 at
org.apache.solr.client.solrj.impl.HttpClientUtil.setFollowRedirects
(Ht
tpClientUtil.java:255)
...

Can these be run side-by-side at all?











Re: AW: slorj -> httpclient 4, but we already have httpclient 3 in use

2014-09-18 Thread Siegfried Goeschl
AFAIK even the different minor versions are source/binary compatible so 
you might need to tinker with the right "version" to get your server running


Cheers,

Siegfried Goeschl

On 18.09.14 17:45, Guido Medina wrote:

Hi Clemens,

If you are going thru the effort of migrating from SolrJ 3 to 4 and HTTP
client 3 to 4 make sure you do it using HTTP client 4.3.x (Latest is
4.3.5) since there are deprecations and stuff from 3.x to 4.0.x, to
4.1.x, to ..., to 4.3.x

It will be painful but it is better do it one time and not later needed
to do it again. I was on a similar situation (well my company) and I had
to suffer such migration (not my company but myself since I'm the one
that keeps all those things up to date)

Best regards,

Guido.

On 18/09/14 16:14, Clemens Wyss DEV wrote:

I guess you are right ;)

-Ursprüngliche Nachricht-
Von: Siegfried Goeschl [mailto:sgoes...@gmx.at]
Gesendet: Donnerstag, 18. September 2014 16:38
An: solr-user@lucene.apache.org
Betreff: Re: slorj -> httpclient 4, but we already have httpclient 3
in use

Hi Clemens,

I think you need to upgrade you framework

* AFAIK is httpclient 3 & 4 uses the same package names - which is
slightly unfortunate
* assuming that they are using the same package name it is
non-deterministic which httpclient library is loaded - might work on
your local box but not on the production server or might change to a
change in the project

Cheers,

Siegfried Goeschl


On 18.09.14 15:08, Clemens Wyss DEV wrote:

I doing initial steps with solrj which is based on httpclient 4.
Unfortunately parts of our framework are based on httpclient 3.
So when I instantiate an HttpSolrServer I run into:

java.lang.VerifyError: Cannot inherit from final class ...
at
org.apache.http.impl.client.DefaultHttpClient.createHttpParams(DefaultHttpClient.java:157)

at
org.apache.http.impl.client.AbstractHttpClient.getParams(AbstractHttpClient.java:447)

at
org.apache.solr.client.solrj.impl.HttpClientUtil.setFollowRedirects(Ht
tpClientUtil.java:255)
...

Can these be run side-by-side at all?







Re: slorj -> httpclient 4, but we already have httpclient 3 in use

2014-09-18 Thread Siegfried Goeschl

Hi Clemens,

I think you need to upgrade you framework

* AFAIK is httpclient 3 & 4 uses the same package names - which is 
slightly unfortunate
* assuming that they are using the same package name it is 
non-deterministic which httpclient library is loaded - might work on 
your local box but not on the production server or might change to a 
change in the project


Cheers,

Siegfried Goeschl


On 18.09.14 15:08, Clemens Wyss DEV wrote:

I doing initial steps with solrj which is based on httpclient 4. Unfortunately 
parts of our framework are based on httpclient 3.
So when I instantiate an HttpSolrServer I run into:

java.lang.VerifyError: Cannot inherit from final class
...
at 
org.apache.http.impl.client.DefaultHttpClient.createHttpParams(DefaultHttpClient.java:157)
at 
org.apache.http.impl.client.AbstractHttpClient.getParams(AbstractHttpClient.java:447)
at 
org.apache.solr.client.solrj.impl.HttpClientUtil.setFollowRedirects(HttpClientUtil.java:255)
...

Can these be run side-by-side at all?





Re: Mongo DB Users

2014-09-16 Thread Siegfried Goeschl

remove please

On 16.09.14 15:42, Karolina Dobromiła Jeleń wrote:

remove please

On Tue, Sep 16, 2014 at 9:35 AM, Amey Patil  wrote:


Remove.

On Tue, Sep 16, 2014 at 12:58 PM, Joan  wrote:


Remove please

2014-09-16 6:59 GMT+02:00 Patti Kelroe-Cooke :


Remove

Kind regards
Patti

On Mon, Sep 15, 2014 at 5:35 PM, Aaron Susan 
wrote:


Hi,

I am here to inform you that we are having a contact list of *Mongo

DB

Users *would you be interested in it?

Data Field’s Consist Of: Name, Job Title, Verified Phone Number,

Verified

Email Address, Company Name & Address Employee Size, Revenue size,

SIC

Code, Industry Type etc.,

We also provide other technology users as well depends on your

requirement.


For Example:


*Red Hat *

*Terra data *

*Net-app *

*NuoDB*

*MongoHQ ** and many more*


We also provide IT Decision Makers, Sales and Marketing Decision

Makers,

C-level Titles and other titles as per your requirement.

Please review and let me know your interest if you are looking for

above

mentioned users list or other contacts list for your campaigns.

Waiting for a positive response!

Thanks

*Aaron Susan*
Data Specialist

If you are not the right person, feel free to forward this email to

the

right person in your organization. To opt out response Remove















Re: external indexer for Solr Cloud

2014-09-01 Thread Siegfried Goeschl

Hi folks,

we are using Apache Camel but could use Spring Integration with the 
option to upgrade to Apache BatchEE or Spring Batch later on - 
especially Tikka document extraction can kill you server due to CPU 
consumption, memory usage and plain memory leaks


AFAIK Douf Turnbull also improved the Camel Solr Integration

http://comments.gmane.org/gmane.comp.jakarta.lucene.solr.user/99739

Cheers,

Siegfried Goeschl

On 01.09.14 18:05, Jack Krupansky wrote:

Packaging SolrCell in the same manner, with parallel threads and able to
talk to multiple SolrCloud servers in parallel would have a lot of the
same benefits as well.

And maybe there could be some more generic Java framework for indexing
as well, that "external indexers" in general could use.

-- Jack Krupansky

-Original Message- From: Shawn Heisey
Sent: Monday, September 1, 2014 11:42 AM
To: solr-user@lucene.apache.org
Subject: Re: external indexer for Solr Cloud

On 9/1/2014 7:19 AM, Jack Krupansky wrote:

It would be great to have a "standalone DIH" that runs as a separate
server and then sends standard Solr update requests to a Solr cluster.


This has been discussed, and I thought we had an issue in Jira, but I
can't find it.

A completely standalone DIH app would be REALLY nice.  I already know
that the JDBC ResultSet is not the bottleneck for indexing, at least for
me.  I once built a simple single-threaded SolrJ application that pulls
data from JDBC and indexes it in Solr.  It works in batches, typically
500 or 1000 docs at a time.  When I comment out the "solr.add(docs)"
line (so input object manipulation, casting, and building of the
SolrInputDocument objects is still happening), it can read and
manipulate our entire database (99.8 million documents) in about 20
minutes, but if I leave that in, it takes many hours.

The bottleneck is that each DIH has only a single thread indexing to
Solr.  I've theorized that it should be *relatively* easy for me to
write an application that pulls records off the JDBC ResultSet with
multiple threads (say 10-20), have each thread figure out which shard
its document lands on, and send it there with SolrJ.  It might even be
possible for the threads to collect several documents for each shard
before indexing them in the same request.

As with most multithreaded apps, the hard part is figuring out all the
thread synchronization, making absolutely certain that thread timing is
perfect without unnecessary delays.  If I can figure out a generic
approach (with a few configurable bells and whistles available), it
might be something suitable for inclusion in the project, followed with
improvements by all the smart people in our community.

Thanks,
Shawn




Re: SOLR Performance benchmarking

2014-07-13 Thread Siegfried Goeschl
Hi Rashi,

abnormal behaviour depends on your data, system and work load - I have seen 
abnormal behaviour at customers sites and it turned out to be a miracle that 
they the customer had no serious problems before :-)

* running out of sockets - you might need to check if you have enough sockets 
(system quota) and that the sockets are closed properly (mostly a 
Windows/networking issue - CLOSED_WAIT)
* understand your test setup - usually a test box is much smaller in terms of 
CPU/memory than you production box
** you might be forced to tweak your test configuration (e.g. production SOLR 
cache configuration can overwhelm a small server)
* understand your work-load 
** if you have long-running queries within your performance tests they tend to 
bring down your server under high-load and your “abnormal” condition looks very 
normal at hindsight 
** spot your long-running queries, optimise them, re-run your tests
** check your cache warming and how fast you start your load injector threads

Cheers,

Siegfried Goeschl


On 13 Jul 2014, at 09:53, rashi gandhi  wrote:

> Hi,
> 
> I am using SolrMeter for load/stress testing solr performance.
> Tomcat is configured with default "maxThreads" (i.e. 200).
> 
> I set Intended Request per min in SolrMeter to 1500 and performed testing.
> 
> I found that sometimes it works with this much load on solr but sometimes
> it gives error "Sever Refused Connection" in solr.
> On getting this error, i increased maxThreads to some higher value, and
> then it works again.
> 
> I would like to know why solr is behaving abnormally, initially when it was
> working with maxThreads=200.
> 
> Please provide me some pointers.



Re: SOLR: getting documents in the given order

2014-06-03 Thread Siegfried Goeschl
Assuming that you just want to sort - have you tried using

sort=id desc

Cheers,

Siegfried Goeschl

On 04 Jun 2014, at 06:19, sachinpkale  wrote:

> I have a following field in SOLR schema.
> 
> 
>  required="false" multiValued="false"/>
> 
> If I issue following query:
> 
> id:(1234 OR 2345 OR 3456)
> 
> SOLR does not return the documents in that order. It is giving document with
> id 3456, then with 1234 and then with 2345.
> 
> How do I get it in the same order as in the query?
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/SOLR-getting-documents-in-the-given-order-tp4139722.html
> Sent from the Solr - User mailing list archive at Nabble.com.



iText hitting infinite loop - Was Re: pdfs

2014-06-02 Thread Siegfried Goeschl

Hi folks,

Brian was so kind and sent me the troublesome PDF document

I gave it a try with PDFBox directly in order to extract the text 
(PDFBox is used by Tikka to extract the textual content of a PDF document)


* hitting an infinite loop with PDFBox 1.8.3
* no problems with PDFBox 1.8.4 & 1.8.5
* PDFBox 1.8.4 is part of Apache Tika 1.5 (see 
http://www.apache.org/dist/tika/CHANGES-1.5.txt)
* Apache SOLR 4.8 uses Tika 1.5 (see 
https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika)


In short the problem with this particular PDF is solved by

* Apache PDFBox 1.8.4 onwards
* Apache Tika 1.5
* Apache SOLR 4.8

Cheers,

Siegfried Goeschl



On 26.05.14 18:20, Erick Erickson wrote:

Brian:

Yeah, if you can share the PDF that would be great. Parsing via Tika should
not bring down Solr, although I supposed there could be something in Tika
that is pathologically bad.

You could also try using Tika itself in SolrJ and indexing from a client. That
might let you
1> more gracefully handle this without shutting down Solr
2> use different versions of Tika.

Personally I like offloading the document parsing to clients anyway since it
lessens the load on the Solr server and scales much better, but YMMV.

It's not actually very difficult, here's a skeleton (rip out the DB parts)
http://searchhub.org/2012/02/14/indexing-with-solrj/

Best,
Erick

On Sun, May 25, 2014 at 2:07 AM, Siegfried Goeschl  wrote:

Sorry typo :- can you send me the PDF by email directly :-)

Siegfried Goeschl

On 25 May 2014, at 10:06, Siegfried Goeschl  wrote:


Hi Brian,

can you send me the email? I would like to play around :-)

Have you opened a JIRA for PdfBox? If not I willl open one if I can reproduce 
the issue …

Thanks in advance

Siegfried Goeschl


On 25 May 2014, at 04:18, Brian McDowell  wrote:


Our feeding (indexing) tool halts because Solr becomes unresponsive after
getting some really bad pdfs. There are levels of pdf "badness." Some just
will not parse and that's fine, but others are more problematic in that our
Operations team has to restart Solr because it just hangs and accepts no
more documents. I actually have identified a pdf that will bring down Solr
every time. Does anyone think that doing pre-validation using the pdfbox
jar will work? Or, will trying to validate just hang as well? Any help is
appreciated.


On Thu, May 22, 2014 at 8:47 AM, Jack Krupansky wrote:


Yeah, I recall running into infinite loop issues with PDFBox in Solr years
ago. They keep fixing these issues, but they keep popping up again. Sigh.

-- Jack Krupansky

-Original Message- From: Siegfried Goeschl
Sent: Thursday, May 22, 2014 4:35 AM
To: solr-user@lucene.apache.org
Subject: Re: pdfs


Hi folks,

for a small customer project I'm running SOLR with embedded Tikka.

* memory consumption is an issue but can be handled
* there is an issue with PDFBox hitting an infinite loop which causes
excessive CPU usage - requires SOLR restart but happens only once
withing 400.000 documents (PDF, Word, ect) but is seems a little bit
erratic since I was never able to track the problem back to a particular
PDF document

Having said that we wire SOLR with Nagios to get an alarm when CPU
consumption goes through the roof

If you doing really serious stuff I would recommend
* moving the document extraction stuff out of SOLR
* provide monitoring and recovery and stuck document extractions
** killing worker threads
** using external processed and kill them when spinning out of control

Cheers,

Siegfried Goeschl

On 22.05.14 06:46, Jack Krupansky wrote:


Yeah, PDF extraction has always been at least somewhat problematic. It
has improved over the years, but still not likely to be perfect.

That said, I'm not aware of any specific PDF extraction issue that would
bring down Solr - as opposed to causing a 500 status with an exception
in PDF extraction, with the exception of memory usage. Some PDF
documents, especially those which are graphic-intense can require a lot
of memory. The rest of Solr could be adversely affected if all available
JVM heap is consumed. The solution is to give the JVM more heap space.

So, what is your specific symptom?

-- Jack Krupansky

-Original Message- From: Brian McDowell
Sent: Thursday, May 22, 2014 12:24 AM
To: solr-user@lucene.apache.org
Subject: pdfs

Has anyone had issues with indexing pdf files? Some pdfs are bringing down
Solr completely so that it actually needs to be manually restarted. We are
using Solr 4.4 and thought that upgrading to Solr 4.8 would solve the
problem because the release notes associated with the new tika version and
also the new pdfbox indicate fixes for pdf issues. It didn't work and now
this issue is causing us to reevaluate using Solr. Any help on this matter
would be greatly appreciated. Thank you!












Re: ExtractingRequestHandler indexing zip files

2014-05-27 Thread Siegfried Goeschl
Hi Sergio,

your either do the stuff on the caller side (which is probably a good idea 
since you are off-load the SOLR server) or extend the ExtractingRequestHandler

Cheers,

Siegfried Goeschl

On 27 May 2014, at 10:37, marotosg  wrote:

> Hi,
> 
> Thanks for your answer Alexandre.
> I have zip files with only one document inside per zip file. These documents
> are mainly pdf,xml,html.
> 
> I tried to index "tini.txt.gz" file which is located in the trunk to be used
> by extraction tests
> \trunk\solr\contrib\extraction\src\test-files\extraction\tini.txt.gz
> 
> I get the same issue only the name of the file inside "tini.txt.gz gets
> indexed as content. That means ExtractRequesthandler can open the file
> because it's getting the name inside but for some reason is not reading the
> content.
> 
> Any suggestions?
> 
> Thanks
> Sergio
> 
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/ExtractingRequestHandler-indexing-zip-files-tp4138172p4138255.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: SolrCloud Nodes autoSoftCommit and (temporary) missing documents

2014-05-25 Thread Siegfried Goeschl
Hi folks,

I think that the timestamp should be rounded down to a minute (or whatever) to 
avoid trashing the filter query cache

Cheers,

Siegfried Goeschl

On 25 May 2014, at 18:19, Steve McKay  wrote:

> Solr can add the filter for you:
> 
> 
>
>timestamp:[* TO NOW-30SECOND]
>
> 
> 
> Increasing soft commit frequency isn't a bad idea, though. I'd probably do 
> both. :)
> 
> On May 23, 2014, at 6:51 PM, Michael Tracey  wrote:
> 
>> Hey all,
>> 
>> I've got a number of nodes (Solr 4.4 Cloud) that I'm balancing with HaProxy 
>> for queries.  I'm indexing pretty much constantly, and have autoCommit and 
>> autoSoftCommit on for Near Realtime Searching.  All works nicely, except 
>> that occasionally the auto-commit cycles are far enough off that one node 
>> will return a document that another node doesn't.  I don't want to have to 
>> add something like this: timestamp:[* TO NOW-30MINUTE] to every query to 
>> make sure that all the nodes have the record.  Ideas? autoSoftCommit more 
>> often?
>> 
>>  
>>  10 
>>  720 
>>  false 
>> 
>> 
>>  
>>  3 
>>  5000
>>  
>> 
>> Thanks,
>> 
>> M.
> 



Re: pdfs

2014-05-25 Thread Siegfried Goeschl
Sorry typo :- can you send me the PDF by email directly :-)

Siegfried Goeschl

On 25 May 2014, at 10:06, Siegfried Goeschl  wrote:

> Hi Brian,
> 
> can you send me the email? I would like to play around :-)
> 
> Have you opened a JIRA for PdfBox? If not I willl open one if I can reproduce 
> the issue … 
> 
> Thanks in advance
> 
> Siegfried Goeschl
> 
> 
> On 25 May 2014, at 04:18, Brian McDowell  wrote:
> 
>> Our feeding (indexing) tool halts because Solr becomes unresponsive after
>> getting some really bad pdfs. There are levels of pdf "badness." Some just
>> will not parse and that's fine, but others are more problematic in that our
>> Operations team has to restart Solr because it just hangs and accepts no
>> more documents. I actually have identified a pdf that will bring down Solr
>> every time. Does anyone think that doing pre-validation using the pdfbox
>> jar will work? Or, will trying to validate just hang as well? Any help is
>> appreciated.
>> 
>> 
>> On Thu, May 22, 2014 at 8:47 AM, Jack Krupansky 
>> wrote:
>> 
>>> Yeah, I recall running into infinite loop issues with PDFBox in Solr years
>>> ago. They keep fixing these issues, but they keep popping up again. Sigh.
>>> 
>>> -- Jack Krupansky
>>> 
>>> -Original Message- From: Siegfried Goeschl
>>> Sent: Thursday, May 22, 2014 4:35 AM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: pdfs
>>> 
>>> 
>>> Hi folks,
>>> 
>>> for a small customer project I'm running SOLR with embedded Tikka.
>>> 
>>> * memory consumption is an issue but can be handled
>>> * there is an issue with PDFBox hitting an infinite loop which causes
>>> excessive CPU usage - requires SOLR restart but happens only once
>>> withing 400.000 documents (PDF, Word, ect) but is seems a little bit
>>> erratic since I was never able to track the problem back to a particular
>>> PDF document
>>> 
>>> Having said that we wire SOLR with Nagios to get an alarm when CPU
>>> consumption goes through the roof
>>> 
>>> If you doing really serious stuff I would recommend
>>> * moving the document extraction stuff out of SOLR
>>> * provide monitoring and recovery and stuck document extractions
>>> ** killing worker threads
>>> ** using external processed and kill them when spinning out of control
>>> 
>>> Cheers,
>>> 
>>> Siegfried Goeschl
>>> 
>>> On 22.05.14 06:46, Jack Krupansky wrote:
>>> 
>>>> Yeah, PDF extraction has always been at least somewhat problematic. It
>>>> has improved over the years, but still not likely to be perfect.
>>>> 
>>>> That said, I'm not aware of any specific PDF extraction issue that would
>>>> bring down Solr - as opposed to causing a 500 status with an exception
>>>> in PDF extraction, with the exception of memory usage. Some PDF
>>>> documents, especially those which are graphic-intense can require a lot
>>>> of memory. The rest of Solr could be adversely affected if all available
>>>> JVM heap is consumed. The solution is to give the JVM more heap space.
>>>> 
>>>> So, what is your specific symptom?
>>>> 
>>>> -- Jack Krupansky
>>>> 
>>>> -Original Message- From: Brian McDowell
>>>> Sent: Thursday, May 22, 2014 12:24 AM
>>>> To: solr-user@lucene.apache.org
>>>> Subject: pdfs
>>>> 
>>>> Has anyone had issues with indexing pdf files? Some pdfs are bringing down
>>>> Solr completely so that it actually needs to be manually restarted. We are
>>>> using Solr 4.4 and thought that upgrading to Solr 4.8 would solve the
>>>> problem because the release notes associated with the new tika version and
>>>> also the new pdfbox indicate fixes for pdf issues. It didn't work and now
>>>> this issue is causing us to reevaluate using Solr. Any help on this matter
>>>> would be greatly appreciated. Thank you!
>>>> 
>>> 
>>> 
> 



Re: pdfs

2014-05-25 Thread Siegfried Goeschl
Hi Brian,

can you send me the email? I would like to play around :-)

Have you opened a JIRA for PdfBox? If not I willl open one if I can reproduce 
the issue … 

Thanks in advance

Siegfried Goeschl


On 25 May 2014, at 04:18, Brian McDowell  wrote:

> Our feeding (indexing) tool halts because Solr becomes unresponsive after
> getting some really bad pdfs. There are levels of pdf "badness." Some just
> will not parse and that's fine, but others are more problematic in that our
> Operations team has to restart Solr because it just hangs and accepts no
> more documents. I actually have identified a pdf that will bring down Solr
> every time. Does anyone think that doing pre-validation using the pdfbox
> jar will work? Or, will trying to validate just hang as well? Any help is
> appreciated.
> 
> 
> On Thu, May 22, 2014 at 8:47 AM, Jack Krupansky 
> wrote:
> 
>> Yeah, I recall running into infinite loop issues with PDFBox in Solr years
>> ago. They keep fixing these issues, but they keep popping up again. Sigh.
>> 
>> -- Jack Krupansky
>> 
>> -Original Message- From: Siegfried Goeschl
>> Sent: Thursday, May 22, 2014 4:35 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: pdfs
>> 
>> 
>> Hi folks,
>> 
>> for a small customer project I'm running SOLR with embedded Tikka.
>> 
>> * memory consumption is an issue but can be handled
>> * there is an issue with PDFBox hitting an infinite loop which causes
>> excessive CPU usage - requires SOLR restart but happens only once
>> withing 400.000 documents (PDF, Word, ect) but is seems a little bit
>> erratic since I was never able to track the problem back to a particular
>> PDF document
>> 
>> Having said that we wire SOLR with Nagios to get an alarm when CPU
>> consumption goes through the roof
>> 
>> If you doing really serious stuff I would recommend
>> * moving the document extraction stuff out of SOLR
>> * provide monitoring and recovery and stuck document extractions
>> ** killing worker threads
>> ** using external processed and kill them when spinning out of control
>> 
>> Cheers,
>> 
>> Siegfried Goeschl
>> 
>> On 22.05.14 06:46, Jack Krupansky wrote:
>> 
>>> Yeah, PDF extraction has always been at least somewhat problematic. It
>>> has improved over the years, but still not likely to be perfect.
>>> 
>>> That said, I'm not aware of any specific PDF extraction issue that would
>>> bring down Solr - as opposed to causing a 500 status with an exception
>>> in PDF extraction, with the exception of memory usage. Some PDF
>>> documents, especially those which are graphic-intense can require a lot
>>> of memory. The rest of Solr could be adversely affected if all available
>>> JVM heap is consumed. The solution is to give the JVM more heap space.
>>> 
>>> So, what is your specific symptom?
>>> 
>>> -- Jack Krupansky
>>> 
>>> -Original Message- From: Brian McDowell
>>> Sent: Thursday, May 22, 2014 12:24 AM
>>> To: solr-user@lucene.apache.org
>>> Subject: pdfs
>>> 
>>> Has anyone had issues with indexing pdf files? Some pdfs are bringing down
>>> Solr completely so that it actually needs to be manually restarted. We are
>>> using Solr 4.4 and thought that upgrading to Solr 4.8 would solve the
>>> problem because the release notes associated with the new tika version and
>>> also the new pdfbox indicate fixes for pdf issues. It didn't work and now
>>> this issue is causing us to reevaluate using Solr. Any help on this matter
>>> would be greatly appreciated. Thank you!
>>> 
>> 
>> 



Re: pdfs

2014-05-22 Thread Siegfried Goeschl

Hi folks,

for a small customer project I'm running SOLR with embedded Tikka.

* memory consumption is an issue but can be handled
* there is an issue with PDFBox hitting an infinite loop which causes 
excessive CPU usage - requires SOLR restart but happens only once 
withing 400.000 documents (PDF, Word, ect) but is seems a little bit 
erratic since I was never able to track the problem back to a particular 
PDF document


Having said that we wire SOLR with Nagios to get an alarm when CPU 
consumption goes through the roof


If you doing really serious stuff I would recommend
* moving the document extraction stuff out of SOLR
* provide monitoring and recovery and stuck document extractions
** killing worker threads
** using external processed and kill them when spinning out of control

Cheers,

Siegfried Goeschl

On 22.05.14 06:46, Jack Krupansky wrote:

Yeah, PDF extraction has always been at least somewhat problematic. It
has improved over the years, but still not likely to be perfect.

That said, I'm not aware of any specific PDF extraction issue that would
bring down Solr - as opposed to causing a 500 status with an exception
in PDF extraction, with the exception of memory usage. Some PDF
documents, especially those which are graphic-intense can require a lot
of memory. The rest of Solr could be adversely affected if all available
JVM heap is consumed. The solution is to give the JVM more heap space.

So, what is your specific symptom?

-- Jack Krupansky

-Original Message- From: Brian McDowell
Sent: Thursday, May 22, 2014 12:24 AM
To: solr-user@lucene.apache.org
Subject: pdfs

Has anyone had issues with indexing pdf files? Some pdfs are bringing down
Solr completely so that it actually needs to be manually restarted. We are
using Solr 4.4 and thought that upgrading to Solr 4.8 would solve the
problem because the release notes associated with the new tika version and
also the new pdfbox indicate fixes for pdf issues. It didn't work and now
this issue is causing us to reevaluate using Solr. Any help on this matter
would be greatly appreciated. Thank you!




Re: Indexing PDF in Apache Solr 4.8.0 - Problem.

2014-05-12 Thread Siegfried Goeschl
Hi Vignesh,

can you check your SOLR Server Log?! Not all PDF documents on this planet can 
be processed using Tikka :-)

Cheers,

Siegfried Goeschl

On 07 May 2014, at 09:40, vignesh  wrote:

> Dear Team,
>  
> I am Vignesh  using the latest version 4.8.0 Apache Solr and am 
> Indexing my PDF but getting an error and have posted that below for your 
> reference. Kindly guide me to solve this error.
>  
> D:\IPCB\solr>java -Durl=http://localhost:8082/solr/ipcb/update/extract 
> -Dparams=
> literal.id=herald060214_001 -Dtype=application/pdf -jar post.jar 
> "D:/IPCB/ipcbpd
> f/herald060214_001.pdf"
> SimplePostTool version 1.5
> Posting files to base url 
> http://localhost:8082/solr/ipcb/update/extract?literal
> .id=herald060214_001 using content-type application/pdf..
> POSTing file herald060214_001.pdf
> SimplePostTool: WARNING: Solr returned an error #500 Internal Server Error
> SimplePostTool: WARNING: IOException while reading response: 
> java.io.IOException
> : Server returned HTTP response code: 500 for URL: 
> http://localhost:8082/solr/ip
> cb/update/extract?literal.id=herald060214_001
> 1 files indexed.
> COMMITting Solr index changes to 
> http://localhost:8082/solr/ipcb/update/extract?
> literal.id=herald060214_001..
> SimplePostTool: WARNING: Solr returned an error #500 Internal Server Error 
> for u
> rl 
> http://localhost:8082/solr/ipcb/update/extract?literal.id=herald060214_001&co
> mmit=true
> Time spent: 0:00:00.062
>  
>  
>  
> Thanks & Regards.
> Vignesh.V
>  
> 
> Ninestars Information Technologies Limited.,
> 72, Greams Road, Thousand Lights, Chennai - 600 006. India.
> Landline : +91 44 2829 4226 / 36 / 56   X: 144
> www.ninestars.in
>  
> 
> STOP Virus, STOP SPAM, SAVE Bandwidth! 
> www.safentrix.com
> 



Re: Export big extract from Solr to [My]SQL

2014-05-02 Thread Siegfried Goeschl

Hi Per,

basically I see three options

* use a lot of memory to scope with huge result sets
* user result set paging
* SOLR 4.7 supports cursors 
(https://issues.apache.org/jira/browse/SOLR-5463)


Cheers,

Siegfried Goeschl

On 02.05.14 13:32, Per Steffensen wrote:

Hi

I want to make extracts from my Solr to MySQL. Any tools around that can
help med perform such a task? I find a lot about data-import from SQL
when googling, but nothing about export/extract. It is not all of the
data in Solr I need to extract. It is only documents that full fill a
normal Solr query, but the number of documents fulfilling it will
(potentially) be huge.

Regards, Per Steffensen




Re: Having trouble with German compound words in Solr 4.7

2014-04-24 Thread Siegfried Goeschl

Hi Alistair,

it seems that there are many ways to skin the cat so I describe the 
approach I used with SOLR 3.6 :-)


* Using a patched DictionaryCompoundWordTokenFilterFactory in the 
"index" phase - so the german compound noun "Leinenhose" (linen 
trousers) would be indexed in addition to "Leinen" & "Hose". Afterwards 
the three tokens go trough stemming.


* One hint which might be useful - I only split words which I consider 
proper german compound nouns. E.g. if your indexed text contains the 
token "schwarzkleid" I would NOT split it since it is NOT a proper noun 
- the proper noun would be "Schwarzkleid" - please note that even 
"Schwarzkleid" is not a proper german noun anyway :-)


* I use a custom dictionary for splitting consisting of 7.000 entries 
which contains a lot of customer-specific entries


I do not tinker with DictionaryCompoundWordTokenFilterFactory in the 
"query" phase of the field so the following queries would work with the 
indexed word "Leinenhose"


* "leinenhosen"
* "leinenhose"
* "leinen hose"
* "leinen hosen"

Cheers,

Siegfried Goeschl



On 22.04.14 12:13, Alistair wrote:

I've managed to solve this (in a quite hacky sort of way) by using filter
queries and the edismax queryparser.

I added in my solrconfig.xml the following parameters:

 edismax
 75%

Then when searching for multiple keywords (for example: schwarzkleid wenz,
where wenz is a german brand name), I use the first keyword as a query and
anything after that I add as a filterquery. So my final query looks
something like this:


fl=id&sort=popular+desc&indent=on&q=keywords:'schwarzkleide'+&wt=json&fq={!edismax}+keywords:'wenz'&fq=deleted:0

My compound splitter filter splits schwarzkleide correctly and it is parsed
as edismax with mm=75%, then the filterqueries are added, for keywords they
are also parsed as edismax. The returned result is all the black dresses
from 'Wenz'.

If anybody has a better solution to what I've posted I would be more than
happy to read up on it as I'm quite new to Solr and I think my way is a bit
convoluted to be honest.

Thanks.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Having-trouble-with-German-compound-words-in-Solr-4-7-tp4131964p4132478.html
Sent from the Solr - User mailing list archive at Nabble.com.





Re: Having trouble with German compound words in Solr 4.7

2014-04-18 Thread Siegfried Goeschl
Hi Alistair,

quick email before getting my plane - I worked with similar requirements in the 
past and tuning SOLR can be tricky

* are you hitting the same SOLR query handler (application versus manual 
checking)?
* turn on debugging for your application SOLR queries so you see what query is 
actually executed
* one thing I always do for prototyping is setting up the Solritas GUI using 
the same query handler as the application server

Cheers,

Siegfried Goeschl


On 18 Apr 2014, at 06:06, Alistair  wrote:

> Hey Jack,
> 
> thanks for the reply. I added autoGeneratePhraseQueries="true" to the
> fieldType and now it's giving me even more results! I'm not sure if the
> debug of my query will be helpful but I'll paste it just in case someone
> might have an idea. This produces 113524 results, whereas if I manually
> enter the query as keyword:schwarz AND keyword:kleid I only get 20283
> results (which is the correct one). 
> 
> 
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Having-trouble-with-German-compound-words-in-Solr-4-7-tp4131964p4131973.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: No route to host

2014-04-09 Thread Siegfried Goeschl
Hi folks,

the URL looks wrong (misconfigured)

http://:8080/solr/collection1

Cheers,

Siegfried Goeschl

On 09 Apr 2014, at 14:28, Rallavagu  wrote:

> All,
> 
> I see the following error in the log file. The host that it is trying to find 
> is itself. Wondering if anybody experienced this before or any other info 
> would helpful. Thanks.
> 
> 709703139 [http-bio-8080-exec-43] ERROR 
> org.apache.solr.update.SolrCmdDistributor  – 
> org.apache.solr.client.solrj.SolrServerException: IOException occured when 
> talking to server at: http://:8080/solr/collection1
>   at 
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:503)
>   at 
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197)
>   at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer.request(ConcurrentUpdateSolrServer.java:293)
>   at 
> org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:212)
>   at 
> org.apache.solr.update.SolrCmdDistributor.distribCommit(SolrCmdDistributor.java:181)
>   at 
> org.apache.solr.update.processor.DistributedUpdateProcessor.processCommit(DistributedUpdateProcessor.java:1260)
>   at 
> org.apache.solr.update.processor.LogUpdateProcessor.processCommit(LogUpdateProcessorFactory.java:157)
>   at 
> org.apache.solr.handler.RequestHandlerUtils.handleCommit(RequestHandlerUtils.java:69)
>   at 
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
>   at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>   at org.apache.solr.core.SolrCore.execute(SolrCore.java:1859)
>   at 
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:710)
>   at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:413)
>   at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:197)
>   at 
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
>   at 
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
>   at 
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:222)
>   at 
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:123)
>   at 
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:171)
>   at 
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:99)
>   at 
> org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:953)
>   at 
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
>   at 
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:408)
>   at 
> org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1023)
>   at 
> org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:589)
>   at 
> org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:312)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:744)
> Caused by: java.net.NoRouteToHostException: No route to host
>   at java.net.PlainSocketImpl.socketConnect(Native Method)
>   at 
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
>   at 
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
>   at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
>   at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>   at java.net.Socket.connect(Socket.java:579)
>   at 
> org.apache.http.conn.scheme.PlainSocketFactory.connectSocket(PlainSocketFactory.java:127)
>   at 
> org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:180)
>   at 
> org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:294)
>   at 
> org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:643)
>   at 
> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:479)
>   at 
> org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906)
>   at 
> org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805)
>   at 
> org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:784)
>   at 
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:393)



Re: Anyone going to ApacheCon in Denver next week?

2014-04-06 Thread Siegfried Goeschl
Hi folks,

I’m already here and would love to join :-)

Cheers,

Siegfried Goeschl


On 05 Apr 2014, at 20:43, Doug Turnbull  
wrote:

> I'll be there. I'd love to meet up. Let me know!
> 
> Sent from my Windows Phone From: William Bell
> Sent: 4/5/2014 10:40 PM
> To: solr-user@lucene.apache.org
> Subject: Anyone going to ApacheCon in Denver next week?
> Thoughts on getting together for breakfast? a little Solr meet up?
> 
> 
> 
> -- 
> Bill Bell
> billnb...@gmail.com
> cell 720-256-8076



Re: Apache Solr.

2014-02-03 Thread Siegfried Goeschl

Hi Vignesh,

a few keywords for further investigations

* Solr Data Import Handler
* Apache Tikka
* Apache PDFBox

Cheers,

Siegfried Goeschl

On 03.02.14 09:15, vignesh wrote:

Hi Team,



 I am Vignesh, am using Apache Solr 3.6 and able to Index
XML file and now trying to Index PDF file and not able to index .Can you
give me the steps to carry out PDF indexing it will be very useful. Kindly
guide me through this process.





Thanks & Regards.

Vignesh.V



cid:image001.jpg@01CA4872.39B33D40

Ninestars Information Technologies Limited.,

72, Greams Road, Thousand Lights, Chennai - 600 006. India.

Landline : +91 44 2829 4226 / 36 / 56   X: 144

  http://www.ninestars.in/> www.ninestars.in




--
30 Million Advertisements displayed. Is yours there?
http://www.safentrixads.com/adlink?cid=13
--





Re: Why do people want to deploy to Tomcat?

2013-11-12 Thread Siegfried Goeschl

Hi ALex,

in my case

* ignorance that Tomcat is not fully supported
* Tomcat configuration and operations know-how inhouse
* could migrate to Jetty but need approved change request to do so

Cheers,

Siegfried Goeschl

On 12.11.13 04:54, Alexandre Rafalovitch wrote:

Hello,

I keep seeing here and on Stack Overflow people trying to deploy Solr to
Tomcat. We don't usually ask why, just help when where we can.

But the question happens often enough that I am curious. What is the actual
business case. Is that because Tomcat is well known? Is it because other
apps are running under Tomcat and it is ops' requirement? Is it because
Tomcat gives something - to Solr - that Jetty does not?

It might be useful to know. Especially, since Solr team is considering
making the server part into a black box component. What use cases will that
break?

So, if somebody runs Solr under Tomcat (or needed to and gave up), let's
use this thread to collect this knowledge.

Regards,
Alex.
Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)



Re: how to debug my own analyzer in solr

2013-10-21 Thread Siegfried Goeschl

Thread Dump and/or Remote Debugging?!

Cheers,

Siegfried Goeschl

On 21.10.13 11:58, Mingzhu Gao wrote:

More information about this , the custom analyzer just implement
"createComponents" of Analyzer.

And my configure in schema.xml is just something like :


  



 From the log I cannot see any error information , however , when I want to
analysis or add document data , it always hang there .

Any way to debug or narrow down the problem ?

Thanks in advance .

-Mingz

On 10/21/13 4:35 PM, "Mingzhu Gao"  wrote:


Dear solr expert ,

I would like to write my own analyser ( Chinese analyser ) and integrate
them into solr as solr plugin .

From the log information , the custom analyzer can be loaded into solr
successfully .  I define my  with this custom analyzer.

Now the problem is that ,  when I try this analyzer from
http://localhost:8983/solr/#/collection1/analysis , click the analysis ,
then choose my FieldType , then input some text .
After I click "Analyse Value" button , the solr hang there , I cannot get
any result or response in a few minutes.

I also try to add  some data by "curl
http://localhost:8983/solr/update?commit=true -H "Content-Type: text/xml"
, or by "post.sh" in exampledocs folder ,
The same issue , the solr hang there , no result and not response .

Can anybody give me some suggestions on how to debug solr to work with my
own custom analyzer ?

By the way , I write a java program to call my custom analyzer , the
result is okay , for example , the following code can work well .
==
Analyzer analyzer = new MyAnalyzer() ;

TokenStream ts = analyzer.tokenStream() ;

CharTermAttribute ta = ts.getAttribute(CharTermAttribute.class);

ts.reset();

while (ts.incrementToken()){

System.out.println(ta.toString());

}

=


Thanks,

-Mingz







Re: solr 4.4 config trouble

2013-09-30 Thread Siegfried Goeschl
Hi Marc,

what exactly is not working - no obvious problemsin the logs as as I see

Cheers,

Siegfried Goeschl

Am 30.09.2013 um 11:44 schrieb Marc des Garets :

> Hi,
> 
> I'm running solr in tomcat. I am trying to upgrade to solr 4.4 but I can't 
> get it to work. If someone can point me at what I'm doing wrong.
> 
> tomcat context:
>  crossContext="true">
>  value="/opt/solr4.4/solr_address" override="true" />
> 
> 
> 
> core.properties:
> name=address
> collection=address
> coreNodeName=address
> dataDir=/opt/indexes4.1/address
> 
> 
> solr.xml:
> 
> 
> 
> ${host:}
> 8080
> solr_address
> ${zkClientTimeout:15000}
> false
> 
> 
>  class="HttpShardHandlerFactory">
> ${socketTimeout:0}
> ${connTimeout:0}
> 
> 
> 
> 
> In solrconfig.xml I have:
> 4.1
> 
> /opt/indexes4.1/address
> 
> 
> And the log4j logs in catalina.out:
> ...
> INFO: Deploying configuration descriptor solr_address.xml
> 0 [main] INFO org.apache.solr.servlet.SolrDispatchFilter – 
> SolrDispatchFilter.init()
> 24 [main] INFO org.apache.solr.core.SolrResourceLoader – Using JNDI 
> solr.home: /opt/solr4.4/solr_address
> 26 [main] INFO org.apache.solr.core.SolrResourceLoader – new 
> SolrResourceLoader for directory: '/opt/solr4.4/solr_address/'
> 176 [main] INFO org.apache.solr.core.ConfigSolr – Loading container 
> configuration from /opt/solr4.4/solr_address/solr.xml
> 272 [main] INFO org.apache.solr.core.SolrCoreDiscoverer – Looking for cores 
> in /opt/solr4.4/solr_address
> 276 [main] INFO org.apache.solr.core.SolrCoreDiscoverer – Looking for cores 
> in /opt/solr4.4/solr_address/conf
> 276 [main] INFO org.apache.solr.core.SolrCoreDiscoverer – Looking for cores 
> in /opt/solr4.4/solr_address/conf/xslt
> 277 [main] INFO org.apache.solr.core.SolrCoreDiscoverer – Looking for cores 
> in /opt/solr4.4/solr_address/conf/lang
> 278 [main] INFO org.apache.solr.core.SolrCoreDiscoverer – Looking for cores 
> in /opt/solr4.4/solr_address/conf/velocity
> 283 [main] INFO org.apache.solr.core.CoreContainer – New CoreContainer 
> 991552899
> 284 [main] INFO org.apache.solr.core.CoreContainer – Loading cores into 
> CoreContainer [instanceDir=/opt/solr4.4/solr_address/]
> 301 [main] INFO org.apache.solr.handler.component.HttpShardHandlerFactory – 
> Setting socketTimeout to: 0
> 301 [main] INFO org.apache.solr.handler.component.HttpShardHandlerFactory – 
> Setting urlScheme to: http://
> 301 [main] INFO org.apache.solr.handler.component.HttpShardHandlerFactory – 
> Setting connTimeout to: 0
> 302 [main] INFO org.apache.solr.handler.component.HttpShardHandlerFactory – 
> Setting maxConnectionsPerHost to: 20
> 302 [main] INFO org.apache.solr.handler.component.HttpShardHandlerFactory – 
> Setting corePoolSize to: 0
> 303 [main] INFO org.apache.solr.handler.component.HttpShardHandlerFactory – 
> Setting maximumPoolSize to: 2147483647
> 303 [main] INFO org.apache.solr.handler.component.HttpShardHandlerFactory – 
> Setting maxThreadIdleTime to: 5
> 303 [main] INFO org.apache.solr.handler.component.HttpShardHandlerFactory – 
> Setting sizeOfQueue to: -1
> 303 [main] INFO org.apache.solr.handler.component.HttpShardHandlerFactory – 
> Setting fairnessPolicy to: false
> 320 [main] INFO org.apache.solr.client.solrj.impl.HttpClientUtil – Creating 
> new http client, 
> config:maxConnectionsPerHost=20&maxConnections=1&socketTimeout=0&connTimeout=0&retry=false
> 420 [main] INFO org.apache.solr.logging.LogWatcher – Registering Log Listener 
> [Log4j (org.slf4j.impl.Log4jLoggerFactory)]
> 422 [main] INFO org.apache.solr.core.ZkContainer – Zookeeper 
> client=192.168.10.206:2181
> 429 [main] INFO org.apache.solr.client.solrj.impl.HttpClientUtil – Creating 
> new http client, 
> config:maxConnections=500&maxConnectionsPerHost=16&socketTimeout=0&connTimeout=0
> 487 [main] INFO org.apache.solr.common.cloud.ConnectionManager – Waiting for 
> client to connect to ZooKeeper
> 540 [main-EventThread] INFO org.apache.solr.common.cloud.ConnectionManager – 
> Watcher org.apache.solr.common.cloud.ConnectionManager@7dc21ece 
> name:ZooKeeperConnection Watcher:192.168.10.206:2181 got event WatchedEvent 
> state:SyncConnected type:None path:null path:null type:None
> 541 [main] INFO org.apache.solr.common.cloud.ConnectionManager – Client is 
> connected to ZooKeeper
> 562 [main] INFO org.apache.solr.common.cloud.SolrZkClient – makePath: 
> /overseer/queue
> 578 [main] INFO org.apache.solr.common.cloud.SolrZkClient – makePath: 
> /overseer/collection-queue-work
> 591 [main] INFO org.apache.solr.common.cloud.SolrZkClient – makePath: 
> /live_nodes
> 59

Re: how to suppress result

2008-04-07 Thread Siegfried Goeschl

Hi Evgeniy

+) delete the documents if you really don't need need them
+) create a field "ignored" and build an appropriate query to exclude 
the documents where 'ignored' is true


Cheers,

Siegfried Goeschl

Evgeniy Strokin wrote:

Hello,.. I have odd problem.
I use Solr for regular search, and it works fine for my task, but my client has 
a list of IDs in a flat separate file (he could have huge amount of the IDs, up 
to 1M) and he wants to exclude those IDs from result of the search.
What is the right way to do this?

Any thoughts are greatly appreciated.
Thank you
Gene


  


Re: Can We append a field to the response that is not in the index but computed at runtime.

2008-03-31 Thread Siegfried Goeschl

Hi folks,

I had to solve a similiar problem with SOLR 1.2 and used a custom 
org.apache.solr.request.QueryResponseWriter - you can trigger your 
custom response writer using SOLR admin but it is not an elegant 
solution (I think the XMWriter is a final class therefore some 
copy&waste code)


Cheers,

Siegfried Goeschl



Umar Shah wrote:

On Mon, Mar 31, 2008 at 7:38 PM, Ryan McKinley <[EMAIL PROTECTED]> wrote:

  

Two approaches:
1. make a map and add it to the response:
  rb.rsp.add( "mystuff", mymap );




I tried using both  Map/ NamedList

it appends to the results
I have to attach each document with corresponding field.


  

2. Augment the documents with a field value -- this is a bit more
complex and runs the risk of name collisions with fields in your
documents.  You can pull the docLIst out from the response and add
fields to each document.



this seems more appropriate,
I'm okay, to resolve name collision , how do I add the  field.. any specific
methods to do that?


  

If #1 works, go with that...

ryan



On Mar 31, 2008, at 9:51 AM, Umar Shah wrote:



thanks ryan for the reply.

I have looked at the prepare and process methods in
SearchComponents(Query,
Filter etc).
I'm using all the default components to prepare and then process the
reults.
and then prepare a custom field after iterating through all the
documents in
the result set. After having created this field for each document
how do I
add corresponding custom field to each document in the response set.


On Mon, Mar 31, 2008 at 6:25 PM, Ryan McKinley <[EMAIL PROTECTED]>
wrote:

  

Without writing any custom code, no.

If you write a "SearchComponent"
http://wiki.apache.org/solr/SearchComponent
-- you can programatically change the response at runtime.

ryan



On Mar 28, 2008, at 3:38 AM, Umar Shah wrote:



Hi,

I wanted to know whether we can append a field (Fdyn say) to each
doc in the
returned set
Fdyn is computed as some complex function of the fields stored in
the index
during the runtime in SOLR.



-umar
  




  


Re: Solr interprets UTF-8 as ISO-8859-1

2008-03-31 Thread Siegfried Goeschl

Hi Daniel,

the following topic might help (at least it did the trick for me using 
german chararcters)


http://wiki.apache.org/solr/FAQ - Why don't International Characters Work?

So I wrote the following servlet (taken from Wiki/mailing list)

import org.apache.solr.servlet.SolrDispatchFilter;

import javax.servlet.ServletRequest;
import javax.servlet.ServletResponse;
import javax.servlet.FilterChain;
import javax.servlet.ServletException;
import java.io.IOException;

/**
* A work around that the URL parameters are encoded using UTF-8 but no 
character
* encoding is defined. So enforce UTF-8 to make it work with German 
characters.

*/
public class CdpSolrDispatchFilter extends SolrDispatchFilter {

 public void doFilter(ServletRequest request, ServletResponse response, 
FilterChain chain) throws IOException, ServletException {


   String encoding = request.getCharacterEncoding();
   if (null == encoding) {
 // Set your default encoding here
 request.setCharacterEncoding("UTF-8");
   } else {
 request.setCharacterEncoding(encoding);
   }
  
   super.doFilter(request, response, chain);

 }
}

Cheers,

Siegfried Goeschl

Daniel Löfquist wrote:

Hello,

We're building a webapplication that uses Solr for searching and I've
come upon a problem that I can't seem to get my head around.

We have a servlet that accepts input via XML-RPC and based on that input
constructs the correct URL to perform a search with the Solr-servlet.

I know that the call to Solr (the URL) from our servlet looks like this
(which is what it should look like):

http://myserver:8080/solrproducts/select/?q=all_SV:ljusblå+status:online&fl=id%2Cartno%2Ctitle_SV%2CtitleSort_SV%2Cdescription_SV%2C&sort=titleSort_SV+asc,id+asc&start=0&q.op=AND&rows=25 



But Solr reports the input-fields (the GET-variables in the URL) as:

INFO: /select/
fl=id,artno,title_SV,titleSort_SV,description_SV,&sort=titleSort_SV+asc,id+asc&start=0&q=all_SV:ljusblå+status:online&q.op=AND&rows=25 



which is all fine except where it says "ljusblå". Apparently Solr is
interpreting the UTF-8 string "ljusblå" as ISO-8859-1 and thus creates
this garbage that makes the search return 0 when it should in reality
return 3 hits.

All other searches that don't use special characters work 100% fine.

I'm new to Solr so I'm not sure what I'm doing wrong here. Can anybody
help me out and point me in the direction of a solution?

Sincerely,

Daniel Löfquist





Re: Combining SOLR and JAMon to monitor query execution times from a browser

2007-11-28 Thread Siegfried Goeschl

Hi Noberto,

JAMon is all about aggregating statistical data and displaying the 
information for a web browser - the main beauty is that it is easy to 
define what you are monitoring such as querying domain objects per customer.


Cheers,

Siegfried Goeschl

Norberto Meijome wrote:

On Tue, 27 Nov 2007 18:18:16 +0100
Siegfried Goeschl <[EMAIL PROTECTED]> wrote:

  

Hi folks,

working on a closed source project for an IP concerned company is not 
always fun ... we combined SOLR with JAMon 
(http://jamonapi.sourceforge.net/) to keep an eye of the query times and 
this might be of general interest


+) JAMon comes with a ready-to-use ServletFilter
+) we extended this implementation to keep track for queries issued by a 
customer and the requested domain objects, e.g. "artist", "album", "track"
+) this allows us to keep track of the execution times and their 
distribution to find quickly long running queries without having access 
to the access.log from a web browser
+) a small presentation can be found at 
http://people.apache.org/~sgoeschl/presentations/jamon-20070717.pdf

+) if it is of general I can rewrite the code as contribution



Thanks Siegfried,

I am further interested in  plugging this information into something like Nagios , Cacti , Zenoss , bigsister , Openview or your monitoring system of choice, but I haven't had much time to look into this yet. How does JAMon compare to JMX ( http://java.sun.com/javase/technologies/core/mntr-mgmt/javamanagement/) ? 


cheers,
B

_
{Beto|Norberto|Numard} Meijome

There are no stupid questions, but there are a LOT of inquisitive idiots.

I speak for myself, not my employer. Contents may be hot. Slippery when wet. 
Reading disclaimers makes you go blind. Writing them is worse. You have been 
Warned.


  


Combining SOLR and JAMon to monitor query execution times from a browser

2007-11-27 Thread Siegfried Goeschl

Hi folks,

working on a closed source project for an IP concerned company is not 
always fun ... we combined SOLR with JAMon 
(http://jamonapi.sourceforge.net/) to keep an eye of the query times and 
this might be of general interest


+) JAMon comes with a ready-to-use ServletFilter
+) we extended this implementation to keep track for queries issued by a 
customer and the requested domain objects, e.g. "artist", "album", "track"
+) this allows us to keep track of the execution times and their 
distribution to find quickly long running queries without having access 
to the access.log from a web browser
+) a small presentation can be found at 
http://people.apache.org/~sgoeschl/presentations/jamon-20070717.pdf

+) if it is of general I can rewrite the code as contribution

Cheers,

Siegfried Goeschl


Re: Any clever ideas to inject into solr? Without http?

2007-08-09 Thread Siegfried Goeschl

Hi Kevin,

I'm also a newbie but some thoughts along the line ...

+) for evaluating SOLR we used a less exotic setup for data import base 
on Pnuts (a JVM based scripting language) ... :-) ... but Groovy would 
do as well if you feel at home with Java.


+) my colleague just finished a database import service running within 
the servlet container to avoid writing out the data to the file system 
and transmitting it over HTTP.


+) I think there were some discussion regarding a generic database 
importer but nothing I'm aware of



Cheers,

Siegfried Goeschl

Kevin Holmes wrote:

I inherited an existing (working) solr indexing script that runs like
this:

 


Python script queries the mysql DB then calls bash script

Bash script performs a curl POST submit to solr

 


We're injecting about 1000 records / minute (constantly), frequently
pushing the edge of our CPU / RAM limitations.

 


I'm in the process of building a Perl script to use DBI and
lwp::simple::post that will perform this all from a single script
(instead of 3).

 


Two specific questions

1: Does anyone have a clever (or better) way to perform this process
efficiently?

 


2: Is there a way to inject into solr without using POST / curl / http?

 


Admittedly, I'm no solr expert - I'm starting from someone else's setup,
trying to reverse-engineer my way out.  Any input would be greatly
appreciated.


  


Re: Need question to configure Log4j for solr

2007-07-13 Thread Siegfried Goeschl

Hi Ken,

and we stopped using Resin's support for daily rolling log files since 
it blocks the server for 20 minutes when rotating a 20 GB logfile - 
please don't ask what we are doing with the daily 20 GB ... :-(


Cheers,

Siegfried Goeschl

Ken Krugler wrote:

: the troubles comes when you integrate third-party stuff depending on
: log4j (as I currently do). Having said this you have a strong point 
when

: looking at http://www.qos.ch/logging/classloader.jsp

there have been several discussions baout changing the logger used by 
Solr

... the best summation i can give to these discussions is:

  * JDK logging is universal
  * using any other logging framework would add a dependency without
adding functionality


The one issue I ran into was with daily rolling log files - maybe I 
missed it, but I didn't find that functionality in the JDK logging 
package, however it is in log4j.


I'm not advocating a change, just noting this. We worked around it by 
leveraging Resin's support for wrapping a logger (set up for daily 
rolling log files) around a webapp.


-- Ken


Re: Need question to configure Log4j for solr

2007-07-12 Thread Siegfried Goeschl

Hi Erik,

the troubles comes when you integrate third-party stuff depending on 
log4j (as I currently do). Having said this you have a strong point when 
looking at http://www.qos.ch/logging/classloader.jsp


Cheers,

Siegfried Goeschl

Erik Hatcher wrote:


On Jul 12, 2007, at 9:03 AM, Siegfried Goeschl wrote:
would be using commons-logging an improvement? It is a common 
requirement to hook up different logging infrastructure ..


My personal take on it is *adding* a dependency to keep functionality 
the same isn't an improvement.  JDK logging, while not with as many 
bells and whistles as Commons Logging, log4j, etc, is plenty good enough 
and keeps us away from many of logging JARmageddon headaches.


I'm not against a logging change should others have different opinions 
with a strong case of improvement.


Erik





Re: Need question to configure Log4j for solr

2007-07-12 Thread Siegfried Goeschl

Hi folks,

would be using commons-logging an improvement? It is a common 
requirement to hook up different logging infrastructure ..


Cheers,

Siegfried Goeschl

Erik Hatcher wrote:


On Jul 11, 2007, at 9:07 PM, solruser wrote:
How do I configure solr to use log4j logging. I am able to configure 
tomcat
5.5.23 to use log4j. But I could not get solr to use log4j. I have 3 
context

of solr running in tomcat which refers to war file in commons.


Solr uses standard JDK logging.  I'm sure it could be bridged to log4j 
somehow, but rather I'd recommend you just configure JDK logging how 
you'd like.


Erik





Re: How to use bit fields to narrow a search

2007-06-26 Thread Siegfried Goeschl

Hi Yonik,

looks intersting - I give it a try 

Cheers,

Siegfried Goeschl

Yonik Seeley wrote:

On 6/26/07, Siegfried Goeschl <[EMAIL PROTECTED]> wrote:

Hi folks,

I'm currently evaluating SOLR to implement fulltext search and within 8
hours I have my content imported and able to benchmark the queries ... 
:-)


As a beginner with Lucence/SOLR I have a problem where to add my
"special features" - little bit overloaded with "Lucene in Action" and
SOLR over the weekend ...

Some background ...

+) I have 4 millions document indexed
+) each document has 3 long variables (stored but not indexed)
representing a 64 bit mask each
+) I have to filter the Hits based on the bit mask using BIT AND with
application supplied parameters

Any suggestions/ideas where to add this processing within SOLR ...


Due to the nature of an inverted index, it could actually be more
efficient to store the bits separately.  You could also then use Solr
w/o any custom java code.

Index a field called bits, which just contains the bit numbers set,
separated by whitespace.
At query time, use filters on the required bit numbers:
q=foo&fq=bits:12&fq=bits:45

-Yonik




How to use bit fields to narrow a search

2007-06-26 Thread Siegfried Goeschl

Hi folks,

I'm currently evaluating SOLR to implement fulltext search and within 8
hours I have my content imported and able to benchmark the queries ... :-)

As a beginner with Lucence/SOLR I have a problem where to add my
"special features" - little bit overloaded with "Lucene in Action" and 
SOLR over the weekend ...


Some background ...

+) I have 4 millions document indexed
+) each document has 3 long variables (stored but not indexed)
representing a 64 bit mask each
+) I have to filter the Hits based on the bit mask using BIT AND with
application supplied parameters

Any suggestions/ideas where to add this processing within SOLR ...

Thanks in advance

Siegfried Goeschl