RE: LowerCaseFilterFactory burns CPU

2015-07-09 Thread Reitzel, Charles
Combining under new subject to reflect new question.

Took a quick look at both the LowerCaseFilter and Java implementation it uses.  
  A perfect hash would be much faster and, since LowerCaseFilter does not 
consider locale, applicable.

ICUFoldingFilter is a somewhat different animal.   But I take your point, for 
an indexed/searchable field (vs. stored/returned) that may contain accented 
characters from a wider variety of locales, it makes a lot of sense.  It seems 
like a single filter would perform the tasks that we use 3 filters to do.

Have you ever looked at the ICU internals?   Are they fairly efficient wrt 
character attributes and folding?   I used ICU C++ libs a while back and they 
were never a bottleneck, but that doesn't mean they wouldn't be in this context 
or that the Java libs have the same performance characteristics.

We use the following for US-only content (which may be a common use case):
fieldType name=text_search class=solr.TextField
analyzer
tokenizer class=solr.UAX29URLEmailTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt /
filter class=solr.LowerCaseFilterFactory/
filter class=solr.ASCIIFoldingFilterFactory/
filter class=solr.EnglishPossessiveFilterFactory/
/analyzer
/fieldType

Thus my interest in the question.

-Original Message-
From: Shawn Heisey [mailto:apa...@elyograg.org] 
Sent: Thursday, July 09, 2015 9:55 AM
To: solr-user@lucene.apache.org
Subject: Re: Do I really need copyField when my app can do the copy?

I don't know what the CPU usage is like compared to LCF, but I use 
ICUFoldingFilterFactory instead.  This does several things in one pass, 
including lowercasing (which it calls case folding), and it is aware of the all 
characters in Unicode.

https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ICUFoldingFilterFactory

The ICU classes require additional jars to be loaded into Solr before they will 
work.

Thanks,
Shawn

-Original Message-
From: Reitzel, Charles [mailto:charles.reit...@tiaa-cref.org] 
Sent: Thursday, July 09, 2015 9:47 AM
To: solr-user@lucene.apache.org
Subject: RE: LowerCaseFilterFactory burns CPU

That should be fixable.   In a past life, I generated a perfect hash to fold 
case for Unicode in a locale-neutral manner and it was very fast.   If I 
remember right, there are only about 2500 Unicode characters that can be case 
folded at all.  So the generated, collision-free hash function was very small 
and fast and the lookup table was small.

I used Bob Jenkins' tool suite for a C application.
http://burtleburtle.net/bob/hash/perfect.html

But there are a number of other open source tools available.   Bob Jenkins 
currently recommends this one by Botelho and Ziviani: 
http://homepages.dcc.ufmg.br/~nivio/papers/cikm07.pdf


-Original Message-
From: Nir Barel [mailto:ni...@checkpoint.com] 
Sent: Thursday, July 09, 2015 4:35 AM
To: solr-user@lucene.apache.org
Subject: RE: Do I really need copyField when my app can do the copy?

Hi,

I wants to add a question regarding copyField and LowerCaseFilterFactory We 
notice that LowerCaseFilterFactory takes huge part of the CPU ( via profiling ) 
for the text filed Can we avoid it or improve that implementation? ( keeping 
the insensitive case search )

Best Regards,
Nir Barel 

*
This e-mail may contain confidential or privileged information.
If you are not the intended recipient, please notify the sender immediately and 
then delete it.

TIAA-CREF
*


Re: Do I really need copyField when my app can do the copy?

2015-07-09 Thread Shawn Heisey
On 7/9/2015 2:35 AM, Nir Barel wrote:
 I wants to add a question regarding copyField and LowerCaseFilterFactory
 We notice that LowerCaseFilterFactory takes huge part of the CPU ( via 
 profiling ) for the text filed
 Can we avoid it or improve that implementation? ( keeping the insensitive 
 case search )

I don't know what the CPU usage is like compared to LCF, but I use
ICUFoldingFilterFactory instead.  This does several things in one pass,
including lowercasing (which it calls case folding), and it is aware of
the all characters in Unicode.

https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ICUFoldingFilterFactory

The ICU classes require additional jars to be loaded into Solr before
they will work.

Thanks,
Shawn



RE: Do I really need copyField when my app can do the copy?

2015-07-09 Thread Reitzel, Charles


-Original Message-
From: Shawn Heisey [mailto:apa...@elyograg.org] 
Sent: Thursday, July 09, 2015 9:55 AM
To: solr-user@lucene.apache.org
Subject: Re: Do I really need copyField when my app can do the copy?

On 7/9/2015 2:35 AM, Nir Barel wrote:
 I wants to add a question regarding copyField and 
 LowerCaseFilterFactory We notice that LowerCaseFilterFactory takes 
 huge part of the CPU ( via profiling ) for the text filed Can we avoid 
 it or improve that implementation? ( keeping the insensitive case 
 search )

I don't know what the CPU usage is like compared to LCF, but I use 
ICUFoldingFilterFactory instead.  This does several things in one pass, 
including lowercasing (which it calls case folding), and it is aware of the all 
characters in Unicode.

https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ICUFoldingFilterFactory

The ICU classes require additional jars to be loaded into Solr before they will 
work.

Thanks,
Shawn


*
This e-mail may contain confidential or privileged information.
If you are not the intended recipient, please notify the sender immediately and 
then delete it.

TIAA-CREF
*


[JOB] Financial search engine company AlphaSense is looking for Search Engineers

2015-07-09 Thread Dmitry Kan
Company: AlphaSense https://www.alpha-sense.com/
Position: Search Engineer

AlphaSense is a one-stop financial search engine for financial research
analysts all around the world.

AlphaSense is looking for Search Engineers experienced with Lucene / Solr
and search architectures in general. Positions are open in Helsinki (
http://www.visitfinland.com/helsinki/).

Daily routine topics for our search team:

1. Sharding
2. Commit vs query performance
3. Performance benchmarking
4. Custom query syntax, lucene / solr grammars
5. Relevancy
6. Query optimization
7. Search system monitoring: cache, RAM, throughput etc
8. Automatic deployment
9. Internal tool development

We have evolved the system through a series of Solr releases starting from
1.4 to 4.10.

Requirements:

1. Core Java + web services
2. Understanding of distributed search engine architecture
3. Java concurrency
4. Understanding of performance issues and their solutions
5. Clean and beautiful code + design patterns

Our search team members are active in the open source search scene, in
particular we support and develop luke toolbox:
https://github.com/dmitrykey/luke, participate in search / OS conferences
(Lucene Revolution, ApacheCon, Berlin buzzwords), review books on Solr.

Send your CV over and let's have a chat.

-- 
Dmitry Kan
Luke Toolbox: http://github.com/DmitryKey/luke
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan
SemanticAnalyzer: www.semanticanalyzer.info


RE: LowerCaseFilterFactory burns CPU

2015-07-09 Thread Reitzel, Charles
That should be fixable.   In a past life, I generated a perfect hash to fold 
case for Unicode in a locale-neutral manner and it was very fast.   If I 
remember right, there are only about 2500 Unicode characters that can be case 
folded at all.  So the generated, collision-free hash function was very small 
and fast and the lookup table was small.

I used Bob Jenkins' tool suite for a C application.
http://burtleburtle.net/bob/hash/perfect.html

But there are a number of other open source tools available.   Bob Jenkins 
currently recommends this one by Botelho and Ziviani: 
http://homepages.dcc.ufmg.br/~nivio/papers/cikm07.pdf


-Original Message-
From: Nir Barel [mailto:ni...@checkpoint.com] 
Sent: Thursday, July 09, 2015 4:35 AM
To: solr-user@lucene.apache.org
Subject: RE: Do I really need copyField when my app can do the copy?

Hi,

I wants to add a question regarding copyField and LowerCaseFilterFactory We 
notice that LowerCaseFilterFactory takes huge part of the CPU ( via profiling ) 
for the text filed Can we avoid it or improve that implementation? ( keeping 
the insensitive case search )

Best Regards,
Nir Barel 


-Original Message-
From: Petersen, Robert [mailto:robert.peter...@rakuten.com]
Sent: Thursday, July 09, 2015 1:59 AM
To: solr-user@lucene.apache.org
Subject: RE: Do I really need copyField when my app can do the copy?

Perhaps some people like maybe those using DIH to feed their index might not 
have that luxury and copyfield is the better way for them.  If you have an 
application you can do it either way.  I have done both ways in different 
situations.

Robi

-Original Message-
From: Steven White [mailto:swhite4...@gmail.com]
Sent: Wednesday, July 08, 2015 3:38 PM
To: solr-user@lucene.apache.org
Subject: Do I really need copyField when my app can do the copy?

Hi Everyone,

What good is the use of copyField in Solr's schema.xml if my application can do 
it into the designated field?  Having my application do so helps me simplify 
the schema.xml maintains task thus my motivation.

Thanks

Steve
j)ly˫y  


*
This e-mail may contain confidential or privileged information.
If you are not the intended recipient, please notify the sender immediately and 
then delete it.

TIAA-CREF
*


RE: Spell checking the synonym list?

2015-07-09 Thread Dyer, James
Ryan,

If you use index-time synonyms on the spellcheck field, this will give you what 
you want.

For instance, if the document has lawyer and you index both terms 
lawyer,attorney, then the spellchecker will see that atorney is 1 edit 
away from an indexed term and will suggest attorney. 

You'll need to have the same synonyms set up against the query field, but you 
have the option of making these query-time synonyms if you prefer.

James Dyer
Ingram Content Group

-Original Message-
From: Ryan Yacyshyn [mailto:ryan.yacys...@gmail.com] 
Sent: Thursday, July 09, 2015 2:28 AM
To: solr-user@lucene.apache.org
Subject: Spell checking the synonym list?

Hi all,

I'm wondering if it's possible to have spell checking performed on terms in
the synonym list?

For example, let's say I have documents with the word lawyer in them and
I add lawyer, attorney in the synonyms.txt file. Then a query is made for
the word atorney. Is there any way to provide spell checking on this?

Thanks,
Ryan


RE: Can I instruct the Tika Entity Processor to skip the first page using the DIH?

2015-07-09 Thread Allison, Timothy B.
Concur on both points.  You can also use PDFBox's app ExtractText with 
-startPage and -endPage parameters: 
https://pdfbox.apache.org/1.8/commandline.html#extractText 

-Original Message-
From: Charlie Hull [mailto:char...@flax.co.uk] 
Sent: Thursday, July 09, 2015 3:55 AM
To: solr-user@lucene.apache.org
Subject: Re: Can I instruct the Tika Entity Processor to skip the first page 
using the DIH?

On 08/07/2015 20:39, Allison, Timothy B. wrote:
 Unfortunately, no.  We can't even do that now with straight Tika.  I
 imagine this is for pdf files?  If you'd like to add this as a
 feature, please submit a ticket over on Tika.

Another alternative is to pre-process the PDF files to remove the first 
page. I've used the command line version of PDFtk for this kind of thing 
in the past: https://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/

I'd also recommend using Tika outside Solr rather than via the DIH: 
certain nasty PDFs can kill Tika, which then can kill Solr.

Charlie

 -Original Message- From: Paden [mailto:rumsey...@gmail.com]
 Sent: Wednesday, July 08, 2015 12:14 PM To:
 solr-user@lucene.apache.org Subject: Can I instruct the Tika Entity
 Processor to skip the first page using the DIH?

 Hello, I'm using the DIH to import some files from one of my local
 directories. However, every single one of these files has the same
 first page. So I want to skip that first page in order to optimize
 search.

 Can this be accomplished by an instruction within the
 dataimporthandler or, if not, how could you do this?



 -- View this message in context:
 http://lucene.472066.n3.nabble.com/Can-I-instruct-the-Tika-Entity-Processor-to-skip-the-first-page-using-the-DIH-tp4216373.html


Sent from the Solr - User mailing list archive at Nabble.com.



-- 
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


Re: SolrJ/Tika custom indexer not indexing CERTAIN .doc text?

2015-07-09 Thread Paden
Haha no need to reinvent wheels. Especially when you don't know java. Just a
prototype anyway.

I made a very strong assumption that it was pulling the text as blank
because I would copy the EXACT same text from one file in the file system
and put it into another file under a different name, but instead of it show
as 

}
Author:Some author 
text:blank 
}

It would show as 

}
Author:Some author
text:text that should have shown up in the other file but appeared as
blank'
}

But I'm a more familiar with solr now than I was about 4 weeks ago so I'll
run that debugger and see if I can find something that's a problem. I just
find it weird that it was ONLY .doc files and when I put it into another
.doc it actually pulled. Thanks for the post and let me know if there's any
new info I should know. 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrJ-Tika-custom-indexer-not-indexing-CERTAIN-doc-text-tp4216541p4216576.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Data form cataloggroup and catalogentry cores

2015-07-09 Thread Erick Erickson
You can try using the shards parameter. The problem will be, though,
that the score calculations may not really be comparable...

Best,
Erick

On Thu, Jul 9, 2015 at 3:40 AM, santosh sidnal sidnal.sant...@gmail.com wrote:
 Hi All,

 Is there a way to get a combined data from 2 different cores together in a
 single call?


 like a data form both CatalogEntry and CatalogGroup cores in a single call
 to solr.



 --
 Regards,
 Santosh Sidnal


Re: SolrJ/Tika custom indexer not indexing CERTAIN .doc text?

2015-07-09 Thread Erick Erickson
Wow, that code looks familiar ;)...

Anyway, what have you tried?
bq: It would pull it but when I got the results in Solr it would look
blank

How do you know this? Do _some_ docs have text in Solr but some
don't or are all of your text fields blank? In this case I suspect
you're not storing the data.

What I'd do is isolate just the one file and look at the processing in
the debugger to see if any text is extracted. Then I'd look at the doc
in Word (or whatever) to insure that there _is_ text in it. Then.

Perhaps the program is swallowing the error. Perhaps the file is
mal-formed and isn't being analyzed appropriately. Perhaps the
file isn't there at all.

And sending one doc to Solr at a time isn't very efficient, but perhaps
some of your files are so big that it's better that way.

Best,
Erick

On Thu, Jul 9, 2015 at 6:36 AM, Paden rumsey...@gmail.com wrote:
 I posted the code anyway just forgot to get rid of that line in the post.
 Sorry



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/SolrJ-Tika-custom-indexer-not-indexing-CERTAIN-doc-text-tp4216541p4216542.html
 Sent from the Solr - User mailing list archive at Nabble.com.


How to determine cache setting in Solr Search Instance

2015-07-09 Thread wwang525
Hi All,

I did a load test with a total of 800 requests (at 40 concurrent requests
per second) to be executed against Solr index with 14 M records. Performance
was good ( 1 second) especially after a short period of time of the test.
BTW, the second round of load test was even better.

The local machine has a free memory of about 15 G during the load test

I observed the following from the stats page:

(1) documentCache reached the configured size for documentCache with a hit
ratio of 0.66
(2) filterCache has 2519 hits with a hit ratio of 0.63. The size is 1465
(less than a configured size: 16384)
(3) queryResultCache has a hit ratio of 0
(4) fieldValueCache has a hit ratio of 0

The following are the cache configuration in solrconfig.xml

 documentCache
  class=solr.LRUCache
  size=16384
  initialSize=512
  autowarmCount=0/


filterCache
  class=solr.LRUCache
  size=16384
  initialSize=4096 
  autowarmCount=256/


queryResultCache
  class=solr.LRUCache
  size=16384
  initialSize=4096
  autowarmCount=256/

It looks like I need to increase the size of documentCache. The hit ratio of
zero for queryResultCache and fieldValueCache was surprising (zero). Is it
possible that this is due to randomly generated requests? 

What are some guideline in tuning the cache parameter?

Thanks



--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-determine-cache-setting-in-Solr-Search-Instance-tp4216562.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How to determine cache setting in Solr Search Instance

2015-07-09 Thread Erick Erickson
It's actually unlikely that increasing the documentCache will
help materially. It's primarily so various components won't
have to fetch the documents off disk for a _single_ request.
I've heard some anecdotal evidence that it helps in some situations,
but that's been rare in my experience.

Your filterCache is a bomb waiting to go off.
Each filterCache entry can be up to maxDoc/8 bytes long, so if you let
you don't re-open searchers (thus causing it to flush), that cache
could eventually grow to on the order of 28G. The autowarm count is also quite
high. I'd start by knocking these back a _lot_. Also, look at why
your hit ratio is this low. If you have a lot of fq clauses that you _know_
aren't going to be re-used, use {!cache=false}. If you are using
NOW, that's an anti-pattern, see:
http://lucidworks.com/blog/date-math-now-and-filter-queries/

queryResultCache is largely used for paging, having 0 hits
is a good thing for stress tests since it insures that you're not
getting false stats because of this. But there's no good reason
to have this over 16K max. It will be significantly smaller than
the filterCache, but still a waste. Again, the autowarm count is
very high. Taking it back to 32 or so is where I'd start.

bq: especially after a short period of time of the test.

right, this is due to filling up the lower-level caches. If you haven't
restarted your system then the autowarming will fill them up. But you
can also specify newSearcher and firstSearcher queries to do the same
thing at far less expense than executing 256 queries every time you
open a new searcher.

In summary, your settings are far too high for most people, I'd start
by examining the usage rather than making them bigger.

Best,
Erick

On Thu, Jul 9, 2015 at 8:48 AM, wwang525 wwang...@gmail.com wrote:
 Hi All,

 I did a load test with a total of 800 requests (at 40 concurrent requests
 per second) to be executed against Solr index with 14 M records. Performance
 was good ( 1 second) especially after a short period of time of the test.
 BTW, the second round of load test was even better.

 The local machine has a free memory of about 15 G during the load test

 I observed the following from the stats page:

 (1) documentCache reached the configured size for documentCache with a hit
 ratio of 0.66
 (2) filterCache has 2519 hits with a hit ratio of 0.63. The size is 1465
 (less than a configured size: 16384)
 (3) queryResultCache has a hit ratio of 0
 (4) fieldValueCache has a hit ratio of 0

 The following are the cache configuration in solrconfig.xml

  documentCache
   class=solr.LRUCache
   size=16384
   initialSize=512
   autowarmCount=0/


 filterCache
   class=solr.LRUCache
   size=16384
   initialSize=4096
   autowarmCount=256/


 queryResultCache
   class=solr.LRUCache
   size=16384
   initialSize=4096
   autowarmCount=256/

 It looks like I need to increase the size of documentCache. The hit ratio of
 zero for queryResultCache and fieldValueCache was surprising (zero). Is it
 possible that this is due to randomly generated requests?

 What are some guideline in tuning the cache parameter?

 Thanks



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/How-to-determine-cache-setting-in-Solr-Search-Instance-tp4216562.html
 Sent from the Solr - User mailing list archive at Nabble.com.


RE: Problem with distributed search using grouping and highlighting

2015-07-09 Thread Cario, Elaine
Rich,

I've run into various problems with group.query and highlighting.  You noted 
one below (SOLR-5046), and there is also SOLR-6712, which might be related to 
what you are experiencing.  Still waiting for that patch to be reviewed...

-Original Message-
From: Rich Hume [mailto:rh...@identifix.com] 
Sent: Monday, June 08, 2015 2:23 PM
To: solr-user@lucene.apache.org
Subject: Problem with distributed search using grouping and highlighting

I am currently using Solr 4.5.1.  In the hopes of seeing better query 
performance, I have sharded an index and I am trying to use the shards 
parameter along with grouping and highlighting.  I am not currently using Solr 
cloud.

I got past an earlier problem by adding a second sort parameter (as described 
in JIRA Solr-5046).  Unfortunately, I have found nothing related to my latest 
index out of bounds problem.  I do not believe that JIRA Solr-5709 is related 
since my unique keys are in fact unique across the shards.

If anyone can point out something that I am doing wrong it would be greatly 
appreciated.

Thanks,
Rich

I am seeing the following error, the parameters I am passing are below the 
stack trace.

null:java.lang.ArrayIndexOutOfBoundsException: 35
 at 
org.apache.solr.handler.component.HighlightComponent.finishStage(HighlightComponent.java:185)
 at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:317)
 at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1859)
 at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:703)
 at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:406)
 at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:195)
 at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)


Here are the parameters I am passing:

group=truegroup.offset=0group.limit=10group.field=DeDup
group.query=DocumentTypes:34group.query=DocumentTypes:35group.query=DocumentTypes:32
shards=localhost:8983/solr/IX1,localhost:8983/solr/IX2
fq=+DocumentTypes:(34 35 32)
defType=edismaxqf=csTitle^100 csContent q=any matching search string
start=0rows=10
fl=PageNumber,FilePath,DocumentGUID,ResultDisplayContent,DocumentTypes
sort=score desc,DocumentGUID asc
hl=on
hl.fl=csTitle,csContent




Re: SolrJ/Tika custom indexer not indexing CERTAIN .doc text?

2015-07-09 Thread Erick Erickson
I rather doubt that it's a Solr issue. Text is text after all. If
some docs display text, then it's probably a matter of not
getting the text in the first place.

My _guess_ is that you're not getting any text at all from
the document. Either the document isn't being found
or it's not a form that Tika expects (perhaps the file's extension has
been changed and it's really an Libre Office file. Or Tika has a bug.
Or your database doesn't have a value for TextContentURL. Or...

So what I'd do, since you know the name of the file in question is
print out what text you get from it to try to put in the Solr doc and go
from there.

Best,
Erick

On Thu, Jul 9, 2015 at 9:59 AM, Paden rumsey...@gmail.com wrote:
 Haha no need to reinvent wheels. Especially when you don't know java. Just a
 prototype anyway.

 I made a very strong assumption that it was pulling the text as blank
 because I would copy the EXACT same text from one file in the file system
 and put it into another file under a different name, but instead of it show
 as

 }
 Author:Some author
 text:blank
 }

 It would show as

 }
 Author:Some author
 text:text that should have shown up in the other file but appeared as
 blank'
 }

 But I'm a more familiar with solr now than I was about 4 weeks ago so I'll
 run that debugger and see if I can find something that's a problem. I just
 find it weird that it was ONLY .doc files and when I put it into another
 .doc it actually pulled. Thanks for the post and let me know if there's any
 new info I should know.



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/SolrJ-Tika-custom-indexer-not-indexing-CERTAIN-doc-text-tp4216541p4216576.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Solr Grouping - sorting groups based on the sum of the scores of the documents within each group

2015-07-09 Thread Emilio Borraz
Hi, I'm having a similar use case, still looking for a solution, I have
 posted a question about it in Stack Overflow (
http://stackoverflow.com/questions/31281640/sum-field-and-sort-on-solr )

Did you solve it ?

Regards.

-- 

Emilio Borraz

*Back-end Developer*

emilio.bor...@sonatasmx.com


Re: How to determine cache setting in Solr Search Instance

2015-07-09 Thread wwang525
Hi,

The real production requests will not be randomly generated, and a lot of
requests will be repeated. I think the performance will be better due to the
repeated requests. In addition, I am sure the configuration will need to be
adjusted once the application is in production.

For the time being, I can drop the size of filterCache to 4096 or 2048 since
it is now only 1465 in the stats page.

I forgot to mention that the size I saw in the stats page for documentCache
is already 16384 after the test, and this is the configured size in
solrconfig.xml. This is why I was asking if I need to raise the number in
the configuration.

Is there any issue or will there be any performance improvement if I raise
up the size for documentCache?

Thanks






--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-determine-cache-setting-in-Solr-Search-Instance-tp4216562p4216591.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: LogTransformer

2015-07-09 Thread Jagdish Vasani
One thing I noted that you need to give full package detail while mentioning 
transformer.
Like, I have added bellow
entity transformer=org.apache.solr.handler.dataimport .LogTransformer...

Hope this will help you.

Thanks,
Jagdish
-Original Message-
From: Midas A [mailto:test.mi...@gmail.com] 
Sent: Friday, July 10, 2015 11:08 AM
To: solr-user@lucene.apache.org
Subject: LogTransformer

I want to log query running through DIH  should i use LogTransformer to do that


entity transformer= LogTransformer logTemplate=Query: ${products.query}
logLevel=info name=products pk=product_id query = SELECT p.product_id

in log i am getting text Query: but not query variable

my Solr version : 4.2


Please correct me what is wrong .or other ways to do this .

Regards,
Abhishek


LogTransformer

2015-07-09 Thread Midas A
I want to log query running through DIH  should i use LogTransformer
to do that


entity transformer=LogTransformer logTemplate=Query: ${products.query}
logLevel=info name=products pk=product_id query = SELECT p.product_id

in log i am getting text Query: but not query variable

my Solr version : 4.2


Please correct me what is wrong .or other ways to do this .

Regards,
Abhishek


Get content in response from ExtractingRequestHandler

2015-07-09 Thread trung.ht
Hi everyone,

I use solr to index and search in office file (docx, pptx, ...). To reduce
the size of solr index, I do not store the content of the file on solr,
however now my customer want to preview the content of the file.

I have read the document of ExtractingRequestHandler, but it seems that to
return content in the response from solr, the only option is to
set extractOnly=true, but in that case, solr would not index the file.

My question is: is there anyway for solr to extract the content from tika,
index the content (without storing it) and then give me the content in the
response?

Thanks in advanced and sorry because my explanation is confusing.

Trung.


RE: Do I really need copyField when my app can do the copy?

2015-07-09 Thread Nir Barel
Hi,

I wants to add a question regarding copyField and LowerCaseFilterFactory
We notice that LowerCaseFilterFactory takes huge part of the CPU ( via 
profiling ) for the text filed
Can we avoid it or improve that implementation? ( keeping the insensitive case 
search )

Best Regards,
Nir Barel 


-Original Message-
From: Petersen, Robert [mailto:robert.peter...@rakuten.com] 
Sent: Thursday, July 09, 2015 1:59 AM
To: solr-user@lucene.apache.org
Subject: RE: Do I really need copyField when my app can do the copy?

Perhaps some people like maybe those using DIH to feed their index might not 
have that luxury and copyfield is the better way for them.  If you have an 
application you can do it either way.  I have done both ways in different 
situations.

Robi

-Original Message-
From: Steven White [mailto:swhite4...@gmail.com] 
Sent: Wednesday, July 08, 2015 3:38 PM
To: solr-user@lucene.apache.org
Subject: Do I really need copyField when my app can do the copy?

Hi Everyone,

What good is the use of copyField in Solr's schema.xml if my application can do 
it into the designated field?  Having my application do so helps me simplify 
the schema.xml maintains task thus my motivation.

Thanks

Steve
j)ly˫y��



Data form cataloggroup and catalogentry cores

2015-07-09 Thread santosh sidnal
Hi All,

Is there a way to get a combined data from 2 different cores together in a
single call?


like a data form both CatalogEntry and CatalogGroup cores in a single call
to solr.



-- 
Regards,
Santosh Sidnal


Spell checking the synonym list?

2015-07-09 Thread Ryan Yacyshyn
Hi all,

I'm wondering if it's possible to have spell checking performed on terms in
the synonym list?

For example, let's say I have documents with the word lawyer in them and
I add lawyer, attorney in the synonyms.txt file. Then a query is made for
the word atorney. Is there any way to provide spell checking on this?

Thanks,
Ryan


Re: Protwords in solr spellchecker

2015-07-09 Thread davidphilip cherian
The best bet is to use solr.StopFilterFactory.
Have all such words added to stopwords.txt and add this filter to your
analyzer.

Reference links
https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.StopFilterFactory
https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-StopFilter

HTH


On Thu, Jul 9, 2015 at 11:50 AM, Kamal Kishore Aggarwal 
kkroyal@gmail.com wrote:

 Hi Team,

 I am currently working with Java-1.7, Solr-4.8.1 with tomcat 7. Is there
 any feature by which I can refrain the following words to appear in spell
 suggestion.

 For example: Somebody searches for sexe, I does not want to show him sex as
 the spell suggestion via solr. How can I stop these type of keywords to be
 shown in suggestion.

 Any help is appreciated.


 Regards
 Kamal Kishore
 Solr Beginner



Re: Lost connection to Zookeeper

2015-07-09 Thread Eirik Hungnes
Hi,

We are facing the same issues on our setup. 3 zk nodes, 1 shard, 10
collections, 1 replica. v. 5.0.0. default startup params.
Solr Servers: 2 core cpu, 7gb memory
Index size: 28g, 3gb heap

This setup was running on v. 4.6 before upgrading to 5 without any of these
errors. The timeout seems to happen randomly and only to 1 of the replicas
(fortunately) at the time. Joe: did you get anywhere with the perf hints?
If not, any other tips appreciated.

null:org.apache.solr.common.SolrException: CLUSTERSTATUS the collection
time out:180s
at
org.apache.solr.handler.admin.CollectionsHandler.handleResponse(CollectionsHandler.java:630)
at
org.apache.solr.handler.admin.CollectionsHandler.handleResponse(CollectionsHandler.java:582)
at
org.apache.solr.handler.admin.CollectionsHandler.handleClusterStatus(CollectionsHandler.java:932)
at
org.apache.solr.handler.admin.CollectionsHandler.handleRequestBody(CollectionsHandler.java:256)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:144)
at
org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:736)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:261
)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:204)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
at org.eclipse.jetty.server.Server.handle(Server.java:368)
at
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)
at
org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
at
org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:942)
at
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1004)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:640)
at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
at
org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
at
org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Thread.java:745)

- Eirik


fre. 5. jun. 2015 kl. 15.58 skrev Joseph Obernberger 
j...@lovehorsepower.com:

 Thank you Shawn!  Yes - it is now a Solr 5.1.0 cloud on 27 nodes and we
 use the startup scripts.  The current index size is 3.0T - about 115G
 per node - index is stored in HDFS which is spread across those 27 nodes
 and about (a guess) - 256 spindles.  Each node has 26G of HDFS cache
 (MaxDirectMemorySize) allocated to Solr.  Zookeeper storage is on local
 disk.  Solr and HDFS run on the same machines. Each node is connected to
 a switch over 1G Ethernet, but the backplane is 40G.
 Do you think the clusterstatus and the zookeeper timeouts are related to
 performance issues talking to HDFS?

 The JVM parameters are:
 -
 -DSTOP.KEY=solrrocks
 -DSTOP.PORT=8100
 -Dhost=helios
 -Djava.net.preferIPv4Stack=true
 -Djetty.port=9100
 -DnumShards=27
 -Dsolr.clustering.enabled=true
 -Dsolr.install.dir=/opt/solr
 -Dsolr.lock.type=hdfs
 -Dsolr.solr.home=/opt/solr/server/solr
 -Duser.timezone=UTC-DzkClientTimeout=15000
 -DzkHost=eris.querymasters.com:2181,daphnis.querymasters.com:2181,
 triton.querymasters.com:2181,oberon.querymasters.com:2181,
 portia.querymasters.com:2181,puck.querymasters.com:2181/solr5

 -XX:+CMSParallelRemarkEnabled
 -XX:+CMSScavengeBeforeRemark
 -XX:+ParallelRefProcEnabled
 -XX:+PrintGCApplicationStoppedTime
 -XX:+PrintGCDateStamps
 -XX:+PrintGCDetails
 -XX:+PrintGCTimeStamps
 -XX:+PrintHeapAtGC
 -XX:+PrintTenuringDistribution
 -XX:+UseCMSInitiatingOccupancyOnly
 -XX:+UseConcMarkSweepGC
 -XX:+UseLargePages
 

Re: Solr cache when using custom scoring

2015-07-09 Thread amid
Mikhail,

We've now override the equal  hashcode of the custom query to use this new
param as well, and it works like charm.

Thanks allot,
Ami



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-cache-when-using-custom-scoring-tp4216419p4216496.html
Sent from the Solr - User mailing list archive at Nabble.com.


Protwords in solr spellchecker

2015-07-09 Thread Kamal Kishore Aggarwal
Hi Team,

I am currently working with Java-1.7, Solr-4.8.1 with tomcat 7. Is there
any feature by which I can refrain the following words to appear in spell
suggestion.

For example: Somebody searches for sexe, I does not want to show him sex as
the spell suggestion via solr. How can I stop these type of keywords to be
shown in suggestion.

Any help is appreciated.


Regards
Kamal Kishore
Solr Beginner


Re: Can I instruct the Tika Entity Processor to skip the first page using the DIH?

2015-07-09 Thread Charlie Hull

On 08/07/2015 20:39, Allison, Timothy B. wrote:

Unfortunately, no.  We can't even do that now with straight Tika.  I
imagine this is for pdf files?  If you'd like to add this as a
feature, please submit a ticket over on Tika.


Another alternative is to pre-process the PDF files to remove the first 
page. I've used the command line version of PDFtk for this kind of thing 
in the past: https://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/


I'd also recommend using Tika outside Solr rather than via the DIH: 
certain nasty PDFs can kill Tika, which then can kill Solr.


Charlie


-Original Message- From: Paden [mailto:rumsey...@gmail.com]
Sent: Wednesday, July 08, 2015 12:14 PM To:
solr-user@lucene.apache.org Subject: Can I instruct the Tika Entity
Processor to skip the first page using the DIH?

Hello, I'm using the DIH to import some files from one of my local
directories. However, every single one of these files has the same
first page. So I want to skip that first page in order to optimize
search.

Can this be accomplished by an instruction within the
dataimporthandler or, if not, how could you do this?



-- View this message in context:
http://lucene.472066.n3.nabble.com/Can-I-instruct-the-Tika-Entity-Processor-to-skip-the-first-page-using-the-DIH-tp4216373.html



Sent from the Solr - User mailing list archive at Nabble.com.





--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


Re: Ranking based on term position

2015-07-09 Thread JACK
Hi Li Li,

I am experiencing the same problem. can you Explain little detailed?
Where do i change these methods?
I am using Solr 5.0.0, And How do i query this? Is there any change while
query?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Ranking-based-on-term-position-tp979271p4216522.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Too many Soft commits and opening searchers realtime

2015-07-09 Thread Alessandro Benedetti
Cool ! So actually you were not using the default you defined in th
Solrconfig, but it was loaded from a java environment property set to be
3 ms ?

Cheers

2015-07-09 4:21 GMT+01:00 Summer Shire shiresum...@gmail.com:

 Yonik, Mikhail, Alessandro

 After a lot of digging around and isolation, All u guys were right. I was
 using property based value and there was one place where it was 30 secs and
 that was overriding my main props.

 Also Yonik thanks for the explanation on the real time searcher. I wasn't
 sure if the maxwarmingSearcher error I was getting also had something to do
 with it.

 Thanks a lot

  On Jul 8, 2015, at 5:28 AM, Yonik Seeley ysee...@gmail.com wrote:
 
  A realtime searcher is necessary for internal bookkeeping / uses if a
  normal searcher isn't opened on a commit.
  This searcher doesn't have caches and hence doesn't carry the weight
  that a normal searcher would.  It's also invisible to clients (it
  doesn't change the view of the index for normal searches).
 
  Your hard autocommit at 8 minutes with openSearcher=false will trigger
  a realtime searcher to open on every 8 minutes along with the hard
  commit.
 
  -Yonik
 
 
  On Tue, Jul 7, 2015 at 5:29 PM, Summer Shire shiresum...@gmail.com
 wrote:
  HI All,
 
  Can someone help me understand the following behavior.
  I have the following maxTimes on hard and soft commits
 
  yet I see a lot of Opening Searchers in the log
  org.apache.solr.search.SolrIndexSearcher - Opening Searcher@1656a258[main]
 realtime
  also I see a soft commit happening almost every 30 secs
  org.apache.solr.update.UpdateHandler - start
 commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=true,prepareCommit=false}
  autoCommit
  maxTime48/maxTime
  openSearcherfalse/openSearcher
  /autoCommit
 
  autoSoftCommit
  maxTime18/maxTime
  /autoSoftCommit
  I tried disabling softCommit by setting maxTime to -1.
  On startup solrCore recognized it and logged Soft AutoCommit: disabled
  but I could still see softCommit=true
  org.apache.solr.update.UpdateHandler - start
 commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=true,prepareCommit=false}
  autoSoftCommit
  maxTime-1/maxTime
  /autoSoftCommit
 
  Thanks,
  Summer




-- 
--

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?

William Blake - Songs of Experience -1794 England


Re: Do I really need copyField when my app can do the copy?

2015-07-09 Thread Alessandro Benedetti
Let me answer in line :

2015-07-09 9:35 GMT+01:00 Nir Barel ni...@checkpoint.com:

 Hi,

 I wants to add a question regarding copyField and LowerCaseFilterFactory
 We notice that LowerCaseFilterFactory takes huge part of the CPU ( via
 profiling ) for the text filed
 Can we avoid it or improve that implementation? ( keeping the insensitive
 case search )


If you want to have case insensitive search you DO need the Lower case
token filter and i suggest to use it.
Or possibly a tokenizer that already produce lowercase tokens ( example :
lower case tokenizer ).

To answer your second question, of course you can improve the lower case
token filter implementation !
I never checked it, I think it is already good but if you believe you can
improve it, i encourage you to do so !

Cheers



 Best Regards,
 Nir Barel


 -Original Message-
 From: Petersen, Robert [mailto:robert.peter...@rakuten.com]
 Sent: Thursday, July 09, 2015 1:59 AM
 To: solr-user@lucene.apache.org
 Subject: RE: Do I really need copyField when my app can do the copy?

 Perhaps some people like maybe those using DIH to feed their index might
 not have that luxury and copyfield is the better way for them.  If you have
 an application you can do it either way.  I have done both ways in
 different situations.

 Robi

 -Original Message-
 From: Steven White [mailto:swhite4...@gmail.com]
 Sent: Wednesday, July 08, 2015 3:38 PM
 To: solr-user@lucene.apache.org
 Subject: Do I really need copyField when my app can do the copy?

 Hi Everyone,

 What good is the use of copyField in Solr's schema.xml if my application
 can do it into the designated field?  Having my application do so helps me
 simplify the schema.xml maintains task thus my motivation.

 Thanks

 Steve
  j)ly˫y��
  




-- 
--

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?

William Blake - Songs of Experience -1794 England


RE: Spell checking the synonym list?

2015-07-09 Thread Reitzel, Charles
One of the uses of synonyms is to replace a mis-spelled query term with a 
correctly spelled value.

The 2 sided synonym file format allows you to control which values survive 
into the actual query.

lawyer, attorney, ambulance chaser, atorney, lowyor = lawyer, attorney

I am not aware, however, of any integration between synonym processing and a 
spellcheck dictionary.   Makes sense, though.   But I think additional metadata 
would be required, per dictionary entry, to govern synonym processing.   Thus, 
building the dictionary would not be a transparent/automatic process.

https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-SynonymFilter


-Original Message-
From: Ryan Yacyshyn [mailto:ryan.yacys...@gmail.com] 
Sent: Thursday, July 09, 2015 3:28 AM
To: solr-user@lucene.apache.org
Subject: Spell checking the synonym list?

Hi all,

I'm wondering if it's possible to have spell checking performed on terms in the 
synonym list?

For example, let's say I have documents with the word lawyer in them and I 
add lawyer, attorney in the synonyms.txt file. Then a query is made for the 
word atorney. Is there any way to provide spell checking on this?

Thanks,
Ryan

*
This e-mail may contain confidential or privileged information.
If you are not the intended recipient, please notify the sender immediately and 
then delete it.

TIAA-CREF
*


Restore index API does not work in solr 5.1.0 ?

2015-07-09 Thread dinesh naik
Hi all,

How can we restore the index in Solr 5.1.0 ?

We did following:

1:- Started Solr Cloud from:

bin/solr start -e cloud -noprompt



2:- posted some documents to solr from examples folder using :

java -Dc=gettingstarted -jar post.jar *.xml



3:- Backed up the Index using:

http://localhost:8983/solr/gettingstarted/replication?command=backup



4:- Deleted 1 document using:

http://localhost:8983/solr/gettingstarted/update?stream.body=deletequeryid:IW-02/query/deletecommit=true



5:- restored the index using:

http://localhost:8983/solr/gettingstarted/replication?command=restore



The Restore works fine with same steps for 5.2 versions but not 5.1

Is there any other way to restore index in Solr 5.1.0?

-- 
Best Regards,
Dinesh Naik


SolrJ/Tika custom indexer not indexing CERTAIN .doc text?

2015-07-09 Thread Paden
Hello, 

I've been working to get a search engine up an running for a little while
now. I'm using Solr to index from both a database and a file system.
However, I'm using the filepath contained inside the database to find the
file in the filesystem and then merge the the metadata in the DB and the
file system. I pretty much figured out I had two options. I could use the
DIH or I could create my own custom indexer in Java. I got pretty far on the
indexer, almost complete actually. But I defaulted to the DIH because it
indexed all the files I had at the time well. 

Now I'm taking the project to the next stage of development and I'm worried
that the larger PDF's that I have to index might just kill Tika/Solr,
thereby stopping me in my tracks. So I want to have that custom indexer as a
backup. As I said I got pretty far with the custom indexer but I encountered
one problem at the the end. Tika wouldn't index the text of all the .doc
files. It would pull it but when I got the results in Solr it would look
blank

}
Author: Some name 
text: 
}

Some context, I got these files from a .zip that was given to me by another
department so they were all sitting in a single file system. After trying a
few things I finally created a NEW .doc and copied the text from another
.doc file in the system to see if that would work. And it did. So it's not
that it wasn't indexing the text of .doc files. It was just THOSE .doc's I
was given in the .zip. I didn't request another zip with fresh files because
that would mean jumping through some hoops but I wonder if I should. Now I
haven't posted the code cause I don't feel like this is really a code issue.
I feel like it might be some bizarre file issue. I've posted the code below
but really I was just wondering whether or not anyone has ran into this
particular brand of problem before and how they solved it. I'm using a linux
file system so theres that and ALL data except for the text comes from the
database. That means author, id,...etc all comes from the database. That's
why i could get the some name author above in the response. 


import org.apache.solr.client.solrj.impl.HttpSolrClient; 

import org.apache.solr.client.solrj.SolrServerException; 

import org.apache.solr.client.solrj.impl.XMLResponseParser; 

import org.apache.solr.client.solrj.SolrClient; 

import org.apache.solr.client.solrj.response.UpdateResponse; 

import org.apache.solr.common.SolrInputDocument; 

import org.apache.pdfbox.pdmodel.PDDocument; 

/* Tika jars need to be retrieved online */ 

import org.apache.tika.metadata.Metadata; 

import org.apache.tika.parser.pdf.PDFParser; 

import org.apache.tika.parser.AutoDetectParser; 

import org.apache.tika.parser.ParseContext; 

import org.apache.tika.sax.BodyContentHandler; 

import org.xml.sax.ContentHandler; 


import java.io.File;

import java.io.FileInputStream; 

import java.io.IOException; 

import java.io.InputStream; 

import java.sql.*; 

import java.util.ArrayList; 

import java.util.Collection;




public class TikaSqlIndexer {

private SolrClient server = new
HttpSolrClient(http://localhost:8983/solr/Testcore3;);

private long _start = System.currentTimeMillis(); 

private AutoDetectParser autoParser;

private PDFParser pdfParser; 

private int _totalTika = 0; 

private int _totalSql = 0; 


private CollectionSolrInputDocument _docs = new ArrayList(); 



public static void main(String[] args) {

try{ 

TikaSqlIndexer idxer = new
TikaSqlIndexer(http://localhost:8983/solr/Testcore3;); 

//idxer.Index(); 

idxer.doTikaDocuments(new
File(/home/paden/Documents/LWP_Files/BIGDATA)); 

} catch (Exception e) { 

e.printStackTrace(); 

}
}

private TikaSqlIndexer(String url) throws IOException, 
SolrServerException
{ 

// creates a channel with the Solr server 

server = new HttpSolrClient(url); 

//server.setParser(new XMLResponseParser()); 

autoParser = new AutoDetectParser(); 

pdfParser = new PDFParser(); 
 

}

private void Index() throws SQLException, SolrServerException {

Connection con = null; 

try{ 

Class.forName(com.mysql.jdbc.Driver).newInstance(); 

log(Driver Loaded ..); 

String URL = 
jdbc:mysql://localhost:3306/EDMS_Metadata; 

Re: SolrJ/Tika custom indexer not indexing CERTAIN .doc text?

2015-07-09 Thread Paden
I posted the code anyway just forgot to get rid of that line in the post.
Sorry



--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrJ-Tika-custom-indexer-not-indexing-CERTAIN-doc-text-tp4216541p4216542.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Windows Version

2015-07-09 Thread Allan Elkowitz
Thanks for all your help.  I decided to switch to Ubuntu linux.
 Allan elkowitzelkow...@alumni.caltech.edu 


 On Wednesday, July 8, 2015 1:44 AM, Shawn Heisey apa...@elyograg.org 
wrote:
   

 On 7/7/2015 10:43 AM, Allan Elkowitz wrote:
 So I am a newbie at Solr and am having trouble getting the examples working 
 on Windows 7.
 I downloaded and unzipped the distribution and have been able to get Solr up 
 and running.  I can access the admin page.  However, when I try to follow the 
 instructions for loading the examples I find that there is a file that I am 
 supposed to have called post.jar which I cannot find in the directory 
 specified, exampledocs.  There is a file called post in another directory 
 but it does not seem to be a .jar file.
 Two questions:
 1.  Has this been addressed on some site that I am not yet aware of?
 2.  What am I missing here?

The post.jar file is in example\exampledocs in the Solr 5.2.1 download.

The bin\post file is a shell script for Linux/UNIX systems that offers
easier access to the SimplePostTool class included in the solr-core jar.
 Unfortunately, no Windows equivalent (post.cmd) exists yet.

If you're getting the impression that Windows is a second-class citizen
around here, you are not really wrong.  A typical Solr user has found
that the free operating systems offer better performance and stability,
with the added advantage that they don't have to pay Microsoft a pile of
money in order to get useful work done.

Windows, especially the server operating systems, is a perfectly good
platform, but it's not free.

Thanks,
Shawn



  

Re: How to determine cache setting in Solr Search Instance

2015-07-09 Thread Erick Erickson
I'd examine the filter queries used to see whether they make sense as well.
You really have to re-tune after you start getting real user queries though
as anything you generate won't reflect reality. I'd start _much_ smaller, 512
or 1024 and work _up_ with real data.

Raising the document cache limit is not where I'd start. Given that the docs
will eventually be held in MMapDirectory space (i.e. O/S system memory)
probably anyway all you're _really_ saving is decompression upon occasion.

Best,
Erick

On Thu, Jul 9, 2015 at 10:58 AM, wwang525 wwang...@gmail.com wrote:
 Hi,

 The real production requests will not be randomly generated, and a lot of
 requests will be repeated. I think the performance will be better due to the
 repeated requests. In addition, I am sure the configuration will need to be
 adjusted once the application is in production.

 For the time being, I can drop the size of filterCache to 4096 or 2048 since
 it is now only 1465 in the stats page.

 I forgot to mention that the size I saw in the stats page for documentCache
 is already 16384 after the test, and this is the configured size in
 solrconfig.xml. This is why I was asking if I need to raise the number in
 the configuration.

 Is there any issue or will there be any performance improvement if I raise
 up the size for documentCache?

 Thanks






 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/How-to-determine-cache-setting-in-Solr-Search-Instance-tp4216562p4216591.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: How to determine cache setting in Solr Search Instance

2015-07-09 Thread Shawn Heisey
On 7/9/2015 9:48 AM, wwang525 wrote:
 I did a load test with a total of 800 requests (at 40 concurrent requests
 per second) to be executed against Solr index with 14 M records. Performance
 was good ( 1 second) especially after a short period of time of the test.
 BTW, the second round of load test was even better.

 The local machine has a free memory of about 15 G during the load test

 I observed the following from the stats page:

 (1) documentCache reached the configured size for documentCache with a hit
 ratio of 0.66
 (2) filterCache has 2519 hits with a hit ratio of 0.63. The size is 1465
 (less than a configured size: 16384)
 (3) queryResultCache has a hit ratio of 0
 (4) fieldValueCache has a hit ratio of 0

 The following are the cache configuration in solrconfig.xml

  documentCache
   class=solr.LRUCache
   size=16384
   initialSize=512
   autowarmCount=0/


 filterCache
   class=solr.LRUCache
   size=16384
   initialSize=4096 
   autowarmCount=256/


 queryResultCache
   class=solr.LRUCache
   size=16384
   initialSize=4096
   autowarmCount=256/

 It looks like I need to increase the size of documentCache. The hit ratio of
 zero for queryResultCache and fieldValueCache was surprising (zero). Is it
 possible that this is due to randomly generated requests? 

I would say that a hitrate of 0.66 for your documentCache is pretty
good.  You can increase the size to try and make it better, but that
will use more java heap memory.  Note that I am not specifically talking
about the 15GB of free memory that you mentioned -- that's the operating
system, not Java.  Depending on what exactly is happening with Java's
memory management, that additional heap memory *might* come out of the
15GB, but it might not.

Cache sizes of 16384 are VERY large.  Your filterCache could actually be
dropped quite a bit, because it only has 1465 entries.  The
autowarmCount settings are a little high.  Executing 256 filters (which
would be required on every single commit) can be slow.

You may want to use FastLRUCache instead of LRUCache for documentCache
and filterCache, because your  hitrate is good.  LRUCache is better when
the hitrate is lower.

The fieldValueCache is an internal cache that Solr and Lucene use and
manage.  I don't know much about it.

If you don't send the same query more than once, then your
queryResultCache will have a hitrate of zero.

Thanks,
Shawn



Class loading problem from ${solr.solr.home}/lib in 5.2.1

2015-07-09 Thread Shawn Heisey
I was having a problem in a 4.x version of Solr and wanted to check
5.2.1 to see if it still had the same problem, so I copied my fieldType
into a 5.2.1 example schema.  My fieldType uses some ICU analysis
classes, so I also put the contrib jars into server/solr/lib.

I ran into a problem similar to SOLR-4852.

https://issues.apache.org/jira/browse/SOLR-4852

The Solr log shows the icu jars being loaded twice, and i got a class
not found exception for the ICU classes.

If I change the schema from the short factory class name
(solr.ICUTokenizerFactory) to the full class name
(org.apache.lucene.analysis.icu.segmentation.ICUTokenizerFactory), then
it works.

I'm guessing that the root of the problem is the fact that the jars are
loaded twice ... but I cannot see any reason for them to be loaded
twice.  They are not mentioned in any lib directives in
solrconfig.xml, and there is no sharedLib declaration in solr.xml.  The
core that I created was using the techproducts sample config.

Should I open an issue for this problem?  I canprovide a very clear
step-by-step minimal process for showing the problem with a stock 5.2.1
download.

Thanks,
Shawn