Problem of facet on 170M documents

2013-11-02 Thread Mingfeng Yang
I have an index with 170M documents, and two of the fields for each doc is
source and url.  And I want to know the top 500 most frequent urls from
Video source.

So I did a facet with
 fq=source:Videofacet=truefacet.field=urlfacet.limit=500, and the
matching documents are about 9 millions.

The solr cluster is hosted on two ec2 instances each with 4 cpu, and  32G
memory. 16G is allocated tfor java heap.  4 master shards on one machine,
and 4 replica on another machine. Connected together via zookeeper.

Whenever I did the query above, the response is just taking too long and
the client will get timed out. Sometimes,  when the end user is impatient,
so he/she may wait for a few second for the results, and then kill the
connection, and then issue the same query again and again.  Then the server
will have to deal with multiple such heavy queries simultaneously and
 being so busy that we got no server hosting shard error, probably due to
lost communication between solr node and zookeeper.

Is there any way to deal with such problem?

Thanks,
Ming


Re: Problem of facet on 170M documents

2013-11-02 Thread Sascha SZOTT
Hi Ming,

which Solr version are you using? In case you use one of the latest
versions (4.5 or above) try the new parameter facet.threads with a
reasonable value (4 to 8 gave me a massive performance speedup when
working with large facets, i.e. nTerms  10^7).

-Sascha


Mingfeng Yang wrote:
 I have an index with 170M documents, and two of the fields for each
 doc is source and url.  And I want to know the top 500 most
 frequent urls from Video source.
 
 So I did a facet with 
 fq=source:Videofacet=truefacet.field=urlfacet.limit=500, and
 the matching documents are about 9 millions.
 
 The solr cluster is hosted on two ec2 instances each with 4 cpu, and
 32G memory. 16G is allocated tfor java heap.  4 master shards on one
 machine, and 4 replica on another machine. Connected together via
 zookeeper.
 
 Whenever I did the query above, the response is just taking too long
 and the client will get timed out. Sometimes,  when the end user is
 impatient, so he/she may wait for a few second for the results, and
 then kill the connection, and then issue the same query again and
 again.  Then the server will have to deal with multiple such heavy
 queries simultaneously and being so busy that we got no server
 hosting shard error, probably due to lost communication between solr
 node and zookeeper.
 
 Is there any way to deal with such problem?
 
 Thanks, Ming
 


Re: Store Solr OpenBitSets In Solr Indexes

2013-11-02 Thread David Philip
Oh fine. Caution point was useful for me.
Yes I wanted to do something similar to filer queries. It is not XY
problem. I am simply trying to implement  something as described below.

I have a [non-clinical] group sets in system and I want to build bitset
based on the documents belonging to that group and save it.
So that, While searching I want to retrieve similar bitset from Solr engine
for matched document and then execute logical XOR. [Am I clear with problem
explanation now?]


So what I am looking for is, If I have to retrieve bitset instance from
Solr search engine for the documents matched, how can I get it?
And How do I save bit mapping for the documents belonging to a particular
group. thus enable XOR operation.

Thanks - David










On Fri, Nov 1, 2013 at 5:05 PM, Erick Erickson erickerick...@gmail.comwrote:

 Why are you saving this? Because if the bitset you're saving
 has anything to do with, say, filter queries, it's probably useless.

 The internal bitsets are often based on the internal Lucene doc ID,
 which will change when segment merges happen, thus the caution.

 Otherwise, theres the binary type you can probably use. It's not very
 efficient since I believe it uses base-64 encoding under the covers
 though...

 Is this an XY problem?

 Best,
 Erick


 On Wed, Oct 30, 2013 at 8:06 AM, David Philip
 davidphilipshe...@gmail.comwrote:

  Hi All,
 
  What should be the field type if I have to save solr's open bit set value
  within solr document object and retrieve it later for search?
 
OpenBitSet bits = new OpenBitSet();
 
bits.set(0);
bits.set(1000);
 
doc.addField(SolrBitSets, bits);
 
 
  What should be the field type of  SolrBitSets?
 
  Thanks
 



Re: Background merge errors with Solr 4.4.0 on Optimize call

2013-11-02 Thread Erick Erickson
See: https://issues.apache.org/jira/browse/SOLR-5418

Thanks Matthew and Robert! I'll see if I can get to this this weekend.





On Wed, Oct 30, 2013 at 7:45 AM, Erick Erickson erickerick...@gmail.comwrote:

 Robert:

 Thanks. I'm on my way out the door, so I'll have to put up a JIRA with
 your patch later if it hasn't been done already

 Erick


 On Tue, Oct 29, 2013 at 10:14 PM, Robert Muir rcm...@gmail.com wrote:

 I think its a bug, but thats just my opinion. i sent a patch to dev@
 for thoughts.

 On Tue, Oct 29, 2013 at 6:09 PM, Erick Erickson erickerick...@gmail.com
 wrote:
  Hmmm, so you're saying that merging indexes where a field
  has been removed isn't handled. So you have some documents
  that do have a what field, but your schema doesn't have it,
  is that true?
 
  It _seems_ like you could get by by putting the _what_ field back
  into your schema, just not sending any data to it in new docs.
 
  I'll let others who understand merging better than me chime in on
  whether this is a case that should be handled or a bug. I pinged the
  dev list to see what the opinion is
 
  Best,
  Erick
 
 
  On Mon, Oct 28, 2013 at 6:39 PM, Matthew Shapiro m...@mshapiro.net
 wrote:
 
  Sorry for reposting after I just sent in a reply, but I just looked at
 the
  error trace closer and noticed
 
 
 1. Caused by: java.lang.IllegalArgumentException: no such field what
 
 
  The 'what' field was removed by request of the customer as they wanted
 the
  logic behind what gets queried in the what field to be code side
 instead
  of solr side (for easier changing without having to re-index
 everything.  I
  didn't feel strongly either way and since they are paying me, I took it
  out).
 
  This makes me wonder if its crashing while merging because a field that
  used to be there is now gone.  However, this seems odd to me as Solr
  doesn't even let me delete the old data and instead its leaving my
  collection in an extremely bad state, with the only remedy I can think
 of
  is to nuke the index at the filesystem level.
 
  If this is indeed the cause of the crash, is the only way to delete a
 field
  to first completely empty your index first?
 
 
  On Mon, Oct 28, 2013 at 6:34 PM, Matthew Shapiro m...@mshapiro.net
 wrote:
 
   Thanks for your response.
  
   You were right, solr is logging to the catalina.out file for tomcat.
   When
   I click the optimize button in solr's admin interface the following
 logs
   are written: http://apaste.info/laup
  
   About JVM memory, solr's admin interface is listing JVM memory at
 3.1%
   (221.7MB is dark grey, 512.56MB light grey and 6.99GB total).
  
  
   On Mon, Oct 28, 2013 at 6:29 AM, Erick Erickson 
 erickerick...@gmail.com
  wrote:
  
   For Tomcat, the Solr is often put into catalina.out
   as a default, so the output might be there. You can
   configure Solr to send the logs most anywhere you
   please, but without some specific setup
   on your part the log output just goes to the default
   for the servlet.
  
   I took a quick glance at the code but since the merges
   are happening in the background, there's not much
   context for where that error is thrown.
  
   How much memory is there for the JVM? I'm grasping
   at straws a bit...
  
   Erick
  
  
   On Sun, Oct 27, 2013 at 9:54 PM, Matthew Shapiro m...@mshapiro.net
  wrote:
  
I am working at implementing solr to work as the search backend
 for
  our
   web
system.  So far things have been going well, but today I made some
   schema
changes and now things have broken.
   
I updated the schema.xml file and reloaded the core (via the admin
interface).  No errors were reported in the logs.
   
I then pushed 100 records to be indexed.  A call to Commit
 afterwards
seemed fine, however my next call for Optimize caused the
 following
   errors:
   
java.io.IOException: background merge hit exception:
_2n(4.4):C4263/154 _30(4.4):C134 _32(4.4):C10 _31(4.4):C10 into
 _37
[maxNumSegments=1]
   
null:java.io.IOException: background merge hit exception:
_2n(4.4):C4263/154 _30(4.4):C134 _32(4.4):C10 _31(4.4):C10 into
 _37
[maxNumSegments=1]
   
   
Unfortunately, googling for background merge hit exception came up
with 2 thing: a corrupt index or not enough free space.  The host
machine that's hosting solr has 227 out of 229GB free (according
 to df
-h), so that's not it.
   
   
I then ran CheckIndex on the index, and got the following results:
http://apaste.info/gmGU
   
   
As someone who is new to solr and lucene, as far as I can tell
 this
means my index is fine. So I am coming up at a loss. I'm somewhat
 sure
that I could probably delete my data directory and rebuild it but
 I am
more interested in finding out why is it having issues, what is
 the
best way to fix it, and what is the best way to prevent it from
happening when this goes into production.
   
   
Does anyone have any advice that may 

Re: Custom Plugin exception : Plugin init failure for [schema.xml]

2013-11-02 Thread Parvin Gasimzade
Hi Shawn,

Thank you for your answer. I have solved the problem.

The problem is, in our code constructor of TurkishFilterFactory is setted
as protected and that works without problem on the 3.x versions of the Solr
but gives the exception that I mentioned here on 4.x versions. By analyzing
the stack trace I saw that it gives an InstantationException and by making
constructor public solves the problem.


On Fri, Nov 1, 2013 at 6:34 PM, Shawn Heisey s...@elyograg.org wrote:

 On 11/1/2013 4:18 AM, Parvin Gasimzade wrote:
  I have a problem with custom plugin development in solr 4.x versions. I
  have developed custom filter and trying to install it but I got following
  exception.

 Later you indicated that you can use it with Solr 3.x without any problem.

 Did you recompile your custom plugin against the Solr jars from the new
 version?  There was a *huge* amount of java class refactoring that went
 into the 4.0 version as compared to any 3.x version, and that continues
 with each new 4.x release.

 I would bet that if you tried that recompile, it would fail due to
 errors and/or warnings, which you'll need to fix.  There might also be
 operational problems that the compiler doesn't find, due to changes in
 how the underlying APIs get used.

 Thanks,
 Shawn




Writing a Solr custom analyzer to post content to Stanbol {was: Need additional data processing in Data Import Handler prior to indexing}

2013-11-02 Thread Dileepa Jayakody
Hi All,

I went through possible solutions for my requirement of triggering a
Stanbol enhancement during Solr indexing, and I got the requirement
simplified.

I only need to process the field named content to perform the Stanbol
enhancement to extract Person and Organizations.
So I think it will be easier to do the Stanbol request during indexing the
content field , after the data is imported (from DIH).

I think the best solution will be to write a custom Analyzer to process the
content and post it to Stanbol.
In the analyzer I also need to process the Stanbol enhancement response.
The response should be processed as a new document to index and store the
identified Person and Organization entities in a field called
extractedEntities.

So my current idea is as follows;

in the schema.xml

copyField source=content dest=stanbolRequest /

field name=stanbolRequest type=stanbolRequestType indexed=true
stored=true docValues=truerequired=false/

 fieldType name=stanbolRequestType class=solr.TextField
  analyzer class=MyCustomAnalyzer/
 /fieldType

In the : MyCustomAnalyzer class the content will be posted and enhanced
from Stanbol. The Person and Organization entities in the response should
be indexed into the Solr field extractedEntities.
Am I going in the correct path for my requirement? Please share your ideas.
Appreciate any relevant pointers to samples/documentation.

Thanks,
Dileea

On Wed, Oct 30, 2013 at 11:26 AM, Dileepa Jayakody 
dileepajayak...@gmail.com wrote:

 Thanks guys for your ideas.

 I will go through them and come back with questions.

 Regards,
 Dileepa


 On Wed, Oct 30, 2013 at 7:00 AM, Erick Erickson 
 erickerick...@gmail.comwrote:

 Third time tonight I've been able to paste this link

 Also, you can consider just moving to SolrJ and
 taking DIH out of the process, see:
 http://searchhub.org/2012/02/14/indexing-with-solrj/

 Whichever approach fits your needs of course.

 Best,
 Erick


 On Tue, Oct 29, 2013 at 7:15 PM, Alexandre Rafalovitch
 arafa...@gmail.comwrote:

  It's also possible to combine Update Request Processor with DIH. That
 way
  if a debug entry needs to be inserted it could go through the same
 Stanbol
  process.
 
  Just define a processing chain the DIH handler and write custom URP to
 call
  out to Stanbol web service. You have access to a full record in URP, so
 can
  add/delete/change the fields at will.
 
  Regards,
 Alex.
 
  Personal website: http://www.outerthoughts.com/
  LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
  - Time is the quality of nature that keeps events from happening all at
  once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
 book)
 
 
  On Wed, Oct 30, 2013 at 4:09 AM, Michael Della Bitta 
  michael.della.bi...@appinions.com wrote:
 
   Hi Dileepa,
  
   You can write your own Transformers in Java. If it doesn't make sense
 to
   run Stanbol calls in a Transformer, maybe setting up a web service
 that
   grabs a record out of MySQL, sends the data to Stanbol, and displays
 the
   results could be used in conjunction with HttpDataSource rather than
   JdbcDataSource.
  
   http://wiki.apache.org/solr/DIHCustomTransformer
  
  
 
 http://wiki.apache.org/solr/DataImportHandler#Usage_with_XML.2FHTTP_Datasource
  
   Michael Della Bitta
  
   Applications Developer
  
   o: +1 646 532 3062  | c: +1 917 477 7906
  
   appinions inc.
  
   “The Science of Influence Marketing”
  
   18 East 41st Street
  
   New York, NY 10017
  
   t: @appinions https://twitter.com/Appinions | g+:
   plus.google.com/appinions
  
 
 https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
   
   w: appinions.com http://www.appinions.com/
  
  
   On Tue, Oct 29, 2013 at 4:47 PM, Dileepa Jayakody 
   dileepajayak...@gmail.com
wrote:
  
Hi All,
   
I'm a newbie to Solr, and I have a requirement to import data from a
   mysql
database; enhance  the imported content to identify Persons
 mentioned
and
index it as a separate field in Solr along with the other fields
  defined
for the original db query.
   
I'm using Apache Stanbol [1] for the content enhancement
 requirement.
I can get enhancement results for 'Person' type data in the content
 as
   the
enhancement result.
   
The data flow will be;
mysql-db  Solr data-import handler  Stanbol enhancer  Solr index
   
For the above requirement I need to perform additional processing at
  the
data-import handler prior to indexing to send a request to Stanbol
 and
process the enhancement response. I found some related examples on
modifying mysql data import handler to customize the query results
 in
db-data-config.xml by using a transformer script.
As per my requirement, In the data-import-handler I need to send a
   request
to Stanbol and process the response prior to indexing. But I'm not
 sure
   if
this can be achieved using a simple javascript.
   
Is there any other better way 

Re: unable to load core after cluster restart

2013-11-02 Thread kaustubh147
Hi Shawn,

One thing I forget to mention here is the same setup (with no bootstrap) is
working fine in our QA1 environment. I did not have the bootstrap option
from start, I added it thinking it will solve the problem.

Nonetheless I followed Shawn's instructions, wherever it differed from my
old approach...
1. I moved my zkHost from JVM to solr.xml and added chroot in it
2. removed bootstrap option
3. created collections with URL template suggested (I have tried it earlier
too)

None of it worked for me... I am seeing same errors.. I am adding some more
logs before and after the error occurs


-

INFO  - 2013-11-02 17:40:40.427;
org.apache.solr.update.DefaultSolrCoreState; closing IndexWriter with
IndexWriterCloser
INFO  - 2013-11-02 17:40:40.428; org.apache.solr.core.SolrCore; [xyz]
Closing main searcher on request.
INFO  - 2013-11-02 17:40:40.431;
org.apache.solr.core.CachingDirectoryFactory; Closing
NRTCachingDirectoryFactory - 1 directories currently being tracked
INFO  - 2013-11-02 17:40:40.432;
org.apache.solr.core.CachingDirectoryFactory; looking to close
/mnt/emc/App_name/data-UAT-refresh/SolrCloud/SolrHome2/solr/xyz/data
[CachedDirrefCount=0;path=/mnt/emc/App_name/data-UAT-refresh/SolrCloud/SolrHome2/solr/xyz/data;done=false]
INFO  - 2013-11-02 17:40:40.432;
org.apache.solr.core.CachingDirectoryFactory; Closing directory:
/mnt/emc/App_name/data-UAT-refresh/SolrCloud/SolrHome2/solr/xyz/data
ERROR - 2013-11-02 17:40:40.433; org.apache.solr.core.CoreContainer; Unable
to create core: xyz
org.apache.solr.common.SolrException: Error opening new searcher
at org.apache.solr.core.SolrCore.init(SolrCore.java:834)
at org.apache.solr.core.SolrCore.init(SolrCore.java:625)
at org.apache.solr.core.ZkContainer.createFromZk(ZkContainer.java:256)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:555)
at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:247)
at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:239)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
Caused by: org.apache.solr.common.SolrException: Error opening new searcher
at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1477)
at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1589)
at org.apache.solr.core.SolrCore.init(SolrCore.java:821)
... 13 more
Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain
timed out:
NativeFSLock@/mnt/emc/App_name/data-UAT-refresh/SolrCloud/SolrHome2/solr/xyz/data/index/write.lock
at org.apache.lucene.store.Lock.obtain(Lock.java:84)
at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:695)
at 
org.apache.solr.update.SolrIndexWriter.init(SolrIndexWriter.java:77)
at 
org.apache.solr.update.SolrIndexWriter.create(SolrIndexWriter.java:64)
at
org.apache.solr.update.DefaultSolrCoreState.createMainIndexWriter(DefaultSolrCoreState.java:267)
at
org.apache.solr.update.DefaultSolrCoreState.getIndexWriter(DefaultSolrCoreState.java:110)
at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1440)
... 15 more
ERROR - 2013-11-02 17:40:40.443; org.apache.solr.common.SolrException;
null:org.apache.solr.common.SolrException: Unable to create core: xyz
at
org.apache.solr.core.CoreContainer.recordAndThrow(CoreContainer.java:934)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:566)
at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:247)
at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:239)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
Caused by: org.apache.solr.common.SolrException: Error opening new searcher
at org.apache.solr.core.SolrCore.init(SolrCore.java:834)