Large import making solr unresponsive

2012-12-16 Thread Brent Mills
This is an issue we've only been running into lately so I'm not sure what to 
make of it.  We have 2 cores on a solr machine right now, one of them is about 
10k documents, the other is about 1.5mil.  None of the documents are very 
large, only about 30 short attributes.  We also have about 10 requests/sec 
hitting the smaller core and less on the larger one.  Whenever we try to do a 
full import on the smaller one everything is fine, the response times stay the 
same during the whole 30 seconds it takes to run the indexer.  The cpu also 
stays fairly low.

When we run a full import on the larger one the response times on all cores 
tank from about 10ms to over 8 seconds.  We have a 4 core machine (VM) and I've 
noticed 1 core stays pegged the entire time which is understandable since the 
DIH as I understand it is single threaded.  Also, from what I can tell there is 
no disk, network, or memory pressure (8gb) either and the other procs do 
virtually nothing.  Also the responses from solr still come back with a <10ms 
qtime.  My best guess at this point is tomcat is having issues when the single 
proc gets pegged but I'm at a loss on how to further diagnose this to a tomcat 
issue or something weird that solr is doing.

Has anyone run into this before or have ideas about what might be happening?


RE: problem adding new fields in DIH

2012-07-11 Thread Brent Mills
Thanks for the explanation and bug report Robert!

-Original Message-
From: Robert Muir [mailto:rcm...@gmail.com] 
Sent: Monday, July 09, 2012 3:18 PM
To: solr-user@lucene.apache.org
Subject: Re: problem adding new fields in DIH

Thanks again for reporting this Brent. I opened a JIRA issue:
https://issues.apache.org/jira/browse/SOLR-3610

On Mon, Jul 9, 2012 at 3:36 PM, Brent Mills  wrote:
> We're having an issue when we add or change a field in the db-data-config.xml 
> and schema.xml files in solr.  Basically whenever I add something new to 
> index I add it to the database, then the data config, then add the field to 
> the schema to index, reload the core, and do a full import.  This has worked 
> fine until we upgraded to an iteration of 4.0 (we are currently on 4.0 
> alpha).  Now sometimes when we go through this process solr throws errors 
> about the field not being found.  The only way to fix this is to restart 
> tomcat and everything immediately starts working fine again.
>
> The interesting thing is that this is only a problem if the database is 
> returning a value for that field and only in the documents that have a value. 
>  The field shows up in the schema browser in solr, it just has no data in it. 
>  If I completely remove it from the database but leave it in the schema and 
> dataconfig files there is no issue.  Also of note, this is happening on 2 
> different machines.
>
> Here's the trace
>
> SEVERE: Exception while solr commit.
> java.lang.IllegalArgumentException: no such field test
> at 
> org.apache.solr.core.DefaultCodecFactory$1.getPostingsFormatForField(DefaultCodecFactory.java:49)
> at 
> org.apache.lucene.codecs.lucene40.Lucene40Codec$1.getPostingsFormatForField(Lucene40Codec.java:52)
> at 
> org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter.addField(PerFieldPostingsFormat.java:94)
> at 
> org.apache.lucene.index.FreqProxTermsWriterPerField.flush(FreqProxTermsWriterPerField.java:335)
> at 
> org.apache.lucene.index.FreqProxTermsWriter.flush(FreqProxTermsWriter.java:85)
> at org.apache.lucene.index.TermsHash.flush(TermsHash.java:117)
> at org.apache.lucene.index.DocInverter.flush(DocInverter.java:53)
> at 
> org.apache.lucene.index.DocFieldProcessor.flush(DocFieldProcessor.java:82)
> at 
> org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:480)
> at 
> org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:422)
> at 
> org.apache.lucene.index.DocumentsWriter.flushAllThreads(DocumentsWriter.java:554)
> at 
> org.apache.lucene.index.IndexWriter.prepareCommit(IndexWriter.java:2547)
> at 
> org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:2683)
> at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:2663)
> at 
> org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:414)
> at 
> org.apache.solr.update.processor.RunUpdateProcessor.processCommit(RunUpdateProcessorFactory.java:82)
> at 
> org.apache.solr.update.processor.UpdateRequestProcessor.processCommit(UpdateRequestProcessor.java:64)
> at 
> org.apache.solr.update.processor.DistributedUpdateProcessor.processCommit(DistributedUpdateProcessor.java:919)
> at 
> org.apache.solr.update.processor.LogUpdateProcessor.processCommit(LogUpdateProcessorFactory.java:154)
> at 
> org.apache.solr.handler.dataimport.SolrWriter.commit(SolrWriter.java:107)
> at 
> org.apache.solr.handler.dataimport.DocBuilder.finish(DocBuilder.java:304)
> at 
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:256)
> at 
> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:333)
> at 
> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:399)
> at 
> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:380)
>



-- 
lucidimagination.com


problem adding new fields in DIH

2012-07-09 Thread Brent Mills
We're having an issue when we add or change a field in the db-data-config.xml 
and schema.xml files in solr.  Basically whenever I add something new to index 
I add it to the database, then the data config, then add the field to the 
schema to index, reload the core, and do a full import.  This has worked fine 
until we upgraded to an iteration of 4.0 (we are currently on 4.0 alpha).  Now 
sometimes when we go through this process solr throws errors about the field 
not being found.  The only way to fix this is to restart tomcat and everything 
immediately starts working fine again.

The interesting thing is that this is only a problem if the database is 
returning a value for that field and only in the documents that have a value.  
The field shows up in the schema browser in solr, it just has no data in it.  
If I completely remove it from the database but leave it in the schema and 
dataconfig files there is no issue.  Also of note, this is happening on 2 
different machines.

Here's the trace

SEVERE: Exception while solr commit.
java.lang.IllegalArgumentException: no such field test
at 
org.apache.solr.core.DefaultCodecFactory$1.getPostingsFormatForField(DefaultCodecFactory.java:49)
at 
org.apache.lucene.codecs.lucene40.Lucene40Codec$1.getPostingsFormatForField(Lucene40Codec.java:52)
at 
org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter.addField(PerFieldPostingsFormat.java:94)
at 
org.apache.lucene.index.FreqProxTermsWriterPerField.flush(FreqProxTermsWriterPerField.java:335)
at 
org.apache.lucene.index.FreqProxTermsWriter.flush(FreqProxTermsWriter.java:85)
at org.apache.lucene.index.TermsHash.flush(TermsHash.java:117)
at org.apache.lucene.index.DocInverter.flush(DocInverter.java:53)
at 
org.apache.lucene.index.DocFieldProcessor.flush(DocFieldProcessor.java:82)
at 
org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:480)
at 
org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:422)
at 
org.apache.lucene.index.DocumentsWriter.flushAllThreads(DocumentsWriter.java:554)
at 
org.apache.lucene.index.IndexWriter.prepareCommit(IndexWriter.java:2547)
at 
org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:2683)
at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:2663)
at 
org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:414)
at 
org.apache.solr.update.processor.RunUpdateProcessor.processCommit(RunUpdateProcessorFactory.java:82)
at 
org.apache.solr.update.processor.UpdateRequestProcessor.processCommit(UpdateRequestProcessor.java:64)
at 
org.apache.solr.update.processor.DistributedUpdateProcessor.processCommit(DistributedUpdateProcessor.java:919)
at 
org.apache.solr.update.processor.LogUpdateProcessor.processCommit(LogUpdateProcessorFactory.java:154)
at 
org.apache.solr.handler.dataimport.SolrWriter.commit(SolrWriter.java:107)
at 
org.apache.solr.handler.dataimport.DocBuilder.finish(DocBuilder.java:304)
at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:256)
at 
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:333)
at 
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:399)
at 
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:380)



RE: Nested CachedSqlEntityProcessor running for each entity row with Solr 3.6?

2012-05-10 Thread Brent Mills
Hi James,

I just pulled down the newest nightly build of 4.0 and it solves an issue I had 
been having with solr ignoring the caching of the child entities.  It was 
basically opening a new connection for each iteration even though everything 
was specified correctly.  This was present in my previous build of 4.0 so it 
looks like you fixed it with one of those patches.  Thanks for all your work on 
the DIH, the caching improvements are a big help with some of the things we 
will be rolling out in production soon.

-Brent

-Original Message-
From: Dyer, James [mailto:james.d...@ingrambook.com] 
Sent: Monday, May 07, 2012 1:47 PM
To: solr-user@lucene.apache.org
Cc: Brent Mills; dye.kel...@gmail.com; keithn...@dswinc.com
Subject: RE: Nested CachedSqlEntityProcessor running for each entity row with 
Solr 3.6?

Dear Kellen, Brent & Keith,

There now are fixes available for 2 cache-related bugs that unfortunately made 
their way into the 3.6.0 release.  These were addressed on these 2 JIRA issues, 
which have been committed to the 3.6 branch (as of today):
- https://issues.apache.org/jira/browse/SOLR-3430
- https://issues.apache.org/jira/browse/SOLR-3360
These problem were also affecting Trunk/4.x, with both fixes being committed to 
Trunk under SOLR-3430.

Should Solr 3.6.1 be released, these fixes will become generally available at 
that time.  They also will be part of the 4.0 release, which the Development 
Community hopes will be later this year.

In the mean time, I am hoping each of you can test these fixes with your 
installation.  The best way to do this is to get a fresh SVN checkout of the 
3.6.1 branch 
(http://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_3_6/), switch 
to the "solr" directory, then run "ant dist".  I believe you need Ant 1.8 to 
build.

If you are unable to build yourself, I put an *unofficial* shapshot of the DIH 
jar here:
 
http://people.apache.org/~jdyer/unofficial/apache-solr-dataimporthandler-3.6.1-SNAPSHOT-r1335176.jar

Please let me know if this solves your problems with DIH Caching, giving you 
the functionality you had with 3.5 and prior.  Your feedback is greatly 
appreciatd.

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-Original Message-
From: not interesting [mailto:dye.kel...@gmail.com]
Sent: Monday, May 07, 2012 9:43 AM
To: solr-user@lucene.apache.org
Subject: Nested CachedSqlEntityProcessor running for each entity row with Solr 
3.6?

I just upgraded from Solr 3.4 to Solr 3.6; I'm using the same data-import.xml 
for both versions. The import functioned properly with 3.4.

I'm using a nested entity to fetch authors associated with each document, and 
I'm using CachedSqlEntityProcessor to avoid hitting the DB an unreasonable 
number of times. However, when indexing, Solr indexes very slowly and appears 
to be fetching all authors in the DB for each document. The index should be 
~500 megs; I aborted the indexing when it reached ~6gigs. If I comment out the 
nested author entity below, Solr will index normally.

Am I missing something obvious or is this a bug?







 
 
 




Also posted at SO if you prefer to answer there:
http://stackoverflow.com/questions/10482484/nested-cachedsqlentityprocessor-running-for-each-entity-row-with-solr-3-6

Kellen


Upgrading to 3.6 broke cachedsqlentityprocessor

2012-04-30 Thread Brent Mills
I've read some things in jira on the new functionality that was put into 
caching in the DIH but I wouldn't think it should break the old behavior.  It 
doesn't look as though any errors are being thrown, it's just ignoring the 
caching part and opening a ton of connections.  Also I cannot find any 
documentation on the new functionality that was added so I'm not sure what 
syntax is valid and what's not.  Here is my entity that worked in 3.1 but no 
longer works in 3.6:




RE: Sharing dih "dictionaries"

2011-12-06 Thread Brent Mills
You're totally correct.  There's actually a link on the DIH page now which 
wasn't there when I had read it a long time ago.  I'm really looking forward to 
4.0, it's got a ton of great new features.  Thanks for the links!!

-Original Message-
From: Mikhail Khludnev [mailto:mkhlud...@griddynamics.com] 
Sent: Monday, December 05, 2011 10:45 PM
To: solr-user@lucene.apache.org
Subject: Re: Sharing dih "dictionaries"

It looks like https://issues.apache.org/jira/browse/SOLR-2382 or even 
https://issues.apache.org/jira/browse/SOLR-2613.
I guess by using SOLR-2382 you can specify your own SortedMapBackedCache 
subclass which is able to share your Dictionary.

Regards

On Tue, Dec 6, 2011 at 12:26 AM, Brent Mills  wrote:

> I'm not really sure how to title this but here's what I'm trying to do.
>
> I have a query that creates a rather large dictionary of codes that 
> are shared across multiple fields of a base entity.  I'm using the 
> cachedsqlentityprocessor but I was curious if there was a way to join 
> this multiple times to the base entity so I can avoid having to reload 
> it for each column join.
>
> Ex:
>   
>
>  
>
>  
>
>   
> 
>
> Kind of a simplified example but in this case the dictionary query has 
> to be run 3 times to join 3 different columns.  It would be nice if I 
> could load the data set once as an entity and specify how to join it 
> in code without requiring a separate sql query.  Any ideas?
>



--
Sincerely yours
Mikhail Khludnev
Developer
Grid Dynamics
tel. 1-415-738-8644
Skype: mkhludnev
<http://www.griddynamics.com>
 


Sharing dih "dictionaries"

2011-12-05 Thread Brent Mills
I'm not really sure how to title this but here's what I'm trying to do.

I have a query that creates a rather large dictionary of codes that are shared 
across multiple fields of a base entity.  I'm using the 
cachedsqlentityprocessor but I was curious if there was a way to join this 
multiple times to the base entity so I can avoid having to reload it for each 
column join.

Ex:

  

  

  

  


Kind of a simplified example but in this case the dictionary query has to be 
run 3 times to join 3 different columns.  It would be nice if I could load the 
data set once as an entity and specify how to join it in code without requiring 
a separate sql query.  Any ideas?