Hi,

I think I've found the main problem with massive parallel ingestion.

I'm working with the last github snapshot.

In org.fcrepo.server.storage.DefaultDOManager, the getIngestWriter method 
should not be synchronized as it seems there is only a single instance of that 
class for the server. The internal objects of this class seem to be correctly 
synchronized (pid generation) of new objects are recreated on each call (inside 
Translator, a new DODeserializer is created and the same happens inside the 
Validator).

I've tested with FOXML ingestion and now almost all the CPUs are used. I've not 
been deeper to check that every inserted object is not corrupted, but after a 
quick look, it seems OK. I guess the same kind of patch could also apply on 
object deletion.

If one of you that better understand that part could have a look, it seems it 
would be a nice patch, not too hard to test, with great performance 
improvements.

Regards,

Nicolas HERVE


On 28/09/2012 11:25, Nicolas Hervé wrote:
Hi,

indeed, it seems we are exactly in the same configuration (millions of DO with 
some metadata and external content) with almost the same hardware. I've not 
identified the bottleneck in the massive parallel ingestion process right now, 
but I highly suspect a synchronized portion of code somewhere in the chain. I 
hope Edwin could say more about this :-)

For the querying of dc fields, index have to be created in the Mysql schema and 
SQL queries are far from being optimal. Currently I only patched for my own 
purposes (my datamodel / my queries) and I bypass some code portions in the 
following classes :

org.fcrepo.server.search.FieldSearchSQLImpl
org.fcrepo.server.search.FieldSearchResultSQLImpl

I'm really new to Fedora Commons but, from what I understand, these SQL part is 
quite old. Changing them for optimizations purposes could imply behaviour 
changes for other people. That's why I don't think simple patches could do the 
job. It would need a complete refactoring. That could only be done with a 
global point of view on the different way this classes are used in the 
different contexts where Fedora instances are running.

Feel free to contact me to discuss this more precisely.

Regards,

Nicolas HERVE
+33 1 49 83 21 66 (GMT + 2)

On 27/09/2012 18:23, Jason V wrote:
Hi Nicolas,

My name is Jason Varghese and I'm a senior developer at the New York Public 
Library. I think you are doing work similar to what I am presently doing based 
on reading some of your posts.

We have a relatively large scale Fedora implementation here.  We've had all the 
hardware in place for some time and are in the process of migrating from a 
large homegrown repository to a Fedora based platform.  We have a single Fedora 
ingest machine and 3 Fedora readers.  The ingest machine alone is 4 x 6 core 
processors w/ 128GB RAM.  I'm in the process of generating about 1 million+ 
digital objects and attaching to each DO all the metadata (as managed content 
datastreams) and all the digital assets (as external content datastreams).  The 
digital assets currently are about 183 TB of content (this is replicated at two 
sites).  I have a multithreaded java client I wrote to accomplish the task for 
the Fedora ingest/DO generation and I use the Mediashelf REST API client for 
connectivity to Fedora. I was able to successfully ingest 10's of thousands of 
digital objects, but really need ensure this process performs optimally and 
scales for millions of objects. What bottlenecks were you able to identify when 
running your multithreaded ingest process?  Look forward to learning/sharing 
experiences from this process with you and the community and possibly 
collaborating.  Thanks

Jason Varghese
NYPL





------------------------------------------------------------------------------
Got visibility?
Most devs has no idea what their production app looks like.
Find out how fast your code is with AppDynamics Lite.
http://ad.doubleclick.net/clk;262219671;13503038;y?
http://info.appdynamics.com/FreeJavaPerformanceDownload.html



_______________________________________________
Fedora-commons-users mailing list
[email protected]<mailto:[email protected]>
https://lists.sourceforge.net/lists/listinfo/fedora-commons-users


------------------------------------------------------------------------------
LogMeIn Central: Instant, anywhere, Remote PC access and management.
Stay in control, update software, and manage PCs from one command center
Diagnose problems and improve visibility into emerging IT issues
Automate, monitor and manage. Do more in less time with Central
http://p.sf.net/sfu/logmein12331_d2d
_______________________________________________
Fedora-commons-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/fedora-commons-developers

Reply via email to