I should add that plug-in LoaderInterface implementations are not documented, because I still consider them to be experimental. However, it's simple enough to configure via LOADER_CLASSNAME, and fairly simple to build by looking at the existing FileLoader.java and Loader.java source code, as well as the more exotic DelimitedDataLoader.java class. All three of these classes extend AbstractLoader, which does most of the heavy lifting.

-- Mike

On 2010-01-21 11:57, Michael Blakeley wrote:
Paul,

You can use the
CONTENT_FACTORY_CLASSNAME=com.marklogic.recordloader.xcc.XccModuleContentFactory
trick to perform arbitrary processing in XQuery. Whether you call that
"pre-processing" or not is up to you :-).

In your particular situation, though, an XccContentFactory may not be
enough. First, to handle SGML you will probably need to write a schema
that defines the empty elements in your SGML DTD
(http://developer.marklogic.com/pubs/4.1/books/dev_guide.pdf - chapter 4).

But SGML is complex, so that may not be enough. Once or twice I've had
to write dedicated LoaderInterface or a Loader subclass to handle
special circumstances. Your requirements sound like they are headed in
that direction, especially if you need to heuristically detect whether a
given document is XML, SGML, or other markup.

Here's one point that you should consider carefully: do you really want
all your SGML to end up as XML in the empty namespace? As I read your
email, that's what you have specified. But in my opinion it would be
more robust to map each SGML DTD to a new XML schema, with its own
namespace. Otherwise you face the problem that SGML elements with two
different DTDs will use the same XML element QName in radically
different ways.

-- Mike

On 2010-01-21 09:47, Lewon, Paul (GPMS) wrote:
Hi all,

Here's my situation:


   *   I have a mixture of xml, sgml, and simple mark-up, and can't rely on the 
file extensions to tell me what's what.
   *   I have a working RecordLoader install, a MarkLogic database, and a file 
staging area where folks are collecting the content.
   *   The process of collecting the files, and creating new files is ongoing; 
I have no idea how many files we'll have, but it will almost certainly be 
several hundred thousand.
   *   Many files have sgml doctype declarations, but not all. Some few have 
xml doctype declarations.
   *   Most are not-UTF8. But some declare themselves as UTF8 in the doctype 
declaration even though they aren't.


What I'd like to do:


   *   I need to ingest it all into the MarkLogic database.
   *   I want to clear the database and repeat the ingest on a regular (TBD) 
basis.
   *   I do not want to pre-process the content prior to RecordLoader if at all 
possible.
   *   I want to handle according to the following logic:

   *   Handle incoming content as non-UTF8.
   *   If incoming files are binary, do not load.
   *   If the incoming file has a doctype definition, xml or sgml, handle it by 
converting to xml, removing problematic processing instructions, and 
pre-empting MarkLogic from turning SGML singletons into nested XML nodes (via 
default stack-level repair) by instead turning them into properly tagged XML 
singletons.

          (That is, I want<date><year year="2006"><month month="1"><day day="1"></date>   to become<date><year year="2006" /><month month="1" /><day day="1" 
/></date>   and not<date><year year="2006"><month month="1"><day day="1"></day></month></year></date>)

   *   Unless incoming content declares its namespace, load all content to the 
empty namespace.
   *   Else if the incoming file has a top level node, treat as xml.
   *   Else, ingest as text.


And finally, the question:

Can I use RecordLoader to do this, and without pre-processing? I'm having a 
hard time wrapping my head around the processing paradigm of CONTENT_FACTORY 
via RecordLoader. Is xquery-based content handling via CONTENT_FACTORY going to 
fire after the MarkLogic Server has already handled incoming SGML singletons? 
And if this is possible, does it put too much burden on RecordLoader (i.e. is 
it scalable and repeatable on a regular basis)?

Thank you,
Paul


Paul Lewon
Production Technology, Global Production&   Manufacturing Services
Cengage Learning
27500 Drake Rd. Farmington Hills, MI  48331

*: paul.le...@cengage.com | www.cengage.com





_______________________________________________
General mailing list
General@developer.marklogic.com
http://xqzone.com/mailman/listinfo/general

Reply via email to