Re: [MarkLogic Dev General] Binary Document Ingestion in MP4 and MP3 format

Geert Josten Thu, 20 Jul 2017 02:33:57 -0700

Hi Pavan,

If you need to store both the binary itself, and the meta info + textual 
contents, there are two general approaches:


- put meta info and textual contents in document properties
- store them separately as normal documents with a reference with the database 
uri of the actual binary

MarkLogic 9 also allows storing simple key/value pairs in hidden document 
metadata, which is more efficient than document properties or separate docs, 
but it is probably too limited for this use case.

You can store transcripts of videos including timestamps as XML, which would 
work for both the two-doc, and the doc-prop approach.

Document properties allows storing complete XML fragments, and is associated 
with the same database uri as the actual document (in this case the binary 
data). It is included in indexing automatically. You just need to indicate you 
like to include properties fragments in searching and faceting.

There are out of the box CPF pipelines for Document Filtering. There is one 
that saves the the result in doc properties, and one that saves the result in a 
separate doc. It should be possible to enable those via the Admin ui..

Kind regards,
Geert

From: GUPTA Pavan 
<[email protected]<mailto:[email protected]>>
Date: Thursday, July 20, 2017 at 11:07 AM
To: MarkLogic Developer Discussion 
<[email protected]<mailto:[email protected]>>, 
Geert Josten <[email protected]<mailto:[email protected]>>
Subject: RE: [MarkLogic Dev General] Binary Document Ingestion in MP4 and MP3 
format

Hello Geert,

Thanks for information. I would also know how I can store the content (means 
spoken words) of a video and find the time when it was spoken as we load the 
content of any document file in metadata.
Is there any CPF I need to apply or suggest some library.

Thanks In Advance!


Regards,
Pavan

From: 
[email protected]<mailto:[email protected]>
 [mailto:[email protected]] On Behalf Of Geert Josten
Sent: Thursday, July 20, 2017 2:27 PM
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Binary Document Ingestion in MP4 and MP3 
format

Hi Pavan,

You can apply xdmp:document-filter on many binary formats, including mp3 and 
mp4. It will extract meta information like file size and content mime type, and 
for instance document properties from office documents, and exif tags from 
images. It will also attempt extract actual text, but that will only work if 
such text is inside the file in a machine readable form. E.g. text contained 
inside images or video streams will not be captured. This includes images 
embedded in office docs, image pdf, and also captions and subtitles on images 
and videos. You would need an OCR kind of solution for that..

Kind regards,
Geert

From: 
<[email protected]<mailto:[email protected]>>
 on behalf of GUPTA Pavan 
<[email protected]<mailto:[email protected]>>
Reply-To: MarkLogic Developer Discussion 
<[email protected]<mailto:[email protected]>>
Date: Thursday, July 20, 2017 at 9:19 AM
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: [MarkLogic Dev General] Binary Document Ingestion in MP4 and MP3 format

Hi Team,

I am trying to ingest the .mp4 and .mp3 file and make them searchable. I have 
studied that these files are considered as binary files.

I have also seen how to make the binary files searchable but I have done for 
.doc, .ppt, .pdf etc file but could not do for .mp4 or .mp3.

Actually I want to make the files searchable.

Can you please direct me how to achieve this and tell me if I need to enable or 
set up any content processing framework for same.\

Thanks In Advance!


Regards,
Pavan

_______________________________________________
General mailing list
[email protected]
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Binary Document Ingestion in MP4 and MP3 format

Reply via email to