Hi Ankur,

The built-in pipeline `Document Filtering (Properties)` should be able to 
handle those. Just add them to the domain you’d like to use. Here is the 
section of the CPF guide on how to do that using the Admin UI: 
http://docs.marklogic.com/guide/cpf/domains#id_40535

For your reference, these are the formats supported by xdmp:document-filter: 
http://docs.marklogic.com/guide/search-dev/binary-document-metadata#id_68368

Kind regards,
Geert

From: MEHROTRA Ankur 
<ankur.mehro...@soprasteria.com<mailto:ankur.mehro...@soprasteria.com>>
Date: Monday, May 29, 2017 at 12:57 PM
To: MarkLogic Developer Discussion 
<general@developer.marklogic.com<mailto:general@developer.marklogic.com>>, 
Geert Josten <geert.jos...@marklogic.com<mailto:geert.jos...@marklogic.com>>
Cc: GUPTA Pavan 
<pavan.gu...@soprasteria.com<mailto:pavan.gu...@soprasteria.com>>, SHARMA 
Archana 
<archana.sha...@soprasteria.com<mailto:archana.sha...@soprasteria.com>>, 
MarkLogic Developer Discussion 
<general@developer.marklogic.com<mailto:general@developer.marklogic.com>>
Subject: RE: [MarkLogic Dev General] Clarification :- Binary Document Search

Hi Geert,

Can we have an option for configuring built-in CPF pipelines for MP3/Video 
files?

Thanks in advance,
Ankur Mehrotra

From: MEHROTRA Ankur
Sent: Monday, May 29, 2017 1:50 PM
To: MarkLogic Developer Discussion
Cc: GUPTA Pavan; SHARMA Archana
Subject: Re: [MarkLogic Dev General] Clarification :- Binary Document Search


Thanks a ton for such a useful response.

Thanks,
Ankur Mehrotra

________________________________
From:general-boun...@developer.marklogic.com<mailto:general-boun...@developer.marklogic.com>
 
<general-boun...@developer.marklogic.com<mailto:general-boun...@developer.marklogic.com>>
 on behalf of Geert Josten 
<geert.jos...@marklogic.com<mailto:geert.jos...@marklogic.com>>
Sent: Monday, May 29, 2017 1:02:48 PM
To: MarkLogic Developer Discussion
Cc: GUPTA Pavan; SHARMA Archana
Subject: Re: [MarkLogic Dev General] Clarification :- Binary Document Search

Hi Ankur,

That is kind of by design. MarkLogic does not search binaries directly. Instead 
you can apply xdmp:document-filter (which uses a built-in 3rd party library) to 
scrape about 200 different formats for text and metadata. The result is XHTML, 
and can be saved in document properties or as separate documents. This is 
represented in the built-in CPF Conversion pipelines as the `Document Filtering 
(Properties)` and `Document Filtering (XHTML)`. These are not enabled by 
default.

MarkLogic also comes with functions like xdmp:pdf-convert and 
xdmp:word-convert. These usually yield better results, but work for very 
specific formats only. The built-in CPF Conversion pipelines that are enabled 
by default (Conversion Processing, DocBook Conversion, HTML Conversion, MS 
Office Conversion, PDF Conversion) make use of these, and attempt to further 
enhance the results, and convert into DocBook XML Format. These always store 
results as separate documents.

Simplest solution might be to use the `Document Filtering (Properties)` 
instead, and toggle searching to search over properties instead of over 
documents, but searching over properties can have performance impact (extra 
join between document and properties fragments), and makes combined search over 
binaries and non-binaries more difficult (potential need for 
fragment-scope-queries and such).

You could also just take the uris returned from your current search, and string 
manipulate the uri to get the link to the original binary. If memory serves me 
right, it is always original uri plus something like ‘.xml’ or ‘.xhtml’ 
appended to it..

Kind regards,
Geert

From: 
<general-boun...@developer.marklogic.com<mailto:general-boun...@developer.marklogic.com>>
 on behalf of MEHROTRA Ankur 
<ankur.mehro...@soprasteria.com<mailto:ankur.mehro...@soprasteria.com>>
Reply-To: MarkLogic Developer Discussion 
<general@developer.marklogic.com<mailto:general@developer.marklogic.com>>
Date: Monday, May 29, 2017 at 8:41 AM
To: "general@developer.marklogic.com<mailto:general@developer.marklogic.com>" 
<general@developer.marklogic.com<mailto:general@developer.marklogic.com>>
Cc: GUPTA Pavan 
<pavan.gu...@soprasteria.com<mailto:pavan.gu...@soprasteria.com>>, SHARMA 
Archana <archana.sha...@soprasteria.com<mailto:archana.sha...@soprasteria.com>>
Subject: Re: [MarkLogic Dev General] Clarification :- Binary Document Search

Any update on this.

From: MEHROTRA Ankur
Sent: Thursday, May 25, 2017 5:36 PM
To: 'general@developer.marklogic.com<mailto:'general@developer.marklogic.com>'
Cc: GUPTA Pavan; SHARMA Archana
Subject: Clarification :- Binary Document Search

Hi Team,

I have gone through the 'https://docs.marklogic.com/... to set up the pipeline 
to make the binary document searchable. I can observe that .xml and .xhtml are 
being generated out of ingested file (for instance .doc/.docx/.pdf). When I 
tried searching using Java Client API search query, I got the results from 
generated xml file rather than getting the results from ingested file which in 
turn returned the uri of generated xml file (in response) but I need to point 
to the main document file uri as I need to show this on screen. How I can 
achieve this.

We have used below code to get the converted document uri (for example .xml 
file) but I need to ingested documents uri.


DatabaseClient client = DatabaseClientFactory.newClient(Config.host, 
Config.port, Config.user, Config.password, Config.authType);

              // create a manager for searching
              QueryManager queryMgr = client.newQueryManager();



              StringQueryDefinition query = queryMgr.newStringDefinition();
              query.setCriteria("text");



              SearchHandle resultsHandle = new SearchHandle();


              queryMgr.search(query, resultsHandle);

MatchDocumentSummary[] results = resultsHandle.getMatchResults();
              for (MatchDocumentSummary result: results) {

                     System.out.println(result.getUri());
              }


Thanks and regards,
Ankur Mehrotra
_______________________________________________
General mailing list
General@developer.marklogic.com
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general

Reply via email to