Hi Ankur, The built-in pipeline `Document Filtering (Properties)` should be able to handle those. Just add them to the domain you’d like to use. Here is the section of the CPF guide on how to do that using the Admin UI: http://docs.marklogic.com/guide/cpf/domains#id_40535
For your reference, these are the formats supported by xdmp:document-filter: http://docs.marklogic.com/guide/search-dev/binary-document-metadata#id_68368 Kind regards, Geert From: MEHROTRA Ankur <ankur.mehro...@soprasteria.com<mailto:ankur.mehro...@soprasteria.com>> Date: Monday, May 29, 2017 at 12:57 PM To: MarkLogic Developer Discussion <general@developer.marklogic.com<mailto:general@developer.marklogic.com>>, Geert Josten <geert.jos...@marklogic.com<mailto:geert.jos...@marklogic.com>> Cc: GUPTA Pavan <pavan.gu...@soprasteria.com<mailto:pavan.gu...@soprasteria.com>>, SHARMA Archana <archana.sha...@soprasteria.com<mailto:archana.sha...@soprasteria.com>>, MarkLogic Developer Discussion <general@developer.marklogic.com<mailto:general@developer.marklogic.com>> Subject: RE: [MarkLogic Dev General] Clarification :- Binary Document Search Hi Geert, Can we have an option for configuring built-in CPF pipelines for MP3/Video files? Thanks in advance, Ankur Mehrotra From: MEHROTRA Ankur Sent: Monday, May 29, 2017 1:50 PM To: MarkLogic Developer Discussion Cc: GUPTA Pavan; SHARMA Archana Subject: Re: [MarkLogic Dev General] Clarification :- Binary Document Search Thanks a ton for such a useful response. Thanks, Ankur Mehrotra ________________________________ From:general-boun...@developer.marklogic.com<mailto:general-boun...@developer.marklogic.com> <general-boun...@developer.marklogic.com<mailto:general-boun...@developer.marklogic.com>> on behalf of Geert Josten <geert.jos...@marklogic.com<mailto:geert.jos...@marklogic.com>> Sent: Monday, May 29, 2017 1:02:48 PM To: MarkLogic Developer Discussion Cc: GUPTA Pavan; SHARMA Archana Subject: Re: [MarkLogic Dev General] Clarification :- Binary Document Search Hi Ankur, That is kind of by design. MarkLogic does not search binaries directly. Instead you can apply xdmp:document-filter (which uses a built-in 3rd party library) to scrape about 200 different formats for text and metadata. The result is XHTML, and can be saved in document properties or as separate documents. This is represented in the built-in CPF Conversion pipelines as the `Document Filtering (Properties)` and `Document Filtering (XHTML)`. These are not enabled by default. MarkLogic also comes with functions like xdmp:pdf-convert and xdmp:word-convert. These usually yield better results, but work for very specific formats only. The built-in CPF Conversion pipelines that are enabled by default (Conversion Processing, DocBook Conversion, HTML Conversion, MS Office Conversion, PDF Conversion) make use of these, and attempt to further enhance the results, and convert into DocBook XML Format. These always store results as separate documents. Simplest solution might be to use the `Document Filtering (Properties)` instead, and toggle searching to search over properties instead of over documents, but searching over properties can have performance impact (extra join between document and properties fragments), and makes combined search over binaries and non-binaries more difficult (potential need for fragment-scope-queries and such). You could also just take the uris returned from your current search, and string manipulate the uri to get the link to the original binary. If memory serves me right, it is always original uri plus something like ‘.xml’ or ‘.xhtml’ appended to it.. Kind regards, Geert From: <general-boun...@developer.marklogic.com<mailto:general-boun...@developer.marklogic.com>> on behalf of MEHROTRA Ankur <ankur.mehro...@soprasteria.com<mailto:ankur.mehro...@soprasteria.com>> Reply-To: MarkLogic Developer Discussion <general@developer.marklogic.com<mailto:general@developer.marklogic.com>> Date: Monday, May 29, 2017 at 8:41 AM To: "general@developer.marklogic.com<mailto:general@developer.marklogic.com>" <general@developer.marklogic.com<mailto:general@developer.marklogic.com>> Cc: GUPTA Pavan <pavan.gu...@soprasteria.com<mailto:pavan.gu...@soprasteria.com>>, SHARMA Archana <archana.sha...@soprasteria.com<mailto:archana.sha...@soprasteria.com>> Subject: Re: [MarkLogic Dev General] Clarification :- Binary Document Search Any update on this. From: MEHROTRA Ankur Sent: Thursday, May 25, 2017 5:36 PM To: 'general@developer.marklogic.com<mailto:'general@developer.marklogic.com>' Cc: GUPTA Pavan; SHARMA Archana Subject: Clarification :- Binary Document Search Hi Team, I have gone through the 'https://docs.marklogic.com/... to set up the pipeline to make the binary document searchable. I can observe that .xml and .xhtml are being generated out of ingested file (for instance .doc/.docx/.pdf). When I tried searching using Java Client API search query, I got the results from generated xml file rather than getting the results from ingested file which in turn returned the uri of generated xml file (in response) but I need to point to the main document file uri as I need to show this on screen. How I can achieve this. We have used below code to get the converted document uri (for example .xml file) but I need to ingested documents uri. DatabaseClient client = DatabaseClientFactory.newClient(Config.host, Config.port, Config.user, Config.password, Config.authType); // create a manager for searching QueryManager queryMgr = client.newQueryManager(); StringQueryDefinition query = queryMgr.newStringDefinition(); query.setCriteria("text"); SearchHandle resultsHandle = new SearchHandle(); queryMgr.search(query, resultsHandle); MatchDocumentSummary[] results = resultsHandle.getMatchResults(); for (MatchDocumentSummary result: results) { System.out.println(result.getUri()); } Thanks and regards, Ankur Mehrotra
_______________________________________________ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general