Hi,
I'm not sure if you are aware of the following, it might help:
https://oakutils.appspot.com/generate/index
https://www.aemstuff.com/blogs/feb/aemindexcheatsheat.html
https://experienceleague.adobe.com/docs/experience-manager-65/assets/JCR_query_cheatsheet-v1.1.pdf
These were written for the Adobe AEM product, but I find them useful even
outside of AEM.
And here an example index definition:
{
"/oak:index/acmeAsset-1": {
"compatVersion": 2,
"type": "lucene",
"tags": ["asset"],
"async": ["async", "nrt"],
"includedPaths": ["/content/dam"],
"jcr:primaryType": "oak:QueryIndexDefinition",
"evaluatePathRestrictions": true,
"maxFieldLength": 100000,
"aggregates": {
"jcr:primaryType": "nt:unstructured",
"dam:Asset": {
"jcr:primaryType": "nt:unstructured",
"include0": {
"path": "jcr:content",
"jcr:primaryType": "nt:unstructured"
},
"include1": {
"path": "jcr:content/metadata",
"jcr:primaryType": "nt:unstructured"
},
"include2": {
"path": "jcr:content/metadata/*",
"jcr:primaryType": "nt:unstructured"
},
"include3": {
"path": "jcr:content/renditions",
"jcr:primaryType": "nt:unstructured"
},
"include4": {
"path": "jcr:content/renditions/original",
"jcr:primaryType": "nt:unstructured"
},
"include5": {
"path": "jcr:content/renditions/original/jcr:content",
"jcr:primaryType": "nt:unstructured"
},
"include6": {
"path": "jcr:content/comments",
"jcr:primaryType": "nt:unstructured"
},
"include7": {
"path": "jcr:content/comments/*",
"jcr:primaryType": "nt:unstructured"
},
"include8": {
"path": "jcr:content/data/master",
"jcr:primaryType": "nt:unstructured"
},
"include9": {
"path": "jcr:content/usages",
"jcr:primaryType": "nt:unstructured"
},
"include10": {
"path": "jcr:content/renditions/text.txt/jcr:content",
"jcr:primaryType": "nt:unstructured"
}
}
},
"facets": {
"jcr:primaryType": "nt:unstructured",
"topChildren": "100",
"secure": "insecure"
},
"indexRules": {
"jcr:primaryType": "nt:unstructured",
"dam:Asset": {
"jcr:primaryType": "nt:unstructured",
"properties": {
"jcr:primaryType": "nt:unstructured",
"jcrLastModified": {
"ordered": true,
"name": "jcr:content/jcr:lastModified",
"propertyIndex": true,
"jcr:primaryType": "nt:unstructured",
"type": "Date"
},
"jcrTitle": {
"useInSpellcheck": true,
"useInSuggest": true,
"nodeScopeIndex": true,
"name": "jcr:content/jcr:title",
"propertyIndex": true,
"boost": 2.0,
"jcr:primaryType": "nt:unstructured"
},
"jcrDescription": {
"nodeScopeIndex": true,
"useInSpellcheck": true,
"name": "jcr:content/jcr:description",
"propertyIndex": true,
"jcr:primaryType": "nt:unstructured",
"useInSuggest": true
},
"jcrCreated": {
"ordered": true,
"name": "jcr:created",
"propertyIndex": true,
"jcr:primaryType": "nt:unstructured",
"type": "Date"
},
"nodeName": {
"nodeScopeIndex": true,
"name": ":nodeName",
"jcr:primaryType": "nt:unstructured",
"useInSuggest": true
},
}
}
}
}
}
I wonder if nowadays, you would get more answers on stackoverflow.com? I'm not
sure...
Regards,
Thomas
From: Raffaele Gambelli <[email protected]>
Date: Wednesday, 11 September 2024 at 18:58
To: [email protected] <[email protected]>
Subject: Re: Indexing a binary and searching with contains, help request
EXTERNAL: Use caution when clicking on links or opening attachments.
Forgive me, I really ask you for help, I beg you... this issue is driving me
crazy, I tried to search for similar code in oak projects but without finding
anything, on the web there is incredibly nothing similar.
Is it possible that among you developers there is not a soul willing to help?
What good is this mailinglist if none of those who carry on this beautiful
project ever take action?
Cordiali saluti / Best regards,
Raffaele Gambelli
Senior Java Developer
E [email protected]<mailto:[email protected]>
[CEGEKA] Via Ettore Cristoni, 84
IT-40033 Bologna (IT), Italy
T +39 02 2544271
https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.cegeka.com%2F&data=05%7C02%7Cmueller%40adobe.com%7C568bbc3b1b274b37a23f08dcd282f6d5%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C638616707160458717%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=SK5hNvfuhqTxOTycVagrptiMhISVw2JG4d16LxcVn5o%3D&reserved=0<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.cegeka.com%2F&data=05%7C02%7Cmueller%40adobe.com%7C568bbc3b1b274b37a23f08dcd282f6d5%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C638616707160469937%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=ZNciXrpB7vHd3i1y2u2n41wsRWRXQTWWTcUa0i46Vvw%3D&reserved=0><http://www.cegeka.com/>
[https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2F2655225.fs1.hubspotusercontent-na1.net%2Fhubfs%2F2655225%2F0.0%2520Cegeka%2520&data=05%7C02%7Cmueller%40adobe.com%7C568bbc3b1b274b37a23f08dcd282f6d5%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C638616707160474921%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=10xeV4YSfw9GOVoRpuxdVTjzpIetVGFDZ3wuZvEIcBM%3D&reserved=0(new)/1.%20Visuals/Email%20Signatures/Annual_Report_Visuals_2023_Email%20Banner%201.png]<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.cegeka.com%2Fit%2Fannual-report-2023%3Futm_campaign%3D&data=05%7C02%7Cmueller%40adobe.com%7C568bbc3b1b274b37a23f08dcd282f6d5%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C638616707160479668%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=ZzKmTH%2B7C1xfmClQplVDTYzJnpgMYpnXzMSyp2ADqZI%3D&reserved=0[EN]%20-%20Annual%20Report%202023&utm_source=email%20signature%20banner&utm_medium=email%20signature%20banner%20annual%20report%202023<https://2655225.fs1.hubspotusercontent-na1.net/hubfs/2655225/0.0%20Cegeka%20>>
Dichiarazione di Riservatezza
Le informazioni contenute nella mail sono riservate. Se si rende conto di non
essere il destinatario corretto della mail, la preghiamo di segnalare l'errore
al mittente e di cancellare immediatamente il messaggio. L’utilizzo improprio
di informazioni riservate può comportare sanzioni.
Protezione dei dati personali
La informiamo che i suoi dati saranno trattati da Cegeka nel rispetto delle
disposizioni di legge applicabili (D. Lgs 196/2003 e Regolamento UE 679/2016).
Per maggiori dettagli può consultare le nostre informative privacy al link
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.cegeka.com%2Fit%2Finformazioni-sulla-privacy&data=05%7C02%7Cmueller%40adobe.com%7C568bbc3b1b274b37a23f08dcd282f6d5%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C638616707160484328%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=OSemO%2FkRnSbMqBcyu3NFOYthrluEtOpHS8mxQVxpkSc%3D&reserved=0.<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.cegeka.com%2Fit%2Finformazioni-sulla-privacy&data=05%7C02%7Cmueller%40adobe.com%7C568bbc3b1b274b37a23f08dcd282f6d5%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C638616707160488876%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=f9qWpGcka%2FRF8%2Fd5W6pq5Zrw1v9qLJgP86fVCGQsA3E%3D&reserved=0><https://www.cegeka.com/it/informazioni-sulla-privacy>
________________________________
From: Raffaele Gambelli <[email protected]>
Sent: Wednesday, September 11, 2024 12:51 PM
To: [email protected] <[email protected]>
Subject: Indexing a binary and searching with contains, help request
Good morning,
I would like to ask for your help in understanding where I go wrong in building
a working example where I populate a repository with binary data, index it, and
run a contains query.
I have logs to TRACE and I see the indexing working, upon executing the query
however I always get 0 results.
Repository is NodeStore and I create it in this way:
LuceneIndexProvider provider = new LuceneIndexProvider();
Oak oak = new Oak(ns)
.with((QueryIndexProvider) provider)
.with((Observer) provider)
.with(new LuceneIndexEditorProvider());
repository = new Jcr(oak).createRepository();
Then I populate it in this way:
Node node = rootNode.addNode("node" + i, "nt:unstructured");
byte[] data = ("testo" + i).getBytes();
ByteArrayInputStream bais = new ByteArrayInputStream(data);
Binary binary = session.getValueFactory()
.createBinary(bais);
try {
node.setProperty("binaryData", binary);
} finally {
binary.dispose();
}
node.setProperty("jcr:mimeType", "text/plain");
Then the index is in this way:
Node root = session.getRootNode();
Node oakIndex = root.getNode("oak:index");
Node index = oakIndex.addNode("contentTextIndex", "oak:QueryIndexDefinition");
index.setProperty("type", "lucene");
index.setProperty("async", (String[]) null);
Node indexRules = index.addNode("indexRules", "nt:unstructured");
Node ntBase = indexRules.addNode("nt:base", "nt:unstructured");
Node properties = ntBase.addNode("properties", "nt:unstructured");
Node binaryDataProperty = properties.addNode("binaryData", "nt:unstructured");
binaryDataProperty.setProperty("name", propertyName);
binaryDataProperty.setProperty("propertyIndex", true);
binaryDataProperty.setProperty("analyzed", true);
Node jcrMimeTypeProperty = properties.addNode("jcr:mimeType");
jcrMimeTypeProperty.setProperty("name", "jcr:mimeType");
jcrMimeTypeProperty.setProperty("propertyIndex", true);
jcrMimeTypeProperty.setProperty("analyzed", true);
Then I search in this way:
String sql2QueryString = "SELECT * FROM [nt:base] WHERE CONTAINS([binaryData],
'testo')";
Query sql2Query = queryManager.createQuery(sql2QueryString, Query.JCR_SQL2);
QueryResult result = sql2Query.execute();
and I read the results in this way:
NodeIterator nodes = result.getNodes();
while (nodes.hasNext()) {
Node node = nodes.nextNode();
log.info("Path: " + node.getPath());
counter++;
}
log.info("Found {} results", counter);
I'm using oak 1.68.0 with tika-core and tika-parsers-standard-package 2.9.2.
In logs I see the indexing and the text extraction correctly, if you want I can
attach a full log.
Really thank you for your help, best regards
Cordiali saluti / Best regards,
Raffaele Gambelli
Senior Java Developer
E [email protected]<mailto:[email protected]>
[CEGEKA] Via Ettore Cristoni, 84
IT-40033 Bologna (IT), Italy
T +39 02 2544271
https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.cegeka.com%2F&data=05%7C02%7Cmueller%40adobe.com%7C568bbc3b1b274b37a23f08dcd282f6d5%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C638616707160493324%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=i6lJqK9XBQaGpbIvVDhbaMp9olf2IWDeuUFBzDn2W9A%3D&reserved=0<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.cegeka.com%2F&data=05%7C02%7Cmueller%40adobe.com%7C568bbc3b1b274b37a23f08dcd282f6d5%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C638616707160497741%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=mjy%2B0B8YHARsq3pdJyQmB6qZvkRaG0l6iF6kCgU54os%3D&reserved=0><http://www.cegeka.com/>
[https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2F2655225.fs1.hubspotusercontent-na1.net%2Fhubfs%2F2655225%2F0.0%2520Cegeka%2520&data=05%7C02%7Cmueller%40adobe.com%7C568bbc3b1b274b37a23f08dcd282f6d5%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C638616707160502767%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=2fsRwdcmoWTqhbXlOX2UOT0LwGhg2SYjC8Xn0KZzzkE%3D&reserved=0(new)/1.%20Visuals/Email%20Signatures/Annual_Report_Visuals_2023_Email%20Banner%201.png]<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.cegeka.com%2Fit%2Fannual-report-2023%3Futm_campaign%3D&data=05%7C02%7Cmueller%40adobe.com%7C568bbc3b1b274b37a23f08dcd282f6d5%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C638616707160507418%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=3K9XpMZrR1Jsqdy64PoZgIKq%2BUan%2BETNV4SM%2FWU9rr8%3D&reserved=0[EN]%20-%20Annual%20Report%202023&utm_source=email%20signature%20banner&utm_medium=email%20signature%20banner%20annual%20report%202023<https://2655225.fs1.hubspotusercontent-na1.net/hubfs/2655225/0.0%20Cegeka%20>>
Dichiarazione di Riservatezza
Le informazioni contenute nella mail sono riservate. Se si rende conto di non
essere il destinatario corretto della mail, la preghiamo di segnalare l'errore
al mittente e di cancellare immediatamente il messaggio. L’utilizzo improprio
di informazioni riservate può comportare sanzioni.
Protezione dei dati personali
La informiamo che i suoi dati saranno trattati da Cegeka nel rispetto delle
disposizioni di legge applicabili (D. Lgs 196/2003 e Regolamento UE 679/2016).
Per maggiori dettagli può consultare le nostre informative privacy al link
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.cegeka.com%2Fit%2Finformazioni-sulla-privacy&data=05%7C02%7Cmueller%40adobe.com%7C568bbc3b1b274b37a23f08dcd282f6d5%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C638616707160512276%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=45rt0fAdsau%2BfiExqFn8gil66BaP%2BQdcX193M4JzdUU%3D&reserved=0.<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.cegeka.com%2Fit%2Finformazioni-sulla-privacy&data=05%7C02%7Cmueller%40adobe.com%7C568bbc3b1b274b37a23f08dcd282f6d5%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C638616707160516792%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=UdUTfHYVQGpPxOZRxc1x7bxpyILy2FD9Dby2IymSBro%3D&reserved=0><https://www.cegeka.com/it/informazioni-sulla-privacy>