Luke has some capabilities to look at the index at a low level, perhaps that could give you some pointers. I think you can pull the older branch from here: https://github.com/DmitryKey/luke
or: https://code.google.com/archive/p/luke/ NOTE: This is not a part of Lucene, but an independent project so it won't have the same labels. Best, Erick On Tue, Jan 2, 2018 at 2:06 AM, Dawid Weiss <dawid.we...@gmail.com> wrote: > Ok. I think you should look at the Java API -- this will give you more > clarity of what is actually stored in the index > and how to extract it. The thing (I think) you're missing is that an > inverted index points in the "other" direction (from a given value to > all documents that contained it). So unless you "store" that value > with the document as a stored field, you'll have to "uninvert" the > index yourself. > > Dawid > > On Tue, Jan 2, 2018 at 10:05 AM, Chetan Mehrotra > <chetan.mehro...@gmail.com> wrote: >>> Only stored fields are kept for each document. If you need to dump >>> internal data structures (terms, positions, offsets, payloads, you >>> name it) you'll need to dive into the API and traverse all segments, >>> then dump the above (and note that document IDs are per-segment and >>> will have to be somehow consolidated back to your document IDs). >> >> Okie. So this would require deeper understanding of index format. >> Would have a look. To start with I was just looking for a way to dump >> indexed field names per document and nothing more >> >> /foo/bar|status, lastModified >> /foo/baz|status, type >> >> Where path is stored field (primary key) and rest of the stuff are >> sorted field names. Then such a file can be generated for both indexes >> and diff can be done post sorting >> >>> I don't quite understand the motive here -- the indexes should behave >>> identically regardless of the order of input documents; what's the >>> point of dumping all this information? >> >> This is because of way indexing logic is given access to the Node >> hierarchy. Would try to provide a brief explanation >> >> Jackrabbit Oak provides a hierarchical storage in a tree form where >> sub trees can be of specific type. >> >> /content/dam/assets/december/banner.png >> - jcr:primaryType = "app:Asset" >> + jcr:content >> - jcr:primaryType = "app:AssetContent" >> + metadata >> - status = "published" >> - jcr:lastModified = "2009-10-9T21:52:31" >> - app:tags = ["properties:orientation/landscape", >> "marketing:interest/product"] >> - comment = "Image for december launch" >> - jcr:title = "December Banner" >> + xmpMM:History >> + 1 >> - softwareAgent = "Adobe Photoshop" >> - author = "David" >> + renditions (nt:folder) >> + original (nt:file) >> + jcr:content >> - jcr:data = ... >> >> To access this content Oak provides a NodeStore/NodeState api [1] >> which provides way to access the children. The default indexing logic >> uses this api to read the content to be indexed and uses index rules >> which allow to index content via relative path. For e.g. it would >> create a Lucene field status which maps to >> jcr:content/metadata/@status (for an index rule for nodes of type >> app:Asset). >> >> This mode of access proved to be slow over remote storage like Mongo >> specially for full reindexing case. So we implemented a newer approach >> where all content was dumped in a flat file (1 node per line) -> >> sorted file and then have a NodeState impl over this flat file. This >> changes the way how relative paths work and thus there may be some >> potential bugs in newer implementation. >> >> Hence we need to validate that indexing using new api produces same >> index as using the stable api. For a case both index would have a >> document for "/content/dam/assets/december/banner.png" but if newer >> impl had some bug then it may not have indexed the "status" field >> >> So I am looking for way where I can map all fieldNames for a given >> document. Actual indexed content would be same if both index have >> "status" field indexed so we only need to validate fieldnames per >> document. Something like >> >> Thanks for reading all this if you have read so far :) >> >> Chetan Mehrotra >> [1] >> https://github.com/apache/jackrabbit-oak/blob/trunk/oak-store-spi/src/main/java/org/apache/jackrabbit/oak/spi/state/NodeState.java >> >> >> On Tue, Jan 2, 2018 at 2:10 PM, Dawid Weiss <dawid.we...@gmail.com> wrote: >>> Only stored fields are kept for each document. If you need to dump >>> internal data structures (terms, positions, offsets, payloads, you >>> name it) you'll need to dive into the API and traverse all segments, >>> then dump the above (and note that document IDs are per-segment and >>> will have to be somehow consolidated back to your document IDs). >>> >>> I don't quite understand the motive here -- the indexes should behave >>> identically regardless of the order of input documents; what's the >>> point of dumping all this information? >>> >>> Dawid >>> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org