>> So unless you "store" that value >> with the document as a stored field, you'll have to "uninvert" the >> index yourself.
That helps and explains why there is no support in std api > Luke has some capabilities to look at the index at a low level, > perhaps that could give you some pointers. I think you can pull > the older branch from here: > https://github.com/DmitryKey/luke Thanks for the pointer. It has support for reconstructing the Document which should be having logic to retrieve non stored field names. Would have a look. Chetan Mehrotra On Tue, Jan 2, 2018 at 8:14 PM, Erick Erickson <erickerick...@gmail.com> wrote: > Luke has some capabilities to look at the index at a low level, > perhaps that could give you some pointers. I think you can pull > the older branch from here: > https://github.com/DmitryKey/luke > > or: > https://code.google.com/archive/p/luke/ > > NOTE: This is not a part of Lucene, but an independent project > so it won't have the same labels. > > Best, > Erick > > On Tue, Jan 2, 2018 at 2:06 AM, Dawid Weiss <dawid.we...@gmail.com> wrote: >> Ok. I think you should look at the Java API -- this will give you more >> clarity of what is actually stored in the index >> and how to extract it. The thing (I think) you're missing is that an >> inverted index points in the "other" direction (from a given value to >> all documents that contained it). So unless you "store" that value >> with the document as a stored field, you'll have to "uninvert" the >> index yourself. >> >> Dawid >> >> On Tue, Jan 2, 2018 at 10:05 AM, Chetan Mehrotra >> <chetan.mehro...@gmail.com> wrote: >>>> Only stored fields are kept for each document. If you need to dump >>>> internal data structures (terms, positions, offsets, payloads, you >>>> name it) you'll need to dive into the API and traverse all segments, >>>> then dump the above (and note that document IDs are per-segment and >>>> will have to be somehow consolidated back to your document IDs). >>> >>> Okie. So this would require deeper understanding of index format. >>> Would have a look. To start with I was just looking for a way to dump >>> indexed field names per document and nothing more >>> >>> /foo/bar|status, lastModified >>> /foo/baz|status, type >>> >>> Where path is stored field (primary key) and rest of the stuff are >>> sorted field names. Then such a file can be generated for both indexes >>> and diff can be done post sorting >>> >>>> I don't quite understand the motive here -- the indexes should behave >>>> identically regardless of the order of input documents; what's the >>>> point of dumping all this information? >>> >>> This is because of way indexing logic is given access to the Node >>> hierarchy. Would try to provide a brief explanation >>> >>> Jackrabbit Oak provides a hierarchical storage in a tree form where >>> sub trees can be of specific type. >>> >>> /content/dam/assets/december/banner.png >>> - jcr:primaryType = "app:Asset" >>> + jcr:content >>> - jcr:primaryType = "app:AssetContent" >>> + metadata >>> - status = "published" >>> - jcr:lastModified = "2009-10-9T21:52:31" >>> - app:tags = ["properties:orientation/landscape", >>> "marketing:interest/product"] >>> - comment = "Image for december launch" >>> - jcr:title = "December Banner" >>> + xmpMM:History >>> + 1 >>> - softwareAgent = "Adobe Photoshop" >>> - author = "David" >>> + renditions (nt:folder) >>> + original (nt:file) >>> + jcr:content >>> - jcr:data = ... >>> >>> To access this content Oak provides a NodeStore/NodeState api [1] >>> which provides way to access the children. The default indexing logic >>> uses this api to read the content to be indexed and uses index rules >>> which allow to index content via relative path. For e.g. it would >>> create a Lucene field status which maps to >>> jcr:content/metadata/@status (for an index rule for nodes of type >>> app:Asset). >>> >>> This mode of access proved to be slow over remote storage like Mongo >>> specially for full reindexing case. So we implemented a newer approach >>> where all content was dumped in a flat file (1 node per line) -> >>> sorted file and then have a NodeState impl over this flat file. This >>> changes the way how relative paths work and thus there may be some >>> potential bugs in newer implementation. >>> >>> Hence we need to validate that indexing using new api produces same >>> index as using the stable api. For a case both index would have a >>> document for "/content/dam/assets/december/banner.png" but if newer >>> impl had some bug then it may not have indexed the "status" field >>> >>> So I am looking for way where I can map all fieldNames for a given >>> document. Actual indexed content would be same if both index have >>> "status" field indexed so we only need to validate fieldnames per >>> document. Something like >>> >>> Thanks for reading all this if you have read so far :) >>> >>> Chetan Mehrotra >>> [1] >>> https://github.com/apache/jackrabbit-oak/blob/trunk/oak-store-spi/src/main/java/org/apache/jackrabbit/oak/spi/state/NodeState.java >>> >>> >>> On Tue, Jan 2, 2018 at 2:10 PM, Dawid Weiss <dawid.we...@gmail.com> wrote: >>>> Only stored fields are kept for each document. If you need to dump >>>> internal data structures (terms, positions, offsets, payloads, you >>>> name it) you'll need to dive into the API and traverse all segments, >>>> then dump the above (and note that document IDs are per-segment and >>>> will have to be somehow consolidated back to your document IDs). >>>> >>>> I don't quite understand the motive here -- the indexes should behave >>>> identically regardless of the order of input documents; what's the >>>> point of dumping all this information? >>>> >>>> Dawid >>>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org