This admittedly surprising behaviour appears to be intended: https://github.com/iipc/openwayback/blob/6475121cef79240b5e18a5f2c224ff9ba933b43d/wayback-core/src/main/java/org/archive/wayback/util/flatfile/FlatFile.java#L39-L41
e.g. here's an example of the match being tested in the client rather than in the iterator: https://github.com/iipc/openwayback/blob/6475121cef79240b5e18a5f2c224ff9ba933b43d/wayback-core/src/main/java/org/archive/wayback/resourceindex/adapters/CaptureToUrlSearchResultIterator.java#L79 I'm a bit wary of changing FlatFile.java as this is used in a number of places that may depend on this behaviour. However, it would seem reasonable to modify the CDXIndex or CDXFormatIndex child classes to behave in a less surprising fashion. Best, Andy On Monday, 13 February 2017 15:43:26 UTC, David Portabella wrote: > > The API containts these two functions: getPrefixIterator and > getUrlIterator. > > With getPrefixIterator, I would expect to return all entries that match a > given prefix. With getUrlIterator, I would expect to get all entries that > match a url. The problem is that getPrefixIterator returns the first > entry that match a given prefix, and all the following entries until the > end of the archive (which do not match the prefix). The same for > getUrlIterator: it returns the first entry that match the full url, and > all the following entries until the end of the archive. > > I think that the example is pretty clear on this. I asked for all the > entries that match the prefix: rmspumptools.com/innovation.php > <http://www.google.com/url?q=http%3A%2F%2Fwww.rmspumptools.com%2Finnovation.php&sa=D&sntz=1&usg=AFQjCNFGeNgeK5QXxpQU-29U8UP9j0ecbA> > and I get that entry http://www.rmspumptools.com/innovation.php > <http://www.google.com/url?q=http%3A%2F%2Fwww.rmspumptools.com%2Finnovation.php&sa=D&sntz=1&usg=AFQjCNFGeNgeK5QXxpQU-29U8UP9j0ecbA> > > (this is correct), but also https://www.sjm.com/en > <http://www.sjm.com/en/legal-notices-patents/patents/cardiac-rhythm-management-patents>, > > and http://www.slaperoo.com/ and ... until the end of the archive (which > do not match the prefix). > > > On Monday, February 13, 2017 at 11:40:57 AM UTC+1, andrew.jackson wrote: >> >> I believe the idea is to allow you to use the same API to perform other >> queries, such as listing all URIs that start with a given prefix (e.g. all >> URIs for a host). >> >> If you only want the list that matches a specific URI, you should stop >> pulling results from the iterator when the URI is no longer the one you >> want. >> >> HTH, >> Andy Jackson >> >> On Wednesday, 8 February 2017 17:20:16 UTC, David Portabella wrote: >>> >>> Using CDXFormatIndex.getPrefixIterator(prefix), >>> >>> I would expect to get only the entries that match this prefix. >>> >>> Instead, it finds the first entry matching this prefix, and then it returns >>> all entries from that point until the end of the archive. >>> >>> so, it returns entries that do not match the prefix. >>> >>> >>> Why? >>> >>> How to *only* get the entries that match the prefix? >>> >>> >>> >>> Example (written in Scala) >>> >>> >>> package application >>> >>> import org.archive.wayback.core.CaptureSearchResult >>> import org.archive.wayback.resourceindex.cdx.CDXFormatIndex >>> import org.archive.wayback.util.url.AggressiveUrlCanonicalizer >>> >>> import scala.collection.JavaConverters._ >>> >>> object ResponseWarcReaderExample { >>> def main(args: Array[String]) { >>> val index = new CDXFormatIndex() >>> index.setPath("/dataset/files.warc.cdx") >>> >>> val key = canonicalize("http://www.rmspumptools.com/innovation.php >>> <http://www.google.com/url?q=http%3A%2F%2Fwww.rmspumptools.com%2Finnovation.php&sa=D&sntz=1&usg=AFQjCNFGeNgeK5QXxpQU-29U8UP9j0ecbA> >>> ") >>> val it = index.getPrefixIterator(key).asScala.foreach { (r: >>> CaptureSearchResult) => >>> println(s"${r.getFile}:${r.getOffset}: ${r.getOriginalUrl}") >>> } >>> } >>> >>> val canonicalizer = new AggressiveUrlCanonicalizer() >>> def canonicalize(url: String): String = >>> canonicalizer.urlStringToKey(url) >>> } >>> >>> >>> The output is as follows: >>> files.warc.gz:1053529: http://www.rmspumptools.com/innovation.php >>> <http://www.google.com/url?q=http%3A%2F%2Fwww.rmspumptools.com%2Finnovation.php&sa=D&sntz=1&usg=AFQjCNFGeNgeK5QXxpQU-29U8UP9j0ecbA> >>> files.warc.gz:1181319: https:// >>> www.sjm.com/en/legal-notices-patents/patents/cardiac-rhythm-management-patents >>> files.warc.gz:11538: http://www.slaperoo.com/ >>> files.warc.gz:1268086: https://www.smarttech.com/patents >>> files.warc.gz:826021: http://speckip.com/ >>> ... >>> >>> >>> >>> >>> >>> Note that CDXFormatIndex/CDXIndex.getPrefixIterator calls >>> FlatFile.getRecordIterator(final String prefix), >>> >>> which returns an input stream starting with the offset the first entry >>> matching the prefix; and then it reads everything until the end of the >>> archive. >>> >>> RandomAccessFile raf = new RandomAccessFile(file,"r"); >>> >>> findKeyOffset(raf, prefix); >>> >>> >>> >>> >>> -- You received this message because you are subscribed to the Google Groups "openwayback-dev" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
