The API containts these two functions: getPrefixIterator and getUrlIterator.
With getPrefixIterator, I would expect to return all entries that match a given prefix. With getUrlIterator, I would expect to get all entries that match a url. The problem is that getPrefixIterator returns the first entry that match a given prefix, and all the following entries until the end of the archive (which do not match the prefix). The same for getUrlIterator: it returns the first entry that match the full url, and all the following entries until the end of the archive. I think that the example is pretty clear on this. I asked for all the entries that match the prefix: rmspumptools.com/innovation.php <http://www.google.com/url?q=http%3A%2F%2Fwww.rmspumptools.com%2Finnovation.php&sa=D&sntz=1&usg=AFQjCNFGeNgeK5QXxpQU-29U8UP9j0ecbA> and I get that entry http://www.rmspumptools.com/innovation.php <http://www.google.com/url?q=http%3A%2F%2Fwww.rmspumptools.com%2Finnovation.php&sa=D&sntz=1&usg=AFQjCNFGeNgeK5QXxpQU-29U8UP9j0ecbA> (this is correct), but also https://www.sjm.com/en <http://www.sjm.com/en/legal-notices-patents/patents/cardiac-rhythm-management-patents>, and http://www.slaperoo.com/ and ... until the end of the archive (which do not match the prefix). On Monday, February 13, 2017 at 11:40:57 AM UTC+1, andrew.jackson wrote: > > I believe the idea is to allow you to use the same API to perform other > queries, such as listing all URIs that start with a given prefix (e.g. all > URIs for a host). > > If you only want the list that matches a specific URI, you should stop > pulling results from the iterator when the URI is no longer the one you > want. > > HTH, > Andy Jackson > > On Wednesday, 8 February 2017 17:20:16 UTC, David Portabella wrote: >> >> Using CDXFormatIndex.getPrefixIterator(prefix), >> >> I would expect to get only the entries that match this prefix. >> >> Instead, it finds the first entry matching this prefix, and then it returns >> all entries from that point until the end of the archive. >> >> so, it returns entries that do not match the prefix. >> >> >> Why? >> >> How to *only* get the entries that match the prefix? >> >> >> >> Example (written in Scala) >> >> >> package application >> >> import org.archive.wayback.core.CaptureSearchResult >> import org.archive.wayback.resourceindex.cdx.CDXFormatIndex >> import org.archive.wayback.util.url.AggressiveUrlCanonicalizer >> >> import scala.collection.JavaConverters._ >> >> object ResponseWarcReaderExample { >> def main(args: Array[String]) { >> val index = new CDXFormatIndex() >> index.setPath("/dataset/files.warc.cdx") >> >> val key = canonicalize("http://www.rmspumptools.com/innovation.php >> <http://www.google.com/url?q=http%3A%2F%2Fwww.rmspumptools.com%2Finnovation.php&sa=D&sntz=1&usg=AFQjCNFGeNgeK5QXxpQU-29U8UP9j0ecbA> >> ") >> val it = index.getPrefixIterator(key).asScala.foreach { (r: >> CaptureSearchResult) => >> println(s"${r.getFile}:${r.getOffset}: ${r.getOriginalUrl}") >> } >> } >> >> val canonicalizer = new AggressiveUrlCanonicalizer() >> def canonicalize(url: String): String = >> canonicalizer.urlStringToKey(url) >> } >> >> >> The output is as follows: >> files.warc.gz:1053529: http://www.rmspumptools.com/innovation.php >> <http://www.google.com/url?q=http%3A%2F%2Fwww.rmspumptools.com%2Finnovation.php&sa=D&sntz=1&usg=AFQjCNFGeNgeK5QXxpQU-29U8UP9j0ecbA> >> files.warc.gz:1181319: https:// >> www.sjm.com/en/legal-notices-patents/patents/cardiac-rhythm-management-patents >> files.warc.gz:11538: http://www.slaperoo.com/ >> files.warc.gz:1268086: https://www.smarttech.com/patents >> files.warc.gz:826021: http://speckip.com/ >> ... >> >> >> >> >> >> Note that CDXFormatIndex/CDXIndex.getPrefixIterator calls >> FlatFile.getRecordIterator(final String prefix), >> >> which returns an input stream starting with the offset the first entry >> matching the prefix; and then it reads everything until the end of the >> archive. >> >> RandomAccessFile raf = new RandomAccessFile(file,"r"); >> >> findKeyOffset(raf, prefix); >> >> >> >> >> -- You received this message because you are subscribed to the Google Groups "openwayback-dev" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
