I believe the idea is to allow you to use the same API to perform other
queries, such as listing all URIs that start with a given prefix (e.g. all
URIs for a host).
If you only want the list that matches a specific URI, you should stop
pulling results from the iterator when the URI is no longer the one you
want.
HTH,
Andy Jackson
On Wednesday, 8 February 2017 17:20:16 UTC, David Portabella wrote:
>
> Using CDXFormatIndex.getPrefixIterator(prefix),
>
> I would expect to get only the entries that match this prefix.
>
> Instead, it finds the first entry matching this prefix, and then it returns
> all entries from that point until the end of the archive.
>
> so, it returns entries that do not match the prefix.
>
>
> Why?
>
> How to *only* get the entries that match the prefix?
>
>
>
> Example (written in Scala)
>
>
> package application
>
> import org.archive.wayback.core.CaptureSearchResult
> import org.archive.wayback.resourceindex.cdx.CDXFormatIndex
> import org.archive.wayback.util.url.AggressiveUrlCanonicalizer
>
> import scala.collection.JavaConverters._
>
> object ResponseWarcReaderExample {
> def main(args: Array[String]) {
> val index = new CDXFormatIndex()
> index.setPath("/dataset/files.warc.cdx")
>
> val key = canonicalize("http://www.rmspumptools.com/innovation.php")
> val it = index.getPrefixIterator(key).asScala.foreach { (r:
> CaptureSearchResult) =>
> println(s"${r.getFile}:${r.getOffset}: ${r.getOriginalUrl}")
> }
> }
>
> val canonicalizer = new AggressiveUrlCanonicalizer()
> def canonicalize(url: String): String =
> canonicalizer.urlStringToKey(url)
> }
>
>
> The output is as follows:
> files.warc.gz:1053529: http://www.rmspumptools.com/innovation.php
> files.warc.gz:1181319: https://
> www.sjm.com/en/legal-notices-patents/patents/cardiac-rhythm-management-patents
> files.warc.gz:11538: http://www.slaperoo.com/
> files.warc.gz:1268086: https://www.smarttech.com/patents
> files.warc.gz:826021: http://speckip.com/
> ...
>
>
>
>
>
> Note that CDXFormatIndex/CDXIndex.getPrefixIterator calls
> FlatFile.getRecordIterator(final String prefix),
>
> which returns an input stream starting with the offset the first entry
> matching the prefix; and then it reads everything until the end of the
> archive.
>
> RandomAccessFile raf = new RandomAccessFile(file,"r");
>
> findKeyOffset(raf, prefix);
>
>
>
>
>
--
You received this message because you are subscribed to the Google Groups
"openwayback-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/d/optout.