[openwayback-dev] Re: CDXFormatIndex.getPrefixIterator(prefix) returns entries that do not match the prefix

andrew.jackson Mon, 13 Feb 2017 02:41:59 -0800

I believe the idea is to allow you to use the same API to perform other 
queries, such as listing all URIs that start with a given prefix (e.g. all 
URIs for a host).


If you only want the list that matches a specific URI, you should stop 
pulling results from the iterator when the URI is no longer the one you 
want.

HTH,
Andy Jackson

On Wednesday, 8 February 2017 17:20:16 UTC, David Portabella wrote:
>
> Using CDXFormatIndex.getPrefixIterator(prefix),
>
> I would expect to get only the entries that match this prefix.
>
> Instead, it finds the first entry matching this prefix, and then it returns 
> all entries from that point until the end of the archive.
>
> so, it returns entries that do not match the prefix.
>
>
> Why?
>
> How to *only* get the entries that match the prefix?
>
>
>
> Example (written in Scala)
>
>
> package application
>
> import org.archive.wayback.core.CaptureSearchResult
> import org.archive.wayback.resourceindex.cdx.CDXFormatIndex
> import org.archive.wayback.util.url.AggressiveUrlCanonicalizer
>
> import scala.collection.JavaConverters._
>
> object ResponseWarcReaderExample {
>  def main(args: Array[String]) {
>  val index = new CDXFormatIndex()
>  index.setPath("/dataset/files.warc.cdx")
>
>  val key = canonicalize("http://www.rmspumptools.com/innovation.php";)
>  val it = index.getPrefixIterator(key).asScala.foreach { (r: 
> CaptureSearchResult) =>
>  println(s"${r.getFile}:${r.getOffset}: ${r.getOriginalUrl}")
>  }
>  }
>
>  val canonicalizer = new AggressiveUrlCanonicalizer()
>  def canonicalize(url: String): String =
>  canonicalizer.urlStringToKey(url)
> }
>
>
> The output is as follows:
> files.warc.gz:1053529: http://www.rmspumptools.com/innovation.php
> files.warc.gz:1181319: https://
> www.sjm.com/en/legal-notices-patents/patents/cardiac-rhythm-management-patents
> files.warc.gz:11538: http://www.slaperoo.com/
> files.warc.gz:1268086: https://www.smarttech.com/patents
> files.warc.gz:826021: http://speckip.com/
> ...
>
>
>
>
>
> Note that CDXFormatIndex/CDXIndex.getPrefixIterator calls 
> FlatFile.getRecordIterator(final String prefix),
>
> which returns an input stream starting with the offset the first entry 
> matching the prefix; and then it reads everything until the end of the 
> archive.
>
>   RandomAccessFile raf = new RandomAccessFile(file,"r");
>
>   findKeyOffset(raf, prefix);
>
>
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"openwayback-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

[openwayback-dev] Re: CDXFormatIndex.getPrefixIterator(prefix) returns entries that do not match the prefix

Reply via email to