Using CDXFormatIndex.getPrefixIterator(prefix),
I would expect to get only the entries that match this prefix.
Instead, it finds the first entry matching this prefix, and then it returns all
entries from that point until the end of the archive.
so, it returns entries that do not match the prefix.
Why?
How to *only* get the entries that match the prefix?
Example (written in Scala)
package application
import org.archive.wayback.core.CaptureSearchResult
import org.archive.wayback.resourceindex.cdx.CDXFormatIndex
import org.archive.wayback.util.url.AggressiveUrlCanonicalizer
import scala.collection.JavaConverters._
object ResponseWarcReaderExample {
def main(args: Array[String]) {
val index = new CDXFormatIndex()
index.setPath("/dataset/files.warc.cdx")
val key = canonicalize("http://www.rmspumptools.com/innovation.php")
val it = index.getPrefixIterator(key).asScala.foreach { (r:
CaptureSearchResult) =>
println(s"${r.getFile}:${r.getOffset}: ${r.getOriginalUrl}")
}
}
val canonicalizer = new AggressiveUrlCanonicalizer()
def canonicalize(url: String): String =
canonicalizer.urlStringToKey(url)
}
The output is as follows:
files.warc.gz:1053529: http://www.rmspumptools.com/innovation.php
files.warc.gz:1181319: https:
//www.sjm.com/en/legal-notices-patents/patents/cardiac-rhythm-management-patents
files.warc.gz:11538: http://www.slaperoo.com/
files.warc.gz:1268086: https://www.smarttech.com/patents
files.warc.gz:826021: http://speckip.com/
...
Note that CDXFormatIndex/CDXIndex.getPrefixIterator calls
FlatFile.getRecordIterator(final String prefix),
which returns an input stream starting with the offset the first entry matching
the prefix; and then it reads everything until the end of the archive.
RandomAccessFile raf = new RandomAccessFile(file,"r");
findKeyOffset(raf, prefix);
--
You received this message because you are subscribed to the Google Groups
"openwayback-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/d/optout.