[openwayback-dev] Re: CDXFormatIndex.getPrefixIterator(prefix) returns entries that do not match the prefix

David Portabella Mon, 13 Feb 2017 07:43:51 -0800

The API containts these two functions: getPrefixIterator and getUrlIterator.


With getPrefixIterator, I would expect to return all entries that match a 
given prefix. With getUrlIterator, I would expect to get all entries that 
match a url. The problem is that getPrefixIterator returns the first entry 
that match a given prefix, and all the following entries until the end of 
the archive (which do not match the prefix). The same for getUrlIterator: 
it returns the first entry that match the full url, and all the following 
entries until the end of the archive.

I think that the example is pretty clear on this. I asked for all the 
entries that match the prefix: rmspumptools.com/innovation.php 
<http://www.google.com/url?q=http%3A%2F%2Fwww.rmspumptools.com%2Finnovation.php&sa=D&sntz=1&usg=AFQjCNFGeNgeK5QXxpQU-29U8UP9j0ecbA>
and I get that entry http://www.rmspumptools.com/innovation.php 
<http://www.google.com/url?q=http%3A%2F%2Fwww.rmspumptools.com%2Finnovation.php&sa=D&sntz=1&usg=AFQjCNFGeNgeK5QXxpQU-29U8UP9j0ecbA>
 
(this is correct), but also https://www.sjm.com/en 
<http://www.sjm.com/en/legal-notices-patents/patents/cardiac-rhythm-management-patents>,
 
and http://www.slaperoo.com/ and ... until the end of the archive (which do 
not match the prefix).


On Monday, February 13, 2017 at 11:40:57 AM UTC+1, andrew.jackson wrote:
>
> I believe the idea is to allow you to use the same API to perform other 
> queries, such as listing all URIs that start with a given prefix (e.g. all 
> URIs for a host).
>
> If you only want the list that matches a specific URI, you should stop 
> pulling results from the iterator when the URI is no longer the one you 
> want.
>
> HTH,
> Andy Jackson
>
> On Wednesday, 8 February 2017 17:20:16 UTC, David Portabella wrote:
>>
>> Using CDXFormatIndex.getPrefixIterator(prefix),
>>
>> I would expect to get only the entries that match this prefix.
>>
>> Instead, it finds the first entry matching this prefix, and then it returns 
>> all entries from that point until the end of the archive.
>>
>> so, it returns entries that do not match the prefix.
>>
>>
>> Why?
>>
>> How to *only* get the entries that match the prefix?
>>
>>
>>
>> Example (written in Scala)
>>
>>
>> package application
>>
>> import org.archive.wayback.core.CaptureSearchResult
>> import org.archive.wayback.resourceindex.cdx.CDXFormatIndex
>> import org.archive.wayback.util.url.AggressiveUrlCanonicalizer
>>
>> import scala.collection.JavaConverters._
>>
>> object ResponseWarcReaderExample {
>>  def main(args: Array[String]) {
>>  val index = new CDXFormatIndex()
>>  index.setPath("/dataset/files.warc.cdx")
>>
>>  val key = canonicalize("http://www.rmspumptools.com/innovation.php 
>> <http://www.google.com/url?q=http%3A%2F%2Fwww.rmspumptools.com%2Finnovation.php&sa=D&sntz=1&usg=AFQjCNFGeNgeK5QXxpQU-29U8UP9j0ecbA>
>> ")
>>  val it = index.getPrefixIterator(key).asScala.foreach { (r: 
>> CaptureSearchResult) =>
>>  println(s"${r.getFile}:${r.getOffset}: ${r.getOriginalUrl}")
>>  }
>>  }
>>
>>  val canonicalizer = new AggressiveUrlCanonicalizer()
>>  def canonicalize(url: String): String =
>>  canonicalizer.urlStringToKey(url)
>> }
>>
>>
>> The output is as follows:
>> files.warc.gz:1053529: http://www.rmspumptools.com/innovation.php 
>> <http://www.google.com/url?q=http%3A%2F%2Fwww.rmspumptools.com%2Finnovation.php&sa=D&sntz=1&usg=AFQjCNFGeNgeK5QXxpQU-29U8UP9j0ecbA>
>> files.warc.gz:1181319: https://
>> www.sjm.com/en/legal-notices-patents/patents/cardiac-rhythm-management-patents
>> files.warc.gz:11538: http://www.slaperoo.com/
>> files.warc.gz:1268086: https://www.smarttech.com/patents
>> files.warc.gz:826021: http://speckip.com/
>> ...
>>
>>
>>
>>
>>
>> Note that CDXFormatIndex/CDXIndex.getPrefixIterator calls 
>> FlatFile.getRecordIterator(final String prefix),
>>
>> which returns an input stream starting with the offset the first entry 
>> matching the prefix; and then it reads everything until the end of the 
>> archive.
>>
>>   RandomAccessFile raf = new RandomAccessFile(file,"r");
>>
>>   findKeyOffset(raf, prefix);
>>
>>
>>
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"openwayback-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

[openwayback-dev] Re: CDXFormatIndex.getPrefixIterator(prefix) returns entries that do not match the prefix

Reply via email to