[openwayback-dev] Re: CDXFormatIndex.getPrefixIterator(prefix) returns entries that do not match the prefix

David Portabella Mon, 13 Feb 2017 08:59:53 -0800

I see.

This is my current workaround:


  def filterByPrefix(prefix: String): Iterator[CaptureSearchResult] = {
    val key = canonicalize(url)
    
getIndex.getPrefixIterator(key).asScala.takeWhile(_.getUrlKey.startsWith(key))
  }

  def filterByUrl(url: String): Iterator[CaptureSearchResult] = {
    val key = canonicalize(url)
    getIndex.getUrlIterator(key).asScala.takeWhile(key == _.getUrlKey)
  }

  val canonicalizer = new AggressiveUrlCanonicalizer()
  def canonicalize(url: String): String =
    canonicalizer.urlStringToKey(url)


What is also "surprising", is that calling getPrefixIterator a second time 
will throw an Exception.
You need to close and open a new CDXFormatIndex.


Cheers,
David


On Monday, February 13, 2017 at 5:41:03 PM UTC+1, andrew.jackson wrote:
>
> This admittedly surprising behaviour appears to be intended: 
> https://github.com/iipc/openwayback/blob/6475121cef79240b5e18a5f2c224ff9ba933b43d/wayback-core/src/main/java/org/archive/wayback/util/flatfile/FlatFile.java#L39-L41
>  
> <https://www.google.com/url?q=https%3A%2F%2Fgithub.com%2Fiipc%2Fopenwayback%2Fblob%2F6475121cef79240b5e18a5f2c224ff9ba933b43d%2Fwayback-core%2Fsrc%2Fmain%2Fjava%2Forg%2Farchive%2Fwayback%2Futil%2Fflatfile%2FFlatFile.java%23L39-L41&sa=D&sntz=1&usg=AFQjCNH6tYZPtzFicoFivUkq_aCv09x-Gg>
>
> e.g. here's an example of the match being tested in the client rather than 
> in the iterator: 
> https://github.com/iipc/openwayback/blob/6475121cef79240b5e18a5f2c224ff9ba933b43d/wayback-core/src/main/java/org/archive/wayback/resourceindex/adapters/CaptureToUrlSearchResultIterator.java#L79
>  
> <https://www.google.com/url?q=https%3A%2F%2Fgithub.com%2Fiipc%2Fopenwayback%2Fblob%2F6475121cef79240b5e18a5f2c224ff9ba933b43d%2Fwayback-core%2Fsrc%2Fmain%2Fjava%2Forg%2Farchive%2Fwayback%2Fresourceindex%2Fadapters%2FCaptureToUrlSearchResultIterator.java%23L79&sa=D&sntz=1&usg=AFQjCNH5pxDgSSY3FkX80OkUBzlOFQwRuw>
>
> I'm a bit wary of changing FlatFile.java as this is used in a number of 
> places that may depend on this behaviour. However, it would seem reasonable 
> to modify the CDXIndex or CDXFormatIndex child classes to behave in a less 
> surprising fashion.
>
>
> Best,
> Andy
>
>
> On Monday, 13 February 2017 15:43:26 UTC, David Portabella wrote:
>>
>> The API containts these two functions: getPrefixIterator and 
>> getUrlIterator.
>>
>> With getPrefixIterator, I would expect to return all entries that match 
>> a given prefix. With getUrlIterator, I would expect to get all entries 
>> that match a url. The problem is that getPrefixIterator returns the 
>> first entry that match a given prefix, and all the following entries until 
>> the end of the archive (which do not match the prefix). The same for 
>> getUrlIterator: it returns the first entry that match the full url, and 
>> all the following entries until the end of the archive.
>>
>> I think that the example is pretty clear on this. I asked for all the 
>> entries that match the prefix: rmspumptools.com/innovation.php 
>> <http://www.google.com/url?q=http%3A%2F%2Fwww.rmspumptools.com%2Finnovation.php&sa=D&sntz=1&usg=AFQjCNFGeNgeK5QXxpQU-29U8UP9j0ecbA>
>> and I get that entry http://www.rmspumptools.com/innovation.php 
>> <http://www.google.com/url?q=http%3A%2F%2Fwww.rmspumptools.com%2Finnovation.php&sa=D&sntz=1&usg=AFQjCNFGeNgeK5QXxpQU-29U8UP9j0ecbA>
>>  
>> (this is correct), but also https://www.sjm.com/en 
>> <http://www.sjm.com/en/legal-notices-patents/patents/cardiac-rhythm-management-patents>,
>>  
>> and http://www.slaperoo.com/ and ... until the end of the archive (which 
>> do not match the prefix).
>>
>>
>> On Monday, February 13, 2017 at 11:40:57 AM UTC+1, andrew.jackson wrote:
>>>
>>> I believe the idea is to allow you to use the same API to perform other 
>>> queries, such as listing all URIs that start with a given prefix (e.g. all 
>>> URIs for a host).
>>>
>>> If you only want the list that matches a specific URI, you should stop 
>>> pulling results from the iterator when the URI is no longer the one you 
>>> want.
>>>
>>> HTH,
>>> Andy Jackson
>>>
>>> On Wednesday, 8 February 2017 17:20:16 UTC, David Portabella wrote:
>>>>
>>>> Using CDXFormatIndex.getPrefixIterator(prefix),
>>>>
>>>> I would expect to get only the entries that match this prefix.
>>>>
>>>> Instead, it finds the first entry matching this prefix, and then it 
>>>> returns all entries from that point until the end of the archive.
>>>>
>>>> so, it returns entries that do not match the prefix.
>>>>
>>>>
>>>> Why?
>>>>
>>>> How to *only* get the entries that match the prefix?
>>>>
>>>>
>>>>
>>>> Example (written in Scala)
>>>>
>>>>
>>>> package application
>>>>
>>>> import org.archive.wayback.core.CaptureSearchResult
>>>> import org.archive.wayback.resourceindex.cdx.CDXFormatIndex
>>>> import org.archive.wayback.util.url.AggressiveUrlCanonicalizer
>>>>
>>>> import scala.collection.JavaConverters._
>>>>
>>>> object ResponseWarcReaderExample {
>>>>  def main(args: Array[String]) {
>>>>  val index = new CDXFormatIndex()
>>>>  index.setPath("/dataset/files.warc.cdx")
>>>>
>>>>  val key = canonicalize("http://www.rmspumptools.com/innovation.php 
>>>> <http://www.google.com/url?q=http%3A%2F%2Fwww.rmspumptools.com%2Finnovation.php&sa=D&sntz=1&usg=AFQjCNFGeNgeK5QXxpQU-29U8UP9j0ecbA>
>>>> ")
>>>>  val it = index.getPrefixIterator(key).asScala.foreach { (r: 
>>>> CaptureSearchResult) =>
>>>>  println(s"${r.getFile}:${r.getOffset}: ${r.getOriginalUrl}")
>>>>  }
>>>>  }
>>>>
>>>>  val canonicalizer = new AggressiveUrlCanonicalizer()
>>>>  def canonicalize(url: String): String =
>>>>  canonicalizer.urlStringToKey(url)
>>>> }
>>>>
>>>>
>>>> The output is as follows:
>>>> files.warc.gz:1053529: http://www.rmspumptools.com/innovation.php 
>>>> <http://www.google.com/url?q=http%3A%2F%2Fwww.rmspumptools.com%2Finnovation.php&sa=D&sntz=1&usg=AFQjCNFGeNgeK5QXxpQU-29U8UP9j0ecbA>
>>>> files.warc.gz:1181319: https://
>>>> www.sjm.com/en/legal-notices-patents/patents/cardiac-rhythm-management-patents
>>>> files.warc.gz:11538: http://www.slaperoo.com/
>>>> files.warc.gz:1268086: https://www.smarttech.com/patents
>>>> files.warc.gz:826021: http://speckip.com/
>>>> ...
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Note that CDXFormatIndex/CDXIndex.getPrefixIterator calls 
>>>> FlatFile.getRecordIterator(final String prefix),
>>>>
>>>> which returns an input stream starting with the offset the first entry 
>>>> matching the prefix; and then it reads everything until the end of the 
>>>> archive.
>>>>
>>>>   RandomAccessFile raf = new RandomAccessFile(file,"r");
>>>>
>>>>   findKeyOffset(raf, prefix);
>>>>
>>>>
>>>>
>>>>
>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"openwayback-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

[openwayback-dev] Re: CDXFormatIndex.getPrefixIterator(prefix) returns entries that do not match the prefix

Reply via email to