[jira] Commented: (CONNECTORS-118) Crawled archive files should be expanded into their constituent files

2010-10-13 Thread Karl Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12920609#action_12920609
 ] 

Karl Wright commented on CONNECTORS-118:


The key question here is how you describe the component of an archive.  There 
must be a URL to describe it, or there is no way the search results are going 
to mean anything.

Since URL's are the connector's job to assemble, this is likely to be connector 
specific.  Also, most connectors will never be dealing with archives.  Can you 
provide a list of connectors where you believe this is important, and what the 
URL's to get at the subpieces of the archive look like?


 Crawled archive files should be expanded into their constituent files
 -

 Key: CONNECTORS-118
 URL: https://issues.apache.org/jira/browse/CONNECTORS-118
 Project: ManifoldCF
  Issue Type: New Feature
  Components: Framework crawler agent
Reporter: Jack Krupansky

 Archive files such as zip, mbox, tar, etc. should be expanded into their 
 constituent files during crawling of repositories so that any output 
 connector would output the flattened archive.
 This could be an option, defaulted to ON, since someone may want to implement 
 a copy connector that maintains crawled files as-is.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (CONNECTORS-118) Crawled archive files should be expanded into their constituent files

2010-10-13 Thread Karl Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12920705#action_12920705
 ] 

Karl Wright commented on CONNECTORS-118:


So this scheme is specific to Apache VFS.  What connectors are used to crawl 
Apache VFS file systems?  just the file system connector, no?


 Crawled archive files should be expanded into their constituent files
 -

 Key: CONNECTORS-118
 URL: https://issues.apache.org/jira/browse/CONNECTORS-118
 Project: ManifoldCF
  Issue Type: New Feature
  Components: Framework crawler agent
Reporter: Jack Krupansky

 Archive files such as zip, mbox, tar, etc. should be expanded into their 
 constituent files during crawling of repositories so that any output 
 connector would output the flattened archive.
 This could be an option, defaulted to ON, since someone may want to implement 
 a copy connector that maintains crawled files as-is.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (CONNECTORS-118) Crawled archive files should be expanded into their constituent files

2010-10-13 Thread Jack Krupansky (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12920730#action_12920730
 ] 

Jack Krupansky commented on CONNECTORS-118:
---

Support within the file system connector is obviously the higher priority. 
Windows shares as well. And FTP/SFTP.


 Crawled archive files should be expanded into their constituent files
 -

 Key: CONNECTORS-118
 URL: https://issues.apache.org/jira/browse/CONNECTORS-118
 Project: ManifoldCF
  Issue Type: New Feature
  Components: Framework crawler agent
Reporter: Jack Krupansky

 Archive files such as zip, mbox, tar, etc. should be expanded into their 
 constituent files during crawling of repositories so that any output 
 connector would output the flattened archive.
 This could be an option, defaulted to ON, since someone may want to implement 
 a copy connector that maintains crawled files as-is.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (CONNECTORS-118) Crawled archive files should be expanded into their constituent files

2010-10-13 Thread Karl Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12920781#action_12920781
 ] 

Karl Wright commented on CONNECTORS-118:


bq. So, if somebody wants to de-reference one of these pseudo URLS they must:

Ah.  So what you are saying is that the person must either be running a custom 
browser, or must do some kind of URL manipulation before the search results 
would be presented to the user, or - or what, exactly?  If the url is in fact 
meant to be real, then it should refer to a custom proxy of some kind that 
would perform the necessary breakdown.  If there is no such service or proxy, 
those URLs will simply be broken.  This represents a major violation of the 
contract for url generation within ManifoldCF connectors.

If there is no such proxy that you are aware of, then I'd much rather generate 
a real url, which in its raw form would not send you to anything other than the 
archive itself, but which has enough information to be interpreted properly, by 
using the anchor trick I alluded to earlier.  If there *is* such a proxy, then 
that proxy's parameters must be added as part of the repository connection 
configuration.  The only case in which the solution you suggest is valid is if 
you are working on a file system where, when you go to your browser, you enter 
bz://... for the url, and it actually does the unpacking for you.  That would 
*not* include CIFS, by the way.

Is this a fair statement of your proposal?  Or am I missing something?

 Crawled archive files should be expanded into their constituent files
 -

 Key: CONNECTORS-118
 URL: https://issues.apache.org/jira/browse/CONNECTORS-118
 Project: ManifoldCF
  Issue Type: New Feature
  Components: Framework crawler agent
Reporter: Jack Krupansky

 Archive files such as zip, mbox, tar, etc. should be expanded into their 
 constituent files during crawling of repositories so that any output 
 connector would output the flattened archive.
 This could be an option, defaulted to ON, since someone may want to implement 
 a copy connector that maintains crawled files as-is.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (CONNECTORS-118) Crawled archive files should be expanded into their constituent files

2010-10-13 Thread Jack Krupansky (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12920787#action_12920787
 ] 

Jack Krupansky commented on CONNECTORS-118:
---

Aperture's approach was just a starting point for discussion for how to form an 
id for a file in an archive file. As long as the MCF rules are functionally 
equivalent to the Apache VFS rules, we should be okay.

In short, my proposal does not have a requirement for what an id should look 
like, just a suggestion.


 Crawled archive files should be expanded into their constituent files
 -

 Key: CONNECTORS-118
 URL: https://issues.apache.org/jira/browse/CONNECTORS-118
 Project: ManifoldCF
  Issue Type: New Feature
  Components: Framework crawler agent
Reporter: Jack Krupansky

 Archive files such as zip, mbox, tar, etc. should be expanded into their 
 constituent files during crawling of repositories so that any output 
 connector would output the flattened archive.
 This could be an option, defaulted to ON, since someone may want to implement 
 a copy connector that maintains crawled files as-is.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (CONNECTORS-118) Crawled archive files should be expanded into their constituent files

2010-10-13 Thread Jack Krupansky (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12920801#action_12920801
 ] 

Jack Krupansky commented on CONNECTORS-118:
---

One of those VFS links points to all the Java packages used to access the list 
of archive formats I listed. I have personally written unit tests that 
generated most of those formats which Aperture then extracted.


 Crawled archive files should be expanded into their constituent files
 -

 Key: CONNECTORS-118
 URL: https://issues.apache.org/jira/browse/CONNECTORS-118
 Project: ManifoldCF
  Issue Type: New Feature
  Components: Framework crawler agent
Reporter: Jack Krupansky

 Archive files such as zip, mbox, tar, etc. should be expanded into their 
 constituent files during crawling of repositories so that any output 
 connector would output the flattened archive.
 This could be an option, defaulted to ON, since someone may want to implement 
 a copy connector that maintains crawled files as-is.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (CONNECTORS-118) Crawled archive files should be expanded into their constituent files

2010-10-13 Thread Karl Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12920827#action_12920827
 ] 

Karl Wright commented on CONNECTORS-118:


Agreed, file system is quite straightforward, although CIFS may be a bit more 
challenging depending on whether the archive processing code accepts an 
InputStream as input.  If so, there would be no need to make a secondary copy 
in either case.


 Crawled archive files should be expanded into their constituent files
 -

 Key: CONNECTORS-118
 URL: https://issues.apache.org/jira/browse/CONNECTORS-118
 Project: ManifoldCF
  Issue Type: New Feature
  Components: Framework crawler agent
Reporter: Jack Krupansky

 Archive files such as zip, mbox, tar, etc. should be expanded into their 
 constituent files during crawling of repositories so that any output 
 connector would output the flattened archive.
 This could be an option, defaulted to ON, since someone may want to implement 
 a copy connector that maintains crawled files as-is.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.