Re: Revisiting: Should Manifold include Pipelines

2012-01-10 Thread Mark Bennett
Hi Karl,

I wanted to acknowledge and thank you for your 2 emails.

I need to think a bit.  I *do* have answers to some of your concerns, and I
hopefully reasonable sounding ones at that.

Also, maybe I should take another look at Nutch - BUT Manifold's Web UI is
so much further along, and more inline with the type of admin "view" of
what's going on, that I had given up on Nutch for a bit.  I have some other
thoughts about Nutch but won't go into them here.

Also, to be clear, I in no way meant to even imply you had any other
motives for having materials in the book.  You've demonstrated, time and
again, that you sincerely want to share MCF, and info about it, with the
whole world!

Mark

--
Mark Bennett / New Idea Engineering, Inc. / mbenn...@ideaeng.com
Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513


On Tue, Jan 10, 2012 at 12:27 AM, Karl Wright  wrote:

> you wanted a connection to be a pipeline component rather than what it
> is today.
>


[jira] [Resolved] (CONNECTORS-372) WebCrawler connector Japanese message properties file is not fully translated

2012-01-10 Thread Karl Wright (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/CONNECTORS-372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-372.


Resolution: Fixed

r1229540


> WebCrawler connector Japanese message properties file is not fully translated
> -
>
> Key: CONNECTORS-372
> URL: https://issues.apache.org/jira/browse/CONNECTORS-372
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Web connector
>Affects Versions: ManifoldCF 0.5
>Reporter: Hitoshi Ozawa
>Assignee: Karl Wright
>Priority: Minor
>  Labels: I18N
> Fix For: ManifoldCF 0.5
>
> Attachments: CONNECTORS-372.patch
>
>
> WebCrawler connector Japanese message files is not fully translated.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CONNECTORS-371) LiveLink connector should have Japanese message properties file

2012-01-10 Thread Karl Wright (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183281#comment-13183281
 ] 

Karl Wright commented on CONNECTORS-371:


This patch did not apply cleanly after the patch for CONNECTORS-363 was applied.

Could you sync up, and regenerate the patch?

Thanks!


> LiveLink connector should have Japanese message properties file
> ---
>
> Key: CONNECTORS-371
> URL: https://issues.apache.org/jira/browse/CONNECTORS-371
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: LiveLink connector
>Affects Versions: ManifoldCF 0.5
>Reporter: Hitoshi Ozawa
>Assignee: Karl Wright
>  Labels: I18N
> Fix For: ManifoldCF 0.5
>
> Attachments: CONNECTORS-371.patch
>
>
> LiveLink connector's Japanese message properties file is not fully translated.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (CONNECTORS-363) LiveLink connector should be fully I18N

2012-01-10 Thread Karl Wright (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/CONNECTORS-363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-363.


Resolution: Fixed

> LiveLink connector should be fully I18N
> ---
>
> Key: CONNECTORS-363
> URL: https://issues.apache.org/jira/browse/CONNECTORS-363
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: LiveLink connector
>Affects Versions: ManifoldCF 0.5
>Reporter: Hitoshi Ozawa
>Assignee: Karl Wright
>Priority: Minor
>  Labels: I18N
> Fix For: ManifoldCF 0.5
>
> Attachments: CONNECTORS-363.patch, CONNECTORS-363.patch
>
>
> Should extract out all messages to properties file.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CONNECTORS-363) LiveLink connector should be fully I18N

2012-01-10 Thread Karl Wright (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183277#comment-13183277
 ] 

Karl Wright commented on CONNECTORS-363:


r1229538 to put in context modifiers to appropriate getString method calls.


> LiveLink connector should be fully I18N
> ---
>
> Key: CONNECTORS-363
> URL: https://issues.apache.org/jira/browse/CONNECTORS-363
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: LiveLink connector
>Affects Versions: ManifoldCF 0.5
>Reporter: Hitoshi Ozawa
>Assignee: Karl Wright
>Priority: Minor
>  Labels: I18N
> Fix For: ManifoldCF 0.5
>
> Attachments: CONNECTORS-363.patch, CONNECTORS-363.patch
>
>
> Should extract out all messages to properties file.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CONNECTORS-363) LiveLink connector should be fully I18N

2012-01-10 Thread Karl Wright (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183268#comment-13183268
 ] 

Karl Wright commented on CONNECTORS-363:


r1229537 to fix the compilation errors in the second patch.


> LiveLink connector should be fully I18N
> ---
>
> Key: CONNECTORS-363
> URL: https://issues.apache.org/jira/browse/CONNECTORS-363
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: LiveLink connector
>Affects Versions: ManifoldCF 0.5
>Reporter: Hitoshi Ozawa
>Assignee: Karl Wright
>Priority: Minor
>  Labels: I18N
> Fix For: ManifoldCF 0.5
>
> Attachments: CONNECTORS-363.patch, CONNECTORS-363.patch
>
>
> Should extract out all messages to properties file.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CONNECTORS-363) LiveLink connector should be fully I18N

2012-01-10 Thread Karl Wright (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183264#comment-13183264
 ] 

Karl Wright commented on CONNECTORS-363:


r1229534 for second patch.


> LiveLink connector should be fully I18N
> ---
>
> Key: CONNECTORS-363
> URL: https://issues.apache.org/jira/browse/CONNECTORS-363
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: LiveLink connector
>Affects Versions: ManifoldCF 0.5
>Reporter: Hitoshi Ozawa
>Assignee: Karl Wright
>Priority: Minor
>  Labels: I18N
> Fix For: ManifoldCF 0.5
>
> Attachments: CONNECTORS-363.patch, CONNECTORS-363.patch
>
>
> Should extract out all messages to properties file.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (CONNECTORS-371) LiveLink connector should have Japanese message properties file

2012-01-10 Thread Karl Wright (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/CONNECTORS-371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright updated CONNECTORS-371:
---

Component/s: (was: Lucene/SOLR connector)
 LiveLink connector
   Assignee: Karl Wright

> LiveLink connector should have Japanese message properties file
> ---
>
> Key: CONNECTORS-371
> URL: https://issues.apache.org/jira/browse/CONNECTORS-371
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: LiveLink connector
>Affects Versions: ManifoldCF 0.5
>Reporter: Hitoshi Ozawa
>Assignee: Karl Wright
>  Labels: I18N
> Fix For: ManifoldCF 0.5
>
> Attachments: CONNECTORS-371.patch
>
>
> LiveLink connector's Japanese message properties file is not fully translated.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Assigned] (CONNECTORS-372) WebCrawler connector Japanese message properties file is not fully translated

2012-01-10 Thread Karl Wright (Assigned) (JIRA)

 [ 
https://issues.apache.org/jira/browse/CONNECTORS-372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright reassigned CONNECTORS-372:
--

Assignee: Karl Wright

> WebCrawler connector Japanese message properties file is not fully translated
> -
>
> Key: CONNECTORS-372
> URL: https://issues.apache.org/jira/browse/CONNECTORS-372
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Web connector
>Affects Versions: ManifoldCF 0.5
>Reporter: Hitoshi Ozawa
>Assignee: Karl Wright
>Priority: Minor
>  Labels: I18N
> Fix For: ManifoldCF 0.5
>
> Attachments: CONNECTORS-372.patch
>
>
> WebCrawler connector Japanese message files is not fully translated.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (CONNECTORS-372) WebCrawler connector Japanese message properties file is not fully translated

2012-01-10 Thread Hitoshi Ozawa (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/CONNECTORS-372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hitoshi Ozawa updated CONNECTORS-372:
-

Attachment: CONNECTORS-372.patch

> WebCrawler connector Japanese message properties file is not fully translated
> -
>
> Key: CONNECTORS-372
> URL: https://issues.apache.org/jira/browse/CONNECTORS-372
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Web connector
>Affects Versions: ManifoldCF 0.5
>Reporter: Hitoshi Ozawa
>Priority: Minor
>  Labels: I18N
> Fix For: ManifoldCF 0.5
>
> Attachments: CONNECTORS-372.patch
>
>
> WebCrawler connector Japanese message files is not fully translated.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (CONNECTORS-372) WebCrawler connector Japanese message properties file is not fully translated

2012-01-10 Thread Hitoshi Ozawa (Created) (JIRA)
WebCrawler connector Japanese message properties file is not fully translated
-

 Key: CONNECTORS-372
 URL: https://issues.apache.org/jira/browse/CONNECTORS-372
 Project: ManifoldCF
  Issue Type: Improvement
  Components: Web connector
Affects Versions: ManifoldCF 0.5
Reporter: Hitoshi Ozawa
Priority: Minor
 Fix For: ManifoldCF 0.5


WebCrawler connector Japanese message files is not fully translated.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (CONNECTORS-371) LiveLink connector should have Japanese message properties file

2012-01-10 Thread Hitoshi Ozawa (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/CONNECTORS-371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hitoshi Ozawa updated CONNECTORS-371:
-

Attachment: CONNECTORS-371.patch

Translated message properties file to Japanese

> LiveLink connector should have Japanese message properties file
> ---
>
> Key: CONNECTORS-371
> URL: https://issues.apache.org/jira/browse/CONNECTORS-371
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Lucene/SOLR connector
>Affects Versions: ManifoldCF 0.5
>Reporter: Hitoshi Ozawa
>  Labels: I18N
> Fix For: ManifoldCF 0.5
>
> Attachments: CONNECTORS-371.patch
>
>
> LiveLink connector's Japanese message properties file is not fully translated.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (CONNECTORS-371) LiveLink connector should have Japanese message properties file

2012-01-10 Thread Hitoshi Ozawa (Created) (JIRA)
LiveLink connector should have Japanese message properties file
---

 Key: CONNECTORS-371
 URL: https://issues.apache.org/jira/browse/CONNECTORS-371
 Project: ManifoldCF
  Issue Type: Improvement
  Components: Lucene/SOLR connector
Affects Versions: ManifoldCF 0.5
Reporter: Hitoshi Ozawa
 Fix For: ManifoldCF 0.5


LiveLink connector's Japanese message properties file is not fully translated.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (CONNECTORS-363) LiveLink connector should be fully I18N

2012-01-10 Thread Hitoshi Ozawa (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/CONNECTORS-363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hitoshi Ozawa updated CONNECTORS-363:
-

Attachment: CONNECTORS-363.patch

Extracted messages from LiveLinkConnector.java

> LiveLink connector should be fully I18N
> ---
>
> Key: CONNECTORS-363
> URL: https://issues.apache.org/jira/browse/CONNECTORS-363
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: LiveLink connector
>Affects Versions: ManifoldCF 0.5
>Reporter: Hitoshi Ozawa
>Assignee: Karl Wright
>Priority: Minor
>  Labels: I18N
> Fix For: ManifoldCF 0.5
>
> Attachments: CONNECTORS-363.patch, CONNECTORS-363.patch
>
>
> Should extract out all messages to properties file.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: Revisiting: Should Manifold include Pipelines

2012-01-10 Thread Karl Wright
As an exercise in understanding, it might be helpful to consider how
exactly a document specification in today's ManifoldCF would morph if
you wanted a connection to be a pipeline component rather than what it
is today.

Right now, the document specification for a job is an XML doc of a
form that only the underlying connector understands, which specifies
the following kinds of information:

- What documents to include in the crawl (which is meaningful only in
the context of an existing underlying connection);
- What parts of those documents to index (e.g. what metadata is included).

The information is used in several places during the crawl:

- At the time seeding is done (the initial documents)
- When a decision is being made to include a document on the queue
- Before a document is going to be fetched
- In order to set up the document for indexing

The repository connector allows you to edit the document specification
in the Crawler UI.  This is done by the repository connector
contributing tabs to the job.

Now, in order for a pipeline to work, most of the activities of the
connector will need to be broken out into separate pipeline tasks.
For instance, "seeding" would be a different task from "filtering"
which would be different from "enqueuing" which would be different
from "obtaining security info".  I would expect that each pipeline
step would have its own UI, so if you were using Connection X to seed,
then you would want to specify what documents to seed in the UI for
that step, in a manner consistent with the underlying connection.

So the connector would need to break up its document specification
into multiple pieces, e.g. a "seeding document specification" with a
seeding document specification UI.  There would be a corresponding
specification and UI for "connector document filtering" and for
"connector document enqueuing".  I suspect there would be a lot of
duplication and overlap too, which would be hard to avoid.

The end result of this exercise would be something that would allow
more flexibility, at the expense of ease of use.

Karl

On Tue, Jan 10, 2012 at 2:49 AM, Karl Wright  wrote:
> Hi Mark,
>
> Please see below.
>
> On Mon, Jan 9, 2012 at 9:53 PM, Mark Bennett  wrote:
>> Hi Karl,
>>
>> Thanks for the reply, most comments inline.
>>
>> General comments:
>>
>> I was wondering if you've used a custom pipeline like FAST ESP or
>> Ultraseek's old "patches.py", and if there were any that you liked or
>> disliked?  In more recent times the OpenPipeline effort has been a bit
>> nascent, I think in part because it lacks some of connectors.  Coming from
>> my background I'm probably a bit biased to thinking of problems in terms of
>> a pipeline, and it's also a frequent discussion with some of our more
>> challenging clients.
>>
>> Generally speaking we define the virtual document to be the basic unit of
>> retrieval, and it doesn't really matter whether it starts life as a Web
>> Page or PDF or Outlook node.  Most "documents" have a create / modified
>> date, some type of title, and a few other semi-common meta data fields.
>> They do vary by source, but there's mapping techniques.
>>
>> Having more connector services, or even just more examples, is certainly a
>> step in the right direction.
>>
>> But leaving it at writing custom monolithic connectors has a few
>> disadvantages:
>> - Not as modular, so discourages code reuse
>> - Maintains 100% coding, vs. some mix of configure vs. code
>> - Keeps the bar at rather advanced Java programmers, vs. opening up to
>> folks that feel more comfortable with "scripting" (of a sort, not
>> suggesting a full language)
>> - I think folks tend to share more when using "configurable" systems,
>> though I have no proof.  I might just be the larger number of people.
>> - Sort of the "blank canvas syndrome" as each person tries to grasp all the
>> nuances; granted one I'm suggesting merely presents a smaller blank canvas,
>> but maybe with crayons and connect the dots, vs. oil paints.
>>
>
> It sounds to me like what you are proposing is a reorganization of the
> architecture of ManifoldCF so that documents that are fetched by
> repository connectors are only obliquely related to documents indexed
> through an output connector.  You are proposing that an indexed
> document be possibly assembled from multiple connector sources, but
> with arbitrary manipulation of the document content along the way.  Is
> this correct?
>
> If so, how would you handle document security?  Each repository
> connection today specifies the security context for the documents it
> fetches.  It also knows about relationships between those documents
> that come from the same connector, and about document versioning for
> documents fetched from that source.  How does this translate into a
> pipelined world in your view?  Is the security of the final indexed
> document the intersection of the security for all the sources of the
> indexed document?  Is the version of the indexed document