Thanks again.
I just ran an example set up to understand better what you said.
As you said, the web page URL get's set to the _id field.
The metadata that is sent to Elastic Search is as follows:
header-Content-Type: "text/html; charset=UTF-8"
header-Content-Length: "3278"
header-Keep-Alive: "timeout=5, max=100"
header-Server: "Apache/2.2"
header-Connection: "Keep-Alive"
type: "attachment"
file: ...
The file field looks to be base64 encoded. Is this always the case, or is
this unique to web repo + elastic search?
This must be the web page. I'm guessing header-Content-Type field holds the
document type and not the type field.
On Mon, Feb 11, 2013 at 1:17 PM, Karl Wright <[email protected]> wrote:
> What emerges from the web connector is the following:
>
> - metadata, which you define on the web connector’s “Metadata” tab,
> that are named however you want;
> - forced acls, which get added to the document based on what you
> select on the “Security” tab;
> - the document’s content type;
> - the document’s url;
> - the document itself.
>
> What the elastic search connector does is:
> - Map the document’s url to ElasticSearch’s document id field (which
> I
> guess shows up in Elastic Search as the ‘uri’ field)
> - Output all the metadata directly to ElasticSearch using the name
> provided by the repository connector
> - Set the file value to “” (which seems wrong, since that could be
> helpful if available - let me know if you think a fix for this would
> be useful)
> - NONE of the rest of the document fields (content type, acls, etc)
> are communicated to Elastic Search at all right now, except for the
> document itself.
>
> Karl
>
>
> On Mon, Feb 11, 2013 at 2:55 PM, Tony Edgin <[email protected]>
> wrote:
> > Thanks for the speedy response!
> >
> > I eventually want to index the contents of our local website with Elastic
> > Search.
> >
> > I would use the Web repository connector with the no authority connector
> and
> > the Elasticsearch output connector. Would you mind letting me know the
> > names and meanings of the metadata that get's passed to Elastic Search?
> >
> > Thanks again.
> >
> >
> > On Mon, Feb 11, 2013 at 12:45 PM, Karl Wright <[email protected]>
> wrote:
> >>
> >> So let me get this clear - you are looking to find out what the
> >> names/meanings are of the metadata that gets passed to the output
> >> connector, for a given repository connection?
> >>
> >> If this is what you are looking for, I'm afraid that while at one
> >> point the end-user documentation described this pretty accurately, it
> >> is now significantly out of date. While it's not terribly hard to
> >> compile this information from source code etc., the work definitely
> >> needs to be repeated by somebody.
> >>
> >> If you want to ask this question about a specific connector, I can
> >> certainly try to answer it, though. If you want to contribute either
> >> the information or a documentation patch, this would be great too.
> >>
> >> Karl
> >>
> >> On Mon, Feb 11, 2013 at 2:38 PM, Tony Edgin <[email protected]>
> >> wrote:
> >> > I'm sure this is documented somewhere, and I apologize in advance for
> >> > not
> >> > being able to find it.
> >> >
> >> > How do I determine the model or schema of the document passed to the
> >> > search
> >> > engine by a given job?
> >> >
> >> > For instance, I'm running a job that crawls a directory on my local
> file
> >> > system and passes to to Elastic Search. Interrogating Elastic
> Search, I
> >> > can
> >> > determine that the document has three fields, "file", "type" and
> "uri",
> >> > all
> >> > strings. How would I have known that in advance?
> >> >
> >> > Thanks for any help.
> >
> >
>