Re: Determining document model passed to search engine

Tony Edgin Mon, 11 Feb 2013 13:01:05 -0800

Thanks again.

I just ran an example set up to understand better what you said.


As you said, the web page URL get's set to the _id field.
The metadata that is sent to Elastic Search is as follows:

      header-Content-Type: "text/html; charset=UTF-8"
      header-Content-Length: "3278"
      header-Keep-Alive: "timeout=5, max=100"
      header-Server: "Apache/2.2"
      header-Connection: "Keep-Alive"
      type: "attachment"
      file: ...

The file field looks to be base64 encoded.  Is this always the case, or is
this unique to web repo + elastic search?

This must be the web page. I'm guessing header-Content-Type field holds the
document type and not the type field.





On Mon, Feb 11, 2013 at 1:17 PM, Karl Wright <[email protected]> wrote:

> What emerges from the web connector is the following:
>
> -       metadata, which you define on the web connector’s “Metadata” tab,
> that are named however you want;
> -       forced acls, which get added to the document based on what you
> select on the “Security” tab;
> -       the document’s content type;
> -       the document’s url;
> -       the document itself.
>
> What the elastic search connector does is:
> -       Map the document’s url to ElasticSearch’s document id field (which
> I
> guess shows up in Elastic Search as the ‘uri’ field)
> -       Output all the metadata directly to ElasticSearch using the name
> provided by the repository connector
> -       Set the file value to “” (which seems wrong, since that could be
> helpful if available - let me know if you think a fix for this would
> be useful)
> -       NONE of the rest of the document fields (content type, acls, etc)
> are communicated to Elastic Search at all right now, except for the
> document itself.
>
> Karl
>
>
> On Mon, Feb 11, 2013 at 2:55 PM, Tony Edgin <[email protected]>
> wrote:
> > Thanks for the speedy response!
> >
> > I eventually want to index the contents of our local website with Elastic
> > Search.
> >
> > I would use the Web repository connector with the no authority connector
> and
> > the Elasticsearch output connector.  Would you mind letting me know the
> > names and meanings of the metadata that get's passed to Elastic Search?
> >
> > Thanks again.
> >
> >
> > On Mon, Feb 11, 2013 at 12:45 PM, Karl Wright <[email protected]>
> wrote:
> >>
> >> So let me get this clear - you are looking to find out what the
> >> names/meanings are of the metadata that gets passed to the output
> >> connector, for a given repository connection?
> >>
> >> If this is what you are looking for, I'm afraid that while at one
> >> point the end-user documentation described this pretty accurately, it
> >> is now significantly out of date.  While it's not terribly hard to
> >> compile this information from source code etc., the work definitely
> >> needs to be repeated by somebody.
> >>
> >> If you want to ask this question about a specific connector, I can
> >> certainly try to answer it, though.  If you want to contribute either
> >> the information or a documentation patch, this would be great too.
> >>
> >> Karl
> >>
> >> On Mon, Feb 11, 2013 at 2:38 PM, Tony Edgin <[email protected]>
> >> wrote:
> >> > I'm sure this is documented somewhere, and I apologize in advance for
> >> > not
> >> > being able to find it.
> >> >
> >> > How do I determine the model or schema of the document passed to the
> >> > search
> >> > engine by a given job?
> >> >
> >> > For instance, I'm running a job that crawls a directory on my local
> file
> >> > system and passes to to Elastic Search.  Interrogating Elastic
> Search, I
> >> > can
> >> > determine that the document has three fields, "file", "type" and
> "uri",
> >> > all
> >> > strings.  How would I have known that in advance?
> >> >
> >> > Thanks for any help.
> >
> >
>

Re: Determining document model passed to search engine

Reply via email to