Re: Metadata and Newline Characters at Content

Furkan KAMACI Thu, 24 Nov 2016 10:37:51 -0800

Hi Erick,

When I check the *Solr* documentation I see that [1]:


*In addition to Tika's metadata, Solr adds the following metadata (defined
in ExtractingMetadataConstants):*

*"stream_name" - The name of the ContentStream as uploaded to Solr.
Depending on how the file is uploaded, this may or may not be set.*
*"stream_source_info" - Any source info about the stream. See
ContentStream.*
*"stream_size" - The size of the stream in bytes(?)*
*"stream_content_type" - The content type of the stream, if available.*

So, it seems that these may not be added by Tika, but Solr. Do you know how
to enable/disable this feature?

Kind Regards,
Furkan KAMACI

[1] https://wiki.apache.org/solr/ExtractingRequestHandler

On Thu, Nov 24, 2016 at 6:51 PM, Erick Erickson <erickerick...@gmail.com>
wrote:

> about PatternCaptureGroupFilterFactory. This isn't going to help. The
> data you see when you return stored data is _before_ any analysis so
> the Pattern....Factory won't be applied. You could do this in a
> ScriptUpdateProcessorFactory. Or, just don't worry about it and have
> the real app deal with it.
>
> I don't particularly know about the Tika settings, that's largely a guess.
>
> Best,
> Erick
>
> On Thu, Nov 24, 2016 at 8:43 AM, Furkan KAMACI <furkankam...@gmail.com>
> wrote:
> > Hi Erick,
> >
> > 1) I am looking stored data via Solr Admin UI. I send the query and check
> > what is in content field.
> >
> > 2) I can debug the Tika settings if you think that this is not the
> desired
> > behaviour to have such metadata fields combined into content field.
> >
> > *PS: *Is there any solution to get rid of it except for
> > using PatternCaptureGroupFilterFactory?
> >
> > Kind Regards,
> > Furkan KAMACI
> >
> > On Thu, Nov 24, 2016 at 6:31 PM, Erick Erickson <erickerick...@gmail.com
> >
> > wrote:
> >
> >> 1> I'm assuming when you "see" this data you're looking at the stored
> >> data, right? It's a verbatim copy of whatever you sent to the field.
> >> I'm guessing it's a character-encoding mismatch between the source and
> >> what you use to display.
> >>
> >> 2> How are you extracting this data? There are Tika options I think
> >> that can/do mush fields together.
> >>
> >> Best,
> >> Erick
> >>
> >>
> >>
> >> On Thu, Nov 24, 2016 at 7:54 AM, Furkan KAMACI <furkankam...@gmail.com>
> >> wrote:
> >> > Hi,
> >> >
> >> > I'm testing Solr 4.9.1 I've indexed documents via it. Content field at
> >> > schema has text_general field type which is not modified from
> original. I
> >> > do not copy any fields to content. When I check the data  I see
> content
> >> > values as like:
> >> >
> >> >  " \n \nstream_source_info MARLON BRANDO.rtf   \nstream_content_type
> >> > application/rtf   \nstream_size 13580   \nstream_name MARLON
> BRANDO.rtf
> >> > \nContent-Type application/rtf   \nresourceName MARLON BRANDO.rtf   \n
> >> \n
> >> > \n  1. Vivien Leigh and Marlon Brando in \"A Streetcar Named Desire\"
> >> > directed by Elia Kazan \n"
> >> >
> >> > My questions:
> >> >
> >> > 1) Is it usual to have that newline characters?
> >> > 2) Is it usual to have file metadata at the beginning of the content
> >> (i.e.
> >> > stream source, stream_content_type) or related to tool that I post
> data
> >> to
> >> > Solr?
> >> >
> >> > Kind Regards,
> >> > Furkan KAMACI
> >>
>

Re: Metadata and Newline Characters at Content

Reply via email to