Re: Parsing and indexing parts of the input file paths

2015-07-22 Thread Andrew Musselman
Thanks; I don't know how the file path is getting into the id field.  Must
be some Tika default?

On Wed, Jul 22, 2015 at 9:52 AM, Erick Erickson 
wrote:

> the id field is absolutely NOT the thing you need to try to parse.
> Assuming you're stuffing the file path into that field, use a
> copyField to copy the filepath into another text (not string)
> field and do your work there.
>
> As far as whether the filepath is in some other field, well, you have
> to put it there, either through Tika configurations or explicitly through
> your crawler.
>
> Best,
> Erick
>
> On Wed, Jul 22, 2015 at 9:47 AM, Andrew Musselman
>  wrote:
> > Trying to figure out how to parse the file path, which when I run the
> > "cloud" instance becomes the "id" for each PDF document.
> >
> > Is that "id" field the thing to parse with PatternReplaceFilterFactory in
> > the config?  If not, is there a "file-path" field I can parse?
> >
> > On Wed, Jul 22, 2015 at 9:42 AM, Erick Erickson  >
> > wrote:
> >
> >> Don't understand your question. If you're talking two different
> >> fields, use copyField.
> >>
> >> On Wed, Jul 22, 2015 at 8:55 AM, Andrew Musselman
> >>  wrote:
> >> > Fwding to user..
> >> >
> >> > -- Forwarded message --
> >> > From: Andrew Musselman 
> >> > Date: Wed, Jul 22, 2015 at 8:54 AM
> >> > Subject: Re: Parsing and indexing parts of the input file paths
> >> > To: d...@lucene.apache.org
> >> >
> >> >
> >> > Thanks, and tell it to index the "id" field, which eventually contains
> >> the
> >> > file path?
> >> >
> >> > On Wed, Jul 22, 2015 at 8:48 AM, Erick Erickson <
> erickerick...@gmail.com
> >> >
> >> > wrote:
> >> >
> >> >> PatternReplacecFilterFactory would be just a configuration solution,
> >> >> construct a fieldType in schema.xml and you're done. It would require
> >> >> re-indexing of course.
> >> >>
> >> >> Best,
> >> >> Erick
> >> >>
> >> >> On Tue, Jul 21, 2015 at 5:59 PM, Andrew Musselman
> >> >>  wrote:
> >> >> > Erik, thanks; the prefix starting with "/user/andrew/" will be
> known,
> >> and
> >> >> > can be put into config, let's assume.  Would this be config-only or
> >> >> would it
> >> >> > require some code, and could you point to some classes I can start
> >> with
> >> >> if I
> >> >> > need to write code, and some up-to-date docs?
> >> >> >
> >> >> > Same for the update processor, is there an example I could read?
> >> >> >
> >> >> > On Tue, Jul 21, 2015 at 11:19 AM, Erik Hatcher <
> >> erik.hatc...@gmail.com>
> >> >> > wrote:
> >> >> >>
> >> >> >> If this is only for search, then an analysis chain could be
> crafted,
> >> >> >> likely with the pattern regex filter in the mix, to pull out
> pieces
> >> of
> >> >> the
> >> >> >> path.  How will you know the prefix of the file though?
> >> >> >>
> >> >> >> There’s also the ability to do this sort of thing in an update
> >> >> processor,
> >> >> >> most easily using the script update processor, using a bit of
> >> >> JavaScript to
> >> >> >> pull out the piece(s) you want to index (and even store at this
> >> point).
> >> >> >>
> >> >> >> —
> >> >> >> Erik Hatcher, Senior Solutions Architect
> >> >> >> http://www.lucidworks.com
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> On Jul 21, 2015, at 1:31 PM, Andrew Musselman <
> >> >> andrew.mussel...@gmail.com>
> >> >> >> wrote:
> >> >> >>
> >> >> >> Dear user and dev lists,
> >> >> >>
> >> >> >> We are loading files from a directory and would like to index a
> >> portion
> >> >> of
> >> >> >> each file path as a field as well as the text inside the file.
> >> >> >>
> >> >> >> E.g., on HDFS we have this file path:
> >> >> >>
> >> >> >> /user/andrew/1234/1234/file.pdf
> >> >> >>
> >> >> >> And we would like the "1234" token parsed from the file path and
> >> indexed
> >> >> >> as an additional field that can be searched on.
> >> >> >>
> >> >> >> From my initial searches I can't see how to do this easily, so
> would
> >> I
> >> >> >> need to write some custom code, or a plugin?
> >> >> >>
> >> >> >> Thanks!
> >> >> >>
> >> >> >>
> >> >> >
> >> >>
> >> >> -
> >> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> >> >> For additional commands, e-mail: dev-h...@lucene.apache.org
> >> >>
> >> >>
> >>
>


Re: Parsing and indexing parts of the input file paths

2015-07-22 Thread Erick Erickson
the id field is absolutely NOT the thing you need to try to parse.
Assuming you're stuffing the file path into that field, use a
copyField to copy the filepath into another text (not string)
field and do your work there.

As far as whether the filepath is in some other field, well, you have
to put it there, either through Tika configurations or explicitly through
your crawler.

Best,
Erick

On Wed, Jul 22, 2015 at 9:47 AM, Andrew Musselman
 wrote:
> Trying to figure out how to parse the file path, which when I run the
> "cloud" instance becomes the "id" for each PDF document.
>
> Is that "id" field the thing to parse with PatternReplaceFilterFactory in
> the config?  If not, is there a "file-path" field I can parse?
>
> On Wed, Jul 22, 2015 at 9:42 AM, Erick Erickson 
> wrote:
>
>> Don't understand your question. If you're talking two different
>> fields, use copyField.
>>
>> On Wed, Jul 22, 2015 at 8:55 AM, Andrew Musselman
>>  wrote:
>> > Fwding to user..
>> >
>> > -- Forwarded message ------
>> > From: Andrew Musselman 
>> > Date: Wed, Jul 22, 2015 at 8:54 AM
>> > Subject: Re: Parsing and indexing parts of the input file paths
>> > To: d...@lucene.apache.org
>> >
>> >
>> > Thanks, and tell it to index the "id" field, which eventually contains
>> the
>> > file path?
>> >
>> > On Wed, Jul 22, 2015 at 8:48 AM, Erick Erickson > >
>> > wrote:
>> >
>> >> PatternReplacecFilterFactory would be just a configuration solution,
>> >> construct a fieldType in schema.xml and you're done. It would require
>> >> re-indexing of course.
>> >>
>> >> Best,
>> >> Erick
>> >>
>> >> On Tue, Jul 21, 2015 at 5:59 PM, Andrew Musselman
>> >>  wrote:
>> >> > Erik, thanks; the prefix starting with "/user/andrew/" will be known,
>> and
>> >> > can be put into config, let's assume.  Would this be config-only or
>> >> would it
>> >> > require some code, and could you point to some classes I can start
>> with
>> >> if I
>> >> > need to write code, and some up-to-date docs?
>> >> >
>> >> > Same for the update processor, is there an example I could read?
>> >> >
>> >> > On Tue, Jul 21, 2015 at 11:19 AM, Erik Hatcher <
>> erik.hatc...@gmail.com>
>> >> > wrote:
>> >> >>
>> >> >> If this is only for search, then an analysis chain could be crafted,
>> >> >> likely with the pattern regex filter in the mix, to pull out pieces
>> of
>> >> the
>> >> >> path.  How will you know the prefix of the file though?
>> >> >>
>> >> >> There’s also the ability to do this sort of thing in an update
>> >> processor,
>> >> >> most easily using the script update processor, using a bit of
>> >> JavaScript to
>> >> >> pull out the piece(s) you want to index (and even store at this
>> point).
>> >> >>
>> >> >> —
>> >> >> Erik Hatcher, Senior Solutions Architect
>> >> >> http://www.lucidworks.com
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >> On Jul 21, 2015, at 1:31 PM, Andrew Musselman <
>> >> andrew.mussel...@gmail.com>
>> >> >> wrote:
>> >> >>
>> >> >> Dear user and dev lists,
>> >> >>
>> >> >> We are loading files from a directory and would like to index a
>> portion
>> >> of
>> >> >> each file path as a field as well as the text inside the file.
>> >> >>
>> >> >> E.g., on HDFS we have this file path:
>> >> >>
>> >> >> /user/andrew/1234/1234/file.pdf
>> >> >>
>> >> >> And we would like the "1234" token parsed from the file path and
>> indexed
>> >> >> as an additional field that can be searched on.
>> >> >>
>> >> >> From my initial searches I can't see how to do this easily, so would
>> I
>> >> >> need to write some custom code, or a plugin?
>> >> >>
>> >> >> Thanks!
>> >> >>
>> >> >>
>> >> >
>> >>
>> >> -
>> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> >> For additional commands, e-mail: dev-h...@lucene.apache.org
>> >>
>> >>
>>


Re: Parsing and indexing parts of the input file paths

2015-07-22 Thread Andrew Musselman
Trying to figure out how to parse the file path, which when I run the
"cloud" instance becomes the "id" for each PDF document.

Is that "id" field the thing to parse with PatternReplaceFilterFactory in
the config?  If not, is there a "file-path" field I can parse?

On Wed, Jul 22, 2015 at 9:42 AM, Erick Erickson 
wrote:

> Don't understand your question. If you're talking two different
> fields, use copyField.
>
> On Wed, Jul 22, 2015 at 8:55 AM, Andrew Musselman
>  wrote:
> > Fwding to user..
> >
> > -- Forwarded message --
> > From: Andrew Musselman 
> > Date: Wed, Jul 22, 2015 at 8:54 AM
> > Subject: Re: Parsing and indexing parts of the input file paths
> > To: d...@lucene.apache.org
> >
> >
> > Thanks, and tell it to index the "id" field, which eventually contains
> the
> > file path?
> >
> > On Wed, Jul 22, 2015 at 8:48 AM, Erick Erickson  >
> > wrote:
> >
> >> PatternReplacecFilterFactory would be just a configuration solution,
> >> construct a fieldType in schema.xml and you're done. It would require
> >> re-indexing of course.
> >>
> >> Best,
> >> Erick
> >>
> >> On Tue, Jul 21, 2015 at 5:59 PM, Andrew Musselman
> >>  wrote:
> >> > Erik, thanks; the prefix starting with "/user/andrew/" will be known,
> and
> >> > can be put into config, let's assume.  Would this be config-only or
> >> would it
> >> > require some code, and could you point to some classes I can start
> with
> >> if I
> >> > need to write code, and some up-to-date docs?
> >> >
> >> > Same for the update processor, is there an example I could read?
> >> >
> >> > On Tue, Jul 21, 2015 at 11:19 AM, Erik Hatcher <
> erik.hatc...@gmail.com>
> >> > wrote:
> >> >>
> >> >> If this is only for search, then an analysis chain could be crafted,
> >> >> likely with the pattern regex filter in the mix, to pull out pieces
> of
> >> the
> >> >> path.  How will you know the prefix of the file though?
> >> >>
> >> >> There’s also the ability to do this sort of thing in an update
> >> processor,
> >> >> most easily using the script update processor, using a bit of
> >> JavaScript to
> >> >> pull out the piece(s) you want to index (and even store at this
> point).
> >> >>
> >> >> —
> >> >> Erik Hatcher, Senior Solutions Architect
> >> >> http://www.lucidworks.com
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> On Jul 21, 2015, at 1:31 PM, Andrew Musselman <
> >> andrew.mussel...@gmail.com>
> >> >> wrote:
> >> >>
> >> >> Dear user and dev lists,
> >> >>
> >> >> We are loading files from a directory and would like to index a
> portion
> >> of
> >> >> each file path as a field as well as the text inside the file.
> >> >>
> >> >> E.g., on HDFS we have this file path:
> >> >>
> >> >> /user/andrew/1234/1234/file.pdf
> >> >>
> >> >> And we would like the "1234" token parsed from the file path and
> indexed
> >> >> as an additional field that can be searched on.
> >> >>
> >> >> From my initial searches I can't see how to do this easily, so would
> I
> >> >> need to write some custom code, or a plugin?
> >> >>
> >> >> Thanks!
> >> >>
> >> >>
> >> >
> >>
> >> -
> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: dev-h...@lucene.apache.org
> >>
> >>
>


Re: Parsing and indexing parts of the input file paths

2015-07-22 Thread Erick Erickson
Don't understand your question. If you're talking two different
fields, use copyField.

On Wed, Jul 22, 2015 at 8:55 AM, Andrew Musselman
 wrote:
> Fwding to user..
>
> -- Forwarded message --
> From: Andrew Musselman 
> Date: Wed, Jul 22, 2015 at 8:54 AM
> Subject: Re: Parsing and indexing parts of the input file paths
> To: d...@lucene.apache.org
>
>
> Thanks, and tell it to index the "id" field, which eventually contains the
> file path?
>
> On Wed, Jul 22, 2015 at 8:48 AM, Erick Erickson 
> wrote:
>
>> PatternReplacecFilterFactory would be just a configuration solution,
>> construct a fieldType in schema.xml and you're done. It would require
>> re-indexing of course.
>>
>> Best,
>> Erick
>>
>> On Tue, Jul 21, 2015 at 5:59 PM, Andrew Musselman
>>  wrote:
>> > Erik, thanks; the prefix starting with "/user/andrew/" will be known, and
>> > can be put into config, let's assume.  Would this be config-only or
>> would it
>> > require some code, and could you point to some classes I can start with
>> if I
>> > need to write code, and some up-to-date docs?
>> >
>> > Same for the update processor, is there an example I could read?
>> >
>> > On Tue, Jul 21, 2015 at 11:19 AM, Erik Hatcher 
>> > wrote:
>> >>
>> >> If this is only for search, then an analysis chain could be crafted,
>> >> likely with the pattern regex filter in the mix, to pull out pieces of
>> the
>> >> path.  How will you know the prefix of the file though?
>> >>
>> >> There’s also the ability to do this sort of thing in an update
>> processor,
>> >> most easily using the script update processor, using a bit of
>> JavaScript to
>> >> pull out the piece(s) you want to index (and even store at this point).
>> >>
>> >> —
>> >> Erik Hatcher, Senior Solutions Architect
>> >> http://www.lucidworks.com
>> >>
>> >>
>> >>
>> >>
>> >> On Jul 21, 2015, at 1:31 PM, Andrew Musselman <
>> andrew.mussel...@gmail.com>
>> >> wrote:
>> >>
>> >> Dear user and dev lists,
>> >>
>> >> We are loading files from a directory and would like to index a portion
>> of
>> >> each file path as a field as well as the text inside the file.
>> >>
>> >> E.g., on HDFS we have this file path:
>> >>
>> >> /user/andrew/1234/1234/file.pdf
>> >>
>> >> And we would like the "1234" token parsed from the file path and indexed
>> >> as an additional field that can be searched on.
>> >>
>> >> From my initial searches I can't see how to do this easily, so would I
>> >> need to write some custom code, or a plugin?
>> >>
>> >> Thanks!
>> >>
>> >>
>> >
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>>


Re: Parsing and indexing parts of the input file paths

2015-07-21 Thread Andrew Musselman
Which can only happen if I post it to a web service, and won't happen if I
do it through config?

On Tue, Jul 21, 2015 at 2:19 PM, Upayavira  wrote:

> yes, unless it has been added consciously as a separate field.
>
> On Tue, Jul 21, 2015, at 09:40 PM, Andrew Musselman wrote:
> > Thanks, so by the time we would get to an Analyzer the file path is
> > forgotten?
> >
> > https://cwiki.apache.org/confluence/display/solr/Analyzers
> >
> > On Tue, Jul 21, 2015 at 1:27 PM, Upayavira  wrote:
> >
> > > Solr generally does not interact with the file system in that way (with
> > > the exception of the DIH).
> > >
> > > It is the job of the code that pushes a file to Solr to process the
> > > filename and send that along with the request.
> > >
> > > See here for more info:
> > >
> > >
> https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika
> > >
> > > You could provide literal.filename=blah/blah
> > >
> > > Upayavira
> > >
> > >
> > > On Tue, Jul 21, 2015, at 07:37 PM, Andrew Musselman wrote:
> > > > I'm not sure, it's a remote team but will get more info.  For now,
> > > > assuming
> > > > that a certain directory is specified, like "/user/andrew/", and a
> regex
> > > > is
> > > > applied to capture anything two directories below matching
> "*/*/*.pdf".
> > > >
> > > > Would there be a way to capture the wild-carded values and index
> them as
> > > > fields?
> > > >
> > > > On Tue, Jul 21, 2015 at 11:20 AM, Upayavira  wrote:
> > > >
> > > > > Keeping to the user list (the right place for this question).
> > > > >
> > > > > More information is needed here - how are you getting these
> documents
> > > > > into Solr? Are you posting them to /update/extract? Or using DIH,
> or?
> > > > >
> > > > > Upayavira
> > > > >
> > > > > On Tue, Jul 21, 2015, at 06:31 PM, Andrew Musselman wrote:
> > > > > > Dear user and dev lists,
> > > > > >
> > > > > > We are loading files from a directory and would like to index a
> > > portion
> > > > > > of
> > > > > > each file path as a field as well as the text inside the file.
> > > > > >
> > > > > > E.g., on HDFS we have this file path:
> > > > > >
> > > > > > /user/andrew/1234/1234/file.pdf
> > > > > >
> > > > > > And we would like the "1234" token parsed from the file path and
> > > indexed
> > > > > > as
> > > > > > an additional field that can be searched on.
> > > > > >
> > > > > > From my initial searches I can't see how to do this easily, so
> would
> > > I
> > > > > > need
> > > > > > to write some custom code, or a plugin?
> > > > > >
> > > > > > Thanks!
> > > > >
> > >
>


Re: Parsing and indexing parts of the input file paths

2015-07-21 Thread Upayavira
yes, unless it has been added consciously as a separate field.

On Tue, Jul 21, 2015, at 09:40 PM, Andrew Musselman wrote:
> Thanks, so by the time we would get to an Analyzer the file path is
> forgotten?
> 
> https://cwiki.apache.org/confluence/display/solr/Analyzers
> 
> On Tue, Jul 21, 2015 at 1:27 PM, Upayavira  wrote:
> 
> > Solr generally does not interact with the file system in that way (with
> > the exception of the DIH).
> >
> > It is the job of the code that pushes a file to Solr to process the
> > filename and send that along with the request.
> >
> > See here for more info:
> >
> > https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika
> >
> > You could provide literal.filename=blah/blah
> >
> > Upayavira
> >
> >
> > On Tue, Jul 21, 2015, at 07:37 PM, Andrew Musselman wrote:
> > > I'm not sure, it's a remote team but will get more info.  For now,
> > > assuming
> > > that a certain directory is specified, like "/user/andrew/", and a regex
> > > is
> > > applied to capture anything two directories below matching "*/*/*.pdf".
> > >
> > > Would there be a way to capture the wild-carded values and index them as
> > > fields?
> > >
> > > On Tue, Jul 21, 2015 at 11:20 AM, Upayavira  wrote:
> > >
> > > > Keeping to the user list (the right place for this question).
> > > >
> > > > More information is needed here - how are you getting these documents
> > > > into Solr? Are you posting them to /update/extract? Or using DIH, or?
> > > >
> > > > Upayavira
> > > >
> > > > On Tue, Jul 21, 2015, at 06:31 PM, Andrew Musselman wrote:
> > > > > Dear user and dev lists,
> > > > >
> > > > > We are loading files from a directory and would like to index a
> > portion
> > > > > of
> > > > > each file path as a field as well as the text inside the file.
> > > > >
> > > > > E.g., on HDFS we have this file path:
> > > > >
> > > > > /user/andrew/1234/1234/file.pdf
> > > > >
> > > > > And we would like the "1234" token parsed from the file path and
> > indexed
> > > > > as
> > > > > an additional field that can be searched on.
> > > > >
> > > > > From my initial searches I can't see how to do this easily, so would
> > I
> > > > > need
> > > > > to write some custom code, or a plugin?
> > > > >
> > > > > Thanks!
> > > >
> >


Re: Parsing and indexing parts of the input file paths

2015-07-21 Thread Andrew Musselman
Thanks, so by the time we would get to an Analyzer the file path is
forgotten?

https://cwiki.apache.org/confluence/display/solr/Analyzers

On Tue, Jul 21, 2015 at 1:27 PM, Upayavira  wrote:

> Solr generally does not interact with the file system in that way (with
> the exception of the DIH).
>
> It is the job of the code that pushes a file to Solr to process the
> filename and send that along with the request.
>
> See here for more info:
>
> https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika
>
> You could provide literal.filename=blah/blah
>
> Upayavira
>
>
> On Tue, Jul 21, 2015, at 07:37 PM, Andrew Musselman wrote:
> > I'm not sure, it's a remote team but will get more info.  For now,
> > assuming
> > that a certain directory is specified, like "/user/andrew/", and a regex
> > is
> > applied to capture anything two directories below matching "*/*/*.pdf".
> >
> > Would there be a way to capture the wild-carded values and index them as
> > fields?
> >
> > On Tue, Jul 21, 2015 at 11:20 AM, Upayavira  wrote:
> >
> > > Keeping to the user list (the right place for this question).
> > >
> > > More information is needed here - how are you getting these documents
> > > into Solr? Are you posting them to /update/extract? Or using DIH, or?
> > >
> > > Upayavira
> > >
> > > On Tue, Jul 21, 2015, at 06:31 PM, Andrew Musselman wrote:
> > > > Dear user and dev lists,
> > > >
> > > > We are loading files from a directory and would like to index a
> portion
> > > > of
> > > > each file path as a field as well as the text inside the file.
> > > >
> > > > E.g., on HDFS we have this file path:
> > > >
> > > > /user/andrew/1234/1234/file.pdf
> > > >
> > > > And we would like the "1234" token parsed from the file path and
> indexed
> > > > as
> > > > an additional field that can be searched on.
> > > >
> > > > From my initial searches I can't see how to do this easily, so would
> I
> > > > need
> > > > to write some custom code, or a plugin?
> > > >
> > > > Thanks!
> > >
>


Re: Parsing and indexing parts of the input file paths

2015-07-21 Thread Upayavira
Solr generally does not interact with the file system in that way (with
the exception of the DIH).

It is the job of the code that pushes a file to Solr to process the
filename and send that along with the request.

See here for more info:
https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika

You could provide literal.filename=blah/blah

Upayavira


On Tue, Jul 21, 2015, at 07:37 PM, Andrew Musselman wrote:
> I'm not sure, it's a remote team but will get more info.  For now,
> assuming
> that a certain directory is specified, like "/user/andrew/", and a regex
> is
> applied to capture anything two directories below matching "*/*/*.pdf".
> 
> Would there be a way to capture the wild-carded values and index them as
> fields?
> 
> On Tue, Jul 21, 2015 at 11:20 AM, Upayavira  wrote:
> 
> > Keeping to the user list (the right place for this question).
> >
> > More information is needed here - how are you getting these documents
> > into Solr? Are you posting them to /update/extract? Or using DIH, or?
> >
> > Upayavira
> >
> > On Tue, Jul 21, 2015, at 06:31 PM, Andrew Musselman wrote:
> > > Dear user and dev lists,
> > >
> > > We are loading files from a directory and would like to index a portion
> > > of
> > > each file path as a field as well as the text inside the file.
> > >
> > > E.g., on HDFS we have this file path:
> > >
> > > /user/andrew/1234/1234/file.pdf
> > >
> > > And we would like the "1234" token parsed from the file path and indexed
> > > as
> > > an additional field that can be searched on.
> > >
> > > From my initial searches I can't see how to do this easily, so would I
> > > need
> > > to write some custom code, or a plugin?
> > >
> > > Thanks!
> >


Re: Parsing and indexing parts of the input file paths

2015-07-21 Thread Andrew Musselman
I'm not sure, it's a remote team but will get more info.  For now, assuming
that a certain directory is specified, like "/user/andrew/", and a regex is
applied to capture anything two directories below matching "*/*/*.pdf".

Would there be a way to capture the wild-carded values and index them as
fields?

On Tue, Jul 21, 2015 at 11:20 AM, Upayavira  wrote:

> Keeping to the user list (the right place for this question).
>
> More information is needed here - how are you getting these documents
> into Solr? Are you posting them to /update/extract? Or using DIH, or?
>
> Upayavira
>
> On Tue, Jul 21, 2015, at 06:31 PM, Andrew Musselman wrote:
> > Dear user and dev lists,
> >
> > We are loading files from a directory and would like to index a portion
> > of
> > each file path as a field as well as the text inside the file.
> >
> > E.g., on HDFS we have this file path:
> >
> > /user/andrew/1234/1234/file.pdf
> >
> > And we would like the "1234" token parsed from the file path and indexed
> > as
> > an additional field that can be searched on.
> >
> > From my initial searches I can't see how to do this easily, so would I
> > need
> > to write some custom code, or a plugin?
> >
> > Thanks!
>


Re: Parsing and indexing parts of the input file paths

2015-07-21 Thread Upayavira
Keeping to the user list (the right place for this question).

More information is needed here - how are you getting these documents
into Solr? Are you posting them to /update/extract? Or using DIH, or?

Upayavira

On Tue, Jul 21, 2015, at 06:31 PM, Andrew Musselman wrote:
> Dear user and dev lists,
> 
> We are loading files from a directory and would like to index a portion
> of
> each file path as a field as well as the text inside the file.
> 
> E.g., on HDFS we have this file path:
> 
> /user/andrew/1234/1234/file.pdf
> 
> And we would like the "1234" token parsed from the file path and indexed
> as
> an additional field that can be searched on.
> 
> From my initial searches I can't see how to do this easily, so would I
> need
> to write some custom code, or a plugin?
> 
> Thanks!