RE: Solr fields for Microsoft files, image files, PDF, text files

2017-09-25 Thread Allison, Timothy B.
bq: How do I get a list of all valid field names based on the file type

bq: You don't. At least I've never found any. Plus various document formats 
will allow custom meta-data fields so there's no definitive list.

It would be trivial to add field counts per mime to tika-eval.  If you're 
interested in this, please open a ticket on Tika's JIRA.


Re: Solr fields for Microsoft files, image files, PDF, text files

2017-09-25 Thread Erik Hatcher
Phillip - You may be interested to start with the example/files that ships with 
Solr.   It is specifically designed as a configuration (and UI!) that deals 
with indexing rich files with a bit more than other examples - it pulls out 
acronyms, e-mail addresses, and URLs from text, as well as what you’ve asked 
about, mapping content types to more friendly human types (“image” instead of 
the whole gamut of image/* content-types).

Erik

> On Sep 24, 2017, at 10:55 PM, Phillip Wu  wrote:
> 
> 
> Hi,
> I'm starting out with Solr on a Windows box.
> 
> I want to index the following documents:
> doc;docx
> xls;xlsx
> ppt
> vsd
> 
> pdf
> txt
> 
> gif;jpeg;tiff
> 
> I undersand that solr uses Apache Tika to read these file types and return an 
> xml stream back to Solr.
> For Tika image processing, I've loaded Tesseract.
> 
> To be able to search the documents, I need to define "fields" in a file 
> called meta-schema.
> 
> How do I get a list of all valid field names based on the file type? For 
> example *.doc, what "fields" exist so I choose what to store?
> 
> I'm assuming that for example, *.doc files there is metadata put into the 
> file by Microsoft Word eg.author,date and "free form" text.
> 
> So where is the list of valid fields per file type?
> 
> Also how do I search the "free form" text for a word/pattern in the Solr 
> search tool?
> 
> 
> 
> 



Re: Solr fields for Microsoft files, image files, PDF, text files

2017-09-25 Thread Erick Erickson
bq: How do I get a list of all valid field names based on the file type

You don't. At least I've never found any. Plus various document
formats will allow custom meta-data fields so there's no definitive
list.

bq: Also how do I search the "free form" text for a word/pattern in
the Solr search tool?

you put the extracted text (as opposed to meta-data) into an analyzed
field and search that.



NOTE: Solr is a search engine. The closest thing to an OOB "Solr
Search Tool" is the admin UI, which isn't intended to be an end-user
facing app.

Here's some SolrJ code that'll let you explore the meta-data fields in
various document types:

https://lucidworks.com/2012/02/14/indexing-with-solrj/

You can pull out the RDBMS bits pretty easily.

Best,
Erick

On Sun, Sep 24, 2017 at 7:55 PM, Phillip Wu  wrote:
>
>  Hi,
> I'm starting out with Solr on a Windows box.
>
> I want to index the following documents:
> doc;docx
> xls;xlsx
> ppt
> vsd
>
> pdf
> txt
>
> gif;jpeg;tiff
>
> I undersand that solr uses Apache Tika to read these file types and return an 
> xml stream back to Solr.
> For Tika image processing, I've loaded Tesseract.
>
> To be able to search the documents, I need to define "fields" in a file 
> called meta-schema.
>
> How do I get a list of all valid field names based on the file type? For 
> example *.doc, what "fields" exist so I choose what to store?
>
> I'm assuming that for example, *.doc files there is metadata put into the 
> file by Microsoft Word eg.author,date and "free form" text.
>
> So where is the list of valid fields per file type?
>
> Also how do I search the "free form" text for a word/pattern in the Solr 
> search tool?
>
>
>
>


Solr fields for Microsoft files, image files, PDF, text files

2017-09-24 Thread Phillip Wu

 Hi,
I'm starting out with Solr on a Windows box.

I want to index the following documents:
doc;docx
xls;xlsx
ppt
vsd

pdf
txt

gif;jpeg;tiff

I undersand that solr uses Apache Tika to read these file types and return an 
xml stream back to Solr.
For Tika image processing, I've loaded Tesseract.

To be able to search the documents, I need to define "fields" in a file called 
meta-schema.

How do I get a list of all valid field names based on the file type? For 
example *.doc, what "fields" exist so I choose what to store?

I'm assuming that for example, *.doc files there is metadata put into the file 
by Microsoft Word eg.author,date and "free form" text.

So where is the list of valid fields per file type?

Also how do I search the "free form" text for a word/pattern in the Solr search 
tool?