Have you looked at Solr Cell? See:
https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika

When working with things like MS word, there are a couple of things to
be aware of:
1> there has to be a mapping between the meta-data (last_edited,
author, whatever) and the field in Solr you want that meta-data to go
to.
2> each type of document may have different meta-data meaning the same thing.

The other alternative is to use Tika directly in a Java program and
take full control of what goes where, here's an example (you can
remove the database stuff easily):
https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/

Best,
Erick

On Thu, Dec 3, 2015 at 4:00 AM, kostali hassan
<med.has.kost...@gmail.com> wrote:
> I start working in solr 5x by extract solr in D://solr and run solr server
> with :
>
> D:\solr\solr-5.3.1\bin>solr start ;
>
> Then I create a core in standalone mode :
>
> D:\solr\solr-5.3.1\bin>solr create -c mycore
>
> I need indexing from system files (word and pdf) and the schema API don’t
> have a field “name” of document, then I Add this field using curl :
>
> curl -X POST -H 'Content-type:application/json' --data-binary '{
>
>   "add-field":{
>
>      "name":"name",
>
>      "type":"text_general",
>
>      "stored":true,
>
>      “indexed”:true }
>
> }' http://localhost:8983/solr/mycore/schema
>
>
>
> And re-index all document.with windows SimplepostTools:
>
> D:\solr\solr-5.3.1>java -classpath example\exampledocs\post.jar -Dauto=yes
> -Dc=mycore -Ddata=files -Drecursive=yes org.apache.solr.util.SimplePostTool
> D:\Lucene\document ;
>
>
>
> But even if the field “name” is succeffly added he is empty ; the field
> title get the name for only pdf document not for msword(.doc and .docx).
>
>
>
> Then I choose indexing with techproducts example because he don’t use
> schema.xml API then I can modified my schema:
>
>
>
> D:\solr\solr-5.3.1>solr –e techproducts
>
>
>
> Techproducts return the name of all files.xml indexed;
>
>
>
> Then I create a new core based in solr_home example/techproducts/solr and I
> use schema.xml (contient field “name”) and solrConfig.xml from techproducts
> in this new core called demo.
>
> When I indexed all document the field name exist but still empty for all
> document indexed.
>
>
>
> My question is how I can get just the name of each document(msword and pdf)
> not the path like the field “id” or field “ressource_name” ; I have to
> create new Typefield or exist another way.
>
>
>
> Sorry for my basic English.
>
> Thank you.

Reply via email to