It's always frustrating when someone replies with "Why not do it a completely different way?". But I will anyway :).
There's no requirement at all that you send things to Solr to make Solr Cel (aka Tika) do it's tricks. Since you're already in SolrJ anyway, why not just parse on the client? This has the advantage of allowing you to offload the Tika processing from Solr which can be quite expensive. You can use the same Tika jars that come with Solr or download whatever version from the Tika project you want. That way, you can exercise much better control over what's done. Here's a skeletal program with indexing from a DB mixed in, but it shouldn't be hard at all to pull the DB parts out. http://searchhub.org/dev/2012/02/14/indexing-with-solrj/ FWIW, Erick On Thu, Sep 5, 2013 at 5:28 PM, Jamie Johnson <jej2...@gmail.com> wrote: > Is it possible to configure solr cell to only extract and store the body of > a document when indexing? I'm currently doing the following which I > thought would work > > ModifiableSolrParams params = new ModifiableSolrParams(); > > params.set("defaultField", "content"); > > params.set("xpath", "/xhtml:html/xhtml:body/descendant::node()"); > > ContentStreamUpdateRequest up = new ContentStreamUpdateRequest( > "/update/extract"); > > up.setParams(params); > > FileStream f = new FileStream(new File("..")); > > up.addContentStream(f); > > up.setAction(ACTION.COMMIT, true, true); > > solrServer.request(up); > > > But the result of content is as follows > > <arr name="content_mvtxt"> > <str/> > <str>null</str> > <str>ISO-8859-1</str> > <str>text/plain; charset=ISO-8859-1</str> > <str>Just a little test</str> > </arr> > > > What I had hoped for was just > > <arr name="content_mvtxt"> > <str>Just a little test</str> > </arr> >