No intention of spamming but I also want to mention tika-python <https://github.com/chrismattmann/tika-python> in the toolchain.
Ziyuan On Tue, Jun 20, 2017 at 2:29 PM, ZiYuan <ziyu...@gmail.com> wrote: > Dear Erick and Timothy, > > I also took a look at the Python clients (say, SolrClient and pysolr) > because Python is my main programming language. I have an impression that > 1. they send HTTP requests to the server according to the server APIs; 2. > they are not official and thus possibly not up to date. Does SolrJ talk to > the server via HTTP or some other more native ways? Is the main benefit of > SolrJ over other clients the official shipment with Solr? Thank you. > > Best regards, > Ziyuan > > On Jun 19, 2017 18:43, "ZiYuan" <ziyu...@gmail.com> wrote: > >> Dear Erick and Timothy, >> >> yes I will parse from the client for all the benefits. I am just trying >> to figure out what is going on by indexing one or two PDF files first. >> Thank you both. >> >> Best regards, >> Ziyuan >> >> On Mon, Jun 19, 2017 at 6:17 PM, Erick Erickson <erickerick...@gmail.com> >> wrote: >> >>> bq: Hope that there is no side effect of not mapping the PDF >>> >>> Well, yes it will have that side effect. You can cure that with a >>> copyField directive from content to _text_. >>> >>> But do really consider running this as a SolrJ program on the client. >>> Tim knows in far more painful detail than I do what kinds of problems >>> there are when parsing all the different formats so I'd _really_ >>> follow his advice. >>> >>> Tika pretty much has an impossible job. "Here, try to parse all these >>> different formats, implemented by different vendors with different >>> versions that more or less follow a spec which really isn't a spec in >>> many cases just recommendations using packages that may or may not be >>> actively maintained. And by the way, we'll try to handle that 1G >>> document that someone sends us, but don't blame us if we hit an >>> OOM.....". When Tika is run on the same box as Solr any problems in >>> that entire chain can adversely affect your search. >>> >>> Not to mention that Tika has to do some heavy lifting, using CPU >>> cycles that are unavailable for Solr. >>> >>> Extracting Request Handler is a fine way to get started, but for >>> production seriously consider a separate client. >>> >>> Best, >>> Erick >>> >>> On Mon, Jun 19, 2017 at 6:24 AM, ZiYuan <ziyu...@gmail.com> wrote: >>> > Hi Erick, >>> > >>> > Now it is clear. I have to update the request handler of >>> /update/extract/ >>> > from >>> > "defaults":{"fmap.content":"_text_"} >>> > to >>> > "defaults":{"fmap.content":"content"} >>> > to fill the field. >>> > >>> > Hope that there is no side effect of not mapping the PDF content to >>> _text_. >>> > Thank you for the hint. >>> > >>> > Best regards, >>> > Ziyuan >>> > >>> > On Mon, Jun 19, 2017 at 1:55 PM, Erik Hatcher <erik.hatc...@gmail.com> >>> > wrote: >>> > >>> >> Ziyuan - >>> >> >>> >> You may be interested in the example/files that ships with Solr too. >>> It’s >>> >> got schema and config and even UI for file indexing and searching. >>> Check >>> >> it out README.txt under example/files in your Solr install. >>> >> >>> >> Erik >>> >> >>> >> > On Jun 19, 2017, at 6:52 AM, ZiYuan <ziyu...@gmail.com> wrote: >>> >> > >>> >> > Hi Erick, >>> >> > >>> >> > thanks very much for the explanations! Clarification for question >>> 2: more >>> >> > specifically I cannot see the field content in the returned JSON, >>> with >>> >> the >>> >> > the same definitions as in the post >>> >> > <http://www.codewrecks.com/blog/index.php/2013/05/27/ >>> >> hilight-matched-text-inside-documents-indexed-with-solr-plus-tika/> >>> >> > : >>> >> > >>> >> > <field name="content" type="text_general" indexed="false" >>> stored="true"/> >>> >> > <field name="text" type="text_general" multiValued="true" >>> indexed="true" >>> >> > stored="false"/> >>> >> > <copyField source="content" dest="text"/> >>> >> > >>> >> > Is it so that Tika does not fill these two fields automatically and >>> I >>> >> have >>> >> > to write some client code to fill them? >>> >> > >>> >> > Best regards, >>> >> > Ziyuan >>> >> > >>> >> > >>> >> > On Sun, Jun 18, 2017 at 8:07 PM, Erick Erickson < >>> erickerick...@gmail.com >>> >> > >>> >> > wrote: >>> >> > >>> >> >> 1> Yes, you can use your single definition. The author identifies >>> the >>> >> >> "text" field as a catch-all. Somewhere in the schema there'll be a >>> >> >> copyField directive copying (perhaps) many different fields to the >>> >> >> "text" field. That permits simple searches against a single field >>> >> >> rather than, say, using edismax to search across multiple separate >>> >> >> fields. >>> >> >> >>> >> >> 2> The link you referenced is for Data Import Handler, which is >>> much >>> >> >> different than just posting files to Solr. See >>> >> >> ExtractingRequestHandler: >>> >> >> https://cwiki.apache.org/confluence/display/solr/ >>> >> >> Uploading+Data+with+Solr+Cell+using+Apache+Tika. >>> >> >> There are ways to map meta-data fields from the doc into specific >>> >> >> fields matching your schema. Be a little careful here. There is no >>> >> >> standard across different types of docs as to what meta-data field >>> is >>> >> >> included. PDF might have a "last_edited" field. Word might have a >>> >> >> "last_modified" field where the two mean the same thing. Here's a >>> link >>> >> >> to a SolrJ program that'll dump all the fields: >>> >> >> https://lucidworks.com/2012/02/14/indexing-with-solrj/. You can >>> easily >>> >> >> hack out the DB bits. >>> >> >> >>> >> >> BTW, once you get more familiar with processing, I strongly >>> recommend >>> >> >> you do the document processing on the client, the reasons are >>> outlined >>> >> >> in that article. >>> >> >> >>> >> >> bq: even I define the fields as he said I cannot see them in the >>> >> >> search results as keys in JSON >>> >> >> are the fields set as stored="true"? They must be to be returned in >>> >> >> requests (skipping the docValues discussion here). >>> >> >> >>> >> >> 3> Yes, the text field is a concatenation of all the other ones. >>> >> >> Because it has stored=false, you can only search it, you cannot >>> >> >> highlight or view. Fields you highlight must have stored=true BTW. >>> >> >> >>> >> >> Whether or not you can highlight "Trevor Hastie" depends an a lot >>> of >>> >> >> things, most particularly whether that text is ever actually in a >>> >> >> field in your index. Just because there's no guarantee that the >>> name >>> >> >> of the file is indexed in a searchable/highlightable way. >>> >> >> >>> >> >> And the query q=id:Trevor Hastie won't do what you think. It'll be >>> >> parsed >>> >> >> as >>> >> >> id:Trevor _text_:Hastie >>> >> >> _text_ is the default field, look for a "df" parameter in your >>> request >>> >> >> handler in solrconfig.xml (usually "/select" or "/query"). >>> >> >> >>> >> >> On Sat, Jun 17, 2017 at 3:04 PM, ZiYuan <ziyu...@gmail.com> wrote: >>> >> >>> Hi, >>> >> >>> >>> >> >>> I am new to Solr and I need to implement a full-text search of >>> some PDF >>> >> >>> files. The indexing part works out of the box by using bin/post. >>> I can >>> >> >> see >>> >> >>> search results in the admin UI given some queries, though without >>> the >>> >> >>> matched texts and the context. >>> >> >>> >>> >> >>> Now I am reading this post >>> >> >>> <http://www.codewrecks.com/blog/index.php/2013/05/27/ >>> >> >> hilight-matched-text-inside-documents-indexed-with-solr-plus >>> -tika/> >>> >> >>> for the highlighting part. It is for an older version of Solr when >>> >> >> managed >>> >> >>> schema was not available. Before fully understand what it is >>> doing I >>> >> have >>> >> >>> some questions: >>> >> >>> >>> >> >>> 1. He defined two fields: >>> >> >>> >>> >> >>> <field name="content" type="text_general" indexed="false" >>> stored="true" >>> >> >>> multiValued="false"/> >>> >> >>> <field name="text" type="text_general" indexed="true" >>> stored="false" >>> >> >>> multiValued="true"/> >>> >> >>> >>> >> >>> But why are there two fields needed? Can I define a field >>> >> >>> >>> >> >>> <field name="content" type="text_general" indexed="true" >>> stored="true" >>> >> >>> multiValued="true"/> >>> >> >>> >>> >> >>> to capture the full text? >>> >> >>> >>> >> >>> 2. How are the fields filled? I don't see relevant information in >>> >> >>> TikaEntityProcessor's documentation >>> >> >>> <https://lucene.apache.org/solr/6_6_0/solr- >>> >> dataimporthandler-extras/org/ >>> >> >> apache/solr/handler/dataimport/TikaEntityProcessor.html# >>> >> >> fields.inherited.from.class.org.apache.solr.handler. >>> >> >> dataimport.EntityProcessorBase>. >>> >> >>> The current text extractor should already be Tika (I can see >>> >> >>> >>> >> >>> "x_parsed_by": >>> >> >>> ["org.apache.tika.parser.DefaultParser","org.apache. >>> >> >> tika.parser.pdf.PDFParser"] >>> >> >>> >>> >> >>> in the returned JSON of some query). But even I define the fields >>> as he >>> >> >>> said I cannot see them in the search results as keys in JSON. >>> >> >>> >>> >> >>> 3. The _text_ field seems a concatenation of other fields, does it >>> >> >> contain >>> >> >>> the full text? Though it does not seem to be accessible by >>> default. >>> >> >>> >>> >> >>> To be brief, using The Elements of Statistical Learning >>> >> >>> <http://statweb.stanford.edu/~tibs/ElemStatLearn/printings/ >>> >> >> ESLII_print10.pdf> >>> >> >>> as an example, how to highlight the relevant texts for the query >>> "SVM"? >>> >> >> And >>> >> >>> if changing the file name into "The Elements of Statistical >>> Learning - >>> >> >>> Trevor Hastie.pdf" and post it, how to highlight "Trevor Hastie" >>> for >>> >> the >>> >> >>> query "id:Trevor Hastie"? >>> >> >>> >>> >> >>> Thank you. >>> >> >>> >>> >> >>> Best regards, >>> >> >>> Ziyuan >>> >> >> >>> >> >>> >> >>> >> >>