RE: Require searching only for file content and not metadata

Khare, Kushal (MIND) Wed, 28 Aug 2019 04:47:55 -0700

Yup ! I have already made stored = true for _text_. I will see to it. No 
worries.


BUT, I really need HELP for the separation of content & metadata. I checked , 
but there isn't any field that is copying the values into the '_text_' field.
The only definition I have for _text_ is :
<field name="_text_" type="text_general" multiValued="true" indexed="true" 
stored="true"/>

For this : doc.addField(“metadatafield1”, value_of_metadata_field1);
I added author name, etc in the code, but not getting those fields. Also,  > 
doc.addField("_text_", textHandler.toString()); has blank value in it.

Please help !
-----Original Message-----
From: Erick Erickson [mailto:erickerick...@gmail.com]
Sent: 28 August 2019 16:50
To: solr-user@lucene.apache.org
Subject: Re: Require searching only for file content and not metadata

Attachments are aggressively stripped of attachments, you’ll have to either 
post it someplace and provide a link or paste the relevant sections into the 
e-mail.

You’re not getting any metadata because you’re not adding any metadata to the 
documents with doc.addField(“metadatafield1”, value_of_metadata_field1);

The only thing ever in the doc is what you explicitly put there. At this point 
it’s just “id” and “_text_”.

As for why _text_ isn’t showing up, does the schema have ’stored=“true”’ for 
the field? And when you query, are you specifying &fl=_text_? _text_ is usually 
a catch-all field in the default schemas with this definition:

<field name="_text_" type="text_general" indexed="true" stored="false" 
multiValued="true”/>

Since stored=false, well, it’s not stored so can’t be returned. If you’re 
successfully _searching_ on that field but not getting it back in the “fl” 
list, this is almost certainly a stored=“false” issue.

As for why you might have gotten all the metadata in this field with the post 
tool, check that there are no “copyField” directives in the schema that 
automatically copy other data into _text_.

Best,
Erick



> On Aug 28, 2019, at 7:03 AM, Khare, Kushal (MIND) 
> <kushal.kh...@mind-infotech.com> wrote:
>
> Attaching managed-schema.xml
>
> -----Original Message-----
> From: Khare, Kushal (MIND) [mailto:kushal.kh...@mind-infotech.com]
> Sent: 28 August 2019 16:30
> To: solr-user@lucene.apache.org
> Subject: RE: Require searching only for file content and not metadata
>
> I already tried this example, I am currently working on this. I have complied 
> the code, it is indexing the documents. But, it is not adding any thing to 
> the field - _text_ . Also, not giving any metadata.
> doc.addField("_text_", textHandler.toString()); --> here, 
> textHandler.toString() is blank for all the 40 documents. All I am getting is 
> the 'id' & 'version' field.
>
> This is the code that I tried :
>
> package mind.solr;
>
> import org.apache.solr.client.solrj.SolrServerException;
> import org.apache.solr.client.solrj.impl.HttpSolrClient;
> import org.apache.solr.client.solrj.impl.XMLResponseParser;
> import org.apache.solr.client.solrj.response.UpdateResponse;
> import org.apache.solr.common.SolrInputDocument;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.sax.BodyContentHandler;
> import org.xml.sax.ContentHandler;
>
> import java.io.File;
> import java.io.FileInputStream;
> import java.io.IOException;
> import java.io.InputStream;
> import java.util.ArrayList;
> import java.util.Collection;
>
> public class solrJExtract {
>
> private HttpSolrClient client;
>  private long start = System.currentTimeMillis();  private
> AutoDetectParser autoParser;  private int totalTika = 0;  private int
> totalSql = 0;
>
>  @SuppressWarnings("rawtypes")
> private Collection docList = new ArrayList();
>
>
> public static void main(String[] args) {
>    try {
>    solrJExtract idxer = new solrJExtract("http://localhost:8983/solr/tika";);
>    idxer.doTikaDocuments(new File("D:\\docs"));
>    idxer.endIndexing();
>    } catch (Exception e) {
>      e.printStackTrace();
>    }
>  }
>
>  private  solrJExtract(String url) throws IOException, SolrServerException {
>    // Create a SolrCloud-aware client to send docs to Solr
>    // Use something like HttpSolrClient for stand-alone
>
>    client = new HttpSolrClient.Builder("http://localhost:8983/solr/tika";)
>    .withConnectionTimeout(10000)
>    .withSocketTimeout(60000)
>    .build();
>
>    // binary parser is used by default for responses
>    client.setParser(new XMLResponseParser());
>
>    // One of the ways Tika can be used to attempt to parse arbitrary files.
>    autoParser = new AutoDetectParser();  }
>
> // Just a convenient place to wrap things up.
>  @SuppressWarnings("unchecked")
> private void endIndexing() throws IOException, SolrServerException {
>    if ( docList.size() > 0) { // Are there any documents left over?
>      client.add(docList, 300000); // Commit within 5 minutes
>    }
>    client.commit(); // Only needs to be done at the end,
>    // commitWithin should do the rest.
>    // Could even be omitted
>    // assuming commitWithin was specified.
>    long endTime = System.currentTimeMillis();
>    System.out.println("Total Time Taken: " + (endTime - start) +
>        " milliseconds to index " + totalSql +
>        " SQL rows and " + totalTika + " documents");
>
>  }
>
>  /**
>   * ***************************Tika processing here
>   */
>  // Recursively traverse the filesystem, parsing everything found.
>  private void doTikaDocuments(File root) throws IOException,
> SolrServerException {
>
>    // Simple loop for recursively indexing all the files
>    // in the root directory passed in.
>    for (File file : root.listFiles()) {
>      if (file.isDirectory()) {
>        doTikaDocuments(file);
>        continue;
>      }
>      // Get ready to parse the file.
>      ContentHandler textHandler = new BodyContentHandler();
>      Metadata metadata = new Metadata();
>      ParseContext context = new ParseContext();
>      // Tim Allison noted the following, thanks Tim!
>      // If you want Tika to parse embedded files (attachments within your 
> .doc or any other embedded
>      // files), you need to send in the autodetectparser in the parsecontext:
>      // context.set(Parser.class, autoParser);
>
>      InputStream input = new FileInputStream(file);
>
>      // Try parsing the file. Note we haven't checked at all to
>      // see whether this file is a good candidate.
>      try {
>        autoParser.parse(input, textHandler, metadata, context);
>      } catch (Exception e) {
>        // Needs better logging of what went wrong in order to
>        // track down "bad" documents.
>        System.out.println(String.format("File %s failed", 
> file.getCanonicalPath()));
>        e.printStackTrace();
>        continue;
>      }
>      // Just to show how much meta-data and what form it's in.
>      dumpMetadata(file.getCanonicalPath(), metadata);
>
>      // Index just a couple of the meta-data fields.
>      SolrInputDocument doc = new SolrInputDocument();
>
>      doc.addField("id", file.getCanonicalPath());
>
>      // Crude way to get known meta-data fields.
>      // Also possible to write a simple loop to examine all the
>      // metadata returned and selectively index it and/or
>      // just get a list of them.
>      // One can also use the Lucidworks field mapping to
>      // accomplish much the same thing.
>      String author = metadata.get("Author");
>
> /*
> * if (author != null) { //doc.addField("author", author); }  */
>
>      doc.addField("_text_", textHandler.toString());
>      //doc.addField("meta", metadata.get("Last_Modified"));
>      docList.add(doc);
>      ++totalTika;
>
>      // Completely arbitrary, just batch up more than one document
>      // for throughput!
>      if ( docList.size() >= 1000) {
>        // Commit within 5 minutes.
>        UpdateResponse resp = client.add(docList, 300000);
>        if (resp.getStatus() != 0) {
>        System.out.println("Some horrible error has occurred, status is: " +
>              resp.getStatus());
>        }
>        docList.clear();
>      }
>    }
>  }
>
>  // Just to show all the metadata that's available.
>  private void dumpMetadata(String fileName, Metadata metadata) {
> System.out.println("Dumping metadata for file: " + fileName);
>    for (String name : metadata.names()) {
>      System.out.println(name + ":" + metadata.get(name));
>    }
>    System.out.println("........xxxxxxxxxxxxxxxxxxxxxxxxx..........");
>  }
> }
>
>
> Also, I am attaching the scrollconfig.xml & Managed-schema.xml for my 
> collection. Please see to it & suggest where I am getting wrong.
> I can't even get to see the _text_ field in the query result, instead of 
> stored parameter being true.
> Any help would really be appreciated.
> Thanks !
>
> -----Original Message-----
> From: Shawn Heisey [mailto:apa...@elyograg.org]
> Sent: 28 August 2019 14:18
> To: solr-user@lucene.apache.org
> Subject: Re: Require searching only for file content and not metadata
>
> On 8/27/2019 7:18 AM, Khare, Kushal (MIND) wrote:
>> Basically, what problem I am facing is - I am getting the textual content + 
>> other metadata in my _text_ field. But, I want only the textual content 
>> written inside the document.
>> I tried various Request Handler Update Extract configurations, but none of 
>> them worked for me.
>> Please help me resolve this as I am badly stuck in this.
>
> Controlling exactly what gets indexed in which fields is likely going to 
> require that you write the indexing software yourself -- a program that 
> extracts the data you want and sends it to Solr for indexing.
>
> We do not recommend running the Extracting Request Handler in
> production
> -- Tika is known to crash when given some documents (usually PDF files are 
> the problematic ones, but other formats can cause it too), and if it crashes 
> while running inside Solr, it will take Solr down with it.
>
> Here is an example program that uses Tika for rich document parsing.  It also 
> talks to a database, but that part could be easily removed or modified:
>
> https://lucidworks.com/post/indexing-with-solrj/
>
> Thanks,
> Shawn
>
> ________________________________
>
> The information contained in this electronic message and any
> attachments to this message are intended for the exclusive use of the
> addressee(s) and may contain proprietary, confidential or privileged
> information. If you are not the intended recipient, you should not
> disseminate, distribute or copy this e-mail. Please notify the sender
> immediately and destroy all copies of this message and any
> attachments. WARNING: Computer viruses can be transmitted via email.
> The recipient should check this email and any attachments for the
> presence of viruses. The company accepts no liability for any damage
> caused by any virus/trojan/worms/malicious code transmitted by this
> email. www.motherson.com
>
> ________________________________
>
> The information contained in this electronic message and any
> attachments to this message are intended for the exclusive use of the
> addressee(s) and may contain proprietary, confidential or privileged
> information. If you are not the intended recipient, you should not
> disseminate, distribute or copy this e-mail. Please notify the sender
> immediately and destroy all copies of this message and any
> attachments. WARNING: Computer viruses can be transmitted via email.
> The recipient should check this email and any attachments for the
> presence of viruses. The company accepts no liability for any damage
> caused by any virus/trojan/worms/malicious code transmitted by this
> email. www.motherson.com


________________________________

The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 
Please notify the sender immediately and destroy all copies of this message and 
any attachments. WARNING: Computer viruses can be transmitted via email. The 
recipient should check this email and any attachments for the presence of 
viruses. The company accepts no liability for any damage caused by any 
virus/trojan/worms/malicious code transmitted by this email. www.motherson.com

RE: Require searching only for file content and not metadata

Reply via email to