Re: Indexing information on number of attachments and their names in EML file

2019-08-02 Thread Tim Allison
I'd strongly recommend rolling your own ingest code. See Erick's superb: https://lucidworks.com/post/indexing-with-solrj/ You can easily get attachments via the RecursiveParserWrapper, e.g.

Re: problem indexing GPS metadata for video upload

2019-05-10 Thread Tim Allison
Unfortunately, It Depends(TM)*...these are the steps I take: https://wiki.apache.org/tika/UpgradingTikaInSolr There can be version conflicts and other awful, unforeseen things if you don't get it right. We're on the cusp of the release for 1.21 (I mean it this time)...I'll upgrade Solr as soon

Re: problem indexing GPS metadata for video upload

2019-05-02 Thread Tim Allison
Sorry build #182: https://builds.apache.org/job/tika-branch-1x/ On Thu, May 2, 2019 at 12:01 PM Tim Allison wrote: > > I just pushed a fix for TIKA-2861. If you can either build locally or > wait a few hours for Jenkins to build #182, let me know if that works > with straight

Re: problem indexing GPS metadata for video upload

2019-05-02 Thread Tim Allison
I just pushed a fix for TIKA-2861. If you can either build locally or wait a few hours for Jenkins to build #182, let me know if that works with straight tika-app.jar. On Thu, May 2, 2019 at 5:00 AM Where is Where wrote: > > Thank you Alex and Tim. > I have looked at the solrconfig.xml file (I

Re: problem indexing GPS metadata for video upload

2019-05-01 Thread Tim Allison
Related? https://issues.apache.org/jira/plugins/servlet/mobile#issue/TIKA-2861 On Wed, May 1, 2019 at 8:09 AM Alexandre Rafalovitch wrote: > What happens when you run it against a standalone Tika (recommended option > anyway)? Do you see the relevant fields? > > Not every Tika field is

Re: SOLR Text Field

2019-04-06 Thread Tim Allison
TextField is a classname. Look in managedschema and pick a field type by name, e.g. text_general On Sat, Apr 6, 2019 at 9:00 AM Dave Beckstrom wrote: > Hi Everyone, > > I'm really hating SOLR. All I want is to define a text field that data > can be indexed into and which is searchable.

Why is elevate not working when I convert a request to local parameters?

2019-03-22 Thread Tim Allison
Should probably send this one from an anonymous email... :( I can see from the results that elevate is working with this: select?=edismax=transcript=my_field However, elevate is not working with this: select?={!edismax%20v=transcript%20qf=my_field} This is Solr 4.x...y, I know... What am I

Re: Help with a DIH config file

2019-03-15 Thread Tim Allison
Haha, looks like Jörn just answered this... onError="skip|continue" >greatly preferable if the indexing process could ignore exceptions Please, no. I'm 100% behind the sentiment that DIH should gracefully handle Tika exceptions, but the better option is to log the exceptions, store the

Re: by: java.util.zip.DataFormatException: invalid distance too far back reported by Solr API

2019-02-05 Thread Tim Allison
>At the end of the day it would be a much better architecture to parse the > PDFs using plain standalone TikaServer +1 Also, note that we added a -spawnChild switch to tika-server that will run the server in a child process and kill+restart the child process if there is an infinite

TokenizerChain.getMultiTermAnalyzer().normalize() no longer normalizes multiterms in 8.x?!

2019-01-25 Thread Tim Allison
All, I don't know if this change was intended, but it feels like a bug to me... TokenFilterFactory[] filters = new TokenFilterFactory[2]; filters[0] = new LowerCaseFilterFactory(Collections.EMPTY_MAP); filters[1] = new ASCIIFoldingFilterFactory(Collections.EMPTY_MAP); TokenizerChain chain = new

Re: 8.0.0-SNAPSHOT snapshot repo poms broken?

2019-01-17 Thread Tim Allison
User error..please ignore. On Thu, Jan 17, 2019 at 4:36 PM Tim Allison wrote: > > All, > I recently tried to upgrade a project that relies on the snapshot > repos[1], but maven wasn't able to pull lucene-highlighter, > lucene-test-framework, lucene-memory, among a few o

8.0.0-SNAPSHOT snapshot repo poms broken?

2019-01-17 Thread Tim Allison
All, I recently tried to upgrade a project that relies on the snapshot repos[1], but maven wasn't able to pull lucene-highlighter, lucene-test-framework, lucene-memory, among a few others. However, maven was able to pull lucene-core and most other artifacts for 8.0.0-SNAPSHOT. I manually

Re: Content from EML files indexing from text/html (which is not clean) instead of text/plain

2019-01-17 Thread Tim Allison
Y, I tracked this down within Solr. This is a feature, not a bug. I found a solution (set {{captureAttr}} to {{true}}): https://issues.apache.org/jira/browse/TIKA-2814?focusedCommentId=16745263=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16745263 Please, though, for

Re: Solr OCR Support

2018-11-02 Thread Tim Allison
g Nuance (or tesseract), I just wish to point out that > what to OCR is important, because OCR works well when it has good input. > > > -Original Message- > > From: Tim Allison > > Sent: Friday, November 2, 2018 11:03 AM > > To: solr-user@lucene.apache.org &

Re: Solr OCR Support

2018-11-02 Thread Tim Allison
OCR'ing of PDFs is fiddly at the moment because of Tika, not Solr! We have an open ticket to make it "just work", but we aren't there yet (TIKA-2749). You have to tell Tika how you want to process images from PDFs via the tika-config.xml file. You've seen this link in the links you mentioned:

Re: Tesseract language

2018-10-27 Thread Tim Allison
path-variables pointing to > > > "Tesseract-OCR/tessdata". > > > > > > Now Tesseract works with Danish language from the CMD, but now I can't > > > make the code work in Java, not even with default settings (which I > > > could before). Am I missing somethin

Re: Tesseract language

2018-10-26 Thread Tim Allison
Tika relies on you to install tesseract and all the language libraries you'll need. If you can successfully call `tesseract testing/eurotext.png testing/eurotext-dan -l dan`, Tika _should_ be able to specify "dan" with your code above. On Fri, Oct 26, 2018 at 10:49 AM Martin Frank Hansen (MHQ)

Re: Reading data using Tika to Solr

2018-10-26 Thread Tim Allison
r > > Hi Tim, > > It is msg files and I added tika-app-1.14.jar to the build path - and now > it works  But how do I get it to read the attachments as well? > > -Original Message- > From: Tim Allison > Sent: 25. oktober 2018 21:57 > To: solr-user@lucene.ap

Re: Reading data using Tika to Solr

2018-10-26 Thread Tim Allison
how do I get it to read the attachments as well? > > -Original Message- > From: Tim Allison > Sent: 25. oktober 2018 21:57 > To: solr-user@lucene.apache.org > Subject: Re: Reading data using Tika to Solr > > If you’re processing actual msg (not eml), you’l

Re: Reading data using Tika to Solr

2018-10-25 Thread Tim Allison
t; > Tika-parsers-1.4.jar > Tika-core-1.4.jar > Commons-io-2.5.jar > Httpclient-4.5.3 > Httpcore-4.4.6.jar > Httpmime-4.5.3.jar > Slf4j-api1-7-24.jar > Jcl-over--slf4j-1.7.24.jar > Solr-cell-7.5.0.jar > Solr-core-7.5.0.jar > Solr-solrj-7.5.0.jar > Noggit-0.8.jar

Re: Reading data using Tika to Solr

2018-10-25 Thread Tim Allison
To follow up w Erick’s point, there are a bunch of transitive dependencies from tika-parsers. If you aren’t using maven or similar build system to grab the dependencies, it can be tricky to get it right. If you aren’t using maven, and you can afford the risks of jar hell, consider using tika-app

Re: Encoding issue in solr

2018-10-05 Thread Tim Allison
This is probably caused by an encoding detection problem in Nutch and/or Tika. If you can share the file on the Tika user’s list, I can take a look. On Fri, Oct 5, 2018 at 7:11 AM UMA MAHESWAR wrote: > HI ALL, > > while i am using nutch for crawling and indexing in to solr,while storing > data

Re: solr and diversification

2018-09-28 Thread Tim Allison
If you haven’t already, might want to check out maximal marginal relevance...original paper: Carbonell and Goldstein. On Thu, Sep 27, 2018 at 7:29 PM Joel Bernstein wrote: > Yeah, I think your plan sounds fine. > > Do you have a specific use case for diversity of results. I've been > wondering

Re: Memory Leak in 7.3 to 7.4

2018-08-06 Thread Tim Allison
+1 to Shawn's and Erick's points about isolating Tika in a separate jvm. Y, please do let us know: u...@tika.apache.org We might be able to help out, and you, in turn, can help the community figure out what's going on; see e.g.: https://issues.apache.org/jira/browse/TIKA-2703 On Sun, Aug 5,

Re: Index protected zip

2018-05-29 Thread Tim Allison
l" place but the real story is in another > > place, > > > one we alternately tell people to sometimes ignore but sometimes keep > up > > to > > > date? Even I'm confused. > > > > > > On Sat, May 26, 2018 at 6:41 PM Erick Erickson < > ericke

Re: Index protected zip

2018-05-26 Thread Tim Allison
W00t! Thank you, Shawn! The "don't use ERH in production" response comes up frequently enough > that I have created a wiki page we can use for responses: > > https://wiki.apache.org/solr/RecommendCustomIndexingWithTika > > Tim, you are extremely well-qualified to expand and correct this page. >

Re: simple enrich uploaded binary documents with sha256 hashes

2018-05-26 Thread Tim Allison
+1 as always to Erick’s advice. DIH is only a PoC. We do have a DigestingParser in Tika, and when you combine that w the RecursiveParserWrapper, you can get digests not only of the main file but also on all embedded files/attachments...which can be pretty neat for some use cases. Operators are

Re: Index protected zip

2018-05-26 Thread Tim Allison
...@mail.gmail.com%3e On Sat, May 26, 2018 at 6:34 AM Tim Allison <talli...@apache.org> wrote: > You’ll need to provide a PasswordProvider in the ParseContext. I don’t > think that is currently possible in the Solr integration. Please open a > ticket if SolrJ doesn’t meet your needs.

Re: Index protected zip

2018-05-26 Thread Tim Allison
You’ll need to provide a PasswordProvider in the ParseContext. I don’t think that is currently possible in the Solr integration. Please open a ticket if SolrJ doesn’t meet your needs. On Thu, May 24, 2018 at 1:03 PM Alexandre Rafalovitch wrote: > Hmm. If it works, then it