Re: Indexing information on number of attachments and their names in EML file

2019-08-02 Thread Tim Allison
I'd strongly recommend rolling your own ingest code.  See Erick's
superb: https://lucidworks.com/post/indexing-with-solrj/

You can easily get attachments via the RecursiveParserWrapper, e.g.
https://github.com/apache/tika/blob/master/tika-parsers/src/test/java/org/apache/tika/parser/RecursiveParserWrapperTest.java#L351

This will return a list of Metadata objects; the first one will be the
main/container, each other entry will be an attachment.  Let us know
if you have any questions/surprises.  There are a couple of todos for
.eml...

On Fri, Aug 2, 2019 at 3:43 AM Jan Høydahl  wrote:
>
> Try the Apache Tika mailing list.
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
>
> > 2. aug. 2019 kl. 05:01 skrev Zheng Lin Edwin Yeo :
> >
> > Hi,
> >
> > Does anyone knows if this can be done on the Solr side?
> > Or it has to be done on the Tika side?
> >
> > Regards,
> > Edwin
> >
> > On Thu, 1 Aug 2019 at 09:38, Zheng Lin Edwin Yeo 
> > wrote:
> >
> >> Hi,
> >>
> >> Would like to check, Is there anyway which we can detect the number of
> >> attachments and their names during indexing of EML files in Solr, and index
> >> those information into Solr?
> >>
> >> Currently, Solr is able to use Tika and Tesseract OCR to extract the
> >> contents of the attachments. However, I could not find the information
> >> about the number of attachments in the EML file and what are their 
> >> filename.
> >>
> >> I am using Solr 7.6.0 in production, and also trying out on the new Solr
> >> 8.2.0.
> >>
> >> Regards,
> >> Edwin
> >>
>


Re: problem indexing GPS metadata for video upload

2019-05-10 Thread Tim Allison
Unfortunately, It Depends(TM)*...these are the steps I take:
https://wiki.apache.org/tika/UpgradingTikaInSolr

There can be version conflicts and other awful, unforeseen things if
you don't get it right.

We're on the cusp of the release for 1.21 (I mean it this time)...I'll
upgrade Solr as soon as Tika is out (I also mean it this time).


*TM by Erick Erickson

On Fri, May 3, 2019 at 3:44 AM Where is Where  wrote:
>
> Thank you very much Tim, I wonder how to make the Tika change apply to
> Solr? I saw Tika core, parse and xml jar files tika-core.jar
> tika-parsers.jar tika-xml.jar in solr contrib/extraction/lib folder. Do we
> just  replace these files? Thanks!
>
> On Thu, May 2, 2019 at 12:16 PM Where is Where  wrote:
>
> > Thank you Alex and Tim.
> > I have looked at the solrconfig.xml file (I am trying the techproducts
> > demo config), the only related place I can find is the extract handle
> >
> >  >   startup="lazy"
> >   class="solr.extraction.ExtractingRequestHandler" >
> > 
> >   true
> >   
> >
> >   
> >   true
> >   links
> >   ignored_
> > 
> >   
> >
> > I am using this command bin/post -c techproducts
> > example/exampledocs/1.mp4 -params "literal.id=mp4_1=attr_"
> >
> > I have tried commenting out ignored_ and
> > changing to div
> > but still not working. I don't quite get why image is getting gps etc
> > metadata but video is acting differently while it is using the same
> > solrconfig and the gps metadata are in the same fields. There is no
> > differentiation in solrconfig setting between image and video.
> >
> > Tim yes this is related to the TIKA link. Thank you!
> >
> > Here is the output in solr for mp4.
> >
> > {
> > "attr_meta":["stream_size",
> >   "5721559",
> >   "date",
> >   "2019-03-29T04:36:39Z",
> >   "X-Parsed-By",
> >   "org.apache.tika.parser.DefaultParser",
> >   "X-Parsed-By",
> >   "org.apache.tika.parser.mp4.MP4Parser",
> >   "stream_content_type",
> >   "application/octet-stream",
> >   "meta:creation-date",
> >   "2019-03-29T04:36:39Z",
> >   "Creation-Date",
> >   "2019-03-29T04:36:39Z",
> >   "tiff:ImageLength",
> >   "1080",
> >   "resourceName",
> >   "/Volumes/Data/inData/App/solr/example/exampledocs/1.mp4",
> >   "dcterms:created",
> >   "2019-03-29T04:36:39Z",
> >   "dcterms:modified",
> >   "2019-03-29T04:36:39Z",
> >   "Last-Modified",
> >   "2019-03-29T04:36:39Z",
> >   "Last-Save-Date",
> >   "2019-03-29T04:36:39Z",
> >   "xmpDM:audioSampleRate",
> >   "1000",
> >   "meta:save-date",
> >   "2019-03-29T04:36:39Z",
> >   "modified",
> >   "2019-03-29T04:36:39Z",
> >   "tiff:ImageWidth",
> >   "1920",
> >   "xmpDM:duration",
> >   "2.64",
> >   "Content-Type",
> >   "video/mp4"],
> > "id":"mp4_4",
> > "attr_stream_size":["5721559"],
> > "attr_date":["2019-03-29T04:36:39Z"],
> > "attr_x_parsed_by":["org.apache.tika.parser.DefaultParser",
> >   "org.apache.tika.parser.mp4.MP4Parser"],
> > "attr_stream_content_type":["application/octet-stream"],
> > "attr_meta_creation_date":["2019-03-29T04:36:39Z"],
> > "attr_creation_date":["2019-03-29T04:36:39Z"],
> > "attr_tiff_imagelength":["1080"],
> > 
> > "resourcename":"/Volumes/Data/inData/App/solr/example/exampledocs/1.mp4",
> > "attr_dcterms_created":["2019-03-29T04:36:39Z"],
> > "attr_dcterms_modified":["2019-03-29T04:36:39Z"],
> > "last_modified":"2019-03-29T04:36:39Z",
> > "attr_last_save_date":["2019-03-29T04:36:39Z"],
> > "attr_xmpdm_audiosamplerate":["1000"],
> > "attr_meta_save_date":["2019-03-29T04:36:39Z"],
> > "attr_modified":["2019-03-29T04:36:39Z"],
> > "attr_tiff_imagewidth":["1920"],
> > "attr_xmpdm_duration":["2.64"],
> > "content_type":["video/mp4"],
> > "content":[" \n \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  
> > \n  \n  \n  \n  \n  \n  \n  \n \n   "],
> > "_version_":1632383499325407232}]
> >   }}
> >
> > JPEG is getting these:
> > "attr_meta":[
> > "GPS Latitude",
> >   "37° 47' 41.99\"",
> > 
> > "attr_gps_latitude":["37° 47' 41.99\""],
> >
> >
> > On Wed, May 1, 2019 at 2:57 PM Where is Where  wrote:
> >
> >> uploading video to solr via tika
> >> https://lucene.apache.org/solr/guide/7_7/uploading-data-with-solr-cell-using-apache-tika.html
> >> The index has no video GPS metadata which is extracted and indexed for
> >> images such as jpeg. I have checked both MP4 and MOV files, the files I
> >> checked all have GPS Exif data embedded in the same fields as image. Any
> >> idea? Thanks!
> >>
> >


Re: problem indexing GPS metadata for video upload

2019-05-02 Thread Tim Allison
Sorry build #182: https://builds.apache.org/job/tika-branch-1x/

On Thu, May 2, 2019 at 12:01 PM Tim Allison  wrote:
>
> I just pushed a fix for TIKA-2861.  If you can either build locally or
> wait a few hours for Jenkins to build #182, let me know if that works
> with straight tika-app.jar.
>
> On Thu, May 2, 2019 at 5:00 AM Where is Where  wrote:
> >
> > Thank you Alex and Tim.
> > I have looked at the solrconfig.xml file (I am trying the techproducts demo
> > config), the only related place I can find is the extract handle
> >
> >  >   startup="lazy"
> >   class="solr.extraction.ExtractingRequestHandler" >
> > 
> >   true
> >   
> >
> >   
> >   true
> >   links
> >   ignored_
> > 
> >   
> >
> > I am using this command bin/post -c techproducts example/exampledocs/1.mp4
> > -params "literal.id=mp4_1=attr_"
> >
> > I have tried commenting out ignored_ and changing
> > to div
> > but still not working. I don't quite get why image is getting gps etc
> > metadata but video is acting differently while it is using the same
> > solrconfig and the gps metadata are in the same fields. There is no
> > differentiation in solrconfig setting between image and video.
> >
> > Tim yes this is related to the TIKA link. Thank you!
> >
> > Here is the output in solr for mp4.
> >
> > {
> > "attr_meta":["stream_size",
> >   "5721559",
> >   "date",
> >   "2019-03-29T04:36:39Z",
> >   "X-Parsed-By",
> >   "org.apache.tika.parser.DefaultParser",
> >   "X-Parsed-By",
> >   "org.apache.tika.parser.mp4.MP4Parser",
> >   "stream_content_type",
> >   "application/octet-stream",
> >   "meta:creation-date",
> >   "2019-03-29T04:36:39Z",
> >   "Creation-Date",
> >   "2019-03-29T04:36:39Z",
> >   "tiff:ImageLength",
> >   "1080",
> >   "resourceName",
> >   "/Volumes/Data/inData/App/solr/example/exampledocs/1.mp4",
> >   "dcterms:created",
> >   "2019-03-29T04:36:39Z",
> >   "dcterms:modified",
> >   "2019-03-29T04:36:39Z",
> >   "Last-Modified",
> >   "2019-03-29T04:36:39Z",
> >   "Last-Save-Date",
> >   "2019-03-29T04:36:39Z",
> >   "xmpDM:audioSampleRate",
> >   "1000",
> >   "meta:save-date",
> >   "2019-03-29T04:36:39Z",
> >   "modified",
> >   "2019-03-29T04:36:39Z",
> >   "tiff:ImageWidth",
> >   "1920",
> >   "xmpDM:duration",
> >   "2.64",
> >   "Content-Type",
> >   "video/mp4"],
> > "id":"mp4_4",
> > "attr_stream_size":["5721559"],
> > "attr_date":["2019-03-29T04:36:39Z"],
> > "attr_x_parsed_by":["org.apache.tika.parser.DefaultParser",
> >   "org.apache.tika.parser.mp4.MP4Parser"],
> > "attr_stream_content_type":["application/octet-stream"],
> > "attr_meta_creation_date":["2019-03-29T04:36:39Z"],
> > "attr_creation_date":["2019-03-29T04:36:39Z"],
> > "attr_tiff_imagelength":["1080"],
> > 
> > "resourcename":"/Volumes/Data/inData/App/solr/example/exampledocs/1.mp4",
> > "attr_dcterms_created":["2019-03-29T04:36:39Z"],
> > "attr_dcterms_modified":["2019-03-29T04:36:39Z"],
> > "last_modified":"2019-03-29T04:36:39Z",
> > "attr_last_save_date":["2019-03-29T04:36:39Z"],
> > "attr_xmpdm_audiosamplerate":["1000"],
> > "attr_meta_save_date":["2019-03-29T04:36:39Z"],
> > "attr_modified":["2019-03-29T04:36:39Z"],
> > "attr_tiff_imagewidth":["1920"],
> > "attr_xmpdm_duration":["2.64"],
> > "content_type":["video/mp4"],
> > "content":[" \n \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n
> >  \n  \n  \n  \n  \n  \n  \n  \n  \n \n   "],
> > "_version_":1632383499325407232}]
> >   }}
> >
> > JPEG is getting these:
> > "attr_meta":[
> > "GPS Latitude",
> >   "37° 47' 41.99\"",
> > 
> > "attr_gps_latitude":["37° 47' 41.99\""],
> >
> >
> > On Wed, May 1, 2019 at 2:57 PM Where is Where  wrote:
> >
> > > uploading video to solr via tika
> > > https://lucene.apache.org/solr/guide/7_7/uploading-data-with-solr-cell-using-apache-tika.html
> > > The index has no video GPS metadata which is extracted and indexed for
> > > images such as jpeg. I have checked both MP4 and MOV files, the files I
> > > checked all have GPS Exif data embedded in the same fields as image. Any
> > > idea? Thanks!
> > >


Re: problem indexing GPS metadata for video upload

2019-05-02 Thread Tim Allison
I just pushed a fix for TIKA-2861.  If you can either build locally or
wait a few hours for Jenkins to build #182, let me know if that works
with straight tika-app.jar.

On Thu, May 2, 2019 at 5:00 AM Where is Where  wrote:
>
> Thank you Alex and Tim.
> I have looked at the solrconfig.xml file (I am trying the techproducts demo
> config), the only related place I can find is the extract handle
>
>startup="lazy"
>   class="solr.extraction.ExtractingRequestHandler" >
> 
>   true
>   
>
>   
>   true
>   links
>   ignored_
> 
>   
>
> I am using this command bin/post -c techproducts example/exampledocs/1.mp4
> -params "literal.id=mp4_1=attr_"
>
> I have tried commenting out ignored_ and changing
> to div
> but still not working. I don't quite get why image is getting gps etc
> metadata but video is acting differently while it is using the same
> solrconfig and the gps metadata are in the same fields. There is no
> differentiation in solrconfig setting between image and video.
>
> Tim yes this is related to the TIKA link. Thank you!
>
> Here is the output in solr for mp4.
>
> {
> "attr_meta":["stream_size",
>   "5721559",
>   "date",
>   "2019-03-29T04:36:39Z",
>   "X-Parsed-By",
>   "org.apache.tika.parser.DefaultParser",
>   "X-Parsed-By",
>   "org.apache.tika.parser.mp4.MP4Parser",
>   "stream_content_type",
>   "application/octet-stream",
>   "meta:creation-date",
>   "2019-03-29T04:36:39Z",
>   "Creation-Date",
>   "2019-03-29T04:36:39Z",
>   "tiff:ImageLength",
>   "1080",
>   "resourceName",
>   "/Volumes/Data/inData/App/solr/example/exampledocs/1.mp4",
>   "dcterms:created",
>   "2019-03-29T04:36:39Z",
>   "dcterms:modified",
>   "2019-03-29T04:36:39Z",
>   "Last-Modified",
>   "2019-03-29T04:36:39Z",
>   "Last-Save-Date",
>   "2019-03-29T04:36:39Z",
>   "xmpDM:audioSampleRate",
>   "1000",
>   "meta:save-date",
>   "2019-03-29T04:36:39Z",
>   "modified",
>   "2019-03-29T04:36:39Z",
>   "tiff:ImageWidth",
>   "1920",
>   "xmpDM:duration",
>   "2.64",
>   "Content-Type",
>   "video/mp4"],
> "id":"mp4_4",
> "attr_stream_size":["5721559"],
> "attr_date":["2019-03-29T04:36:39Z"],
> "attr_x_parsed_by":["org.apache.tika.parser.DefaultParser",
>   "org.apache.tika.parser.mp4.MP4Parser"],
> "attr_stream_content_type":["application/octet-stream"],
> "attr_meta_creation_date":["2019-03-29T04:36:39Z"],
> "attr_creation_date":["2019-03-29T04:36:39Z"],
> "attr_tiff_imagelength":["1080"],
> 
> "resourcename":"/Volumes/Data/inData/App/solr/example/exampledocs/1.mp4",
> "attr_dcterms_created":["2019-03-29T04:36:39Z"],
> "attr_dcterms_modified":["2019-03-29T04:36:39Z"],
> "last_modified":"2019-03-29T04:36:39Z",
> "attr_last_save_date":["2019-03-29T04:36:39Z"],
> "attr_xmpdm_audiosamplerate":["1000"],
> "attr_meta_save_date":["2019-03-29T04:36:39Z"],
> "attr_modified":["2019-03-29T04:36:39Z"],
> "attr_tiff_imagewidth":["1920"],
> "attr_xmpdm_duration":["2.64"],
> "content_type":["video/mp4"],
> "content":[" \n \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n
>  \n  \n  \n  \n  \n  \n  \n  \n  \n \n   "],
> "_version_":1632383499325407232}]
>   }}
>
> JPEG is getting these:
> "attr_meta":[
> "GPS Latitude",
>   "37° 47' 41.99\"",
> 
> "attr_gps_latitude":["37° 47' 41.99\""],
>
>
> On Wed, May 1, 2019 at 2:57 PM Where is Where  wrote:
>
> > uploading video to solr via tika
> > https://lucene.apache.org/solr/guide/7_7/uploading-data-with-solr-cell-using-apache-tika.html
> > The index has no video GPS metadata which is extracted and indexed for
> > images such as jpeg. I have checked both MP4 and MOV files, the files I
> > checked all have GPS Exif data embedded in the same fields as image. Any
> > idea? Thanks!
> >


Re: problem indexing GPS metadata for video upload

2019-05-01 Thread Tim Allison
Related?

https://issues.apache.org/jira/plugins/servlet/mobile#issue/TIKA-2861


On Wed, May 1, 2019 at 8:09 AM Alexandre Rafalovitch 
wrote:

> What happens when you run it against a standalone Tika (recommended option
> anyway)? Do you see the relevant fields?
>
> Not every Tika field is captured, that is configured in solrconfig.xml. So
> if Tika extracts them, next step is to check the mapping.
>
> Regards,
>  Alex
>
> On Wed, May 1, 2019, 5:38 AM Where is Where,  wrote:
>
> > uploading video to solr via tika
> >
> >
> https://lucene.apache.org/solr/guide/7_7/uploading-data-with-solr-cell-using-apache-tika.html
> > The index has no video GPS metadata which is extracted and indexed for
> > images such as jpeg. I have checked both MP4 and MOV files, the files I
> > checked all have GPS Exif data embedded in the same fields as image. Any
> > idea? Thanks!
> >
>


Re: SOLR Text Field

2019-04-06 Thread Tim Allison
TextField is a classname. Look in managedschema and pick a field type by
name, e.g. text_general

On Sat, Apr 6, 2019 at 9:00 AM Dave Beckstrom 
wrote:

> Hi Everyone,
>
> I'm really hating SOLR.   All I want is to define a text field that data
> can be indexed into and which is searchable.  Should be super simple.  But
> I run into issue after issue.  I'm running SOLR 7.3 because it's compatible
> with the version of NUTCH I'm running.
>
> The docs say that SOLR ships with a default TextField but that seems to be
> wrong.  I define:
>
>
>  indexed="true"/>
>
> The above throws error  "Unable to create core [MyCore] Caused by: Unknown
> fieldType 'TextField' specified on field metadata.myfield"
>
> Then I try:
>
> 
>
> Same error.
>
> Then as a workaround I got into defining a "Text_General" field because I
> couldn't get Text to work.  Text_General extends the Text field which seems
> to indicate there should be a text field built into SOLR!
>
> Text_General causes a new set of problems.   How does one go about using
> the supposed default text field available in SOLR?
>
> When I defined Text_General:
>
>   name="add-schema-fields">
> 
>   java.lang.String
>   text_general
>   true
> 
>
> Text_General with type=string complains when I try and insert data that has
> characters and numbers:
>
> java.lang.Exception:
> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error
> from server at http://127.0.0.1:/solr/MyCore: ERROR: [doc=
> http://xxx.xxx.com/services/mydocument.htm] Error adding field
> 'metatag.myfield'='15c0188' msg=For input string: "15c0188"
>
> I'm very frustrated.  If anyone is able to help sort this out I would
> really appreciate it!  What do I need to do to be able to define a simple
> text field that is stored and searchable?
>
> Thank you!
>
> --
> *Fig Leaf Software, Inc.*
> https://www.figleaf.com/
> 
>
> Full-Service Solutions Integrator
>
>
>
>
>
>
>


Why is elevate not working when I convert a request to local parameters?

2019-03-22 Thread Tim Allison
Should probably send this one from an anonymous email... :(

I can see from the results that elevate is working with this:

select?=edismax=transcript=my_field

However, elevate is not working with this:

select?={!edismax%20v=transcript%20qf=my_field}

This is Solr 4.x...y, I know...

What am I doing wrong?  How can I fix this?

Thank you.

Best,

 Tim


Re: Help with a DIH config file

2019-03-15 Thread Tim Allison
Haha, looks like Jörn just answered this... onError="skip|continue"

>greatly preferable if the indexing process could ignore exceptions
Please, no.  I'm 100% behind the sentiment that DIH should gracefully
handle Tika exceptions, but the better option is to log the
exceptions, store the stacktraces and report your high priority
problems to Apache Tika and/or its dependencies so that we can fix
them.  Try running tika-eval[0] against a subset of your docs,
perhaps.

That said, DIH's integration with Tika is not intended for robust
production use.  It is intended to get people up to speed quickly and,
effectively, for demo purposes.  I recognize that it is being used in
production around the world, but it really shouldn't be.

See Erick Erickson's[1]:
>But, i wouldn’t really recommend that you just ship the docs to Solr, I’d 
>recommend that you build a little program to do the extraction on one or more 
>clients, the details of why are here:

>https://lucidworks.com/2012/02/14/indexing-with-solrj/

[0] https://wiki.apache.org/tika/TikaEval
[1] 
https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201903.mbox/ajax/%3CF2034803-D4A8-48E1-889A-DA9E44961EE6%40gmail.com%3E

On Fri, Mar 15, 2019 at 7:44 AM Demian Katz  wrote:
>
> Jörn (and anyone else with more experience with this than I have),
>
> I've been working on Whitney with this issue. It is a PDF file, and it can be 
> opened successfully in a PDF reader. Interestingly, if I try to extract data 
> from it on the command line, Tika version 1.3 throws a lot of warnings but 
> does successfully extract data, but several newer versions, including 1.17 
> and 1.20 (haven't tested other intermediate versions) encounter a fatal error 
> and extract nothing. So this seems like something that used to work but has 
> stopped. Unfortunately, we haven't been able to find a way to downgrade to an 
> old enough Tika in her Solr installation to work around the problem that way.
>
> The bigger question, though, is whether there's a way to allow the DIH to 
> simply ignore errors and keep going. Whitney needs to index several terabytes 
> of arbitrary documents for her project, and at this scale, she can't afford 
> the time to stop and manually intervene for every strange document that 
> happens to be in the collection. It would be greatly preferable if the 
> indexing process could ignore exceptions and proceed on than if it just stops 
> dead at the first problem. (I'm also pretty sure that Whitney is already 
> using the ignoreTikaException attribute in her configuration, but it doesn't 
> seem to help in this instance).
>
> Any suggestions would be greatly appreciated!
>
> thanks,
> Demian
>
> -Original Message-
> From: Jörn Franke 
> Sent: Friday, March 15, 2019 4:18 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Help with a DIH config file
>
> Do you have an exception?
> It could be that the pdf is broken - can you open it on your computer with a 
> pdfreader?
>
> If the exception is related to Tika and pdf then file an issue with the 
> pdfbox project. If there is an issue with Tika and MsOffice documents then 
> Apache poi is the right project to ask.
>
> > Am 15.03.2019 um 03:41 schrieb wclarke :
> >
> > Thank you so much.  You helped a great deal.  I am running into one
> > last issue where the Tika DIH is stopping at a specific language and
> > fails there (Malayalam).  Do you know of a work around?
> >
> >
> >
> > --
> > Sent from:
> > https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Flucen
> > e.472066.n3.nabble.com%2FSolr-User-f472068.htmldata=02%7C01%7Cdem
> > ian.katz%40villanova.edu%7Ca54d5daee7b14648442908d6a91f9bf6%7C765a8de5
> > cf9444f09cafae5bf8cfa366%7C0%7C0%7C636882350564627071sdata=NpddZY
> > 2sHKJHAR8V%2BIlMt4j1i3oy94KP9%2Btp1EQ2xM4%3Dreserved=0


Re: by: java.util.zip.DataFormatException: invalid distance too far back reported by Solr API

2019-02-05 Thread Tim Allison
>At the end of the day it would be a much better architecture to parse the
> PDFs using plain standalone TikaServer

+1

Also, note that we added a -spawnChild switch to tika-server that will
run the server in a child process and kill+restart the child process
if there is an infinite loop/oom/segfault, etc.  Your client will need
to handle tika-server being down for a second or two during restarts
and/or 503 while shutting down.

>In fact, moving the parsing to the client solved the problem!
Yay!

On Mon, Feb 4, 2019 at 2:01 PM Monique Monteiro
 wrote:
>
> Hi all,
>
> In fact, moving the parsing to the client solved the problem!
>
> Thanks!
> Monique
>
> On Thu, Jan 31, 2019 at 8:25 AM Jan Høydahl  wrote:
>
> > Hi
> >
> > This is Apache Tika that cannot parse a zip file or possibly a zip
> > formatted office file.
> > You have to post the full stack trace (which you'll find in the solr.log
> > on server side)
> > if you want help in locating the source of the issue, you may be able to
> > configure Tika
> >
> > Have you tried to specify ignoreTikaException=true on the request? See
> > https://lucene.apache.org/solr/guide/7_6/uploading-data-with-solr-cell-using-apache-tika.html
> >
> > At the end of the day it would be a much better architecture to parse the
> > PDFs using plain standalone TikaServer and then construct a Solr Document
> > in your Python code which is then posted to Solr. Reason is you have much
> > better control over parse errors and how to map metadata to your schema
> > fields. Also you don't want to overload Solr with all this work, it can
> > even crash the whole Solr server if some parser crashes or gets stuck in an
> > infinite loop.
> >
> > --
> > Jan Høydahl, search solution architect
> > Cominvent AS - www.cominvent.com
> >
> > > 30. jan. 2019 kl. 20:49 skrev Monique Monteiro  > >:
> > >
> > > Hi all,
> > >
> > > I'm writing a Python routine to upload thousands of PDF files to Solr,
> > and
> > > after trying to upload some files, Solr reports the following error in a
> > > HTTP 500 response:
> > >
> > > "by: java.util.zip.DataFormatException: invalid distance too far back"
> > >
> > > Does anyone have any idea about how to overcome this?
> > >
> > > Thanks in advance,
> > > Monique Monteiro
> >
> >
>
> --
> Monique Monteiro
> Twitter: http://twitter.com/monilouise


TokenizerChain.getMultiTermAnalyzer().normalize() no longer normalizes multiterms in 8.x?!

2019-01-25 Thread Tim Allison
All,
  I don't know if this change was intended, but it feels like a bug to me...

TokenFilterFactory[] filters = new TokenFilterFactory[2];
filters[0] = new LowerCaseFilterFactory(Collections.EMPTY_MAP);
filters[1] = new ASCIIFoldingFilterFactory(Collections.EMPTY_MAP);
TokenizerChain chain = new TokenizerChain (
new MockTokenizerFactory(Collections.EMPTY_MAP), filters);
System.out.println("NORMALIZE: " + chain.normalize("text0",
"f\u00F6\u00F6Ba").utf8ToString());
System.out.println("NORMALIZE with multiterm: " +
chain.getMultiTermAnalyzer().normalize("text0",
"f\u00F6\u00F6Ba").utf8ToString());

output:
NORMALIZE: fooba
NORMALIZE with multiterm: fööBa

If this is a bug and not the desired behavior, the source of the
problem is that in TokenizerChain's getMultiTermAnalyzer(), there's no
override of #normalize(String fieldName, TokenStream ts)...which means
that the multiTermAnalyzer returned by TokenizerChain doesn't actually
work to normalize multiterms!

If this is a bug, I'll open a ticket.


Re: 8.0.0-SNAPSHOT snapshot repo poms broken?

2019-01-17 Thread Tim Allison
User error..please ignore.

On Thu, Jan 17, 2019 at 4:36 PM Tim Allison  wrote:
>
> All,
>   I recently tried to upgrade a project that relies on the snapshot
> repos[1], but maven wasn't able to pull lucene-highlighter,
> lucene-test-framework, lucene-memory, among a few others.  However,
> maven was able to pull lucene-core and most other artifacts for
> 8.0.0-SNAPSHOT.  I manually checked that the jars and poms for the
> artifacts that maven wasn't able to pull were in fact there.
>   Is this user error or something wrong with the poms or something else?
>
>Thank you.
>
>  Best,
>
>Tim
>
>
> [1]
> 
> 
> apache-snapshot
> http://repository.apache.org/snapshots/
> 
> 


8.0.0-SNAPSHOT snapshot repo poms broken?

2019-01-17 Thread Tim Allison
All,
  I recently tried to upgrade a project that relies on the snapshot
repos[1], but maven wasn't able to pull lucene-highlighter,
lucene-test-framework, lucene-memory, among a few others.  However,
maven was able to pull lucene-core and most other artifacts for
8.0.0-SNAPSHOT.  I manually checked that the jars and poms for the
artifacts that maven wasn't able to pull were in fact there.
  Is this user error or something wrong with the poms or something else?

   Thank you.

 Best,

   Tim


[1]


apache-snapshot
http://repository.apache.org/snapshots/




Re: Content from EML files indexing from text/html (which is not clean) instead of text/plain

2019-01-17 Thread Tim Allison
Y, I tracked this down within Solr.  This is a feature, not a bug.  I
found a solution (set {{captureAttr}} to {{true}}):
https://issues.apache.org/jira/browse/TIKA-2814?focusedCommentId=16745263=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16745263

Please, though, for the sake of Solr, please run Tika outside of Solr
in production (e.g. SolrJ...see:
https://lucidworks.com/2012/02/14/indexing-with-solrj/)

On Thu, Jan 17, 2019 at 2:15 AM Zheng Lin Edwin Yeo
 wrote:
>
> Based on the discussion in Tika and also on the Jira (TIKA-2814), it was
> said that the issue could be with the Solr's ExtractingRequestHandler, in
> which the HTMLParser is either not being applied, or is somehow not
> stripping the content of  elements. Straight Tika app is able to do
> the right thing.
>
> Regards,
> Edwin
>
> On Tue, 15 Jan 2019 at 10:56, Zheng Lin Edwin Yeo 
> wrote:
>
> > Hi Alex,
> >
> > Thanks for the suggestions.
> > Yes, I have posted it in the Tika mailing list too.
> >
> > Regards,
> > Edwin
> >
> > On Mon, 14 Jan 2019 at 21:16, Alexandre Rafalovitch 
> > wrote:
> >
> >> I think asking this question on Tika mailing list may give you better
> >> answers. Then, if the conclusion is that the behavior is configurable,
> >> you can see how to do it in Solr. It may be however, that you need to
> >> do the parsing outside of Solr with standalone Tika. Standalone Tika
> >> is a production advice anyway.
> >>
> >> I would suggest the title be something like "How to prefer plain/text
> >> part of an email message when parsing .eml files".
> >>
> >> Regards,
> >>   Alex.
> >>
> >> On Mon, 14 Jan 2019 at 00:20, Zheng Lin Edwin Yeo 
> >> wrote:
> >> >
> >> > Hi,
> >> >
> >> > I have uploaded a sample EML file here:
> >> >
> >> https://drive.google.com/file/d/1z1gujv4SiacFeganLkdb0DhfZsNeGD2a/view?usp=sharing
> >> >
> >> > This is what is indexed in the content:
> >> >
> >> > "content":"  font-size: 14pt; font-family: book antiqua,
> >> > palatino, serif;  Hi There,font-size: 14pt; font-family:
> >> > book antiqua, palatino, serif;  My client owns the domain name “
> >> > font-size: 14pt; color: #ff; font-family: arial black, sans-serif;
> >> >  TravelInsuranceEurope.com   font-size: 14pt; font-family: book
> >> > antiqua, palatino, serif;  ” and is considering putting it in market.
> >> > It is keyword rich domain with good search volume,adword bidding and
> >> > type-in-traffic.font-size: 14pt; font-family: book
> >> > antiqua, palatino, serif;  Based on our extensive study, we strongly
> >> > feel that you should consider buying this domain name to improve the
> >> > SEO, Online visibility, brand image, authority and type-in-traffic for
> >> > your business. We also do provide free 1 year hosting and unlimited
> >> > emails along with domain name.font-size: 14pt;
> >> > font-family: book antiqua, palatino, serif;  Besides this, if you need
> >> > any other domain name, web and app designing services and digital
> >> > marketing services (SEO, PPC and SMO) at reasonable charges, feel free
> >> > to contact us.font-size: 14pt; font-family: book antiqua,
> >> > palatino, serif;  Best Regards,font-size: 14pt;
> >> > font-family: book antiqua, palatino, serif;  Josh   ",
> >> >
> >> >
> >> > As you can see, this is taken from the Content-Type: text/html.
> >> > However, the Content-Type: text/plain looks clean, and that is what we
> >> want
> >> > it to be indexed.
> >> >
> >> > How can we configure the Tika in Solr to change the priority to get the
> >> > content from Content-Type: text/plain  instead of Content-Type:
> >> text/html?
> >> >
> >> > On Mon, 14 Jan 2019 at 11:18, Zheng Lin Edwin Yeo  >> >
> >> > wrote:
> >> >
> >> > > Hi,
> >> > >
> >> > > I am using Solr 7.5.0 with Tika 1.18.
> >> > >
> >> > > Currently I am facing a situation during the indexing of EML files,
> >> > > whereby the content is being extracted from the Content-type=text/html
> >> > > instead of Content-type=text/plain.
> >> > >
> >> > > The problem with Content-type=text/html is that it contains alot of
> >> words
> >> > > like "*FONT-SIZE: 9pt; FONT-FAMILY: arial*" in the content, and all of
> >> > > these get indexed in Solr as well, which makes the content very
> >> cluttered,
> >> > > and it also affect the search, as when we search for words like
> >> "font", all
> >> > > the contents gets returned because of this.
> >> > >
> >> > > Would like to enquire on the following:
> >> > > 1. Why Tika didn't get the text part (text/plain). Is there any way to
> >> > > configure the Tika in Solr to change the priority to get the text part
> >> > > (text/plain) instead of html part (text/html).
> >> > > 2. If that is not possible, as you can see, the content is not clean,
> >> > > which is not right. How can we get this to be clean when Tika is
> >> extracting
> >> > > text?
> >> > >
> >> > > Regards,
> >> > > Edwin
> >> > >
> >>
> >


Re: Solr OCR Support

2018-11-02 Thread Tim Allison
+1 Thank you, Daniel.  If you have any interest in helping out on
TIKA-2749, please join the fun. :D
On Fri, Nov 2, 2018 at 12:12 PM Davis, Daniel (NIH/NLM) [C]
 wrote:
>
> I think that you also have to process a PDF pretty deeply to decide if you 
> want it to be OCR.   I have worked on projects where all of the PDFs are 
> really like faxes - images are encoded in JBIG2 black and white or similar, 
> and there is really one image per page, and no text.   I have also worked on 
> projects where it really is unstructured data, but if a PDF has one image per 
> page and have no text, they should be OCRd.
>
> I've had problems, not with Tesseract, but even with Nuance OCR OEM 
> libraries, where text was missed because one image was the top of the 
> letters, and the image on the next line was the bottom half of the letters.   
> I don't mean to ding Nuance (or tesseract), I just wish to point out that 
> what to OCR is important, because OCR works well when it has good input.
>
> > -Original Message-
> > From: Tim Allison 
> > Sent: Friday, November 2, 2018 11:03 AM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Solr OCR Support
> >
> > OCR'ing of PDFs is fiddly at the moment because of Tika, not Solr!  We
> > have an open ticket to make it "just work", but we aren't there yet
> > (TIKA-2749).
> >
> > You have to tell Tika how you want to process images from PDFs via the
> > tika-config.xml file.
> >
> > You've seen this link in the links you mentioned:
> > https://wiki.apache.org/tika/TikaOCR
> >
> > This one is key for PDFs:
> > https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29#OCR
> > On Fri, Nov 2, 2018 at 10:30 AM Furkan KAMACI 
> > wrote:
> > >
> > > Hi All,
> > >
> > > I want to index images and pdf documents which have images into Solr. I
> > > test it with my Solr 6.3.0.
> > >
> > > I've installed tesseract at my computer (Mac). I verify that Tesseract
> > > works fine to extract text from an image.
> > >
> > > I index image into Solr but it has no content. However, as far as I know, 
> > > I
> > > don't need to do anything else to integrate Tesseract with Solr.
> > >
> > > I've checked these but they were not useful for me:
> > >
> > > http://lucene.472066.n3.nabble.com/TIKA-OCR-not-working-
> > td4201834.html
> > > http://lucene.472066.n3.nabble.com/Fwd-configuring-Solr-with-Tesseract-
> > td4361908.html
> > >
> > > My question is, how can I support OCR with Solr?
> > >
> > > Kind Regards,
> > > Furkan KAMACI


Re: Solr OCR Support

2018-11-02 Thread Tim Allison
OCR'ing of PDFs is fiddly at the moment because of Tika, not Solr!  We
have an open ticket to make it "just work", but we aren't there yet
(TIKA-2749).

You have to tell Tika how you want to process images from PDFs via the
tika-config.xml file.

You've seen this link in the links you mentioned:
https://wiki.apache.org/tika/TikaOCR

This one is key for PDFs:
https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29#OCR
On Fri, Nov 2, 2018 at 10:30 AM Furkan KAMACI  wrote:
>
> Hi All,
>
> I want to index images and pdf documents which have images into Solr. I
> test it with my Solr 6.3.0.
>
> I've installed tesseract at my computer (Mac). I verify that Tesseract
> works fine to extract text from an image.
>
> I index image into Solr but it has no content. However, as far as I know, I
> don't need to do anything else to integrate Tesseract with Solr.
>
> I've checked these but they were not useful for me:
>
> http://lucene.472066.n3.nabble.com/TIKA-OCR-not-working-td4201834.html
> http://lucene.472066.n3.nabble.com/Fwd-configuring-Solr-with-Tesseract-td4361908.html
>
> My question is, how can I support OCR with Solr?
>
> Kind Regards,
> Furkan KAMACI


Re: Tesseract language

2018-10-27 Thread Tim Allison
Martin,
  Let’s move this over to user@tika.

Rohan,
  Is there something about Tika’s use of tesseract for image files that can
be improved?

Best,
   Tim

On Sat, Oct 27, 2018 at 3:40 AM Rohan Kasat  wrote:

> I used tess4j for image formats and Tika for scanned PDFs and images within
> PDFs.
>
> Regards,
> Rohan Kasat
>
> On Sat, Oct 27, 2018 at 12:39 AM Martin Frank Hansen (MHQ) 
> wrote:
>
> > Hi Rohan,
> >
> > Thanks for your reply, are you using tess4j with Tika or on its own?  I
> > will take a look at tess4j if I can't make it work with Tika alone.
> >
> > Best regards
> > Martin
> >
> >
> > -Original Message-
> > From: Rohan Kasat 
> > Sent: 26. oktober 2018 21:45
> > To: solr-user@lucene.apache.org
> > Subject: Re: Tesseract language
> >
> > Hi Martin,
> >
> > Are you using it For image formats , I think you can try tess4j and use
> > give TESSDATA_PREFIX as the home for tessarct Configs.
> >
> > I have tried it and it works pretty well in my local machine.
> >
> > I have used java 8 and tesseact 3 for the same.
> >
> > Regards,
> > Rohan Kasat
> >
> > On Fri, Oct 26, 2018 at 12:31 PM Martin Frank Hansen (MHQ) 
> > wrote:
> >
> > > Hi Tim,
> > >
> > > You were right.
> > >
> > > When I called `tesseract testing/eurotext.png testing/eurotext-dan -l
> > > dan`, I got an error message so I downloaded "dan.traineddata" and
> > > added it to the Tesseract-OCR/tessdata folder. Furthermore I added the
> > > 'TESSDATA_PREFIX' variable to the path-variables pointing to
> > > "Tesseract-OCR/tessdata".
> > >
> > > Now Tesseract works with Danish language from the CMD, but now I can't
> > > make the code work in Java, not even with default settings (which I
> > > could before). Am I missing something or just mixing some things up?
> > >
> > >
> > >
> > > -Original Message-
> > > From: Tim Allison 
> > > Sent: 26. oktober 2018 19:58
> > > To: solr-user@lucene.apache.org
> > > Subject: Re: Tesseract language
> > >
> > > Tika relies on you to install tesseract and all the language libraries
> > > you'll need.
> > >
> > > If you can successfully call `tesseract testing/eurotext.png
> > > testing/eurotext-dan -l dan`, Tika _should_ be able to specify "dan"
> > > with your code above.
> > > On Fri, Oct 26, 2018 at 10:49 AM Martin Frank Hansen (MHQ)
> > > 
> > > wrote:
> > > >
> > > > Hi again,
> > > >
> > > > Now I moved the OCR part to Tika, but I still can't make it work
> > > > with
> > > Danish. It works when using default language settings and it seems
> > > like Tika is missing Danish dictionary.
> > > >
> > > > My java code looks like this:
> > > >
> > > > {
> > > > File file = new File(pathfilename);
> > > >
> > > > Metadata meta = new Metadata();
> > > >
> > > > InputStream stream = TikaInputStream.get(file);
> > > >
> > > > Parser parser = new AutoDetectParser();
> > > > BodyContentHandler handler = new
> > > > BodyContentHandler(Integer.MAX_VALUE);
> > > >
> > > > TesseractOCRConfig config = new TesseractOCRConfig();
> > > > config.setLanguage("dan"); // code works if this phrase
> > > > is
> > > commented out.
> > > >
> > > > ParseContext parseContext = new ParseContext();
> > > >
> > > >  parseContext.set(TesseractOCRConfig.class, config);
> > > >
> > > > parser.parse(stream, handler, meta, parseContext);
> > > > System.out.println(handler.toString());
> > > > }
> > > >
> > > > Hope that someone can help here.
> > > >
> > > > -Original Message-
> > > > From: Martin Frank Hansen (MHQ) 
> > > > Sent: 22. oktober 2018 07:58
> > <https://maps.google.com/?q=tober+2018+07:58=gmail=g>
> > > > To: solr-user@lucene.apache.org
> > > > Subject: SV: Tessera
> > > <https://maps.google.com/?q=ect:+SV:+Tessera=gmail=g>ct
> > > language
> > > >
> > > > Hi Erick,
> > > >

Re: Tesseract language

2018-10-26 Thread Tim Allison
Tika relies on you to install tesseract and all the language libraries
you'll need.

If you can successfully call `tesseract testing/eurotext.png
testing/eurotext-dan -l dan`, Tika _should_ be able to specify "dan"
with your code above.
On Fri, Oct 26, 2018 at 10:49 AM Martin Frank Hansen (MHQ)  wrote:
>
> Hi again,
>
> Now I moved the OCR part to Tika, but I still can't make it work with Danish. 
> It works when using default language settings and it seems like Tika is 
> missing Danish dictionary.
>
> My java code looks like this:
>
> {
> File file = new File(pathfilename);
>
> Metadata meta = new Metadata();
>
> InputStream stream = TikaInputStream.get(file);
>
> Parser parser = new AutoDetectParser();
> BodyContentHandler handler = new 
> BodyContentHandler(Integer.MAX_VALUE);
>
> TesseractOCRConfig config = new TesseractOCRConfig();
> config.setLanguage("dan"); // code works if this phrase is 
> commented out.
>
> ParseContext parseContext = new ParseContext();
>
>  parseContext.set(TesseractOCRConfig.class, config);
>
> parser.parse(stream, handler, meta, parseContext);
> System.out.println(handler.toString());
> }
>
> Hope that someone can help here.
>
> -Original Message-
> From: Martin Frank Hansen (MHQ) 
> Sent: 22. oktober 2018 07:58
> To: solr-user@lucene.apache.org
> Subject: SV: Tesseract language
>
> Hi Erick,
>
> Thanks for the help! I will take a look at it.
>
>
> Martin Frank Hansen, Senior Data Analytiker
>
> Data, IM & Analytics
>
>
>
> Lautrupparken 40-42, DK-2750 Ballerup
> E-mail m...@kmd.dk  Web www.kmd.dk
> Mobil +4525571418
>
> -Oprindelig meddelelse-
> Fra: Erick Erickson 
> Sendt: 21. oktober 2018 22:49
> Til: solr-user 
> Emne: Re: Tesseract language
>
> Here's a skeletal program that uses Tika in a stand-alone client. Rip the 
> RDBMS parts out
>
> https://lucidworks.com/2012/02/14/indexing-with-solrj/
> On Sun, Oct 21, 2018 at 1:13 PM Alexandre Rafalovitch  
> wrote:
> >
> > Usually, we just say to do a custom solution using SolrJ client to
> > connect. This gives you maximum flexibility and allows to integrate
> > Tika either inside your code or as a server. Latest Tika actually has
> > some off-thread handling I believe, to make it safer to embed.
> >
> > For DIH alternatives, if you want configuration over custom code, you
> > could look at something like Apache NiFI. It can push data into Solr.
> > Obviously it is a bigger solution, but it is correspondingly more
> > robust too.
> >
> > Regards,
> >Alex.
> > On Sun, 21 Oct 2018 at 11:07, Martin Frank Hansen (MHQ)  wrote:
> > >
> > > Hi Alexandre,
> > >
> > > Thanks for your reply.
> > >
> > > Yes right now it is just for testing the possibilities of Solr and 
> > > Tesseract.
> > >
> > > I will take a look at the Tika documentation to see if I can make it work.
> > >
> > > You said that DIH are not recommended for production usage, what is the 
> > > recommended method(s) to upload data to a Solr instance?
> > >
> > > Best regards
> > >
> > > Martin Frank Hansen
> > >
> > > -Oprindelig meddelelse-
> > > Fra: Alexandre Rafalovitch 
> > > Sendt: 21. oktober 2018 16:26
> > > Til: solr-user 
> > > Emne: Re: Tesseract language
> > >
> > > There is a couple of things mixed in here:
> > > 1) Extract handler is not recommended for production usage. It is great 
> > > for a quick test, just like you did it, but going to production, running 
> > > it externally is better. Tika - especially with large files can use up a 
> > > lot of memory and trip up the Solr instance it is running within.
> > > 2) If you are still just testing, you can configure Tika within Solr but 
> > > specifying parseContent.config file as shown at the link and described 
> > > further down in the same document:
> > > https://lucene.apache.org/solr/guide/7_5/uploading-data-with-solr-ce
> > > ll-using-apache-tika.html#configuring-the-solr-extractingrequesthand
> > > ler You still need to check with Tika documentation with Tesseract
> > > can take its configuration from the parseContext file.
> > > 3) If you are still testing with multiple files, Data Import Handler can 
> > > iterate through files and then - as a nested entity - feed it to Tika 
> > > processor for further extraction. I think one of the examples shows that.
> > > However, I am not sure you can pass parseContext that way and DIH is also 
> > > not recommended for production.
> > >
> > > I hope this helps,
> > > Alex.
> > >
> > > On Sun, 21 Oct 2018 at 09:24, Martin Frank Hansen (MHQ)  
> > > wrote:
> > >
> > > > Hi again,
> > > >
> > > >
> > > >
> > > > Is there anyone who has some experience of using Tesseract’s OCR
> > > > module within Solr? The files I am trying to read into Solr is
> > > > Danish Tiff documents.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > *Martin Frank Hansen*, Senior Data Analytiker
> > > >
> > > > 

Re: Reading data using Tika to Solr

2018-10-26 Thread Tim Allison
Ha...emails passed in the ether.

As you saw, we added the RecursiveParserWrapper a while back into Tika so
no need to re-invent that wheel.  That’s my preferred method/format because
it maintains metadata from attachments and lets you know about exceptions
in embedded files. The legacy method concatenates contents, throws out
attachment metadata and silently swallows attachment exceptions.

On Fri, Oct 26, 2018 at 6:25 AM Martin Frank Hansen (MHQ) 
wrote:

> Hi again,
>
> Never mind, I got manage to get the content of the msg-files as well using
> the following link as inspiration:
> https://wiki.apache.org/tika/RecursiveMetadata
>
> But thanks again for all your help!
>
> -Original Message-
> From: Martin Frank Hansen (MHQ) 
> Sent: 26. oktober 2018 10:14
> To: solr-user@lucene.apache.org
> Subject: RE: Reading data using Tika to Solr
>
> Hi Tim,
>
> It is msg files and I added tika-app-1.14.jar to the build path - and now
> it works  But how do I get it to read the attachments as well?
>
> -Original Message-
> From: Tim Allison 
> Sent: 25. oktober 2018 21:57
> To: solr-user@lucene.apache.org
> Subject: Re: Reading data using Tika to Solr
>
> If you’re processing actual msg (not eml), you’ll also need poi and
> poi-scratchpad and their dependencies, but then those msgs could have
> attachments, at which point, you may as just add tika-app. :D
>
> On Thu, Oct 25, 2018 at 2:46 PM Martin Frank Hansen (MHQ) 
> wrote:
>
> > Hi Erick and Tim,
> >
> > Thanks for your answers, I can see that my mail got messed up on the
> > way through the server. It looked much more readable at my end  The
> > attachment simply included my build-path.
> >
> > @Erick I am compiling the program using Netbeans at the moment.
> >
> > I updated to tika-1.7 but that did not help, and I haven't tried maven
> > yet but will probably have to give that a chance. I just find it a bit
> > odd that I can see the dependencies are included in the jar files I
> > added to the project, but I must be missing something?
> >
> > My buildpath looks as follows:
> >
> > Tika-parsers-1.4.jar
> > Tika-core-1.4.jar
> > Commons-io-2.5.jar
> > Httpclient-4.5.3
> > Httpcore-4.4.6.jar
> > Httpmime-4.5.3.jar
> > Slf4j-api1-7-24.jar
> > Jcl-over--slf4j-1.7.24.jar
> > Solr-cell-7.5.0.jar
> > Solr-core-7.5.0.jar
> > Solr-solrj-7.5.0.jar
> > Noggit-0.8.jar
> >
> >
> >
> > -Original Message-
> > From: Tim Allison 
> > Sent: 25. oktober 2018 20:21
> > To: solr-user@lucene.apache.org
> > Subject: Re: Reading data using Tika to Solr
> >
> > To follow up w Erick’s point, there are a bunch of transitive
> > dependencies from tika-parsers. If you aren’t using maven or similar
> > build system to grab the dependencies, it can be tricky to get it
> > right. If you aren’t using maven, and you can afford the risks of jar
> > hell, consider using tika-app or, better perhaps, tika-server.
> >
> > Stay tuned for SOLR-11721...
> >
> > On Thu, Oct 25, 2018 at 1:08 PM Erick Erickson
> > 
> > wrote:
> >
> > > Martin:
> > >
> > > The mail server is pretty aggressive about stripping attachments,
> > > your png didn't come though. You might also get a more informed
> > > answer on the Tika mailing list.
> > >
> > > That said (and remember I can't see your png so this may be a silly
> > > question), how are you executing the program .vs. compiling it? You
> > > mentioned the "build path". I'm usually lazy and just execute it in
> > > IntelliJ for development and have forgotten to set my classpath on
> > > _numerous_ occasions when running it from a command line ;)
> > >
> > > Best,
> > > Erick
> > >
> > > On Thu, Oct 25, 2018 at 2:55 AM Martin Frank Hansen (MHQ)
> > > 
> > > wrote:
> > > >
> > > > Hi,
> > > >
> > > >
> > > >
> > > > I am trying to read content of msg-files using Tika and index
> > > > these in
> > > Solr, however I am having some problems with the OfficeParser(). I
> > > keep getting the error java.lang.NoClassDefFoundError for the
> > > OfficeParcer, even though both tika-core and tika-parsers are
> > > included
> > in the build path.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > I am using Java with the following code:
> > > >
> > >

Re: Reading data using Tika to Solr

2018-10-26 Thread Tim Allison
IIRC, somewhere btwn 1.14 and now (1.19.1), we changed the default behavior
for the AutoDetectParser from skip attachments to include attachments.

So, two options: 1) upgrade to 1.19.1 and use the AutoDetectParser or 2)
pass an AutoDetectParser via the ParseContext to be used for attachments.

If you’re wondering why you might upgrade to 1.19.1, look no further than:
https://tika.apache.org/security.html



On Fri, Oct 26, 2018 at 4:14 AM Martin Frank Hansen (MHQ) 
wrote:

> Hi Tim,
>
> It is msg files and I added tika-app-1.14.jar to the build path - and now
> it works  But how do I get it to read the attachments as well?
>
> -Original Message-
> From: Tim Allison 
> Sent: 25. oktober 2018 21:57
> To: solr-user@lucene.apache.org
> Subject: Re: Reading data using Tika to Solr
>
> If you’re processing actual msg (not eml), you’ll also need poi and
> poi-scratchpad and their dependencies, but then those msgs could have
> attachments, at which point, you may as just add tika-app. :D
>


Re: Reading data using Tika to Solr

2018-10-25 Thread Tim Allison
If you’re processing actual msg (not eml), you’ll also need poi and
poi-scratchpad and their dependencies, but then those msgs could have
attachments, at which point, you may as just add tika-app. :D

On Thu, Oct 25, 2018 at 2:46 PM Martin Frank Hansen (MHQ) 
wrote:

> Hi Erick and Tim,
>
> Thanks for your answers, I can see that my mail got messed up on the way
> through the server. It looked much more readable at my end  The
> attachment simply included my build-path.
>
> @Erick I am compiling the program using Netbeans at the moment.
>
> I updated to tika-1.7 but that did not help, and I haven't tried maven yet
> but will probably have to give that a chance. I just find it a bit odd that
> I can see the dependencies are included in the jar files I added to the
> project, but I must be missing something?
>
> My buildpath looks as follows:
>
> Tika-parsers-1.4.jar
> Tika-core-1.4.jar
> Commons-io-2.5.jar
> Httpclient-4.5.3
> Httpcore-4.4.6.jar
> Httpmime-4.5.3.jar
> Slf4j-api1-7-24.jar
> Jcl-over--slf4j-1.7.24.jar
> Solr-cell-7.5.0.jar
> Solr-core-7.5.0.jar
> Solr-solrj-7.5.0.jar
> Noggit-0.8.jar
>
>
>
> -Original Message-
> From: Tim Allison 
> Sent: 25. oktober 2018 20:21
> To: solr-user@lucene.apache.org
> Subject: Re: Reading data using Tika to Solr
>
> To follow up w Erick’s point, there are a bunch of transitive dependencies
> from tika-parsers. If you aren’t using maven or similar build system to
> grab the dependencies, it can be tricky to get it right. If you aren’t
> using maven, and you can afford the risks of jar hell, consider using
> tika-app or, better perhaps, tika-server.
>
> Stay tuned for SOLR-11721...
>
> On Thu, Oct 25, 2018 at 1:08 PM Erick Erickson 
> wrote:
>
> > Martin:
> >
> > The mail server is pretty aggressive about stripping attachments, your
> > png didn't come though. You might also get a more informed answer on
> > the Tika mailing list.
> >
> > That said (and remember I can't see your png so this may be a silly
> > question), how are you executing the program .vs. compiling it? You
> > mentioned the "build path". I'm usually lazy and just execute it in
> > IntelliJ for development and have forgotten to set my classpath on
> > _numerous_ occasions when running it from a command line ;)
> >
> > Best,
> > Erick
> >
> > On Thu, Oct 25, 2018 at 2:55 AM Martin Frank Hansen (MHQ) 
> > wrote:
> > >
> > > Hi,
> > >
> > >
> > >
> > > I am trying to read content of msg-files using Tika and index these
> > > in
> > Solr, however I am having some problems with the OfficeParser(). I
> > keep getting the error java.lang.NoClassDefFoundError for the
> > OfficeParcer, even though both tika-core and tika-parsers are included
> in the build path.
> > >
> > >
> > >
> > >
> > >
> > > I am using Java with the following code:
> > >
> > >
> > >
> > >
> > >
> > > public static void main(final String[] args) throws
> > IOException,SAXException, TikaException {
> > >
> > >
> > >
> > > processDocument(pathtofile)
> > >
> > >
> > >
> > >  }
> > >
> > >
> > >
> > > private static void
> > > processDocument(String
> > pathfilename)  {
> > >
> > >
> > >
> > >
> > >
> > >  try {
> > >
> > >
> > >
> > > File file =
> > > new
> > File(pathfilename);
> > >
> > >
> > >
> > > Metadata
> > > meta =
> > new Metadata();
> > >
> > >
> > >
> > >  InputStream
> > input = TikaInputStream.get(file);
> > >
> > >
> > >
> > >
> > BodyContentHandler handler = new BodyContentHandler();
> > >
> > >
> > >
> > > Parser
> > > parser =
> > new OfficeParser();
> > >
> > >
> > > ParseContext
> > context = new ParseContext();
> > >
> > >
> > parser.parse(input, handler, meta, context);
> > >
> > >
> > >
> > >  

Re: Reading data using Tika to Solr

2018-10-25 Thread Tim Allison
To follow up w Erick’s point, there are a bunch of transitive dependencies
from tika-parsers. If you aren’t using maven or similar build system to
grab the dependencies, it can be tricky to get it right. If you aren’t
using maven, and you can afford the risks of jar hell, consider using
tika-app or, better perhaps, tika-server.

Stay tuned for SOLR-11721...

On Thu, Oct 25, 2018 at 1:08 PM Erick Erickson 
wrote:

> Martin:
>
> The mail server is pretty aggressive about stripping attachments, your
> png didn't come though. You might also get a more informed answer on
> the Tika mailing list.
>
> That said (and remember I can't see your png so this may be a silly
> question), how are you executing the program .vs. compiling it? You
> mentioned the "build path". I'm usually lazy and just execute it in
> IntelliJ for development and have forgotten to set my classpath on
> _numerous_ occasions when running it from a command line ;)
>
> Best,
> Erick
>
> On Thu, Oct 25, 2018 at 2:55 AM Martin Frank Hansen (MHQ) 
> wrote:
> >
> > Hi,
> >
> >
> >
> > I am trying to read content of msg-files using Tika and index these in
> Solr, however I am having some problems with the OfficeParser(). I keep
> getting the error java.lang.NoClassDefFoundError for the OfficeParcer, even
> though both tika-core and tika-parsers are included in the build path.
> >
> >
> >
> >
> >
> > I am using Java with the following code:
> >
> >
> >
> >
> >
> > public static void main(final String[] args) throws
> IOException,SAXException, TikaException {
> >
> >
> >
> > processDocument(pathtofile)
> >
> >
> >
> >  }
> >
> >
> >
> > private static void processDocument(String
> pathfilename)  {
> >
> >
> >
> >
> >
> >  try {
> >
> >
> >
> > File file = new
> File(pathfilename);
> >
> >
> >
> > Metadata meta =
> new Metadata();
> >
> >
> >
> >  InputStream
> input = TikaInputStream.get(file);
> >
> >
> >
> >
> BodyContentHandler handler = new BodyContentHandler();
> >
> >
> >
> > Parser parser =
> new OfficeParser();
> >
> >  ParseContext
> context = new ParseContext();
> >
> >
> parser.parse(input, handler, meta, context);
> >
> >
> >
> >  String
> doccontent = handler.toString();
> >
> >
> >
> >
> >
> >
>  System.out.println(doccontent);
> >
> >
>  System.out.println(meta);
> >
> >
> >
> >  }
> >
> >  }
> >
> > In the buildpath I have the following dependencies:
> >
> >
> >
> >
> >
> > Any help is appreciate.
> >
> >
> >
> > Thanks in advance.
> >
> >
> >
> > Best regards,
> >
> >
> >
> > Martin Hansen
> >
> >
> >
> > Beskyttelse af dine personlige oplysninger er vigtig for os. Her finder
> du KMD’s Privatlivspolitik, der fortæller, hvordan vi behandler oplysninger
> om dig.
> >
> > Protection of your personal data is important to us. Here you can read
> KMD’s Privacy Policy outlining how we process your personal data.
> >
> > Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig
> information. Hvis du ved en fejltagelse modtager e-mailen, beder vi dig
> venligst informere afsender om fejlen ved at bruge svarfunktionen. Samtidig
> beder vi dig slette e-mailen i dit system uden at videresende eller kopiere
> den. Selvom e-mailen og ethvert vedhæftet bilag efter vores overbevisning
> er fri for virus og andre fejl, som kan påvirke computeren eller
> it-systemet, hvori den modtages og læses, åbnes den på modtagerens eget
> ansvar. Vi påtager os ikke noget ansvar for tab og skade, som er opstået i
> forbindelse med at modtage og bruge e-mailen.
> >
> > Please note that this message may contain confidential information. If
> you have received this message by mistake, please inform the sender of the
> mistake by sending a reply, then delete the message from your system
> without making, distributing or retaining any copies of it. Although we
> believe that the message and any attachments are free from viruses and
> other errors that might affect the computer or it-system where it is
> received and read, the recipient opens the message at his or her own risk.
> We assume no responsibility for any loss or damage arising from the receipt
> or use of this message.
>


Re: Encoding issue in solr

2018-10-05 Thread Tim Allison
This is probably caused by an encoding detection problem in Nutch and/or
Tika. If you can share the file on the Tika user’s list, I can take a look.

On Fri, Oct 5, 2018 at 7:11 AM UMA MAHESWAR 
wrote:

> HI ALL,
>
> while i am using nutch for crawling and indexing in to solr,while storing
> data in to solr encoding issue facing
>
>
> in site  having the title
>
> title : ebm-papst Motoren & Ventilatoren GmbH - Axialventilatoren und
> Radialventilatoren aus Linz, Österreich
>
> but in solr storing in the below format
>
> title": "ebm-papst Motoren & Ventilatoren GmbH - Axialventilatoren und
> Radialventilatoren aus Linz, Österrei",
>
> suggest me how to store actual data in to solr .
>
> thanks for your suggestions.
>
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


Re: solr and diversification

2018-09-28 Thread Tim Allison
If you haven’t already, might want to check out maximal marginal
relevance...original paper: Carbonell and Goldstein.

On Thu, Sep 27, 2018 at 7:29 PM Joel Bernstein  wrote:

> Yeah, I think your plan sounds fine.
>
> Do you have a specific use case for diversity of results. I've been
> wondering if diversity of results would provide better perceived relevance.
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
>
> On Thu, Sep 27, 2018 at 1:39 PM Diego Ceccarelli (BLOOMBERG/ LONDON) <
> dceccarel...@bloomberg.net> wrote:
>
> > Yeah, I think Kmeans might be a way to implement the "top 3 stories that
> > are more distant", but you can also have a more naïve (and faster)
> strategy
> > like
> >  - sending a threshold
> >  - scan the documents according to the relevance score
> >  - select the top documents that have diversity > threshold.
> >
> > I would allow to define the strategy and select it from the request.
> >
> > From: solr-user@lucene.apache.org At: 09/27/18 18:25:43To:  Diego
> > Ceccarelli (BLOOMBERG/ LONDON ) ,  solr-user@lucene.apache.org
> > Subject: Re: solr and diversification
> >
> > I've thought about this problem a little bit. What I was considering was
> > using Kmeans clustering to cluster the top 50 docs, then pulling the top
> > scoring doc form each cluster as the top documents. This should be fast
> and
> > effective at getting diversity.
> >
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> >
> > On Thu, Sep 27, 2018 at 1:20 PM Diego Ceccarelli (BLOOMBERG/ LONDON) <
> > dceccarel...@bloomberg.net> wrote:
> >
> > > Hi,
> > >
> > > I'm considering to write a component for diversifying the results. I
> know
> > > that diversification can be achieved by using grouping but I'm thinking
> > > about something different and query biased.
> > > The idea is to have something that gets applied after the normal
> > retrieval
> > > and selects the top k documents more diverse based on some distance
> > metric:
> > >
> > > Example:
> > > imagine that you are asking for 10 rows, and you set diversify.rows=3
> > > diversity.metric=tfidf  diversify.field=body
> > >
> > > Solr might retrieve the the top 10 rows as usual, extract tfidf vectors
> > > for the bodies and select the top 3 stories that are more distant
> > according
> > > to the cosine similarity.
> > > This would be different from grouping because documents will be
> > > 'collapsed' or not based on the subset of documents retrieved for the
> > > query.
> > > Do you think it would make sense to have it as a component?  any
> feedback
> > > / idea?
> > >
> > >
> > >
> >
> >
> >
>


Re: Memory Leak in 7.3 to 7.4

2018-08-06 Thread Tim Allison
+1 to Shawn's and Erick's points about isolating Tika in a separate jvm.

Y, please do let us know:  u...@tika.apache.org  We might be able to
help out, and you, in turn, can help the community figure out what's
going on; see e.g.: https://issues.apache.org/jira/browse/TIKA-2703
On Sun, Aug 5, 2018 at 1:22 PM Shawn Heisey  wrote:
>
> On 8/2/2018 5:30 AM, Thomas Scheffler wrote:
> > my final verdict is the upgrade to Tika 1.17. If I downgrade the libraries 
> > just for tika back to 1.16 and keep the rest of SOLR 7.4.0 the heap usage 
> > after about 85 % of the index process and manual trigger of the garbage 
> > collector is about 60-70 MB (That low!!!)
> >
> > My problem now is that we have several setups that triggers this reliably 
> > but there is no simple test case that „fails“ if Tika 1.17 or 1.18 is used. 
> > I also do not know if the error is inside Tika or inside the glue code that 
> > makes Tika usable in SOLR.
>
> If downgrading Tika fixes the issue, then it doesn't seem (to me) very
> likely that Solr's glue code for ERH has a problem. If it's not Solr's
> code that has the problem, there will be nothing we can do about it
> other than change the Tika library included with Solr.
>
> Before filing an issue, you should discuss this with the Tika project on
> their mailing list.  They'll want to make sure that they can fix the
> problem in a future version.  It might not be an actual memory leak ...
> it could just be that one of the documents you're trying to index is one
> that Tika requires a huge amount of memory to handle.  But it could be a
> memory leak.
>
> If you know which document is being worked on when it runs out of
> memory, can you try not including that document in your indexing, to see
> if it still has a problem?
>
> Please note that it is strongly recommended that you do not use the
> Extracting Request Handler in production.  Tika is prone to many
> problems, and those problems will generally affect Solr if Tika is being
> run inside Solr.  Because of this, it is recommended that you write a
> separate program using Tika that handles extracting information from
> documents and sending that data to Solr.  If that program crashes, Solr
> remains operational.
>
> There is already an issue to upgrade Tika to the latest version in Solr,
> but you've said that you tried 1.18 already with no change to the
> problem.  So whatever the problem is, it will need to be solved in 1.19
> or later.
>
> Thanks,
> Shawn
>


Re: Index protected zip

2018-05-29 Thread Tim Allison
I’m happy to contribute to this message in any way I can.  Let me know how
I can help.

On Tue, May 29, 2018 at 2:31 PM Cassandra Targett 
wrote:

> It's not as simple as a banner. Information was added to the wiki that does
> not exist in the Ref Guide.
>
> Before you say "go look at the Ref Guide" you need to make sure it says
> what you want it to say, and the creation of this page just 3 days ago
> indicates to me that the Ref Guide is missing something.
>
> On Tue, May 29, 2018 at 1:04 PM Erick Erickson 
> wrote:
>
> > On further reflection ,+1 to marking the Wiki page superseded by the
> > reference guide. I'd be fine with putting a banner at the top of all
> > the Wiki pages saying "check the Solr reference guide first" ;)
> >
> > On Tue, May 29, 2018 at 10:59 AM, Cassandra Targett
> >  wrote:
> > > Couldn't the same information on that page be put into the Solr Ref
> > Guide?
> > >
> > > I mean, if that's what we recommend, it should be documented officially
> > > that it's what we recommend.
> > >
> > > I mean, is anyone surprised people keep stumbling over this? Shawn's
> wiki
> > > page doesn't point to the Ref Guide (instead pointing at other wiki
> pages
> > > that are out of date) and the Ref Guide doesn't point to that page. So
> > half
> > > the info is in our "official" place but the real story is in another
> > place,
> > > one we alternately tell people to sometimes ignore but sometimes keep
> up
> > to
> > > date? Even I'm confused.
> > >
> > > On Sat, May 26, 2018 at 6:41 PM Erick Erickson <
> erickerick...@gmail.com>
> > > wrote:
> > >
> > >> Thanks! now I can just record the URL and then paste it in ;)
> > >>
> > >> Who knows, maybe people will see it first too!
> > >>
> > >> On Sat, May 26, 2018 at 9:48 AM, Tim Allison 
> > wrote:
> > >> > W00t! Thank you, Shawn!
> > >> >
> > >> > The "don't use ERH in production" response comes up frequently
> enough
> > >> >> that I have created a wiki page we can use for responses:
> > >> >>
> > >> >> https://wiki.apache.org/solr/RecommendCustomIndexingWithTika
> > >> >>
> > >> >> Tim, you are extremely well-qualified to expand and correct this
> > page.
> > >> >> Erick may be interested in making adjustments also. The flow of the
> > page
> > >> >> feels a little bit awkward to me, but I'm not sure how to improve
> it.
> > >> >>
> > >> >> If the page name is substandard, feel free to rename.  I've already
> > >> >> renamed it once!  I searched for an existing page like this before
> I
> > >> >> started creating it.  I did put a link to the new page on the
> > >> >> ExtractingRequestHandler page.
> > >> >>
> > >> >> Thanks,
> > >> >> Shawn
> > >> >>
> > >> >>
> > >>
> >
>


Re: Index protected zip

2018-05-26 Thread Tim Allison
W00t! Thank you, Shawn!

The "don't use ERH in production" response comes up frequently enough
> that I have created a wiki page we can use for responses:
>
> https://wiki.apache.org/solr/RecommendCustomIndexingWithTika
>
> Tim, you are extremely well-qualified to expand and correct this page.
> Erick may be interested in making adjustments also. The flow of the page
> feels a little bit awkward to me, but I'm not sure how to improve it.
>
> If the page name is substandard, feel free to rename.  I've already
> renamed it once!  I searched for an existing page like this before I
> started creating it.  I did put a link to the new page on the
> ExtractingRequestHandler page.
>
> Thanks,
> Shawn
>
>


Re: simple enrich uploaded binary documents with sha256 hashes

2018-05-26 Thread Tim Allison
+1 as always to Erick’s advice. DIH is only a PoC.

We do have a DigestingParser in Tika, and when you combine that w the
RecursiveParserWrapper, you can get digests not only of the main file but
also on all embedded files/attachments...which can be pretty neat for some
use cases.

Operators are standing by on the user list for Tika when you have
questions. :)

Cheers,
Tim

On Fri, May 25, 2018 at 11:10 AM Erick Erickson 
wrote:

> I'd consider using a separate Java program that uses Tika directly, or
> one of various services. Then you can assemble whatever you please
> before sending the doc to Solr. There are multiple reasons to
> recommend this, see:
> https://lucidworks.com/2012/02/14/indexing-with-solrj/
>
> There are other reasons why using extractingRequestHandler is
> problematic in production, the biggest one being that it can blow up
> your server. Tika has to try to cope with every variant of every
> document format it processes, and I personally guarantee that the
> implementations from company X (which is no longer in business) for a
> PDF file  (from a spec current 10 years ago) may "interpret" that
> spec...er...freely ;) And Tika has to then try to cope. It does a
> brilliant job, but there's going to be case N+1
>
> The inference, of course, is that extractingRequestHandler is largely
> a PoC tool IMO, it gets people going without having to write an external
> program but not something I'd recommend for production.
>
> Best,
> Erick
>
> On Thu, May 24, 2018 at 10:06 PM, Thomas Lustig 
> wrote:
> > dear community,
> >
> > I would like to automatically add a sha256 filehash to a Document field
> > after a binary file is posted to a ExtractingRequestHandler.
> > First i thought, that the ExtractingRequestHandler has such a feature,
> but
> > so far i did not find a configuration.
> > It was mentioned that I should implement my own  Update Request Processor
> > to calculate the hash and add it to a field.
> > The  SignatureUpdateProcessor seemed to be an out-of-the-box option, but
> it
> > only supports md5 and also does not access the raw binary stream.
> >
> > The important thing is that i do need the binary stream of the uploaded
> > file to calculate a correct hashvalue (e.g. md5, sha256,..)
> > Is it possible to also arrange this with a ScriptUpdateProcessor and
> > javascript?.
> >
> > thanks in advance for any help
> >
> > Tom
>


Re: Index protected zip

2018-05-26 Thread Tim Allison
On third thought, I can’t think of how you’d easily inject a
PasswordProvider into Solr’s integration.

Please see Erick Erickson’s evergreen advice and linked blog post:

https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201805.mbox/%3ccan4yxve_0gn0a1y7wjpr27inuddo6+jzwwfgvzkfs40gh3r...@mail.gmail.com%3e


On Sat, May 26, 2018 at 6:34 AM Tim Allison <talli...@apache.org> wrote:

> You’ll need to provide a PasswordProvider in the ParseContext.  I don’t
> think that is currently possible in the Solr integration. Please open a
> ticket if SolrJ doesn’t meet your needs.
>
> On Thu, May 24, 2018 at 1:03 PM Alexandre Rafalovitch <arafa...@gmail.com>
> wrote:
>
>> Hmm. If it works, then it is Tika magic. Which may mean they may have a
>> setting for passwords. Which would need to be configured and then exposed
>> through Solr.
>>
>> So, I would check if you can extract text with Tika standalone first.
>>
>> Regards,
>> Alex
>>
>> On Thu, May 24, 2018, 5:05 AM Dimitris Kardarakos, <
>> dimitris.kardara...@iteam.gr> wrote:
>>
>> > Hello everyone.
>> >
>> > In Solr 7.3.0 I can successfully index the content of zip files.
>> >
>> > But if the zip file is password protected, running something like the
>> > below:
>> >
>> > curl
>> > "
>> >
>> http://localhost:8983/solr/sample/update/extract?commit=true&=enc.zip=1234
>> "
>> >
>> > -H "Content-Type: application/zip" --data-binary @enc.zip
>> >
>> > only the names of the files contained are indexed.
>> >
>> > Is it a known issue or I am doing sth wrong?
>> >
>> > Thanks!
>> >
>> > --
>> > Dimitris Kardarakos
>> >
>> >
>>
>


Re: Index protected zip

2018-05-26 Thread Tim Allison
You’ll need to provide a PasswordProvider in the ParseContext.  I don’t
think that is currently possible in the Solr integration. Please open a
ticket if SolrJ doesn’t meet your needs.

On Thu, May 24, 2018 at 1:03 PM Alexandre Rafalovitch 
wrote:

> Hmm. If it works, then it is Tika magic. Which may mean they may have a
> setting for passwords. Which would need to be configured and then exposed
> through Solr.
>
> So, I would check if you can extract text with Tika standalone first.
>
> Regards,
> Alex
>
> On Thu, May 24, 2018, 5:05 AM Dimitris Kardarakos, <
> dimitris.kardara...@iteam.gr> wrote:
>
> > Hello everyone.
> >
> > In Solr 7.3.0 I can successfully index the content of zip files.
> >
> > But if the zip file is password protected, running something like the
> > below:
> >
> > curl
> > "
> >
> http://localhost:8983/solr/sample/update/extract?commit=true&=enc.zip=1234
> "
> >
> > -H "Content-Type: application/zip" --data-binary @enc.zip
> >
> > only the names of the files contained are indexed.
> >
> > Is it a known issue or I am doing sth wrong?
> >
> > Thanks!
> >
> > --
> > Dimitris Kardarakos
> >
> >
>