[jira] [Commented] (SOLR-2416) Solr Cell fails to index Zip file contents
[ https://issues.apache.org/jira/browse/SOLR-2416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16458950#comment-16458950 ] Markus Schuch commented on SOLR-2416: - I just tested ZIP extraction with 7.3.0 and i can confirm that due to the new default behavior of Tika 1.15+ the Extracting Request Handler extracts the text of the embedded documents as well and not only the file names as stated in the issue description. So this was fixed with SOLR-10335. > Solr Cell fails to index Zip file contents > -- > > Key: SOLR-2416 > URL: https://issues.apache.org/jira/browse/SOLR-2416 > Project: Solr > Issue Type: Bug > Components: contrib - DataImportHandler, contrib - Solr Cell (Tika > extraction) >Affects Versions: 1.4.1 >Reporter: Jayendra Patil >Priority: Major > Fix For: 6.0 > > Attachments: SOLR-2416_ExtractingDocumentLoader.patch, SOLR-4216.patch > > > Working with the latest Solr Trunk code and seems the Tika handlers for Solr > Cell (ExtractingDocumentLoader.java) and Data Import handler > (TikaEntityProcessor.java) fails to index the zip file contents again. > It just indexes the file names again. > This issue was addressed some time back, late last year, but seems to have > reappeared with the latest code. > Jira for the Data Import handler part with the patch and the testcase - > https://issues.apache.org/jira/browse/SOLR-2332. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2416) Solr Cell fails to index Zip file contents
[ https://issues.apache.org/jira/browse/SOLR-2416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084606#comment-16084606 ] Tim Allison commented on SOLR-2416: --- For more fun with embedded docs, see the issue on adding the RecursiveParserWrapper's behavior to Solr -- SOLR-7229 > Solr Cell fails to index Zip file contents > -- > > Key: SOLR-2416 > URL: https://issues.apache.org/jira/browse/SOLR-2416 > Project: Solr > Issue Type: Bug > Components: contrib - DataImportHandler, contrib - Solr Cell (Tika > extraction) >Affects Versions: 1.4.1 >Reporter: Jayendra Patil > Fix For: 6.0 > > Attachments: SOLR-2416_ExtractingDocumentLoader.patch, SOLR-4216.patch > > > Working with the latest Solr Trunk code and seems the Tika handlers for Solr > Cell (ExtractingDocumentLoader.java) and Data Import handler > (TikaEntityProcessor.java) fails to index the zip file contents again. > It just indexes the file names again. > This issue was addressed some time back, late last year, but seems to have > reappeared with the latest code. > Jira for the Data Import handler part with the patch and the testcase - > https://issues.apache.org/jira/browse/SOLR-2332. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2416) Solr Cell fails to index Zip file contents
[ https://issues.apache.org/jira/browse/SOLR-2416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084599#comment-16084599 ] Tim Allison commented on SOLR-2416: --- This should have been fixed by SOLR-7189, no? Or am I confusing DIH and Solr Cell? In Tika 1.15 (TIKA-2096), we changed the default behavior to add an embedded parser if a user fails to pass one in via the parse context. So, if we upgrade to Tika 1.16 (just out), this will be fixed, too. We'll probably want to let Solr users configure turning off embedded document handling... > Solr Cell fails to index Zip file contents > -- > > Key: SOLR-2416 > URL: https://issues.apache.org/jira/browse/SOLR-2416 > Project: Solr > Issue Type: Bug > Components: contrib - DataImportHandler, contrib - Solr Cell (Tika > extraction) >Affects Versions: 1.4.1 >Reporter: Jayendra Patil > Fix For: 6.0 > > Attachments: SOLR-2416_ExtractingDocumentLoader.patch, SOLR-4216.patch > > > Working with the latest Solr Trunk code and seems the Tika handlers for Solr > Cell (ExtractingDocumentLoader.java) and Data Import handler > (TikaEntityProcessor.java) fails to index the zip file contents again. > It just indexes the file names again. > This issue was addressed some time back, late last year, but seems to have > reappeared with the latest code. > Jira for the Data Import handler part with the patch and the testcase - > https://issues.apache.org/jira/browse/SOLR-2332. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2416) Solr Cell fails to index Zip file contents
[ https://issues.apache.org/jira/browse/SOLR-2416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16067311#comment-16067311 ] Jan Høydahl commented on SOLR-2416: --- [~smolloy] you filed the last patch for this, and it is really small and contained. Are you able to write a unit test and a CHANGES entry to make this ready for final review and commit? I'm happy to commit this for 7.x once it is ready. > Solr Cell fails to index Zip file contents > -- > > Key: SOLR-2416 > URL: https://issues.apache.org/jira/browse/SOLR-2416 > Project: Solr > Issue Type: Bug > Components: contrib - DataImportHandler, contrib - Solr Cell (Tika > extraction) >Affects Versions: 1.4.1 >Reporter: Jayendra Patil > Fix For: 6.0 > > Attachments: SOLR-2416_ExtractingDocumentLoader.patch, SOLR-4216.patch > > > Working with the latest Solr Trunk code and seems the Tika handlers for Solr > Cell (ExtractingDocumentLoader.java) and Data Import handler > (TikaEntityProcessor.java) fails to index the zip file contents again. > It just indexes the file names again. > This issue was addressed some time back, late last year, but seems to have > reappeared with the latest code. > Jira for the Data Import handler part with the patch and the testcase - > https://issues.apache.org/jira/browse/SOLR-2332. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2416) Solr Cell fails to index Zip file contents
[ https://issues.apache.org/jira/browse/SOLR-2416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14048048#comment-14048048 ] Steve Molloy commented on SOLR-2416: Patch including parameter to make behavior optional. Solr Cell fails to index Zip file contents -- Key: SOLR-2416 URL: https://issues.apache.org/jira/browse/SOLR-2416 Project: Solr Issue Type: Bug Components: contrib - DataImportHandler, contrib - Solr Cell (Tika extraction) Affects Versions: 1.4.1 Reporter: Jayendra Patil Fix For: 5.0 Attachments: SOLR-2416_ExtractingDocumentLoader.patch, SOLR-4216.patch Working with the latest Solr Trunk code and seems the Tika handlers for Solr Cell (ExtractingDocumentLoader.java) and Data Import handler (TikaEntityProcessor.java) fails to index the zip file contents again. It just indexes the file names again. This issue was addressed some time back, late last year, but seems to have reappeared with the latest code. Jira for the Data Import handler part with the patch and the testcase - https://issues.apache.org/jira/browse/SOLR-2332. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2416) Solr Cell fails to index Zip file contents
[ https://issues.apache.org/jira/browse/SOLR-2416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13546719#comment-13546719 ] Maciej Lizewski commented on SOLR-2416: --- I think this is really needed feature. Also in earlier versions of Solr it worked different than now: grepping code of org.apache.solr.handler.extraction.ExtractingDocumentLoader from version 1.4.0.1 show that context was not created and instead autoDetectParser::parse function was called with 3 parameters (without context) and this caused context to be automatically created with Parser=autoDetectParser... this is backward compatibility violation after adding PasswordProvider. Also comments in current code suggest that someone was not sure about consequences of such change: TODO: should we design a way to pass in parse context? the patch is already attached as I see... anyway - does anyone have this handler refactored as external jar so it can be added to running solr instance without changing and recompiling core libs? Solr Cell fails to index Zip file contents -- Key: SOLR-2416 URL: https://issues.apache.org/jira/browse/SOLR-2416 Project: Solr Issue Type: Bug Components: contrib - DataImportHandler, contrib - Solr Cell (Tika extraction) Affects Versions: 1.4.1 Reporter: Jayendra Patil Fix For: 5.0 Attachments: SOLR-2416_ExtractingDocumentLoader.patch Working with the latest Solr Trunk code and seems the Tika handlers for Solr Cell (ExtractingDocumentLoader.java) and Data Import handler (TikaEntityProcessor.java) fails to index the zip file contents again. It just indexes the file names again. This issue was addressed some time back, late last year, but seems to have reappeared with the latest code. Jira for the Data Import handler part with the patch and the testcase - https://issues.apache.org/jira/browse/SOLR-2332. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2416) Solr Cell fails to index Zip file contents
[ https://issues.apache.org/jira/browse/SOLR-2416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13547308#comment-13547308 ] Jan Høydahl commented on SOLR-2416: --- As far as I can see, this behaviour has been consistent over several years and versions, seemingly unrelated to PasswordProvider change for v4.0. Thus there are probably more Solr users expecting today's behavior than the pre 1.4.1 one. As with open source in general, features are added by real world needs, by contributors who want to help. If you need this feature for your company, the first thing to do would be to test the attached patch, add configuration param for enabling/disabling, add JUnit tests and work step by step towards a mature patch. Solr Cell fails to index Zip file contents -- Key: SOLR-2416 URL: https://issues.apache.org/jira/browse/SOLR-2416 Project: Solr Issue Type: Bug Components: contrib - DataImportHandler, contrib - Solr Cell (Tika extraction) Affects Versions: 1.4.1 Reporter: Jayendra Patil Fix For: 5.0 Attachments: SOLR-2416_ExtractingDocumentLoader.patch Working with the latest Solr Trunk code and seems the Tika handlers for Solr Cell (ExtractingDocumentLoader.java) and Data Import handler (TikaEntityProcessor.java) fails to index the zip file contents again. It just indexes the file names again. This issue was addressed some time back, late last year, but seems to have reappeared with the latest code. Jira for the Data Import handler part with the patch and the testcase - https://issues.apache.org/jira/browse/SOLR-2332. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2416) Solr Cell fails to index Zip file contents
[ https://issues.apache.org/jira/browse/SOLR-2416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13188616#comment-13188616 ] Jayendra Patil commented on SOLR-2416: -- Tika parsers the zip file and extracts the complete content of the files as well. It parsers all the files in the zip as well as the the zip in zip. The metadata is the zip file rather than the individual files There would be no special handling required from the Solr side. The metadata for the Zip and its contents would be indexed as well. Also, Solr doesn't allow attaching multiple files with a single document. Zip is a nice way of associating a document with multiple files. And, as in the current behavior of indexing zip with just the file names doesn't have much value in it. Solr Cell fails to index Zip file contents -- Key: SOLR-2416 URL: https://issues.apache.org/jira/browse/SOLR-2416 Project: Solr Issue Type: Bug Components: contrib - DataImportHandler, contrib - Solr Cell (Tika extraction) Affects Versions: 1.4.1 Reporter: Jayendra Patil Fix For: 3.6, 4.0 Attachments: SOLR-2416_ExtractingDocumentLoader.patch Working with the latest Solr Trunk code and seems the Tika handlers for Solr Cell (ExtractingDocumentLoader.java) and Data Import handler (TikaEntityProcessor.java) fails to index the zip file contents again. It just indexes the file names again. This issue was addressed some time back, late last year, but seems to have reappeared with the latest code. Jira for the Data Import handler part with the patch and the testcase - https://issues.apache.org/jira/browse/SOLR-2332. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2416) Solr Cell fails to index Zip file contents
[ https://issues.apache.org/jira/browse/SOLR-2416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13188628#comment-13188628 ] Jan Høydahl commented on SOLR-2416: --- I see. Perhaps we should make recursive parsing a config option, so people can choose? Also, according to http://wiki.apache.org/tika/RecursiveMetadata the parser passed to the context is the parser used to parse inner files. Your patch assumes that is always AutoDetectParser, but in the case someone passes stream.type=application/zip, you'll be lost. So perhaps a better way is to create a new AutodetectParser to pass to the context. Would you like to attempt a new patch with this fix as well as controlling it via a config parameter, e.g. recurseContainers=true? Please also add a JUnit test case to the patch to verify the fix. Solr Cell fails to index Zip file contents -- Key: SOLR-2416 URL: https://issues.apache.org/jira/browse/SOLR-2416 Project: Solr Issue Type: Bug Components: contrib - DataImportHandler, contrib - Solr Cell (Tika extraction) Affects Versions: 1.4.1 Reporter: Jayendra Patil Fix For: 3.6, 4.0 Attachments: SOLR-2416_ExtractingDocumentLoader.patch Working with the latest Solr Trunk code and seems the Tika handlers for Solr Cell (ExtractingDocumentLoader.java) and Data Import handler (TikaEntityProcessor.java) fails to index the zip file contents again. It just indexes the file names again. This issue was addressed some time back, late last year, but seems to have reappeared with the latest code. Jira for the Data Import handler part with the patch and the testcase - https://issues.apache.org/jira/browse/SOLR-2332. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2416) Solr Cell fails to index Zip file contents
[ https://issues.apache.org/jira/browse/SOLR-2416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13188648#comment-13188648 ] Jayendra Patil commented on SOLR-2416: -- sure .. will try to check on this. Solr Cell fails to index Zip file contents -- Key: SOLR-2416 URL: https://issues.apache.org/jira/browse/SOLR-2416 Project: Solr Issue Type: Bug Components: contrib - DataImportHandler, contrib - Solr Cell (Tika extraction) Affects Versions: 1.4.1 Reporter: Jayendra Patil Fix For: 3.6, 4.0 Attachments: SOLR-2416_ExtractingDocumentLoader.patch Working with the latest Solr Trunk code and seems the Tika handlers for Solr Cell (ExtractingDocumentLoader.java) and Data Import handler (TikaEntityProcessor.java) fails to index the zip file contents again. It just indexes the file names again. This issue was addressed some time back, late last year, but seems to have reappeared with the latest code. Jira for the Data Import handler part with the patch and the testcase - https://issues.apache.org/jira/browse/SOLR-2332. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2416) Solr Cell fails to index Zip file contents
[ https://issues.apache.org/jira/browse/SOLR-2416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13188206#comment-13188206 ] Jan Høydahl commented on SOLR-2416: --- Unless I get good answers to the questions above, I'll close this as Not a problem Solr Cell fails to index Zip file contents -- Key: SOLR-2416 URL: https://issues.apache.org/jira/browse/SOLR-2416 Project: Solr Issue Type: Bug Components: contrib - DataImportHandler, contrib - Solr Cell (Tika extraction) Affects Versions: 1.4.1 Reporter: Jayendra Patil Fix For: 3.6, 4.0 Attachments: SOLR-2416_ExtractingDocumentLoader.patch Working with the latest Solr Trunk code and seems the Tika handlers for Solr Cell (ExtractingDocumentLoader.java) and Data Import handler (TikaEntityProcessor.java) fails to index the zip file contents again. It just indexes the file names again. This issue was addressed some time back, late last year, but seems to have reappeared with the latest code. Jira for the Data Import handler part with the patch and the testcase - https://issues.apache.org/jira/browse/SOLR-2332. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2416) Solr Cell fails to index Zip file contents
[ https://issues.apache.org/jira/browse/SOLR-2416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13176371#comment-13176371 ] Jan Høydahl commented on SOLR-2416: --- If we add this, the behavior should probably be parameter driven. Some questions arises: a) What to do with metadata? Should meta data for all files in the ZIP be added to the document? What's Tikas default? b) How do you present the title of such a document consisting of multiple docs from ZIP? Each individual document has its own title metadata... c) Do you always want to traverse all files in the ZIP or only some types? d) What do you do when a ZIP contains another ZIP? All in all, perhaps this isn't such a useful feature after all? Solr Cell fails to index Zip file contents -- Key: SOLR-2416 URL: https://issues.apache.org/jira/browse/SOLR-2416 Project: Solr Issue Type: Bug Components: contrib - DataImportHandler, contrib - Solr Cell (Tika extraction) Affects Versions: 1.4.1 Reporter: Jayendra Patil Fix For: 3.6, 4.0 Attachments: SOLR-2416_ExtractingDocumentLoader.patch Working with the latest Solr Trunk code and seems the Tika handlers for Solr Cell (ExtractingDocumentLoader.java) and Data Import handler (TikaEntityProcessor.java) fails to index the zip file contents again. It just indexes the file names again. This issue was addressed some time back, late last year, but seems to have reappeared with the latest code. Jira for the Data Import handler part with the patch and the testcase - https://issues.apache.org/jira/browse/SOLR-2332. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-2416) Solr Cell fails to index Zip file contents
[ https://issues.apache.org/jira/browse/SOLR-2416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13008284#comment-13008284 ] Jayendra Patil commented on SOLR-2416: -- This issue existed in Solr 1.4 packaged with Tika 0.4, which prevented us from using the stable version. Thread - http://lucene.472066.n3.nabble.com/Issue-Indexing-zip-file-content-in-Solr-1-4-td504914.html The issue was resolved with the Tika 0.5 upgrade @ https://issues.apache.org/jira/browse/SOLR-1567 We are working on a Snapshot of Solr Trunk 4.X marked around 15 August 2010, which uses the Tika 0.8 snapshot jars, and the extraction works fine for us. However, with the latest Trunk upgraded to the stable release of Tika 0.8, it does not have the same behaviour. Solr Cell fails to index Zip file contents -- Key: SOLR-2416 URL: https://issues.apache.org/jira/browse/SOLR-2416 Project: Solr Issue Type: Bug Components: contrib - DataImportHandler, contrib - Solr Cell (Tika extraction) Affects Versions: 1.4.1 Reporter: Jayendra Patil Fix For: 3.2 Attachments: SOLR-2416_ExtractingDocumentLoader.patch Working with the latest Solr Trunk code and seems the Tika handlers for Solr Cell (ExtractingDocumentLoader.java) and Data Import handler (TikaEntityProcessor.java) fails to index the zip file contents again. It just indexes the file names again. This issue was addressed some time back, late last year, but seems to have reappeared with the latest code. Jira for the Data Import handler part with the patch and the testcase - https://issues.apache.org/jira/browse/SOLR-2332. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org