[jira] [Updated] (SOLR-2480) Text extraction of password protected files

2011-05-14 Thread Koji Sekiguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi updated SOLR-2480:
-

Attachment: SOLR-2480.patch

New patch.

According to custom, ExtractingRequestHandlerTest class should be at 
o.a.s.handler.extraction, but curiously it was o.a.s.handler. I corrected it in 
this patch.

 Text extraction of password protected files
 ---

 Key: SOLR-2480
 URL: https://issues.apache.org/jira/browse/SOLR-2480
 Project: Solr
  Issue Type: Improvement
  Components: contrib - Solr Cell (Tika extraction)
Affects Versions: 1.4.1, 3.1
Reporter: Shinichiro Abe
Assignee: Koji Sekiguchi
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: SOLR-2480-idea1.patch, SOLR-2480.patch, SOLR-2480.patch, 
 SOLR-2480.patch, password-is-solrcell.docx


 Proposal:
 There are password-protected files. PDF, Office documents in 2007 format/97 
 format.
 These files are posted using SolrCell.
 We do not have to read these files if we do not know the reading password of 
 files.
 So, these files may not be extracted text.
 My requirement is that these files should be processed normally without 
 extracting text, and without throwing exception.
 This background:
 Now, when you post a password-protected file, solr returns 500 server error.
 Solr catches the error in ExtractingDocumentLoader and throws TikException.
 I use ManifoldCF.
 If the solr server responds 500, ManifoldCF judge is that this
 document should be retried because I have absolutely no idea what
 happened.
 And it attempts to retry posting many times without getting the password.
 In the other case, my customer posts the files with embedded images.
 Sometimes it seems that solr throws TikaException of unknown cause.
 He wants to post just metadata without extracting text, but makes him stop 
 posting by the exception.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-2480) Text extraction of password protected files

2011-05-13 Thread Koji Sekiguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi updated SOLR-2480:
-

Affects Version/s: 1.4.1
Fix Version/s: 4.0
   3.2

 Text extraction of password protected files
 ---

 Key: SOLR-2480
 URL: https://issues.apache.org/jira/browse/SOLR-2480
 Project: Solr
  Issue Type: Improvement
  Components: contrib - Solr Cell (Tika extraction)
Affects Versions: 1.4.1, 3.1
Reporter: Shinichiro Abe
Assignee: Koji Sekiguchi
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: SOLR-2480-idea1.patch


 Proposal:
 There are password-protected files. PDF, Office documents in 2007 format/97 
 format.
 These files are posted using SolrCell.
 We do not have to read these files if we do not know the reading password of 
 files.
 So, these files may not be extracted text.
 My requirement is that these files should be processed normally without 
 extracting text, and without throwing exception.
 This background:
 Now, when you post a password-protected file, solr returns 500 server error.
 Solr catches the error in ExtractingDocumentLoader and throws TikException.
 I use ManifoldCF.
 If the solr server responds 500, ManifoldCF judge is that this
 document should be retried because I have absolutely no idea what
 happened.
 And it attempts to retry posting many times without getting the password.
 In the other case, my customer posts the files with embedded images.
 Sometimes it seems that solr throws TikaException of unknown cause.
 He wants to post just metadata without extracting text, but makes him stop 
 posting by the exception.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-2480) Text extraction of password protected files

2011-05-13 Thread Koji Sekiguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi updated SOLR-2480:
-

Attachment: password-is-solrcell.docx
SOLR-2480.patch

Attached the next patch and password protected word file that is used for test.

I added test cases for ignoreTikaException=true|false cases.

I think this is ready to commit.

 Text extraction of password protected files
 ---

 Key: SOLR-2480
 URL: https://issues.apache.org/jira/browse/SOLR-2480
 Project: Solr
  Issue Type: Improvement
  Components: contrib - Solr Cell (Tika extraction)
Affects Versions: 1.4.1, 3.1
Reporter: Shinichiro Abe
Assignee: Koji Sekiguchi
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: SOLR-2480-idea1.patch, SOLR-2480.patch, SOLR-2480.patch, 
 password-is-solrcell.docx


 Proposal:
 There are password-protected files. PDF, Office documents in 2007 format/97 
 format.
 These files are posted using SolrCell.
 We do not have to read these files if we do not know the reading password of 
 files.
 So, these files may not be extracted text.
 My requirement is that these files should be processed normally without 
 extracting text, and without throwing exception.
 This background:
 Now, when you post a password-protected file, solr returns 500 server error.
 Solr catches the error in ExtractingDocumentLoader and throws TikException.
 I use ManifoldCF.
 If the solr server responds 500, ManifoldCF judge is that this
 document should be retried because I have absolutely no idea what
 happened.
 And it attempts to retry posting many times without getting the password.
 In the other case, my customer posts the files with embedded images.
 Sometimes it seems that solr throws TikaException of unknown cause.
 He wants to post just metadata without extracting text, but makes him stop 
 posting by the exception.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-2480) Text extraction of password protected files

2011-05-02 Thread Shinichiro Abe (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shinichiro Abe updated SOLR-2480:
-

Attachment: SOLR-2480-idea1.patch

 Text extraction of password protected files
 ---

 Key: SOLR-2480
 URL: https://issues.apache.org/jira/browse/SOLR-2480
 Project: Solr
  Issue Type: Improvement
  Components: contrib - Solr Cell (Tika extraction)
Affects Versions: 3.1
Reporter: Shinichiro Abe
Priority: Minor
 Attachments: SOLR-2480-idea1.patch


 Proposal:
 There are password-protected files. PDF, Office documents in 2007 format/97 
 format.
 These files are posted using SolrCell.
 We do not have to read these files if we do not know the reading password of 
 files.
 So, these files may not be extracted text.
 My requirement is that these files should be processed normally without 
 extracting text, and without throwing exception.
 This background:
 Now, when you post a password-protected file, solr returns 500 server error.
 Solr catches the error in ExtractingDocumentLoader and throws TikException.
 I use ManifoldCF.
 If the solr server responds 500, ManifoldCF judge is that this
 document should be retried because I have absolutely no idea what
 happened.
 And it attempts to retry posting many times without getting the password.
 In the other case, my customer posts the files with embedded images.
 Sometimes it seems that solr throws TikaException of unknown cause.
 He wants to post just metadata without extracting text, but makes him stop 
 posting by the exception.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org