[jira] [Commented] (NIFI-1815) Tesseract OCR Processor

2017-02-14 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-1815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866401#comment-15866401
 ] 

ASF GitHub Bot commented on NIFI-1815:
--

Github user jdye64 commented on the issue:

https://github.com/apache/nifi/pull/397
  
Unfortunately this isn't going to work due to license issues so closing.


> Tesseract OCR Processor
> ---
>
> Key: NIFI-1815
> URL: https://issues.apache.org/jira/browse/NIFI-1815
> Project: Apache NiFi
>  Issue Type: Improvement
>Reporter: Jeremy Dyer
>Assignee: Jeremy Dyer
> Attachments: 0006-changes-to-the-OCR-processor.patch, 
> nifi_1815_1.x_patch.zip
>
>
> This ticket is a follow-up to NIFI-1718 minus the use of the Tika library
> Expose OCR capabilities through a new processor which uses the Tesseract 
> library. Use of this processor would require that Tesseract be installed on 
> the NiFi host. Since the processor will have a system dependency care must be 
> taken to ensure that the overall NiFi cluster continues to function properly 
> in the absence of the Tesseract system dependency even though the OCR 
> processor itself will be unable to perform its duties. In the event that the 
> system dependencies are not detected the processor should display a 
> validation warning rather than failing or preventing the NiFi instance from 
> booting properly.
> Properties expose to configure Tesseract
> tesseractPath - Path to tesseract installation folder, if not on system path.
> language - Language ID (e.g. "eng"); language dictionary to be used.
> pageSegMode - Tesseract page segmentation mode, defaults to 1.
> minFileSizeToOcr - Minimum file size to submit file to OCR, defaults to 0.
> maxFileSizeToOcr - Maximum file size to submit file to OCR, defaults to 
> Integer.MAX_VALUE.
> timeout - Maximum time (in seconds) to wait for the OCR process termination; 
> defaults to 120.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (NIFI-1815) Tesseract OCR Processor

2017-02-14 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-1815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866402#comment-15866402
 ] 

ASF GitHub Bot commented on NIFI-1815:
--

Github user jdye64 closed the pull request at:

https://github.com/apache/nifi/pull/397


> Tesseract OCR Processor
> ---
>
> Key: NIFI-1815
> URL: https://issues.apache.org/jira/browse/NIFI-1815
> Project: Apache NiFi
>  Issue Type: Improvement
>Reporter: Jeremy Dyer
>Assignee: Jeremy Dyer
> Attachments: 0006-changes-to-the-OCR-processor.patch, 
> nifi_1815_1.x_patch.zip
>
>
> This ticket is a follow-up to NIFI-1718 minus the use of the Tika library
> Expose OCR capabilities through a new processor which uses the Tesseract 
> library. Use of this processor would require that Tesseract be installed on 
> the NiFi host. Since the processor will have a system dependency care must be 
> taken to ensure that the overall NiFi cluster continues to function properly 
> in the absence of the Tesseract system dependency even though the OCR 
> processor itself will be unable to perform its duties. In the event that the 
> system dependencies are not detected the processor should display a 
> validation warning rather than failing or preventing the NiFi instance from 
> booting properly.
> Properties expose to configure Tesseract
> tesseractPath - Path to tesseract installation folder, if not on system path.
> language - Language ID (e.g. "eng"); language dictionary to be used.
> pageSegMode - Tesseract page segmentation mode, defaults to 1.
> minFileSizeToOcr - Minimum file size to submit file to OCR, defaults to 0.
> maxFileSizeToOcr - Maximum file size to submit file to OCR, defaults to 
> Integer.MAX_VALUE.
> timeout - Maximum time (in seconds) to wait for the OCR process termination; 
> defaults to 120.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (NIFI-1815) Tesseract OCR Processor

2017-02-14 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-1815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866378#comment-15866378
 ] 

ASF GitHub Bot commented on NIFI-1815:
--

Github user joewitt commented on the issue:

https://github.com/apache/nifi/pull/397
  
I think we should close out this PR until licensing challenges are resolved 
and momentum is restored.  It is certainly cool so even if we end up with a 
process that puts an abnormal burden on the user for administration of it there 
may still be good use for it.  But for now I'd advocate closing this.


> Tesseract OCR Processor
> ---
>
> Key: NIFI-1815
> URL: https://issues.apache.org/jira/browse/NIFI-1815
> Project: Apache NiFi
>  Issue Type: Improvement
>Reporter: Jeremy Dyer
>Assignee: Jeremy Dyer
> Attachments: 0006-changes-to-the-OCR-processor.patch, 
> nifi_1815_1.x_patch.zip
>
>
> This ticket is a follow-up to NIFI-1718 minus the use of the Tika library
> Expose OCR capabilities through a new processor which uses the Tesseract 
> library. Use of this processor would require that Tesseract be installed on 
> the NiFi host. Since the processor will have a system dependency care must be 
> taken to ensure that the overall NiFi cluster continues to function properly 
> in the absence of the Tesseract system dependency even though the OCR 
> processor itself will be unable to perform its duties. In the event that the 
> system dependencies are not detected the processor should display a 
> validation warning rather than failing or preventing the NiFi instance from 
> booting properly.
> Properties expose to configure Tesseract
> tesseractPath - Path to tesseract installation folder, if not on system path.
> language - Language ID (e.g. "eng"); language dictionary to be used.
> pageSegMode - Tesseract page segmentation mode, defaults to 1.
> minFileSizeToOcr - Minimum file size to submit file to OCR, defaults to 0.
> maxFileSizeToOcr - Maximum file size to submit file to OCR, defaults to 
> Integer.MAX_VALUE.
> timeout - Maximum time (in seconds) to wait for the OCR process termination; 
> defaults to 120.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (NIFI-1815) Tesseract OCR Processor

2017-01-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-1815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15837705#comment-15837705
 ] 

ASF GitHub Bot commented on NIFI-1815:
--

Github user jdye64 commented on the issue:

https://github.com/apache/nifi/pull/397
  
@trixpan unfortunately I was unable to get this implementation to run 
properly when removing the libraries that had licensing conflicts. Its 
unfortunate but maybe we just need to close this issue since I don't think its 
going to work due to license issues. Thoughts?


> Tesseract OCR Processor
> ---
>
> Key: NIFI-1815
> URL: https://issues.apache.org/jira/browse/NIFI-1815
> Project: Apache NiFi
>  Issue Type: Improvement
>Reporter: Jeremy Dyer
>Assignee: Jeremy Dyer
> Attachments: 0006-changes-to-the-OCR-processor.patch, 
> nifi_1815_1.x_patch.zip
>
>
> This ticket is a follow-up to NIFI-1718 minus the use of the Tika library
> Expose OCR capabilities through a new processor which uses the Tesseract 
> library. Use of this processor would require that Tesseract be installed on 
> the NiFi host. Since the processor will have a system dependency care must be 
> taken to ensure that the overall NiFi cluster continues to function properly 
> in the absence of the Tesseract system dependency even though the OCR 
> processor itself will be unable to perform its duties. In the event that the 
> system dependencies are not detected the processor should display a 
> validation warning rather than failing or preventing the NiFi instance from 
> booting properly.
> Properties expose to configure Tesseract
> tesseractPath - Path to tesseract installation folder, if not on system path.
> language - Language ID (e.g. "eng"); language dictionary to be used.
> pageSegMode - Tesseract page segmentation mode, defaults to 1.
> minFileSizeToOcr - Minimum file size to submit file to OCR, defaults to 0.
> maxFileSizeToOcr - Maximum file size to submit file to OCR, defaults to 
> Integer.MAX_VALUE.
> timeout - Maximum time (in seconds) to wait for the OCR process termination; 
> defaults to 120.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NIFI-1815) Tesseract OCR Processor

2017-01-16 Thread Jeremy Dyer (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-1815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15824212#comment-15824212
 ] 

Jeremy Dyer commented on NIFI-1815:
---

[~trixpan] I'm thinking there is going to be too many licensing challenges 
unfortunately. Let me give it another look today and see if there is anything 
that can be done to omit those dependencies with the licensing conflicts.

> Tesseract OCR Processor
> ---
>
> Key: NIFI-1815
> URL: https://issues.apache.org/jira/browse/NIFI-1815
> Project: Apache NiFi
>  Issue Type: Improvement
>Reporter: Jeremy Dyer
>Assignee: Jeremy Dyer
> Attachments: 0006-changes-to-the-OCR-processor.patch, 
> nifi_1815_1.x_patch.zip
>
>
> This ticket is a follow-up to NIFI-1718 minus the use of the Tika library
> Expose OCR capabilities through a new processor which uses the Tesseract 
> library. Use of this processor would require that Tesseract be installed on 
> the NiFi host. Since the processor will have a system dependency care must be 
> taken to ensure that the overall NiFi cluster continues to function properly 
> in the absence of the Tesseract system dependency even though the OCR 
> processor itself will be unable to perform its duties. In the event that the 
> system dependencies are not detected the processor should display a 
> validation warning rather than failing or preventing the NiFi instance from 
> booting properly.
> Properties expose to configure Tesseract
> tesseractPath - Path to tesseract installation folder, if not on system path.
> language - Language ID (e.g. "eng"); language dictionary to be used.
> pageSegMode - Tesseract page segmentation mode, defaults to 1.
> minFileSizeToOcr - Minimum file size to submit file to OCR, defaults to 0.
> maxFileSizeToOcr - Maximum file size to submit file to OCR, defaults to 
> Integer.MAX_VALUE.
> timeout - Maximum time (in seconds) to wait for the OCR process termination; 
> defaults to 120.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NIFI-1815) Tesseract OCR Processor

2017-01-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-1815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15824137#comment-15824137
 ] 

ASF GitHub Bot commented on NIFI-1815:
--

Github user trixpan commented on the issue:

https://github.com/apache/nifi/pull/397
  
@jdye64 is the PR active or should we close it? 

Cheers


> Tesseract OCR Processor
> ---
>
> Key: NIFI-1815
> URL: https://issues.apache.org/jira/browse/NIFI-1815
> Project: Apache NiFi
>  Issue Type: Improvement
>Reporter: Jeremy Dyer
>Assignee: Jeremy Dyer
> Attachments: 0006-changes-to-the-OCR-processor.patch, 
> nifi_1815_1.x_patch.zip
>
>
> This ticket is a follow-up to NIFI-1718 minus the use of the Tika library
> Expose OCR capabilities through a new processor which uses the Tesseract 
> library. Use of this processor would require that Tesseract be installed on 
> the NiFi host. Since the processor will have a system dependency care must be 
> taken to ensure that the overall NiFi cluster continues to function properly 
> in the absence of the Tesseract system dependency even though the OCR 
> processor itself will be unable to perform its duties. In the event that the 
> system dependencies are not detected the processor should display a 
> validation warning rather than failing or preventing the NiFi instance from 
> booting properly.
> Properties expose to configure Tesseract
> tesseractPath - Path to tesseract installation folder, if not on system path.
> language - Language ID (e.g. "eng"); language dictionary to be used.
> pageSegMode - Tesseract page segmentation mode, defaults to 1.
> minFileSizeToOcr - Minimum file size to submit file to OCR, defaults to 0.
> maxFileSizeToOcr - Maximum file size to submit file to OCR, defaults to 
> Integer.MAX_VALUE.
> timeout - Maximum time (in seconds) to wait for the OCR process termination; 
> defaults to 120.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NIFI-1815) Tesseract OCR Processor

2016-11-29 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NIFI-1815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15704587#comment-15704587
 ] 

Sergio Fernández commented on NIFI-1815:


Hi guys, so long since your incubation...

I just saw [PR #397|https://github.com/apache/nifi/pull/397], and I'd like to 
ask if you have properly checked the license. Because tess4j (AL2) depends on 
ghost4j (GPL3), so my understanding it goes into [License Category 
X|https://www.apache.org/legal/resolved.html#category-x], which may not be 
included within Apache products.

So I guess it needs further investigation. Feedback is welcomed!



> Tesseract OCR Processor
> ---
>
> Key: NIFI-1815
> URL: https://issues.apache.org/jira/browse/NIFI-1815
> Project: Apache NiFi
>  Issue Type: Improvement
>Reporter: Jeremy Dyer
>Assignee: Jeremy Dyer
> Attachments: 0006-changes-to-the-OCR-processor.patch, 
> nifi_1815_1.x_patch.zip
>
>
> This ticket is a follow-up to NIFI-1718 minus the use of the Tika library
> Expose OCR capabilities through a new processor which uses the Tesseract 
> library. Use of this processor would require that Tesseract be installed on 
> the NiFi host. Since the processor will have a system dependency care must be 
> taken to ensure that the overall NiFi cluster continues to function properly 
> in the absence of the Tesseract system dependency even though the OCR 
> processor itself will be unable to perform its duties. In the event that the 
> system dependencies are not detected the processor should display a 
> validation warning rather than failing or preventing the NiFi instance from 
> booting properly.
> Properties expose to configure Tesseract
> tesseractPath - Path to tesseract installation folder, if not on system path.
> language - Language ID (e.g. "eng"); language dictionary to be used.
> pageSegMode - Tesseract page segmentation mode, defaults to 1.
> minFileSizeToOcr - Minimum file size to submit file to OCR, defaults to 0.
> maxFileSizeToOcr - Maximum file size to submit file to OCR, defaults to 
> Integer.MAX_VALUE.
> timeout - Maximum time (in seconds) to wait for the OCR process termination; 
> defaults to 120.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NIFI-1815) Tesseract OCR Processor

2016-10-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-1815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15580785#comment-15580785
 ] 

ASF GitHub Bot commented on NIFI-1815:
--

Github user trixpan commented on the issue:

https://github.com/apache/nifi/pull/397
  
@olegz @jdye64 is this PR still active / plan to review?

Cheers


> Tesseract OCR Processor
> ---
>
> Key: NIFI-1815
> URL: https://issues.apache.org/jira/browse/NIFI-1815
> Project: Apache NiFi
>  Issue Type: Improvement
>Reporter: Jeremy Dyer
>Assignee: Jeremy Dyer
> Attachments: 0006-changes-to-the-OCR-processor.patch, 
> nifi_1815_1.x_patch.zip
>
>
> This ticket is a follow-up to NIFI-1718 minus the use of the Tika library
> Expose OCR capabilities through a new processor which uses the Tesseract 
> library. Use of this processor would require that Tesseract be installed on 
> the NiFi host. Since the processor will have a system dependency care must be 
> taken to ensure that the overall NiFi cluster continues to function properly 
> in the absence of the Tesseract system dependency even though the OCR 
> processor itself will be unable to perform its duties. In the event that the 
> system dependencies are not detected the processor should display a 
> validation warning rather than failing or preventing the NiFi instance from 
> booting properly.
> Properties expose to configure Tesseract
> tesseractPath - Path to tesseract installation folder, if not on system path.
> language - Language ID (e.g. "eng"); language dictionary to be used.
> pageSegMode - Tesseract page segmentation mode, defaults to 1.
> minFileSizeToOcr - Minimum file size to submit file to OCR, defaults to 0.
> maxFileSizeToOcr - Maximum file size to submit file to OCR, defaults to 
> Integer.MAX_VALUE.
> timeout - Maximum time (in seconds) to wait for the OCR process termination; 
> defaults to 120.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)