[GitHub] [tika] lewismc opened a new pull request #444: TIKA-3403 Create example for Transcription

GitBox Thu, 13 May 2021 19:40:37 -0700


lewismc opened a new pull request #444:
URL: https://github.com/apache/tika/pull/444

This issue addresses https://issues.apache.org/jira/browse/TIKA-3403
In addition to implementing the example file, it proposes the following
improvements
* minor upgrade of aws libraries to `1.11.1018`
* adds a new configuration option for the AWS transcriber allowing client to
write to a specific region cf. `transcribe.REGION`
* makes use of
[SelectObjectContentRequest](https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/model/SelectObjectContentRequest.html)
which filters the contents of an Amazon S3 object (transcription) based on a
simple Structured Query Language (SQL) statement. In the request, along with
the SQL expression, we specify JSON as the data serialization format of the
object. Amazon S3 uses this to parse object data into records, and returns only
records that match the specified SQL expression. In our case this means we ONLY
return the transcription text. This dramatically (orders of magnitude) reduces
the amount of data we egress from s3 to client.
* the implementation will now automatically create the bucket (to store the
transcription) if one does not already exist. This is a merely a utility
feature.
* introduces a LOT of exception handling and checks which will assist the
client in debugging errors/anomalies.
* Reformatted GoogleTranslator.java with 4-space indents.

Thanks about it.

CC @rohan2810 FYI

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [tika] lewismc opened a new pull request #444: TIKA-3403 Create example for Transcription

Reply via email to