[
https://issues.apache.org/jira/browse/TIKA-94?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17291406#comment-17291406
]
ASF GitHub Bot commented on TIKA-94:
------------------------------------
lewismc commented on a change in pull request #406:
URL: https://github.com/apache/tika/pull/406#discussion_r583392435
##########
File path: tika-core/src/main/java/org/apache/tika/transcribe/Transcriber.java
##########
@@ -0,0 +1,90 @@
+/*
Review comment:
Please change name of interface from `Transcriber.java` to
`Transcribe.java`
Why?
The Interface doesn't do the transcribing... the implementation does.
##########
File path: tika-core/src/main/java/org/apache/tika/transcribe/Transcriber.java
##########
@@ -0,0 +1,90 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.tika.transcribe;
+
+import org.apache.tika.exception.TikaException;
+
+import java.io.IOException;
+
+import com.amazonaws.services.transcribe.model.LanguageCode;
+
+
+/**
+ * Interface for Transcriber services.
+ *
+ * @since Tika TODO
Review comment:
Excellent. Thank you for adding this. We will populate it when we
complete the pull request.
##########
File path: tika-core/pom.xml
##########
@@ -84,6 +84,12 @@
<artifactId>junit</artifactId>
<scope>test</scope>
</dependency>
+ <dependency>
Review comment:
Please push this into the `tika-translate` module
##########
File path: tika-core/src/main/java/org/apache/tika/transcribe/Transcriber.java
##########
@@ -0,0 +1,90 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.tika.transcribe;
+
+import org.apache.tika.exception.TikaException;
+
+import java.io.IOException;
+
+import com.amazonaws.services.transcribe.model.LanguageCode;
+
+
+/**
+ * Interface for Transcriber services.
+ *
+ * @since Tika TODO
+ */
+public interface Transcriber {
+ /**
+ * @return
Review comment:
First, we need a description of the interface. This is REALLY important
Next we add parameters
Then we add `@throws`
then return
This method signature needs to change. It is too tighly coupled to the AWS
transcribe input. Please model the interface on the `tika-translate` API.
##########
File path:
tika-transcribe/src/main/resources/org/apache/tika/transcribe/transcribe/transcribe.amazon.properties
##########
@@ -0,0 +1,18 @@
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+transcribe.AWS_ACCESS_KEY=dummy_key
+transcribe.AWS_SECRET_KEY=dummy_key
+transcribe.BUCKET_NAME=dummy_name
Review comment:
I feel that we need to put more out of the interface and into the
imlementation. The same goes for pushing more backend-specific methos
parameters into this config file.
##########
File path:
tika-transcribe/src/main/java/org/apache/tika/transcribe/transcribe/AmazonTranscribe.java
##########
@@ -0,0 +1,193 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.tika.transcribe.transcribe;
+import java.io.File;
+
+import com.amazonaws.services.transcribe.model.*;
+import org.apache.tika.exception.TikaException;
+import org.apache.tika.transcribe.Transcriber;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import com.amazonaws.services.s3.AmazonS3;
+import com.amazonaws.services.s3.model.PutObjectRequest;
+import com.amazonaws.services.transcribe.AmazonTranscribeAsync;
+
+import java.io.IOException;
+import java.util.Properties;
+
+
+public class AmazonTranscribe implements Transcriber {
+
+ private AmazonTranscribeAsync amazonTranscribe;
+
+ private AmazonS3 amazonS3;
+
+ private static final Logger LOG =
LoggerFactory.getLogger(AmazonTranscribe.class);
+
+ private String bucketName;
+
+ private boolean isAvailable; // Flag for whether or not translation is
available.
+
+ private String clientId;
+
+ private String clientSecret; // Keys used for the API calls.
+
+// private HashSet<String> validSourceLanguages = new
HashSet<>(Arrays.asList("en-US", "en-GB", "es-US", "fr-CA", "fr-FR", "en-AU",
Review comment:
Is this not available from the AWS Java API? This is difficult to
maintain otherwise.
##########
File path:
tika-transcribe/src/main/java/org/apache/tika/transcribe/transcribe/AmazonTranscribe.java
##########
@@ -0,0 +1,193 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.tika.transcribe.transcribe;
+import java.io.File;
+
+import com.amazonaws.services.transcribe.model.*;
+import org.apache.tika.exception.TikaException;
+import org.apache.tika.transcribe.Transcriber;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import com.amazonaws.services.s3.AmazonS3;
+import com.amazonaws.services.s3.model.PutObjectRequest;
+import com.amazonaws.services.transcribe.AmazonTranscribeAsync;
+
+import java.io.IOException;
+import java.util.Properties;
+
+
+public class AmazonTranscribe implements Transcriber {
+
+ private AmazonTranscribeAsync amazonTranscribe;
+
+ private AmazonS3 amazonS3;
+
+ private static final Logger LOG =
LoggerFactory.getLogger(AmazonTranscribe.class);
+
+ private String bucketName;
+
+ private boolean isAvailable; // Flag for whether or not translation is
available.
+
+ private String clientId;
+
+ private String clientSecret; // Keys used for the API calls.
+
+// private HashSet<String> validSourceLanguages = new
HashSet<>(Arrays.asList("en-US", "en-GB", "es-US", "fr-CA", "fr-FR", "en-AU",
+// "it-IT", "de-DE", "pt-BR", "ja-JP", "ko-KR")); // Valid inputs
to StartStreamTranscription for language of source file (audio)
+
+ public AmazonTranscribe() {
+ this.isAvailable = true;
+ Properties config = new Properties();
+ try {
+ config.load(AmazonTranscribe.class
+ .getResourceAsStream(
+ "transcribe.amazon.properties"));
+ this.clientId = config.getProperty("transcribe.AWS_ACCESS_KEY");
+ this.clientSecret =
config.getProperty("transcribe.AWS_SECRET_KEY");
+ this.bucketName = config.getProperty("transcribe.BUCKET_NAME");
+
+ } catch (Exception e) {
+ LOG.warn("Exception reading config file", e);
+ isAvailable = false;
+ }
+ }
+
+
+ /**
+ * Audio to text function without language specification
+ * @param fileName
+ * @return Transcribed text
+ * @throws TikaException
+ * @throws IOException
+ */
+ @Override
+ public void startTranscribeAudio(String fileName, String jobName) throws
TikaException, IOException {
+ if (!isAvailable())
+ return;
+ StartTranscriptionJobRequest startTranscriptionJobRequest = new
StartTranscriptionJobRequest();
+ Media media = new Media();
+ media.setMediaFileUri(amazonS3.getUrl(bucketName,
fileName).toString());
Review comment:
What about source language?
##########
File path:
tika-transcribe/src/test/java/org/apache/tika/transcibe/transcibe/AmazonTranscribeGuessLanguageTest.java
##########
@@ -0,0 +1,125 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.tika.transcibe.transcibe;
+
+import org.apache.tika.transcribe.transcribe.AmazonTranscribe;
+import org.junit.Before;
+import org.junit.Test;
+
+import static junit.framework.TestCase.assertNotNull;
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.fail;
+
+public class AmazonTranscribeGuessLanguageTest {
+ AmazonTranscribe transcriber;
Review comment:
This should be
```
Transcribe transcriber;
```
##########
File path:
tika-transcribe/src/main/java/org/apache/tika/transcribe/transcribe/AmazonTranscribe.java
##########
@@ -0,0 +1,193 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.tika.transcribe.transcribe;
+import java.io.File;
+
+import com.amazonaws.services.transcribe.model.*;
+import org.apache.tika.exception.TikaException;
+import org.apache.tika.transcribe.Transcriber;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import com.amazonaws.services.s3.AmazonS3;
+import com.amazonaws.services.s3.model.PutObjectRequest;
+import com.amazonaws.services.transcribe.AmazonTranscribeAsync;
+
+import java.io.IOException;
+import java.util.Properties;
+
+
+public class AmazonTranscribe implements Transcriber {
+
+ private AmazonTranscribeAsync amazonTranscribe;
+
+ private AmazonS3 amazonS3;
+
+ private static final Logger LOG =
LoggerFactory.getLogger(AmazonTranscribe.class);
+
+ private String bucketName;
+
+ private boolean isAvailable; // Flag for whether or not translation is
available.
+
+ private String clientId;
+
+ private String clientSecret; // Keys used for the API calls.
+
+// private HashSet<String> validSourceLanguages = new
HashSet<>(Arrays.asList("en-US", "en-GB", "es-US", "fr-CA", "fr-FR", "en-AU",
+// "it-IT", "de-DE", "pt-BR", "ja-JP", "ko-KR")); // Valid inputs
to StartStreamTranscription for language of source file (audio)
+
+ public AmazonTranscribe() {
+ this.isAvailable = true;
+ Properties config = new Properties();
+ try {
+ config.load(AmazonTranscribe.class
+ .getResourceAsStream(
+ "transcribe.amazon.properties"));
+ this.clientId = config.getProperty("transcribe.AWS_ACCESS_KEY");
+ this.clientSecret =
config.getProperty("transcribe.AWS_SECRET_KEY");
+ this.bucketName = config.getProperty("transcribe.BUCKET_NAME");
+
+ } catch (Exception e) {
+ LOG.warn("Exception reading config file", e);
+ isAvailable = false;
+ }
+ }
+
+
+ /**
+ * Audio to text function without language specification
+ * @param fileName
+ * @return Transcribed text
+ * @throws TikaException
+ * @throws IOException
+ */
+ @Override
+ public void startTranscribeAudio(String fileName, String jobName) throws
TikaException, IOException {
+ if (!isAvailable())
+ return;
+ StartTranscriptionJobRequest startTranscriptionJobRequest = new
StartTranscriptionJobRequest();
+ Media media = new Media();
+ media.setMediaFileUri(amazonS3.getUrl(bucketName,
fileName).toString());
+ startTranscriptionJobRequest.withMedia(media)
+ .withOutputBucketName(this.bucketName)
+ .setTranscriptionJobName(jobName);
+ amazonTranscribe.startTranscriptionJob(startTranscriptionJobRequest);
+ }
+
+ /**
+ * Audio to text function with language specification
+ * @param fileName
+ * @param sourceLanguage
+ * @return Transcribed text
+ * @throws TikaException
+ * @throws IOException
+ */
+ @Override
+ public void startTranscribeAudio(String fileName, LanguageCode
sourceLanguage, String jobName) throws TikaException, IOException {
+ if (!isAvailable())
+ return;
+ StartTranscriptionJobRequest startTranscriptionJobRequest = new
StartTranscriptionJobRequest();
+ Media media = new Media();
+ media.setMediaFileUri(amazonS3.getUrl(bucketName,
fileName).toString());
+ startTranscriptionJobRequest.withMedia(media)
+ .withLanguageCode(sourceLanguage)
+ .withOutputBucketName(this.bucketName)
+ .setTranscriptionJobName(jobName);
+ amazonTranscribe.startTranscriptionJob(startTranscriptionJobRequest);
+ }
+
+ @Override
+ public void startTranscribeVideo(String fileName, String jobName) throws
TikaException, IOException {
+ if (!isAvailable())
+ return;
+ //TODO
+
+ }
+
+ /**
+ * Audio to text function with language specification
+ * @param fileName
+ * @param sourceLanguage
+ * @return Transcribed text
+ * @throws TikaException
+ * @throws IOException
+ */
+ @Override
+ public void startTranscribeVideo(String fileName, LanguageCode
sourceLanguage, String jobName) throws TikaException, IOException {
+ if (!isAvailable())
+ return;
+ //boolean validSourceLanguageFlag =
validSourceLanguages.contains(sourceLanguage); // Checks if sourceLanguage in
validSourceLanguages O(1) lookup time
+
+ //if (!validSourceLanguageFlag) { // Throws TikaException if the input
sourceLanguage is not present in validSourceLanguages
+ // throw new TikaException("Provided Source Language is Not Valid.
Run without language parameter or please select one of: " +
+ // "en-US, en-GB, es-US, fr-CA, fr-FR, en-AU, it-IT, de-DE,
pt-BR, ja-JP, ko-KR"); }
+ //TODO
+
+ }
+
+ /**
+ * @return Valid AWS Credentials
+ */
+ public boolean isAvailable() {
+ return this.isAvailable;
+ }
+
+ /** Gets Transcriptioni result from AWS S3 bucket given bucketNamee and key
+ * @param key
+ * @return
+ */
+ @Override
+ public String getTranscriptResult(String key) {
+ TranscriptionJob transcriptionJob =
retrieveObjectWhenJobCompleted(key);
+ if (transcriptionJob != null &&
!TranscriptionJobStatus.FAILED.equals(transcriptionJob.getTranscriptionJobStatus()))
{
+ return amazonS3.getObjectAsString(this.bucketName, key + ".json");
+ } else
+ return null;
+ }
+
+ /**
+ * Private helper function to get object from s3
+ * @param key
+ * @return
+ */
+ private TranscriptionJob retrieveObjectWhenJobCompleted(String key) {
+ GetTranscriptionJobRequest getTranscriptionJobRequest = new
GetTranscriptionJobRequest();
+ getTranscriptionJobRequest.setTranscriptionJobName(key);
+
+ while (true) {
+ GetTranscriptionJobResult innerResult =
amazonTranscribe.getTranscriptionJob(getTranscriptionJobRequest);
+ String status =
innerResult.getTranscriptionJob().getTranscriptionJobStatus();
+ if (TranscriptionJobStatus.COMPLETED.name().equals(status) ||
+ TranscriptionJobStatus.FAILED.name().equals(status)) {
+ return innerResult.getTranscriptionJob();
+ }
+ }
+ }
+
+ /**
+ * Call this method in order to upload a file to the Amazon S3 bucket.
+ * @param bucketName
+ * @param fileName
+ * @param fullFileName
+ */
+ @Override
+ public void uploadFileToBucket(String bucketName, String fileName, String
fullFileName) {
Review comment:
This needs to exist in the AWS implementation but NOT in the Transcribe
Interface.
##########
File path: tika-core/src/main/java/org/apache/tika/transcribe/Transcriber.java
##########
@@ -0,0 +1,90 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.tika.transcribe;
+
+import org.apache.tika.exception.TikaException;
+
+import java.io.IOException;
+
+import com.amazonaws.services.transcribe.model.LanguageCode;
+
+
+/**
+ * Interface for Transcriber services.
+ *
+ * @since Tika TODO
+ */
+public interface Transcriber {
+ /**
+ * @return
+ * @param fileName
+ * @param jobName
+ * @throws TikaException When there is an error translating.
+ * @throws java.io.IOException
+ * @since TODO
+ */
+ public void startTranscribeAudio(String fileName, String jobName) throws
TikaException, IOException;
+
+ /**
+ * @return
+ * @param fileName
+ * @param sourceLanguage
+ * @param jobName
+ * @throws TikaException When there is an error translating.
+ * @throws java.io.IOException
+ * @since TODO
+ */
+ public void startTranscribeAudio(String fileName, LanguageCode
sourceLanguage, String jobName) throws TikaException, IOException;
+
+ /**
+ * @return
+ * @param fileName
+ * @param jobName
+ * @throws TikaException When there is an error translating.
+ * @throws java.io.IOException
+ * @since TODO
+ */
+ public void startTranscribeVideo(String fileName, String jobName) throws
TikaException, IOException;
+
+ /**
+ * @return
+ * @param fileName
+ * @param jobName
+ * @param sourceLanguage
+ * @throws TikaException When there is an error translating.
+ * @throws java.io.IOException
+ * @since TODO
+ */
+ public void startTranscribeVideo(String fileName, LanguageCode
sourceLanguage, String jobName) throws TikaException, IOException;
+
+ /**
+ * Gets transcription result from S3
+ * @param key
+ * @return
+ */
+ public String getTranscriptResult(String key);
+
+ /**
+ * Upload file to s3
+ * @param bucketName
+ * @param fileName
+ * @param filePath
+ */
+ public void uploadFileToBucket(String bucketName, String fileName, String
filePath);
Review comment:
This should never be in an interface. This is WAY to AWS specific.
##########
File path: tika-core/src/main/java/org/apache/tika/transcribe/Transcriber.java
##########
@@ -0,0 +1,90 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.tika.transcribe;
+
Review comment:
Remove whitespace
##########
File path:
tika-transcribe/src/main/java/org/apache/tika/transcribe/transcribe/AmazonTranscribe.java
##########
@@ -0,0 +1,193 @@
+/*
Review comment:
Please look at the package naming here...
```
tika-transcribe/src/main/java/org/apache/tika/transcribe/transcribe/
```
should be
```
tika-transcribe/src/main/java/org/apache/tika/transcribe
```
##########
File path: tika-core/src/main/java/org/apache/tika/transcribe/Transcriber.java
##########
@@ -0,0 +1,90 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.tika.transcribe;
+
+import org.apache.tika.exception.TikaException;
+
+import java.io.IOException;
+
+import com.amazonaws.services.transcribe.model.LanguageCode;
+
+
+/**
+ * Interface for Transcriber services.
+ *
+ * @since Tika TODO
+ */
+public interface Transcriber {
+ /**
+ * @return
+ * @param fileName
Review comment:
Also, what about the language implementation the transcription service
should work on?
##########
File path: tika-transcribe/pom.xml
##########
@@ -0,0 +1,144 @@
+<?xml version="1.0" encoding="UTF-8"?>
+
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements. See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership. The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied. See the License for the
+ specific language governing permissions and limitations
+ under the License.
+-->
+
+<project xmlns="http://maven.apache.org/POM/4.0.0"
+ xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
+ xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
http://maven.apache.org/xsd/maven-4.0.0.xsd">
+ <modelVersion>4.0.0</modelVersion>
+
+ <parent>
+ <groupId>org.apache.tika</groupId>
+ <artifactId>tika-parent</artifactId>
+ <version>2.0.0-SNAPSHOT</version>
+ <relativePath>../tika-parent/pom.xml</relativePath>
+ </parent>
+
+ <artifactId>tika-transcribe</artifactId>
+ <packaging>bundle</packaging>
+ <name>Apache Tika transcribe</name>
+ <url>http://tika.apache.org/</url>
+ <!--TODO use latest aws version or the one defined in the tika-parent-->
Review comment:
Defining in `tika-parent` is fine but not in `tika-core`
##########
File path:
tika-transcribe/src/main/java/org/apache/tika/transcribe/transcribe/AmazonTranscribe.java
##########
@@ -0,0 +1,193 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.tika.transcribe.transcribe;
+import java.io.File;
+
+import com.amazonaws.services.transcribe.model.*;
Review comment:
Please order all `import`s alphabetically
##########
File path:
tika-transcribe/src/test/java/org/apache/tika/transcibe/transcibe/AmazonTranscribeGuessLanguageTest.java
##########
@@ -0,0 +1,125 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.tika.transcibe.transcibe;
+
+import org.apache.tika.transcribe.transcribe.AmazonTranscribe;
+import org.junit.Before;
+import org.junit.Test;
+
+import static junit.framework.TestCase.assertNotNull;
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.fail;
+
+public class AmazonTranscribeGuessLanguageTest {
+ AmazonTranscribe transcriber;
+
+ @Before
+ public void setUp() {
+ transcriber = new AmazonTranscribe();
+ }
+
+ @Test
+ public void AmazonTranscribeGuessLanguageAudioShortTest() {
+ String expected = "where is the bus stop? where is the bus stop?";
+ //TODO: "expected" should be changed to reflect the contents of
ShortAudioSample.mp3
+ /*
+ URL res =
getClass().getClassLoader().getResource("ShortAudioSample.mp3");
+ File file = Paths.get(res.toURI()).toFile();
+ String absolutePath = file.getAbsolutePath();
+ Necessary to get the correct file path from our test resource folder?
*/
+ //TODO: is the above commented block necessary to obtain the proper
filepath for a file located in the tika-translate/test/resources directory?
+
+ String audioFilePath = "src/test/resources/ShortAudioSample.mp3";
+ String result = null;
+
+ if (transcriber.isAvailable()) {
+ try {
+ result = transcriber.transcribeAudio(audioFilePath);
+ assertNotNull(result);
+ assertEquals("Result: [" + result
+ + "]: not equal to expected: [" + expected +
"]",
+ expected, result);
+ } catch (Exception e) {
+ e.printStackTrace();
+ fail(e.getMessage());
+ }
+ }
+ }
+
+ @Test
+ public void AmazonTranscribeGuessLanguageAudioLongTest() {
+ String expected = "where is the bus stop? where is the bus stop?";
+ //TODO: "expected" should be changed to reflect the contents of
LongAudioSample.mp3
+ String audioFilePath = "src/test/resources/LongAudioSample.mp3";
Review comment:
Where is this file?
##########
File path:
tika-transcribe/src/main/java/org/apache/tika/transcribe/transcribe/AmazonTranscribe.java
##########
@@ -0,0 +1,193 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.tika.transcribe.transcribe;
+import java.io.File;
+
+import com.amazonaws.services.transcribe.model.*;
+import org.apache.tika.exception.TikaException;
+import org.apache.tika.transcribe.Transcriber;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import com.amazonaws.services.s3.AmazonS3;
+import com.amazonaws.services.s3.model.PutObjectRequest;
+import com.amazonaws.services.transcribe.AmazonTranscribeAsync;
+
+import java.io.IOException;
+import java.util.Properties;
+
+
+public class AmazonTranscribe implements Transcriber {
+
+ private AmazonTranscribeAsync amazonTranscribe;
+
+ private AmazonS3 amazonS3;
+
+ private static final Logger LOG =
LoggerFactory.getLogger(AmazonTranscribe.class);
+
+ private String bucketName;
+
+ private boolean isAvailable; // Flag for whether or not translation is
available.
+
+ private String clientId;
+
+ private String clientSecret; // Keys used for the API calls.
+
+// private HashSet<String> validSourceLanguages = new
HashSet<>(Arrays.asList("en-US", "en-GB", "es-US", "fr-CA", "fr-FR", "en-AU",
+// "it-IT", "de-DE", "pt-BR", "ja-JP", "ko-KR")); // Valid inputs
to StartStreamTranscription for language of source file (audio)
+
+ public AmazonTranscribe() {
+ this.isAvailable = true;
+ Properties config = new Properties();
+ try {
+ config.load(AmazonTranscribe.class
+ .getResourceAsStream(
+ "transcribe.amazon.properties"));
+ this.clientId = config.getProperty("transcribe.AWS_ACCESS_KEY");
+ this.clientSecret =
config.getProperty("transcribe.AWS_SECRET_KEY");
+ this.bucketName = config.getProperty("transcribe.BUCKET_NAME");
+
+ } catch (Exception e) {
+ LOG.warn("Exception reading config file", e);
+ isAvailable = false;
+ }
+ }
+
+
+ /**
+ * Audio to text function without language specification
+ * @param fileName
+ * @return Transcribed text
Review comment:
Please populate all of this Javadoc based upon the guidance I provided
above.
##########
File path:
tika-transcribe/src/test/java/org/apache/tika/transcibe/transcibe/AmazonTranscribeGuessLanguageTest.java
##########
@@ -0,0 +1,125 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.tika.transcibe.transcibe;
+
+import org.apache.tika.transcribe.transcribe.AmazonTranscribe;
+import org.junit.Before;
+import org.junit.Test;
+
+import static junit.framework.TestCase.assertNotNull;
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.fail;
+
+public class AmazonTranscribeGuessLanguageTest {
+ AmazonTranscribe transcriber;
+
+ @Before
+ public void setUp() {
+ transcriber = new AmazonTranscribe();
+ }
+
+ @Test
+ public void AmazonTranscribeGuessLanguageAudioShortTest() {
+ String expected = "where is the bus stop? where is the bus stop?";
+ //TODO: "expected" should be changed to reflect the contents of
ShortAudioSample.mp3
+ /*
+ URL res =
getClass().getClassLoader().getResource("ShortAudioSample.mp3");
+ File file = Paths.get(res.toURI()).toFile();
+ String absolutePath = file.getAbsolutePath();
+ Necessary to get the correct file path from our test resource folder?
*/
+ //TODO: is the above commented block necessary to obtain the proper
filepath for a file located in the tika-translate/test/resources directory?
+
+ String audioFilePath = "src/test/resources/ShortAudioSample.mp3";
Review comment:
Where is this file?
##########
File path:
tika-transcribe/src/test/java/org/apache/tika/transcibe/transcibe/AmazonTranscribeGuessLanguageTest.java
##########
@@ -0,0 +1,125 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.tika.transcibe.transcibe;
+
+import org.apache.tika.transcribe.transcribe.AmazonTranscribe;
+import org.junit.Before;
+import org.junit.Test;
+
+import static junit.framework.TestCase.assertNotNull;
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.fail;
+
+public class AmazonTranscribeGuessLanguageTest {
+ AmazonTranscribe transcriber;
+
+ @Before
+ public void setUp() {
+ transcriber = new AmazonTranscribe();
+ }
+
+ @Test
+ public void AmazonTranscribeGuessLanguageAudioShortTest() {
+ String expected = "where is the bus stop? where is the bus stop?";
+ //TODO: "expected" should be changed to reflect the contents of
ShortAudioSample.mp3
+ /*
+ URL res =
getClass().getClassLoader().getResource("ShortAudioSample.mp3");
+ File file = Paths.get(res.toURI()).toFile();
+ String absolutePath = file.getAbsolutePath();
+ Necessary to get the correct file path from our test resource folder?
*/
+ //TODO: is the above commented block necessary to obtain the proper
filepath for a file located in the tika-translate/test/resources directory?
+
+ String audioFilePath = "src/test/resources/ShortAudioSample.mp3";
+ String result = null;
+
+ if (transcriber.isAvailable()) {
+ try {
+ result = transcriber.transcribeAudio(audioFilePath);
+ assertNotNull(result);
+ assertEquals("Result: [" + result
+ + "]: not equal to expected: [" + expected +
"]",
+ expected, result);
+ } catch (Exception e) {
+ e.printStackTrace();
+ fail(e.getMessage());
+ }
+ }
+ }
+
+ @Test
+ public void AmazonTranscribeGuessLanguageAudioLongTest() {
+ String expected = "where is the bus stop? where is the bus stop?";
+ //TODO: "expected" should be changed to reflect the contents of
LongAudioSample.mp3
+ String audioFilePath = "src/test/resources/LongAudioSample.mp3";
+ String result = null;
+
+ if (transcriber.isAvailable()) {
+ try {
+ result = transcriber.transcribeAudio(audioFilePath);
+ assertNotNull(result);
+ assertEquals("Result: [" + result
+ + "]: not equal to expected: [" + expected +
"]",
+ expected, result);
+ } catch (Exception e) {
+ e.printStackTrace();
+ fail(e.getMessage());
+ }
+ }
+ }
+
+ @Test
+ public void AmazonTranscribeGuessLanguageShortVideoTest() {
+ String expected = "where is the bus stop? where is the bus stop?";
+ //TODO: "expected" should be changed to reflect the contents of
ShortVideoSample.mp4
+ String videoFilePath = "src/test/resources/ShortVideoSample.mp4";
Review comment:
Where is this file?
##########
File path:
tika-transcribe/src/test/java/org/apache/tika/transcibe/transcibe/AmazonTranscribeGuessLanguageTest.java
##########
@@ -0,0 +1,125 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.tika.transcibe.transcibe;
+
+import org.apache.tika.transcribe.transcribe.AmazonTranscribe;
+import org.junit.Before;
+import org.junit.Test;
+
+import static junit.framework.TestCase.assertNotNull;
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.fail;
+
+public class AmazonTranscribeGuessLanguageTest {
+ AmazonTranscribe transcriber;
+
+ @Before
+ public void setUp() {
+ transcriber = new AmazonTranscribe();
+ }
+
+ @Test
+ public void AmazonTranscribeGuessLanguageAudioShortTest() {
+ String expected = "where is the bus stop? where is the bus stop?";
+ //TODO: "expected" should be changed to reflect the contents of
ShortAudioSample.mp3
+ /*
+ URL res =
getClass().getClassLoader().getResource("ShortAudioSample.mp3");
+ File file = Paths.get(res.toURI()).toFile();
+ String absolutePath = file.getAbsolutePath();
+ Necessary to get the correct file path from our test resource folder?
*/
+ //TODO: is the above commented block necessary to obtain the proper
filepath for a file located in the tika-translate/test/resources directory?
+
+ String audioFilePath = "src/test/resources/ShortAudioSample.mp3";
+ String result = null;
+
+ if (transcriber.isAvailable()) {
+ try {
+ result = transcriber.transcribeAudio(audioFilePath);
+ assertNotNull(result);
+ assertEquals("Result: [" + result
+ + "]: not equal to expected: [" + expected +
"]",
+ expected, result);
+ } catch (Exception e) {
+ e.printStackTrace();
+ fail(e.getMessage());
+ }
+ }
+ }
+
+ @Test
+ public void AmazonTranscribeGuessLanguageAudioLongTest() {
+ String expected = "where is the bus stop? where is the bus stop?";
+ //TODO: "expected" should be changed to reflect the contents of
LongAudioSample.mp3
+ String audioFilePath = "src/test/resources/LongAudioSample.mp3";
+ String result = null;
+
+ if (transcriber.isAvailable()) {
+ try {
+ result = transcriber.transcribeAudio(audioFilePath);
+ assertNotNull(result);
+ assertEquals("Result: [" + result
+ + "]: not equal to expected: [" + expected +
"]",
+ expected, result);
+ } catch (Exception e) {
+ e.printStackTrace();
+ fail(e.getMessage());
+ }
+ }
+ }
+
+ @Test
+ public void AmazonTranscribeGuessLanguageShortVideoTest() {
+ String expected = "where is the bus stop? where is the bus stop?";
+ //TODO: "expected" should be changed to reflect the contents of
ShortVideoSample.mp4
+ String videoFilePath = "src/test/resources/ShortVideoSample.mp4";
+ String result = null;
+
+ if (transcriber.isAvailable()) {
+ try {
+ result = transcriber.transcribeVideo(videoFilePath);
+ assertNotNull(result);
+ assertEquals("Result: [" + result
+ + "]: not equal to expected: [" + expected +
"]",
+ expected, result);
+ } catch (Exception e) {
+ e.printStackTrace();
+ fail(e.getMessage());
+ }
+ }
+ }
+
+ @Test
+ public void AmazonTranscribeGuessLanguageLongVideoTest() {
+ String expected = "hello sir";
+ //TODO: "expected" should be changed to reflect the contents of
LongVideoSample.mp4
+ String videoFilePath = "src/test/resources/LongVideoSample.mp4";
Review comment:
?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> Speech recognition
> ------------------
>
> Key: TIKA-94
> URL: https://issues.apache.org/jira/browse/TIKA-94
> Project: Tika
> Issue Type: New Feature
> Components: parser
> Reporter: Jukka Zitting
> Assignee: Lewis John McGibbney
> Priority: Minor
> Labels: new-parser
>
> Like OCR for image files (TIKA-93), we could try using speech recognition to
> extract text content (where available) from audio (and video!) files.
> The CMU Sphinx engine (http://cmusphinx.sourceforge.net/) looks promising and
> comes with a friendly license.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)