[ 
https://issues.apache.org/jira/browse/TIKA-94?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17291406#comment-17291406
 ] 

ASF GitHub Bot commented on TIKA-94:
------------------------------------

lewismc commented on a change in pull request #406:
URL: https://github.com/apache/tika/pull/406#discussion_r583392435



##########
File path: tika-core/src/main/java/org/apache/tika/transcribe/Transcriber.java
##########
@@ -0,0 +1,90 @@
+/*

Review comment:
       Please change name of interface from `Transcriber.java` to 
`Transcribe.java`
   Why? 
   The Interface doesn't do the transcribing... the implementation does.

##########
File path: tika-core/src/main/java/org/apache/tika/transcribe/Transcriber.java
##########
@@ -0,0 +1,90 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.tika.transcribe;
+
+import org.apache.tika.exception.TikaException;
+
+import java.io.IOException;
+
+import com.amazonaws.services.transcribe.model.LanguageCode;
+
+
+/**
+ * Interface for Transcriber services.
+ *
+ * @since Tika TODO

Review comment:
       Excellent. Thank you for adding this. We will populate it when we 
complete the pull request.

##########
File path: tika-core/pom.xml
##########
@@ -84,6 +84,12 @@
       <artifactId>junit</artifactId>
       <scope>test</scope>
     </dependency>
+      <dependency>

Review comment:
       Please push this into the `tika-translate` module

##########
File path: tika-core/src/main/java/org/apache/tika/transcribe/Transcriber.java
##########
@@ -0,0 +1,90 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.tika.transcribe;
+
+import org.apache.tika.exception.TikaException;
+
+import java.io.IOException;
+
+import com.amazonaws.services.transcribe.model.LanguageCode;
+
+
+/**
+ * Interface for Transcriber services.
+ *
+ * @since Tika TODO
+ */
+public interface Transcriber {
+    /**
+     * @return

Review comment:
       First, we need a description of the interface. This is REALLY important
   Next we add parameters
   Then we add `@throws`
   then return
   
   This method signature needs to change. It is too tighly coupled to the AWS 
transcribe input. Please model the interface on the `tika-translate` API. 

##########
File path: 
tika-transcribe/src/main/resources/org/apache/tika/transcribe/transcribe/transcribe.amazon.properties
##########
@@ -0,0 +1,18 @@
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+transcribe.AWS_ACCESS_KEY=dummy_key
+transcribe.AWS_SECRET_KEY=dummy_key
+transcribe.BUCKET_NAME=dummy_name

Review comment:
       I feel that we need to put more out of the interface and into the 
imlementation. The same goes for pushing more backend-specific methos 
parameters into this config file. 

##########
File path: 
tika-transcribe/src/main/java/org/apache/tika/transcribe/transcribe/AmazonTranscribe.java
##########
@@ -0,0 +1,193 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.tika.transcribe.transcribe;
+import java.io.File;
+
+import com.amazonaws.services.transcribe.model.*;
+import org.apache.tika.exception.TikaException;
+import org.apache.tika.transcribe.Transcriber;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import com.amazonaws.services.s3.AmazonS3;
+import com.amazonaws.services.s3.model.PutObjectRequest;
+import com.amazonaws.services.transcribe.AmazonTranscribeAsync;
+
+import java.io.IOException;
+import java.util.Properties;
+
+
+public class AmazonTranscribe implements Transcriber {
+
+    private AmazonTranscribeAsync amazonTranscribe;
+
+    private AmazonS3 amazonS3;
+
+    private static final Logger LOG = 
LoggerFactory.getLogger(AmazonTranscribe.class);
+
+    private String bucketName;
+
+    private boolean isAvailable; // Flag for whether or not translation is 
available.
+
+    private String clientId;
+
+    private String clientSecret;  // Keys used for the API calls.
+
+//    private HashSet<String> validSourceLanguages = new 
HashSet<>(Arrays.asList("en-US", "en-GB", "es-US", "fr-CA", "fr-FR", "en-AU",

Review comment:
       Is this not available from the AWS Java API? This is difficult  to 
maintain otherwise. 

##########
File path: 
tika-transcribe/src/main/java/org/apache/tika/transcribe/transcribe/AmazonTranscribe.java
##########
@@ -0,0 +1,193 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.tika.transcribe.transcribe;
+import java.io.File;
+
+import com.amazonaws.services.transcribe.model.*;
+import org.apache.tika.exception.TikaException;
+import org.apache.tika.transcribe.Transcriber;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import com.amazonaws.services.s3.AmazonS3;
+import com.amazonaws.services.s3.model.PutObjectRequest;
+import com.amazonaws.services.transcribe.AmazonTranscribeAsync;
+
+import java.io.IOException;
+import java.util.Properties;
+
+
+public class AmazonTranscribe implements Transcriber {
+
+    private AmazonTranscribeAsync amazonTranscribe;
+
+    private AmazonS3 amazonS3;
+
+    private static final Logger LOG = 
LoggerFactory.getLogger(AmazonTranscribe.class);
+
+    private String bucketName;
+
+    private boolean isAvailable; // Flag for whether or not translation is 
available.
+
+    private String clientId;
+
+    private String clientSecret;  // Keys used for the API calls.
+
+//    private HashSet<String> validSourceLanguages = new 
HashSet<>(Arrays.asList("en-US", "en-GB", "es-US", "fr-CA", "fr-FR", "en-AU",
+//            "it-IT", "de-DE", "pt-BR", "ja-JP", "ko-KR"));  // Valid inputs 
to StartStreamTranscription for language of source file (audio)
+
+    public AmazonTranscribe() {
+        this.isAvailable = true;
+        Properties config = new Properties();
+        try {
+            config.load(AmazonTranscribe.class
+                    .getResourceAsStream(
+                            "transcribe.amazon.properties"));
+            this.clientId = config.getProperty("transcribe.AWS_ACCESS_KEY");
+            this.clientSecret = 
config.getProperty("transcribe.AWS_SECRET_KEY");
+            this.bucketName = config.getProperty("transcribe.BUCKET_NAME");
+
+        } catch (Exception e) {
+            LOG.warn("Exception reading config file", e);
+            isAvailable = false;
+        }
+    }
+
+    
+    /**
+     * Audio to text function without language specification
+     * @param fileName
+     * @return Transcribed text
+     * @throws TikaException
+     * @throws IOException
+     */
+    @Override
+    public void startTranscribeAudio(String fileName, String jobName) throws 
TikaException, IOException {
+        if (!isAvailable())
+            return;
+        StartTranscriptionJobRequest startTranscriptionJobRequest = new 
StartTranscriptionJobRequest();
+        Media media = new Media();
+        media.setMediaFileUri(amazonS3.getUrl(bucketName, 
fileName).toString());

Review comment:
       What about source language?

##########
File path: 
tika-transcribe/src/test/java/org/apache/tika/transcibe/transcibe/AmazonTranscribeGuessLanguageTest.java
##########
@@ -0,0 +1,125 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.tika.transcibe.transcibe;
+
+import org.apache.tika.transcribe.transcribe.AmazonTranscribe;
+import org.junit.Before;
+import org.junit.Test;
+
+import static junit.framework.TestCase.assertNotNull;
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.fail;
+
+public class AmazonTranscribeGuessLanguageTest {
+    AmazonTranscribe transcriber;

Review comment:
       This should be
   ```
   Transcribe transcriber;
   ```
   

##########
File path: 
tika-transcribe/src/main/java/org/apache/tika/transcribe/transcribe/AmazonTranscribe.java
##########
@@ -0,0 +1,193 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.tika.transcribe.transcribe;
+import java.io.File;
+
+import com.amazonaws.services.transcribe.model.*;
+import org.apache.tika.exception.TikaException;
+import org.apache.tika.transcribe.Transcriber;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import com.amazonaws.services.s3.AmazonS3;
+import com.amazonaws.services.s3.model.PutObjectRequest;
+import com.amazonaws.services.transcribe.AmazonTranscribeAsync;
+
+import java.io.IOException;
+import java.util.Properties;
+
+
+public class AmazonTranscribe implements Transcriber {
+
+    private AmazonTranscribeAsync amazonTranscribe;
+
+    private AmazonS3 amazonS3;
+
+    private static final Logger LOG = 
LoggerFactory.getLogger(AmazonTranscribe.class);
+
+    private String bucketName;
+
+    private boolean isAvailable; // Flag for whether or not translation is 
available.
+
+    private String clientId;
+
+    private String clientSecret;  // Keys used for the API calls.
+
+//    private HashSet<String> validSourceLanguages = new 
HashSet<>(Arrays.asList("en-US", "en-GB", "es-US", "fr-CA", "fr-FR", "en-AU",
+//            "it-IT", "de-DE", "pt-BR", "ja-JP", "ko-KR"));  // Valid inputs 
to StartStreamTranscription for language of source file (audio)
+
+    public AmazonTranscribe() {
+        this.isAvailable = true;
+        Properties config = new Properties();
+        try {
+            config.load(AmazonTranscribe.class
+                    .getResourceAsStream(
+                            "transcribe.amazon.properties"));
+            this.clientId = config.getProperty("transcribe.AWS_ACCESS_KEY");
+            this.clientSecret = 
config.getProperty("transcribe.AWS_SECRET_KEY");
+            this.bucketName = config.getProperty("transcribe.BUCKET_NAME");
+
+        } catch (Exception e) {
+            LOG.warn("Exception reading config file", e);
+            isAvailable = false;
+        }
+    }
+
+    
+    /**
+     * Audio to text function without language specification
+     * @param fileName
+     * @return Transcribed text
+     * @throws TikaException
+     * @throws IOException
+     */
+    @Override
+    public void startTranscribeAudio(String fileName, String jobName) throws 
TikaException, IOException {
+        if (!isAvailable())
+            return;
+        StartTranscriptionJobRequest startTranscriptionJobRequest = new 
StartTranscriptionJobRequest();
+        Media media = new Media();
+        media.setMediaFileUri(amazonS3.getUrl(bucketName, 
fileName).toString());
+        startTranscriptionJobRequest.withMedia(media)
+                .withOutputBucketName(this.bucketName)
+                .setTranscriptionJobName(jobName);
+        amazonTranscribe.startTranscriptionJob(startTranscriptionJobRequest);
+    }
+
+    /**
+     * Audio to text function with language specification
+     * @param fileName
+     * @param sourceLanguage
+     * @return Transcribed text
+     * @throws TikaException
+     * @throws IOException
+     */
+    @Override
+    public void startTranscribeAudio(String fileName, LanguageCode 
sourceLanguage, String jobName) throws TikaException, IOException {
+        if (!isAvailable())
+                       return;
+        StartTranscriptionJobRequest startTranscriptionJobRequest = new 
StartTranscriptionJobRequest();
+        Media media = new Media();
+        media.setMediaFileUri(amazonS3.getUrl(bucketName, 
fileName).toString());
+        startTranscriptionJobRequest.withMedia(media)
+                .withLanguageCode(sourceLanguage)
+                .withOutputBucketName(this.bucketName)
+                .setTranscriptionJobName(jobName);
+        amazonTranscribe.startTranscriptionJob(startTranscriptionJobRequest);
+    }
+
+    @Override
+    public void startTranscribeVideo(String fileName, String jobName) throws 
TikaException, IOException {
+        if (!isAvailable())
+            return;
+        //TODO
+
+    }
+
+    /**
+     * Audio to text function with language specification
+     * @param fileName
+     * @param sourceLanguage
+     * @return Transcribed text
+     * @throws TikaException
+     * @throws IOException
+     */
+    @Override
+    public void startTranscribeVideo(String fileName, LanguageCode 
sourceLanguage, String jobName) throws TikaException, IOException {
+        if (!isAvailable())
+            return;
+        //boolean validSourceLanguageFlag = 
validSourceLanguages.contains(sourceLanguage); // Checks if sourceLanguage in 
validSourceLanguages O(1) lookup time
+
+        //if (!validSourceLanguageFlag) { // Throws TikaException if the input 
sourceLanguage is not present in validSourceLanguages
+        //    throw new TikaException("Provided Source Language is Not Valid. 
Run without language parameter or please select one of: " +
+        //           "en-US, en-GB, es-US, fr-CA, fr-FR, en-AU, it-IT, de-DE, 
pt-BR, ja-JP, ko-KR"); }
+        //TODO
+
+    }
+
+    /**
+     * @return Valid AWS Credentials
+     */
+       public boolean isAvailable() {
+               return this.isAvailable;
+       }
+
+    /** Gets Transcriptioni result from AWS S3 bucket given bucketNamee and key
+     * @param key
+     * @return
+     */
+    @Override
+    public String getTranscriptResult(String key) {
+        TranscriptionJob transcriptionJob = 
retrieveObjectWhenJobCompleted(key);
+        if (transcriptionJob != null && 
!TranscriptionJobStatus.FAILED.equals(transcriptionJob.getTranscriptionJobStatus()))
 {
+            return amazonS3.getObjectAsString(this.bucketName, key + ".json");
+        } else
+            return null;
+    }
+
+    /**
+     * Private helper function to get object from s3
+     * @param key
+     * @return
+     */
+    private TranscriptionJob retrieveObjectWhenJobCompleted(String key) {
+        GetTranscriptionJobRequest getTranscriptionJobRequest = new 
GetTranscriptionJobRequest();
+        getTranscriptionJobRequest.setTranscriptionJobName(key);
+
+        while (true) {
+            GetTranscriptionJobResult innerResult = 
amazonTranscribe.getTranscriptionJob(getTranscriptionJobRequest);
+            String status = 
innerResult.getTranscriptionJob().getTranscriptionJobStatus();
+            if (TranscriptionJobStatus.COMPLETED.name().equals(status) ||
+                    TranscriptionJobStatus.FAILED.name().equals(status)) {
+                return innerResult.getTranscriptionJob();
+            }
+        }
+    }
+
+    /**
+     * Call this method in order to upload a file to the Amazon S3 bucket.
+     * @param bucketName
+     * @param fileName
+     * @param fullFileName
+     */
+    @Override
+    public void uploadFileToBucket(String bucketName, String fileName, String 
fullFileName) {

Review comment:
       This needs to exist in the AWS implementation but NOT in the Transcribe 
Interface. 

##########
File path: tika-core/src/main/java/org/apache/tika/transcribe/Transcriber.java
##########
@@ -0,0 +1,90 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.tika.transcribe;
+
+import org.apache.tika.exception.TikaException;
+
+import java.io.IOException;
+
+import com.amazonaws.services.transcribe.model.LanguageCode;
+
+
+/**
+ * Interface for Transcriber services.
+ *
+ * @since Tika TODO
+ */
+public interface Transcriber {
+    /**
+     * @return
+     * @param fileName
+     * @param jobName
+     * @throws TikaException       When there is an error translating.
+     * @throws java.io.IOException
+     * @since TODO
+     */
+    public void startTranscribeAudio(String fileName, String jobName) throws 
TikaException, IOException;
+
+    /**
+     * @return
+     * @param fileName
+     * @param sourceLanguage
+     * @param jobName
+     * @throws TikaException       When there is an error translating.
+     * @throws java.io.IOException
+     * @since TODO
+     */
+    public void startTranscribeAudio(String fileName, LanguageCode 
sourceLanguage, String jobName) throws TikaException, IOException;
+
+    /**
+     * @return
+     * @param fileName
+     * @param jobName
+     * @throws TikaException       When there is an error translating.
+     * @throws java.io.IOException
+     * @since TODO
+     */
+    public void startTranscribeVideo(String fileName, String jobName) throws 
TikaException, IOException;
+
+    /**
+     * @return
+     * @param fileName
+     * @param jobName
+     * @param sourceLanguage
+     * @throws TikaException       When there is an error translating.
+     * @throws java.io.IOException
+     * @since TODO
+     */
+    public void startTranscribeVideo(String fileName, LanguageCode 
sourceLanguage, String jobName) throws TikaException, IOException;
+
+    /**
+     * Gets transcription result from S3
+     * @param key
+     * @return
+     */
+    public String getTranscriptResult(String key);
+
+    /**
+     * Upload file to s3
+     * @param bucketName
+     * @param fileName
+     * @param filePath
+     */
+    public void uploadFileToBucket(String bucketName, String fileName, String 
filePath);

Review comment:
       This should never be in an interface. This is WAY to AWS specific. 

##########
File path: tika-core/src/main/java/org/apache/tika/transcribe/Transcriber.java
##########
@@ -0,0 +1,90 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.tika.transcribe;
+

Review comment:
       Remove whitespace

##########
File path: 
tika-transcribe/src/main/java/org/apache/tika/transcribe/transcribe/AmazonTranscribe.java
##########
@@ -0,0 +1,193 @@
+/*

Review comment:
       Please look at the package naming here...
   ```
   tika-transcribe/src/main/java/org/apache/tika/transcribe/transcribe/
   ```
   should be
   ```
   tika-transcribe/src/main/java/org/apache/tika/transcribe
   ```
   
   

##########
File path: tika-core/src/main/java/org/apache/tika/transcribe/Transcriber.java
##########
@@ -0,0 +1,90 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.tika.transcribe;
+
+import org.apache.tika.exception.TikaException;
+
+import java.io.IOException;
+
+import com.amazonaws.services.transcribe.model.LanguageCode;
+
+
+/**
+ * Interface for Transcriber services.
+ *
+ * @since Tika TODO
+ */
+public interface Transcriber {
+    /**
+     * @return
+     * @param fileName

Review comment:
       Also, what about the language implementation the transcription service 
should work on?

##########
File path: tika-transcribe/pom.xml
##########
@@ -0,0 +1,144 @@
+<?xml version="1.0" encoding="UTF-8"?>
+
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one
+  or more contributor license agreements.  See the NOTICE file
+  distributed with this work for additional information
+  regarding copyright ownership.  The ASF licenses this file
+  to you under the Apache License, Version 2.0 (the
+  "License"); you may not use this file except in compliance
+  with the License.  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing,
+  software distributed under the License is distributed on an
+  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  KIND, either express or implied.  See the License for the
+  specific language governing permissions and limitations
+  under the License.
+-->
+
+<project xmlns="http://maven.apache.org/POM/4.0.0";
+         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";
+         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 
http://maven.apache.org/xsd/maven-4.0.0.xsd";>
+    <modelVersion>4.0.0</modelVersion>
+
+    <parent>
+        <groupId>org.apache.tika</groupId>
+        <artifactId>tika-parent</artifactId>
+        <version>2.0.0-SNAPSHOT</version>
+        <relativePath>../tika-parent/pom.xml</relativePath>
+    </parent>
+
+    <artifactId>tika-transcribe</artifactId>
+    <packaging>bundle</packaging>
+    <name>Apache Tika transcribe</name>
+    <url>http://tika.apache.org/</url>
+    <!--TODO use latest aws version or the one defined in the tika-parent-->

Review comment:
       Defining in `tika-parent` is fine but not in `tika-core`

##########
File path: 
tika-transcribe/src/main/java/org/apache/tika/transcribe/transcribe/AmazonTranscribe.java
##########
@@ -0,0 +1,193 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.tika.transcribe.transcribe;
+import java.io.File;
+
+import com.amazonaws.services.transcribe.model.*;

Review comment:
       Please order all `import`s alphabetically

##########
File path: 
tika-transcribe/src/test/java/org/apache/tika/transcibe/transcibe/AmazonTranscribeGuessLanguageTest.java
##########
@@ -0,0 +1,125 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.tika.transcibe.transcibe;
+
+import org.apache.tika.transcribe.transcribe.AmazonTranscribe;
+import org.junit.Before;
+import org.junit.Test;
+
+import static junit.framework.TestCase.assertNotNull;
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.fail;
+
+public class AmazonTranscribeGuessLanguageTest {
+    AmazonTranscribe transcriber;
+
+    @Before
+    public void setUp() {
+        transcriber = new AmazonTranscribe();
+    }
+
+    @Test
+    public void AmazonTranscribeGuessLanguageAudioShortTest() {
+        String expected = "where is the bus stop? where is the bus stop?";
+        //TODO: "expected" should be changed to reflect the contents of 
ShortAudioSample.mp3
+        /*
+        URL res = 
getClass().getClassLoader().getResource("ShortAudioSample.mp3");
+        File file = Paths.get(res.toURI()).toFile();
+        String absolutePath = file.getAbsolutePath();
+        Necessary to get the correct file path from our test resource folder? 
*/
+        //TODO: is the above commented block necessary to obtain the proper 
filepath for a file located in the tika-translate/test/resources directory?
+
+        String audioFilePath = "src/test/resources/ShortAudioSample.mp3";
+        String result = null;
+
+        if (transcriber.isAvailable()) {
+            try {
+                result = transcriber.transcribeAudio(audioFilePath);
+                assertNotNull(result);
+                assertEquals("Result: [" + result
+                                + "]: not equal to expected: [" + expected + 
"]",
+                        expected, result);
+            } catch (Exception e) {
+                e.printStackTrace();
+                fail(e.getMessage());
+            }
+        }
+    }
+
+    @Test
+    public void AmazonTranscribeGuessLanguageAudioLongTest() {
+        String expected = "where is the bus stop? where is the bus stop?";
+        //TODO: "expected" should be changed to reflect the contents of 
LongAudioSample.mp3
+        String audioFilePath = "src/test/resources/LongAudioSample.mp3";

Review comment:
       Where is this file?

##########
File path: 
tika-transcribe/src/main/java/org/apache/tika/transcribe/transcribe/AmazonTranscribe.java
##########
@@ -0,0 +1,193 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.tika.transcribe.transcribe;
+import java.io.File;
+
+import com.amazonaws.services.transcribe.model.*;
+import org.apache.tika.exception.TikaException;
+import org.apache.tika.transcribe.Transcriber;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import com.amazonaws.services.s3.AmazonS3;
+import com.amazonaws.services.s3.model.PutObjectRequest;
+import com.amazonaws.services.transcribe.AmazonTranscribeAsync;
+
+import java.io.IOException;
+import java.util.Properties;
+
+
+public class AmazonTranscribe implements Transcriber {
+
+    private AmazonTranscribeAsync amazonTranscribe;
+
+    private AmazonS3 amazonS3;
+
+    private static final Logger LOG = 
LoggerFactory.getLogger(AmazonTranscribe.class);
+
+    private String bucketName;
+
+    private boolean isAvailable; // Flag for whether or not translation is 
available.
+
+    private String clientId;
+
+    private String clientSecret;  // Keys used for the API calls.
+
+//    private HashSet<String> validSourceLanguages = new 
HashSet<>(Arrays.asList("en-US", "en-GB", "es-US", "fr-CA", "fr-FR", "en-AU",
+//            "it-IT", "de-DE", "pt-BR", "ja-JP", "ko-KR"));  // Valid inputs 
to StartStreamTranscription for language of source file (audio)
+
+    public AmazonTranscribe() {
+        this.isAvailable = true;
+        Properties config = new Properties();
+        try {
+            config.load(AmazonTranscribe.class
+                    .getResourceAsStream(
+                            "transcribe.amazon.properties"));
+            this.clientId = config.getProperty("transcribe.AWS_ACCESS_KEY");
+            this.clientSecret = 
config.getProperty("transcribe.AWS_SECRET_KEY");
+            this.bucketName = config.getProperty("transcribe.BUCKET_NAME");
+
+        } catch (Exception e) {
+            LOG.warn("Exception reading config file", e);
+            isAvailable = false;
+        }
+    }
+
+    
+    /**
+     * Audio to text function without language specification
+     * @param fileName
+     * @return Transcribed text

Review comment:
       Please populate all of this Javadoc based upon the guidance I provided 
above. 

##########
File path: 
tika-transcribe/src/test/java/org/apache/tika/transcibe/transcibe/AmazonTranscribeGuessLanguageTest.java
##########
@@ -0,0 +1,125 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.tika.transcibe.transcibe;
+
+import org.apache.tika.transcribe.transcribe.AmazonTranscribe;
+import org.junit.Before;
+import org.junit.Test;
+
+import static junit.framework.TestCase.assertNotNull;
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.fail;
+
+public class AmazonTranscribeGuessLanguageTest {
+    AmazonTranscribe transcriber;
+
+    @Before
+    public void setUp() {
+        transcriber = new AmazonTranscribe();
+    }
+
+    @Test
+    public void AmazonTranscribeGuessLanguageAudioShortTest() {
+        String expected = "where is the bus stop? where is the bus stop?";
+        //TODO: "expected" should be changed to reflect the contents of 
ShortAudioSample.mp3
+        /*
+        URL res = 
getClass().getClassLoader().getResource("ShortAudioSample.mp3");
+        File file = Paths.get(res.toURI()).toFile();
+        String absolutePath = file.getAbsolutePath();
+        Necessary to get the correct file path from our test resource folder? 
*/
+        //TODO: is the above commented block necessary to obtain the proper 
filepath for a file located in the tika-translate/test/resources directory?
+
+        String audioFilePath = "src/test/resources/ShortAudioSample.mp3";

Review comment:
       Where is this file?

##########
File path: 
tika-transcribe/src/test/java/org/apache/tika/transcibe/transcibe/AmazonTranscribeGuessLanguageTest.java
##########
@@ -0,0 +1,125 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.tika.transcibe.transcibe;
+
+import org.apache.tika.transcribe.transcribe.AmazonTranscribe;
+import org.junit.Before;
+import org.junit.Test;
+
+import static junit.framework.TestCase.assertNotNull;
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.fail;
+
+public class AmazonTranscribeGuessLanguageTest {
+    AmazonTranscribe transcriber;
+
+    @Before
+    public void setUp() {
+        transcriber = new AmazonTranscribe();
+    }
+
+    @Test
+    public void AmazonTranscribeGuessLanguageAudioShortTest() {
+        String expected = "where is the bus stop? where is the bus stop?";
+        //TODO: "expected" should be changed to reflect the contents of 
ShortAudioSample.mp3
+        /*
+        URL res = 
getClass().getClassLoader().getResource("ShortAudioSample.mp3");
+        File file = Paths.get(res.toURI()).toFile();
+        String absolutePath = file.getAbsolutePath();
+        Necessary to get the correct file path from our test resource folder? 
*/
+        //TODO: is the above commented block necessary to obtain the proper 
filepath for a file located in the tika-translate/test/resources directory?
+
+        String audioFilePath = "src/test/resources/ShortAudioSample.mp3";
+        String result = null;
+
+        if (transcriber.isAvailable()) {
+            try {
+                result = transcriber.transcribeAudio(audioFilePath);
+                assertNotNull(result);
+                assertEquals("Result: [" + result
+                                + "]: not equal to expected: [" + expected + 
"]",
+                        expected, result);
+            } catch (Exception e) {
+                e.printStackTrace();
+                fail(e.getMessage());
+            }
+        }
+    }
+
+    @Test
+    public void AmazonTranscribeGuessLanguageAudioLongTest() {
+        String expected = "where is the bus stop? where is the bus stop?";
+        //TODO: "expected" should be changed to reflect the contents of 
LongAudioSample.mp3
+        String audioFilePath = "src/test/resources/LongAudioSample.mp3";
+        String result = null;
+
+        if (transcriber.isAvailable()) {
+            try {
+                result = transcriber.transcribeAudio(audioFilePath);
+                assertNotNull(result);
+                assertEquals("Result: [" + result
+                                + "]: not equal to expected: [" + expected + 
"]",
+                        expected, result);
+            } catch (Exception e) {
+                e.printStackTrace();
+                fail(e.getMessage());
+            }
+        }
+    }
+
+    @Test
+    public void AmazonTranscribeGuessLanguageShortVideoTest() {
+        String expected = "where is the bus stop? where is the bus stop?";
+        //TODO: "expected" should be changed to reflect the contents of 
ShortVideoSample.mp4
+        String videoFilePath = "src/test/resources/ShortVideoSample.mp4";

Review comment:
       Where  is this file?

##########
File path: 
tika-transcribe/src/test/java/org/apache/tika/transcibe/transcibe/AmazonTranscribeGuessLanguageTest.java
##########
@@ -0,0 +1,125 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.tika.transcibe.transcibe;
+
+import org.apache.tika.transcribe.transcribe.AmazonTranscribe;
+import org.junit.Before;
+import org.junit.Test;
+
+import static junit.framework.TestCase.assertNotNull;
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.fail;
+
+public class AmazonTranscribeGuessLanguageTest {
+    AmazonTranscribe transcriber;
+
+    @Before
+    public void setUp() {
+        transcriber = new AmazonTranscribe();
+    }
+
+    @Test
+    public void AmazonTranscribeGuessLanguageAudioShortTest() {
+        String expected = "where is the bus stop? where is the bus stop?";
+        //TODO: "expected" should be changed to reflect the contents of 
ShortAudioSample.mp3
+        /*
+        URL res = 
getClass().getClassLoader().getResource("ShortAudioSample.mp3");
+        File file = Paths.get(res.toURI()).toFile();
+        String absolutePath = file.getAbsolutePath();
+        Necessary to get the correct file path from our test resource folder? 
*/
+        //TODO: is the above commented block necessary to obtain the proper 
filepath for a file located in the tika-translate/test/resources directory?
+
+        String audioFilePath = "src/test/resources/ShortAudioSample.mp3";
+        String result = null;
+
+        if (transcriber.isAvailable()) {
+            try {
+                result = transcriber.transcribeAudio(audioFilePath);
+                assertNotNull(result);
+                assertEquals("Result: [" + result
+                                + "]: not equal to expected: [" + expected + 
"]",
+                        expected, result);
+            } catch (Exception e) {
+                e.printStackTrace();
+                fail(e.getMessage());
+            }
+        }
+    }
+
+    @Test
+    public void AmazonTranscribeGuessLanguageAudioLongTest() {
+        String expected = "where is the bus stop? where is the bus stop?";
+        //TODO: "expected" should be changed to reflect the contents of 
LongAudioSample.mp3
+        String audioFilePath = "src/test/resources/LongAudioSample.mp3";
+        String result = null;
+
+        if (transcriber.isAvailable()) {
+            try {
+                result = transcriber.transcribeAudio(audioFilePath);
+                assertNotNull(result);
+                assertEquals("Result: [" + result
+                                + "]: not equal to expected: [" + expected + 
"]",
+                        expected, result);
+            } catch (Exception e) {
+                e.printStackTrace();
+                fail(e.getMessage());
+            }
+        }
+    }
+
+    @Test
+    public void AmazonTranscribeGuessLanguageShortVideoTest() {
+        String expected = "where is the bus stop? where is the bus stop?";
+        //TODO: "expected" should be changed to reflect the contents of 
ShortVideoSample.mp4
+        String videoFilePath = "src/test/resources/ShortVideoSample.mp4";
+        String result = null;
+
+        if (transcriber.isAvailable()) {
+            try {
+                result = transcriber.transcribeVideo(videoFilePath);
+                assertNotNull(result);
+                assertEquals("Result: [" + result
+                                + "]: not equal to expected: [" + expected + 
"]",
+                        expected, result);
+            } catch (Exception e) {
+                e.printStackTrace();
+                fail(e.getMessage());
+            }
+        }
+    }
+
+    @Test
+    public void AmazonTranscribeGuessLanguageLongVideoTest() {
+        String expected = "hello sir";
+        //TODO: "expected" should be changed to reflect the contents of 
LongVideoSample.mp4
+        String videoFilePath = "src/test/resources/LongVideoSample.mp4";

Review comment:
       ?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


> Speech recognition
> ------------------
>
>                 Key: TIKA-94
>                 URL: https://issues.apache.org/jira/browse/TIKA-94
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Jukka Zitting
>            Assignee: Lewis John McGibbney
>            Priority: Minor
>              Labels: new-parser
>
> Like OCR for image files (TIKA-93), we could try using speech recognition to 
> extract text content (where available) from audio (and video!) files.
> The CMU Sphinx engine (http://cmusphinx.sourceforge.net/) looks promising and 
> comes with a friendly license.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to