Re: [PR] Tika based document analyzer - DO NOT MERGE [creadur-rat]

2024-04-25 Thread via GitHub


Claudenw commented on code in PR #240:
URL: https://github.com/apache/creadur-rat/pull/240#discussion_r1579860679


##
apache-rat-core/src/main/java/org/apache/rat/api/Document.java:
##
@@ -33,47 +36,416 @@ public interface Document {
  */
 enum Type {
 /** A generated document. */
-GENERATED, 
+GENERATED,
 /** An unknown document type. */
 UNKNOWN,
 /** An archive type document. */
-ARCHIVE, 
+ARCHIVE,
 /** A notice document (e.g. LICENSE file) */
 NOTICE,
 /** A binary file */
 BINARY,
 /** A standard document */
-STANDARD}
+STANDARD;
+
+public static Map documentTypeMap;
+
+public static Type fromContentType(String documentType, Log log) {
+Type result = documentTypeMap.get(documentType);
+if (result == null) {
+log.warn(String.format("Please open a Jira ticket with the 
subject: 'Unknown media type %s in Document.Type'", documentType));
+return UNKNOWN;
+}
+return result;
+}
+
+/*
+ * https://tika.apache.org/3.0.0-BETA/formats.html 
+ */
+static {
+documentTypeMap = new HashMap<>();

Review Comment:
   Shall we just open a JIRA ticket for this.  I think that collecting some of 
the files will be difficult.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@creadur.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Tika based document analyzer - DO NOT MERGE [creadur-rat]

2024-04-24 Thread via GitHub


Claudenw commented on code in PR #240:
URL: https://github.com/apache/creadur-rat/pull/240#discussion_r1578972448


##
apache-rat-core/pom.xml:
##
@@ -126,5 +126,10 @@
   assertj-core
   test
 
+

Review Comment:
   I missed this.  I put it here for dev, yes I will move it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@creadur.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Tika based document analyzer - DO NOT MERGE [creadur-rat]

2024-04-24 Thread via GitHub


ottlinger commented on code in PR #240:
URL: https://github.com/apache/creadur-rat/pull/240#discussion_r1578938426


##
apache-rat-core/src/main/java/org/apache/rat/api/Document.java:
##
@@ -33,47 +36,416 @@ public interface Document {
  */
 enum Type {
 /** A generated document. */
-GENERATED, 
+GENERATED,
 /** An unknown document type. */
 UNKNOWN,
 /** An archive type document. */
-ARCHIVE, 
+ARCHIVE,
 /** A notice document (e.g. LICENSE file) */
 NOTICE,
 /** A binary file */
 BINARY,
 /** A standard document */
-STANDARD}
+STANDARD;
+
+public static Map documentTypeMap;
+
+public static Type fromContentType(String documentType, Log log) {
+Type result = documentTypeMap.get(documentType);
+if (result == null) {
+log.warn(String.format("Please open a Jira ticket with the 
subject: 'Unknown media type %s in Document.Type'", documentType));
+return UNKNOWN;
+}
+return result;
+}
+
+/*
+ * https://tika.apache.org/3.0.0-BETA/formats.html 
+ */
+static {
+documentTypeMap = new HashMap<>();

Review Comment:
   Seeing this long list I wondered if we should add a new module 
apache-rat-regression-tests that contains at least one example for all the file 
types and can be used measure and integration-test RAT .



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@creadur.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Tika based document analyzer - DO NOT MERGE [creadur-rat]

2024-04-24 Thread via GitHub


ottlinger commented on code in PR #240:
URL: https://github.com/apache/creadur-rat/pull/240#discussion_r1578937305


##
apache-rat-core/pom.xml:
##
@@ -126,5 +126,10 @@
   assertj-core
   test
 
+

Review Comment:
   Could we add the dependency into a dependencyManagement block to only have 
the version in 1 pom.xml and reference it in the other one?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@creadur.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Tika based document analyzer - DO NOT MERGE [creadur-rat]

2024-04-23 Thread via GitHub


ottlinger commented on PR #240:
URL: https://github.com/apache/creadur-rat/pull/240#issuecomment-2072736075

   I'd prefer a file filter that allows ignoring no-comment plaintext files 
such as JSON.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@creadur.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org