Re: [PR] RAT-54: Tika based document analyzer [creadur-rat]

2024-05-04 Thread via GitHub


Claudenw merged PR #240:
URL: https://github.com/apache/creadur-rat/pull/240


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@creadur.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] RAT-54: Tika based document analyzer [creadur-rat]

2024-05-04 Thread via GitHub


ottlinger commented on PR #240:
URL: https://github.com/apache/creadur-rat/pull/240#issuecomment-2094202616

   @Claudenw pls review my latest additions concerning RAT-301, after that go 
ahead with the merge. Thanks for your work and the cool addition of more 
functionality to RAT #kudos


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@creadur.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] RAT-54: Tika based document analyzer [creadur-rat]

2024-05-04 Thread via GitHub


ottlinger commented on code in PR #240:
URL: https://github.com/apache/creadur-rat/pull/240#discussion_r1589986059


##
src/changes/changes.xml:
##
@@ -72,6 +72,22 @@ 
https://maven.apache.org/plugins/maven-changes-plugin/xsd/changes-1.0.0.xsd
 
 -->
 
+  
+MIME Detection Using Tika
+  
+  
+Changed to detecting binary by content not name.
+  
+  
+Change to detect non UTF-8 text as text not binary.

Review Comment:
   @Claudenw I brought in changes related to RAT-301 - is that fine for you or 
should the Chinese character example go somewhere else? Thanks



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@creadur.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] RAT-54: Tika based document analyzer [creadur-rat]

2024-05-04 Thread via GitHub


Claudenw commented on PR #240:
URL: https://github.com/apache/creadur-rat/pull/240#issuecomment-2094116895

   I updated the checklist.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@creadur.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] RAT-54: Tika based document analyzer [creadur-rat]

2024-05-03 Thread via GitHub


ottlinger commented on PR #240:
URL: https://github.com/apache/creadur-rat/pull/240#issuecomment-2093681175

   > @ottlinger If you approve I can merge this. If you want more eyes on it, 
lets's invite a few reviewers.
   
   In the PR's main description you've created a check list - is that already 
done? 
   I'm still thinking if we should add more documentation about the change or 
leave to a bigger entry in the release notes about possibly changing behaviour 
of RAT.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@creadur.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] RAT-54: Tika based document analyzer [creadur-rat]

2024-05-03 Thread via GitHub


ottlinger commented on code in PR #240:
URL: https://github.com/apache/creadur-rat/pull/240#discussion_r1589675326


##
src/changes/changes.xml:
##
@@ -72,6 +72,22 @@ 
https://maven.apache.org/plugins/maven-changes-plugin/xsd/changes-1.0.0.xsd
 
 -->
 
+  
+MIME Detection Using Tika
+  
+  
+Changed to detecting binary by content not name.
+  
+  
+Change to detect non UTF-8 text as text not binary.

Review Comment:
   @Claudenw would you mind integrating above file in order to see if its 
properly detected with Tika? This could allow solving RAT-301 as well :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@creadur.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] RAT-54: Tika based document analyzer [creadur-rat]

2024-05-03 Thread via GitHub


ottlinger commented on code in PR #240:
URL: https://github.com/apache/creadur-rat/pull/240#discussion_r1589675326


##
src/changes/changes.xml:
##
@@ -72,6 +72,22 @@ 
https://maven.apache.org/plugins/maven-changes-plugin/xsd/changes-1.0.0.xsd
 
 -->
 
+  
+MIME Detection Using Tika
+  
+  
+Changed to detecting binary by content not name.
+  
+  
+Change to detect non UTF-8 text as text not binary.

Review Comment:
   @Claudenw would you mind integrating above file in order to see if its 
properly detected with Tika?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@creadur.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] RAT-54: Tika based document analyzer [creadur-rat]

2024-05-03 Thread via GitHub


ottlinger commented on code in PR #240:
URL: https://github.com/apache/creadur-rat/pull/240#discussion_r1589673093


##
apache-rat-core/src/test/java/org/apache/rat/document/impl/guesser/BinaryGuesserTest.java:
##
@@ -1,150 +1,150 @@
-/*
- * Licensed to the Apache Software Foundation (ASF) under one   *
- * or more contributor license agreements.  See the NOTICE file *
- * distributed with this work for additional information*
- * regarding copyright ownership.  The ASF licenses this file   *
- * to you under the Apache License, Version 2.0 (the*
- * "License"); you may not use this file except in compliance   *
- * with the License.  You may obtain a copy of the License at   *
- *  *
- *   http://www.apache.org/licenses/LICENSE-2.0 *
- *  *
- * Unless required by applicable law or agreed to in writing,   *
- * software distributed under the License is distributed on an  *
- * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY   *
- * KIND, either express or implied.  See the License for the*
- * specific language governing permissions and limitations  *
- * under the License.   *
- */
-package org.apache.rat.document.impl.guesser;
-
-import org.apache.commons.io.IOUtils;
-import org.apache.rat.document.MockDocument;
-import org.apache.rat.document.impl.FileDocument;
-import org.apache.rat.test.utils.Resources;
-import org.junit.jupiter.api.Test;
-
-import java.io.IOException;
-import java.io.Reader;
-import java.util.Arrays;
-import java.util.List;
-
-import static org.junit.jupiter.api.Assertions.assertEquals;
-import static org.junit.jupiter.api.Assertions.assertFalse;
-import static org.junit.jupiter.api.Assertions.assertTrue;
-
-public class BinaryGuesserTest {
-
-private static final List BINARY_FILES = Arrays.asList(//
-"image.png",//
-"image.pdf",//
-"image.psd",//
-"image.gif",//
-"image.giff",//
-"image.jpg",//
-"image.jpeg",//
-"image.exe",//
-"Whatever.class",//
-"data.dat",//
-"libicuda.so.34",//
-"my.truststore",//
-//"foo.Java", //
-//"manifest.Mf",//
-"deprecatedtechnology.swf",
-"xyz.aif",
-"abc.iff",
-// Audio Files
-"test.m3u", "test.m4a",
-"test-audio.mid", "test-audio.mp3",
-"test-audio.mpa", "test-audio.wav",
-"test-audio.wma"
-);
-
-@Test
-public void testMatches() {
-for (String name : BINARY_FILES) {
-assertTrue(BinaryGuesser.isBinary(new MockDocument(name)), ()->"'" 
+ name + "' should be detected as a binary");
-}
-
-}
-
-@Test
-public void testIsBinary() {
-for (String name : BINARY_FILES) {
-assertTrue(BinaryGuesser.isBinary(name), ()->"'" + name + "' 
should be detected as a binary");
-}
-}
-
-/**
- * Used to swallow a MalformedInputException and return false
- * because the encoding of the stream was different from the
- * platform's default encoding.
- *
- * @throws Exception
- * @see "RAT-81"
- */
-@Test
-public void binaryWithMalformedInputRAT81() throws Exception {
-FileDocument doc = new 
FileDocument(Resources.getResourceFile("/binaries/UTF16_with_signature.xml"));
-Reader r = doc.reader(); // this will fail test if file is not readable
-try {
-char[] dummy = new char[100];
-r.read(dummy);
-// if we get here, the UTF-16 encoded file didn't throw
-// any exception, try the UTF-8 encoded one
-r.close();
-r = null; // ensure we detect failure to read second file
-doc = new 
FileDocument(Resources.getResourceFile("/binaries/UTF8_with_signature.xml"));
-r = doc.reader();
-r.read(dummy);
-// still here?  can't test on this platform
-System.err.println("Skipping testBinaryWithMalformedInput");
-} catch (IOException e) {
-if (r != null) {
-IOUtils.closeQuietly(r);
-} else {
-throw e; // could not open the second file
-}
-r = null;
-assertTrue(BinaryGuesser.isBinary(doc), "Expected binary for " + 
doc.getName());
-} finally {
-IOUtils.closeQuietly(r);
-}
-}
-
-@Test
-public void realBinaryContent() throws IOException {
-// This test is not accurate on all platforms
-final String encoding = System.getProperty("file.encoding");
-final boolean isBinary = BinaryGuesser.isBinary(new 
FileDocument(Resources.getResourceFile("/binaries/Image-png.not")));
-   

Re: [PR] RAT-54: Tika based document analyzer [creadur-rat]

2024-05-03 Thread via GitHub


ottlinger commented on code in PR #240:
URL: https://github.com/apache/creadur-rat/pull/240#discussion_r1589669854


##
apache-rat-core/pom.xml:
##
@@ -126,5 +126,10 @@
   assertj-core
   test
 
+

Review Comment:
   Thanks.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@creadur.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] RAT-54: Tika based document analyzer [creadur-rat]

2024-05-01 Thread via GitHub


Claudenw commented on PR #240:
URL: https://github.com/apache/creadur-rat/pull/240#issuecomment-2089624723

   @ottlinger If you approve I can merge this.  If you want more eyes on it, 
lets's invite a few reviewers.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@creadur.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] RAT-54: Tika based document analyzer [creadur-rat]

2024-05-01 Thread via GitHub


ottlinger commented on PR #240:
URL: https://github.com/apache/creadur-rat/pull/240#issuecomment-2088992912

   @Claudenw the extraction into the Tika-class looks very nice - thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@creadur.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] RAT-54: Tika based document analyzer [creadur-rat]

2024-05-01 Thread via GitHub


ottlinger commented on code in PR #240:
URL: https://github.com/apache/creadur-rat/pull/240#discussion_r1586731939


##
apache-rat-core/src/main/java/org/apache/rat/Defaults.java:
##
@@ -152,27 +163,35 @@ public SortedSet 
getLicenseFamilies(LicenseFilter filter) {
  * @param filter define which type of licenses to return.
  * @return The sorted set of approved licenseIds.
  */
-public SortedSet getLicenseIds(LicenseFilter filter) {
+public SortedSet getLicenseIds(final LicenseFilter filter) {
 return setFactory.getLicenseFamilyIds(filter);
 }
+
+public static FilenameFilter getFilesToIgnore() {
+return FILES_TO_IGNORE;
+}
+
+public static IOFileFilter getDirectoriesToIgnore() {
+return DIRECTORIES_TO_IGNORE;
+}
 
 /**
  * The Defaults builder.
  */
-public static class Builder {
+public final static class Builder {
 private final Set fileNames = new 
TreeSet<>(Comparator.comparing(URL::toString));
 
 private Builder() {
 fileNames.add(DEFAULT_CONFIG_URL);
 }
 
 /**
- * Adds a URL to a configuration file to be read.
+ * Adds a URI to a configuration file to be read.

Review Comment:
   Thanks.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@creadur.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] RAT-54: Tika based document analyzer [creadur-rat]

2024-05-01 Thread via GitHub


Claudenw commented on code in PR #240:
URL: https://github.com/apache/creadur-rat/pull/240#discussion_r1585924926


##
apache-rat-core/src/main/java/org/apache/rat/Defaults.java:
##
@@ -152,27 +163,35 @@ public SortedSet 
getLicenseFamilies(LicenseFilter filter) {
  * @param filter define which type of licenses to return.
  * @return The sorted set of approved licenseIds.
  */
-public SortedSet getLicenseIds(LicenseFilter filter) {
+public SortedSet getLicenseIds(final LicenseFilter filter) {
 return setFactory.getLicenseFamilyIds(filter);
 }
+
+public static FilenameFilter getFilesToIgnore() {
+return FILES_TO_IGNORE;
+}
+
+public static IOFileFilter getDirectoriesToIgnore() {
+return DIRECTORIES_TO_IGNORE;
+}
 
 /**
  * The Defaults builder.
  */
-public static class Builder {
+public final static class Builder {
 private final Set fileNames = new 
TreeSet<>(Comparator.comparing(URL::toString));
 
 private Builder() {
 fileNames.add(DEFAULT_CONFIG_URL);
 }
 
 /**
- * Adds a URL to a configuration file to be read.
+ * Adds a URI to a configuration file to be read.

Review Comment:
   I opened ticket https://issues.apache.org/jira/browse/RAT-371 to deal with 
URL -> URI transition.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@creadur.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] RAT-54: Tika based document analyzer [creadur-rat]

2024-04-30 Thread via GitHub


Claudenw commented on code in PR #240:
URL: https://github.com/apache/creadur-rat/pull/240#discussion_r1584567119


##
apache-rat-core/src/main/java/org/apache/rat/Defaults.java:
##
@@ -152,27 +163,35 @@ public SortedSet 
getLicenseFamilies(LicenseFilter filter) {
  * @param filter define which type of licenses to return.
  * @return The sorted set of approved licenseIds.
  */
-public SortedSet getLicenseIds(LicenseFilter filter) {
+public SortedSet getLicenseIds(final LicenseFilter filter) {
 return setFactory.getLicenseFamilyIds(filter);
 }
+
+public static FilenameFilter getFilesToIgnore() {
+return FILES_TO_IGNORE;
+}
+
+public static IOFileFilter getDirectoriesToIgnore() {
+return DIRECTORIES_TO_IGNORE;
+}
 
 /**
  * The Defaults builder.
  */
-public static class Builder {
+public final static class Builder {
 private final Set fileNames = new 
TreeSet<>(Comparator.comparing(URL::toString));
 
 private Builder() {
 fileNames.add(DEFAULT_CONFIG_URL);
 }
 
 /**
- * Adds a URL to a configuration file to be read.
+ * Adds a URI to a configuration file to be read.

Review Comment:
   Agree, I started to look at it but I think we need a ticket just to handle 
the URL -> URI conversion.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@creadur.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] RAT-54: Tika based document analyzer [creadur-rat]

2024-04-30 Thread via GitHub


ottlinger commented on code in PR #240:
URL: https://github.com/apache/creadur-rat/pull/240#discussion_r1584556357


##
apache-rat-core/src/main/java/org/apache/rat/analysis/DefaultAnalyserFactory.java:
##
@@ -63,8 +60,8 @@ private final static class DefaultAnalyser implements 
IDocumentAnalyser {
 
 /**
  * Constructs a DocumentAnalyser for the specified license.
- * 
- * @param license The license to analyse
+ * @param log the Log to use
+ * @param licenses The license to analyse

Review Comment:
   typo: licenses to analyse



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@creadur.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] RAT-54: Tika based document analyzer [creadur-rat]

2024-04-30 Thread via GitHub


ottlinger commented on code in PR #240:
URL: https://github.com/apache/creadur-rat/pull/240#discussion_r1584553794


##
apache-rat-core/src/main/java/org/apache/rat/Defaults.java:
##
@@ -152,27 +163,35 @@ public SortedSet 
getLicenseFamilies(LicenseFilter filter) {
  * @param filter define which type of licenses to return.
  * @return The sorted set of approved licenseIds.
  */
-public SortedSet getLicenseIds(LicenseFilter filter) {
+public SortedSet getLicenseIds(final LicenseFilter filter) {
 return setFactory.getLicenseFamilyIds(filter);
 }
+
+public static FilenameFilter getFilesToIgnore() {
+return FILES_TO_IGNORE;
+}
+
+public static IOFileFilter getDirectoriesToIgnore() {
+return DIRECTORIES_TO_IGNORE;
+}
 
 /**
  * The Defaults builder.
  */
-public static class Builder {
+public final static class Builder {
 private final Set fileNames = new 
TreeSet<>(Comparator.comparing(URL::toString));
 
 private Builder() {
 fileNames.add(DEFAULT_CONFIG_URL);
 }
 
 /**
- * Adds a URL to a configuration file to be read.
+ * Adds a URI to a configuration file to be read.

Review Comment:
   URI or URL? Spotbugs complains that URL-comparison may lead to performance 
issues as each URL is resolved/queried via DNS . simply changing to URI is 
only psssible with more changes within RAT.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@creadur.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] RAT-54: Tika based document analyzer [creadur-rat]

2024-04-29 Thread via GitHub


Claudenw commented on PR #240:
URL: https://github.com/apache/creadur-rat/pull/240#issuecomment-2081968269

   I extracted the Tika processing to its own class.  
   I added the tika `MediaType` to our metadata.
   The process now assumes that all media types = "text/*" are `STANDARD` 
documents, 
   for everything else it is `BINARY` unless it is listed in the 
`documentTypeMap` which is now in the `TikaProcessor` class.
   
   You will notice that `application/json` is listed in the `documentTypeMap`.  
This is a stupid move on my part and will be removed.  This will mean that we 
can then remove the `WildcardFileName` filter for *.json from the 
`filesToIgnore` filter in `ReportConfiguration`.  I will fix this soon.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@creadur.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] RAT-54: Tika based document analyzer [creadur-rat]

2024-04-28 Thread via GitHub


ottlinger commented on code in PR #240:
URL: https://github.com/apache/creadur-rat/pull/240#discussion_r1582370881


##
src/changes/changes.xml:
##
@@ -72,6 +72,22 @@ 
https://maven.apache.org/plugins/maven-changes-plugin/xsd/changes-1.0.0.xsd
 
 -->
 
+  
+MIME Detection Using Tika
+  
+  
+Changed to detecting binary by content not name.
+  
+  
+Change to detect non UTF-8 text as text not binary.

Review Comment:
   
https://github.com/apache/linkis/blob/master/linkis-public-enhancements/linkis-pes-common/src/main/java/org/apache/linkis/udf/entity/UDFVersion.java



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@creadur.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] RAT-54: Tika based document analyzer [creadur-rat]

2024-04-28 Thread via GitHub


Claudenw commented on code in PR #240:
URL: https://github.com/apache/creadur-rat/pull/240#discussion_r1582029861


##
src/changes/changes.xml:
##
@@ -72,6 +72,22 @@ 
https://maven.apache.org/plugins/maven-changes-plugin/xsd/changes-1.0.0.xsd
 
 -->
 
+  
+MIME Detection Using Tika
+  
+  
+Changed to detecting binary by content not name.
+  
+  
+Change to detect non UTF-8 text as text not binary.

Review Comment:
   147 is improve the guesser so that non UTF-8 text is not detected as binary.
   301 is extended (Chinese in the report) characters that are UTF-8 encoded 
being detected as binary.
   
   related but not the same.  I was waiting for example to test 301 with.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@creadur.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] RAT-54: Tika based document analyzer [creadur-rat]

2024-04-28 Thread via GitHub


Claudenw commented on code in PR #240:
URL: https://github.com/apache/creadur-rat/pull/240#discussion_r1582028644


##
apache-rat-core/src/main/java/org/apache/rat/walker/Walker.java:
##
@@ -33,38 +34,32 @@ public abstract class Walker implements IReportable {
 protected final File file;
 protected final String name;
 
-protected final FilenameFilter filter;
+protected final FilenameFilter filesToIgnore;
 
 protected static FilenameFilter regexFilter(final Pattern pattern) {
 return (dir, name) -> {
 final boolean result;
 if (pattern == null) {
-result = true;
+result = false;
 } else {
-result = !pattern.matcher(name).matches();
+result = pattern.matcher(name).matches();

Review Comment:
   I will go back and double check that this correct.  However, I did change 
the nomenclature to be speicfic, the list of  `filesToIgnore` should indicate 
that if the pattern is matched the file is excluded (same with directory).  
Previously the name was `fileFilter`  and there it is difficult to know if you 
are including or excluding files in the filter.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@creadur.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] RAT-54: Tika based document analyzer [creadur-rat]

2024-04-28 Thread via GitHub


Claudenw commented on code in PR #240:
URL: https://github.com/apache/creadur-rat/pull/240#discussion_r1582027702


##
apache-rat-core/src/main/java/org/apache/rat/report/claim/ClaimStatistic.java:
##
@@ -57,45 +58,71 @@ public int getCounter(Counter counter) {
 return count == null ? 0 : count[0];
 }
 
-/**
- * @return Returns a map with the file types. The map keys
- * are file type names and the map values
- * are integers with the number of resources matching
- * the file type.
- */
-public Map getCounterMap() {
-return counterMap;
+public void incCounter(Counter key, int value) {
+final int[] num = counterMap.get(key);
+
+if (num == null) {
+counterMap.put(key, new int[] { value });
+} else {
+num[0] += value;

Review Comment:
   I had not thought about atomic properties, but I did think the array 
solution is no longer required.  I think I have a solution using a concurrent 
map and a IncrementableInt class.  I will put that together shortly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@creadur.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] RAT-54: Tika based document analyzer [creadur-rat]

2024-04-28 Thread via GitHub


Claudenw commented on code in PR #240:
URL: https://github.com/apache/creadur-rat/pull/240#discussion_r1582027540


##
apache-rat-core/src/main/java/org/apache/rat/ReportConfiguration.java:
##
@@ -179,31 +177,31 @@ public boolean isDryRun() {
 /**
  * @return The filename filter for the potential input files.
  */
-public FilenameFilter getInputFileFilter() {
-return inputFileFilter;
+public FilenameFilter getFilesToIgnore() {
+return filesToIgnore;
 }
 
 /**
- * @param inputFileFilter the filename filter to filter the input files.
+ * @param filesToIgnore the filename filter to filter the input files.
  */
-public void setInputFileFilter(FilenameFilter inputFileFilter) {
-this.inputFileFilter = inputFileFilter;
+public void setFilesToIgnore(FilenameFilter filesToIgnore) {
+this.filesToIgnore = filesToIgnore;
 }
 
-public IOFileFilter getDirectoryFilter() {
-return directoryFilter;
+public IOFileFilter getDirectoriesToIgnore() {
+return directoriesToIgnore;
 }
 
-public void setDirectoryFilter(IOFileFilter directoryFilter) {
-if (directoryFilter == null) {
-this.directoryFilter = FalseFileFilter.FALSE;
+public void setDirectoriesToIgnore(IOFileFilter directoriesToIgnore) {

Review Comment:
   Report configuration is the only place where this is handled so that 
everyplace we want the directoriesToIgnore it is set and there is no need to do 
the null check.  This case allows UI to set directoriesToIgnore to null (the 
old setting) and still guarantee the directoriesToIgnore has a value.  In 
future we might change the UIs to require that they send a defined static value 
to signal no directories.  But this is the fix without modifying the UIs.
   
   This also applies for filesToIgnore.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@creadur.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] RAT-54: Tika based document analyzer [creadur-rat]

2024-04-28 Thread via GitHub


Claudenw commented on code in PR #240:
URL: https://github.com/apache/creadur-rat/pull/240#discussion_r1582027053


##
apache-rat-core/src/main/java/org/apache/rat/Report.java:
##
@@ -452,11 +452,11 @@ private static IReportable getDirectory(String 
baseDirectory, ReportConfiguratio
 }
 
 if (base.isDirectory()) {
-return new DirectoryWalker(base, config.getInputFileFilter(), 
config.getDirectoryFilter());
+return new DirectoryWalker(base, config.getFilesToIgnore(), 
config.getDirectoriesToIgnore());

Review Comment:
   I  think that the ReportConfiguration class needs to be reworked into 
enclosed objects.  The enclosed objects would be something like ScanDefault.  
However, I also think that the definition of the objects needs to be based on 
the structure of the command line client, and that the command line client 
needs to be reworked to have better names for the options that signify what 
part of the config is being set.  The names are currently too short to be 
meaningful.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@creadur.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] RAT-54: Tika based document analyzer [creadur-rat]

2024-04-27 Thread via GitHub


Claudenw commented on code in PR #240:
URL: https://github.com/apache/creadur-rat/pull/240#discussion_r1582025702


##
apache-rat-core/src/main/java/org/apache/rat/Defaults.java:
##
@@ -57,6 +62,10 @@ public class Defaults {
 public static final String UNAPPROVED_LICENSES_STYLESHEET = 
"org/apache/rat/unapproved-licenses.xsl";
 
 private final LicenseSetFactory setFactory;
+
+private final FilenameFilter filesToIgnore = 
WildcardFileFilter.builder().setWildcards("*.json").setIoCase(IOCase.INSENSITIVE).get();
+
+private final IOFileFilter directoriesToIgnore = 
NameBasedHiddenFileFilter.HIDDEN;

Review Comment:
   Defaults is intended to be the System defaults for the ReportConfiguration.  
There are some cases where the report option has to be set by the UI before the 
Defaults can be tested.  And there is a flag for no defaults so the Defaults 
need to be specified outside of the ReportConfiguration initialization.  
   
   I changed the description of Defaults and updated the values and methods to 
be static.
   
   I also added a checklist at the top of this ticket to track the things we 
need to update as I suspect it it going to get longish.  Feel free to add to 
it.  I think items on the list can be closed if we account for them in this 
change or open a ticket to track them for a new change.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@creadur.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] RAT-54: Tika based document analyzer [creadur-rat]

2024-04-27 Thread via GitHub


Claudenw commented on code in PR #240:
URL: https://github.com/apache/creadur-rat/pull/240#discussion_r1582025702


##
apache-rat-core/src/main/java/org/apache/rat/Defaults.java:
##
@@ -57,6 +62,10 @@ public class Defaults {
 public static final String UNAPPROVED_LICENSES_STYLESHEET = 
"org/apache/rat/unapproved-licenses.xsl";
 
 private final LicenseSetFactory setFactory;
+
+private final FilenameFilter filesToIgnore = 
WildcardFileFilter.builder().setWildcards("*.json").setIoCase(IOCase.INSENSITIVE).get();
+
+private final IOFileFilter directoriesToIgnore = 
NameBasedHiddenFileFilter.HIDDEN;

Review Comment:
   Defaults is intended to be the System defaults for the ReportConfiguration.  
There are some cases where the report option has to be set by the UI before the 
Defaults can be tested.  And there is a flag for no defaults so the Defaults 
need to be specified outside of the ReportConfiguration initialization.  
   
   I changed the description of Defaults and updated the values and methods to 
be static.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@creadur.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] RAT-54: Tika based document analyzer [creadur-rat]

2024-04-27 Thread via GitHub


ottlinger commented on code in PR #240:
URL: https://github.com/apache/creadur-rat/pull/240#discussion_r1581906723


##
apache-rat-core/src/main/java/org/apache/rat/report/claim/ClaimStatistic.java:
##
@@ -57,45 +58,71 @@ public int getCounter(Counter counter) {
 return count == null ? 0 : count[0];
 }
 
-/**
- * @return Returns a map with the file types. The map keys
- * are file type names and the map values
- * are integers with the number of resources matching
- * the file type.
- */
-public Map getCounterMap() {
-return counterMap;
+public void incCounter(Counter key, int value) {
+final int[] num = counterMap.get(key);
+
+if (num == null) {
+counterMap.put(key, new int[] { value });
+} else {
+num[0] += value;
+}
 }
 
-
 /**
- * @return Returns a map with the file types. The map keys
- * are file type names and the map values
- * are integers with the number of resources matching
- * the file type.
+ * Returns the counts for the counter.
+ * @param documentType the document type to get the counter for.
+ * @return Returns the number of files with approved licenses.
  */
-public Map getDocumentCategoryMap() {
-return documentCategoryMap;
+public int getCounter(Document.Type documentType) {
+int[] count = documentCategoryMap.get(documentType);
+return count == null ? 0 : count[0];
 }
 
-/**
- * @return Returns a map with the license family codes. The map
- * keys are license family category names,
- * the map values are integers with the number of resources
- * matching the license family code.
- */
-public Map getLicenseFamilyCodeMap() {
-return licenseFamilyCodeMap;
+public void incCounter(Document.Type documentType, int value) {
+final int[] num = documentCategoryMap.get(documentType);
+
+if (num == null) {
+documentCategoryMap.put(documentType, new int[] { value });
+} else {
+num[0] += value;
+}
 }
 
-/**
- * @return Returns a map with the license family codes. The map
- * keys are the names of the license families and
- * the map values are integers with the number of resources
- * matching the license family name.
- */
-public Map getLicenseFileNameMap() {
-return licenseFamilyNameMap;
+public int getLicenseFamilyCount(String licenseFamilyName) {
+int[] count = licenseFamilyCodeMap.get(licenseFamilyName);
+return count == null ? 0 : count[0];
+}
+
+public void incLicenseFamilyCount(String licenseFamilyName, int value) {
+final int[] num = licenseFamilyCodeMap.get(licenseFamilyName);
+
+if (num == null) {
+licenseFamilyCodeMap.put(licenseFamilyName, new int[] { value });
+} else {
+num[0] += value;
+}
 }
 
+public Set getLicenseFamilyNames() {
+return Collections.unmodifiableSet(licenseFamilyCodeMap.keySet());
+}
+
+public Set getLicenseFileNames() {
+return Collections.unmodifiableSet(licenseFamilyNameMap.keySet());
+}
+
+public int getLicenseFileNameCount(String licenseFilename) {
+int[] count = licenseFamilyNameMap.get(licenseFilename);
+return count == null ? 0 : count[0];
+}
+
+public void incLicenseFileNameCount(String licenseFileNameName, int value) 
{
+final int[] num = licenseFamilyNameMap.get(licenseFileNameName);
+
+if (num == null) {
+licenseFamilyNameMap.put(licenseFileNameName, new int[] { value });
+} else {
+num[0] += value;

Review Comment:
   AtomicInteger?



##
apache-rat-core/src/main/java/org/apache/rat/walker/DirectoryWalker.java:
##
@@ -38,56 +38,42 @@ public class DirectoryWalker extends Walker implements 
IReportable {
 
 private static final FileNameComparator COMPARATOR = new 
FileNameComparator();
 
-private final IOFileFilter directoryFilter;
-
-/**
- * Constructs a walker.
- *
- * @param file the directory to walk.
- * @param directoryFilter directory filter to eventually exclude some 
directories/files from the scan.
- */
-public DirectoryWalker(File file, IOFileFilter directoryFilter) {
-this(file, (FilenameFilter) null, directoryFilter);
-}
+private final IOFileFilter directoriesToIgnore;
 
 /**
  * Constructs a walker.
  *
  * @param file the directory to walk (not null).
- * @param filter filters input files (optional),
+ * @param filesToIgnore filters input files (optional),
  *   or null when no filtering should be performed
- * @param directoryFilter filters directories (optional), or null when no 
filtering should be performed.
+ * @param directoriesToIgnore filters directories (optional), or null when 
no 

Re: [PR] RAT-54: Tika based document analyzer [creadur-rat]

2024-04-27 Thread via GitHub


ottlinger commented on code in PR #240:
URL: https://github.com/apache/creadur-rat/pull/240#discussion_r1581906646


##
apache-rat-core/src/main/java/org/apache/rat/report/claim/ClaimStatistic.java:
##
@@ -57,45 +58,71 @@ public int getCounter(Counter counter) {
 return count == null ? 0 : count[0];
 }
 
-/**
- * @return Returns a map with the file types. The map keys
- * are file type names and the map values
- * are integers with the number of resources matching
- * the file type.
- */
-public Map getCounterMap() {
-return counterMap;
+public void incCounter(Counter key, int value) {
+final int[] num = counterMap.get(key);
+
+if (num == null) {
+counterMap.put(key, new int[] { value });
+} else {
+num[0] += value;
+}
 }
 
-
 /**
- * @return Returns a map with the file types. The map keys
- * are file type names and the map values
- * are integers with the number of resources matching
- * the file type.
+ * Returns the counts for the counter.
+ * @param documentType the document type to get the counter for.
+ * @return Returns the number of files with approved licenses.
  */
-public Map getDocumentCategoryMap() {
-return documentCategoryMap;
+public int getCounter(Document.Type documentType) {
+int[] count = documentCategoryMap.get(documentType);
+return count == null ? 0 : count[0];
 }
 
-/**
- * @return Returns a map with the license family codes. The map
- * keys are license family category names,
- * the map values are integers with the number of resources
- * matching the license family code.
- */
-public Map getLicenseFamilyCodeMap() {
-return licenseFamilyCodeMap;
+public void incCounter(Document.Type documentType, int value) {
+final int[] num = documentCategoryMap.get(documentType);
+
+if (num == null) {
+documentCategoryMap.put(documentType, new int[] { value });
+} else {
+num[0] += value;

Review Comment:
   AtomicInteger to be threadsafe?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@creadur.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] RAT-54: Tika based document analyzer [creadur-rat]

2024-04-27 Thread via GitHub


ottlinger commented on code in PR #240:
URL: https://github.com/apache/creadur-rat/pull/240#discussion_r1581906606


##
apache-rat-core/src/main/java/org/apache/rat/report/claim/ClaimStatistic.java:
##
@@ -57,45 +58,71 @@ public int getCounter(Counter counter) {
 return count == null ? 0 : count[0];
 }
 
-/**
- * @return Returns a map with the file types. The map keys
- * are file type names and the map values
- * are integers with the number of resources matching
- * the file type.
- */
-public Map getCounterMap() {
-return counterMap;
+public void incCounter(Counter key, int value) {
+final int[] num = counterMap.get(key);
+
+if (num == null) {
+counterMap.put(key, new int[] { value });
+} else {
+num[0] += value;

Review Comment:
   are we running into problems here, when we do multithreaded analysis? Do we 
need AtomicInteger here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@creadur.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] RAT-54: Tika based document analyzer [creadur-rat]

2024-04-27 Thread via GitHub


ottlinger commented on code in PR #240:
URL: https://github.com/apache/creadur-rat/pull/240#discussion_r1581906451


##
apache-rat-core/src/main/java/org/apache/rat/ReportConfiguration.java:
##
@@ -179,31 +177,31 @@ public boolean isDryRun() {
 /**
  * @return The filename filter for the potential input files.
  */
-public FilenameFilter getInputFileFilter() {
-return inputFileFilter;
+public FilenameFilter getFilesToIgnore() {
+return filesToIgnore;
 }
 
 /**
- * @param inputFileFilter the filename filter to filter the input files.
+ * @param filesToIgnore the filename filter to filter the input files.
  */
-public void setInputFileFilter(FilenameFilter inputFileFilter) {
-this.inputFileFilter = inputFileFilter;
+public void setFilesToIgnore(FilenameFilter filesToIgnore) {
+this.filesToIgnore = filesToIgnore;
 }
 
-public IOFileFilter getDirectoryFilter() {
-return directoryFilter;
+public IOFileFilter getDirectoriesToIgnore() {
+return directoriesToIgnore;
 }
 
-public void setDirectoryFilter(IOFileFilter directoryFilter) {
-if (directoryFilter == null) {
-this.directoryFilter = FalseFileFilter.FALSE;
+public void setDirectoriesToIgnore(IOFileFilter directoriesToIgnore) {

Review Comment:
   ScanDefault could have boolean hasFiles()/hasDirectories() methods to not 
have to handle null values while interacting with the scan configuration



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@creadur.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] RAT-54: Tika based document analyzer [creadur-rat]

2024-04-27 Thread via GitHub


ottlinger commented on code in PR #240:
URL: https://github.com/apache/creadur-rat/pull/240#discussion_r1581906153


##
apache-rat-core/src/main/java/org/apache/rat/Report.java:
##
@@ -452,11 +452,11 @@ private static IReportable getDirectory(String 
baseDirectory, ReportConfiguratio
 }
 
 if (base.isDirectory()) {
-return new DirectoryWalker(base, config.getInputFileFilter(), 
config.getDirectoryFilter());
+return new DirectoryWalker(base, config.getFilesToIgnore(), 
config.getDirectoriesToIgnore());

Review Comment:
   DirectoryWalker would also benefit from a separate class ScanDefault  
WDYT?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@creadur.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] RAT-54: Tika based document analyzer [creadur-rat]

2024-04-27 Thread via GitHub


ottlinger commented on code in PR #240:
URL: https://github.com/apache/creadur-rat/pull/240#discussion_r1581906050


##
apache-rat-core/src/main/java/org/apache/rat/Defaults.java:
##
@@ -57,6 +62,10 @@ public class Defaults {
 public static final String UNAPPROVED_LICENSES_STYLESHEET = 
"org/apache/rat/unapproved-licenses.xsl";
 
 private final LicenseSetFactory setFactory;
+
+private final FilenameFilter filesToIgnore = 
WildcardFileFilter.builder().setWildcards("*.json").setIoCase(IOCase.INSENSITIVE).get();
+
+private final IOFileFilter directoriesToIgnore = 
NameBasedHiddenFileFilter.HIDDEN;

Review Comment:
   Should we create a new class for these 2 members? ScanDefaults that contains 
files and directories to ignore?
   
   ScanDefault {
   List filesToIgnore; // wildcards
   List directoriesToIgnore; // not sure if IOFileFilter is the 
correct superclass
   }
   
   WDYT?
   
   This would allow to have a static version of this configuration set in 
Defaults.java such as
   
   public static ScanDefault RAT_DEFAULT_SCAN = new 
ScanDefault(List.of(*.json),List.of(NamebasedHiffenFilterFilter.HIDDEN); 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@creadur.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] RAT-54: Tika based document analyzer [creadur-rat]

2024-04-27 Thread via GitHub


ottlinger commented on code in PR #240:
URL: https://github.com/apache/creadur-rat/pull/240#discussion_r1581905166


##
apache-rat-core/src/main/java/org/apache/rat/Defaults.java:
##
@@ -57,6 +62,10 @@ public class Defaults {
 public static final String UNAPPROVED_LICENSES_STYLESHEET = 
"org/apache/rat/unapproved-licenses.xsl";
 
 private final LicenseSetFactory setFactory;
+
+private final FilenameFilter filesToIgnore = 
WildcardFileFilter.builder().setWildcards("*.json").setIoCase(IOCase.INSENSITIVE).get();
+
+private final IOFileFilter directoriesToIgnore = 
NameBasedHiddenFileFilter.HIDDEN;

Review Comment:
   Would it make sense to add these 2 special cases to the documentation?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@creadur.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] RAT-54: Tika based document analyzer [creadur-rat]

2024-04-27 Thread via GitHub


ottlinger commented on code in PR #240:
URL: https://github.com/apache/creadur-rat/pull/240#discussion_r1581905166


##
apache-rat-core/src/main/java/org/apache/rat/Defaults.java:
##
@@ -57,6 +62,10 @@ public class Defaults {
 public static final String UNAPPROVED_LICENSES_STYLESHEET = 
"org/apache/rat/unapproved-licenses.xsl";
 
 private final LicenseSetFactory setFactory;
+
+private final FilenameFilter filesToIgnore = 
WildcardFileFilter.builder().setWildcards("*.json").setIoCase(IOCase.INSENSITIVE).get();
+
+private final IOFileFilter directoriesToIgnore = 
NameBasedHiddenFileFilter.HIDDEN;

Review Comment:
   Would it make sense to add these 2 special cases to the documentation?
   Didn't you (or JB) add a configuration option to scan for hidden files?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@creadur.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] RAT-54: Tika based document analyzer [creadur-rat]

2024-04-26 Thread via GitHub


Claudenw commented on PR #240:
URL: https://github.com/apache/creadur-rat/pull/240#issuecomment-2079782431

   Well this blew up to something bigger than I wanted but...
   
   I added default exclusion for "*.json" files in the Default class and used 
that to configure the ReportConfiguration defaults.
   In the future the default exclusions should be configured in Default class.
   
   I changed the parameter names and instance variables to "filesToIgnore" and 
"directoriesToIgnore" to make it clear what the filters were doing.
   
   I cleaned up the DirectoryWalker and Walker classes.
   
   I ensured that the filesToIgnore and directoriesToIgnore alwasy have a value 
(not null).  If set to null in the configuration the value is translated into a 
filter that always returns false.
   
   I think there may be an issue in the directoreisToIgnore processing, but it 
has been there from the beginning.  I will investigate and open a ticket if 
necessary.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@creadur.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] RAT-54: Tika based document analyzer [creadur-rat]

2024-04-26 Thread via GitHub


ottlinger commented on PR #240:
URL: https://github.com/apache/creadur-rat/pull/240#issuecomment-2079291560

   Pls add a reference to all the old tickets in the changelog & thanks for 
taking care of the old tickets/bugs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@creadur.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] RAT-54: Tika based document analyzer [creadur-rat]

2024-04-26 Thread via GitHub


Claudenw commented on PR #240:
URL: https://github.com/apache/creadur-rat/pull/240#issuecomment-2078759677

   I am adding a  file filter to remove json files.  Initially this will be 
hard coded.  I will open a subsequent ticket to generalize it so that we can 
define a list of extensions to remove.  This will probably mean more command 
line options.  So I'll have to open that can of worms as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@creadur.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] RAT-54: Tika based document analyzer [creadur-rat]

2024-04-25 Thread via GitHub


ottlinger commented on code in PR #240:
URL: https://github.com/apache/creadur-rat/pull/240#discussion_r1580070068


##
apache-rat-core/src/main/java/org/apache/rat/api/Document.java:
##
@@ -33,47 +36,416 @@ public interface Document {
  */
 enum Type {
 /** A generated document. */
-GENERATED, 
+GENERATED,
 /** An unknown document type. */
 UNKNOWN,
 /** An archive type document. */
-ARCHIVE, 
+ARCHIVE,
 /** A notice document (e.g. LICENSE file) */
 NOTICE,
 /** A binary file */
 BINARY,
 /** A standard document */
-STANDARD}
+STANDARD;
+
+public static Map documentTypeMap;
+
+public static Type fromContentType(String documentType, Log log) {
+Type result = documentTypeMap.get(documentType);
+if (result == null) {
+log.warn(String.format("Please open a Jira ticket with the 
subject: 'Unknown media type %s in Document.Type'", documentType));
+return UNKNOWN;
+}
+return result;
+}
+
+/*
+ * https://tika.apache.org/3.0.0-BETA/formats.html 
+ */
+static {
+documentTypeMap = new HashMap<>();

Review Comment:
   I created RAT-370 for that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@creadur.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org