date:20220125

Build failed in Jenkins: ManifoldCF » ManifoldCF-ant-1x #37

2022-01-25 Thread Apache Jenkins Server

See 


Changes:


--
Started by an SCM change
Running as SYSTEM
[EnvInject] - Loading node environment variables.
Building remotely on builds35 (ubuntu) in workspace 

Updating https://svn.apache.org/repos/asf/manifoldcf/branches/dev_1x at 
revision '2022-01-26T02:04:07.392 +'
At revision 1897480

[ManifoldCF-ant-1x] $ ant clean-core-deps make-core-deps clean
Exception in thread "main" java.lang.UnsupportedClassVersionError: 
org/apache/tools/ant/launch/Launcher : Unsupported major.minor version 52.0
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
at 
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:482)
Build step 'Invoke Ant' marked build as failure
Archiving artifacts
Publishing Javadoc

[jira] [Commented] (CONNECTORS-1695) Sitemap xml not detected in version 2.17 webconnector

2022-01-25 Thread Karl Wright (Jira)



[ 
https://issues.apache.org/jira/browse/CONNECTORS-1695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482114#comment-17482114
 ] 

Karl Wright commented on CONNECTORS-1695:
-

The interestingMimeTypes are mime types that may be HTML or XHTML, or documents 
where links cannot be extracted but where the content can be indexed.  What 
actually happens to a given document depends on what is actually in it rather 
than what the mime type says.  The mimetypes are a crude filter only.

The code that we'd want to modify would be the extractLinks code:

{code}
// Now, extract links.
// We'll call the "link extractor" series, so we can plug more 
stuff in over time. 
boolean indexDocument = 
extractLinks(documentIdentifier,activities,filter);
{code}

The code for this method is:

{code}
  /** Code to extract links from an already-fetched document. */

 protected boolean extractLinks(String 
documentIdentifier, IProcessActivity activities, DocumentURLFilter filter)
throws ManifoldCFException, ServiceInterruption 

 {
ProcessActivityRedirectionHandler redirectHandler = new 
ProcessActivityRedirectionHandler(documentIdentifier,activities,filter);
   
handleRedirects(documentIdentifier,redirectHandler);
if (Logging.connectors.isDebugEnabled() && redirectHandler.shouldIndex() == 
false)  
 Logging.connectors.debug("Web: Not 
indexing document '"+documentIdentifier+"' because of redirection");

  // For html, we don't want any actions, because we don't do form submission.
ProcessActivityHTMLHandler htmlHandler = new 
ProcessActivityHTMLHandler(documentIdentifier,activities,filter,metaRobotsTagsUsage);
 
handleHTML(documentIdentifier,htmlHandler); 

   if (Logging.connectors.isDebugEnabled() 
&& htmlHandler.shouldIndex() == false)  

 Logging.connectors.debug("Web: Not indexing document '"+documentIdentifier+"' 
because of HTML robots or content tags prohibiting indexing");
ProcessActivityXMLHandler xmlHandler = new 
ProcessActivityXMLHandler(documentIdentifier,activities,filter);
handleXML(documentIdentifier,xmlHandler);   

   if 
(Logging.connectors.isDebugEnabled() && xmlHandler.shouldIndex() == false)  

  Logging.connectors.debug("Web: Not 
indexing document '"+documentIdentifier+"' because of XML robots or content 
tags prohibiting indexing");
// May add more later for other extraction tasks.
return htmlHandler.shouldIndex() && redirectHandler.shouldIndex() && 
xmlHandler.shouldIndex();   
}
{code}

Note that there are three different parsing attempts made: HTML, XML (which is 
I believe RSS feeds only at this point) and redirection pages.  You could add a 
fourth.

Most of these invoke the fuzzyml parser, which is a bottom-up parser with 
overrides for specific interesting tags.  Even though the sitemap xml is 
supposedly well formed, you wouldn't want to bet on it, and the fuzzyml parser 
would be a reasonable technology to do parsing of this kind since it is quite 
resilient against syntax errors of all kinds.

So the trick would be to identify the tag structure of a sitemap document and 
extend the overrides present for the parser the web connector is using to 
understand that xml syntax IN ADDITION TO the xhtml it already understands.


> Sitemap xml not detected in version 2.17 webconnector
> -
>
> Key: CONNECTORS-1695
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1695
> Project: ManifoldCF
>  Issue Type: Bug

[jira] [Commented] (CONNECTORS-1695) Sitemap xml not detected in version 2.17 webconnector

2022-01-25 Thread DK (Jira)



[ 
https://issues.apache.org/jira/browse/CONNECTORS-1695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17481863#comment-17481863
 ] 

DK commented on CONNECTORS-1695:


In that case, What is the significance of 'interestingMimeType". As per the 
defect related to application/xml, it was missing in that variable and got 
added.

My understanding is that web connector would treat as special sitemap and pull 
individual urls and submit html to solr for indexing.

If that is the not case, Can we say manifoldcf does not support sitemap 
indexing? and what does it take to add the support? I am willing to help.

> Sitemap xml not detected in version 2.17 webconnector
> -
>
> Key: CONNECTORS-1695
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1695
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Affects Versions: ManifoldCF 2.17
>Reporter: DK
>Priority: Major
>
> Trying to index sitemap xml and web connector index the whole xml into solr.
> Please fix in version 2.17.
> If it is any special config that needs to be taken care, please add here and 
> add in documentation to make it clear.
>  
> Sitemap.xml:
> http://www.sitemaps.org/schemas/sitemap/0.9;>
> 
> https:///sitemap_1.xml
> 2022-01-21T16:04:45Z
> 
> 
>  
> sitemap_1.xml:
> 
> 
> https://
> 2018-10-31T11:25:27Z
> 
> 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Created] (CONNECTORS-1696) MariaDB JDBC Driver 2.7.5 not working with AWS Aurora

2022-01-25 Thread Markus Schuch (Jira)

Markus Schuch created CONNECTORS-1696:
-

 Summary: MariaDB JDBC Driver 2.7.5 not working with AWS Aurora
 Key: CONNECTORS-1696
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1696
 Project: ManifoldCF
  Issue Type: Bug
  Components: Framework core
Reporter: Markus Schuch


With 2.7.5 they changed, how they detect if the database is AWS Aurora:
 * [https://jira.mariadb.org/browse/CONJ-824]
 * 
[https://github.com/mariadb-corporation/mariadb-connector-j/commit/a3cf53117614a8a706ef0f62c57bc1801eeeb374#diff-5a3e7974abae96c88e31e1d3a815f89e31157a5eaaff66c71a7d290c382b5fc4R360]

This causes the activation of usePipelineAuth and useMultiBatchSend which is 
not supported by AWS Aurora.

We need to provide a special database interface implementation setting the JDBC 
url correctly for Aurora.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Build failed in Jenkins: ManifoldCF » ManifoldCF-ant-1x #37

[jira] [Commented] (CONNECTORS-1695) Sitemap xml not detected in version 2.17 webconnector

[jira] [Commented] (CONNECTORS-1695) Sitemap xml not detected in version 2.17 webconnector

[jira] [Created] (CONNECTORS-1696) MariaDB JDBC Driver 2.7.5 not working with AWS Aurora

4 matches

Site Navigation

Mail list logo

Footer information