Re: [VOTE] Release Apache ManifoldCF 2.0.2, RC0

2015-02-27 Thread Erlend Garåsen


Local tests are running fine, but there is a problem with a table which 
is not properly installed on our Resin Deployment server.


I guess the following command should install the needpriority table? Not 
errors shown runnins this command.


$MCF_HOME/executecommand.sh org.apache.manifoldcf.agents.Install

But the following does not register properly:

$MCF_HOME/executecommand.sh org.apache.manifoldcf.agents.Register 
org.apache.manifoldcf.crawler.system.CrawlerAgent


ERROR: column needpriority does not exist

The complete output of the first command is (the output from the other 
with the stacktrace will follow):


Configuration file successfully read
[main] INFO org.apache.zookeeper.ZooKeeper - Client 
environment:zookeeper.version=3.4.6-1569965, built on 02/20/2014 09:09 GMT
[main] INFO org.apache.zookeeper.ZooKeeper - Client 
environment:host.name=solr-test02.uio.no
[main] INFO org.apache.zookeeper.ZooKeeper - Client 
environment:java.version=1.7.0_75
[main] INFO org.apache.zookeeper.ZooKeeper - Client 
environment:java.vendor=Oracle Corporation
[main] INFO org.apache.zookeeper.ZooKeeper - Client 
environment:java.home=/usit/solr-test02/www/java/jdk1.7.0_75-x86_64/jre
[main] INFO org.apache.zookeeper.ZooKeeper - Client 
environment:java.class.path=.:../lib/mcf-core.jar:../lib/mcf-agents.jar:../lib/mcf-pull-agent.jar:../lib/hsqldb-2.3.2.jar:../lib/postgresql-9.1-901.jdbc4.jar:../lib/commons-codec-1.9.jar:../lib/commons-collections-3.2.1.jar:../lib/commons-discovery-0.5.jar:../lib/commons-el-1.0.jar:../lib/commons-fileupload-1.2.2.jar:../lib/commons-io-2.1.jar:../lib/commons-lang-2.6.jar:../lib/commons-logging-1.1.3.jar:../lib/ecj-4.3.1.jar:../lib/httpclient-4.3.5.jar:../lib/httpcore-4.3.2.jar:../lib/jasper-6.0.35.jar:../lib/jasper-el-6.0.35.jar:../lib/javax.servlet-api-3.1.0.jar:../lib/json-20090211.jar:../lib/json-simple-1.1.jar:../lib/jsp-api-2.1-glassfish-2.1.v20091210.jar:../lib/juli-6.0.35.jar:../lib/log4j-1.2.16.jar:../lib/mail-1.4.5.jar:../lib/serializer-2.7.1.jar:../lib/slf4j-api-1.7.7.jar:../lib/slf4j-simple-1.7.7.jar:../lib/velocity-1.7.jar:../lib/xalan-2.7.1.jar:../lib/xercesImpl-2.10.0.jar:../lib/xml-apis-1.4.01.jar:../lib/zookeeper-3.4.6.jar:
[main] INFO org.apache.zookeeper.ZooKeeper - Client 
environment:java.library.path=/local/opt/oraclient10.2/product/10.2.0/lib::/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
[main] INFO org.apache.zookeeper.ZooKeeper - Client 
environment:java.io.tmpdir=/tmp
[main] INFO org.apache.zookeeper.ZooKeeper - Client 
environment:java.compiler=NA
[main] INFO org.apache.zookeeper.ZooKeeper - Client 
environment:os.name=Linux
[main] INFO org.apache.zookeeper.ZooKeeper - Client 
environment:os.arch=amd64
[main] INFO org.apache.zookeeper.ZooKeeper - Client 
environment:os.version=2.6.18-400.1.1.el5
[main] INFO org.apache.zookeeper.ZooKeeper - Client 
environment:user.name=resin
[main] INFO org.apache.zookeeper.ZooKeeper - Client 
environment:user.home=/www/home/resin
[main] INFO org.apache.zookeeper.ZooKeeper - Client 
environment:user.dir=/usit/solr-test02/www/var/data/mcf/mcf-1/conf
[main] INFO org.apache.zookeeper.ZooKeeper - Initiating client 
connection, connectString=localhost:8349 sessionTimeout=2000 
watcher=org.apache.manifoldcf.core.lockmanager.ZooKeeperConnection$ZooKeeperWatcher@37b335b7
[main-SendThread(localhost.localdomain:8349)] INFO 
org.apache.zookeeper.ClientCnxn - Opening socket connection to server 
localhost.localdomain/127.0.0.1:8349. Will not attempt to authenticate 
using SASL (unknown error)
[main-SendThread(localhost.localdomain:8349)] INFO 
org.apache.zookeeper.ClientCnxn - Socket connection established to 
localhost.localdomain/127.0.0.1:8349, initiating session
[main-SendThread(localhost.localdomain:8349)] INFO 
org.apache.zookeeper.ClientCnxn - Session establishment complete on 
server localhost.localdomain/127.0.0.1:8349, sessionid = 
0x14b3573590c001a, negotiated timeout = 2

Agent tables installed
[Shutdown thread] INFO org.apache.zookeeper.ZooKeeper - Session: 
0x14b3573590c001a closed
[main-EventThread] INFO org.apache.zookeeper.ClientCnxn - EventThread 
shut down


Here's the stacktrace from the other command:

org.apache.manifoldcf.core.interfaces.ManifoldCFException: Database 
exception: SQLException doing query (42703): ERROR: column 
needpriority does not exist
	at 
org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.finishUp(Database.java:702)
	at 
org.apache.manifoldcf.core.database.Database.executeViaThread(Database.java:728)
	at 
org.apache.manifoldcf.core.database.Database.executeUncachedQuery(Database.java:771)
	at 
org.apache.manifoldcf.core.database.Database$QueryCacheExecutor.create(Database.java:1444)
	at 
org.apache.manifoldcf.core.cachemanager.CacheManager.findObjectsAndExecute(CacheManager.java:146)
	at 
org.apache.manifoldcf.core.database.Database.executeQuery(Database.java:191)
	at 

Re: [VOTE] Release Apache ManifoldCF 1.7.1, RC1

2014-09-22 Thread Erlend Garåsen


I can verify an eventually network problem by using file-based 
synchronization instead.


I'll do that right away and test RC2 as well, even though you already 
have three +1's.


The three other jobs I started before I left my office on Thursday did 
all complete successfully.


Erlend

On 19.09.14 12:27, Karl Wright wrote:

Well, it's crawled fine over night, with no issues whatsoever.  I'm using a
Zookeeper setup, with MCF 1.7.1 RC1.

I still maintain you've got something broken with the network in your
production machine.

Karl

On Thu, Sep 18, 2014 at 5:31 PM, Karl Wright daddy...@gmail.com wrote:


Well, FWIW it is still crawling perfectly.  I'll let it run until done.

Karl


On Thu, Sep 18, 2014 at 5:29 PM, Erlend Fedt Garåsen 
e.f.gara...@usit.uio.no wrote:


I know. I used a lot of time to create the rules which seems to index
what we really want. Your observation is correct. Crawling Dspace
repositories are very difficult. A lot of nonsense pages we need to filter
out.

We have crawled this host the last two years using file based synch.

I'm planning a new approach, i.e. using a connector etc.

E

Sent from my iPhone


On 18. sep. 2014, at 22:35, Karl Wright daddy...@gmail.com wrote:

Ok, I started this crawl.  It fetched and processed robots.txt

perfectly.

And then I saw the following: lots of fetches of fairly good-sized
documents, with very few ingestions.  The documents that did not ingest
look like this:



https://www.duo.uio.no/handle/10852/163/discover?order=DESCr...pp=100sort_by=dc.date.issued_dt



I think your index inclusion rules may be excluding most of the content.

Karl




On Thu, Sep 18, 2014 at 8:48 AM, Karl Wright daddy...@gmail.com

wrote:


Thanks -- I will probably not be able to get to this further until

tonight

anyhow.

Karl

On Thu, Sep 18, 2014 at 8:16 AM, Erlend Garåsen 

e.f.gara...@usit.uio.no

wrote:



I tried to fetch documents by using curl from our prod server just in
case a webmaster had blocked access. No problem. Maybe I should ask

the

webmaster of that host anyway, just to be sure.

The interrupted message may have been caused by an abort of that job.

I think I should just stop the problematic job and start all the other
three remaining jobs instead. I bet they will all complete. Ideally we
shouldn't crawl www.duo.uio.no at all since it's a Dspace resource. I
have just contacted someone who is indexing Dspace resources. I guess

a

Dspace connector is a better approach.

Below you'll find some parameters.

REPOSITORY CONNECTION
-
Throttling - max connections: 30
Throttling - Max fetches/min: 100
Bandwith - max connections: 25
Bandwith - max kbytes/sec: 8000
Bandwith - max fetches/min: 20

JOB SETTINGS


Hop filters: Keep forever

Seeds: https://www.duo.uio.no/

Exclude from crawl:
# Exclude some file types:
\.gif$
\.GIF$
\.jpeg$
\.JPEG$
\.jpg$
\.JPG$
\.png$
\.PNG$
\.mpg$
\.MPG$
\.mpeg$
\.MPEG$
\.exe$
\.bmp$
\.BMP$
\.mov$
\.MOV$
\.wmf$
\.css$
\.ico$
\.ICO$
\.mp2$
\.mp3$
\.mp4$
\.wmv$
\.tif$
\.tiff$
\.avi$
\.ogg$
\.ogv$
\.zip$
\.gz$
\.psd$

# TIKA-1011
\.mhtml$

# Exclude log files:
\.log$
\.logfile$

# Generelt, ikke tillatt indeksering av DUO-søkeresultater:
https?://www\.duo\.uio\.no/sok/search.*

# Andre elementer i DUO som skal ekskluderes:
https://www\.duo\.uio\.no.*open-search/description\.xml$
https://www\.duo\.uio\.no/(inn|login|feed|search|
advanced-search|community-list|browse|password-login|inn|discover).*

# Skip locale settings - makes duplicates:
https://www\.duo\.uio\.no/.*\?locale-attribute=\w{2}$

# Temporarily skip PDFs since we are indexing abstracts:
https://www\.duo\.uio\.no/bitstream/handle/.+

# skip full item record:
https://www\.duo\.uio\.no/handle/\d{9}/\d+\?show=full$
# ny url-struktur:
https://www\.duo\.uio\.no/handle/.*\?show=full$

# Skip all navigations but start with letter:
https://www\.duo\.uio\.no/.*type=(author|dateissued)$

# Skip search:
#https://www\.duo\.uio\.no/handle/.*/discover\?.*
https://www\.duo\.uio\.no/handle/.*search-filter\?.*
# ny url-struktur:
https://www\.duo\.uio\.no/discover\?.*
https://www\.duo\.uio\.no/search-filter\?.*

# Skip statistics:
https://www\.duo\.uio\.no/handle/.*/statistics$

Exclude from index:
# Exclude front page - no valuable info and we have QL:
https?://www\.duo\.uio\.no/$

# Do not index navigation, but follow:
https://www\.duo\.uio\.no/handle/\d{9}/\d+/.+
#ny url-struktur:
https://www\.duo\.uio\.no/handle/\d+/\d+/.+

# Exclude id's lower than four, probably category listening:
https://www\.duo\.uio\.no/handle/\d{9}/\d{1,4}$
# ny url-strultur:
https://www\.duo\.uio\.no/handle/\d+/\d{1,3}$

Thanks for looking at this!

BTW: Within an hour, I will be away from my computer and cannot test
anymore until Monday. I'm leaving Oslo for some days, but I will

still be

able to read and answer emails.

Erlend



On 18.09.14 13:43, Karl Wright wrote:

Hi Erlend,

The Interrupted: null message with a -104 code means only that the
fetch
was interrupted

Re: [VOTE] Release Apache ManifoldCF 1.7.1, RC1

2014-09-22 Thread Erlend Garåsen


I'm able to fetch documents from www.duo.uio.no using file-based 
synchronization, so there are no network problems.


Anyway, I'll continue to test RC2. Even though I'm not able to use 
Zookeeper-based synchronization on that host, I may find other 
bugs/problems.


Erlend

On 22.09.14 10:39, Erlend Garåsen wrote:


I can verify an eventually network problem by using file-based
synchronization instead.

I'll do that right away and test RC2 as well, even though you already
have three +1's.

The three other jobs I started before I left my office on Thursday did
all complete successfully.

Erlend

On 19.09.14 12:27, Karl Wright wrote:

Well, it's crawled fine over night, with no issues whatsoever.  I'm
using a
Zookeeper setup, with MCF 1.7.1 RC1.

I still maintain you've got something broken with the network in your
production machine.

Karl

On Thu, Sep 18, 2014 at 5:31 PM, Karl Wright daddy...@gmail.com wrote:


Well, FWIW it is still crawling perfectly.  I'll let it run until done.

Karl


On Thu, Sep 18, 2014 at 5:29 PM, Erlend Fedt Garåsen 
e.f.gara...@usit.uio.no wrote:


I know. I used a lot of time to create the rules which seems to index
what we really want. Your observation is correct. Crawling Dspace
repositories are very difficult. A lot of nonsense pages we need to
filter
out.

We have crawled this host the last two years using file based synch.

I'm planning a new approach, i.e. using a connector etc.

E

Sent from my iPhone


On 18. sep. 2014, at 22:35, Karl Wright daddy...@gmail.com wrote:

Ok, I started this crawl.  It fetched and processed robots.txt

perfectly.

And then I saw the following: lots of fetches of fairly good-sized
documents, with very few ingestions.  The documents that did not
ingest
look like this:



https://www.duo.uio.no/handle/10852/163/discover?order=DESCr...pp=100sort_by=dc.date.issued_dt




I think your index inclusion rules may be excluding most of the
content.

Karl




On Thu, Sep 18, 2014 at 8:48 AM, Karl Wright daddy...@gmail.com

wrote:


Thanks -- I will probably not be able to get to this further until

tonight

anyhow.

Karl

On Thu, Sep 18, 2014 at 8:16 AM, Erlend Garåsen 

e.f.gara...@usit.uio.no

wrote:



I tried to fetch documents by using curl from our prod server
just in
case a webmaster had blocked access. No problem. Maybe I should ask

the

webmaster of that host anyway, just to be sure.

The interrupted message may have been caused by an abort of that
job.

I think I should just stop the problematic job and start all the
other
three remaining jobs instead. I bet they will all complete.
Ideally we
shouldn't crawl www.duo.uio.no at all since it's a Dspace
resource. I
have just contacted someone who is indexing Dspace resources. I
guess

a

Dspace connector is a better approach.

Below you'll find some parameters.

REPOSITORY CONNECTION
-
Throttling - max connections: 30
Throttling - Max fetches/min: 100
Bandwith - max connections: 25
Bandwith - max kbytes/sec: 8000
Bandwith - max fetches/min: 20

JOB SETTINGS


Hop filters: Keep forever

Seeds: https://www.duo.uio.no/

Exclude from crawl:
# Exclude some file types:
\.gif$
\.GIF$
\.jpeg$
\.JPEG$
\.jpg$
\.JPG$
\.png$
\.PNG$
\.mpg$
\.MPG$
\.mpeg$
\.MPEG$
\.exe$
\.bmp$
\.BMP$
\.mov$
\.MOV$
\.wmf$
\.css$
\.ico$
\.ICO$
\.mp2$
\.mp3$
\.mp4$
\.wmv$
\.tif$
\.tiff$
\.avi$
\.ogg$
\.ogv$
\.zip$
\.gz$
\.psd$

# TIKA-1011
\.mhtml$

# Exclude log files:
\.log$
\.logfile$

# Generelt, ikke tillatt indeksering av DUO-søkeresultater:
https?://www\.duo\.uio\.no/sok/search.*

# Andre elementer i DUO som skal ekskluderes:
https://www\.duo\.uio\.no.*open-search/description\.xml$
https://www\.duo\.uio\.no/(inn|login|feed|search|
advanced-search|community-list|browse|password-login|inn|discover).*

# Skip locale settings - makes duplicates:
https://www\.duo\.uio\.no/.*\?locale-attribute=\w{2}$

# Temporarily skip PDFs since we are indexing abstracts:
https://www\.duo\.uio\.no/bitstream/handle/.+

# skip full item record:
https://www\.duo\.uio\.no/handle/\d{9}/\d+\?show=full$
# ny url-struktur:
https://www\.duo\.uio\.no/handle/.*\?show=full$

# Skip all navigations but start with letter:
https://www\.duo\.uio\.no/.*type=(author|dateissued)$

# Skip search:
#https://www\.duo\.uio\.no/handle/.*/discover\?.*
https://www\.duo\.uio\.no/handle/.*search-filter\?.*
# ny url-struktur:
https://www\.duo\.uio\.no/discover\?.*
https://www\.duo\.uio\.no/search-filter\?.*

# Skip statistics:
https://www\.duo\.uio\.no/handle/.*/statistics$

Exclude from index:
# Exclude front page - no valuable info and we have QL:
https?://www\.duo\.uio\.no/$

# Do not index navigation, but follow:
https://www\.duo\.uio\.no/handle/\d{9}/\d+/.+
#ny url-struktur:
https://www\.duo\.uio\.no/handle/\d+/\d+/.+

# Exclude id's lower than four, probably category listening:
https://www\.duo\.uio\.no/handle/\d{9}/\d{1,4}$
# ny url-strultur:
https://www\.duo\.uio\.no/handle/\d+/\d{1,3}$

Thanks for looking at this!

BTW: Within

Re: [VOTE] Release Apache ManifoldCF 1.7.1, RC1

2014-09-22 Thread Erlend Garåsen


Even thought Zookeeper is running on the same machine?

I'm planning to investigate this issue further by using tcpdump. I have 
already turned on DEBUG logging, but nothing suspicious is showing up in 
my logs.


This machine is on a very strict network, and that may cause these 
problems, but it's strange that all the other jobs are working. perfectly.


Erlend

On 22.09.14 12:26, Karl Wright wrote:

Hi Erlend,

What I think you might want to look for, network-wise, are periods of
significant packet loss.  Normally your server seems to have no trouble
talking to either zookeeper or the external network, but periodically, it
seems to lose that ability for times of at least 20 seconds.  It could be
bad hardware, it could be routing, hard to tell.

What I'd suggest to prove this is to set up a long-running ping, e.g.
ping -n 1, from that machine to the server that zookeeper is running
on, and then do a crawl.  I will wager, well, quite a lot of money, that
you will see periods of packet loss. ;-)

Karl


On Mon, Sep 22, 2014 at 5:05 AM, Erlend Garåsen e.f.gara...@usit.uio.no
wrote:



I'm able to fetch documents from www.duo.uio.no using file-based
synchronization, so there are no network problems.

Anyway, I'll continue to test RC2. Even though I'm not able to use
Zookeeper-based synchronization on that host, I may find other
bugs/problems.

Erlend


On 22.09.14 10:39, Erlend Garåsen wrote:



I can verify an eventually network problem by using file-based
synchronization instead.

I'll do that right away and test RC2 as well, even though you already
have three +1's.

The three other jobs I started before I left my office on Thursday did
all complete successfully.

Erlend

On 19.09.14 12:27, Karl Wright wrote:


Well, it's crawled fine over night, with no issues whatsoever.  I'm
using a
Zookeeper setup, with MCF 1.7.1 RC1.

I still maintain you've got something broken with the network in your
production machine.

Karl

On Thu, Sep 18, 2014 at 5:31 PM, Karl Wright daddy...@gmail.com wrote:

  Well, FWIW it is still crawling perfectly.  I'll let it run until done.


Karl


On Thu, Sep 18, 2014 at 5:29 PM, Erlend Fedt Garåsen 
e.f.gara...@usit.uio.no wrote:

  I know. I used a lot of time to create the rules which seems to index

what we really want. Your observation is correct. Crawling Dspace
repositories are very difficult. A lot of nonsense pages we need to
filter
out.

We have crawled this host the last two years using file based synch.

I'm planning a new approach, i.e. using a connector etc.

E

Sent from my iPhone

  On 18. sep. 2014, at 22:35, Karl Wright daddy...@gmail.com wrote:


Ok, I started this crawl.  It fetched and processed robots.txt


perfectly.


And then I saw the following: lots of fetches of fairly good-sized
documents, with very few ingestions.  The documents that did not
ingest
look like this:


  https://www.duo.uio.no/handle/10852/163/discover?order=DESC;

r...pp=100sort_by=dc.date.issued_dt




I think your index inclusion rules may be excluding most of the
content.

Karl



  On Thu, Sep 18, 2014 at 8:48 AM, Karl Wright daddy...@gmail.com



wrote:





Thanks -- I will probably not be able to get to this further until


tonight



anyhow.


Karl

On Thu, Sep 18, 2014 at 8:16 AM, Erlend Garåsen 


e.f.gara...@usit.uio.no



wrote:




I tried to fetch documents by using curl from our prod server
just in
case a webmaster had blocked access. No problem. Maybe I should ask


the



webmaster of that host anyway, just to be sure.


The interrupted message may have been caused by an abort of that
job.

I think I should just stop the problematic job and start all the
other
three remaining jobs instead. I bet they will all complete.
Ideally we
shouldn't crawl www.duo.uio.no at all since it's a Dspace
resource. I
have just contacted someone who is indexing Dspace resources. I
guess


a



Dspace connector is a better approach.


Below you'll find some parameters.

REPOSITORY CONNECTION
-
Throttling - max connections: 30
Throttling - Max fetches/min: 100
Bandwith - max connections: 25
Bandwith - max kbytes/sec: 8000
Bandwith - max fetches/min: 20

JOB SETTINGS


Hop filters: Keep forever

Seeds: https://www.duo.uio.no/

Exclude from crawl:
# Exclude some file types:
\.gif$
\.GIF$
\.jpeg$
\.JPEG$
\.jpg$
\.JPG$
\.png$
\.PNG$
\.mpg$
\.MPG$
\.mpeg$
\.MPEG$
\.exe$
\.bmp$
\.BMP$
\.mov$
\.MOV$
\.wmf$
\.css$
\.ico$
\.ICO$
\.mp2$
\.mp3$
\.mp4$
\.wmv$
\.tif$
\.tiff$
\.avi$
\.ogg$
\.ogv$
\.zip$
\.gz$
\.psd$

# TIKA-1011
\.mhtml$

# Exclude log files:
\.log$
\.logfile$

# Generelt, ikke tillatt indeksering av DUO-søkeresultater:
https?://www\.duo\.uio\.no/sok/search.*

# Andre elementer i DUO som skal ekskluderes:
https://www\.duo\.uio\.no.*open-search/description\.xml$
https://www\.duo\.uio\.no/(inn|login|feed|search|
advanced-search|community-list|browse|password-login|
inn|discover).*

# Skip locale settings - makes duplicates:
https://www\.duo\.uio

Re: [VOTE] Release Apache ManifoldCF 1.7.1, RC1

2014-09-18 Thread Erlend Garåsen


I tried to restart the job dealing with www.duo.no on our test server, 
but it does not seem to touch the robots.txt file at all. That's the 
reason why it's able to continue. Both servers are set up to obey the 
rules of such files.


Erlend

On 18.09.14 11:12, Erlend Garåsen wrote:


I'm facing the same problems with robot.txt files using RC1, so maybe
this is another issue we have to fix. Can you please try to fetch the
host below? For some odd reason, it seems that MCF on our test server
can handle it.

This is exactly the same that happened when I started MCF (referring to
my previous post) after I had deployed RC1:
09-18-2014 11:02:14.400 robots parse https:www.duo.uio.no:443
 ERRORS 0 3 Unknown robots.txt line: ''

No activity after this error.

Here's the robots.txt file:
https://www.duo.uio.no/robots.txt

This is the content of manifoldcf.log after the startup:
  WARN 2014-09-18 11:02:14,401 (Worker thread '19') - Web: Unknown
robots.txt line from 'https:www.duo.uio.no:443': ''
  WARN 2014-09-18 11:02:14,401 (Worker thread '19') - Web: Unknown
robots.txt line from 'https:www.duo.uio.no:443': 'The contents of
this file are subject to the license and copyright'
  WARN 2014-09-18 11:02:14,402 (Worker thread '19') - Web: Unknown
robots.txt line from 'https:www.duo.uio.no:443': 'detailed in the
LICENSE and NOTICE files at the root of the source'
  WARN 2014-09-18 11:02:14,402 (Worker thread '19') - Web: Unknown
robots.txt line from 'https:www.duo.uio.no:443': 'tree and available
online at'
  WARN 2014-09-18 11:02:14,402 (Worker thread '19') - Web: Unknown
robots.txt line from 'https:www.duo.uio.no:443': '
http://www.dspace.org/license/'
  WARN 2014-09-18 11:02:14,402 (Worker thread '19') - Web: Unknown
robots.txt line from 'https:www.duo.uio.no:443': ''

E


On 18.09.14 03:12, Karl Wright wrote:

Please vote on whether to release Apache ManifoldCF 1.7.1, RC1.

This release fixes a number of critical issues, as well as a number of
user
priorities, most notably:

- A bad Zookeeper support issue, which made locking support fail when
Zookeeper connections got lost and then restored;
- The Alfresco connector, which was nonfunctional in both MCF 1.6 and
1.7;
- Solr Cloud support, which had ceased working due to changes to SolrJ;
- Non-null connector components caused failure;
- PostgreSQL queries not performing well.

The complete list of included fixes can be found at:

https://svn.apache.org/repos/asf/manifoldcf/tags/release-1.7.1-RC1/CHANGES.txt


The release candidate can be downloaded from:

http://people.apache.org/~kwright/apache-manifoldcf-1.7.1

There is a tag at:

https://svn.apache.org/repos/asf/manifoldcf/tags/release-1.7.1-RC1

Thanks,
Karl







Re: [VOTE] Release Apache ManifoldCF 1.7.1, RC1

2014-09-18 Thread Erlend Garåsen


MCF should handle invalid robots.txt files. We cannot rely on what 
people have entered into such files. So I guess MCF should just ignore 
invalid robots.txt files. I guess it already does.


It seems invalid due to use of the = symbol instead of a #. I'm not an 
expert of such files, so I'm not completely sure.


E

On 18.09.14 12:04, Karl Wright wrote:

Hi Erlend,

Your robots file has this at the top:


 The contents of this file are subject to the license and copyright
 detailed in the LICENSE and NOTICE files at the root of the source
 tree and available online at

 http://www.dspace.org/license/


That's fine except to the best of my knowledge the robots spec does
not allow for comments at all.

If you have reason to believe that has changed, then please point me
at a reference and I can change our robots parser.

Thanks,
Karl



On Thu, Sep 18, 2014 at 6:02 AM, Karl Wright daddy...@gmail.com wrote:


Hi Erlend,

MCF caches the robots.txt file in the database, which it considers valid
for 1 hour.

I'll look at the logs and thread dump and let you know if this is a
locking issue or something else.  Please stand by.

Karl


On Thu, Sep 18, 2014 at 5:24 AM, Erlend Garåsen e.f.gara...@usit.uio.no
wrote:



I tried to restart the job dealing with www.duo.no on our test server,
but it does not seem to touch the robots.txt file at all. That's the reason
why it's able to continue. Both servers are set up to obey the rules of
such files.

Erlend


On 18.09.14 11:12, Erlend Garåsen wrote:



I'm facing the same problems with robot.txt files using RC1, so maybe
this is another issue we have to fix. Can you please try to fetch the
host below? For some odd reason, it seems that MCF on our test server
can handle it.

This is exactly the same that happened when I started MCF (referring to
my previous post) after I had deployed RC1:
09-18-2014 11:02:14.400 robots parse https:www.duo.uio.no:443
  ERRORS 0 3 Unknown robots.txt line: ''

No activity after this error.

Here's the robots.txt file:
https://www.duo.uio.no/robots.txt

This is the content of manifoldcf.log after the startup:
   WARN 2014-09-18 11:02:14,401 (Worker thread '19') - Web: Unknown
robots.txt line from 'https:www.duo.uio.no:443': ''
   WARN 2014-09-18 11:02:14,401 (Worker thread '19') - Web: Unknown
robots.txt line from 'https:www.duo.uio.no:443': 'The contents of
this file are subject to the license and copyright'
   WARN 2014-09-18 11:02:14,402 (Worker thread '19') - Web: Unknown
robots.txt line from 'https:www.duo.uio.no:443': 'detailed in the
LICENSE and NOTICE files at the root of the source'
   WARN 2014-09-18 11:02:14,402 (Worker thread '19') - Web: Unknown
robots.txt line from 'https:www.duo.uio.no:443': 'tree and available
online at'
   WARN 2014-09-18 11:02:14,402 (Worker thread '19') - Web: Unknown
robots.txt line from 'https:www.duo.uio.no:443': '
http://www.dspace.org/license/'
   WARN 2014-09-18 11:02:14,402 (Worker thread '19') - Web: Unknown
robots.txt line from 'https:www.duo.uio.no:443': ''

E


On 18.09.14 03:12, Karl Wright wrote:


Please vote on whether to release Apache ManifoldCF 1.7.1, RC1.

This release fixes a number of critical issues, as well as a number of
user
priorities, most notably:

- A bad Zookeeper support issue, which made locking support fail when
Zookeeper connections got lost and then restored;
- The Alfresco connector, which was nonfunctional in both MCF 1.6 and
1.7;
- Solr Cloud support, which had ceased working due to changes to SolrJ;
- Non-null connector components caused failure;
- PostgreSQL queries not performing well.

The complete list of included fixes can be found at:

https://svn.apache.org/repos/asf/manifoldcf/tags/release-1.
7.1-RC1/CHANGES.txt


The release candidate can be downloaded from:

http://people.apache.org/~kwright/apache-manifoldcf-1.7.1

There is a tag at:

https://svn.apache.org/repos/asf/manifoldcf/tags/release-1.7.1-RC1

Thanks,
Karl














Re: [VOTE] Release Apache ManifoldCF 1.7.1, RC1

2014-09-18 Thread Erlend Garåsen

On 18.09.14 13:00, Karl Wright wrote:

Hi Erlend,

please can you also add the manifoldcf log as well?


Yes, I will, but it includes entries from RC0 as well.

MCF works perfectly using the other jobs for the other hosts. Take a 
look at the following once again. MCF is being interrupted:
INFO 2014-09-18 11:13:42,824 (Worker thread '19') - WEB: FETCH 
URL|https://www.duo.uio.no/|1411030940209+682605|-104|4096|org.apache.manifoldcf.core.interfaces.ManifoldCFException| 
Interrupted: Interrupted: null


You can find this entry near the other regarding the robots.txt file:
http://folk.uio.no/erlendfg/manifoldcf/manifoldcf.log

Erlend



Re: [VOTE] Release Apache ManifoldCF 1.7.1, RC1

2014-09-18 Thread Erlend Garåsen


I tried to fetch documents by using curl from our prod server just in 
case a webmaster had blocked access. No problem. Maybe I should ask the 
webmaster of that host anyway, just to be sure.


The interrupted message may have been caused by an abort of that job.

I think I should just stop the problematic job and start all the other 
three remaining jobs instead. I bet they will all complete. Ideally we 
shouldn't crawl www.duo.uio.no at all since it's a Dspace resource. I 
have just contacted someone who is indexing Dspace resources. I guess a 
Dspace connector is a better approach.


Below you'll find some parameters.

REPOSITORY CONNECTION
-
Throttling - max connections: 30
Throttling - Max fetches/min: 100
Bandwith - max connections: 25
Bandwith - max kbytes/sec: 8000
Bandwith - max fetches/min: 20

JOB SETTINGS


Hop filters: Keep forever

Seeds: https://www.duo.uio.no/

Exclude from crawl:
# Exclude some file types:
\.gif$
\.GIF$
\.jpeg$
\.JPEG$
\.jpg$
\.JPG$
\.png$
\.PNG$
\.mpg$
\.MPG$
\.mpeg$
\.MPEG$
\.exe$
\.bmp$
\.BMP$
\.mov$
\.MOV$
\.wmf$
\.css$
\.ico$
\.ICO$
\.mp2$
\.mp3$
\.mp4$
\.wmv$
\.tif$
\.tiff$
\.avi$
\.ogg$
\.ogv$
\.zip$
\.gz$
\.psd$

# TIKA-1011
\.mhtml$

# Exclude log files:
\.log$
\.logfile$

# Generelt, ikke tillatt indeksering av DUO-søkeresultater:
https?://www\.duo\.uio\.no/sok/search.*

# Andre elementer i DUO som skal ekskluderes:
https://www\.duo\.uio\.no.*open-search/description\.xml$
https://www\.duo\.uio\.no/(inn|login|feed|search|advanced-search|community-list|browse|password-login|inn|discover).*

# Skip locale settings - makes duplicates:
https://www\.duo\.uio\.no/.*\?locale-attribute=\w{2}$

# Temporarily skip PDFs since we are indexing abstracts:
https://www\.duo\.uio\.no/bitstream/handle/.+

# skip full item record:
https://www\.duo\.uio\.no/handle/\d{9}/\d+\?show=full$
# ny url-struktur:
https://www\.duo\.uio\.no/handle/.*\?show=full$

# Skip all navigations but start with letter:
https://www\.duo\.uio\.no/.*type=(author|dateissued)$

# Skip search:
#https://www\.duo\.uio\.no/handle/.*/discover\?.*
https://www\.duo\.uio\.no/handle/.*search-filter\?.*
# ny url-struktur:
https://www\.duo\.uio\.no/discover\?.*
https://www\.duo\.uio\.no/search-filter\?.*

# Skip statistics:
https://www\.duo\.uio\.no/handle/.*/statistics$

Exclude from index:
# Exclude front page - no valuable info and we have QL:
https?://www\.duo\.uio\.no/$

# Do not index navigation, but follow:
https://www\.duo\.uio\.no/handle/\d{9}/\d+/.+
#ny url-struktur:
https://www\.duo\.uio\.no/handle/\d+/\d+/.+

# Exclude id's lower than four, probably category listening:
https://www\.duo\.uio\.no/handle/\d{9}/\d{1,4}$
# ny url-strultur:
https://www\.duo\.uio\.no/handle/\d+/\d{1,3}$

Thanks for looking at this!

BTW: Within an hour, I will be away from my computer and cannot test 
anymore until Monday. I'm leaving Oslo for some days, but I will still 
be able to read and answer emails.


Erlend

On 18.09.14 13:43, Karl Wright wrote:

Hi Erlend,

The Interrupted: null message with a -104 code means only that the fetch
was interrupted by something.  Unfortunately, the message is not clear
about what the cause of the interruption is.  This is unrelated to
Zookeeper; but I agree that it is suspicious that many such interruptions
appear right after robots is parsed.

One cause of a -104 is when the target server forcibly drops the
connection, so an InterruptedIOException is thrown.  Having a look at the
timestamps for the fetch messages, it looks believable that you might have
exceeded some predetermined limit on that machine.  They're all within a
few milliseconds of each other.  When a robots file needs to be read,
ManifoldCF creates an event for that, and the urls blocked by that event
will all be 'fetchable' as soon as the event is released.  Perhaps your
throttling needs to be adjusted now that the rate limit bug has been fixed?

I won't be able to work with this without at least your crawling parameters
for the server in question.  I can ping that server so if you would like I
can try crawling that server from here.

For zookeeper, I would still try to either increase your tick count to
maybe 1, or better yet, find out why you periodically lose the ability
to transmit pings from MCF to your zookeeper process.

Thanks,
Karl




On Thu, Sep 18, 2014 at 7:15 AM, Erlend Garåsen e.f.gara...@usit.uio.no
wrote:


On 18.09.14 13:00, Karl Wright wrote:


Hi Erlend,

please can you also add the manifoldcf log as well?



Yes, I will, but it includes entries from RC0 as well.

MCF works perfectly using the other jobs for the other hosts. Take a look
at the following once again. MCF is being interrupted:
INFO 2014-09-18 11:13:42,824 (Worker thread '19') - WEB: FETCH URL|
https://www.duo.uio.no/|1411030940209+682605|-104|
4096|org.apache.manifoldcf.core.interfaces.ManifoldCFException|
https://www.duo.uio.no/%7C1411030940209+682605%7C-104%7C4096

Re: Bug-fix release 1.7.1

2014-09-17 Thread Erlend Garåsen


+1

As an exception to the rule, I will deploy a patched version on our 
production server just to be sure that we have fixed the problem. For 
some reason, I'm not able to reproduce the Zookeeper problem on our test 
server, so I'll go ahead on our prod server instead. I'll let you know 
whether this solved our problem.


Erlend

On 16.09.14 23:21, Karl Wright wrote:

I think we're going to need a bug fix release for 1.7.  Specifically, we
need the fixes for CONNECTORS-1031, as well as the fix for the Axis
classpath that allows the Alfresco connector to work.

Please let me know if you agree that a point fix is warranted.

Thanks,

Karl





Re: [VOTE] Release Apache ManifoldCF 1.7.1, RC0

2014-09-17 Thread Erlend Garåsen


I got the following on both my test and prod server. The error also 
shows up in simple history:
Error: KeeperErrorCode = NoNode for 
/org.apache.manifoldcf.locks-_Cache_OUTPUTCONNECTION_Solr/read-0001039554


I guess it is related to the shutdown process - either when I stopped 
the Resin instance or the Agent. Just mentioning.


BTW, for some reason I had to restart the job on our test server. Our 
prod server hangs at the moment, so I will try to restart everything 
once again. The applied patch works well, but version 1.7.1 seems to be 
tricky. I'll publish a thread dump and my logs if I'm not manage to run 
MCF on our prod server until I leave office.


This is what I can see in manifoldcf.log:
 WARN 2014-09-17 14:06:52,228 (Shutdown thread) - Exception tossed on 
repository connector pool cleanup: KeeperErrorCode = NoNode for 
/org.apache.manifoldcf.serviceactive-_REPOSITORYCONNECTORPOOL_Web-_ANON_5
org.apache.manifoldcf.core.interfaces.ManifoldCFException: 
KeeperErrorCode = NoNode for 
/org.apache.manifoldcf.serviceactive-_REPOSITORYCONNECTORPOOL_Web-_ANON_5
	at 
org.apache.manifoldcf.core.lockmanager.ZooKeeperConnection.handleKeeperException(ZooKeeperConnection.java:941)
	at 
org.apache.manifoldcf.core.lockmanager.ZooKeeperConnection.deleteNode(ZooKeeperConnection.java:155)
	at 
org.apache.manifoldcf.core.lockmanager.ZooKeeperLockManager.endServiceActivity(ZooKeeperLockManager.java:478)
	at 
org.apache.manifoldcf.core.connectorpool.ConnectorPool$Pool.releaseAll(ConnectorPool.java:735)
	at 
org.apache.manifoldcf.core.connectorpool.ConnectorPool.closeAllConnectors(ConnectorPool.java:381)
	at 
org.apache.manifoldcf.crawler.repositoryconnectorpool.RepositoryConnectorPool.closeAllConnectors(RepositoryConnectorPool.java:144)
	at 
org.apache.manifoldcf.crawler.system.ManifoldCF.localCleanup(ManifoldCF.java:110)
	at 
org.apache.manifoldcf.crawler.system.CrawlerAgent.cleanUp(CrawlerAgent.java:105)
	at 
org.apache.manifoldcf.agents.system.AgentsDaemon.stopAgents(AgentsDaemon.java:171)
	at 
org.apache.manifoldcf.agents.system.AgentsDaemon$AgentsShutdownHook.doCleanup(AgentsDaemon.java:386)
	at 
org.apache.manifoldcf.core.system.ManifoldCF.cleanUpEnvironment(ManifoldCF.java:1295)
	at 
org.apache.manifoldcf.core.system.ManifoldCF$ShutdownThread.run(ManifoldCF.java:1483)
Caused by: org.apache.zookeeper.KeeperException$NoNodeException: 
KeeperErrorCode = NoNode for 
/org.apache.manifoldcf.serviceactive-_REPOSITORYCONNECTORPOOL_Web-_ANON_5

at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:873)
	at 
org.apache.manifoldcf.core.lockmanager.ZooKeeperConnection.deleteNode(ZooKeeperConnection.java:150)

... 10 more

On 17.09.14 13:35, Karl Wright wrote:

Please vote on whether to release Apache ManifoldCF 1.7.1, RC0.

This release fixes a number of critical issues, as well as a number of user
priorities, most notably:

- A bad Zookeeper support issue, which made locking support fail when
Zookeeper connections got lost and then restored;
- The Alfresco connector, which was nonfunctional in both MCF 1.6 and 1.7;
- Solr Cloud support, which had ceased working due to changes to SolrJ;
- Non-null connector components caused failure;
- PostgreSQL queries not performing well.

The complete list of included fixes can be found at:

https://svn.apache.org/repos/asf/manifoldcf/tags/release-1.7.1-RC0/CHANGES.txt

The release candidate can be downloaded from:

http://people.apache.org/~kwright/apache-manifoldcf-1.7.1

There is a tag at:

https://svn.apache.org/repos/asf/manifoldcf/tags/release-1.7.1-RC0

Thanks,
Karl





Re: [VOTE] Release Apache ManifoldCF 1.7.1, RC0

2014-09-17 Thread Erlend Garåsen

On 17.09.14 14:55, Karl Wright wrote:

Hi Erlend,

Yes, this is shutdown related.  The patch file did not include the fix for
this particular problem.  The release candidate, however, does.


This is not from the patch, but from 1.7.1. I just meant to say that I 
did not had any problems using the patch.


The thread dump is included in my stdout log file since the output of 
kill -3 where placed there. Please note that it is included in THE END 
of that file. I'm in a hurry, so I didn't have time to delete all the 
other irrelevant entries. Sorry about that:

http://folk.uio.no/erlendfg/manifoldcf/mcf_agent.stdout.log

I'll try to restart everything and get MCF up and running. Runs fine on 
our test server, but not on prod. I'll get back to this.


E


Re: [VOTE] Release Apache ManifoldCF 1.7.1, RC0

2014-09-17 Thread Erlend Garåsen


Both servers are running now. Not sure about what caused the problems on 
prod. The only thing I did different was to do a lock clean on prod 
prior to startup.


I'll keep both servers up and running in 24 hours and vote thereafter.

Erlend

On 17.09.14 15:05, Erlend Garåsen wrote:

On 17.09.14 14:55, Karl Wright wrote:

Hi Erlend,

Yes, this is shutdown related.  The patch file did not include the fix
for
this particular problem.  The release candidate, however, does.


This is not from the patch, but from 1.7.1. I just meant to say that I
did not had any problems using the patch.

The thread dump is included in my stdout log file since the output of
kill -3 where placed there. Please note that it is included in THE END
of that file. I'm in a hurry, so I didn't have time to delete all the
other irrelevant entries. Sorry about that:
http://folk.uio.no/erlendfg/manifoldcf/mcf_agent.stdout.log

I'll try to restart everything and get MCF up and running. Runs fine on
our test server, but not on prod. I'll get back to this.

E




Re: [VOTE] Release Apache ManifoldCF 1.7 RC2

2014-08-21 Thread Erlend Garåsen


+1

- Deployed binary dist on Caucho Resin on Linux and ran:
  - a huge crawl using FileLockManager
- Built source dist on OS X and:
  - Ran single-process version under example directory
  - Ran ant uitest and test

Erlend

On 20.08.14 02:58, Mingchun Zhao wrote:

Hi all,

Please vote on whether to release the ManifoldCF, version 1.7, RC2.
RC2 included the following changes from RC1.

CONNECTORS-1009: Fix CMIS connector again, to handle typical case
where a new version is a new node.
CONNECTORS-1011: Upgrade to httpclient 4.3.5.
CONNECTORS-1012: Upgrade xmlbeans and POI to fix various CVE's.

You can find the artifact at:

http://people.apache.org/~mingchun/apache-manifoldcf-1.7-RC2

There is also a tag at:

https://svn.apache.org/repos/asf/manifoldcf/tags/release-1.7-RC2

Vote will remain open at least 72 hours.

Thanks!
Mingchun Zhao





Re: [VOTE] Release Apache ManifoldCF 1.7 RC0

2014-08-15 Thread Erlend Garåsen

On 12.08.14 05:13, Mingchun Zhao wrote:

Hi all,

Please vote on whether to release the ManifoldCF, version 1.7, RC0.

You can find the artifact at:

http://people.apache.org/~mingchun/apache-manifoldcf-1.7-RC0

There is also a tag at:

https://svn.apache.org/repos/asf/manifoldcf/tags/release-1.7-RC0

Vote will remain open at least 72 hours.

Thanks!
Mingchun Zhao





Re: [VOTE] Release Apache ManifoldCF 1.7 RC0

2014-08-15 Thread Erlend Garåsen


-1

All my first tests pass, but I think I found a blocker when I ran the 
last one.


By running MCF using FileLockManager, I'm getting the following error 
and MCF just tries to run this task over and over again. My synch folder 
now contains a lot of files and it still grows. I think MCF should 
handle long URLs and just strip the length of the filename if it becomes 
too large.


INFO 2014-08-15 09:30:54,485 (Worker thread '9') - WEB: FETCH 
URL|https://www.journals.uio.no/index.php/nordina/search/advancedResults?subject=effective%20continuing%20professional%20development%2C%20authentic%20and%20entrepreneurial%20learning%2C%20science%20and%20technology%20education|1408087853848+633|200|15735|
 WARN 2014-08-15 09:30:54,609 (Worker thread '9') - Attempt to set file 
lock 
'/www/var/data/mcf/mcf-1/conf/../data/synchdir/948/350/lock-Solr58!https58!47!47!www.journals.uio.no58!44347!index.php47!nordina47!search47!advancedResults?subject61!effective%20continuing%20professional%20development%2C%20authentic%20and%20entrepreneurial%20learning%2C%20science%20and%20technology%20education.lock' 
failed: File name too long

java.io.IOException: File name too long
at java.io.UnixFileSystem.createFileExclusively(Native Method)
at java.io.File.createNewFile(File.java:947)
	at 
org.apache.manifoldcf.core.lockmanager.FileLockObject.grabFileLock(FileLockObject.java:221)
	at 
org.apache.manifoldcf.core.lockmanager.FileLockObject.obtainGlobalWriteLockNoWait(FileLockObject.java:77)
	at 
org.apache.manifoldcf.core.lockmanager.LockObject.obtainGlobalWriteLock(LockObject.java:121)
	at 
org.apache.manifoldcf.core.lockmanager.LockObject.enterWriteLock(LockObject.java:74)
	at 
org.apache.manifoldcf.core.lockmanager.LockGate.enterWriteLock(LockGate.java:177)
	at 
org.apache.manifoldcf.core.lockmanager.BaseLockManager.enter(BaseLockManager.java:1473)
	at 
org.apache.manifoldcf.core.lockmanager.BaseLockManager.enterLocks(BaseLockManager.java:803)
	at 
org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$OutputAddEntryPoint.addOrReplaceDocumentWithException(IncrementalIngester.java:3329)
	at 
org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddFanout.sendDocument(IncrementalIngester.java:3051)




On 12.08.14 05:13, Mingchun Zhao wrote:

Hi all,

Please vote on whether to release the ManifoldCF, version 1.7, RC0.

You can find the artifact at:

http://people.apache.org/~mingchun/apache-manifoldcf-1.7-RC0

There is also a tag at:

https://svn.apache.org/repos/asf/manifoldcf/tags/release-1.7-RC0

Vote will remain open at least 72 hours.

Thanks!
Mingchun Zhao





Re: [VOTE] Release Apache ManifoldCF 1.7 RC0

2014-08-15 Thread Erlend Garåsen


Another thing. It's not possible to abort the job due to this problem. 
LockManager still tries to set locks over and over again. It's not just 
the previous URL/filename I entered, but several others:


 WARN 2014-08-15 10:07:46,178 (Worker thread '31') - Attempt to set 
file lock 
'/www/var/data/mcf/mcf-1/conf/../data/synchdir/664/756/lock-Solr58!https58!47!47!www.journals.uio.no58!44347!index.php47!nordina47!search47!advancedResults?subject61!Small%20group%20learning%2C%203rd%20graders%2C%20learning%20of%20DC-circuit%20phenomena%2C%20active%20and%20spontaneous%20learning.lock' 
failed: File name too long


Erlend

On 15.08.14 09:46, Erlend Garåsen wrote:


-1

All my first tests pass, but I think I found a blocker when I ran the
last one.

By running MCF using FileLockManager, I'm getting the following error
and MCF just tries to run this task over and over again. My synch folder
now contains a lot of files and it still grows. I think MCF should
handle long URLs and just strip the length of the filename if it becomes
too large.

INFO 2014-08-15 09:30:54,485 (Worker thread '9') - WEB: FETCH
URL|https://www.journals.uio.no/index.php/nordina/search/advancedResults?subject=effective%20continuing%20professional%20development%2C%20authentic%20and%20entrepreneurial%20learning%2C%20science%20and%20technology%20education|1408087853848+633|200|15735|

  WARN 2014-08-15 09:30:54,609 (Worker thread '9') - Attempt to set file
lock
'/www/var/data/mcf/mcf-1/conf/../data/synchdir/948/350/lock-Solr58!https58!47!47!www.journals.uio.no58!44347!index.php47!nordina47!search47!advancedResults?subject61!effective%20continuing%20professional%20development%2C%20authentic%20and%20entrepreneurial%20learning%2C%20science%20and%20technology%20education.lock'
failed: File name too long
java.io.IOException: File name too long
 at java.io.UnixFileSystem.createFileExclusively(Native Method)
 at java.io.File.createNewFile(File.java:947)
 at
org.apache.manifoldcf.core.lockmanager.FileLockObject.grabFileLock(FileLockObject.java:221)

 at
org.apache.manifoldcf.core.lockmanager.FileLockObject.obtainGlobalWriteLockNoWait(FileLockObject.java:77)

 at
org.apache.manifoldcf.core.lockmanager.LockObject.obtainGlobalWriteLock(LockObject.java:121)

 at
org.apache.manifoldcf.core.lockmanager.LockObject.enterWriteLock(LockObject.java:74)

 at
org.apache.manifoldcf.core.lockmanager.LockGate.enterWriteLock(LockGate.java:177)

 at
org.apache.manifoldcf.core.lockmanager.BaseLockManager.enter(BaseLockManager.java:1473)

 at
org.apache.manifoldcf.core.lockmanager.BaseLockManager.enterLocks(BaseLockManager.java:803)

 at
org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$OutputAddEntryPoint.addOrReplaceDocumentWithException(IncrementalIngester.java:3329)

 at
org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddFanout.sendDocument(IncrementalIngester.java:3051)




On 12.08.14 05:13, Mingchun Zhao wrote:

Hi all,

Please vote on whether to release the ManifoldCF, version 1.7, RC0.

You can find the artifact at:

http://people.apache.org/~mingchun/apache-manifoldcf-1.7-RC0

There is also a tag at:

https://svn.apache.org/repos/asf/manifoldcf/tags/release-1.7-RC0

Vote will remain open at least 72 hours.

Thanks!
Mingchun Zhao







Re: [VOTE] Release Apache ManifoldCF 1.6.1, RC1

2014-05-30 Thread Erlend Garåsen

+1 from me.

1. Ran test, uitest
2. Ran single process example, registered a Solr server and performed a 
web crawl

2. Deployed on Resin, ran huge crawl with Multiprocess/Zookeeper model

Looks good!

Erlend

On 29.05.14 10:51, Karl Wright wrote:

This minor release of ManifoldCF fixes a number of critical problems,
including compatibility with JDK 8.  A full list of changes can be found
with the release candidate.  The release candidate can be found at:

http://people.apache.org/~kwright/apache-manifoldcf-1.6.1

There is also a tag at:

https://svn.apache.org/repos/asf/manifoldcf/tags/release-1.6.1-RC1

Voting will be held open the Apache-mandated 72 hours.

Thanks!





IO exception during indexing: null

2014-05-21 Thread Erlend Garåsen


I'm getting the following error after I upgraded to version 1.6. I think 
HttpClient is the source of the problem and that the following ticket 
describes the issue in detail:

https://issues.apache.org/jira/browse/CONNECTORS-661

I have turned on HttpClient logging and placed the manifoldcf.log here:
http://folk.uio.no/erlendfg/manifoldcf/manifoldcf.log

The first line indicates that the Solr server requires authentication. 
Then it seems that the authentication was unsuccessful using the BASIC 
method.


Then we're getting the following from HttpClient, just like we did as 
described in CONNECTORS-661: NonRepeatableRequestException: Cannot retry 
request with a non-repeatable request entity.


There is nothing wrong with my settings. The realm, user ID and password 
values are all correctly set.


Last time it was something about two HTTP statuses sent (100 and 401), 
but I can only see the 401 status this time. We have also changed our 
authentication implementation on our Solr server since then to only rely 
on Apache, i.e. a setup in httpd.conf. So neither Resin nor mod_caucho 
should be the problem here.


Erlend


Re: IO exception during indexing: null

2014-05-21 Thread Erlend Garåsen


Thanks for looking at this, Karl.

I have sent you the output from tcpdump directly to you.

Erlend

On 21.05.14 14:42, Karl Wright wrote:

Looking at CONNECTORS-661, the fix to this was to enable expect-continue.
The current code still does this, so clearly it's not working as expected.
I'll post to the HTTPCLIENT list for answers.  In the meantime, I think we
should open a ticket.

Karl


On Wed, May 21, 2014 at 8:35 AM, Karl Wright daddy...@gmail.com wrote:


Hi Erlend,

Looking at the log you provided, it's missing critical information, namely
what ManifoldCF is sending to your server.  So in its current form it's not
very helpful.  I can see that there are two 401 responses, but that's about
it.

Karl



On Wed, May 21, 2014 at 6:39 AM, Erlend Garåsen e.f.gara...@usit.uio.nowrote:



I'm getting the following error after I upgraded to version 1.6. I think
HttpClient is the source of the problem and that the following ticket
describes the issue in detail:
https://issues.apache.org/jira/browse/CONNECTORS-661

I have turned on HttpClient logging and placed the manifoldcf.log here:
http://folk.uio.no/erlendfg/manifoldcf/manifoldcf.log

The first line indicates that the Solr server requires authentication.
Then it seems that the authentication was unsuccessful using the BASIC
method.

Then we're getting the following from HttpClient, just like we did as
described in CONNECTORS-661: NonRepeatableRequestException: Cannot retry
request with a non-repeatable request entity.

There is nothing wrong with my settings. The realm, user ID and password
values are all correctly set.

Last time it was something about two HTTP statuses sent (100 and 401),
but I can only see the 401 status this time. We have also changed our
authentication implementation on our Solr server since then to only rely on
Apache, i.e. a setup in httpd.conf. So neither Resin nor mod_caucho should
be the problem here.

Erlend










Re: IO exception during indexing: null

2014-05-21 Thread Erlend Garåsen


The complete log is not available here:
http://folk.uio.no/erlendfg/manifoldcf/manifoldcf.log

Erlend

On 21.05.14 15:09, Erlend Garåsen wrote:


Thanks for looking at this, Karl.

I have sent you the output from tcpdump directly to you.

Erlend

On 21.05.14 14:42, Karl Wright wrote:

Looking at CONNECTORS-661, the fix to this was to enable expect-continue.
The current code still does this, so clearly it's not working as
expected.
I'll post to the HTTPCLIENT list for answers.  In the meantime, I
think we
should open a ticket.

Karl


On Wed, May 21, 2014 at 8:35 AM, Karl Wright daddy...@gmail.com wrote:


Hi Erlend,

Looking at the log you provided, it's missing critical information,
namely
what ManifoldCF is sending to your server.  So in its current form
it's not
very helpful.  I can see that there are two 401 responses, but that's
about
it.

Karl



On Wed, May 21, 2014 at 6:39 AM, Erlend Garåsen
e.f.gara...@usit.uio.nowrote:



I'm getting the following error after I upgraded to version 1.6. I
think
HttpClient is the source of the problem and that the following ticket
describes the issue in detail:
https://issues.apache.org/jira/browse/CONNECTORS-661

I have turned on HttpClient logging and placed the manifoldcf.log here:
http://folk.uio.no/erlendfg/manifoldcf/manifoldcf.log

The first line indicates that the Solr server requires authentication.
Then it seems that the authentication was unsuccessful using the BASIC
method.

Then we're getting the following from HttpClient, just like we did as
described in CONNECTORS-661: NonRepeatableRequestException: Cannot
retry
request with a non-repeatable request entity.

There is nothing wrong with my settings. The realm, user ID and
password
values are all correctly set.

Last time it was something about two HTTP statuses sent (100 and 401),
but I can only see the 401 status this time. We have also changed our
authentication implementation on our Solr server since then to only
rely on
Apache, i.e. a setup in httpd.conf. So neither Resin nor mod_caucho
should
be the problem here.

Erlend












Re: New committer: Graeme Seaton

2014-03-14 Thread Erlend Garåsen

Greetings from Oslo, Norway, and welcome aboard, Graeme!

Erlend

On 10.03.14 08:18, Karl Wright wrote:

The Project Management Committee (PMC) for Apache ManifoldCFhas asked
Graeme Seaton to become a committer and we are pleased to announce
that they have accepted.

Graeme has be instrumental in driving the ManifoldCF project in
the direction of scalability, and will no doubt continue to do
so now that he has committer and PMC privileges.

  Being a committer enables easier contribution to theproject since
there is no need to go via the patchsubmission process. This should
enable better productivity.Being a PMC member enables assistance with
the managementand to guide the direction of the project.

Thanks,
Karl Wright





Re: [VOTE] Release Apache ManifoldCF 1.5, RC7

2014-02-07 Thread Erlend Garåsen

+1

- Ran ant test | uitest | doc
- Installed binary version and ran single process model
- Installed source version, built and ran multi-process model and a huge 
crawl

- Deployed on Resin application server and ran a huge crawl

Erlend

On 04.02.14 13:33, Karl Wright wrote:

This is a major release of ManifoldCF that includes the following:

- Federated authority support
- Multiple authorization domains
- ZooKeeper process coordination
- Multiple agents processes
- Support for SharePoint Claims-based authorization
- An Email connector
- A revamped look-and-feel

Voting will remain open for 3 days.

You can download the artifacts from
http://people.apache.org/~kwright/apache-manifoldcf-1.5 .  There is also a
release tag at
https://svn.apache.org/repos/asf/manifoldcf/tags/release-1.5-RC7 .

This RC includes changes to the dist directory organization so that jar
files are not duplicated, saving 40MB from each binary download.  It also
fixes an issue with connection limits in the zookeeper example.  Finally,
it fixes a limitation in the CMIS connector (CONNECTORS-864) and a maven
build problem (CONNECTORS-865).  Also fixes CONNECTORS-866 (the lockclean
script), and two more Maven version issues.  Finally, corrects a LiveLink
connector reversion described in CONNECTORS-871.  Missing SolrJ
dependencies in CONNECTORS-873.  Workaround for SolrJ runtime exception
being thrown in CONNECTORS-874.  Throttling lockup dealt with, improved,
and tested in CONNECTORS-872.

Karl





Re: [VOTE] Release Apache ManifoldCF 1.5, RC7

2014-02-06 Thread Erlend Garåsen


We're still having problems with this release on our test server. It 
runs stable and does not hang anymore, but nothing gets sent to Solr. 
Since there was a problem with the SSL certificate in previous RCs, 
maybe there is a similar problem related to the Solr Output Connector? 
We have configured the same certificate in order to post documents to Solr.


I get entries like this in manifoldcf.log which indicates that documents 
should be indexed, but they aren't:
DEBUG 2014-02-06 10:28:06,609 (Worker thread '29') - WEB: Decided to 
ingest 'http://www.ibsen.uio.no/varia.xhtml'


In Simple history, only fetch activities are shown. Any suggestions how 
to debug what's really going on? I can try to turn on debug logging for 
Httpclient in case that helps.


Erlend

On 2/4/14 1:33 PM, Karl Wright wrote:

This is a major release of ManifoldCF that includes the following:

- Federated authority support
- Multiple authorization domains
- ZooKeeper process coordination
- Multiple agents processes
- Support for SharePoint Claims-based authorization
- An Email connector
- A revamped look-and-feel

Voting will remain open for 3 days.

You can download the artifacts from
http://people.apache.org/~kwright/apache-manifoldcf-1.5 .  There is also a
release tag at
https://svn.apache.org/repos/asf/manifoldcf/tags/release-1.5-RC7 .

This RC includes changes to the dist directory organization so that jar
files are not duplicated, saving 40MB from each binary download.  It also
fixes an issue with connection limits in the zookeeper example.  Finally,
it fixes a limitation in the CMIS connector (CONNECTORS-864) and a maven
build problem (CONNECTORS-865).  Also fixes CONNECTORS-866 (the lockclean
script), and two more Maven version issues.  Finally, corrects a LiveLink
connector reversion described in CONNECTORS-871.  Missing SolrJ
dependencies in CONNECTORS-873.  Workaround for SolrJ runtime exception
being thrown in CONNECTORS-874.  Throttling lockup dealt with, improved,
and tested in CONNECTORS-872.

Karl





Re: [VOTE] Release Apache ManifoldCF 1.5, RC7

2014-02-06 Thread Erlend Garåsen

On 06.02.14 12:41, Erlend Garåsen wrote:

p://www.ibsen.uio.no/diktsamlinger.xhtml]} 0 16


select * from repohistory where entityid like 
'%www.ibsen.uio.no/diktsamlinger.xhtml%'


11227;1391439283905;1391439277790;1391439283890;http://www.ibsen.uio.no/diktsamlinger.xhtml;Web;fetch;200;
11227;1391439283948;1391439277782;1391439283923;http://www.ibsen.uio.no/diktsamlinger.xhtml;Web;document 
ingest (Solr);OK

11227;1391678979841;1391678941727;1391678951418;http://www.ibsen.uio.no/diktsamlinger.xhtml;Web;fetch;200;
11227;1391685900353;1391685874021;1391685881017;http://www.ibsen.uio.no/diktsamlinger.xhtml;Web;fetch;200;
11227;1391686685694;1391686673738;1391686540299;http://www.ibsen.uio.no/diktsamlinger.xhtml;Web;fetch;200;

So it should show up in simple history.

Erlend


Re: [VOTE] Release Apache ManifoldCF 1.5, RC7

2014-02-06 Thread Erlend Garåsen


Anny suggestion what to include in logging.ini? I have tried the 
following without any success:

log4j.logger.org.postgresql=DEBUG
log4j.logger.java.sql.Connection=DEBUG
log4j.logger.java.sql=DEBUG
log4j.logger.java.sql.ResultSet=TRACE

Erlend

On 06.02.14 13:34, Karl Wright wrote:

Hi Erlend,

This isn't making much sense.  Nothing here has changed, AFAIK, between
1.4.1 and 1.5.  If you want to see the queries being submitted for the
simple history, try these steps:

(1) Stop agents process and Resin
(2) Enable db debugging
(3) Start Resin
(4) try the simple history in the UI
(5) have a look at the log; the query should be there

Karl



On Thu, Feb 6, 2014 at 6:59 AM, Erlend Garåsen e.f.gara...@usit.uio.nowrote:


On 06.02.14 12:41, Erlend Garåsen wrote:


p://www.ibsen.uio.no/diktsamlinger.xhtml]} 0 16



select * from repohistory where entityid like '%www.ibsen.uio.no/
diktsamlinger.xhtml%'

11227;1391439283905;1391439277790;1391439283890;http://www.ibsen.uio.no/
diktsamlinger.xhtml;Web;fetch;200
11227;1391439283948;1391439277782;1391439283923;http://www.ibsen.uio.no/
diktsamlinger.xhtml;Web;document ingest (Solr);OK
11227;1391678979841;1391678941727;1391678951418;http://www.ibsen.uio.no/
diktsamlinger.xhtml;Web;fetch;200
11227;1391685900353;1391685874021;1391685881017;http://www.ibsen.uio.no/
diktsamlinger.xhtml;Web;fetch;200
11227;1391686685694;1391686673738;1391686540299;http://www.ibsen.uio.no/
diktsamlinger.xhtml;Web;fetch;200

So it should show up in simple history.

Erlend







Re: [VOTE] Release Apache ManifoldCF 1.5, RC7

2014-02-06 Thread Erlend Garåsen

On 06.02.14 15:25, Karl Wright wrote:


So I conclude that simple history is working fine, but since it is only
returning indexing results within the last hour by default it is confusing
you.  I also think it is likely that documents are getting skipped because
you've crawled this set before with the same job and many of the documents
have not changed.


Karl, we are indexing these documents:

I have tail -F opened up from our Solr test server at the moment:
[2014-02-06 15:21:00.321] INFO [uio] OP crawl 
{add=[http://www.ibsen.uio.no/brevmottakere.xhtml?bokstav=B]} 0 38
[2014-02-06 15:21:00.359] INFO [uio] OP crawl 
{add=[http://www.ibsen.uio.no/brevmottakere.xhtml?bokstav=N]} 0 23
[2014-02-06 15:21:29.732] INFO [uio] OP crawl 
{add=[http://www.ibsen.uio.no/brevmottakere.xhtml?bokstav=G]} 0 29
[2014-02-06 15:22:11.954] INFO [uio] OP crawl 
{add=[http://www.ibsen.uio.no/brevmottakere.xhtml?bokstav=S]} 0 38
[2014-02-06 15:22:15.752] INFO [uio] OP crawl 
{add=[http://www.ibsen.uio.no/brevmottakere.xhtml?bokstav=D]} 0 28
[2014-02-06 15:22:18.323] INFO [uio] OP crawl 
{add=[http://www.ibsen.uio.no/brevmottakere.xhtml?bokstav=H]} 0 34
[2014-02-06 15:22:21.657] INFO [uio] OP crawl 
{add=[http://www.ibsen.uio.no/variakronologi.xhtml]} 0 73


How could these log entries show up on our Solr server if the documents 
were skipped?


And why did I get entries like this earlier today:
DEBUG 2014-02-06 10:28:06,609 (Worker thread '29') - WEB: Decided to 
ingest 'http://www.ibsen.uio.no/varia.xhtml'


(I have changed the log level back to INFO right now, so I cannot see 
these entries for the last crawl, but I will re-enable DEBUG again).


I have re-ingested all documents several times today to be sure that all 
documents were crawled all over again.


Of course, I can try to remove all jobs, delete all tables in PostgreSQL 
and try to create everything from scratch in case the old settings did 
not get upgraded successfully. Unfortunately MCF will delete all tables 
in my index as well.


Erlend


Re: [VOTE] Release Apache ManifoldCF 1.5, RC7

2014-02-06 Thread Erlend Garåsen


And why do I get the following result from pgAdmin when I run the 
following SQL?:
select * from repohistory where entityid = 
'http://www.ibsen.uio.no/brevmottakere.xhtml?bokstav=H'


54251;1391440247586;1391440244203;1391440247542;http://www.ibsen.uio.no/brevmottakere.xhtml?bokstav=H;Web;document 
ingest (Solr);OK


This shows that the document was indexed, but it's not visible inside 
simple history.


Erlend


Re: [VOTE] Release Apache ManifoldCF 1.5, RC7

2014-02-06 Thread Erlend Garåsen

On 06.02.14 15:53, Karl Wright wrote:

Hi Erlend,

Please go into the Simple History, and change the start time of the query
to be one day earlier than the default.  By default, Simple History only
reports the last hour's worth of events.


Then it only displays the crawl which completed tonight before I did the 
upgrade of MCF.


If Solr is indexing the documents, you should also see the entries in 
simple history. I changed the start time to four hours earlier than the 
default which should catch the Solr activity.


The query I posted seems to include an old start time (3rd of Feb) and 
that's the reason why pgAdmin displays a result set. At that time, I 
reindexed Solr prior to the MCF upgrade.


If I'm re-ingesting all documents and start the job, see activity in 
both our Solr log and in manifoldcf.log (Decided to ingest...) . And 
if I continuously refreshing the simple history window, all I can see is 
fetching activities (and job start etc).


For some odd reason, 'document ingest (Solr)' as an activity type does 
not seem to be added to my repohistory table after I did the upgrade.


Take a look at this query:
select count(*) from repohistory where owner='Web' AND starttime 
1391691978799 and activitytype = 'fetch'

== 141.
(This is everything from 1:06 pm until now.)

But then take a look at this one:
select starttime, activitytype from repohistory where owner='Web' AND 
starttime 1391691978799 and activitytype  'fetch'

==
1391693068680;job stop
1391693560219;job start
1391693602432;robots parse
1391694720907;job stop
1391694830347;job start
1391694870310;job stop
1391695481359;job continue
1391695518007;robots parse
1391696593141;job end

I can try to debug more tomorrow.

Erlend


Re: [VOTE] Release Apache ManifoldCF 1.5, RC7

2014-02-06 Thread Erlend Garåsen

On 06.02.14 18:18, Karl Wright wrote:

Actually yes, I found it.  Only exceptions/errors are recorded by the solr
connector.

CONNECTORS-884.  However, I don't think this rises to the level of needing
to respin the RC.  Do you agree?


Since we are on RC7 now, I agree. I'll start a complete crawl after 
dinner. If that completes, I'll place my final vote. :)


Erlend


Re: [VOTE] Release Apache ManifoldCF 1.5, RC5

2014-01-29 Thread Erlend Garåsen

On 1/29/14 3:57 PM, Karl Wright wrote:

Thanks - this shows that threads are all waiting on connection throttling.
How many simultaneous connections did you make available to the site you
are crawling, and can you look at the simple history report to confirm that
there is no activity?  I'll dig further through your stack trace while I
await your answer.


I'll answer even though you have cancelled the vote.

No activity in simple history and no new log entries in manifoldcf.log 
other than the debug entries included in the screenshot.


Throttling
--
Max connections: 30
Max avg fetches/min: 100

Bandwidth
-
Max connections: 25
Max kbytes/sec: 2000
Max fetches/min: 20

Erlend


Re: [VOTE] Release Apache ManifoldCF 1.4, RC1

2013-10-23 Thread Erlend Garåsen


+1

Looks good.

1. Deployed the binary version on Resin and did a test crawl.
2. Built the source version using Ant
3. Ran UI tests
4. Built docs (ant doc) using Forest 0.9
5. Ran ant test
6. Ran the single process model within the example dir, started the web 
crawler and posted to Solr 4

7. Ran multiprocess example, started the web crawler and posted to Solr.

Erlend

On 10/22/13 12:05 AM, Karl Wright wrote:

Please vote on whether to release Apache ManifoldCF 1.4, RC1.

The release candidate can be downloaded from:

http://people.apache.org/~kwright/apache-manifoldcf-1.4

There is a tag at:

http://svn.apache.org/repos/asf/manifoldcf/tags/release-1.4-RC1http://svn.apache.org/repos/asf/manifoldcf/tags/release-1.4-RC0

This release contains a substantial refactoring of the SharePoint
connector, as
well as new features including attachment crawling.  Proxy support for the
wiki
and jira connectors has also been added.

It also fixes CONNECTORS-790 and CONNECTORS-791.

Vote will remain open for at least 72 hours.

Thanks,
Karl





Re: [VOTE] Release Apache ManifoldCF 1.4, RC0

2013-10-21 Thread Erlend Garåsen


-1

I'm getting an NPE when running ant test.

I will of course withdraw my vote in case I have forgot a crucial step 
before I ran the tests, but I don't think that is the case.


Otherwise, I completed the following tests successfully:

1. Deployed the binary version on Resin and did a test crawl.
2. Built the source version using Ant
3. Ran UI tests
4. Built docs (ant doc) using Forest 0.9

Stack trace:

[junit] 2178 [main] INFO org.eclipse.jetty.server.handler.ContextHandler 
- started 
o.e.j.w.WebAppContext{/mcf-api-service,file:/private/var/folders/11/q0gk5wfs4pl662rzx319gg14gn/T/jetty-0.0.0.0-8346-mcf-api-service.war-_mcf-api-service-any-/webapp/},../../../framework/build/war-proprietary/mcf-api-service.war
[junit] 2190 [main] INFO org.eclipse.jetty.server.AbstractConnector 
- Started SelectChannelConnector@0.0.0.0:8346 STARTING
[junit] 32865 [qtp1273157698-145] WARN 
org.eclipse.jetty.servlet.ServletHandler - 
/mcf-api-service/json/jobstatuses/1382362019797

[junit] java.lang.NullPointerException
[junit] 	at 
org.apache.manifoldcf.crawler.system.ManifoldCF.apiReadJobStatus(ManifoldCF.java:2063)
[junit] 	at 
org.apache.manifoldcf.crawler.system.ManifoldCF.executeReadCommand(ManifoldCF.java:3196)
[junit] 	at 
org.apache.manifoldcf.apiservlet.APIServlet.executeRead(APIServlet.java:231)
[junit] 	at 
org.apache.manifoldcf.apiservlet.APIServlet.doGet(APIServlet.java:77)
[junit] 	at 
javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
[junit] 	at 
javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
[junit] 	at 
org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:547)
[junit] 	at 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:480)
[junit] 	at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119)
[junit] 	at 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:520)
[junit] 	at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:227)
[junit] 	at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:941)
[junit] 	at 
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:409)
[junit] 	at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:186)
[junit] 	at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:875)
[junit] 	at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117)
[junit] 	at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:250)
[junit] 	at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:110)

[junit] at org.eclipse.jetty.server.Server.handle(Server.java:349)
[junit] 	at 
org.eclipse.jetty.server.HttpConnection.handleRequest(HttpConnection.java:441)
[junit] 	at 
org.eclipse.jetty.server.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:919)
[junit] 	at 
org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:582)
[junit] 	at 
org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:218)
[junit] 	at 
org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:51)
[junit] 	at 
org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:586)
[junit] 	at 
org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:44)
[junit] 	at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:598)
[junit] 	at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:533)

[junit] at java.lang.Thread.run(Thread.java:722)

Erlend

On 10/21/13 2:23 AM, Karl Wright wrote:

Please vote on whether to release Apache ManifoldCF 1.4, RC0.

The release candidate can be downloaded from:

http://people.apache.org/~kwright/apache-manifoldcf-1.4

There is a tag at:

http://svn.apache.org/repos/asf/manifoldcf/tags/release-1.4-RC0

This release contains a substantial refactoring of the SharePoint
connector, as
well as new features including attachment crawling.  Proxy support for the
wiki
and jira connectors has also been added.

Vote will remain open for at least 72 hours.

Thanks,
Karl





Re: 1.3 release schedule

2013-06-27 Thread Erlend Garåsen


I'm sorry to inform that I haven't worked with the Hydra connector the 
last month. I have been busy with a major release of our search project 
at the university and my summer vaccation starts tomorrow. And I have 
also been busy with creating patches for Solr.


Erlend

On 6/24/13 5:13 PM, Karl Wright wrote:

Hi All,

Our quarterly release schedule means that we'll need to code-freeze what is
in 1.3 in about 1 month.

There is currently a *lot* of outstanding activity that needs to solidify
between now and then, and I'm not likely to be able to get done with
everything currently assigned to me.  If you have the capacity to look at
any of these tickets, please let me know.

Thanks,
Karl





Re: [VOTE] Release Apache ManifoldCF 1.2, RC1

2013-05-08 Thread Erlend Garåsen


+1

- Deployed and started a big crawl on Resin.
- Ran:
ant uitest
ant doc
- Built using Maven 3.0.4

I will withdraw my vote if CONNECTORS-682 is still not resolved. So far 
so good, the job has been running for 7 hours. I will check it again on 
Friday since will be away from my computer till then.


Erlend

On 08.05.13 12.50, Karl Wright wrote:

Please vote on whether to release Apache ManifoldCF 1.2, RC1.

The release artifact can be found at:

http://people.apache.org/~kwright/apache-manifoldcf-1.2

The release tag can be found at:

https://svn.apache.org/repos/asf/manifoldcf/tags/release-1.2-RC1https://svn.apache.org/repos/asf/manifoldcf/tags/release-1.2-RC0

Fixes from the last RC include:
- DropBox connector documentation
- A fix for CONNECTORS-682

The 1.2 release has a large number of changes in it, including a new
connector for DropBox, and also new framework functionality for minimal
crawls, with better support for ADD_CHANGE_DELETE models of crawling.  (See
CHANGES.txt for a complete list.)

Karl




--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050


Re: [VOTE] Release Apache ManifoldCF 1.2, RC1

2013-05-08 Thread Erlend Garåsen

On 08.05.13 18.00, Erlend Garåsen wrote:


I will withdraw my vote if CONNECTORS-682 is still not resolved. So far
so good, the job has been running for 7 hours. I will check it again on
Friday since will be away from my computer till then.


I just want to inform that the job completed without any errors, so 
CONNECTORS-682 seems to be resolved. Over 14000 documents crawled.


Erlend

--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050


Re: Release status

2013-04-29 Thread Erlend Garåsen

On 28.04.13 23.27, Karl Wright wrote:


The upshot is that the release is going to be late.  I am not yet sure
*how* late.  Will keep everyone posted.


I have explained the problem in detail for my colleague who will get 
back to work tomorrow when I'm leaving Norway. Unfortunately I don't 
think I will bring with me my laptop to US since I want to carry as 
little as possible, only my iPad, so it will be difficult to work more 
on this issue from tomorrow. I'll be back on Monday next week.


Erlend

--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050


Re: Disabling retries

2013-03-07 Thread Erlend Garåsen


I enabled this functionality once. In previous releases I think the 
retry functionality was enabled by default. In HttpClient 4, this became 
more complicated. I think the method must return true in order to enable 
retries, but then you need to catch several types of exceptions first in 
order to figure out when to retry the connection.


Just a warning - I haven't tried/tested this, but here's what I added 
for some time ago. It must probably be adapted to the MCF environment to 
decide when to retry:


HttpRequestRetryHandler retryHandler = new HttpRequestRetryHandler() {

public boolean retryRequest(IOException exception, int 
executionCount, HttpContext context) {

if (executionCount = 3) {
// Do not retry if over max retry count
return false;
}
if (exception instanceof InterruptedIOException) {
// Timeout
return false;
}
if (exception instanceof UnknownHostException) {
// Unknown host
return false;
}
if (exception instanceof ConnectException) {
// Connection refused
return false;
}
if (exception instanceof SSLException) {
// SSL handshake exception
return false;
}
HttpRequest request = (HttpRequest) 
context.getAttribute(ExecutionContext.HTTP_REQUEST);
boolean idempotent = !(request instanceof 
HttpEntityEnclosingRequest);

if (idempotent) {
// Retry if the request is considered idempotent
return true;
}
return false;
}

Erlend

On 07.03.13 14.09, Karl Wright wrote:

Hi all,

We have code that creates a DefaultHttpClient instance for use with
Solr.  The HttpEntity that is created when sending data is not
reusable, so we've disabled retries (we thought) using the following
code:

 DefaultHttpClient localClient = new
DefaultHttpClient(connectionManager,params);

 // No retries
 localClient.setHttpRequestRetryHandler(new HttpRequestRetryHandler()
   {
 public boolean retryRequest(
   IOException exception,
   int executionCount,
   HttpContext context)
 {
   return false;
 }

   });


Unfortunately it does not seem to have actually worked; we are still
seeing non-reusable stream retry errors in some cases.  Has anybody
seen this before, and what
are we doing wrong?

Karl




--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050


Re: Fwd: FW: Google Summer of Code 2013

2013-03-07 Thread Erlend Garåsen


You were thinking about the general Email connector?
https://issues.apache.org/jira/browse/CONNECTORS-553

Since I want to finish the Hydra Output Connector before I start working 
with this one, I think this might be a good candidate. If it's not too 
complicated and takes longer than eight weeks to finish.


What about a simple LDAP connector for indexing LDAP content? 
(people/unit trees etc.).


Erlend

On 07.03.13 16.19, Karl Wright wrote:

Hey everyone -

Now is the time to come up with reasonable ManifoldCF enhancement
ideas that a student could likely succeed at in eight weeks' time.
This can range from new connectors (I think we still have an
outstanding ticket for a Microsoft Outlook connector, for instance),
to significant enhancements you can think of that might be useful.  If
you don't want to create a ticket yourself, I'm happy to do it.  Just
drop me an email.

Karl



-Original Message-
From: ext Ulrich Stärk [mailto:u...@apache.org]
Sent: Tuesday, March 05, 2013 10:27 AM
To: p...@apache.org
Subject: Google Summer of Code 2013

Hello PMCs,

Google Summer of Code [1] is the ideal opportunity for you to attract
new contributors to your projects.

The ASF will apply as a participating organization meaning individual
projects don't have to apply separately.

If you want to participate with your project you NOW need to

- understand what it means to be a mentor [2].

- record your project ideas. Just create issues in JIRA, label them
with gsoc2013, and they will show up at [3]. Please be as specific as
possible when describing your idea. Include the programming language,
the tools and skills required, but try not to scare potential students
away. They are supposed to learn what's required before the program
starts. Use labels, e.g. for the programming language (java, c, c++,
erlang, python, brainfuck, ...) or technology area (cloud, xml, web,
foo, bar, ...) and record them at [5]. Please use the COMDEV JIRA
project for recording your ideas if your project doesn't use JIRA
(e.g. httpd, ooo). Contact d...@community.apache.org if you need
assistance.

- subscribe to code-awa...@apache.org (restricted to potential
mentors, meant to be used as a private list - general discussions on
the public d...@community.apache.org list as much as possible please).
Use a recognized address when subscribing (@apache.org or one of your
alias addresses on record).

Note that the ASF isn't accepted yet, nevertheless you *really* should
start recording your ideas now.

Over the years we were able to complete hundreds of projects
successfully. Some of our prior students are active contributors now!
Let's make this a success again this year!


Uli

P.S.: Except for the private parts (label spreadsheet mostly), this
email is free to be shared publicly if you want to.

[1] http://www.google-melange.com/gsoc/homepage/google/gsoc2013
[2] http://community.apache.org/guide-to-being-a-mentor.html
[3] http://s.apache.org/gsoc2013ideas
[4] http://community.apache.org/gsoc.html
[5] http://s.apache.org/gsoclabels




--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050


Re: Fwd: FW: Google Summer of Code 2013

2013-03-07 Thread Erlend Garåsen

On 07.03.13 19.26, Karl Wright wrote:

Yes, I thought the email connector might be a good choice.


I think an email connector is more useful than an ldap connector.


I'm not sure what the use case would be for an LDAP repository
connector.  Can you describe a scenario where it might be useful?


We were thinking about indexing LDAP resources at the university since 
we're offering people and unit searches at our web pages. The problem is 
that we are limited by the search facilities LDAP itself offer. If we 
had the opportunity to search this information from Solr, we could do 
faceting and more complex searches, for instance list all professors 
within an academic field/topic etc.


*If* we decide to develop such a connector, we will probably make it, 
but we might find other solutions.


So basically, I think an LDAP connector might be interesting for bigger 
institutions with a lot of information stored in LDAP.


Erlend

--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050


Re: [VOTE] Release Apache ManifoldCF 1.1.1, RC0

2013-02-13 Thread Erlend Garåsen


+1 (I'm withdrawing my previous -1).

- Downloaded the binary zip version:
  * It runs now stable on Resin, even after a test where I stopped the 
running job and started it once again. Posting to Solr 4.0 works.

- Downloaded the source zip version:
  * ran ant uitest
  * ran Jetty within both the example and multithread-example folders 
and posted to Solr 3.1 (backward compatibility test)


Erlend

On 12.02.13 20.39, Karl Wright wrote:

Right, since this is a long-standing problem and this is a patch
release, I'd hope that we could hold off until the next real release
for this one.

I doubt it will be too challenging to fix.

Karl

On Tue, Feb 12, 2013 at 1:22 PM, Erlend Garåsen e.f.gara...@usit.uio.no wrote:


You are probably right. I can withdraw my vote, but I'm unsure whether I
should wait and see what happens with the crawl I just started on our test
server with new hop filter settings. I can then make a final vote tomorrow.
All the other tests I have done with this RC have passed.

Erlend


On 12.02.13 19.06, Karl Wright wrote:


If this problem is non-critical, and has been around a long time, it
is not necessary to cancel a release in order to fix it.  The logic in
question has not changed since probably ManifoldCF 0.3 or so.

Karl

On Tue, Feb 12, 2013 at 1:04 PM, Erlend Garåsen e.f.gara...@usit.uio.no
wrote:



-1 due to CONNECTORS-644.

The job restart link does not work, causing Error: Repeated service
interruptions - failure getting document version. I think this
functionality
is so basic, so I think it should be solved for this release.

This problem is totally unrelated to Resin. It happens if you are running
Jetty as well.


Erlend

On 10.02.13 20.01, Karl Wright wrote:



Please vote on whether to release Apache ManifoldCF 1.1.1, RC0.

The release artifact can be downloaded from:

http://people.apache.org/~kwright/apache-manifoldcf-1.1.1

There is a tag at:

https://svn.apache.org/repos/asf/manifoldcf/tags/release-1.1.1-RC0

This release has been made primarily to fix a leak of connection
handles, described by CONNECTORS-638.  Other major fixes have also
been included, specifically:

- Fix the maven build (various tickets)
- Fix the rather broken Elastic Search connector (also various tickets)

Karl




--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP:
31050




--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050



--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050


Re: [VOTE] Release Apache ManifoldCF 1.1.1, RC0

2013-02-12 Thread Erlend Garåsen


I tried to restart the crawl ten minutes after I started it. The job 
ends after a while and will not start again. This is the status after it 
stopped:

Error: Repeated service interruptions - failure getting document version

If I start it manually, it just fetches and fetches without posting 
anything to Solr.


The only thing I did while it was running the first time was to edit the 
exclude list once - removed a white space at the end of a reg exp rule.


Then I commented out the regexp line in case it DID affected the 
documents (it shouldn't) and restarted again. Same problem - the job 
does not want to start: Error: Repeated service interruptions - failure 
getting document version


Just before the job ends, the result description shows Interrupted: Job 
no longer active. This is normal, but why won't MCF start the job again 
after it stops?


Same problem after I manually starts it - MCF just fetches and fetches 
without posting anything to Solr.


E

On 12.02.13 13.38, Erlend Garåsen wrote:


I have changed some settings in MCF which will reduce the heavy load on
our PG server (changed hop count mode to Keep unreachable documents,
forever).

I will start a new crawl today and make a final vote tomorrow.

Erlend

On 11.02.13 20.49, Karl Wright wrote:

I've looked at this enough now to conclude that this problem is
probably not intrinsic to ManifoldCF.  It may instead be due to
timeouts present in Erlend's PostgreSQL installation.  I am therefore
leaving the vote open until there is some reason to believe that there
is a general problem here.

Thanks,
Karl


On Mon, Feb 11, 2013 at 10:02 AM, Erlend Garåsen
e.f.gara...@usit.uio.no wrote:


The job just stopped working and nothing suspicious in my logs. The
database
people are saying that we have connection locks again (idle in
transaction).

Karl, you mentioned that in order to use the following parameter:
property name=org.apache.manifoldcf.database.connectiontracking
value=true/
there was no way back to use an older release due to changes in the
database. That's ok, but was that just a temporary functionality, which
means, I need to clear my database in order to use 1.1.1 RC0?

Erlend


On 10.02.13 20.01, Karl Wright wrote:


Please vote on whether to release Apache ManifoldCF 1.1.1, RC0.

The release artifact can be downloaded from:

http://people.apache.org/~kwright/apache-manifoldcf-1.1.1

There is a tag at:

https://svn.apache.org/repos/asf/manifoldcf/tags/release-1.1.1-RC0

This release has been made primarily to fix a leak of connection
handles, described by CONNECTORS-638.  Other major fixes have also
been included, specifically:

- Fix the maven build (various tickets)
- Fix the rather broken Elastic Search connector (also various tickets)

Karl




--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP:
31050






--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050


Re: [VOTE] Release Apache ManifoldCF 1.1.1, RC0

2013-02-12 Thread Erlend Garåsen


You are probably right. I can withdraw my vote, but I'm unsure whether I 
should wait and see what happens with the crawl I just started on our 
test server with new hop filter settings. I can then make a final vote 
tomorrow. All the other tests I have done with this RC have passed.


Erlend

On 12.02.13 19.06, Karl Wright wrote:

If this problem is non-critical, and has been around a long time, it
is not necessary to cancel a release in order to fix it.  The logic in
question has not changed since probably ManifoldCF 0.3 or so.

Karl

On Tue, Feb 12, 2013 at 1:04 PM, Erlend Garåsen e.f.gara...@usit.uio.no wrote:


-1 due to CONNECTORS-644.

The job restart link does not work, causing Error: Repeated service
interruptions - failure getting document version. I think this functionality
is so basic, so I think it should be solved for this release.

This problem is totally unrelated to Resin. It happens if you are running
Jetty as well.


Erlend

On 10.02.13 20.01, Karl Wright wrote:


Please vote on whether to release Apache ManifoldCF 1.1.1, RC0.

The release artifact can be downloaded from:

http://people.apache.org/~kwright/apache-manifoldcf-1.1.1

There is a tag at:

https://svn.apache.org/repos/asf/manifoldcf/tags/release-1.1.1-RC0

This release has been made primarily to fix a leak of connection
handles, described by CONNECTORS-638.  Other major fixes have also
been included, specifically:

- Fix the maven build (various tickets)
- Fix the rather broken Elastic Search connector (also various tickets)

Karl




--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050



--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050


Re: [VOTE] Release Apache ManifoldCF 1.1.1, RC0

2013-02-11 Thread Erlend Garåsen

On 11.02.13 16.26, Karl Wright wrote:

trunk has a different schema than 1.1.1.  So yes, you'd have to blow
away the old database to go back.


OK, who knows. Maybe this was the reason why it just stopped. I'll clean 
up by deleting all tables and related db resources and try again.


Erlend

--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050


Re: [VOTE] Release Apache ManifoldCF 1.1.1, RC0

2013-02-11 Thread Erlend Garåsen



On 11.02.13 17.50, Karl Wright wrote:

The trace didn't look like it was the result of stuck ManifoldCF locks.


Even though it does not seem to be a result of this, I found an error in 
our control script. The path to executecommand.sh was incorrect for the 
stop function.


I'll send your suggestions to the PostgreSQL admins and try to get more 
information.


Erlend

--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050


Re: ManifoldCF 1.1.1?

2013-02-08 Thread Erlend Garåsen

On 08.02.13 10.31, Maciej Liżewski wrote:

I would go with patch. It is serious enough because causes problems with
running crawler jobs...


+1.

I'm running the job once again and examining the logs. I will report any 
suspicious deviations continuously.


I think it is advisable that I complete this job before I conclude that 
we have got rid of the problem. This may take some time, probably about 
30 hours.


Erlend

--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050


Re: [VOTE] Release Apache ManifoldCF 1.1, RC7

2013-01-30 Thread Erlend Garåsen


Difficult to say what caused these problems, but I have now deployed RC8 
which works well on Resin. I just have a couple of more tests to do, so 
I will give my final vote within a couple of hours.


Erlend


On 29.01.13 17.41, Karl Wright wrote:

I ran the multiprocess example, with httpmime.jar just as we deliver
it in the connector-lib directory, and I did not see this issue.  It
is almost certainly configuration, seems likely.

Karl

On Tue, Jan 29, 2013 at 11:26 AM, Erlend Garåsen
e.f.gara...@usit.uio.no wrote:


I have to run now, but I will investigate this further. BTW, I have the
following in my lib folder, so it should work:
httpmime.jar

I did not see this yesterday when I was testing RC6 with Resin. The
difference now is that the crawler just fetches and fetches, but nothing
gets posted to Solr.

I hop it is me who have misconfigured something, but I will get back to this
as soon as possible.

FATAL 2013-01-29 17:19:17,609 (Worker thread '17') - Error tossed:
org/apache/http/entity/mime/content/ContentBody
java.lang.NoClassDefFoundError:
org/apache/http/entity/mime/content/ContentBody
 at
org.apache.manifoldcf.agents.output.solr.HttpPoster.init(HttpPoster.java:246)
 at
org.apache.manifoldcf.agents.output.solr.SolrConnector.getSession(SolrConnector.java:256)
 at
org.apache.manifoldcf.agents.output.solr.SolrConnector.removeDocument(SolrConnector.java:629)
 at
org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.removeDocument(IncrementalIngester.java:1598)
 at
org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.performIngestion(IncrementalIngester.java:469)
 at
org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:370)
 at
org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocument(WorkerThread.java:1651)
 at
org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.deleteDocument(WorkerThread.java:1672)
 at
org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.processDocuments(WebcrawlerConnector.java:1445)
 at
org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423)
 at
org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:551)
Caused by: java.lang.ClassNotFoundException:
org.apache.http.entity.mime.content.ContentBody
 at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
 at java.net.FactoryURLClassLoader.loadClass(URLClassLoader.java:627)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
 ... 11 more

Erlend


On 28.01.13 23.09, Karl Wright wrote:


Please vote on whether or not to release ManifoldCF 1.1, RC7.

The release artifact can be found at:

http://people.apache.org/~kwright/apache-manifoldcf-1.1

There is a tag at:

https://svn.apache.org/repos/asf/manifoldcf/tags/release-1.1-RC7

This release candidate fixes a packaging problem for wars similar to
CONNECTORS-619.  It also fixes a problem with the CMIS connector
and another SolrJ-related issue (CONNECTORS-622 and CONNECTORS-623).

This release candidate provides a better workaround for
CONNECTORS-616 than RC5.  It also fixes CONNECTORS-617.

This release candidate fixes one problem since RC4, which is
the inconfigurability of the commit action path for Solr commits in
the Solr connector.  This needed to be fixed to maintain backwards
compatibility.  CONNECTORS-621.

This release candidate fixes two problems since RC3.  The problems
were in the included jars for the multiprocess example (CONNECTORS-619)
and in connection leakage for JDBC handles (CONNECTORS-620).

This release candidate fixes one problem since RC2.  The problem is
CONNECTORS-618, which relates to MySQL performance.

This release candidate fixes one additional problem since RC1.  The
problem is CONNECTORS-616, and relates to Solr dropping connections
during
indexing.

This release candidate fixes two other problems since RC0, both
related to Solr 4.0.0 support.
- CONNECTORS-613: The version of Tika used in Solr 4.0.0 cannot
extract text unless told an accurate mime type.  While this is
probably a Tika bug, in this ticket we at least make sure a good guess
as to the mime type is sent to Solr.
- CONNECTORS-614: Fix logic having to do with releasing idle Solr
connections.  This shows up as socket timeout exceptions, because it
becomes very easy to exhaust the Solr application server's thread pool
when idle connections are not released in a timely way.

This release includes a significant amount of long-planned upgrading
and refactoring since Apache ManifoldCF 1.0.1, including:
- Port to HttpComponents from commons-httpclient
- Port

Re: [VOTE] Release Apache ManifoldCF 1.1, RC8

2013-01-30 Thread Erlend Garåsen


+1

- Deployed on Resin. Adding documents to Solr 4.0 and document deletion 
works

- Ran ant uitest
- Ran ant doc
- Ran single process.
- Started Jetty from the example folder and tried to post to Solr 3.1 
(backward compatibility test)


I have examined all the logs in case there were stack traces or other 
error messages, but everything seems to be OK.


Erlend

On 30.01.13 03.07, Karl Wright wrote:

Please vote on whether or not to release ManifoldCF 1.1, RC8.

The release artifact can be found at:

http://people.apache.org/~kwright/apache-manifoldcf-1.1

There is a tag at:

https://svn.apache.org/repos/asf/manifoldcf/tags/release-1.1-RC8

This release candidate upgrades to SolrJ 4.1.0, with necessary dependency
upgrades.  This was necessary because SolrJ 4.0.0 had serious issues in its
Solr Cloud support.  SolrJ 4.0.0 also would not work with Solr 4.1.0, but the
SolrJ 4.1.0 does work with Solr 4.0.0, mostly.  Potential issues remain with
cross-version SolrCloud support.  See CONNECTORS-627.

This release candidate fixes a packaging problem for wars similar to
CONNECTORS-619.  It also fixes a problem with the CMIS connector
and another SolrJ-related issue (CONNECTORS-622 and CONNECTORS-623).

This release candidate provides a better workaround for
CONNECTORS-616 than RC5.  It also fixes CONNECTORS-617.

This release candidate fixes one problem since RC4, which is
the inconfigurability of the commit action path for Solr commits in
the Solr connector.  This needed to be fixed to maintain backwards
compatibility.  CONNECTORS-621.

This release candidate fixes two problems since RC3.  The problems
were in the included jars for the multiprocess example (CONNECTORS-619)
and in connection leakage for JDBC handles (CONNECTORS-620).

This release candidate fixes one problem since RC2.  The problem is
CONNECTORS-618, which relates to MySQL performance.

This release candidate fixes one additional problem since RC1.  The
problem is CONNECTORS-616, and relates to Solr dropping connections
during
indexing.

This release candidate fixes two other problems since RC0, both
related to Solr 4.0.0 support.
- CONNECTORS-613: The version of Tika used in Solr 4.0.0 cannot
extract text unless told an accurate mime type.  While this is
probably a Tika bug, in this ticket we at least make sure a good guess
as to the mime type is sent to Solr.
- CONNECTORS-614: Fix logic having to do with releasing idle Solr
connections.  This shows up as socket timeout exceptions, because it
becomes very easy to exhaust the Solr application server's thread pool
when idle connections are not released in a timely way.

This release includes a significant amount of long-planned upgrading
and refactoring since Apache ManifoldCF 1.0.1, including:
- Port to HttpComponents from commons-httpclient
- Port to SolrJ from homegrown for the Solr connector, so that
SolrCloud is supported
- Improved NTLM support
- Partial Kerberos support
- Many other improvements, which are summarized in CHANGES.txt




--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050


Re: [VOTE] Release Apache ManifoldCF 1.1, RC7

2013-01-29 Thread Erlend Garåsen


I have to run now, but I will investigate this further. BTW, I have the 
following in my lib folder, so it should work:

httpmime.jar

I did not see this yesterday when I was testing RC6 with Resin. The 
difference now is that the crawler just fetches and fetches, but nothing 
gets posted to Solr.


I hop it is me who have misconfigured something, but I will get back to 
this as soon as possible.


FATAL 2013-01-29 17:19:17,609 (Worker thread '17') - Error tossed: 
org/apache/http/entity/mime/content/ContentBody
java.lang.NoClassDefFoundError: 
org/apache/http/entity/mime/content/ContentBody
	at 
org.apache.manifoldcf.agents.output.solr.HttpPoster.init(HttpPoster.java:246)
	at 
org.apache.manifoldcf.agents.output.solr.SolrConnector.getSession(SolrConnector.java:256)
	at 
org.apache.manifoldcf.agents.output.solr.SolrConnector.removeDocument(SolrConnector.java:629)
	at 
org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.removeDocument(IncrementalIngester.java:1598)
	at 
org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.performIngestion(IncrementalIngester.java:469)
	at 
org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:370)
	at 
org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocument(WorkerThread.java:1651)
	at 
org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.deleteDocument(WorkerThread.java:1672)
	at 
org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.processDocuments(WebcrawlerConnector.java:1445)
	at 
org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423)
	at 
org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:551)
Caused by: java.lang.ClassNotFoundException: 
org.apache.http.entity.mime.content.ContentBody

at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at java.net.FactoryURLClassLoader.loadClass(URLClassLoader.java:627)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
... 11 more

Erlend

On 28.01.13 23.09, Karl Wright wrote:

Please vote on whether or not to release ManifoldCF 1.1, RC7.

The release artifact can be found at:

http://people.apache.org/~kwright/apache-manifoldcf-1.1

There is a tag at:

https://svn.apache.org/repos/asf/manifoldcf/tags/release-1.1-RC7

This release candidate fixes a packaging problem for wars similar to
CONNECTORS-619.  It also fixes a problem with the CMIS connector
and another SolrJ-related issue (CONNECTORS-622 and CONNECTORS-623).

This release candidate provides a better workaround for
CONNECTORS-616 than RC5.  It also fixes CONNECTORS-617.

This release candidate fixes one problem since RC4, which is
the inconfigurability of the commit action path for Solr commits in
the Solr connector.  This needed to be fixed to maintain backwards
compatibility.  CONNECTORS-621.

This release candidate fixes two problems since RC3.  The problems
were in the included jars for the multiprocess example (CONNECTORS-619)
and in connection leakage for JDBC handles (CONNECTORS-620).

This release candidate fixes one problem since RC2.  The problem is
CONNECTORS-618, which relates to MySQL performance.

This release candidate fixes one additional problem since RC1.  The
problem is CONNECTORS-616, and relates to Solr dropping connections
during
indexing.

This release candidate fixes two other problems since RC0, both
related to Solr 4.0.0 support.
- CONNECTORS-613: The version of Tika used in Solr 4.0.0 cannot
extract text unless told an accurate mime type.  While this is
probably a Tika bug, in this ticket we at least make sure a good guess
as to the mime type is sent to Solr.
- CONNECTORS-614: Fix logic having to do with releasing idle Solr
connections.  This shows up as socket timeout exceptions, because it
becomes very easy to exhaust the Solr application server's thread pool
when idle connections are not released in a timely way.

This release includes a significant amount of long-planned upgrading
and refactoring since Apache ManifoldCF 1.0.1, including:
- Port to HttpComponents from commons-httpclient
- Port to SolrJ from homegrown for the Solr connector, so that
SolrCloud is supported
- Improved NTLM support
- Partial Kerberos support
- Many other improvements, which are summarized in CHANGES.txt




--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050


Re: [CANCEL][VOTE] Release Apache ManifoldCF 1.1, RC6

2013-01-28 Thread Erlend Garåsen


RC6 runs well on Resin 4, so please prepare the next RC7.

Erlend

On 25.01.13 15.57, Karl Wright wrote:

FWIW, I'm going to hold off on spinning RC7 until Erlend signs off on
RC6 in it University of Oslo Resin environment.  Hopefully that should
be sometime this weekend.

Karl

On Fri, Jan 25, 2013 at 9:31 AM, Karl Wright daddy...@gmail.com wrote:

Agreed that we aren't shooting for perfection.  Absence of significant
regression is the best we can hope for. ;-)

Unfortunately, due to the significant amount of refactoring that took
place in this release, we're still discovering exactly where the
bodies lie. But I think we are getting close.

Karl

On Fri, Jan 25, 2013 at 8:42 AM, Jukka Zitting jukka.zitt...@gmail.com wrote:

Hi,

On Fri, Jan 25, 2013 at 3:35 PM, Karl Wright daddy...@gmail.com wrote:

I've pulled up this fix to the release branch, and the vote on RC6 is
canceled.  However, can everyone do at least a preliminary smoke-test
evaluation of RC6 at this time, and not wait for the final vote?  We
aren't going to converge if we all keep waiting to do the evaluation
until we think there are no more RC's likely.


No software is perfect and it's always possible to cut more releases
to address issues as they come along, so in general I'd recommend
people not to vote -1 because of each individual bug they encounter.

As long as the code compiles, passes the existing test suite and has
no other major issues (known security vulnerabilities, licensing
problems, etc.), it should be ready to release. Smaller issues like
CONNECTORS-622 can and IMHO should be fixed in the following release
instead of holding up the current one.

It's better to release early and often than to wait for perfection.

BR,

Jukka Zitting



--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050


Re: [VOTE] Release Apache Manifold 1.1, RC3

2013-01-22 Thread Erlend Garåsen


-1 so far. Until the problem described below is solved or explained.

Running Jetty within the example folder seems to work normally, but not 
within the multiprocess-example folder. In both configurations I have 
defined a Solr Output Connector and a web crawler. The funny thing 
within the latter folder is that nothing is sent to Solr. The crawler 
just fetches and fetches, and that is the only activity I can see.


I have ran:
./start-database.sh
./initialize.sh
./start-agents.sh
./start-webapps.sh

The Solr Output connection is working and I have gone through the 
settings in my job - very similar configurations from my first attempt 
within the example folder, but nothing shows up.


When I looked in my logs, I discovered this:
FATAL 2013-01-22 14:10:31,802 (Worker thread '43') - Error tossed: Could 
not initialize class org.apache.solr.client.solrj.impl.HttpSolrServer
java.lang.NoClassDefFoundError: Could not initialize class 
org.apache.solr.client.solrj.impl.HttpSolrServer
	at 
org.apache.manifoldcf.agents.output.solr.HttpPoster.init(HttpPoster.java:246)
	at 
org.apache.manifoldcf.agents.output.solr.SolrConnector.getSession(SolrConnector.java:256)
	at 
org.apache.manifoldcf.agents.output.solr.SolrConnector.addOrReplaceDocument(SolrConnector.java:609)
	at 
org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.addOrReplaceDocument(IncrementalIngester.java:1579)
	at 
org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.performIngestion(IncrementalIngester.java:504)
	at 
org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:370)
	at 
org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocument(WorkerThread.java:1651)
	at 
org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.processDocuments(WebcrawlerConnector.java:1409)
	at 
org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423)
	at 
org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:551)


BTW, I'm running Solr 3.1, not the latest version. I don't think this 
has something to do with the problems described above since my Solr 
server does not seem to be hit my MCF at all.


Erlend

On 22.01.13 09.59, Karl Wright wrote:

Please vote on whether or not to release ManifoldCF 1.1, RC3.

The release artifact can be found at:

http://people.apache.org/~kwright/apache-manifoldcf-1.1

There is a tag at:

https://svn.apache.org/repos/asf/manifoldcf/tags/release-1.1-RC3

Please vote on whether or not to release ManifoldCF 1.1, RC2.

The release artifact can be found at:

http://people.apache.org/~kwright/apache-manifoldcf-1.1

There is a tag at:

https://svn.apache.org/repos/asf/manifoldcf/tags/release-1.1-RC2

This release candidate fixes one problem since RC2.  The problem is
CONNECTORS-618, which relates to MySQL performance.

This release candidate fixes one additional problem since RC1.  The
problem is CONNECTORS-616, and relates to Solr dropping connections
during
indexing.

This release candidate fixes two other problems since RC0, both
related to Solr 4.0.0 support.
- CONNECTORS-613: The version of Tika used in Solr 4.0.0 cannot
extract text unless told an accurate mime type.  While this is
probably a Tika bug, in this ticket we at least make sure a good guess
as to the mime type is sent to Solr.
- CONNECTORS-614: Fix logic having to do with releasing idle Solr
connections.  This shows up as socket timeout exceptions, because it
becomes very easy to exhaust the Solr application server's thread pool
when idle connections are not released in a timely way.

This release includes a significant amount of long-planned upgrading
and refactoring since Apache ManifoldCF 1.0.1, including:
- Port to HttpComponents from commons-httpclient
- Port to SolrJ from homegrown for the Solr connector, so that
SolrCloud is supported
- Improved NTLM support
- Partial Kerberos support
- Many other improvements, which are summarized in CHANGES.txt

Karl




--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050


Re: New committer: Minoru Osuka

2013-01-21 Thread Erlend Garåsen


Welcome as a ManifoldCF committer, Minoru!

Erlend

On 10.01.13 21.00, Karl Wright wrote:

The Project Management Committee (PMC) for Apache ManifoldCF
has asked Minoru Osuka to become a committer and we are pleased
to announce that they have accepted.

Minoru has been active in using advanced features of Solr, and has been
instrumental in bringing our Solr connector into the modern era.

Please join me in welcoming Minoru to the Apache ManifoldCF project!

Karl




--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050


Re: Do we need an org.apache.manifoldcf.core.DBClean command class?

2012-12-18 Thread Erlend Garåsen

On 18.12.12 16.29, Karl Wright wrote:

Hmm, somehow you lost a connector jar out of the connector-lib or
connector-lib-proprietary area.  Deleting the jars before you clean up
the database is not going to work. ;-)


I guess not. This is probably what I have done. Maybe this behaviour is 
so odd that we do not need the functionality I was mentioning? As long 
as one is removing things in the correct order, problems will not show up.


Erlend

--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050


Re: Issues marked fix in ManifoldCF next

2012-12-10 Thread Erlend Garåsen


The Hydra framework has recently been changed and I just got some 
neccesary documentation two weeks ago. The Hydra connector will 
therefore be included in version 1.2 instead.


Anyway, I have started to work on it and will commit some very basic 
stuff soon.


Fix version for CONNECTORS-193 is changed from next to 1.2 as well since 
this has been assigned to me for a while now.


Erlend

On 08.12.12 03.29, Karl Wright wrote:

Hi folks,

I've created a ManifoldCF 1.2 release in JIRA and triaged some tickets
I intend to work on for that release.   I've also closed/resolved a
fair number of tickets that were hanging around marked fix in
ManifoldCF next.  You may want to do the same...

Karl




--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050


Re: Developing an Email Connector

2012-10-31 Thread Erlend Garåsen


Yes, I'm aware of that. It's on my TODO list.

Thanks for all your suggestions and comments, but please make a comment 
in Jira next time. It will be easier for me if everything related to 
this connector is placed there. :)


The connector will not be ready for the next release since I have to 
finish an another connector first.


Erlend

On 24.10.12 15.44, Maciej Liżewski wrote:

I was thinking about one more thing: document IDs for indexed emails... it
should be somehow possible to form them as valid URLs, but POP3 and IMAP
does not support it by themselves. So you have to create links to some
third party web-mail system but there is a number of such systems and
document ID should be customizable to support as many of them as
possible... what do you think?

2012/10/15 Erlend Garåsen e.f.gara...@usit.uio.no



Sounds like a good idea.

I didn't even think about attachments, even though it's quite obvious that
we need to take care of them. :)

Erlend


On 15.10.12 14.40, Maciej Liżewski wrote:


I would like to add my 5cents here. I would like the connector to set
multivalued cat field with some categories which I could use with
faceted
dynamic groups, like: with attachment, sender domain, sender name, etc.
Also would be nice to index also the attachments as linked documents (they
could have same categories as email message above)
15 paź 2012 14:13, Karl Wright daddy...@gmail.com napisał(a):

  Sounds great!  I can't wait to see it.


Karl


On Mon, Oct 15, 2012 at 6:31 AM, Erlend Garåsen e.f.gara...@usit.uio.no



wrote:



Me and Karl had a short discussion about such a connector in Cambridge


for


some months ago. Now I have created the following ticket regarding an


Email


Connector:
https://issues.apache.org/**jira/browse/CONNECTORS-553https://issues.apache.org/jira/browse/CONNECTORS-553

I'm notifying the list in case some of you have comments or special


wishes.


Generally it will support IMAP and POP3, SSL/TLS, the possibility to


specify


the port numbers if necessary, server certificate upload etc.

Erlend
--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP:


31050






--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP:
31050






--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050


Re: Developing an Email Connector

2012-10-15 Thread Erlend Garåsen


Sounds like a good idea.

I didn't even think about attachments, even though it's quite obvious 
that we need to take care of them. :)


Erlend

On 15.10.12 14.40, Maciej Liżewski wrote:

I would like to add my 5cents here. I would like the connector to set
multivalued cat field with some categories which I could use with faceted
dynamic groups, like: with attachment, sender domain, sender name, etc.
Also would be nice to index also the attachments as linked documents (they
could have same categories as email message above)
15 paź 2012 14:13, Karl Wright daddy...@gmail.com napisał(a):


Sounds great!  I can't wait to see it.

Karl


On Mon, Oct 15, 2012 at 6:31 AM, Erlend Garåsen e.f.gara...@usit.uio.no
wrote:


Me and Karl had a short discussion about such a connector in Cambridge

for

some months ago. Now I have created the following ticket regarding an

Email

Connector:
https://issues.apache.org/jira/browse/CONNECTORS-553

I'm notifying the list in case some of you have comments or special

wishes.

Generally it will support IMAP and POP3, SSL/TLS, the possibility to

specify

the port numbers if necessary, server certificate upload etc.

Erlend
--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP:

31050






--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050


Re: [PROPOSAL] Release a ManifoldCF 1.0.1 release

2012-10-10 Thread Erlend Garåsen

+1

Erlend

On 09.10.12 22.53, Karl Wright wrote:

Hi folks,

Due to the potential severity of CONNECTORS-551, I think it might be a
good idea to release a ManifoldCF 1.0.1 release which contains the fix
for this ticket.  Please can I have a show of hands as to whether
people agree that this is serious enough to warrant such a release.

Thanks!
Karl




--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050


Re: question about multiple languages

2012-10-09 Thread Erlend Garåsen

On 08.10.12 17.03, Maciej Liżewski wrote:


Now there are two possibilities:
1. when fields are untouched - processing data (stemming, etc) is same
for every document, which is rather wrong because polish stemming is
different from english one... :)
2. attributes are mapped to *_lang and every *_lang field has
different processing definition (stemming, stop words, etc).


The latter seems more reasonable for me and is more common practice. 
There are different stemmers you may try out such as Hunspell.


If you want to detect languages, I would use 
TikaLanguageIdentifierUpdateProcessorFactory:

http://wiki.apache.org/solr/LanguageDetection

It can be configured by using an Update Request Processor:
http://wiki.apache.org/solr/UpdateRequestProcessor


This part I understand,
but I am confused on how to perform valid queries in both cases? I
have single (simple) page which should work google-like: you enter a
text and get results. But there is no language guess process for
queries... Do I have to specify on each query whether it should search
in 'text_en' or 'text_pl' fields? If so - it is not very good because
I would like users to get all documents that match query no matter
what language they are written in. There are many similar words,
technical names, etc, which are same in many languages...


I think you should search in both fields, yes. I will explain why 
further down.



In other words - how to achieve google-like search with stemming for
multiple languages and without to force users to select language they
would like to search in?


Google does a guessing about the query language. If you hit 
www.google.com, you will be redirected to www.google.pl if you're 
sitting in Poland. This may also be achieved in your application by 
detecting the browser's locale etc. Many web application frameworks have 
support for this. Then you may give (at query time) a higher boost to 
the fields belonging to the language detected.


Erlend
--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050


Re: [VOTE] Release Apache ManifoldCF 1.0, RC7

2012-10-03 Thread Erlend Garåsen
-manifoldcf-1.0

There is also an SVN tag at:

https://svn.apache.org/repos/asf/manifoldcf/tags/release-1.0-RC7

Fixes since RC6:

CONNECTORS-549

Fixes since RC5:

CONNECTORS-547
CONNECTORS-548 (documentation fix)

Fixes since RC4:

CONNECTORS-545

Fixes since RC3:

CONNECTORS-544




--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050


Re: [VOTE] Release Apache ManifoldCF 1.0, RC5

2012-10-02 Thread Erlend Garåsen

On 28.09.12 13.31, Erlend Garåsen wrote:


OK, I will give you a stack trace in the beginning of next week.


Do you still need the stack trace? If you do, I need to adjust the log 
level and/or change the source code in order to print it out.


I'm still a little bit worried about how MCF deals with 500 server 
errors since the job I started last Friday is still running. It retries 
and retries the three documents I previously mentioned.


Is it really a normal behaviour that MCF retries the same document every 
fourth second after the last attempt and continues do do this (perhaps) 
thousand times? MCF has probably retried these documents in four days 
now. I doubt this is normal behaviour.


The job should end in the middle of the day on Saturday, and now it's 
Tuesday.


I will test the latest RC after these issues have been clarified.

Erlend

--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050


Re: [VOTE] Release Apache ManifoldCF 1.0, RC5

2012-10-02 Thread Erlend Garåsen


Karl, you wrote: I was able to reproduce the exception here using your 
URL.  It is indeed a bug in how it handles the 500 error.


OK, then I guess that the StringIndexOutOfBoundsException *was* related 
to the 500 server issue (It is not clear at all that it is related

to the 500 error you described before, but it could be.).

To clarify another thing: These three documents are fetched over and 
over again every fourth second (in four days). I was mentioning this in 
case we had another issue.


I'm just trying to clarify this before I deploy RC7 as I wrote.

Anyway, I will deploy RC7 now and start my job once more.

Erlend

On 02.10.12 11.03, Karl Wright wrote:

No stack trace needed.  If you read the rest of the mail, you will
note that I was able to reproduce the issue using the URL you had
provided.  There have been two RC's since; we are on RC7 now.

Karl


On Tue, Oct 2, 2012 at 4:38 AM, Erlend Garåsen e.f.gara...@usit.uio.no wrote:

On 28.09.12 13.31, Erlend Garåsen wrote:


OK, I will give you a stack trace in the beginning of next week.



Do you still need the stack trace? If you do, I need to adjust the log level
and/or change the source code in order to print it out.

I'm still a little bit worried about how MCF deals with 500 server errors
since the job I started last Friday is still running. It retries and retries
the three documents I previously mentioned.

Is it really a normal behaviour that MCF retries the same document every
fourth second after the last attempt and continues do do this (perhaps)
thousand times? MCF has probably retried these documents in four days now. I
doubt this is normal behaviour.

The job should end in the middle of the day on Saturday, and now it's
Tuesday.

I will test the latest RC after these issues have been clarified.

Erlend


--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050



--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050


Re: [VOTE] Release Apache ManifoldCF 1.0, RC5

2012-09-28 Thread Erlend Garåsen


I'm trying to start a crawl before I have to run to the airport. I just 
discovered that MCF recrawls the same host over and over again when it 
returns result code 500:
09-28-2012 11:40:11.024 	fetch 
http://foreninger.uio.no/go/oslo_open_2012_no.php

500

It's just not this document, but several others returning the same HTTP 
result code.


Meanwhile, the following is filling up my log:
FATAL 2012-09-28 11:42:32,112 (Worker thread '29') - Error tossed: 
String index out of range: -1

java.lang.StringIndexOutOfBoundsException: String index out of range: -1

I'm pretty sure they are related to each other.

I will end this job before I leave because I'm afraid that MCF will try 
to fetch these documents over and over again during this weekend.


Erlend

On 28.09.12 09.58, Karl Wright wrote:

Please vote +1 to release ManifoldCF 1.0, RC5.  The release artifact
can be found at:

http://people.apache.org/~kwright/apache-manifoldcf-1.0

There is also an SVN tag at:

https://svn.apache.org/repos/asf/manifoldcf/tags/release-1.0-RC5

Fixes since RC4:

CONNECTORS-545

Fixes since RC3:

CONNECTORS-544




--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050


Re: [VOTE] Release Apache ManifoldCF 1.0, RC5

2012-09-28 Thread Erlend Garåsen


OK, I will give you a stack trace in the beginning of next week.

I will start the crawler once more and check the results when I'm back 
and change my vote then if it is ok.


Erlend

On 28.09.12 13.26, Karl Wright wrote:

Meanwhile, the following is filling up my log:
FATAL 2012-09-28 11:42:32,112 (Worker thread '29') - Error tossed:
String index out of range: -1
java.lang.StringIndexOutOfBoundsException: String index out of range: -1

This is indeed a problem I agree we should fix, but in order to do
that I need a stack trace.  It is not clear at all that it is related
to the 500 error you described before, but it could be.  I will create
a ticket for it though.
Karl

On Fri, Sep 28, 2012 at 5:49 AM, Erlend Garåsen e.f.gara...@usit.uio.no wrote:


I'm trying to start a crawl before I have to run to the airport. I just
discovered that MCF recrawls the same host over and over again when it
returns result code 500:
09-28-2012 11:40:11.024 fetch
http://foreninger.uio.no/go/oslo_open_2012_no.php
 500

It's just not this document, but several others returning the same HTTP
result code.

Meanwhile, the following is filling up my log:
FATAL 2012-09-28 11:42:32,112 (Worker thread '29') - Error tossed: String
index out of range: -1
java.lang.StringIndexOutOfBoundsException: String index out of range: -1

I'm pretty sure they are related to each other.

I will end this job before I leave because I'm afraid that MCF will try to
fetch these documents over and over again during this weekend.

Erlend


On 28.09.12 09.58, Karl Wright wrote:


Please vote +1 to release ManifoldCF 1.0, RC5.  The release artifact
can be found at:

http://people.apache.org/~kwright/apache-manifoldcf-1.0

There is also an SVN tag at:

https://svn.apache.org/repos/asf/manifoldcf/tags/release-1.0-RC5

Fixes since RC4:

CONNECTORS-545

Fixes since RC3:

CONNECTORS-544




--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050



--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050


Re: [VOTE] Release Apache ManifoldCF 1.0, RC3

2012-09-27 Thread Erlend Garåsen


The reason why Resin restores the application is due to a new feature in 
version 4 and is related to clustering functionality. Resin stores all 
versions of an application (war) in a local Git repository, which means, 
it is able to restore an application if it has been deleted:

http://www.caucho.com/resin-4.0/admin/clustering-overview.xtp#DeployingApplicationstoaCluster

Since I'm going to Amsterdam tomorrow morning, I don't want to change 
anything in Resin right now in case I break something (other apps are 
running on the same server).


I tried to deploy the combined war on Tomcat, but then I couldn't 
connect to our PostgreSQL server because it seems to have a new SSL 
certificate I haven't installed into my local keystore.


I guess ut is possible to configure HSQLDB, but I'm afraid that my time 
is running out. Sorry.


Erlend

On 26.09.12 18.00, Erlend Garåsen wrote:


Yes, I know, Karl, but I'm actually deleting the places where they are
unpacked. I will get back to this as soon as I have spoken to my
colleague. I will find out why tomorrow after our Solr meeting.

If I get a reply this evening, I will try to do a new test from home.

Erlend

On 26.09.12 17.55, Karl Wright wrote:

Usually application servers unpack the war somewhere.  Unless you
remove the place where it is unpacked you will continue to have the
applications even after the war is gone.

Karl

On Wed, Sep 26, 2012 at 11:52 AM, Erlend Garåsen
e.f.gara...@usit.uio.no wrote:


Hm, it seems that Resin manages to restore the three applications even
though I delete the three war files and the path where they are
built. This
makes it a little bit more difficult to test.

I haven't restarted Resin, only the instance where MCF is running since
other applications are running on the same server. I have asked
someone with
better server skills and Resin knowledge.

Erlend


On 26.09.12 15.02, Erlend Garåsen wrote:


On 26.09.12 14.39, Karl Wright wrote:


I didn't do documentation (or tests) because it is experimental at
this point.

It replaces ALL of manifoldcf in one war.  So it is exactly like the
single-process example (and would use the same properties.xml) but
deployable as a war.  Does this help?



Sure! I will test it withing 24 hours, probably later today.

Erlend




--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP:
31050






--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050


Re: [VOTE] Release Apache ManifoldCF 1.0, RC0

2012-09-24 Thread Erlend Garåsen

On 21.09.12 20.55, Karl Wright wrote:


I see what has happened here.  You unregistered the connectors before
you deleted the job.  That basically meant that the job cleanup can't
take place until the connector(s) it requires are registered again.


That's correct, and I also see the problem now. I forgot to install the 
Filesystem connector before I did the configuration import (normally we 
do not use this connector, but I installed one as a part of a test I did).


After I installed it, I do not longer get an NPE.

Maybe our routines for upgrading MCF need to be changed. We want to be 
sure that these connectors do not need new fields in the database tables 
due to changed/new functions. Therefore I thought this was the safest 
approach. First we export the configuration, then we uninstall all 
connectors by using the executecommand script, then deleting the tables 
by performing an agents.Uninstall command, then reinstall everything 
and finally import the configuration.


Still I cannot delete my jobs since their statuses are cleaning up. 
And the reason is because I didn't delete my jobs prior to executing 
crawler.UnRegisterAll?


Erlend

--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050


Re: [VOTE] Release Apache ManifoldCF 1.0, RC0

2012-09-21 Thread Erlend Garåsen


According to my bash history, I did a LockClean prior to the upgrade:
solr-test02 mcf-1 $ history | grep LockClean
  154  $MCF_HOME/processes/script/executecommand.sh 
org.apache.manifoldcf.core.LockClean


I will create a ticket. Let's hope this is a local problem on my server 
and not a bug, but I think it's best to investigate it to be sure.


I will try to provide as much information as possible in the ticket.

Erlend

On 21.09.12 15.08, Karl Wright wrote:

Hmm, that's not good.

Can you open a ticket for this NPE, and also please attach the .java
file it is referring to: _simplereport__jsp.java?  It should be found
in resin's workarea somewhere.

As for the job not being deleted, can you supply further details of
your setup?  Specifically, properties.xml (so I can see what synch
settings you have), and what database, etc.

If there is nothing in the log, I'd shut down everything, execute the
lock-clean procedure, and start everything back up, and see if that
fixed the issue.

Karl


On Fri, Sep 21, 2012 at 9:01 AM, Erlend Garåsen e.f.gara...@usit.uio.no wrote:

On 21.09.12 14.47, Karl Wright wrote:


A temporary error should not block a (non running) job from getting
cleaned up.  The job can't be deleted until it is stopped, and no
outstanding documents are being worked on.

How many documents are still listed for the job in the UI?



Documents: 0
Active: 0
Proceed: 0

Nothing in the simple history, but I get a 500 Server Exceptions if I try to
read the history of the file connector (due to a NPE) - stack trace further
down in this post.

I can try to unregister all connectors and empty the database and try again.

[2012-09-21 14:57:10.220] {resin-port-127.0.0.1:6945-143}
java.lang.NullPointerException
 at
_jsp._simplereport__jsp._jspService(_simplereport__jsp.java:367)
 at
_jsp._simplereport__jsp._jspService(_simplereport__jsp.java:36)
 at
com.caucho.jsp.JavaPage.service(JavaPage.java:64)
 at
com.caucho.jsp.Page.pageservice(Page.java:542)
 at
com.caucho.server.dispatch.PageFilterChain.doFilter(PageFilterChain.java:194)
 at
com.caucho.server.webapp.DispatchFilterChain.doFilter(DispatchFilterChain.java:126)
 at
com.caucho.server.dispatch.ServletInvocation.service(ServletInvocation.java:289)
 at
com.caucho.server.webapp.RequestDispatcherImpl.forward(RequestDispatcherImpl.java:298)
 at
com.caucho.server.webapp.RequestDispatcherImpl.forward(RequestDispatcherImpl.java:116)
 at
com.caucho.jsp.PageContextImpl.forward(PageContextImpl.java:1149)
 at
_jsp._execute__jsp._jspService(_execute__jsp.java:1072)
 at
_jsp._execute__jsp._jspService(_execute__jsp.java:36)
 at
com.caucho.jsp.JavaPage.service(JavaPage.java:64)
 at
com.caucho.jsp.Page.pageservice(Page.java:542)
 at
com.caucho.server.dispatch.PageFilterChain.doFilter(PageFilterChain.java:194)
 at
com.caucho.server.webapp.WebAppFilterChain.doFilter(WebAppFilterChain.java:156)
 at
com.caucho.server.dispatch.ServletInvocation.service(ServletInvocation.java:289)
 at
com.caucho.server.hmux.HmuxRequest.handleInvocation(HmuxRequest.java:468)
 at
com.caucho.server.hmux.HmuxRequest.handleRequestImpl(HmuxRequest.java:369)
 at
com.caucho.server.hmux.HmuxRequest.handleRequest(HmuxRequest.java:336)
 at
com.caucho.network.listen.TcpSocketLink.dispatchRequest(TcpSocketLink.java:1301)
 at
com.caucho.network.listen.TcpSocketLink.handleRequest(TcpSocketLink.java:1257)
 at
com.caucho.network.listen.TcpSocketLink.handleRequestsImpl(TcpSocketLink.java:1241

Japanese translation needed for CONNECTORS-486

2012-09-20 Thread Erlend Garåsen


Japanese translation is needed for CONNECTORS-486. I have just committed 
English translation (r1388020).


It seems that we do not have Japanese translation for how to build and 
deploy, so I'm unsure whether we need to translate my changes at this time.


If we *do* have a translation for that page, here's what needs to be 
translated:


1. New row in the property.xml table:
Yes, if file encryption is used

Specify the seed value to be used for encrypting the file to which the 
crawler configuration is exported.


2. (New section under the commands table):

Encrypting crawler configuration data

By adding a passcode as a second argument to the ExportConfiguration 
command class, the file will be encrypted by using the AES algorithm. 
This can be useful to prevent repository passwords to be stored in clear 
text. In order to use this functionality, you must enter a seed value to 
your configuration file. The same passcode along with the seed value are 
used to decrypt the file with the ImportConfiguration command class. See 
the documentation for the commands and properties above to find the 
correct arguments and settings.


Thanks,
Erlend

--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
r1388020r


Re: ant test - BUILD FAILED

2012-09-11 Thread Erlend Garåsen


You're right!

I have been so busy the last two weeks, so I have barely read all the 
last emails. Sorry.


Erlend

On 11.09.12 15.06, Karl Wright wrote:

Hi Erlend,

I posted a warning about this a few days back.  You need to rerun ant
make-core-deps.

On Tue, Sep 11, 2012 at 9:02 AM, Erlend Garåsen e.f.gara...@usit.uio.no wrote:

BUILD FAILED
/Users/erlendfg/tmp/mcf_2012/build.xml:870: The following error occurred
while executing this line:
/Users/erlendfg/tmp/mcf_2012/connectors/sharepoint/build.xml:71:
/Users/erlendfg/tmp/mcf_2012/lib/sharepoint-2007 does not exist.

I just encountered this before I was going to commit my own contribution
regarding CONNECTORS-486.

Since my changes are not related to Sharepoint, I will commit my changes
anyway, but later in the evening.

BTW, I will try to promote ManifoldCF at Scandinavia's biggest developer
conference tomorrow and Thursday. Basically I'm responsible for promoting
Oslo Solr Community, but I have a lot of time to recommend ManifoldCF to
people who need a new open source search engine for their business.
http://jz12.java.no/



Thanks for doing this!

Karl


Erlend

--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050



--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050


Re: New committer: Ahmet Arslan

2012-09-05 Thread Erlend Garåsen


Welcome to the MCF community, Ahmet!

Erlend

On 28.08.12 23.01, Karl Wright wrote:

The Project Management Committee (PMC) for Apache ManifoldCF
has asked Ahmet Arslan to become a committer and we are pleased
to announce that he has accepted.

Ahmet brings significant skills and resources in the area of SharePoint
development to the project, and we look forward to his continuing
contribution in this area, and any other area he wishes to address.

Please join me in welcoming Ahmet to the community!

Karl




--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050


Where do you MCF committers live?

2012-08-15 Thread Erlend Garåsen


Since I'm traveling a lot, I'm curious about where you committers live 
in the world. I have already met Karl in Boston in May this year, and I 
would like to meet other committers as well.


I live in Oslo, Norway's capital and largest city.

Erlend

--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050


Re: Missing jdbcpool in Maven repository

2012-06-27 Thread Erlend Garåsen


Thanks!

mvn-bootstrap.sh is not related to this problem, but I managed to build 
core with Maven after a svn up. I still have some problems to build 
other parts of MCF, but I'm afraid that I have to get back to this issue 
tomorrow since I have to leave my office in five minutes.


Erlend

On 27.06.12 18.31, Karl Wright wrote:

Yup, the dependency is superfluous, and r1354618 removes it.

If you find any others, please let me know.

Karl

On Wed, Jun 27, 2012 at 12:29 PM, Karl Wright daddy...@gmail.com wrote:

Also, we no longer have a dependency on jdbcpool so I think that pom
can be modified to just remove it.  Let me check.

Karl


On Wed, Jun 27, 2012 at 12:28 PM, Karl Wright daddy...@gmail.com wrote:

You need to run the mvn-bootstrap.sh script first.

Karl

On Wed, Jun 27, 2012 at 11:09 AM, Erlend Garåsen
e.f.gara...@usit.uio.no wrote:


mvn eclipse:eclipse fails, probably because the following is not available
in any Maven repository (framework/core/pom.xml):

  dependency
  groupIdcom.bitmechanic/groupId
  artifactIdjdbcpool/artifactId
  version${jdbcpool.version}/version
/dependency

Erlend
--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050




--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050




Re: Tranlation isues [was: Re: ManifoldCF 0.6 release]

2012-06-25 Thread Erlend Garåsen

On 25.06.12 03.25, Shinichiro Abe wrote:


5番目のタブはコミットタブです。これはコミットを動作を制御することができます。すべてのジョブの終了時にドキュメントをコミットするようデフォルトで有効になっています。また、ミリ秒単位で一定時間内に各ドキュメントをコミットすることができます(10秒以内にコミットなら1と登録します)。commit
 withinの挙動はManifoldCFでなくSolrに委ねられています。タブは以下のように表示されます:


Can you please place a link to:
http://wiki.apache.org/solr/CommitWithin
in your translation?

Here is the English version including a link:
The fifth tab is the Commits tab, which allows you to control the 
commit strategies. As well as committing documents at the end of every 
job, an option which is enabled by default, you may also commit each 
document within a certain time in milliseconds (e.g. 1 for 
committing within 10 seconds). The a 
href=http://wiki.apache.org/solr/CommitWithin;commit within/a 
strategy will leave the responsibility to Solr instead of ManifoldCF. 
The tab looks like:


Erlend

--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050




Tranlation isues [was: Re: ManifoldCF 0.6 release]

2012-06-23 Thread Erlend Garåsen


The following needs translation:

You can for instance add update.chain=myChain to select the document 
processing pipeline/chain to use for processing documents in Solr.

Google Translation:
あなたは、たとえば、Solrのドキュメントを処理するために使用する文書処理 
パイプライン/チェーンを選択するupdate.chain= myChainを追加することができ 
ます。


The fourth tab is the Documents tab, which allows you to do document 
filtering based on size and mime types. By specifying a maximum document 
length in bytes, you can filter out documents which exceed that size 
(e.g. 10485760 which is equivalent to 10 MB). If you only want to add 
documents with specific mime types, you can enter them into the 
included mime types field (e.g. text/html for filtering out all 
documents but HTML). The excluded mime types field is for excluding 
documents with specific mime types (e.g. image/jpeg for filtering out 
JPEG images). The tab looks like:

===
4番目のタブは、ドキュメントのサイズやMIMEタイプに基づいてフィルタリング 
を行うことができますドキュメントタブです。バイト単位の最大ドキュメント 
の長さを指定することによって、あなたはそのサイズ(10 MBと同等です例えば 
10485760)を超えてドキュメントをフィルタリングすることができます。あなた 
が特定のMIMEタイプを使用してドキュメントを追加したい場合は、(すべての文 
書が、HTMLをフィルタリングするために必要な、例えば text / htmlの)が 
含まれてMIMEタイプフィールドにそれらを入力することができます。 MIMEタ 
イプの除外フィールドは、特定のMIMEタイプ(JPEG画像をフィルタリングする 
例: image / jpegに)を使って文書を除くためのものです。タブは以下のよ 
うに表示されます:


The fifth tab is the Commits tab, which allows you to control the 
commit strategies. As well as committing documents at the end of every 
job, an option which is enabled by default, you may also commit each 
document within a certain time in milliseconds (e.g. 1 for 
committing within 10 seconds). The commit within strategy will leave the 
responsibility to Solr instead of ManifoldCF. The tab looks like:

===
5番目のタブでは、コミットの戦略を制御することができますコミットタブで 
す。同様にすべてのジョブの終了時にドキュメントをコミットするように、デ 
フォルトで有効になっているオプションは、また、ミリ秒単位で一定時間(10秒 
以内にコミット例えば1)内の各ドキュメントをコミットすることができ 
ます。戦略にコミットではなくManifoldCFのSolrに責任を残します。あなたは 
Solrの出力接続を作成するときに、5つのコンフィギュレーションタブが表示さ 
れます。


Comment: I need to place a link to:
http://wiki.apache.org/solr/CommitWithin
over the text: commit within (last sentence). I need help to place the 
link correctly for the Japanese translation.


I haven't committed my changes yet since my password to the Apache 
account has been changed. If I do not manage to fix my account until the 
beginning of next week, I will add a patch instead.


Karl: I'm not sure I explained the purpose of the included mime types 
field correctly. Please review and comment if necessary.


Thanks,
Erlend

On 22.06.12 17.03, Erlend Garåsen wrote:

On 22.06.12 15.08, Karl Wright wrote:

That's OK.  Any improvement welcome. ;-)


Shinichiro Abe: Will you assist me in order to translate the
documentation? I have made Japanese screenshots as well, but the content
is slightly changed, that is, I have removed certain sentences for the
existing tabs (for Solr Output Connection).

I will mark the ticket with needs translation after I have committed
my changes (including my attempts to do the translations). Or I can
simply email the suggested sentences to the list.

I will get back to this issue on Sunday (or tomorrow if I get time).

Erlend




--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050