Build failed in Hudson: Nutch-Nightly #108

2007-06-05 Thread hudson
See http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/108/

--
started
Checking out http://svn.apache.org/repos/asf/lucene/nutch/trunk
FATAL: null
java.lang.IndexOutOfBoundsException
at java.nio.Buffer.checkBounds(Buffer.java:454)
at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:125)
at java.nio.ByteBuffer.get(ByteBuffer.java:674)
at 
org.tmatesoft.svn.core.internal.delta.SVNDeltaReader.deflate(SVNDeltaReader.java:159)
at 
org.tmatesoft.svn.core.internal.delta.SVNDeltaReader.nextWindow(SVNDeltaReader.java:125)
at 
org.tmatesoft.svn.core.internal.io.dav.handlers.BasicDAVDeltaHandler.characters(BasicDAVDeltaHandler.java:98)
at 
com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.characters(AbstractSAXParser.java:570)
at 
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanContent(XMLDocumentFragmentScannerImpl.java:1062)
at 
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(XMLDocumentFragmentScannerImpl.java:1649)
at 
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:368)
at 
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:834)
at 
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:764)
at 
com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:148)
at 
com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1242)
at 
org.tmatesoft.svn.core.internal.io.dav.http.HTTPConnection.readData(HTTPConnection.java:631)
at 
org.tmatesoft.svn.core.internal.io.dav.http.HTTPConnection.readData(HTTPConnection.java:594)
at 
org.tmatesoft.svn.core.internal.io.dav.http.HTTPRequest.dispatch(HTTPRequest.java:197)
at 
org.tmatesoft.svn.core.internal.io.dav.http.HTTPConnection.request(HTTPConnection.java:284)
at 
org.tmatesoft.svn.core.internal.io.dav.http.HTTPConnection.request(HTTPConnection.java:229)
at 
org.tmatesoft.svn.core.internal.io.dav.http.HTTPConnection.request(HTTPConnection.java:217)
at 
org.tmatesoft.svn.core.internal.io.dav.DAVConnection.doReport(DAVConnection.java:219)
at 
org.tmatesoft.svn.core.internal.io.dav.DAVConnection.doReport(DAVConnection.java:211)
at 
org.tmatesoft.svn.core.internal.io.dav.DAVRepository.update(DAVRepository.java:601)
at 
org.tmatesoft.svn.core.wc.SVNUpdateClient.doUpdate(SVNUpdateClient.java:162)
at 
org.tmatesoft.svn.core.wc.SVNUpdateClient.doCheckout(SVNUpdateClient.java:322)
at hudson.scm.SubversionSCM$1.invoke(SubversionSCM.java:259)
at hudson.scm.SubversionSCM$1.invoke(SubversionSCM.java:247)
at hudson.FilePath.act(FilePath.java:226)
at hudson.scm.SubversionSCM.checkout(SubversionSCM.java:247)
at hudson.scm.SubversionSCM.checkout(SubversionSCM.java:225)
at hudson.model.AbstractProject.checkout(AbstractProject.java:281)
at 
hudson.model.AbstractBuild$AbstractRunner.checkout(AbstractBuild.java:150)
at hudson.model.AbstractBuild$AbstractRunner.run(AbstractBuild.java:116)
at hudson.model.Run.run(Run.java:549)
at hudson.model.Build.run(Build.java:99)
at hudson.model.Executor.run(Executor.java:61)



Re: [jira] Commented: (NUTCH-496) ConcurrentModificationException can be thrown when getSorted() is called.

2007-06-05 Thread Briggs

I have already 'fixed' the issue with concurrency.  I did as Sami suggested
and just threw in a ThreadLocal variable for the NGramProfile and am
currently testing (though, this is another difficult one to set up for
testing, since nobody has found an issue with this because it's a very quiet
bug).

On 6/5/07, Doğacan Güney <[EMAIL PROTECTED]> wrote:


On 6/5/07, Sami Siren <[EMAIL PROTECTED]> wrote:
>
> I just saw this on api and assumed it had to do with detecting the
> language, I might be wrong:
>
>
http://www.icu-project.org/apiref/icu4j/com/ibm/icu/text/CharsetMatch.html#getLanguage()

I think that method is used to get detected charset's ISO code. Like,
it returns "tr" for ISO-8859-9.

That being said, language identification is a very crucial feature and
if it doesn't work properly, well, someone should do something about
it :).


>
> --
>  Sami Siren
>


--
Doğacan Güney





--
"Conscious decisions by conscious minds are what make reality real"


[jira] Updated: (NUTCH-496) ConcurrentModificationException can be thrown when getSorted() is called.

2007-06-05 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren updated NUTCH-496:
-

Attachment: nutch-496.txt

This patch changes LanguageIdentifier to have NGramProfile per thread instead 
of one common one.

> ConcurrentModificationException can be thrown when getSorted() is called.
> -
>
> Key: NUTCH-496
> URL: https://issues.apache.org/jira/browse/NUTCH-496
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 0.9.0
> Environment: Nutch application, during fetch.
>Reporter: Briggs
> Attachments: language_analyzer_ngram.patch, nutch-496.txt
>
>
> NGramProfile (within the org.apache.nutch.analysis.lang) package is not 
> thread-safe due to a ConcurrentModificationException that can occur if during 
> iteration of the resultant List from getSorted() and another call to 
> getSorted() is invoked from within another thread.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [jira] Commented: (NUTCH-496) ConcurrentModificationException can be thrown when getSorted() is called.

2007-06-05 Thread Doğacan Güney

On 6/5/07, Sami Siren <[EMAIL PROTECTED]> wrote:


I just saw this on api and assumed it had to do with detecting the
language, I might be wrong:

http://www.icu-project.org/apiref/icu4j/com/ibm/icu/text/CharsetMatch.html#getLanguage()


I think that method is used to get detected charset's ISO code. Like,
it returns "tr" for ISO-8859-9.

That being said, language identification is a very crucial feature and
if it doesn't work properly, well, someone should do something about
it :).




--
 Sami Siren




--
Doğacan Güney


Re: [jira] Commented: (NUTCH-496) ConcurrentModificationException can be thrown when getSorted() is called.

2007-06-05 Thread Sami Siren

2007/6/5, Doğacan Güney <[EMAIL PROTECTED]>:


Can you give a few links? I have looked at icu4j's API, but I haven't
found any info about language identification.


I just saw this on api and assumed it had to do with detecting the
language, I might be wrong:

http://www.icu-project.org/apiref/icu4j/com/ibm/icu/text/CharsetMatch.html#getLanguage()

--
Sami Siren


Re: [jira] Commented: (NUTCH-496) ConcurrentModificationException can be thrown when getSorted() is called.

2007-06-05 Thread Doğacan Güney

On 6/4/07, Sami Siren <[EMAIL PROTECTED]> wrote:

Briggs wrote:
> Yeah, you are correct there.  How does this thing actually even
> remotely begin to work on a  predictable level?

One crucial aspect of language identification is that the input properly
encoded. There was a patch that added icu4j character set encoding
detection into Nutch. I believe icu4j also offers language
identification in addition to character set detection. Has anyone
checked how usable the language identification from icu4j would be?

There is severe problems with current language identification for CJK
for example.



Can you give a few links? I have looked at icu4j's API, but I haven't
found any info about language identification.

IBM does have something called Linguini
(http://www-306.ibm.com/software/globalization/topics/linguini/index.jsp)
. It doesn't seem to be open source, though.



--
 Sami Siren




--
Doğacan Güney