A short update for future readers:
Most of my issues were punishment for modifing the raw data with Windows.
* The charset issue can be avoided by using "java ...
-Dfile.encoding=UTF-8 -cp ..."
* For the array index issue: could be tabs, but also: line endings
should not include carriage returns! (at least when creating the
model on linux; Win has no problem with this)
Now it's just a matter of finding enough memory to run this ;-)
On 16-5-2014 15:02, Joachim Daiber wrote:
Check if there are lines with too many tab characters in them. You
might have to remove them manually. I fixed that upstream (by ignoring
the line if this is the case), I think but you might run into that
problem. For creating full English, you need around 20-25GB of memory.
Jo
On Fri, May 16, 2014 at 2:48 PM, Alex Olieman <[email protected]
<mailto:[email protected]>> wrote:
Hey Jo,
Thanks a lot! This must be the case; when using the same files on
a different machine, I got a bit further.
Now it still fails on parsing sfAndTotalCounts, but on a
java.lang.ArrayIndexOutOfBoundsException (without traceback).
Should I assume there is a line with only one column in the file?
Perhaps I messed up the TSV format with my modifications. Is there
any quoting or escape character I should know about when reading &
writing the TSV files?
Btw, how much ram is needed to create the model from files in your
experience? My modified files are about 60% the size of the originals.
Best,
Alex
On 16-5-2014 14:22, Joachim Daiber wrote:
Hey Alex,
they should be utf-8. I get the same error as you if not all of
my bash lang vars are set. Try to check with "locale" if all are
set and do export LC_ALL, etc with utf-8 if they are not.
Best,
Jo
On Fri, May 16, 2014 at 12:55 PM, Alex Olieman <[email protected]
<mailto:[email protected]>> wrote:
Hi,
Lately I've been tinkering with the raw data to see if I can
create a
model with a filtered set of disambiguation candidates, and
that spots
more lowercase surface forms. When I train the model on the
modified
data, however, I'm faced with character encoding issues.
For example:
> INFO 2014-05-16 03:16:29,310 main
[WikipediaToDBpediaClosure] - Done.
> INFO 2014-05-16 03:16:29,388 main [SurfaceFormSource$] -
Creating
> SurfaceFormSource...
> INFO 2014-05-16 03:16:29,388 main [SurfaceFormSource$] -
Reading
> annotated and total counts...
> Exception in thread "main"
> java.nio.charset.UnmappableCharacterException: Input length = 1
> at
java.nio.charset.CoderResult.throwException(Unknown Source)
> at sun.nio.cs.StreamDecoder.implRead(Unknown Source)
I had expected these files to be encoded in utf-8, but it
looks like
this isn't the case. The chardet library tells me it is
ISO-8859-2
a.k.a. Latin-2 instead. Can someone tell me in which
character encoding
the raw data (pig output) files should be for
db.CreateSpotlightModel to
read them correctly? If this really should be one of the
"Western"
character sets, I would expect it to be Latin-1 instead.
Best,
Alex
------------------------------------------------------------------------------
"Accelerate Dev Cycles with Automated Cross-Browser Testing -
For FREE
Instantly run your Selenium tests across 300+ browser/OS combos.
Get unparalleled scalability from the best Selenium testing
platform available
Simple to use. Nothing to install. Get started now for free."
http://p.sf.net/sfu/SauceLabs
_______________________________________________
Dbp-spotlight-users mailing list
[email protected]
<mailto:[email protected]>
https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users
------------------------------------------------------------------------------
"Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
Instantly run your Selenium tests across 300+ browser/OS combos.
Get unparalleled scalability from the best Selenium testing platform available
Simple to use. Nothing to install. Get started now for free."
http://p.sf.net/sfu/SauceLabs
_______________________________________________
Dbp-spotlight-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users