Re: [Dbp-spotlight-users] Character encoding Raw Data

Joachim Daiber Fri, 16 May 2014 06:02:45 -0700

Check if there are lines with too many tab characters in them. You might
have to remove them manually. I fixed that upstream (by ignoring the line
if this is the case), I think but you might run into that problem. For
creating full English, you need around 20-25GB of memory.


Jo


On Fri, May 16, 2014 at 2:48 PM, Alex Olieman <[email protected]> wrote:

>  Hey Jo,
>
> Thanks a lot! This must be the case; when using the same files on a
> different machine, I got a bit further.
>
> Now it still fails on parsing sfAndTotalCounts, but on a
> java.lang.ArrayIndexOutOfBoundsException (without traceback). Should I
> assume there is a line with only one column in the file? Perhaps I messed
> up the TSV format with my modifications. Is there any quoting or escape
> character I should know about when reading & writing the TSV files?
>
> Btw, how much ram is needed to create the model from files in your
> experience? My modified files are about 60% the size of the originals.
>
> Best,
> Alex
>
>
> On 16-5-2014 14:22, Joachim Daiber wrote:
>
> Hey Alex,
>
>  they should be utf-8. I get the same error as you if not all of my bash
> lang vars are set. Try to check with "locale" if all are set and do export
> LC_ALL, etc with utf-8 if they are not.
>
>  Best,
> Jo
>
>
> On Fri, May 16, 2014 at 12:55 PM, Alex Olieman <[email protected]> wrote:
>
>> Hi,
>>
>> Lately I've been tinkering with the raw data to see if I can create a
>> model with a filtered set of disambiguation candidates, and that spots
>> more lowercase surface forms. When I train the model on the modified
>> data, however, I'm faced with character encoding issues.
>>
>> For example:
>> >  INFO 2014-05-16 03:16:29,310 main [WikipediaToDBpediaClosure] - Done.
>> >  INFO 2014-05-16 03:16:29,388 main [SurfaceFormSource$] - Creating
>> > SurfaceFormSource...
>> >  INFO 2014-05-16 03:16:29,388 main [SurfaceFormSource$] - Reading
>> > annotated and total counts...
>> > Exception in thread "main"
>> > java.nio.charset.UnmappableCharacterException: Input length = 1
>> >         at java.nio.charset.CoderResult.throwException(Unknown Source)
>> >         at sun.nio.cs.StreamDecoder.implRead(Unknown Source)
>>
>> I had expected these files to be encoded in utf-8, but it looks like
>> this isn't the case. The chardet library tells me it is ISO-8859-2
>> a.k.a. Latin-2 instead. Can someone tell me in which character encoding
>> the raw data (pig output) files should be for db.CreateSpotlightModel to
>> read them correctly? If this really should be one of the "Western"
>> character sets, I would expect it to be Latin-1 instead.
>>
>> Best,
>> Alex
>>
>>
>>
>> ------------------------------------------------------------------------------
>> "Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
>> Instantly run your Selenium tests across 300+ browser/OS combos.
>> Get unparalleled scalability from the best Selenium testing platform
>> available
>> Simple to use. Nothing to install. Get started now for free."
>> http://p.sf.net/sfu/SauceLabs
>> _______________________________________________
>> Dbp-spotlight-users mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users
>>
>
>
>

------------------------------------------------------------------------------
"Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
Instantly run your Selenium tests across 300+ browser/OS combos.
Get unparalleled scalability from the best Selenium testing platform available
Simple to use. Nothing to install. Get started now for free."
http://p.sf.net/sfu/SauceLabs

_______________________________________________
Dbp-spotlight-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users

Re: [Dbp-spotlight-users] Character encoding Raw Data

Reply via email to