Re: [Dbp-spotlight-users] Character encoding Raw Data

Joachim Daiber Fri, 16 May 2014 09:31:46 -0700

Hey there,

this is mainly because the interfaces for the indexing are single objects
(*Source in [1]) to keep things more or less exchangeable. This means all
data from Pig is loaded into memory first. It would be relatively simple to
change the indexer to not load everything into memory. I was just not
motivated to do that back then because realistically not that many people
run the indexing themselves and you would still not get below the final
model sizes (which can be around 7-10G for English).


Best,
Jo

[1]
https://github.com/dbpedia-spotlight/dbpedia-spotlight/tree/master/index/src/main/scala/org/dbpedia/spotlight/db/io


On Fri, May 16, 2014 at 5:31 PM, Pablo N. Mendes <[email protected]>wrote:

>
> This is the kind of thing where the disk backed models could be useful,
> no? Would take longer, but at indexing it is OK to wait. A second step
> would just translate to the Mem models.
>  On May 16, 2014 7:12 AM, "Alex Olieman" <[email protected]> wrote:
>
>>  A short update for future readers:
>> Most of my issues were punishment for modifing the raw data with Windows.
>>
>>    - The charset issue can be avoided by using "java ...
>>    -Dfile.encoding=UTF-8 -cp ..."
>>    - For the array index issue: could be tabs, but also: line endings
>>    should not include carriage returns! (at least when creating the model on
>>    linux; Win has no problem with this)
>>
>> Now it's just a matter of finding enough memory to run this ;-)
>>
>> On 16-5-2014 15:02, Joachim Daiber wrote:
>>
>> Check if there are lines with too many tab characters in them. You might
>> have to remove them manually. I fixed that upstream (by ignoring the line
>> if this is the case), I think but you might run into that problem. For
>> creating full English, you need around 20-25GB of memory.
>>
>>  Jo
>>
>>
>> On Fri, May 16, 2014 at 2:48 PM, Alex Olieman <[email protected]> wrote:
>>
>>>  Hey Jo,
>>>
>>> Thanks a lot! This must be the case; when using the same files on a
>>> different machine, I got a bit further.
>>>
>>> Now it still fails on parsing sfAndTotalCounts, but on a
>>> java.lang.ArrayIndexOutOfBoundsException (without traceback). Should I
>>> assume there is a line with only one column in the file? Perhaps I messed
>>> up the TSV format with my modifications. Is there any quoting or escape
>>> character I should know about when reading & writing the TSV files?
>>>
>>> Btw, how much ram is needed to create the model from files in your
>>> experience? My modified files are about 60% the size of the originals.
>>>
>>> Best,
>>> Alex
>>>
>>>
>>> On 16-5-2014 14:22, Joachim Daiber wrote:
>>>
>>> Hey Alex,
>>>
>>>  they should be utf-8. I get the same error as you if not all of my
>>> bash lang vars are set. Try to check with "locale" if all are set and do
>>> export LC_ALL, etc with utf-8 if they are not.
>>>
>>>  Best,
>>> Jo
>>>
>>>
>>> On Fri, May 16, 2014 at 12:55 PM, Alex Olieman <[email protected]> wrote:
>>>
>>>> Hi,
>>>>
>>>> Lately I've been tinkering with the raw data to see if I can create a
>>>> model with a filtered set of disambiguation candidates, and that spots
>>>> more lowercase surface forms. When I train the model on the modified
>>>> data, however, I'm faced with character encoding issues.
>>>>
>>>> For example:
>>>> >  INFO 2014-05-16 03:16:29,310 main [WikipediaToDBpediaClosure] - Done.
>>>> >  INFO 2014-05-16 03:16:29,388 main [SurfaceFormSource$] - Creating
>>>> > SurfaceFormSource...
>>>> >  INFO 2014-05-16 03:16:29,388 main [SurfaceFormSource$] - Reading
>>>> > annotated and total counts...
>>>> > Exception in thread "main"
>>>> > java.nio.charset.UnmappableCharacterException: Input length = 1
>>>> >         at java.nio.charset.CoderResult.throwException(Unknown Source)
>>>> >         at sun.nio.cs.StreamDecoder.implRead(Unknown Source)
>>>>
>>>> I had expected these files to be encoded in utf-8, but it looks like
>>>> this isn't the case. The chardet library tells me it is ISO-8859-2
>>>> a.k.a. Latin-2 instead. Can someone tell me in which character encoding
>>>> the raw data (pig output) files should be for db.CreateSpotlightModel to
>>>> read them correctly? If this really should be one of the "Western"
>>>> character sets, I would expect it to be Latin-1 instead.
>>>>
>>>> Best,
>>>> Alex
>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> "Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
>>>> Instantly run your Selenium tests across 300+ browser/OS combos.
>>>> Get unparalleled scalability from the best Selenium testing platform
>>>> available
>>>> Simple to use. Nothing to install. Get started now for free."
>>>> http://p.sf.net/sfu/SauceLabs
>>>> _______________________________________________
>>>> Dbp-spotlight-users mailing list
>>>> [email protected]
>>>> https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users
>>>>
>>>
>>>
>>>
>>
>>
>>
>> ------------------------------------------------------------------------------
>> "Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
>> Instantly run your Selenium tests across 300+ browser/OS combos.
>> Get unparalleled scalability from the best Selenium testing platform
>> available
>> Simple to use. Nothing to install. Get started now for free."
>> http://p.sf.net/sfu/SauceLabs
>> _______________________________________________
>> Dbp-spotlight-users mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users
>>
>>

------------------------------------------------------------------------------
"Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
Instantly run your Selenium tests across 300+ browser/OS combos.
Get unparalleled scalability from the best Selenium testing platform available
Simple to use. Nothing to install. Get started now for free."
http://p.sf.net/sfu/SauceLabs

_______________________________________________
Dbp-spotlight-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users

Re: [Dbp-spotlight-users] Character encoding Raw Data

Reply via email to