Hi Rupert,

Thanks a lot for the reply.

In the mean time, I tried to add following line in namespaceprefix.mappings
and does seem to change a bit.

<space> http://rdf.freebase.com/ns/

After above change, I kicked-off the run again and now I am seeing
following lines in the run at the end. I also see some data in tdb
directory.

Is it doing the right thing or should I re-start the process per your
suggestion?

Thanks in advance,
Rajan
..................................................Log.....................................
Batch: 57,045 slots/s / Avg: 89,972 slots/s)
07:23:27,329 [Thread-3] INFO  jenatdb.RdfResourceImporter - Index SPO->POS:
356,000,000 slots (Batch: 43,782 slots/s / Avg: 89,945 slots/s)
07:23:27,329 [Thread-3] INFO  jenatdb.RdfResourceImporter -   Elapsed:
83,935.45 seconds [2015/05/23 07:23:27 EDT]
07:23:28,485 [Thread-3] INFO  jenatdb.RdfResourceImporter - Index SPO->POS:
356,100,000 slots (Batch: 86,505 slots/s / Avg: 89,944 slots/s)
07:23:31,334 [Thread-3] INFO  jenatdb.RdfResourceImporter - Index SPO->POS:
356,200,000 slots (Batch: 35,100 slots/s / Avg: 89,904 slots/s)
07:23:32,966 [Thread-3] INFO  jenatdb.RdfResourceImporter - Index SPO->POS:
356,300,000 slots (Batch: 61,274 slots/s / Avg: 89,893 slots/s)
07:23:34,348 [Thread-3] INFO  jenatdb.RdfResourceImporter - Index SPO->POS:
356,400,000 slots (Batch: 72,358 slots/s / Avg: 89,887 slots/s)
07:23:36,850 [Thread-3] INFO  jenatdb.R
................................................................................................................

---------------- files in tdb directory .....................

-rw-r--r--   1      8388608 May 22 07:49 GOSP.dat
-rw-r--r--   1      8388608 May 22 07:49 GOSP.idn
-rw-r--r--   1      8388608 May 22 07:49 GPOS.dat
-rw-r--r--   1      8388608 May 22 07:49 GPOS.idn
-rw-r--r--   1      8388608 May 22 07:49 GSPO.dat
-rw-r--r--   1      8388608 May 22 07:49 GSPO.idn
-rw-r--r--   1      8388608 May 22 07:49 OSP.dat
-rw-r--r--   1      8388608 May 22 07:49 OSP.idn
-rw-r--r--   1      8388608 May 22 07:49 OSPG.dat
-rw-r--r--   1      8388608 May 22 07:49 OSPG.idn
-rw-r--r--   1  16399728640 May 23 07:24 POS.dat
-rw-r--r--   1      117440512 May 23 07:24 POS.idn
-rw-r--r--   1       8388608 May 22 07:49 POSG.dat
-rw-r--r--   1       8388608 May 22 07:49 POSG.idn
-rw-r--r--      25954353152 May 23 07:24 SPO.dat
-rw-r--r--        192937984 May 23 07:17 SPO.idn
-rw-r--r--   1       8388608 May 22 07:49 SPOG.dat
-rw-r--r--   1       8388608 May 22 07:49 SPOG.idn
-rw-r--r--     8808038400 May 23 07:15 node2id.dat
-rw-r--r--   1     50331648 May 23 06:50 node2id.idn
-rw-r--r--   1 18672429231 May 23 06:17 nodes.dat
-rw-r--r--   1        8388608 May 22 07:49 prefix2id.dat
-rw-r--r--   1        8388608 May 22 07:49 prefix2id.idn
-rw-r--r--   1        8388608 May 22 07:49 prefixIdx.dat
-rw-r--r--   1        8388608 May 22 07:49 prefixIdx.idn
-rw-r--r--   1                    0 May 22 07:49 prefixes.dat

On Sat, May 23, 2015 at 12:12 AM, Rupert Westenthaler <
rupert.westentha...@gmail.com> wrote:

> Hi
>
> you need to enable 'id-namespace' in the iditerator.properties file
> and set the value to 'http://rdf.freebase.com/ns/' (the same value as
> defined by http://prefix.cc/fb)
>
> This will ensure that the indexing tool is looking for the correct
> Entity URIs (e.g. 'http://rdf.freebase.com/ns/m.0kpv11' for '10888430
> m.0kpv11' the first line in the incoming_links.txt file)
>
> best
> Rupert
>
>
> On Fri, May 22, 2015 at 3:43 PM, Rajan Shah <raja...@gmail.com> wrote:
> > Hi Rupert,
> >
> > Thanks for the quick turnover, I really appreciate your prompt response.
> >
> > Please find included at the end.
> >
> > Thanks in advance,
> > Rajan
> >
> > *a. iditerator.properties*
> >
> > #NOTES:
> > # Lines in this file start with spaces in cases the score is lower than
> one
> > # million. because of that we need to trim leading spaces
> > trimLine
> > # after trimming the lines the
> > #  -> first position is always an empty string
> > #  -> score should be at the first position
> > score-pos=1
> > #  -> second position should be the local name of the entity
> > id-pos=2
> > #the file needs to be in the source (default="/indexing/resource")
> folder!
> > source=incoming_links.txt
> > encodeIds=false
> > charset=UTF-8
> > # set the separator to ' '
> > separator=
> > # and URLdecode the IDs
> > decodeIds=false
> >
> > # freebase uses namespace prefixes for IDs, because of that we do not
> need
> > # the id-namespace parameter. NOTE that the 'ns' prefix need to be set to
> > # http://www.
> > #id-namespace=http://freebase.com/
> > ns-prefix-state=false
> >
> > *b. incoming_links.txt*
> >
> > Some lines are as follows:
> >
> > 10888430 m.0kpv11
> > 3741261 m.019h
> > 2667858 m.0775xx5
> > 2667804 m.0775xvm
> > 1875352 m.01xryvm
> > 1739262 m.05zppz
> > 1369590 m.01xrzlb
> > 1336481 m.0g4g
> > 1202333 m.04l
> > 1093642 m.01xryw5
> > 1079153 m.09gn
> > 1070544 m.0kpv17
> > 1066210 m.09c7w0
> > 925879 m.01x32j1
> > 922312 m.0jst35z
> > 921239 m.08x8
> > 864526 m.02nsjl9
> > 832558 m.01xlj26
> > 769191 m.02lx2r
> > 736892 m.04m8
> >
> > On Fri, May 22, 2015 at 9:29 AM, Rupert Westenthaler <
> > rupert.westentha...@gmail.com> wrote:
> >
> >> Hi Rajan,
> >>
> >> > *05:11:41,851 [Indexing: Finished Entity Logger Deamon] INFO
> >> > impl.IndexerImpl - Indexed 0 items in 2059467sec (Infinityms/item):
> >>
> >> 'You have not indexed a single entity. So something in your indexing
> >> configuration is wrong. Most likely you are not correctly building the
> >> URIs of the entities from the incoming_links.txt file. Can you provide
> >> me an example line of the 'incoming_links.txt' file and the contents
> >> of the 'iditerator.properties' file. Those specify how Entity URIs are
> >> built.
> >>
> >> Short answers to the other questions
> >>
> >>
> >> On Fri, May 22, 2015 at 2:10 PM, Rajan Shah <raja...@gmail.com> wrote:
> >> > it ran for almost 3 days and generated index.
> >>
> >> Thats good. It means you do have now the Freebase dump in your Jena
> >> TDB triple store. You will not need to repeat this (until you want to
> >> use a newer dump. On the next call to the indexing tool it will
> >> immediately start with the indexing step.
> >>
> >>
> >> >
> >> > Couple questions come to mind:
> >> >
> >> > a. Is there any particular log/error file the process generates
> besides
> >> > printing out on stdout/stderr?
> >>
> >> The indexer writes a zip archive with the IDs of all the indexed
> >> entities. Its in the indexing/destination folder.
> >>
> >> > b. Is it a must-have to have stanbol full launcher running all the
> time
> >> > while indexing is going on?
> >>
> >> No Stanbol instance is needed by the indexing process.
> >>
> >> > c. Is it possible that, if the machine is not connected to internet
> for
> >> > couple minutes could cause some issues?
> >>
> >> No Internet connectivity is needed during indexing. Only if you want
> >> to use the namespace prefix mappings of prefix.cc you need to have
> >> internet connectivity when starting the indexing tool.
> >>
> >> best
> >> Rupert
> >>
> >> >
> >> > I would really appreciate, if you can shed some light on "what could
> be
> >> > wrong" or "potential approach to nail down this issue"? If you need,
> I am
> >> > happy to share any additional logs/properties.
> >> >
> >> > With best regards,
> >> > Rajan
> >> >
> >> > *1. Configuration changes*
> >> >
> >> > a. set ns-prefix-state=false*
> >> > [within /indexing/config/iditerator.properties]*
> >> > b. add empty space mapping to   http://rdf.freebase.com/ns/*
> >> > [within namespaceprefix.mappings]*
> >> > c. enable bunch of properties within mappings.txt such as following
> >> >
> >> > fb:music.artist.genre
> >> > fb:music.artist.label
> >> > fb:music.artist.album
> >> >
> >> > *2. Contents of indexing/dist directory*
> >> >
> >> > -rw-r--r--  108899 May 22 05:11 freebase.solrindex.zip
> >> > -rw-r--r--  3457 May 22 05:11
> >> > org.apache.stanbol.data.site.freebase-1.0.0.jar
> >> >
> >> > *3. Contents of /tmp/freebase/indexing/resources/imported directory*
> >> >
> >> > -rw-r--r--  1 31026810858 May 20 07:32 freebase.nt.gz
> >> >
> >> > *4. Contents of /tmp/freebase/indexing/resources directory*
> >> >
> >> > -rw-r--r--   1 1206745360 May 19 09:38 incoming_links.txt
> >> >
> >> > *5. The indexer log*
> >> >
> >> > *04:31:57,236 [Thread-3] INFO  jenatdb.RdfResourceImporter - Add:
> >> > 570,850,000 triples (Batch: 2,604 / Avg: 3,621)*
> >> > *04:32:00,727 [Thread-3] INFO  jenatdb.RdfResourceImporter - Filtered:
> >> > 2429800000 triples (80.97554853864854%)*
> >> > *04:32:01,157 [Thread-3] INFO  jenatdb.RdfResourceImporter - -- Finish
> >> > triples data phase*
> >> > *04:32:01,157 [Thread-3] INFO  jenatdb.RdfResourceImporter - ** Data:
> >> > 570,859,352 triples loaded in 157,619.39 seconds [Rate: 3,621.76 per
> >> > second]*
> >> > *04:32:01,157 [Thread-3] INFO  jenatdb.RdfResourceImporter - -- Start
> >> > triples index phase*
> >> > *04:32:01,157 [Thread-3] INFO  jenatdb.RdfResourceImporter - -- Finish
> >> > triples index phase*
> >> > *04:32:01,157 [Thread-3] INFO  jenatdb.RdfResourceImporter - -- Finish
> >> > triples load*
> >> > *04:32:01,157 [Thread-3] INFO  jenatdb.RdfResourceImporter - **
> >> Completed:
> >> > 570,859,352 triples loaded in 157,619.39 seconds [Rate: 3,621.76 per
> >> > second]*
> >> > 04:32:56,880 [Thread-3] INFO  source.ResourceLoader -    ... moving
> >> > imported file freebase.nt.gz to imported/freebase.nt.gz
> >> > 04:32:56,883 [Thread-3] INFO  source.ResourceLoader -    - completed
> in
> >> > 157675 seconds
> >> > 04:32:56,883 [Thread-3] INFO  source.ResourceLoader -  > loading
> >> > '/private/tmp/freebase/indexing/resources/rdfdata/fixit.sh' ...
> >> > 04:32:56,944 [Thread-3] WARN  jenatdb.RdfResourceImporter - ignore
> File
> >> {}
> >> > because of unknown extension
> >> > 04:32:56,958 [Thread-3] INFO  source.ResourceLoader -    - completed
> in 0
> >> > seconds
> >> > 04:32:56,958 [Thread-3] INFO  source.ResourceLoader -  ... 2 files
> >> imported
> >> > in 157675 seconds
> >> > 04:32:56,958 [Thread-3] INFO  source.ResourceLoader - Loding 0 File
> ...
> >> > 04:32:56,958 [Thread-3] INFO  source.ResourceLoader -  ... 0 files
> >> imported
> >> > in 0 seconds
> >> > 04:32:56,971 [main] INFO  impl.IndexerImpl -  ... delete existing
> >> > IndexedEntityId file
> >> > /private/tmp/freebase/indexing/destination/indexed-entities-ids.zip
> >> > 04:32:56,982 [main] INFO  impl.IndexerImpl - Initialisation completed
> >> > 04:32:56,982 [main] INFO  impl.IndexerImpl -   ... initialisation
> >> completed
> >> > 04:32:56,982 [main] INFO  impl.IndexerImpl - start indexing ...
> >> > 04:32:56,982 [main] INFO  impl.IndexerImpl - Indexing started ...
> >> >
> >> >
> >> >
> >> > 04:45:48,075 [pool-1-thread-1] WARN  impl.NamespacePrefixProviderImpl
> -
> >> > Invalid Namespace Mapping: prefix 'nsogi' valid , namespace '
> >> > http://prefix.cc/nsogi:' invalid -> mapping ignored!
> >> > 04:45:48,076 [pool-1-thread-1] WARN  impl.NamespacePrefixProviderImpl
> -
> >> > Invalid Namespace Mapping: prefix 'category' valid , namespace '
> >> > http://dbpedia.org/resource/Category:' invalid -> mapping ignored!
> >> > 04:45:48,077 [pool-1-thread-1] WARN  impl.NamespacePrefixProviderImpl
> -
> >> > Invalid Namespace Mapping: prefix 'chebi' valid , namespace '
> >> > http://bio2rdf.org/chebi:' invalid -> mapping ignored!
> >> > 04:45:48,077 [pool-1-thread-1] WARN  impl.NamespacePrefixProviderImpl
> -
> >> > Invalid Namespace Mapping: prefix 'hgnc' valid , namespace '
> >> > http://bio2rdf.org/hgnc:' invalid -> mapping ignored!
> >> > 04:45:48,077 [pool-1-thread-1] WARN  impl.NamespacePrefixProviderImpl
> -
> >> > Invalid Namespace Mapping: prefix 'dbptmpl' valid , namespace '
> >> > http://dbpedia.org/resource/Template:' invalid -> mapping ignored!
> >> > 04:45:48,077 [pool-1-thread-1] WARN  impl.NamespacePrefixProviderImpl
> -
> >> > Invalid Namespace Mapping: prefix 'dbc' valid , namespace '
> >> > http://dbpedia.org/resource/Category:' invalid -> mapping ignored!
> >> > 04:45:48,078 [pool-1-thread-1] WARN  impl.NamespacePrefixProviderImpl
> -
> >> > Invalid Namespace Mapping: prefix 'pubmed' valid , namespace '
> >> > http://bio2rdf.org/pubmed_vocabulary:' invalid -> mapping ignored!
> >> > 04:45:48,078 [pool-1-thread-1] WARN  impl.NamespacePrefixProviderImpl
> -
> >> > Invalid Namespace Mapping: prefix 'dbt' valid , namespace '
> >> > http://dbpedia.org/resource/Template:' invalid -> mapping ignored!
> >> > 04:45:48,078 [pool-1-thread-1] WARN  impl.NamespacePrefixProviderImpl
> -
> >> > Invalid Namespace Mapping: prefix 'dbrc' valid , namespace '
> >> > http://dbpedia.org/resource/Category:' invalid -> mapping ignored!
> >> > 04:45:48,078 [pool-1-thread-1] WARN  impl.NamespacePrefixProviderImpl
> -
> >> > Invalid Namespace Mapping: prefix 'call' valid , namespace '
> >> > http://webofcode.org/wfn/call:' invalid -> mapping ignored!
> >> > 04:45:48,078 [pool-1-thread-1] WARN  impl.NamespacePrefixProviderImpl
> -
> >> > Invalid Namespace Mapping: prefix 'dbcat' valid , namespace '
> >> > http://dbpedia.org/resource/Category:' invalid -> mapping ignored!
> >> > 04:45:48,084 [pool-1-thread-1] WARN  impl.NamespacePrefixProviderImpl
> -
> >> > Invalid Namespace Mapping: prefix 'affymetrix' valid , namespace '
> >> > http://bio2rdf.org/affymetrix_vocabulary:' invalid -> mapping
> ignored!
> >> > 04:45:48,084 [pool-1-thread-1] WARN  impl.NamespacePrefixProviderImpl
> -
> >> > Invalid Namespace Mapping: prefix 'bgcat' valid , namespace '
> >> > http://bg.dbpedia.org/resource/Категория:' invalid -> mapping
> ignored!
> >> > 04:45:48,084 [pool-1-thread-1] WARN  impl.NamespacePrefixProviderImpl
> -
> >> > Invalid Namespace Mapping: prefix 'condition' valid , namespace '
> >> > http://www.kinjal.com/condition:' invalid -> mapping ignored!
> >> > 05:11:41,836 [Indexing: Entity Source Reader Deamon] INFO
> >> impl.IndexerImpl
> >> > - Indexing: Entity Source Reader Deamon completed (sequence=0) ...
> >> > 05:11:41,838 [Indexing: Entity Source Reader Deamon] INFO
> >> impl.IndexerImpl
> >> > -  > current sequence : 0
> >> > 05:11:41,838 [Indexing: Entity Source Reader Deamon] INFO
> >> impl.IndexerImpl
> >> > -  > new sequence: 1
> >> > 05:11:41,838 [Indexing: Entity Source Reader Deamon] INFO
> >> impl.IndexerImpl
> >> > - Send end-of-queue to Deamons with Sequence 1
> >> > 05:11:41,839 [Indexing: Entity Processor Deamon] INFO
> impl.IndexerImpl -
> >> > Indexing: Entity Processor Deamon completed (sequence=1) ...
> >> > 05:11:41,839 [Indexing: Entity Processor Deamon] INFO
> impl.IndexerImpl -
> >> >  > current sequence : 1
> >> > 05:11:41,839 [Indexing: Entity Processor Deamon] INFO
> impl.IndexerImpl -
> >> >  > new sequence: 2
> >> > 05:11:41,839 [Indexing: Entity Processor Deamon] INFO
> impl.IndexerImpl -
> >> > Send end-of-queue to Deamons with Sequence 2
> >> > 05:11:41,839 [Indexing: Entity Perstisting Deamon] INFO
> >> impl.IndexerImpl -
> >> > Indexing: Entity Perstisting Deamon completed (sequence=2) ...
> >> > 05:11:41,839 [Indexing: Entity Perstisting Deamon] INFO
> >> impl.IndexerImpl -
> >> >  > current sequence : 2
> >> > 05:11:41,839 [Indexing: Entity Perstisting Deamon] INFO
> >> impl.IndexerImpl -
> >> >  > new sequence: 3
> >> > 05:11:41,839 [Indexing: Entity Perstisting Deamon] INFO
> >> impl.IndexerImpl -
> >> > Send end-of-queue to Deamons with Sequence 3
> >> > *05:11:41,851 [Indexing: Finished Entity Logger Deamon] INFO
> >> >  impl.IndexerImpl - Indexed 0 items in 2059467sec (Infinityms/item):
> >> > processing:  -1.000ms/item | queue:  -1.000ms*
> >> > 05:11:41,851 [Indexing: Finished Entity Logger Deamon] INFO
> >> >  impl.IndexerImpl -   - source   :  -1.000ms/item
> >> > 05:11:41,851 [Indexing: Finished Entity Logger Deamon] INFO
> >> >  impl.IndexerImpl -   - processing:  -1.000ms/item
> >> > 05:11:41,851 [Indexing: Finished Entity Logger Deamon] INFO
> >> >  impl.IndexerImpl -   - store     :  -1.000ms/item
> >> > 05:11:41,906 [Indexing: Finished Entity Logger Deamon] INFO
> >> >  impl.IndexerImpl - Indexing: Finished Entity Logger Deamon completed
> >> > (sequence=3) ...
> >> > 05:11:41,906 [Indexing: Finished Entity Logger Deamon] INFO
> >> >  impl.IndexerImpl -  > current sequence : 3
> >> > 05:11:41,906 [Indexing: Finished Entity Logger Deamon] INFO
> >> >  impl.IndexerImpl -  > new sequence: 4
> >> > 05:11:41,906 [Indexing: Finished Entity Logger Deamon] INFO
> >> >  impl.IndexerImpl - Send end-of-queue to Deamons with Sequence 4
> >> > 05:11:41,910 [Indexer: Entity Error Logging Daemon] INFO
> >> impl.IndexerImpl
> >> > - Indexer: Entity Error Logging Daemon completed (sequence=4) ...
> >> > 05:11:41,910 [Indexer: Entity Error Logging Daemon] INFO
> >> impl.IndexerImpl
> >> > -  > current sequence : 4
> >> > 05:11:41,910 [main] INFO  impl.IndexerImpl -   ... indexing completed
> >> > 05:11:41,910 [main] INFO  impl.IndexerImpl - start post-processing ...
> >> > 05:11:41,910 [main] INFO  impl.IndexerImpl - PostProcessing started
> ...
> >> > 05:11:41,910 [main] INFO  impl.IndexerImpl -   ... post-processing
> >> finished
> >> > ...
> >> > 05:11:41,911 [main] INFO  impl.IndexerImpl - start finalisation....
> >> >
> >> >
> >> >
> >> > On Wed, May 20, 2015 at 8:19 AM, Rupert Westenthaler <
> >> > rupert.westentha...@gmail.com> wrote:
> >> >
> >> >> On Tue, May 19, 2015 at 7:04 PM, Rajan Shah <raja...@gmail.com>
> wrote:
> >> >> > Hi Rupert and Antonio,
> >> >> >
> >> >> > Thanks a lot for the reply.
> >> >> >
> >> >> > I start to follow Rupert's suggestion, however it failed again at
> >> >> >
> >> >> > 10:56:34,152 [Thread-3] ERROR jena.riot - [line: 8722294, col: 88]
> >> >> illegal
> >> >> > escape sequence value: $ (0x24) -- Is there anyway it can be
> resolved
> >> for
> >> >> > the entire file?
> >> >> >
> >> >>
> >> >> The indexing tool uses Apache Jena. An those are Jena parsing errors.
> >> >> So the Jena Mailing lists would be the better place to look for
> >> >> answers.
> >> >> This specific issue looks like an invalid URI that is not fixed by
> the
> >> >> fixit script.
> >> >>
> >> >>
> >> >> > I requested an access to latest BaseKB bucket, as it doesn't seem
> to
> >> be
> >> >> > open.
> >> >> >
> >> >> > s3cmd ls s3://basekb-now/2015-04-15-18-54/
> >> >> >  --add-header="x-amz-request-payer: requester"
> >> >> > ERROR: Access to bucket 'basekb-now' was denied
> >> >> >
> >> >> >
> >> >> > *Couple additional questions:*
> >> >> >
> >> >> > *1. indexing enhancements:*
> >> >> > What settings/properties one can tweak to gain most out of the
> >> indexing.
> >> >> >
> >> >>
> >> >> In general you do only want information as needed for your
> application
> >> >> case in the index.
> >> >> For EntityLinking only labels and type are required.
> >> >> Additional properties will only be used for dereferencing Entities.
> So
> >> >> this will depend on your application needs (your dereferencing
> >> >> configuration).
> >> >>
> >> >> In general I try to exclude as much information as possible form the
> >> >> index to keep the size of the Solr Index as small as possible.
> >> >>
> >> >> > a. for ex. domain specific such as Pharmaceutical, Law etc...
> within
> >> >> > freebase
> >> >> > b. potential optimizations to speed up the overall indexing
> >> >>
> >> >> Most of the time will be needed to load the Freebase dump into Jena
> >> >> TDB. Even with an SSD equipped Server this will take several days.
> >> >> Assigning more RAM will speed up this process as Jena TDB can cache
> >> >> more things in RAM.
> >> >>
> >> >> Usually it is a good Idea to cancel the indexing process after the
> >> >> importing of the RDF data has finished (and the indexing of the
> >> >> Entities has started). This is because after indexing all the RAM
> will
> >> >> be used by Jena TDB for caching stuff that is no longer needed in the
> >> >> read-only operations during indexing. So a fresh start can speed up
> >> >> the indexing part of the process.
> >> >>
> >> >> Also have a look at the Freebase Indexing Tool Readme
> >> >>
> >> >> >
> >> >> > *2. demo:*
> >> >> > I see that, in recent github commit(s) the eHealth and other demos
> >> have
> >> >> > been commented out. How can I get demo source code and other
> >> components
> >> >> for
> >> >> > these demos. I prefer to build it myself to see the power of
> stanbol.
> >> >> >
> >> >>
> >> >> The eHealth demo is still in the 0.12 branch [1]. This is fully
> >> >> compatible to the trunk version.
> >> >>
> >> >> > *3. custom vocabulary:*
> >> >> > Suppose, I have custom vocabulary in CSV format. Is there a
> preferred
> >> way
> >> >> > to upload it to Stanbol and have it recognize my entities?
> >> >>
> >> >> Google Refine[2] with the RDF extension [3]. You can also try to use
> >> >> the (newer) Open Refine [4] with the RDF Refine 0.9.0 Alpha version
> >> >> but AFAIK this combination is not so stable and might not work at
> all.
> >> >>
> >> >> * Google Refine allows you to import your CSV file.
> >> >> * Clean it up (if necessary)
> >> >> * The RDF extension allows you to map your CSV data to RDF
> >> >> * based on this mapping you can save your data as RDF
> >> >> * after that you can import the RDF data to Apache Stanbol
> >> >>
> >> >> hope this helps
> >> >> best
> >> >> Rupert
> >> >>
> >> >> >
> >> >> > Thanks in advance,
> >> >> > Rajan
> >> >> >
> >> >>
> >> >>
> >> >>
> >> >> [1]
> >> >>
> >>
> http://svn.apache.org/repos/asf/stanbol/branches/release-0.12/demos/ehealth/
> >> >> [2] https://code.google.com/p/google-refine/
> >> >> [3] http://refine.deri.ie/
> >> >> [4] http://openrefine.org/
> >> >>
> >> >> > On Tue, May 19, 2015 at 3:01 AM, Rupert Westenthaler <
> >> >> > rupert.westentha...@gmail.com> wrote:
> >> >> >
> >> >> >> Hi Rajan,
> >> >> >>
> >> >> >> I think this is because you named you file
> >> >> >> "freebase-rdf-latest-fixed.gz". Jena assumes RDF/XML if the RDF
> >> format
> >> >> >> is not provided by the file extension. Renaming the file to
> >> >> >> "freebase-rdf-latest-fixed.nt.gz" should fix this issue.
> >> >> >>
> >> >> >> The suggestion of Antonio to use BaseKB is also a valid option.
> >> >> >>
> >> >> >> best
> >> >> >> Rupert
> >> >> >>
> >> >> >> On Tue, May 19, 2015 at 8:32 AM, Antonio David Perez Morales
> >> >> >> <ape...@zaizi.com> wrote:
> >> >> >> > Hi Rajan
> >> >> >> >
> >> >> >> > Freebase dump contains some things that does not fit very well
> with
> >> >> the
> >> >> >> > indexer.
> >> >> >> > I advise you to use the dump provided by BaseKB (
> http://basekb.com
> >> )
> >> >> >> which
> >> >> >> > is a curated Freebase dump.
> >> >> >> > I did not have any problem indexing it using that dump.
> >> >> >> >
> >> >> >> > Regards
> >> >> >> >
> >> >> >> > On Mon, May 18, 2015 at 8:48 PM, Rajan Shah <raja...@gmail.com>
> >> >> wrote:
> >> >> >> >
> >> >> >> >> Hi,
> >> >> >> >>
> >> >> >> >> I am working on indexing Freebase data within EntityHub and
> >> observed
> >> >> >> >> following issue:
> >> >> >> >>
> >> >> >> >> 01:06:01,547 [Thread-3] ERROR jena.riot - [line: 1, col: 7 ]
> >> Element
> >> >> or
> >> >> >> >> attribute do not match QName production:
> >> QName::=(NCName':')?NCName.
> >> >> >> >>
> >> >> >> >> I would appreciate any help pertaining to this issue.
> >> >> >> >>
> >> >> >> >> Thanks,
> >> >> >> >> Rajan
> >> >> >> >>
> >> >> >> >> *Steps followed:*
> >> >> >> >>
> >> >> >> >> *1. Initialization: *
> >> >> >> >> java -jar
> >> >> >> org.apache.stanbol.entityhub.indexing.freebase-1.0.0-SNAPSHOT.jar
> >> >> >> >>  init
> >> >> >> >>
> >> >> >> >> *2. Download the data:*
> >> >> >> >> Download data and copy it to
> >> >> >> https://developers.google.com/freebase/data
> >> >> >> >>
> >> >> >> >> *3. Performed execution of fbrankings-uri.sh*
> >> >> >> >> It generated incoming_links.txt under resources directory as
> >> follows
> >> >> >> >>
> >> >> >> >> 10888430 m.0kpv11
> >> >> >> >> 3741261 m.019h
> >> >> >> >> 2667858 m.0775xx5
> >> >> >> >> 2667804 m.0775xvm
> >> >> >> >> 1875352 m.01xryvm
> >> >> >> >> 1739262 m.05zppz
> >> >> >> >> 1369590 m.01xrzlb
> >> >> >> >>
> >> >> >> >> *4. Performed execution of fixit script*
> >> >> >> >>
> >> >> >> >> gunzip -c ${FB_DUMP} | fixit | gzip > ${FB_DUMP_fixed}
> >> >> >> >>
> >> >> >> >> *5. Rename the fixed file to freebase.rdf.gz and copy it *
> >> >> >> >> to indexing/resources/rdfdata
> >> >> >> >>
> >> >> >> >> *6. config/iditer.properties file has following setting*
> >> >> >> >> #id-namespace=http://freebase.com/
> >> >> >> >> ns-prefix-state=false
> >> >> >> >>
> >> >> >> >> *7. Performed run of following command:*
> >> >> >> >> java -jar -Xmx32g
> >> >> >> >>
> org.apache.stanbol.entityhub.indexing.freebase-1.0.0-SNAPSHOT.jar
> >> >> index
> >> >> >> >>
> >> >> >> >> The error dump on stdout is as follows:
> >> >> >> >>
> >> >> >> >> 01:37:32,884 [Thread-0] INFO
> >> solryard.SolrYardIndexingDestination -
> >> >> >> ...
> >> >> >> >> copy Solr Configuration form
> >> >> >> /private/tmp/freebase/indexing/config/freebase
> >> >> >> >> to
> >> >> /private/tmp/freebase/indexing/destination/indexes/default/freebase
> >> >> >> >> 01:37:32,895 [Thread-3] INFO  jenatdb.RdfResourceImporter -
>  -
> >> >> bulk
> >> >> >> >> loading File freebase.rdf.gz using Format Lang:RDF/XML
> >> >> >> >> 01:37:32,896 [Thread-3] INFO  jenatdb.RdfResourceImporter - --
> >> Start
> >> >> >> >> triples data phase
> >> >> >> >> 01:37:32,896 [Thread-3] INFO  jenatdb.RdfResourceImporter - **
> >> Load
> >> >> >> empty
> >> >> >> >> triples table
> >> >> >> >> *01:37:32,948 [Thread-3] ERROR jena.riot - [line: 1, col: 7 ]
> >> >> Element or
> >> >> >> >> attribute do not match QName production:
> >> QName::=(NCName':')?NCName.*
> >> >> >> >> 01:37:32,948 [Thread-3] INFO  jenatdb.RdfResourceImporter - --
> >> Finish
> >> >> >> >> triples data phase
> >> >> >> >> 01:37:32,948 [Thread-3] INFO  jenatdb.RdfResourceImporter - --
> >> Finish
> >> >> >> >> triples load
> >> >> >> >> 01:37:32,960 [Thread-3] INFO  source.ResourceLoader - Ignore
> Error
> >> >> for
> >> >> >> File
> >> >> >> >>
> /private/tmp/freebase/indexing/resources/rdfdata/freebase.rdf.gz
> >> and
> >> >> >> >> continue
> >> >> >> >>
> >> >> >> >> Additional Reference Point:
> >> >> >> >>
> >> >> >> >> *Original Freebase dump size:*  31025015397 May 14 18:10
> >> >> >> >> freebase-rdf-latest.gz
> >> >> >> >> *Fixed Freebase dump size:* 31026818367 May 15 12:45
> >> >> >> >> freebase-rdf-latest-fixed.gz
> >> >> >> >> *Incoming Links size: *1206745360 May 17 00:42
> incoming_links.txt
> >> >> >> >>
> >> >> >> >
> >> >> >> > --
> >> >> >> >
> >> >> >> > ------------------------------
> >> >> >> > This message should be regarded as confidential. If you have
> >> received
> >> >> >> this
> >> >> >> > email in error please notify the sender and destroy it
> immediately.
> >> >> >> > Statements of intent shall only become binding when confirmed in
> >> hard
> >> >> >> copy
> >> >> >> > by an authorised signatory.
> >> >> >> >
> >> >> >> > Zaizi Ltd is registered in England and Wales with the
> registration
> >> >> number
> >> >> >> > 6440931. The Registered Office is Brook House, 229 Shepherds
> Bush
> >> >> Road,
> >> >> >> > London W6 7AN.
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> --
> >> >> >> | Rupert Westenthaler             rupert.westentha...@gmail.com
> >> >> >> | Bodenlehenstraße 11
> ++43-699-11108907
> >> >> >> | A-5500 Bischofshofen
> >> >> >> | REDLINK.CO
> >> >> >>
> >> >>
> >>
> ..........................................................................
> >> >> >> | http://redlink.co/
> >> >> >>
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> | Rupert Westenthaler             rupert.westentha...@gmail.com
> >> >> | Bodenlehenstraße 11                              ++43-699-11108907
> >> >> | A-5500 Bischofshofen
> >> >> | REDLINK.CO
> >> >>
> >>
> ..........................................................................
> >> >> | http://redlink.co/
> >> >>
> >>
> >>
> >>
> >> --
> >> | Rupert Westenthaler             rupert.westentha...@gmail.com
> >> | Bodenlehenstraße 11                              ++43-699-11108907
> >> | A-5500 Bischofshofen
> >> | REDLINK.CO
> >>
> ..........................................................................
> >> | http://redlink.co/
> >>
>
>
>
> --
> | Rupert Westenthaler             rupert.westentha...@gmail.com
> | Bodenlehenstraße 11                              ++43-699-11108907
> | A-5500 Bischofshofen
> | REDLINK.CO
> ..........................................................................
> | http://redlink.co/
>

Reply via email to