[jira] [Commented] (RYA-500) Make RdfFileInputTool to accept multiple input paths

ASF GitHub Bot (JIRA) Wed, 04 Jul 2018 08:20:27 -0700


    [ 
https://issues.apache.org/jira/browse/RYA-500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16532874#comment-16532874
 ]


ASF GitHub Bot commented on RYA-500:
------------------------------------

Github user kchilton2 commented on a diff in the pull request:

    https://github.com/apache/incubator-rya/pull/299#discussion_r200156492
  
    --- Diff: extras/rya.manual/src/site/markdown/loaddata.md ---
    @@ -92,29 +92,55 @@ The default "format" is RDF/XML, but these formats are 
supported : RDFXML, NTRIP
     
     ## Bulk Loading data
     
    -Bulk loading data is done through Map Reduce jobs
    +Bulk loading data is done through Map Reduce jobs.
     
     ### Bulk Load RDF data
     
    -This Map Reduce job will read files into memory and parse them into 
statements. The statements are saved into the store. Here is an example for 
storing in Accumulo:
    +This Map Reduce job will read files into memory and parse them into 
statements. The statements are saved into the triplestore. 
    +Here are the steps to prepare and run the job:
    +
    +  * Load the RDF data to HDFS. It can be single of multiple volumes and 
directories in them.
    +  * Also load the `mapreduce/target/rya.mapreduce-<version>-shaded.jar` 
executable jar file to HDFS.
    +  * Run the following sample command:
     
     ```
    -hadoop jar target/rya.mapreduce-3.2.10-SNAPSHOT-shaded.jar 
org.apache.rya.accumulo.mr.RdfFileInputTool -Dac.zk=localhost:2181 
-Dac.instance=accumulo -Dac.username=root -Dac.pwd=secret 
-Drdf.tablePrefix=triplestore_ -Drdf.format=N-Triples /tmp/temp.ntrips
    +hadoop hdfs://volume/rya.mapreduce-<version>-shaded.jar 
org.apache.rya.accumulo.mr.tools.RdfFileInputTool -Dac.zk=localhost:2181 
-Dac.instance=accumulo -Dac.username=root -Dac.pwd=secret 
-Drdf.tablePrefix=triplestore_ -Drdf.format=N-Triples 
hdfs://volume/dir1,hdfs://volume/dir2,hdfs://volume/file1.nt
     ```
     
     Options:
     
    -- rdf.tablePrefix : The tables (spo, po, osp) are prefixed with this 
qualifier. The tables become: 
(rdf.tablePrefix)spo,(rdf.tablePrefix)po,(rdf.tablePrefix)osp
    -- ac.* : Accumulo connection parameters
    -- rdf.format : See RDFFormat from RDF4J, samples include (Trig, N-Triples, 
RDF/XML)
    -- sc.use_freetext, sc.use_geo, sc.use_temporal, sc.use_entity : If any of 
these are set to true, statements will also be
    +- **rdf.tablePrefix** - The tables (spo, po, osp) are prefixed with this 
qualifier.
    +    The tables become: 
(rdf.tablePrefix)spo,(rdf.tablePrefix)po,(rdf.tablePrefix)osp
    +- **ac.*** - Accumulo connection parameters
    +- **rdf.format** - See RDFFormat from RDF4J, samples include (Trig, 
N-Triples, RDF/XML)
    +- **sc.use_freetext, sc.use_geo, sc.use_temporal, sc.use_entity** - If any 
of these are set to true, statements will also be
         added to the enabled secondary indices.
    -- sc.freetext.predicates, sc.geo.predicates, sc.temporal.predicates: If 
the associated indexer is enabled, these options specify
    +- **sc.freetext.predicates, sc.geo.predicates, sc.temporal.predicates** - 
If the associated indexer is enabled, these options specify
         which statements should be sent to that indexer (based on the 
predicate). If not given, all indexers will attempt to index
         all statements.
     
    -The argument is the directory/file to load. This file needs to be loaded 
into HDFS before running. If loading a directory, all files should have the 
same RDF
    -format.
    +The positional argument is a comma separated list of directories/files to 
load.
    +They need to be loaded into HDFS before running. If loading a directory,
    +all files should have the same RDF format.
    +
    +Once the data is loaded, it is actually a good practice to compact your 
tables.
    +You can do this by opening the accumulo shell shell and running the compact
    --- End diff --
    
    "the accumulo shell shell and"


> Make RdfFileInputTool to accept multiple input paths
> ----------------------------------------------------
>
>                 Key: RYA-500
>                 URL: https://issues.apache.org/jira/browse/RYA-500
>             Project: Rya
>          Issue Type: Improvement
>    Affects Versions: 3.2.12
>            Reporter: Maxim Kolchin
>            Priority: Trivial
>              Labels: mapreduce
>
> We store RDF files in multiple folders where each folder contains data about 
> a specific type of entity (e.g. person, company, etc.). So it's not 
> convenient that the RdfFileInputTool allows only a single input path.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (RYA-500) Make RdfFileInputTool to accept multiple input paths

Reply via email to