Paolo Castagna wrote:
> Hi,
> I have a MapReduce job with a map function which parses a line from an
> N-Quads file:
>
> private static final Logger log =
> LoggerFactory.getLogger(FirstMapper.class);
> private String inputFileName;
> private MapReduceParserProfile profile;
> private LabelToNode labelMapping;
>
> public void setup(Context context) throws IOException, InterruptedException
> {
> inputFileName =
> context.getConfiguration().get("mapreduce.map.input.file");
> Prologue prologue = new Prologue(null, IRIResolver.createNoResolve());
> labelMapping = new MapReduceLabelToNode(inputFileName);
> profile = new MapReduceParserProfile(prologue,
> ErrorHandlerFactory.errorHandlerStd, labelMapping);
> }
>
> @Override
> public void map (LongWritable key, Text value, Context context)
> throws IOException, InterruptedException {
> if ( log.isDebugEnabled() ) log.debug("< ({}, {})", key, value);
> SinkToContext sink = new SinkToContext(context);
> Tokenizer tokenizer =
> TokenizerFactory.makeTokenizerString(value.toString());
> LangNQuads parser = new LangNQuads(tokenizer, profile, sink) ;
> parser.parse();
> }
FirstMapper.java is here:
https://github.com/castagna/tdbloader3/blob/master/src/main/java/com/talis/labs/tdb/tdbloader3/FirstMapper.java
>
> (A RecordReader<LongWritable, QuadWritable> would be better, but for now the
> snippet above does its job. Almost.)
>
> The problem I have is with blank node labels.
>
> With MapReduce the same file will be split into multiple file splits which
> are parsed on different machines. Therefore, I would like to have my own
> LabelToNode implementation with an Allocator<String, Node> which takes into
> account the filename (or an hash of it) when it creates a new blank node.
>
> Something along these lines:
>
> public Node create(String label) {
> return Node.createAnon(new AnonId(filename + "-" + label)) ;
> }
>
> So, I have my MapReduceLabelToNode:
>
> public class MapReduceLabelToNode extends LabelToNode {
>
> public MapReduceLabelToNode(String filename) {
> super(new SingleScopePolicy(), new MapReduceAllocator(filename));
> }
>
> ...
>
MapReduceLabelToNode.java is here:
https://github.com/castagna/tdbloader3/blob/master/src/main/java/com/talis/labs/tdb/tdbloader3/MapReduceLabelToNode.java
> But LabelToNode constructor is private.
>
> Could we make it protected?
>
> Or, alternatively, how can I construct a LabelToNode object which will be
> using
> my MapReduceAllocator?
>
> Thanks,
> Paolo
>
Another thing I struggled with is using RIOT parsers in a QuadRecordReader:
https://github.com/castagna/tdbloader3/blob/master/src/main/java/com/talis/labs/tdb/tdbloader3/io/QuadRecordReader.java
The problem I have is that with Hadoop a file is split on a byte level basis,
therefore a line is often truncated. Is there a way I can tell RIOT, given an
InputStream, to skip to the next quad/line and start parsing from there?
Thanks,
Paolo