Dylan Bethune-Waddell created TINKERPOP-1117:
------------------------------------------------
Summary: InputFormatRDD.readGraphRDD requires a valid
gremlin.hadoop.inputLocation, breaking InputFormats (Cassandra, HBase) that
don't need one
Key: TINKERPOP-1117
URL: https://issues.apache.org/jira/browse/TINKERPOP-1117
Project: TinkerPop
Issue Type: Improvement
Affects Versions: 3.2.0-incubating
Reporter: Dylan Bethune-Waddell
Priority: Minor
Fix For: 3.2.0-incubating
On line 43, the call to Constants.getSearchGraphLocation returns
Optional.empty() if gremlin.hadoop.inputLocation=none as advised in Titan's
CassandraInputFormat and HBaseInputFormat. Changing the readGraphRDD method to
call .isPresent() and only set the storage location in the config if so allows
SparkGraphComputer from the 3.2.0-SNAPSHOT branch to work with Titan via
CassandraInputFormat in a traversal source:
{code}
// Imports
import java.util.Optional;
@Override
public JavaPairRDD<Object, VertexWritable> readGraphRDD(final Configuration
configuration, final JavaSparkContext sparkContext) {
final org.apache.hadoop.conf.Configuration hadoopConfiguration =
ConfUtil.makeHadoopConfiguration(configuration);
// This part was used directly in hadoopConfiguration.set(...)
final Optional<String> searchGraph =
Constants.getSearchGraphLocation(configuration.getString(Constants.GREMLIN_HADOOP_INPUT_LOCATION),
FileSystemStorage.open(hadoopConfiguration));
if (searchGraph.isPresent()) {
hadoopConfiguration.set(configuration.getString(Constants.GREMLIN_HADOOP_INPUT_LOCATION),
searchGraph.get());
}
return sparkContext.newAPIHadoopRDD(hadoopConfiguration,
(Class<InputFormat<NullWritable, VertexWritable>>)
hadoopConfiguration.getClass(Constants.GREMLIN_HADOOP_GRAPH_INPUT_FORMAT,
InputFormat.class),
NullWritable.class,
VertexWritable.class)
.mapToPair(tuple -> new Tuple2<>(tuple._2().get().id(), new
VertexWritable(tuple._2().get())));
{code}
I don't really understand the intended behaviour, so this is probably not the
right thing to do. Would the addition of a configuration variable such as
"gremlin.hadoop.inputLocationRequired" that defaults to true, and can be set to
false for these other input formats work?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)