Hi folks, Currently, we are going to use Giraph to replace some graph processing in our Hive workflow. We did't use HiveGiraphRunner to submit the job directly, but customize and submit it in our own program.
However, after the job is submitted to hadoop, NPE encountered when HiveApiInputFormat is computing the InputSplits: Below is the scala code snippet about the job configuration: ================================================== val hive_config_copy = new HiveConf(hive_config) val workers = 1 val dbName = "default" val edgeInputTableStr = "transitMatrix" val vertexInputTableStr = "initialRank" val vertexOutputTableStr = "twitterRank" HIVE_TO_VERTEX_CLASS.set(hive_config_copy, classOf[InitialRankToVertex]) HIVE_TO_EDGE_CLASS.set(hive_config_copy, classOf[TransitMatrixToEdge]) hive_config_copy.setClass(HiveVertexWriter.VERTEX_TO_HIVE_KEY, classOf[TRVertexToHive], classOf[VertexToHive[Text, DoubleWritable, Writable]]) val job = new GiraphJob(hive_config_copy, getClass().getName()) var giraphConf = job.getConfiguration() giraphConf.setVertexClass(classOf[TwitterRankVertex]) var hiveVertexInputDescription = new HiveInputDescription() var hiveEdgeInputDescription = new HiveInputDescription() var hiveOutputDescription = new HiveOutputDescription() /** * Initialize hive input db and tables */ hiveVertexInputDescription.setDbName(dbName) hiveEdgeInputDescription.setDbName(dbName) hiveOutputDescription.setDbName(dbName) hiveEdgeInputDescription.setTableName(edgeInputTableStr) hiveVertexInputDescription.setTableName(vertexInputTableStr) hiveOutputDescription.setTableName(vertexOutputTableStr) /** * Initialize the hive input settings */ hiveVertexInputDescription.setNumSplits(HIVE_VERTEX_SPLITS.get(giraphConf)) HiveApiInputFormat.setProfileInputDesc(giraphConf, hiveVertexInputDescription, VERTEX_INPUT_PROFILE_ID) giraphConf.setVertexInputFormatClass(classOf[HiveVertexInputFormat[Text, DoubleWritable, Writable]]) HiveTableSchemas.put(giraphConf, VERTEX_INPUT_PROFILE_ID,hiveVertexInputDescription.hiveTableName()) hiveEdgeInputDescription.setNumSplits(HIVE_EDGE_SPLITS.get(giraphConf)) HiveApiInputFormat.setProfileInputDesc(giraphConf, hiveEdgeInputDescription,EDGE_INPUT_PROFILE_ID) giraphConf.setEdgeInputFormatClass(classOf[HiveEdgeInputFormat[Text, DoubleWritable]]); HiveTableSchemas.put(giraphConf, EDGE_INPUT_PROFILE_ID,hiveEdgeInputDescription.hiveTableName()) /** * Initialize the hive output settings */ HiveApiOutputFormat.initProfile(giraphConf, hiveOutputDescription,VERTEX_OUTPUT_PROFILE_ID) giraphConf.setVertexOutputFormatClass(classOf[HiveVertexOutputFormat[Text, DoubleWritable, Writable]]) HiveTableSchemas.put(giraphConf, VERTEX_OUTPUT_PROFILE_ID,hiveOutputDescription.hiveTableName()) /** * Set number of workers */ giraphConf.setWorkerConfiguration(workers, workers, 100.0f) /** * Run the job */ if (job.run(true)) return true else return false ========================================= Here are the task logs: 2013-11-16 12:19:19,032 INFO com.facebook.giraph.hive.input.HiveApiInputFormat: getSplits for profile vertex_input_profile 2013-11-16 12:19:19,034 WARN org.apache.hadoop.hive.conf.HiveConf: hive-site.xml not found on CLASSPATH 2013-11-16 12:19:19,161 INFO org.apache.hadoop.mapred.FileInputFormat: Total input paths to process : 1 2013-11-16 12:19:19,164 ERROR org.apache.giraph.master.MasterThread: masterThread: Master algorithm failed with NullPointerException java.lang.NullPointerException at org.apache.hadoop.mapred.TextInputFormat.isSplitable(TextInputFormat.java:42) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:232) at com.facebook.giraph.hive.input.HiveApiInputFormat.computeSplits(HiveApiInputFormat.java:183) at com.facebook.giraph.hive.input.HiveApiInputFormat.getSplits(HiveApiInputFormat.java:166) at com.facebook.giraph.hive.input.HiveApiInputFormat.getSplits(HiveApiInputFormat.java:147) at org.apache.giraph.hive.input.vertex.HiveVertexInputFormat.getSplits(HiveVertexInputFormat.java:60) at org.apache.giraph.master.BspServiceMaster.generateInputSplits(BspServiceMaster.java:314) at org.apache.giraph.master.BspServiceMaster.createInputSplits(BspServiceMaster.java:626) at org.apache.giraph.master.BspServiceMaster.createVertexInputSplits(BspServiceMaster.java:692) at org.apache.giraph.master.MasterThread.run(MasterThread.java:100) Not sure if anything is missing in the job configuration. Can anybody help? Thanks in advance. Best Wishes, ~Andy