Re: HCatalog access from a Java app

Dmitry Vasilenko Mon, 23 Jun 2014 06:12:13 -0700

Hi Brian,

1. To enumerate databases and tables and to get the Hive table schema you
can use the code I provided earlier.

2. To get the HCatalog flavor of the table schema you will use something
like this:

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hive.hcatalog.data.schema.HCatSchemaUtils;
import org.apache.hive.hcatalog.mapreduce.HCatInputFormat;
import org.apache.hive.hcatalog.mapreduce.HCatSplit;

...................

  Job job = new Job(configuration);
  job.setJarByClass(XXXX.class); // you class name
job.setInputFormatClass(HCatInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
  HCatInputFormat.setInput(job, "data_base_name", "table_name",
"partition_filter");

3. To get input splits you will have to do some gymnastics as the team IMHO
made an unfortunate decision to change the ReaderContext  in non cross
version compatible way. The new ReaderContext does not provide you with the
convenient way to get  input splits so you will have to use HCatInputFormat
directly:

  ReadEntity.Builder builder = ....
   ReadEntity entity = builder.build();
  Job job = new Job(configuration);
HCatInputFormat inputFormat = HCatInputFormat.setInput(job,
entity.getDbName(), entity.getTableName(), entity.getFilterString());
  List<InputSplit> splits =
inputFormat.getSplits(ShimLoader.getHadoopShims().newJobContext(job));

4. To get an iterator over records more effort is needed. Here is the code
that should do the work for the given input split and Hadoop configuration
(returned from the Hadoop job)

Iterator<HCatRecord> read() throws HCatException {
HCatInputFormat inputFormat = new HCatInputFormat();
@SuppressWarnings("rawtypes")
final RecordReader<WritableComparable, HCatRecord> recordReader;
try {
TaskAttemptContext context =
ShimLoader.getHadoopShims().newTaskAttemptContext(this.configuration, null);
recordReader = inputFormat.createRecordReader(this.inputSplit, context);
recordReader.initialize(this.inputSplit, context);
} catch (Exception e) {
throw new HCatException(ErrorType.ERROR_NOT_INITIALIZED, e);
}
return new Iterator<HCatRecord>() {
public boolean hasNext() {
try {
boolean retVal = recordReader.nextKeyValue();
if (retVal) {
return true;
}
// if its false, we need to close recordReader.
recordReader.close();
return false;
} catch (Exception e) {
throw new RuntimeException(e);
}
}

@Override
public HCatRecord next() {
try {
return recordReader.getCurrentValue();
} catch (Exception e) {
throw new RuntimeException(e);
}
}

@Override
public void remove() {
throw new UnsupportedOperationException();
}
};
}

Note also that when you create a Hadoop configuration you will have to
disable client cache:

configuration.set(HCatConstants.HCAT_HIVE_CLIENT_DISABLE_CACHE,
Boolean.TRUE.toString());

otherwise the client process will not terminate (HIVE 6268).

Hope this helps.

On Sat, Jun 21, 2014 at 9:11 AM, Brian Jeltema <
brian.jelt...@digitalenvoy.net> wrote:

> I’m also experimenting with version 0.13, and see that it differs from
> 0.12 significantly.
> Can you give me a code example for 0.13?
>
> Thanks
> Brian
>
> On Jun 13, 2014, at 9:25 AM, Brian Jeltema <brian.jelt...@digitalenvoy.net>
> wrote:
>
> Version 0.12.0.
>
> I’d like to obtain the table’s schema, scan a table partition, and use the
> schema to parse the rows.
>
> I can probably figure this out by looking at the HCatalog source. My
> concern was that
> the HCatalog packages in the Hive distributions are excluded in the
> JavaDoc, which implies
> that the API is not public. Is there a reason for this?
>
> Brian
>
> On Jun 13, 2014, at 9:10 AM, Dmitry Vasilenko <dvasi...@gmail.com> wrote:
>
> You should be able to access this information. The exact API depends on
> the version of Hive/HCat. As you know earlier HCat API is being deprecated
> and will be removed in Hive 0.14.0. I can provide you with the code sample
> if you tell me what you are trying to do and what version of Hive you are
> using.
>
>
> On Fri, Jun 13, 2014 at 7:33 AM, Brian Jeltema <
> brian.jelt...@digitalenvoy.net> wrote:
>
>> I’m experimenting with HCatalog, and would like to be able to access
>> tables and their schema
>> from a Java application (not Hive/Pig/MapReduce). However, the API seems
>> to be hidden, which
>> leads leads me to believe that this is not a supported use case. Is
>> HCatalog use limited to
>> one of the supported frameworks?
>>
>> TIA
>>
>> Brian
>
>
>
>
>

Re: HCatalog access from a Java app

Reply via email to