Thanks. I’d already implemented something like this based on some docs I found. 
I’m a little
confused about the scenario for reading the splits on slaves: 

  - does each slave read all of the splits, or is the master process 
responsible for obtaining the
    list of splits and then modifying the ReaderContext to contain a partial 
list before passing the
    ReaderContext to the slave?

  - how does the master pass the ReaderContext to the slave?

  - are there any real-world code examples?

Thanks
Brian

On Jun 16, 2014, at 12:19 PM, Dmitry Vasilenko <dvasi...@gmail.com> wrote:

> Here is the code sketch to get you started: 
> 
> Step 1. Create a builder:
> 
> ReadEntity.Builder builder = new ReadEntity.Builder();
> String database = ...
> builder.withDatabase(database);
> String table = ...
> builder.withTable(table);
> String filter = ...
> if (filter != null) {
> builder.withFilter(filter);
> }
> String region = getString(context.getRegion());
> if (region != null) {
> builder.withRegion(region);
> }
> 
> 
> Step 2: Get initial reader context
> 
> Map<String, String> config = ...
> // make sure that you have hive.metastore.uris property in the config
> ReadEntity entity = builder.build();
> ReaderContext readerContext = DataTransferFactory.getHCatReader(entity, 
> config).prepareRead();
> 
> Step 3: Get input splits and Hadoop Configuration
> 
> List<InputSplit> splits = readerContext.getSplits();
> Configuration config = readerContext.getConfig();
> 
> Step 4: Get records
> 
> a) for each input split get the reader:
> 
> HCatReader hcatReader = DataTransferFactory.getHCatReader(inputSplit, config);
> 
> Iterator<HCatRecord> records = hcatReader.read();
> 
> b) Iterate over the records for that reader
> 
> 
> 
> 
> 
> On Mon, Jun 16, 2014 at 9:57 AM, Brian Jeltema 
> <brian.jelt...@digitalenvoy.net> wrote:
> regarding:
> 
>> 3. To read the HCat records....
>> 
>> It depends on how you' like to read the records  ... will you be reading ALL 
>> the records remotely from the client app  
>> or you will get input splits and read the records on mappers....???
>> 
>> The code will be different (somewhat)... let me know...
> 
> 
> in this case I’d be reading all of the records remotely from the client app
> 
> TIA
> Brian
> 
> On Jun 13, 2014, at 9:51 AM, Dmitry Vasilenko <dvasi...@gmail.com> wrote:
> 
>> I am not sure about java docs... ;-]
>> I have spent the last three years integrating with HCat and to make it work 
>> had to go thru the code...
>> 
>> So here are some samples that can be helpful to start with. If you are using 
>> Hive 0.12.0 I would not bother with the new APIs... I had to create some 
>> shim classes for HCat to make my code version independent but I cannot share 
>> that. 
>> 
>> So 
>> 
>> 1. To enumerate tables ... just use Hive client ... this seems to be version 
>> independent 
>> 
>>    hiveMetastoreClient = new HiveMetaStoreClient(conf);
>> 
>> // the conf should contain the "hive.metastore.uris" property that point to 
>> your Hive Metastore thrift server
>>    List<String> databases = hiveMetastoreClient.getAllDatabases();
>> // this will get you all the databases
>>    List<String> tables = hiveMetastoreClient.getAllTables(database);
>> // this will get you all the tables for the give data base
>> 
>> 2. To get the table schema... I assume that you are after HCat schema  
>> 
>> 
>> import org.apache.hadoop.conf.Configuration;
>> import org.apache.hadoop.mapreduce.InputSplit;
>> import org.apache.hadoop.mapreduce.Job;
>> import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
>> import org.apache.hcatalog.data.schema.HCatSchemaUtils;
>> import org.apache.hcatalog.mapreduce.HCatInputFormat;
>> import org.apache.hcatalog.mapreduce.HCatSplit;
>> import org.apache.hcatalog.mapreduce.InputJobInfo;
>> 
>> 
>>   Job job = new Job(config);
>>   job.setJarByClass(XXXXXX.class); // this will be your class 
>> job.setInputFormatClass(HCatInputFormat.class);
>> job.setOutputFormatClass(TextOutputFormat.class);
>>   InputJobInfo inputJobInfo = InputJobInfo.create("my_data_base", 
>> "my_table", "partition filter");
>> HCatInputFormat.setInput(job, inputJobInfo);
>> HCatSchema s =  HCatInputFormat.getTableSchema(job);
>> 
>> 
>> 3. To read the HCat records....
>> 
>> It depends on how you' like to read the records  ... will you be reading ALL 
>> the records remotely from the client app  
>> or you will get input splits and read the records on mappers....???
>> 
>> The code will be different (somewhat)... let me know...
>> 
>> 
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>>  
>> 
>> 
>> On Fri, Jun 13, 2014 at 8:25 AM, Brian Jeltema 
>> <brian.jelt...@digitalenvoy.net> wrote:
>> Version 0.12.0.
>> 
>> I’d like to obtain the table’s schema, scan a table partition, and use the 
>> schema to parse the rows.
>> 
>> I can probably figure this out by looking at the HCatalog source. My concern 
>> was that
>> the HCatalog packages in the Hive distributions are excluded in the JavaDoc, 
>> which implies
>> that the API is not public. Is there a reason for this?
>> 
>> Brian
>> 
>> On Jun 13, 2014, at 9:10 AM, Dmitry Vasilenko <dvasi...@gmail.com> wrote:
>> 
>>> You should be able to access this information. The exact API depends on the 
>>> version of Hive/HCat. As you know earlier HCat API is being deprecated and 
>>> will be removed in Hive 0.14.0. I can provide you with the code sample if 
>>> you tell me what you are trying to do and what version of Hive you are 
>>> using. 
>>> 
>>> 
>>> On Fri, Jun 13, 2014 at 7:33 AM, Brian Jeltema 
>>> <brian.jelt...@digitalenvoy.net> wrote:
>>> I’m experimenting with HCatalog, and would like to be able to access tables 
>>> and their schema
>>> from a Java application (not Hive/Pig/MapReduce). However, the API seems to 
>>> be hidden, which
>>> leads leads me to believe that this is not a supported use case. Is 
>>> HCatalog use limited to
>>> one of the supported frameworks?
>>> 
>>> TIA
>>> 
>>> Brian
>>> 
>> 
>> 
> 
> 

Reply via email to