Thanks. I’d already implemented something like this based on some docs I found. I’m a little confused about the scenario for reading the splits on slaves:
- does each slave read all of the splits, or is the master process responsible for obtaining the list of splits and then modifying the ReaderContext to contain a partial list before passing the ReaderContext to the slave? - how does the master pass the ReaderContext to the slave? - are there any real-world code examples? Thanks Brian On Jun 16, 2014, at 12:19 PM, Dmitry Vasilenko <dvasi...@gmail.com> wrote: > Here is the code sketch to get you started: > > Step 1. Create a builder: > > ReadEntity.Builder builder = new ReadEntity.Builder(); > String database = ... > builder.withDatabase(database); > String table = ... > builder.withTable(table); > String filter = ... > if (filter != null) { > builder.withFilter(filter); > } > String region = getString(context.getRegion()); > if (region != null) { > builder.withRegion(region); > } > > > Step 2: Get initial reader context > > Map<String, String> config = ... > // make sure that you have hive.metastore.uris property in the config > ReadEntity entity = builder.build(); > ReaderContext readerContext = DataTransferFactory.getHCatReader(entity, > config).prepareRead(); > > Step 3: Get input splits and Hadoop Configuration > > List<InputSplit> splits = readerContext.getSplits(); > Configuration config = readerContext.getConfig(); > > Step 4: Get records > > a) for each input split get the reader: > > HCatReader hcatReader = DataTransferFactory.getHCatReader(inputSplit, config); > > Iterator<HCatRecord> records = hcatReader.read(); > > b) Iterate over the records for that reader > > > > > > On Mon, Jun 16, 2014 at 9:57 AM, Brian Jeltema > <brian.jelt...@digitalenvoy.net> wrote: > regarding: > >> 3. To read the HCat records.... >> >> It depends on how you' like to read the records ... will you be reading ALL >> the records remotely from the client app >> or you will get input splits and read the records on mappers....??? >> >> The code will be different (somewhat)... let me know... > > > in this case I’d be reading all of the records remotely from the client app > > TIA > Brian > > On Jun 13, 2014, at 9:51 AM, Dmitry Vasilenko <dvasi...@gmail.com> wrote: > >> I am not sure about java docs... ;-] >> I have spent the last three years integrating with HCat and to make it work >> had to go thru the code... >> >> So here are some samples that can be helpful to start with. If you are using >> Hive 0.12.0 I would not bother with the new APIs... I had to create some >> shim classes for HCat to make my code version independent but I cannot share >> that. >> >> So >> >> 1. To enumerate tables ... just use Hive client ... this seems to be version >> independent >> >> hiveMetastoreClient = new HiveMetaStoreClient(conf); >> >> // the conf should contain the "hive.metastore.uris" property that point to >> your Hive Metastore thrift server >> List<String> databases = hiveMetastoreClient.getAllDatabases(); >> // this will get you all the databases >> List<String> tables = hiveMetastoreClient.getAllTables(database); >> // this will get you all the tables for the give data base >> >> 2. To get the table schema... I assume that you are after HCat schema >> >> >> import org.apache.hadoop.conf.Configuration; >> import org.apache.hadoop.mapreduce.InputSplit; >> import org.apache.hadoop.mapreduce.Job; >> import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; >> import org.apache.hcatalog.data.schema.HCatSchemaUtils; >> import org.apache.hcatalog.mapreduce.HCatInputFormat; >> import org.apache.hcatalog.mapreduce.HCatSplit; >> import org.apache.hcatalog.mapreduce.InputJobInfo; >> >> >> Job job = new Job(config); >> job.setJarByClass(XXXXXX.class); // this will be your class >> job.setInputFormatClass(HCatInputFormat.class); >> job.setOutputFormatClass(TextOutputFormat.class); >> InputJobInfo inputJobInfo = InputJobInfo.create("my_data_base", >> "my_table", "partition filter"); >> HCatInputFormat.setInput(job, inputJobInfo); >> HCatSchema s = HCatInputFormat.getTableSchema(job); >> >> >> 3. To read the HCat records.... >> >> It depends on how you' like to read the records ... will you be reading ALL >> the records remotely from the client app >> or you will get input splits and read the records on mappers....??? >> >> The code will be different (somewhat)... let me know... >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> On Fri, Jun 13, 2014 at 8:25 AM, Brian Jeltema >> <brian.jelt...@digitalenvoy.net> wrote: >> Version 0.12.0. >> >> I’d like to obtain the table’s schema, scan a table partition, and use the >> schema to parse the rows. >> >> I can probably figure this out by looking at the HCatalog source. My concern >> was that >> the HCatalog packages in the Hive distributions are excluded in the JavaDoc, >> which implies >> that the API is not public. Is there a reason for this? >> >> Brian >> >> On Jun 13, 2014, at 9:10 AM, Dmitry Vasilenko <dvasi...@gmail.com> wrote: >> >>> You should be able to access this information. The exact API depends on the >>> version of Hive/HCat. As you know earlier HCat API is being deprecated and >>> will be removed in Hive 0.14.0. I can provide you with the code sample if >>> you tell me what you are trying to do and what version of Hive you are >>> using. >>> >>> >>> On Fri, Jun 13, 2014 at 7:33 AM, Brian Jeltema >>> <brian.jelt...@digitalenvoy.net> wrote: >>> I’m experimenting with HCatalog, and would like to be able to access tables >>> and their schema >>> from a Java application (not Hive/Pig/MapReduce). However, the API seems to >>> be hidden, which >>> leads leads me to believe that this is not a supported use case. Is >>> HCatalog use limited to >>> one of the supported frameworks? >>> >>> TIA >>> >>> Brian >>> >> >> > >