Re: HBase 0.98 addon for Flink 0.8

Flavio Pompermaier Fri, 07 Nov 2014 03:50:26 -0800

I fixed also the profile for Cloudera CDH5.1.3. You can build it with the
command:
      mvn clean install -Dmaven.test.skip=true -Dhadoop.profile=2
 -Pvendor-repos,cdh5.1.3


However, it would be good to generate the specific jar when
releasing..(e.g. flink-addons:flink-hbase:0.8.0-hadoop2-cdh5.1.3-incubating)

Best,
Flavio

On Fri, Nov 7, 2014 at 12:44 PM, Flavio Pompermaier <[email protected]>
wrote:

> I've just updated the code on my fork (synch with current master and
> applied improvements coming from comments on related PR).
> I still have to understand how to write results back to an HBase
> Sink/OutputFormat...
>
>
> On Mon, Nov 3, 2014 at 12:05 PM, Flavio Pompermaier <[email protected]>
> wrote:
>
>> Thanks for the detailed answer. So if I run a job from my machine I'll
>> have to download all the scanned data in a table..right?
>>
>> Always regarding the GenericTableOutputFormat it is not clear to me how
>> to proceed..
>> I saw in the hadoop compatibility addon that it is possible to have such
>> compatibility using HBaseUtils class so the open method should become
>> something like:
>>
>> @Override
>> public void open(int taskNumber, int numTasks) throws IOException {
>> if (Integer.toString(taskNumber + 1).length() > 6) {
>> throw new IOException("Task id too large.");
>> }
>> TaskAttemptID taskAttemptID = TaskAttemptID.forName("attempt__0000_r_"
>> + String.format("%" + (6 - Integer.toString(taskNumber + 1).length()) +
>> "s"," ").replace(" ", "0")
>> + Integer.toString(taskNumber + 1)
>> + "_0");
>>  this.configuration.set("mapred.task.id", taskAttemptID.toString());
>> this.configuration.setInt("mapred.task.partition", taskNumber + 1);
>> // for hadoop 2.2
>> this.configuration.set("mapreduce.task.attempt.id",
>> taskAttemptID.toString());
>> this.configuration.setInt("mapreduce.task.partition", taskNumber + 1);
>>  try {
>> this.context =
>> HadoopUtils.instantiateTaskAttemptContext(this.configuration,
>> taskAttemptID);
>> } catch (Exception e) {
>> throw new RuntimeException(e);
>> }
>> final HFileOutputFormat2 outFormat = new HFileOutputFormat2();
>> try {
>> this.writer = outFormat.getRecordWriter(this.context);
>> } catch (InterruptedException iex) {
>> throw new IOException("Opening the writer was interrupted.", iex);
>> }
>> }
>>
>> But I'm not sure about how to pass the JobConf to the class, if to merge
>> config fileas, where HFileOutputFormat2 writes the data and how to
>> implement the public void writeRecord(Record record) API.
>> Could I do a little chat off the mailing list with the implementor of
>> this extension?
>>
>> On Mon, Nov 3, 2014 at 11:51 AM, Fabian Hueske <[email protected]>
>> wrote:
>>
>>> Hi Flavio
>>>
>>> let me try to answer your last question on the user's list (to the best
>>> of
>>> my HBase knowledge).
>>> "I just wanted to known if and how regiom splitting is handled. Can you
>>> explain me in detail how Flink and HBase works?what is not fully clear to
>>> me is when computation is done by region servers and when data start flow
>>> to a Flink worker (that in ky test job is only my pc) and how ro
>>> undertsand
>>> better the important logged info to understand if my job is performing
>>> well"
>>>
>>> HBase partitions its tables into so called "regions" of keys and stores
>>> the
>>> regions distributed in the cluster using HDFS. I think an HBase region
>>> can
>>> be thought of as a HDFS block. To make reading an HBase table efficient,
>>> region reads should be locally done, i.e., an InputFormat should
>>> primarily
>>> read region that are stored on the same machine as the IF is running on.
>>> Flink's InputSplits partition the HBase input by regions and add
>>> information about the storage location of the region. During execution,
>>> input splits are assigned to InputFormats that can do local reads.
>>>
>>> Best, Fabian
>>>
>>> 2014-11-03 11:13 GMT+01:00 Stephan Ewen <[email protected]>:
>>>
>>> > Hi!
>>> >
>>> > The way of passing parameters through the configuration is very old
>>> (the
>>> > original HBase format dated back to that time). I would simply make the
>>> > HBase format take those parameters through the constructor.
>>> >
>>> > Greetings,
>>> > Stephan
>>> >
>>> >
>>> > On Mon, Nov 3, 2014 at 10:59 AM, Flavio Pompermaier <
>>> [email protected]>
>>> > wrote:
>>> >
>>> > > The problem is that I also removed the GenericTableOutputFormat
>>> because
>>> > > there is an incompatibility between hadoop1 and hadoop2 for class
>>> > > TaskAttemptContext and TaskAttemptContextImpl..
>>> > > then it would be nice if the user doesn't have to worry about passing
>>> > > pact.hbase.jtkey and pact.job.id parameters..
>>> > > I think it is probably a good idea to remove hadoop1 compatibility
>>> and
>>> > keep
>>> > > enable HBase addon only for hadoop2 (as before) and decide how to
>>> mange
>>> > > those 2 parameters..
>>> > >
>>> > > On Mon, Nov 3, 2014 at 10:19 AM, Stephan Ewen <[email protected]>
>>> wrote:
>>> > >
>>> > > > It is fine to remove it, in my opinion.
>>> > > >
>>> > > > On Mon, Nov 3, 2014 at 10:11 AM, Flavio Pompermaier <
>>> > > [email protected]>
>>> > > > wrote:
>>> > > >
>>> > > > > That is one class I removed because it was using the deprecated
>>> API
>>> > > > > GenericDataSink..I can restore them but the it will be a good
>>> idea to
>>> > > > > remove those warning (also because from what I understood the
>>> Record
>>> > > APIs
>>> > > > > are going to be removed).
>>> > > > >
>>> > > > > On Mon, Nov 3, 2014 at 9:51 AM, Fabian Hueske <
>>> [email protected]>
>>> > > > wrote:
>>> > > > >
>>> > > > > > I'm not familiar with the HBase connector code, but are you
>>> maybe
>>> > > > looking
>>> > > > > > for the GenericTableOutputFormat?
>>> > > > > >
>>> > > > > > 2014-11-03 9:44 GMT+01:00 Flavio Pompermaier <
>>> [email protected]
>>> > >:
>>> > > > > >
>>> > > > > > > | was trying to modify the example setting hbaseDs.output(new
>>> > > > > > > HBaseOutputFormat()); but I can't see any HBaseOutputFormat
>>> > > > > class..maybe
>>> > > > > > we
>>> > > > > > > shall use another class?
>>> > > > > > >
>>> > > > > > > On Mon, Nov 3, 2014 at 9:39 AM, Flavio Pompermaier <
>>> > > > > [email protected]
>>> > > > > > >
>>> > > > > > > wrote:
>>> > > > > > >
>>> > > > > > > > Maybe that's something I could add to the HBase example and
>>> > that
>>> > > > > could
>>> > > > > > be
>>> > > > > > > > better documented in the Wiki.
>>> > > > > > > >
>>> > > > > > > > Since we're talking about the wiki..I was looking at the
>>> Java
>>> > > API (
>>> > > > > > > >
>>> > > > > > >
>>> > > > > >
>>> > > > >
>>> > > >
>>> > >
>>> >
>>> http://flink.incubator.apache.org/docs/0.6-incubating/java_api_guide.html
>>> )
>>> > > > > > > > and the link to the KMeans example is not working (where it
>>> > says
>>> > > > For
>>> > > > > a
>>> > > > > > > > complete example program, have a look at KMeans Algorithm).
>>> > > > > > > >
>>> > > > > > > > Best,
>>> > > > > > > > Flavio
>>> > > > > > > >
>>> > > > > > > >
>>> > > > > > > > On Mon, Nov 3, 2014 at 9:12 AM, Flavio Pompermaier <
>>> > > > > > [email protected]
>>> > > > > > > >
>>> > > > > > > > wrote:
>>> > > > > > > >
>>> > > > > > > >> Ah ok, perfect! That was the reason why I removed it :)
>>> > > > > > > >>
>>> > > > > > > >> On Mon, Nov 3, 2014 at 9:10 AM, Stephan Ewen <
>>> > [email protected]>
>>> > > > > > wrote:
>>> > > > > > > >>
>>> > > > > > > >>> You do not really need a HBase data sink. You can call
>>> > > > > > > >>> "DataSet.output(new
>>> > > > > > > >>> HBaseOutputFormat())"
>>> > > > > > > >>>
>>> > > > > > > >>> Stephan
>>> > > > > > > >>> Am 02.11.2014 23:05 schrieb "Flavio Pompermaier" <
>>> > > > > > [email protected]
>>> > > > > > > >:
>>> > > > > > > >>>
>>> > > > > > > >>> > Just one last thing..I removed the HbaseDataSink
>>> because I
>>> > > > think
>>> > > > > it
>>> > > > > > > was
>>> > > > > > > >>> > using the old APIs..can someone help me in updating
>>> that
>>> > > class?
>>> > > > > > > >>> >
>>> > > > > > > >>> > On Sun, Nov 2, 2014 at 10:55 AM, Flavio Pompermaier <
>>> > > > > > > >>> [email protected]>
>>> > > > > > > >>> > wrote:
>>> > > > > > > >>> >
>>> > > > > > > >>> > > Indeed this time the build has been successful :)
>>> > > > > > > >>> > >
>>> > > > > > > >>> > > On Sun, Nov 2, 2014 at 10:29 AM, Fabian Hueske <
>>> > > > > > [email protected]
>>> > > > > > > >
>>> > > > > > > >>> > wrote:
>>> > > > > > > >>> > >
>>> > > > > > > >>> > >> You can also setup Travis to build your own Github
>>> > > > > repositories
>>> > > > > > by
>>> > > > > > > >>> > linking
>>> > > > > > > >>> > >> it to your Github account. That way Travis can
>>> build all
>>> > > > your
>>> > > > > > > >>> branches
>>> > > > > > > >>> > >> (and
>>> > > > > > > >>> > >> you can also trigger rebuilds if something fails).
>>> > > > > > > >>> > >> Not sure if we can manually trigger retrigger
>>> builds on
>>> > > the
>>> > > > > > Apache
>>> > > > > > > >>> > >> repository.
>>> > > > > > > >>> > >>
>>> > > > > > > >>> > >> Support for Hadoop 1 and 2 is indeed a very good
>>> > addition
>>> > > > :-)
>>> > > > > > > >>> > >>
>>> > > > > > > >>> > >> For the discusion about the PR itself, I would need
>>> a
>>> > bit
>>> > > > more
>>> > > > > > > time
>>> > > > > > > >>> to
>>> > > > > > > >>> > >> become more familiar with HBase. I do also not have
>>> a
>>> > > HBase
>>> > > > > > setup
>>> > > > > > > >>> > >> available
>>> > > > > > > >>> > >> here.
>>> > > > > > > >>> > >> Maybe somebody else of the community who was
>>> involved
>>> > > with a
>>> > > > > > > >>> previous
>>> > > > > > > >>> > >> version of the HBase connector could comment on your
>>> > > > question.
>>> > > > > > > >>> > >>
>>> > > > > > > >>> > >> Best, Fabian
>>> > > > > > > >>> > >>
>>> > > > > > > >>> > >> 2014-11-02 9:57 GMT+01:00 Flavio Pompermaier <
>>> > > > > > > [email protected]
>>> > > > > > > >>> >:
>>> > > > > > > >>> > >>
>>> > > > > > > >>> > >> > As suggestes by Fabian I moved the discussion on
>>> this
>>> > > > > mailing
>>> > > > > > > >>> list.
>>> > > > > > > >>> > >> >
>>> > > > > > > >>> > >> > I think that what is still to be discussed is
>>> how  to
>>> > > > > > retrigger
>>> > > > > > > >>> the
>>> > > > > > > >>> > >> build
>>> > > > > > > >>> > >> > on Travis (I don't have an account) and if the PR
>>> can
>>> > be
>>> > > > > > > >>> integrated.
>>> > > > > > > >>> > >> >
>>> > > > > > > >>> > >> > Maybe what I can do is to move the HBase example
>>> in
>>> > the
>>> > > > test
>>> > > > > > > >>> package
>>> > > > > > > >>> > >> (right
>>> > > > > > > >>> > >> > now I left it in the main folder) so it will force
>>> > > Travis
>>> > > > to
>>> > > > > > > >>> rebuild.
>>> > > > > > > >>> > >> > I'll do it within a couple of hours.
>>> > > > > > > >>> > >> >
>>> > > > > > > >>> > >> > Another thing I forgot to say is that the hbase
>>> > > extension
>>> > > > is
>>> > > > > > now
>>> > > > > > > >>> > >> compatible
>>> > > > > > > >>> > >> > with both hadoop 1 and 2.
>>> > > > > > > >>> > >> >
>>> > > > > > > >>> > >> > Best,
>>> > > > > > > >>> > >> > Flavio
>>> > > > > > > >>> > >>
>>> > > > > > > >>> > >
>>> > > > > > > >>> >
>>> > > > > > > >>>
>>> > > > > > > >>
>>> > > > > > > >
>>> > > > > > >
>>> > > > > >
>>> > > > >
>>> > > >
>>> > >
>>> >
>>>
>>
>>
>>
>

Re: HBase 0.98 addon for Flink 0.8

Reply via email to