Re: HBase 0.98 addon for Flink 0.8

Flavio Pompermaier Thu, 13 Nov 2014 05:07:08 -0800

We definitely discovered that instantiating HTable and Scan in configure()
method of TableInputFormat causes problem in distributed environment!
If you look at my implementation at
https://github.com/fpompermaier/incubator-flink/blob/master/flink-addons/flink-hbase/src/main/java/org/apache/flink/addons/hbase/TableInputFormat.java
you can see that Scan and HTable were made transient and recreated within
configure but this causes HBaseConfiguration.create() to fail searching for
classpath files...could you help us understanding why?


On Wed, Nov 12, 2014 at 8:10 PM, Flavio Pompermaier <pomperma...@okkam.it>
wrote:

> Usually, when I run a mapreduce job both on Spark and Hadoop I just put
> *-site.xml files into the war I submit to the cluster and that's it. I
> think the problem appeared when I made the HTable a private transient field
> and the table istantiation was moved in the configure method.
> Could it be a valid reason? we still have to make a deeper debug but I'm
> trying ro figure out where to investigate..
> On Nov 12, 2014 8:03 PM, "Robert Metzger" <rmetz...@apache.org> wrote:
>
>> Hi,
>> Maybe its an issue with the classpath? As far as I know is Hadoop reading
>> the configuration files from the classpath. Maybe is the hbase-site.xml
>> file not accessible through the classpath when running on the cluster?
>>
>> On Wed, Nov 12, 2014 at 7:40 PM, Flavio Pompermaier <pomperma...@okkam.it
>> >
>> wrote:
>>
>> > Today we tried tp execute a job on the cluster instead of on local
>> executor
>> > and we faced the problem that the hbase-site.xml was basically ignored.
>> Is
>> > there a reason why the TableInputFormat is working correctly on local
>> > environment while it doesn't on a cluster?
>> > On Nov 10, 2014 10:56 AM, "Fabian Hueske" <fhue...@apache.org> wrote:
>> >
>> > > I don't think we need to bundle the HBase input and output format in a
>> > > single PR.
>> > > So, I think we can proceed with the IF only and target the OF later.
>> > > However, the fix for Kryo should be in the master before merging the
>> PR.
>> > > Till is currently working on that and said he expects this to be done
>> by
>> > > end of the week.
>> > >
>> > > Cheers, Fabian
>> > >
>> > >
>> > > 2014-11-07 12:49 GMT+01:00 Flavio Pompermaier <pomperma...@okkam.it>:
>> > >
>> > > > I fixed also the profile for Cloudera CDH5.1.3. You can build it
>> with
>> > the
>> > > > command:
>> > > >       mvn clean install -Dmaven.test.skip=true -Dhadoop.profile=2
>> > > >  -Pvendor-repos,cdh5.1.3
>> > > >
>> > > > However, it would be good to generate the specific jar when
>> > > > releasing..(e.g.
>> > > > flink-addons:flink-hbase:0.8.0-hadoop2-cdh5.1.3-incubating)
>> > > >
>> > > > Best,
>> > > > Flavio
>> > > >
>> > > > On Fri, Nov 7, 2014 at 12:44 PM, Flavio Pompermaier <
>> > > pomperma...@okkam.it>
>> > > > wrote:
>> > > >
>> > > > > I've just updated the code on my fork (synch with current master
>> and
>> > > > > applied improvements coming from comments on related PR).
>> > > > > I still have to understand how to write results back to an HBase
>> > > > > Sink/OutputFormat...
>> > > > >
>> > > > >
>> > > > > On Mon, Nov 3, 2014 at 12:05 PM, Flavio Pompermaier <
>> > > > pomperma...@okkam.it>
>> > > > > wrote:
>> > > > >
>> > > > >> Thanks for the detailed answer. So if I run a job from my machine
>> > I'll
>> > > > >> have to download all the scanned data in a table..right?
>> > > > >>
>> > > > >> Always regarding the GenericTableOutputFormat it is not clear to
>> me
>> > > how
>> > > > >> to proceed..
>> > > > >> I saw in the hadoop compatibility addon that it is possible to
>> have
>> > > such
>> > > > >> compatibility using HBaseUtils class so the open method should
>> > become
>> > > > >> something like:
>> > > > >>
>> > > > >> @Override
>> > > > >> public void open(int taskNumber, int numTasks) throws
>> IOException {
>> > > > >> if (Integer.toString(taskNumber + 1).length() > 6) {
>> > > > >> throw new IOException("Task id too large.");
>> > > > >> }
>> > > > >> TaskAttemptID taskAttemptID =
>> > TaskAttemptID.forName("attempt__0000_r_"
>> > > > >> + String.format("%" + (6 - Integer.toString(taskNumber +
>> > 1).length())
>> > > +
>> > > > >> "s"," ").replace(" ", "0")
>> > > > >> + Integer.toString(taskNumber + 1)
>> > > > >> + "_0");
>> > > > >>  this.configuration.set("mapred.task.id",
>> > taskAttemptID.toString());
>> > > > >> this.configuration.setInt("mapred.task.partition", taskNumber +
>> 1);
>> > > > >> // for hadoop 2.2
>> > > > >> this.configuration.set("mapreduce.task.attempt.id",
>> > > > >> taskAttemptID.toString());
>> > > > >> this.configuration.setInt("mapreduce.task.partition", taskNumber
>> +
>> > 1);
>> > > > >>  try {
>> > > > >> this.context =
>> > > > >> HadoopUtils.instantiateTaskAttemptContext(this.configuration,
>> > > > >> taskAttemptID);
>> > > > >> } catch (Exception e) {
>> > > > >> throw new RuntimeException(e);
>> > > > >> }
>> > > > >> final HFileOutputFormat2 outFormat = new HFileOutputFormat2();
>> > > > >> try {
>> > > > >> this.writer = outFormat.getRecordWriter(this.context);
>> > > > >> } catch (InterruptedException iex) {
>> > > > >> throw new IOException("Opening the writer was interrupted.",
>> iex);
>> > > > >> }
>> > > > >> }
>> > > > >>
>> > > > >> But I'm not sure about how to pass the JobConf to the class, if
>> to
>> > > merge
>> > > > >> config fileas, where HFileOutputFormat2 writes the data and how
>> to
>> > > > >> implement the public void writeRecord(Record record) API.
>> > > > >> Could I do a little chat off the mailing list with the
>> implementor
>> > of
>> > > > >> this extension?
>> > > > >>
>> > > > >> On Mon, Nov 3, 2014 at 11:51 AM, Fabian Hueske <
>> fhue...@apache.org>
>> > > > >> wrote:
>> > > > >>
>> > > > >>> Hi Flavio
>> > > > >>>
>> > > > >>> let me try to answer your last question on the user's list (to
>> the
>> > > best
>> > > > >>> of
>> > > > >>> my HBase knowledge).
>> > > > >>> "I just wanted to known if and how regiom splitting is handled.
>> Can
>> > > you
>> > > > >>> explain me in detail how Flink and HBase works?what is not fully
>> > > clear
>> > > > to
>> > > > >>> me is when computation is done by region servers and when data
>> > start
>> > > > flow
>> > > > >>> to a Flink worker (that in ky test job is only my pc) and how ro
>> > > > >>> undertsand
>> > > > >>> better the important logged info to understand if my job is
>> > > performing
>> > > > >>> well"
>> > > > >>>
>> > > > >>> HBase partitions its tables into so called "regions" of keys and
>> > > stores
>> > > > >>> the
>> > > > >>> regions distributed in the cluster using HDFS. I think an HBase
>> > > region
>> > > > >>> can
>> > > > >>> be thought of as a HDFS block. To make reading an HBase table
>> > > > efficient,
>> > > > >>> region reads should be locally done, i.e., an InputFormat should
>> > > > >>> primarily
>> > > > >>> read region that are stored on the same machine as the IF is
>> > running
>> > > > on.
>> > > > >>> Flink's InputSplits partition the HBase input by regions and add
>> > > > >>> information about the storage location of the region. During
>> > > execution,
>> > > > >>> input splits are assigned to InputFormats that can do local
>> reads.
>> > > > >>>
>> > > > >>> Best, Fabian
>> > > > >>>
>> > > > >>> 2014-11-03 11:13 GMT+01:00 Stephan Ewen <se...@apache.org>:
>> > > > >>>
>> > > > >>> > Hi!
>> > > > >>> >
>> > > > >>> > The way of passing parameters through the configuration is
>> very
>> > old
>> > > > >>> (the
>> > > > >>> > original HBase format dated back to that time). I would simply
>> > make
>> > > > the
>> > > > >>> > HBase format take those parameters through the constructor.
>> > > > >>> >
>> > > > >>> > Greetings,
>> > > > >>> > Stephan
>> > > > >>> >
>> > > > >>> >
>> > > > >>> > On Mon, Nov 3, 2014 at 10:59 AM, Flavio Pompermaier <
>> > > > >>> pomperma...@okkam.it>
>> > > > >>> > wrote:
>> > > > >>> >
>> > > > >>> > > The problem is that I also removed the
>> GenericTableOutputFormat
>> > > > >>> because
>> > > > >>> > > there is an incompatibility between hadoop1 and hadoop2 for
>> > class
>> > > > >>> > > TaskAttemptContext and TaskAttemptContextImpl..
>> > > > >>> > > then it would be nice if the user doesn't have to worry
>> about
>> > > > passing
>> > > > >>> > > pact.hbase.jtkey and pact.job.id parameters..
>> > > > >>> > > I think it is probably a good idea to remove hadoop1
>> > > compatibility
>> > > > >>> and
>> > > > >>> > keep
>> > > > >>> > > enable HBase addon only for hadoop2 (as before) and decide
>> how
>> > to
>> > > > >>> mange
>> > > > >>> > > those 2 parameters..
>> > > > >>> > >
>> > > > >>> > > On Mon, Nov 3, 2014 at 10:19 AM, Stephan Ewen <
>> > se...@apache.org>
>> > > > >>> wrote:
>> > > > >>> > >
>> > > > >>> > > > It is fine to remove it, in my opinion.
>> > > > >>> > > >
>> > > > >>> > > > On Mon, Nov 3, 2014 at 10:11 AM, Flavio Pompermaier <
>> > > > >>> > > pomperma...@okkam.it>
>> > > > >>> > > > wrote:
>> > > > >>> > > >
>> > > > >>> > > > > That is one class I removed because it was using the
>> > > deprecated
>> > > > >>> API
>> > > > >>> > > > > GenericDataSink..I can restore them but the it will be a
>> > good
>> > > > >>> idea to
>> > > > >>> > > > > remove those warning (also because from what I
>> understood
>> > the
>> > > > >>> Record
>> > > > >>> > > APIs
>> > > > >>> > > > > are going to be removed).
>> > > > >>> > > > >
>> > > > >>> > > > > On Mon, Nov 3, 2014 at 9:51 AM, Fabian Hueske <
>> > > > >>> fhue...@apache.org>
>> > > > >>> > > > wrote:
>> > > > >>> > > > >
>> > > > >>> > > > > > I'm not familiar with the HBase connector code, but
>> are
>> > you
>> > > > >>> maybe
>> > > > >>> > > > looking
>> > > > >>> > > > > > for the GenericTableOutputFormat?
>> > > > >>> > > > > >
>> > > > >>> > > > > > 2014-11-03 9:44 GMT+01:00 Flavio Pompermaier <
>> > > > >>> pomperma...@okkam.it
>> > > > >>> > >:
>> > > > >>> > > > > >
>> > > > >>> > > > > > > | was trying to modify the example setting
>> > > > hbaseDs.output(new
>> > > > >>> > > > > > > HBaseOutputFormat()); but I can't see any
>> > > HBaseOutputFormat
>> > > > >>> > > > > class..maybe
>> > > > >>> > > > > > we
>> > > > >>> > > > > > > shall use another class?
>> > > > >>> > > > > > >
>> > > > >>> > > > > > > On Mon, Nov 3, 2014 at 9:39 AM, Flavio Pompermaier <
>> > > > >>> > > > > pomperma...@okkam.it
>> > > > >>> > > > > > >
>> > > > >>> > > > > > > wrote:
>> > > > >>> > > > > > >
>> > > > >>> > > > > > > > Maybe that's something I could add to the HBase
>> > example
>> > > > and
>> > > > >>> > that
>> > > > >>> > > > > could
>> > > > >>> > > > > > be
>> > > > >>> > > > > > > > better documented in the Wiki.
>> > > > >>> > > > > > > >
>> > > > >>> > > > > > > > Since we're talking about the wiki..I was looking
>> at
>> > > the
>> > > > >>> Java
>> > > > >>> > > API (
>> > > > >>> > > > > > > >
>> > > > >>> > > > > > >
>> > > > >>> > > > > >
>> > > > >>> > > > >
>> > > > >>> > > >
>> > > > >>> > >
>> > > > >>> >
>> > > > >>>
>> > > >
>> > >
>> >
>> http://flink.incubator.apache.org/docs/0.6-incubating/java_api_guide.html
>> > > > >>> )
>> > > > >>> > > > > > > > and the link to the KMeans example is not working
>> > > (where
>> > > > it
>> > > > >>> > says
>> > > > >>> > > > For
>> > > > >>> > > > > a
>> > > > >>> > > > > > > > complete example program, have a look at KMeans
>> > > > Algorithm).
>> > > > >>> > > > > > > >
>> > > > >>> > > > > > > > Best,
>> > > > >>> > > > > > > > Flavio
>> > > > >>> > > > > > > >
>> > > > >>> > > > > > > >
>> > > > >>> > > > > > > > On Mon, Nov 3, 2014 at 9:12 AM, Flavio
>> Pompermaier <
>> > > > >>> > > > > > pomperma...@okkam.it
>> > > > >>> > > > > > > >
>> > > > >>> > > > > > > > wrote:
>> > > > >>> > > > > > > >
>> > > > >>> > > > > > > >> Ah ok, perfect! That was the reason why I
>> removed it
>> > > :)
>> > > > >>> > > > > > > >>
>> > > > >>> > > > > > > >> On Mon, Nov 3, 2014 at 9:10 AM, Stephan Ewen <
>> > > > >>> > se...@apache.org>
>> > > > >>> > > > > > wrote:
>> > > > >>> > > > > > > >>
>> > > > >>> > > > > > > >>> You do not really need a HBase data sink. You
>> can
>> > > call
>> > > > >>> > > > > > > >>> "DataSet.output(new
>> > > > >>> > > > > > > >>> HBaseOutputFormat())"
>> > > > >>> > > > > > > >>>
>> > > > >>> > > > > > > >>> Stephan
>> > > > >>> > > > > > > >>> Am 02.11.2014 23:05 schrieb "Flavio
>> Pompermaier" <
>> > > > >>> > > > > > pomperma...@okkam.it
>> > > > >>> > > > > > > >:
>> > > > >>> > > > > > > >>>
>> > > > >>> > > > > > > >>> > Just one last thing..I removed the
>> HbaseDataSink
>> > > > >>> because I
>> > > > >>> > > > think
>> > > > >>> > > > > it
>> > > > >>> > > > > > > was
>> > > > >>> > > > > > > >>> > using the old APIs..can someone help me in
>> > updating
>> > > > >>> that
>> > > > >>> > > class?
>> > > > >>> > > > > > > >>> >
>> > > > >>> > > > > > > >>> > On Sun, Nov 2, 2014 at 10:55 AM, Flavio
>> > > Pompermaier <
>> > > > >>> > > > > > > >>> pomperma...@okkam.it>
>> > > > >>> > > > > > > >>> > wrote:
>> > > > >>> > > > > > > >>> >
>> > > > >>> > > > > > > >>> > > Indeed this time the build has been
>> successful
>> > :)
>> > > > >>> > > > > > > >>> > >
>> > > > >>> > > > > > > >>> > > On Sun, Nov 2, 2014 at 10:29 AM, Fabian
>> Hueske
>> > <
>> > > > >>> > > > > > fhue...@apache.org
>> > > > >>> > > > > > > >
>> > > > >>> > > > > > > >>> > wrote:
>> > > > >>> > > > > > > >>> > >
>> > > > >>> > > > > > > >>> > >> You can also setup Travis to build your own
>> > > Github
>> > > > >>> > > > > repositories
>> > > > >>> > > > > > by
>> > > > >>> > > > > > > >>> > linking
>> > > > >>> > > > > > > >>> > >> it to your Github account. That way Travis
>> can
>> > > > >>> build all
>> > > > >>> > > > your
>> > > > >>> > > > > > > >>> branches
>> > > > >>> > > > > > > >>> > >> (and
>> > > > >>> > > > > > > >>> > >> you can also trigger rebuilds if something
>> > > fails).
>> > > > >>> > > > > > > >>> > >> Not sure if we can manually trigger
>> retrigger
>> > > > >>> builds on
>> > > > >>> > > the
>> > > > >>> > > > > > Apache
>> > > > >>> > > > > > > >>> > >> repository.
>> > > > >>> > > > > > > >>> > >>
>> > > > >>> > > > > > > >>> > >> Support for Hadoop 1 and 2 is indeed a very
>> > good
>> > > > >>> > addition
>> > > > >>> > > > :-)
>> > > > >>> > > > > > > >>> > >>
>> > > > >>> > > > > > > >>> > >> For the discusion about the PR itself, I
>> would
>> > > > need
>> > > > >>> a
>> > > > >>> > bit
>> > > > >>> > > > more
>> > > > >>> > > > > > > time
>> > > > >>> > > > > > > >>> to
>> > > > >>> > > > > > > >>> > >> become more familiar with HBase. I do also
>> not
>> > > > have
>> > > > >>> a
>> > > > >>> > > HBase
>> > > > >>> > > > > > setup
>> > > > >>> > > > > > > >>> > >> available
>> > > > >>> > > > > > > >>> > >> here.
>> > > > >>> > > > > > > >>> > >> Maybe somebody else of the community who
>> was
>> > > > >>> involved
>> > > > >>> > > with a
>> > > > >>> > > > > > > >>> previous
>> > > > >>> > > > > > > >>> > >> version of the HBase connector could
>> comment
>> > on
>> > > > your
>> > > > >>> > > > question.
>> > > > >>> > > > > > > >>> > >>
>> > > > >>> > > > > > > >>> > >> Best, Fabian
>> > > > >>> > > > > > > >>> > >>
>> > > > >>> > > > > > > >>> > >> 2014-11-02 9:57 GMT+01:00 Flavio
>> Pompermaier <
>> > > > >>> > > > > > > pomperma...@okkam.it
>> > > > >>> > > > > > > >>> >:
>> > > > >>> > > > > > > >>> > >>
>> > > > >>> > > > > > > >>> > >> > As suggestes by Fabian I moved the
>> > discussion
>> > > on
>> > > > >>> this
>> > > > >>> > > > > mailing
>> > > > >>> > > > > > > >>> list.
>> > > > >>> > > > > > > >>> > >> >
>> > > > >>> > > > > > > >>> > >> > I think that what is still to be
>> discussed
>> > is
>> > > > >>> how  to
>> > > > >>> > > > > > retrigger
>> > > > >>> > > > > > > >>> the
>> > > > >>> > > > > > > >>> > >> build
>> > > > >>> > > > > > > >>> > >> > on Travis (I don't have an account) and
>> if
>> > the
>> > > > PR
>> > > > >>> can
>> > > > >>> > be
>> > > > >>> > > > > > > >>> integrated.
>> > > > >>> > > > > > > >>> > >> >
>> > > > >>> > > > > > > >>> > >> > Maybe what I can do is to move the HBase
>> > > example
>> > > > >>> in
>> > > > >>> > the
>> > > > >>> > > > test
>> > > > >>> > > > > > > >>> package
>> > > > >>> > > > > > > >>> > >> (right
>> > > > >>> > > > > > > >>> > >> > now I left it in the main folder) so it
>> will
>> > > > force
>> > > > >>> > > Travis
>> > > > >>> > > > to
>> > > > >>> > > > > > > >>> rebuild.
>> > > > >>> > > > > > > >>> > >> > I'll do it within a couple of hours.
>> > > > >>> > > > > > > >>> > >> >
>> > > > >>> > > > > > > >>> > >> > Another thing I forgot to say is that the
>> > > hbase
>> > > > >>> > > extension
>> > > > >>> > > > is
>> > > > >>> > > > > > now
>> > > > >>> > > > > > > >>> > >> compatible
>> > > > >>> > > > > > > >>> > >> > with both hadoop 1 and 2.
>> > > > >>> > > > > > > >>> > >> >
>> > > > >>> > > > > > > >>> > >> > Best,
>> > > > >>> > > > > > > >>> > >> > Flavio
>> > > > >>> > > > > > > >>> > >>
>> > > > >>> > > > > > > >>> > >
>> > > > >>> > > > > > > >>> >
>> > > > >>> > > > > > > >>>
>> > > > >>> > > > > > > >>
>> > > > >>> > > > > > > >
>> > > > >>> > > > > > >
>> > > > >>> > > > > >
>> > > > >>> > > > >
>> > > > >>> > > >
>> > > > >>> > >
>> > > > >>> >
>> > > > >>>
>> > > > >>
>> > > > >>
>> > > > >>
>> > > > >
>> > > >
>> > >
>> >
>>
>

Re: HBase 0.98 addon for Flink 0.8

Reply via email to