Have you added the hbase.jar file with your HBase config to the ./lib folders of your Flink setup (JobManager, TaskManager) or is it bundled with your job.jar file?
-- Fabian Hueske Phone: +49 170 5549438 Email: fhue...@gmail.com Web: http://www.user.tu-berlin.de/fabian.hueske From: Flavio Pompermaier Sent: Thursday, 13. November, 2014 18:36 To: dev@flink.incubator.apache.org Any help with this? :( On Thu, Nov 13, 2014 at 2:06 PM, Flavio Pompermaier <pomperma...@okkam.it> wrote: > We definitely discovered that instantiating HTable and Scan in configure() > method of TableInputFormat causes problem in distributed environment! > If you look at my implementation at > https://github.com/fpompermaier/incubator-flink/blob/master/flink-addons/flink-hbase/src/main/java/org/apache/flink/addons/hbase/TableInputFormat.java > you can see that Scan and HTable were made transient and recreated within > configure but this causes HBaseConfiguration.create() to fail searching for > classpath files...could you help us understanding why? > > On Wed, Nov 12, 2014 at 8:10 PM, Flavio Pompermaier <pomperma...@okkam.it> > wrote: > >> Usually, when I run a mapreduce job both on Spark and Hadoop I just put >> *-site.xml files into the war I submit to the cluster and that's it. I >> think the problem appeared when I made the HTable a private transient field >> and the table istantiation was moved in the configure method. >> Could it be a valid reason? we still have to make a deeper debug but I'm >> trying ro figure out where to investigate.. >> On Nov 12, 2014 8:03 PM, "Robert Metzger" <rmetz...@apache.org> wrote: >> >>> Hi, >>> Maybe its an issue with the classpath? As far as I know is Hadoop reading >>> the configuration files from the classpath. Maybe is the hbase-site.xml >>> file not accessible through the classpath when running on the cluster? >>> >>> On Wed, Nov 12, 2014 at 7:40 PM, Flavio Pompermaier < >>> pomperma...@okkam.it> >>> wrote: >>> >>> > Today we tried tp execute a job on the cluster instead of on local >>> executor >>> > and we faced the problem that the hbase-site.xml was basically >>> ignored. Is >>> > there a reason why the TableInputFormat is working correctly on local >>> > environment while it doesn't on a cluster? >>> > On Nov 10, 2014 10:56 AM, "Fabian Hueske" <fhue...@apache.org> wrote: >>> > >>> > > I don't think we need to bundle the HBase input and output format in >>> a >>> > > single PR. >>> > > So, I think we can proceed with the IF only and target the OF later. >>> > > However, the fix for Kryo should be in the master before merging the >>> PR. >>> > > Till is currently working on that and said he expects this to be >>> done by >>> > > end of the week. >>> > > >>> > > Cheers, Fabian >>> > > >>> > > >>> > > 2014-11-07 12:49 GMT+01:00 Flavio Pompermaier <pomperma...@okkam.it >>> >: >>> > > >>> > > > I fixed also the profile for Cloudera CDH5.1.3. You can build it >>> with >>> > the >>> > > > command: >>> > > > mvn clean install -Dmaven.test.skip=true -Dhadoop.profile=2 >>> > > > -Pvendor-repos,cdh5.1.3 >>> > > > >>> > > > However, it would be good to generate the specific jar when >>> > > > releasing..(e.g. >>> > > > flink-addons:flink-hbase:0.8.0-hadoop2-cdh5.1.3-incubating) >>> > > > >>> > > > Best, >>> > > > Flavio >>> > > > >>> > > > On Fri, Nov 7, 2014 at 12:44 PM, Flavio Pompermaier < >>> > > pomperma...@okkam.it> >>> > > > wrote: >>> > > > >>> > > > > I've just updated the code on my fork (synch with current master >>> and >>> > > > > applied improvements coming from comments on related PR). >>> > > > > I still have to understand how to write results back to an HBase >>> > > > > Sink/OutputFormat... >>> > > > > >>> > > > > >>> > > > > On Mon, Nov 3, 2014 at 12:05 PM, Flavio Pompermaier < >>> > > > pomperma...@okkam.it> >>> > > > > wrote: >>> > > > > >>> > > > >> Thanks for the detailed answer. So if I run a job from my >>> machine >>> > I'll >>> > > > >> have to download all the scanned data in a table..right? >>> > > > >> >>> > > > >> Always regarding the GenericTableOutputFormat it is not clear >>> to me >>> > > how >>> > > > >> to proceed.. >>> > > > >> I saw in the hadoop compatibility addon that it is possible to >>> have >>> > > such >>> > > > >> compatibility using HBaseUtils class so the open method should >>> > become >>> > > > >> something like: >>> > > > >> >>> > > > >> @Override >>> > > > >> public void open(int taskNumber, int numTasks) throws >>> IOException { >>> > > > >> if (Integer.toString(taskNumber + 1).length() > 6) { >>> > > > >> throw new IOException("Task id too large."); >>> > > > >> } >>> > > > >> TaskAttemptID taskAttemptID = >>> > TaskAttemptID.forName("attempt__0000_r_" >>> > > > >> + String.format("%" + (6 - Integer.toString(taskNumber + >>> > 1).length()) >>> > > + >>> > > > >> "s"," ").replace(" ", "0") >>> > > > >> + Integer.toString(taskNumber + 1) >>> > > > >> + "_0"); >>> > > > >> this.configuration.set("mapred.task.id", >>> > taskAttemptID.toString()); >>> > > > >> this.configuration.setInt("mapred.task.partition", taskNumber + >>> 1); >>> > > > >> // for hadoop 2.2 >>> > > > >> this.configuration.set("mapreduce.task.attempt.id", >>> > > > >> taskAttemptID.toString()); >>> > > > >> this.configuration.setInt("mapreduce.task.partition", >>> taskNumber + >>> > 1); >>> > > > >> try { >>> > > > >> this.context = >>> > > > >> HadoopUtils.instantiateTaskAttemptContext(this.configuration, >>> > > > >> taskAttemptID); >>> > > > >> } catch (Exception e) { >>> > > > >> throw new RuntimeException(e); >>> > > > >> } >>> > > > >> final HFileOutputFormat2 outFormat = new HFileOutputFormat2(); >>> > > > >> try { >>> > > > >> this.writer = outFormat.getRecordWriter(this.context); >>> > > > >> } catch (InterruptedException iex) { >>> > > > >> throw new IOException("Opening the writer was interrupted.", >>> iex); >>> > > > >> } >>> > > > >> } >>> > > > >> >>> > > > >> But I'm not sure about how to pass the JobConf to the class, if >>> to >>> > > merge >>> > > > >> config fileas, where HFileOutputFormat2 writes the data and how >>> to >>> > > > >> implement the public void writeRecord(Record record) API. >>> > > > >> Could I do a little chat off the mailing list with the >>> implementor >>> > of >>> > > > >> this extension? >>> > > > >> >>> > > > >> On Mon, Nov 3, 2014 at 11:51 AM, Fabian Hueske < >>> fhue...@apache.org> >>> > > > >> wrote: >>> > > > >> >>> > > > >>> Hi Flavio >>> > > > >>> >>> > > > >>> let me try to answer your last question on the user's list (to >>> the >>> > > best >>> > > > >>> of >>> > > > >>> my HBase knowledge). >>> > > > >>> "I just wanted to known if and how regiom splitting is >>> handled. Can >>> > > you >>> > > > >>> explain me in detail how Flink and HBase works?what is not >>> fully >>> > > clear >>> > > > to >>> > > > >>> me is when computation is done by region servers and when data >>> > start >>> > > > flow >>> > > > >>> to a Flink worker (that in ky test job is only my pc) and how >>> ro >>> > > > >>> undertsand >>> > > > >>> better the important logged info to understand if my job is >>> > > performing >>> > > > >>> well" >>> > > > >>> >>> > > > >>> HBase partitions its tables into so called "regions" of keys >>> and >>> > > stores >>> > > > >>> the >>> > > > >>> regions distributed in the cluster using HDFS. I think an HBase >>> > > region >>> > > > >>> can >>> > > > >>> be thought of as a HDFS block. To make reading an HBase table >>> > > > efficient, >>> > > > >>> region reads should be locally done, i.e., an InputFormat >>> should >>> > > > >>> primarily >>> > > > >>> read region that are stored on the same machine as the IF is >>> > running >>> > > > on. >>> > > > >>> Flink's InputSplits partition the HBase input by regions and >>> add >>> > > > >>> information about the storage location of the region. During >>> > > execution, >>> > > > >>> input splits are assigned to InputFormats that can do local >>> reads. >>> > > > >>> >>> > > > >>> Best, Fabian >>> > > > >>> >>> > > > >>> 2014-11-03 11:13 GMT+01:00 Stephan Ewen <se...@apache.org>: >>> > > > >>> >>> > > > >>> > Hi! >>> > > > >>> > >>> > > > >>> > The way of passing parameters through the configuration is >>> very >>> > old >>> > > > >>> (the >>> > > > >>> > original HBase format dated back to that time). I would >>> simply >>> > make >>> > > > the >>> > > > >>> > HBase format take those parameters through the constructor. >>> > > > >>> > >>> > > > >>> > Greetings, >>> > > > >>> > Stephan >>> > > > >>> > >>> > > > >>> > >>> > > > >>> > On Mon, Nov 3, 2014 at 10:59 AM, Flavio Pompermaier < >>> > > > >>> pomperma...@okkam.it> >>> > > > >>> > wrote: >>> > > > >>> > >>> > > > >>> > > The problem is that I also removed the >>> GenericTableOutputFormat >>> > > > >>> because >>> > > > >>> > > there is an incompatibility between hadoop1 and hadoop2 for >>> > class >>> > > > >>> > > TaskAttemptContext and TaskAttemptContextImpl.. >>> > > > >>> > > then it would be nice if the user doesn't have to worry >>> about >>> > > > passing >>> > > > >>> > > pact.hbase.jtkey and pact.job.id parameters.. >>> > > > >>> > > I think it is probably a good idea to remove hadoop1 >>> > > compatibility >>> > > > >>> and >>> > > > >>> > keep >>> > > > >>> > > enable HBase addon only for hadoop2 (as before) and decide >>> how >>> > to >>> > > > >>> mange >>> > > > >>> > > those 2 parameters.. >>> > > > >>> > > >>> > > > >>> > > On Mon, Nov 3, 2014 at 10:19 AM, Stephan Ewen < >>> > se...@apache.org> >>> > > > >>> wrote: >>> > > > >>> > > >>> > > > >>> > > > It is fine to remove it, in my opinion. >>> > > > >>> > > > >>> > > > >>> > > > On Mon, Nov 3, 2014 at 10:11 AM, Flavio Pompermaier < >>> > > > >>> > > pomperma...@okkam.it> >>> > > > >>> > > > wrote: >>> > > > >>> > > > >>> > > > >>> > > > > That is one class I removed because it was using the >>> > > deprecated >>> > > > >>> API >>> > > > >>> > > > > GenericDataSink..I can restore them but the it will be >>> a >>> > good >>> > > > >>> idea to >>> > > > >>> > > > > remove those warning (also because from what I >>> understood >>> > the >>> > > > >>> Record >>> > > > >>> > > APIs >>> > > > >>> > > > > are going to be removed). >>> > > > >>> > > > > >>> > > > >>> > > > > On Mon, Nov 3, 2014 at 9:51 AM, Fabian Hueske < >>> > > > >>> fhue...@apache.org> >>> > > > >>> > > > wrote: >>> > > > >>> > > > > >>> > > > >>> > > > > > I'm not familiar with the HBase connector code, but >>> are >>> > you >>> > > > >>> maybe >>> > > > >>> > > > looking >>> > > > >>> > > > > > for the GenericTableOutputFormat? >>> > > > >>> > > > > > >>> > > > >>> > > > > > 2014-11-03 9:44 GMT+01:00 Flavio Pompermaier < >>> > > > >>> pomperma...@okkam.it >>> > > > >>> > >: >>> > > > >>> > > > > > >>> > > > >>> > > > > > > | was trying to modify the example setting >>> > > > hbaseDs.output(new >>> > > > >>> > > > > > > HBaseOutputFormat()); but I can't see any >>> > > HBaseOutputFormat >>> > > > >>> > > > > class..maybe >>> > > > >>> > > > > > we >>> > > > >>> > > > > > > shall use another class? >>> > > > >>> > > > > > > >>> > > > >>> > > > > > > On Mon, Nov 3, 2014 at 9:39 AM, Flavio Pompermaier >>> < >>> > > > >>> > > > > pomperma...@okkam.it >>> > > > >>> > > > > > > >>> > > > >>> > > > > > > wrote: >>> > > > >>> > > > > > > >>> > > > >>> > > > > > > > Maybe that's something I could add to the HBase >>> > example >>> > > > and >>> > > > >>> > that >>> > > > >>> > > > > could >>> > > > >>> > > > > > be >>> > > > >>> > > > > > > > better documented in the Wiki. >>> > > > >>> > > > > > > > >>> > > > >>> > > > > > > > Since we're talking about the wiki..I was >>> looking at >>> > > the >>> > > > >>> Java >>> > > > >>> > > API ( >>> > > > >>> > > > > > > > >>> > > > >>> > > > > > > >>> > > > >>> > > > > > >>> > > > >>> > > > > >>> > > > >>> > > > >>> > > > >>> > > >>> > > > >>> > >>> > > > >>> >>> > > > >>> > > >>> > >>> http://flink.incubator.apache.org/docs/0.6-incubating/java_api_guide.html >>> > > > >>> ) >>> > > > >>> > > > > > > > and the link to the KMeans example is not working >>> > > (where >>> > > > it >>> > > > >>> > says >>> > > > >>> > > > For >>> > > > >>> > > > > a >>> > > > >>> > > > > > > > complete example program, have a look at KMeans >>> > > > Algorithm). >>> > > > >>> > > > > > > > >>> > > > >>> > > > > > > > Best, >>> > > > >>> > > > > > > > Flavio >>> > > > >>> > > > > > > > >>> > > > >>> > > > > > > > >>> > > > >>> > > > > > > > On Mon, Nov 3, 2014 at 9:12 AM, Flavio >>> Pompermaier < >>> > > > >>> > > > > > pomperma...@okkam.it >>> > > > >>> > > > > > > > >>> > > > >>> > > > > > > > wrote: >>> > > > >>> > > > > > > > >>> > > > >>> > > > > > > >> Ah ok, perfect! That was the reason why I >>> removed it >>> > > :) >>> > > > >>> > > > > > > >> >>> > > > >>> > > > > > > >> On Mon, Nov 3, 2014 at 9:10 AM, Stephan Ewen < >>> > > > >>> > se...@apache.org> >>> > > > >>> > > > > > wrote: >>> > > > >>> > > > > > > >> >>> > > > >>> > > > > > > >>> You do not really need a HBase data sink. You >>> can >>> > > call >>> > > > >>> > > > > > > >>> "DataSet.output(new >>> > > > >>> > > > > > > >>> HBaseOutputFormat())" >>> > > > >>> > > > > > > >>> >>> > > > >>> > > > > > > >>> Stephan >>> > > > >>> > > > > > > >>> Am 02.11.2014 23:05 schrieb "Flavio >>> Pompermaier" < >>> > > > >>> > > > > > pomperma...@okkam.it >>> > > > >>> > > > > > > >: >>> > > > >>> > > > > > > >>> >>> > > > >>> > > > > > > >>> > Just one last thing..I removed the >>> HbaseDataSink >>> > > > >>> because I >>> > > > >>> > > > think >>> > > > >>> > > > > it >>> > > > >>> > > > > > > was >>> > > > >>> > > > > > > >>> > using the old APIs..can someone help me in >>> > updating >>> > > > >>> that >>> > > > >>> > > class? >>> > > > >>> > > > > > > >>> > >>> > > > >>> > > > > > > >>> > On Sun, Nov 2, 2014 at 10:55 AM, Flavio >>> > > Pompermaier < >>> > > > >>> > > > > > > >>> pomperma...@okkam.it> >>> > > > >>> > > > > > > >>> > wrote: >>> > > > >>> > > > > > > >>> > >>> > > > >>> > > > > > > >>> > > Indeed this time the build has been >>> successful >>> > :) >>> > > > >>> > > > > > > >>> > > >>> > > > >>> > > > > > > >>> > > On Sun, Nov 2, 2014 at 10:29 AM, Fabian >>> Hueske >>> > < >>> > > > >>> > > > > > fhue...@apache.org >>> > > > >>> > > > > > > > >>> > > > >>> > > > > > > >>> > wrote: >>> > > > >>> > > > > > > >>> > > >>> > > > >>> > > > > > > >>> > >> You can also setup Travis to build your >>> own >>> > > Github >>> > > > >>> > > > > repositories >>> > > > >>> > > > > > by >>> > > > >>> > > > > > > >>> > linking >>> > > > >>> > > > > > > >>> > >> it to your Github account. That way >>> Travis can >>> > > > >>> build all >>> > > > >>> > > > your >>> > > > >>> > > > > > > >>> branches >>> > > > >>> > > > > > > >>> > >> (and >>> > > > >>> > > > > > > >>> > >> you can also trigger rebuilds if something >>> > > fails). >>> > > > >>> > > > > > > >>> > >> Not sure if we can manually trigger >>> retrigger >>> > > > >>> builds on >>> > > > >>> > > the >>> > > > >>> > > > > > Apache >>> > > > >>> > > > > > > >>> > >> repository. >>> > > > >>> > > > > > > >>> > >> >>> > > > >>> > > > > > > >>> > >> Support for Hadoop 1 and 2 is indeed a >>> very >>> > good >>> > > > >>> > addition >>> > > > >>> > > > :-) >>> > > > >>> > > > > > > >>> > >> >>> > > > >>> > > > > > > >>> > >> For the discusion about the PR itself, I >>> would >>> > > > need >>> > > > >>> a >>> > > > >>> > bit >>> > > > >>> > > > more >>> > > > >>> > > > > > > time >>> > > > >>> > > > > > > >>> to >>> > > > >>> > > > > > > >>> > >> become more familiar with HBase. I do >>> also not >>> > > > have >>> > > > >>> a >>> > > > >>> > > HBase >>> > > > >>> > > > > > setup >>> > > > >>> > > > > > > >>> > >> available >>> > > > >>> > > > > > > >>> > >> here. >>> > > > >>> > > > > > > >>> > >> Maybe somebody else of the community who >>> was >>> > > > >>> involved >>> > > > >>> > > with a >>> > > > >>> > > > > > > >>> previous >>> > > > >>> > > > > > > >>> > >> version of the HBase connector could >>> comment >>> > on >>> > > > your >>> > > > >>> > > > question. >>> > > > >>> > > > > > > >>> > >> >>> > > > >>> > > > > > > >>> > >> Best, Fabian >>> > > > >>> > > > > > > >>> > >> >>> > > > >>> > > > > > > >>> > >> 2014-11-02 9:57 GMT+01:00 Flavio >>> Pompermaier < >>> > > > >>> > > > > > > pomperma...@okkam.it >>> > > > >>> > > > > > > >>> >: >>> > > > >>> > > > > > > >>> > >> >>> > > > >>> > > > > > > >>> > >> > As suggestes by Fabian I moved the >>> > discussion >>> > > on >>> > > > >>> this >>> > > > >>> > > > > mailing >>> > > > >>> > > > > > > >>> list. >>> > > > >>> > > > > > > >>> > >> > >>> > > > >>> > > > > > > >>> > >> > I think that what is still to be >>> discussed >>> > is >>> > > > >>> how to >>> > > > >>> > > > > > retrigger >>> > > > >>> > > > > > > >>> the >>> > > > >>> > > > > > > >>> > >> build >>> > > > >>> > > > > > > >>> > >> > on Travis (I don't have an account) and >>> if >>> > the >>> > > > PR >>> > > > >>> can >>> > > > >>> > be >>> > > > >>> > > > > > > >>> integrated. >>> > > > >>> > > > > > > >>> > >> > >>> > > > >>> > > > > > > >>> > >> > Maybe what I can do is to move the HBase >>> > > example >>> > > > >>> in >>> > > > >>> > the >>> > > > >>> > > > test >>> > > > >>> > > > > > > >>> package >>> > > > >>> > > > > > > >>> > >> (right >>> > > > >>> > > > > > > >>> > >> > now I left it in the main folder) so it >>> will >>> > > > force >>> > > > >>> > > Travis >>> > > > >>> > > > to >>> > > > >>> > > > > > > >>> rebuild. >>> > > > >>> > > > > > > >>> > >> > I'll do it within a couple of hours. >>> > > > >>> > > > > > > >>> > >> > >>> > > > >>> > > > > > > >>> > >> > Another thing I forgot to say is that >>> the >>> > > hbase >>> > > > >>> > > extension >>> > > > >>> > > > is >>> > > > >>> > > > > > now >>> > > > >>> > > > > > > >>> > >> compatible >>> > > > >>> > > > > > > >>> > >> > with both hadoop 1 and 2. >>> > > > >>> > > > > > > >>> > >> > >>> > > > >>> > > > > > > >>> > >> > Best, >>> > > > >>> > > > > > > >>> > >> > Flavio >>> > > > >>> > > > > > > >>> > >> >>> > > > >>> > > > > > > >>> > > >>> > > > >>> > > > > > > >>> > >>> > > > >>> > > > > > > >>> >>> > > > >>> > > > > > > >> >>> > > > >>> > > > > > > > >>> > > > >>> > > > > > > >>> > > > >>> > > > > > >>> > > > >>> > > > > >>> > > > >>> > > > >>> > > > >>> > > >>> > > > >>> > >>> > > > >>> >>> > > > >> >>> > > > >> >>> > > > >> >>> > > > > >>> > > > >>> > > >>> > >>> >> > >