Re: Re: Re: Re: What way to improve MTTR other than DLR(distributed log replay)

2016-10-21 Thread Enis Söztutar
A bit late, but let me give my perspective. This can also be moved to jira
or dev@ I think.

DLR was a nice and had pretty good gains for MTTR. However, dealing with
the sequence ids, onlining regions etc and the replay paths proved to be
too difficult in practice. I think the way forward would be to not bring
DLR back, but actually fix long standing log split problems.

The main gains in DLR is that we do not create lots and lots of tiny files,
but instead rely on the regular region flushes, to flush bigger files. This
also helps with handling requests coming from different log files etc. The
only gain that I can think of that you get with DLR, but not with log split
is the online enabling of writes while the recovery is going on. However, I
think it is not worth having DLR just for this feature.

Now, what are the problems with Log Split you ask. The problems are
  - we create a lot of tiny files
  - these tiny files are replayed sequentially when the region is assigned
  - The region has to replay and flush all data sequentially coming from
all these tiny files.

In terms of IO, we pay the cost of reading original WAL files, and writing
this same amount of data into many small files where the NN overhead is
huge. Then for every region, we do serially sort the data by re-reading the
tiny WAL files (recovered edits) and sorting them in memory and flushing
the data. Which means we do 2 times the reads and writes that we should do
otherwise.

The way to solve our log split bottlenecks is re-reading the big table
paper and implement the WAL recovery as described there.
 - Implement an HFile format that can contain data from multiple regions.
Something like a concatinated HFile format where each region has its own
section, with its own sequence id, etc.
 - Implement links to these files where a link can refer to this data. This
is very similar to our ReferenceFile concept.
 - In each log splitter task, instead of generating tiny WAL files that are
recovered edits, we instead buffer up in memory, and do a sort (this is the
same sort of inserting into the memstore) per region. A WAL is ~100 MB on
average, so should not be a big problem to buffer up this. At the end of
the WAL split task, write an hfile containing data from all the regions as
described above. Also do a multi NN request to create links in regions to
refer to these files (Not sure whether NN has a batch RPC call or not).

The reason this will be on-par or better than DLR is that, we are only
doing 1 read and 1 write, and the sort is parallelized. The region opening
does not have to block on replaying anything or waiting for flush, because
the data is already sorted and in HFile format. These hfiles will be used
the normal way by adding them to the KVHeaps, etc. When compactions run, we
will be removing the links to these files using the regular mechanisms.

Enis

On Tue, Oct 18, 2016 at 6:58 PM, Ted Yu  wrote:

> Allan:
> One factor to consider is that the assignment manager in hbase 2.0 would be
> quite different from those in 0.98 and 1.x branches.
>
> Meaning, you may need to come up with two solutions for a single problem.
>
> FYI
>
> On Tue, Oct 18, 2016 at 6:11 PM, Allan Yang  wrote:
>
> > Hi, Ted
> > These issues I mentioned above(HBASE-13567, HBASE-12743, HBASE-13535,
> > HBASE-14729) are ALL reproduced in our HBase1.x test environment. Fixing
> > them is exactly what I'm going to do. I haven't found the root cause yet,
> > but  I will update if I find solutions.
> >  what I afraid is that, there are other issues I don't know yet. So if
> you
> > or other guys know other issues related to DLR, please let me know
> >
> >
> > Regards
> > Allan Yang
> >
> >
> >
> >
> >
> >
> >
> > At 2016-10-19 00:19:06, "Ted Yu"  wrote:
> > >Allan:
> > >I wonder how you deal with open issues such as HBASE-13535.
> > >From your description, it seems your team fixed more DLR issues.
> > >
> > >Cheers
> > >
> > >On Mon, Oct 17, 2016 at 11:37 PM, allanwin  wrote:
> > >
> > >>
> > >>
> > >>
> > >> Here is the thing. We have backported DLR(HBASE-7006) to our 0.94
> > >> clusters  in production environment(of course a lot of bugs are fixed
> > and
> > >> it is working well). It is was proven to be a huge gain. When a large
> > >> cluster crash down, the MTTR improved from several hours to less than
> a
> > >> hour. Now, we want to move on to HBase1.x, and still we want DLR. This
> > >> time, we don't want to backport the 'backported' DLR to HBase1.x, but
> it
> > >> seems like that the community have determined to remove DLR...
> > >>
> > >>
> > >> The DLR feature is proven useful in our production environment, so I
> > think
> > >> I will try to fix its issues in branch-1.x
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> At 2016-10-18 13:47:17, "Anoop John"  wrote:
> > >> >Agree with ur observation.. But DLR feature we wanted to get
> removed..
> > >> >Because it 

Re: ETL HBase HFile+HLog to ORC(or Parquet) file?

2016-10-21 Thread Mich Talebzadeh
Hi Demai,

As I understand you want to use Hbase as the real time layer and Hive Data
Warehouse as the batch layer for analytics.

In other words ingest data real time from source into Hbase and push that
data into Hive recurring

If you partition your target ORC table with DtStamp and INSERT/OVERWRITE
into this table using Spark as the execution engine for Hive (as opposed to
map-reduce) it should pretty fast.

Hive is going to get an in-memory database in the next release or so it is
a perfect choice.


HTH




Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 21 October 2016 at 22:28, Demai Ni  wrote:

> Mich,
>
> thanks for the detail instructions.
>
> While aware of the Hive method, I have a few questions/concerns:
> 1) the Hive method is a "INSERT FROM SELECT " ,which usually not perform as
> good as a bulk load though I am not familiar with the real implementation
> 2) I have another SQL-on-Hadoop engine working well with ORC file. So if
> possible, I'd like to avoid the system dependency on Hive(one fewer
> component to maintain).
> 3) HBase has well running back-end process for Replication(HBASE-1295) or
> Backup(HBASE-7912), so  wondering anything can be piggy-back on it to deal
> with day-to-day works
>
> The goal is to have HBase as a OLTP front(to receive data), and the ORC
> file(with a SQL engine) as the OLAP end for reporting/analytic. the ORC
> file will also serve as my backup in the case for DR.
>
> Demai
>
>
> On Fri, Oct 21, 2016 at 1:57 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com>
> wrote:
>
> > Create an external table in Hive on Hbase atble. Pretty straight forward.
> >
> > hive>  create external table marketDataHbase (key STRING, ticker STRING,
> > timecreated STRING, price STRING)
> >
> > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'WITH
> > SERDEPROPERTIES ("hbase.columns.mapping" =
> > ":key,price_info:ticker,price_info:timecreated, price_info:price")
> >
> > TBLPROPERTIES ("hbase.table.name" = "marketDataHbase");
> >
> >
> >
> > then create a normal table in hive as ORC
> >
> >
> > CREATE TABLE IF NOT EXISTS marketData (
> >  KEY string
> >, TICKER string
> >, TIMECREATED string
> >, PRICE float
> > )
> > PARTITIONED BY (DateStamp  string)
> > STORED AS ORC
> > TBLPROPERTIES (
> > "orc.create.index"="true",
> > "orc.bloom.filter.columns"="KEY",
> > "orc.bloom.filter.fpp"="0.05",
> > "orc.compress"="SNAPPY",
> > "orc.stripe.size"="16777216",
> > "orc.row.index.stride"="1" )
> > ;
> > --show create table marketData;
> > --Populate target table
> > INSERT OVERWRITE TABLE marketData PARTITION (DateStamp = "${TODAY}")
> > SELECT
> >   KEY
> > , TICKER
> > , TIMECREATED
> > , PRICE
> > FROM MarketDataHbase
> >
> >
> > Run this job as a cron every often
> >
> >
> > HTH
> >
> >
> >
> > Dr Mich Talebzadeh
> >
> >
> >
> > LinkedIn * https://www.linkedin.com/profile/view?id=
> > AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >  > OABUrV8Pw>*
> >
> >
> >
> > http://talebzadehmich.wordpress.com
> >
> >
> > *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> > loss, damage or destruction of data or any other property which may arise
> > from relying on this email's technical content is explicitly disclaimed.
> > The author will in no case be liable for any monetary damages arising
> from
> > such loss, damage or destruction.
> >
> >
> >
> > On 21 October 2016 at 21:48, Demai Ni  wrote:
> >
> > > hi,
> > >
> > > I am wondering whether there are existing methods to ETL HBase data to
> > > ORC(or other open source columnar) file?
> > >
> > > I understand in Hive "insert into Hive_ORC_Table from SELET * from
> > > HIVE_HBase_Table", can probably get the job done. Is this the common
> way
> > to
> > > do so? Performance is acceptable and able to handle the delta update in
> > the
> > > case HBase table changed?
> > >
> > > I did a bit google, and find this
> > > https://community.hortonworks.com/questions/2632/loading-
> > > hbase-from-hive-orc-tables.html
> > >
> > > which is another way around.
> > >
> > > Will it perform better(comparing to above Hive stmt) if using either
> > > replication logic or snapshot backup to generate ORC file from hbase
> > tables
> > > and with incremental update ability?
> > >
> > > I hope to has as fewer dependency as possible. in the Example of ORC,
> > 

Re: ETL HBase HFile+HLog to ORC(or Parquet) file?

2016-10-21 Thread Demai Ni
Mich,

thanks for the detail instructions.

While aware of the Hive method, I have a few questions/concerns:
1) the Hive method is a "INSERT FROM SELECT " ,which usually not perform as
good as a bulk load though I am not familiar with the real implementation
2) I have another SQL-on-Hadoop engine working well with ORC file. So if
possible, I'd like to avoid the system dependency on Hive(one fewer
component to maintain).
3) HBase has well running back-end process for Replication(HBASE-1295) or
Backup(HBASE-7912), so  wondering anything can be piggy-back on it to deal
with day-to-day works

The goal is to have HBase as a OLTP front(to receive data), and the ORC
file(with a SQL engine) as the OLAP end for reporting/analytic. the ORC
file will also serve as my backup in the case for DR.

Demai


On Fri, Oct 21, 2016 at 1:57 PM, Mich Talebzadeh 
wrote:

> Create an external table in Hive on Hbase atble. Pretty straight forward.
>
> hive>  create external table marketDataHbase (key STRING, ticker STRING,
> timecreated STRING, price STRING)
>
> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'WITH
> SERDEPROPERTIES ("hbase.columns.mapping" =
> ":key,price_info:ticker,price_info:timecreated, price_info:price")
>
> TBLPROPERTIES ("hbase.table.name" = "marketDataHbase");
>
>
>
> then create a normal table in hive as ORC
>
>
> CREATE TABLE IF NOT EXISTS marketData (
>  KEY string
>, TICKER string
>, TIMECREATED string
>, PRICE float
> )
> PARTITIONED BY (DateStamp  string)
> STORED AS ORC
> TBLPROPERTIES (
> "orc.create.index"="true",
> "orc.bloom.filter.columns"="KEY",
> "orc.bloom.filter.fpp"="0.05",
> "orc.compress"="SNAPPY",
> "orc.stripe.size"="16777216",
> "orc.row.index.stride"="1" )
> ;
> --show create table marketData;
> --Populate target table
> INSERT OVERWRITE TABLE marketData PARTITION (DateStamp = "${TODAY}")
> SELECT
>   KEY
> , TICKER
> , TIMECREATED
> , PRICE
> FROM MarketDataHbase
>
>
> Run this job as a cron every often
>
>
> HTH
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=
> AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  OABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 21 October 2016 at 21:48, Demai Ni  wrote:
>
> > hi,
> >
> > I am wondering whether there are existing methods to ETL HBase data to
> > ORC(or other open source columnar) file?
> >
> > I understand in Hive "insert into Hive_ORC_Table from SELET * from
> > HIVE_HBase_Table", can probably get the job done. Is this the common way
> to
> > do so? Performance is acceptable and able to handle the delta update in
> the
> > case HBase table changed?
> >
> > I did a bit google, and find this
> > https://community.hortonworks.com/questions/2632/loading-
> > hbase-from-hive-orc-tables.html
> >
> > which is another way around.
> >
> > Will it perform better(comparing to above Hive stmt) if using either
> > replication logic or snapshot backup to generate ORC file from hbase
> tables
> > and with incremental update ability?
> >
> > I hope to has as fewer dependency as possible. in the Example of ORC,
> will
> > only depend on Apache ORC's API, and not depend on Hive
> >
> > Demai
> >
>


Re: Hbase fast access

2016-10-21 Thread Mich Talebzadeh
I was asked an interesting question.

Can one update data in Hbase? and my answer was it is only append only

Can one update data in Hive? My answer was yes if table is created as ORC
and tableproperties set with "transactional"="true"


STORED AS ORC
TBLPROPERTIES ( "orc.compress"="SNAPPY",
"transactional"="true",
"orc.create.index"="true",
"orc.bloom.filter.columns"="object_id",
"orc.bloom.filter.fpp"="0.05",
"orc.stripe.size"="268435456",
"orc.row.index.stride"="1" )




Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 21 October 2016 at 22:01, Ted Yu  wrote:

> It is true in the sense that hfile, once written (and closed), becomes
> immutable.
>
> Compaction would remove obsolete content and generate new hfiles.
>
> Cheers
>
> On Fri, Oct 21, 2016 at 1:59 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com>
> wrote:
>
> > BTW. I always understood that Hbase is append only. is that generally
> true?
> >
> > thx
> >
> > Dr Mich Talebzadeh
> >
> >
> >
> > LinkedIn * https://www.linkedin.com/profile/view?id=
> > AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >  > OABUrV8Pw>*
> >
> >
> >
> > http://talebzadehmich.wordpress.com
> >
> >
> > *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> > loss, damage or destruction of data or any other property which may arise
> > from relying on this email's technical content is explicitly disclaimed.
> > The author will in no case be liable for any monetary damages arising
> from
> > such loss, damage or destruction.
> >
> >
> >
> > On 21 October 2016 at 21:57, Mich Talebzadeh 
> > wrote:
> >
> > > agreed much like any rdbms
> > >
> > >
> > >
> > > Dr Mich Talebzadeh
> > >
> > >
> > >
> > > LinkedIn * https://www.linkedin.com/profile/view?id=
> > AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> > >  AAEWh2gBxianrbJd6zP6AcPCCd
> > OABUrV8Pw>*
> > >
> > >
> > >
> > > http://talebzadehmich.wordpress.com
> > >
> > >
> > > *Disclaimer:* Use it at your own risk. Any and all responsibility for
> any
> > > loss, damage or destruction of data or any other property which may
> arise
> > > from relying on this email's technical content is explicitly
> disclaimed.
> > > The author will in no case be liable for any monetary damages arising
> > from
> > > such loss, damage or destruction.
> > >
> > >
> > >
> > > On 21 October 2016 at 21:54, Ted Yu  wrote:
> > >
> > >> Well, updates (in memory) would ultimately be flushed to disk,
> resulting
> > >> in
> > >> new hfiles.
> > >>
> > >> On Fri, Oct 21, 2016 at 1:50 PM, Mich Talebzadeh <
> > >> mich.talebza...@gmail.com>
> > >> wrote:
> > >>
> > >> > thanks
> > >> >
> > >> > bq. all updates are done in memory o disk access
> > >> >
> > >> > I meant data updates are operated in memory, no disk access.
> > >> >
> > >> > in other much like rdbms read data into memory and update it there
> > >> > (assuming that data is not already in memory?)
> > >> >
> > >> > HTH
> > >> >
> > >> > Dr Mich Talebzadeh
> > >> >
> > >> >
> > >> >
> > >> > LinkedIn * https://www.linkedin.com/profile/view?id=
> > >> > AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> > >> >  > >> Jd6zP6AcPCCd
> > >> > OABUrV8Pw>*
> > >> >
> > >> >
> > >> >
> > >> > http://talebzadehmich.wordpress.com
> > >> >
> > >> >
> > >> > *Disclaimer:* Use it at your own risk. Any and all responsibility
> for
> > >> any
> > >> > loss, damage or destruction of data or any other property which may
> > >> arise
> > >> > from relying on this email's technical content is explicitly
> > disclaimed.
> > >> > The author will in no case be liable for any monetary damages
> arising
> > >> from
> > >> > such loss, damage or destruction.
> > >> >
> > >> >
> > >> >
> > >> > On 21 October 2016 at 21:46, Ted Yu  wrote:
> > >> >
> > >> > > bq. this search is carried out through map-reduce on region
> servers?
> > >> > >
> > >> > > No map-reduce. region server uses its own thread(s).
> > >> > >
> > >> > > bq. all updates are done in memory o disk access
> > >> > >
> > >> > > Can you clarify ? There seems to be some missing letters.
> > >> > >
> > >> > > On Fri, Oct 21, 2016 at 1:43 PM, Mich Talebzadeh <
> > >> > > mich.talebza...@gmail.com>
> > >> > > wrote:
> > >> > >
> > >> > > > thanks
> > >> > > >
> > >> > > > 

Re: Hbase fast access

2016-10-21 Thread Mich Talebzadeh
BTW. I always understood that Hbase is append only. is that generally true?

thx

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 21 October 2016 at 21:57, Mich Talebzadeh 
wrote:

> agreed much like any rdbms
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 21 October 2016 at 21:54, Ted Yu  wrote:
>
>> Well, updates (in memory) would ultimately be flushed to disk, resulting
>> in
>> new hfiles.
>>
>> On Fri, Oct 21, 2016 at 1:50 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com>
>> wrote:
>>
>> > thanks
>> >
>> > bq. all updates are done in memory o disk access
>> >
>> > I meant data updates are operated in memory, no disk access.
>> >
>> > in other much like rdbms read data into memory and update it there
>> > (assuming that data is not already in memory?)
>> >
>> > HTH
>> >
>> > Dr Mich Talebzadeh
>> >
>> >
>> >
>> > LinkedIn * https://www.linkedin.com/profile/view?id=
>> > AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> > > Jd6zP6AcPCCd
>> > OABUrV8Pw>*
>> >
>> >
>> >
>> > http://talebzadehmich.wordpress.com
>> >
>> >
>> > *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any
>> > loss, damage or destruction of data or any other property which may
>> arise
>> > from relying on this email's technical content is explicitly disclaimed.
>> > The author will in no case be liable for any monetary damages arising
>> from
>> > such loss, damage or destruction.
>> >
>> >
>> >
>> > On 21 October 2016 at 21:46, Ted Yu  wrote:
>> >
>> > > bq. this search is carried out through map-reduce on region servers?
>> > >
>> > > No map-reduce. region server uses its own thread(s).
>> > >
>> > > bq. all updates are done in memory o disk access
>> > >
>> > > Can you clarify ? There seems to be some missing letters.
>> > >
>> > > On Fri, Oct 21, 2016 at 1:43 PM, Mich Talebzadeh <
>> > > mich.talebza...@gmail.com>
>> > > wrote:
>> > >
>> > > > thanks
>> > > >
>> > > > having read the docs it appears to me that the main reason of hbase
>> > being
>> > > > faster is:
>> > > >
>> > > >
>> > > >1. it behaves like an rdbms like oracle tetc. reads are looked
>> for
>> > in
>> > > >the buffer cache for consistent reads and if not found then store
>> > > files
>> > > > on
>> > > >disks are searched. Does this mean that this search is carried
>> out
>> > > > through
>> > > >map-reduce on region servers?
>> > > >2. when the data is written it is written to log file
>> sequentially
>> > > >first, then to in-memory store, sorted like b-tree of rdbms and
>> then
>> > > >flushed to disk. this is exactly what checkpoint in an rdbms does
>> > > >3. one can point out that hbase is faster because log structured
>> > merge
>> > > >tree (LSM-trees)  has less depth than a B-tree in rdbms.
>> > > >4. all updates are done in memory o disk access
>> > > >5. in summary LSM-trees reduce disk access when data is read from
>> > disk
>> > > >because of reduced seek time again less depth to get data with
>> > > LSM-tree
>> > > >
>> > > >
>> > > > appreciate any comments
>> > > >
>> > > >
>> > > > cheers
>> > > >
>> > > >
>> > > > Dr Mich Talebzadeh
>> > > >
>> > > >
>> > > >
>> > > > LinkedIn * https://www.linkedin.com/profile/view?id=
>> > > > AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> > > > > > AAEWh2gBxianrbJd6zP6AcPCCd
>> > > > OABUrV8Pw>*
>> > > >
>> > > >
>> > > >
>> > > > http://talebzadehmich.wordpress.com
>> > > >
>> > > >
>> > > > *Disclaimer:* Use it at your own risk. Any and all responsibility
>> for
>> > any
>> > > > loss, damage or destruction of data or any other property which may
>> > arise
>> > > > from relying on this email's technical content is explicitly
>> > disclaimed.
>> > > > The author will in no case be liable for 

Re: Hbase fast access

2016-10-21 Thread Mich Talebzadeh
agreed much like any rdbms



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 21 October 2016 at 21:54, Ted Yu  wrote:

> Well, updates (in memory) would ultimately be flushed to disk, resulting in
> new hfiles.
>
> On Fri, Oct 21, 2016 at 1:50 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com>
> wrote:
>
> > thanks
> >
> > bq. all updates are done in memory o disk access
> >
> > I meant data updates are operated in memory, no disk access.
> >
> > in other much like rdbms read data into memory and update it there
> > (assuming that data is not already in memory?)
> >
> > HTH
> >
> > Dr Mich Talebzadeh
> >
> >
> >
> > LinkedIn * https://www.linkedin.com/profile/view?id=
> > AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >  > OABUrV8Pw>*
> >
> >
> >
> > http://talebzadehmich.wordpress.com
> >
> >
> > *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> > loss, damage or destruction of data or any other property which may arise
> > from relying on this email's technical content is explicitly disclaimed.
> > The author will in no case be liable for any monetary damages arising
> from
> > such loss, damage or destruction.
> >
> >
> >
> > On 21 October 2016 at 21:46, Ted Yu  wrote:
> >
> > > bq. this search is carried out through map-reduce on region servers?
> > >
> > > No map-reduce. region server uses its own thread(s).
> > >
> > > bq. all updates are done in memory o disk access
> > >
> > > Can you clarify ? There seems to be some missing letters.
> > >
> > > On Fri, Oct 21, 2016 at 1:43 PM, Mich Talebzadeh <
> > > mich.talebza...@gmail.com>
> > > wrote:
> > >
> > > > thanks
> > > >
> > > > having read the docs it appears to me that the main reason of hbase
> > being
> > > > faster is:
> > > >
> > > >
> > > >1. it behaves like an rdbms like oracle tetc. reads are looked for
> > in
> > > >the buffer cache for consistent reads and if not found then store
> > > files
> > > > on
> > > >disks are searched. Does this mean that this search is carried out
> > > > through
> > > >map-reduce on region servers?
> > > >2. when the data is written it is written to log file sequentially
> > > >first, then to in-memory store, sorted like b-tree of rdbms and
> then
> > > >flushed to disk. this is exactly what checkpoint in an rdbms does
> > > >3. one can point out that hbase is faster because log structured
> > merge
> > > >tree (LSM-trees)  has less depth than a B-tree in rdbms.
> > > >4. all updates are done in memory o disk access
> > > >5. in summary LSM-trees reduce disk access when data is read from
> > disk
> > > >because of reduced seek time again less depth to get data with
> > > LSM-tree
> > > >
> > > >
> > > > appreciate any comments
> > > >
> > > >
> > > > cheers
> > > >
> > > >
> > > > Dr Mich Talebzadeh
> > > >
> > > >
> > > >
> > > > LinkedIn * https://www.linkedin.com/profile/view?id=
> > > > AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> > > >  > AAEWh2gBxianrbJd6zP6AcPCCd
> > > > OABUrV8Pw>*
> > > >
> > > >
> > > >
> > > > http://talebzadehmich.wordpress.com
> > > >
> > > >
> > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for
> > any
> > > > loss, damage or destruction of data or any other property which may
> > arise
> > > > from relying on this email's technical content is explicitly
> > disclaimed.
> > > > The author will in no case be liable for any monetary damages arising
> > > from
> > > > such loss, damage or destruction.
> > > >
> > > >
> > > >
> > > > On 21 October 2016 at 17:51, Ted Yu  wrote:
> > > >
> > > > > See some prior blog:
> > > > >
> > > > > http://www.cyanny.com/2014/03/13/hbase-architecture-
> > > > > analysis-part1-logical-architecture/
> > > > >
> > > > > w.r.t. compaction in Hive, it is used to compact deltas into a base
> > > file
> > > > > (in the context of transactions).  Likely they're different.
> > > > >
> > > > > Cheers
> > > > >
> > > > > On Fri, Oct 21, 2016 at 9:08 AM, Mich Talebzadeh <
> > > > > mich.talebza...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > Can someone in a nutshell explain *the *Hbase use of
> log-structured
> > > > > > merge-tree (LSM-tree) as data storage architecture
> > > > > >
> > > > > > The idea of merging smaller files to larger 

Re: ETL HBase HFile+HLog to ORC(or Parquet) file?

2016-10-21 Thread Mich Talebzadeh
Create an external table in Hive on Hbase atble. Pretty straight forward.

hive>  create external table marketDataHbase (key STRING, ticker STRING,
timecreated STRING, price STRING)

STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'WITH
SERDEPROPERTIES ("hbase.columns.mapping" =
":key,price_info:ticker,price_info:timecreated, price_info:price")

TBLPROPERTIES ("hbase.table.name" = "marketDataHbase");



then create a normal table in hive as ORC


CREATE TABLE IF NOT EXISTS marketData (
 KEY string
   , TICKER string
   , TIMECREATED string
   , PRICE float
)
PARTITIONED BY (DateStamp  string)
STORED AS ORC
TBLPROPERTIES (
"orc.create.index"="true",
"orc.bloom.filter.columns"="KEY",
"orc.bloom.filter.fpp"="0.05",
"orc.compress"="SNAPPY",
"orc.stripe.size"="16777216",
"orc.row.index.stride"="1" )
;
--show create table marketData;
--Populate target table
INSERT OVERWRITE TABLE marketData PARTITION (DateStamp = "${TODAY}")
SELECT
  KEY
, TICKER
, TIMECREATED
, PRICE
FROM MarketDataHbase


Run this job as a cron every often


HTH



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 21 October 2016 at 21:48, Demai Ni  wrote:

> hi,
>
> I am wondering whether there are existing methods to ETL HBase data to
> ORC(or other open source columnar) file?
>
> I understand in Hive "insert into Hive_ORC_Table from SELET * from
> HIVE_HBase_Table", can probably get the job done. Is this the common way to
> do so? Performance is acceptable and able to handle the delta update in the
> case HBase table changed?
>
> I did a bit google, and find this
> https://community.hortonworks.com/questions/2632/loading-
> hbase-from-hive-orc-tables.html
>
> which is another way around.
>
> Will it perform better(comparing to above Hive stmt) if using either
> replication logic or snapshot backup to generate ORC file from hbase tables
> and with incremental update ability?
>
> I hope to has as fewer dependency as possible. in the Example of ORC, will
> only depend on Apache ORC's API, and not depend on Hive
>
> Demai
>


Re: Hbase fast access

2016-10-21 Thread Ted Yu
Well, updates (in memory) would ultimately be flushed to disk, resulting in
new hfiles.

On Fri, Oct 21, 2016 at 1:50 PM, Mich Talebzadeh 
wrote:

> thanks
>
> bq. all updates are done in memory o disk access
>
> I meant data updates are operated in memory, no disk access.
>
> in other much like rdbms read data into memory and update it there
> (assuming that data is not already in memory?)
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=
> AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  OABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 21 October 2016 at 21:46, Ted Yu  wrote:
>
> > bq. this search is carried out through map-reduce on region servers?
> >
> > No map-reduce. region server uses its own thread(s).
> >
> > bq. all updates are done in memory o disk access
> >
> > Can you clarify ? There seems to be some missing letters.
> >
> > On Fri, Oct 21, 2016 at 1:43 PM, Mich Talebzadeh <
> > mich.talebza...@gmail.com>
> > wrote:
> >
> > > thanks
> > >
> > > having read the docs it appears to me that the main reason of hbase
> being
> > > faster is:
> > >
> > >
> > >1. it behaves like an rdbms like oracle tetc. reads are looked for
> in
> > >the buffer cache for consistent reads and if not found then store
> > files
> > > on
> > >disks are searched. Does this mean that this search is carried out
> > > through
> > >map-reduce on region servers?
> > >2. when the data is written it is written to log file sequentially
> > >first, then to in-memory store, sorted like b-tree of rdbms and then
> > >flushed to disk. this is exactly what checkpoint in an rdbms does
> > >3. one can point out that hbase is faster because log structured
> merge
> > >tree (LSM-trees)  has less depth than a B-tree in rdbms.
> > >4. all updates are done in memory o disk access
> > >5. in summary LSM-trees reduce disk access when data is read from
> disk
> > >because of reduced seek time again less depth to get data with
> > LSM-tree
> > >
> > >
> > > appreciate any comments
> > >
> > >
> > > cheers
> > >
> > >
> > > Dr Mich Talebzadeh
> > >
> > >
> > >
> > > LinkedIn * https://www.linkedin.com/profile/view?id=
> > > AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> > >  AAEWh2gBxianrbJd6zP6AcPCCd
> > > OABUrV8Pw>*
> > >
> > >
> > >
> > > http://talebzadehmich.wordpress.com
> > >
> > >
> > > *Disclaimer:* Use it at your own risk. Any and all responsibility for
> any
> > > loss, damage or destruction of data or any other property which may
> arise
> > > from relying on this email's technical content is explicitly
> disclaimed.
> > > The author will in no case be liable for any monetary damages arising
> > from
> > > such loss, damage or destruction.
> > >
> > >
> > >
> > > On 21 October 2016 at 17:51, Ted Yu  wrote:
> > >
> > > > See some prior blog:
> > > >
> > > > http://www.cyanny.com/2014/03/13/hbase-architecture-
> > > > analysis-part1-logical-architecture/
> > > >
> > > > w.r.t. compaction in Hive, it is used to compact deltas into a base
> > file
> > > > (in the context of transactions).  Likely they're different.
> > > >
> > > > Cheers
> > > >
> > > > On Fri, Oct 21, 2016 at 9:08 AM, Mich Talebzadeh <
> > > > mich.talebza...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > Can someone in a nutshell explain *the *Hbase use of log-structured
> > > > > merge-tree (LSM-tree) as data storage architecture
> > > > >
> > > > > The idea of merging smaller files to larger files periodically to
> > > reduce
> > > > > disk seeks,  is this similar concept to compaction in HDFS or Hive?
> > > > >
> > > > > Thanks
> > > > >
> > > > >
> > > > > Dr Mich Talebzadeh
> > > > >
> > > > >
> > > > >
> > > > > LinkedIn * https://www.linkedin.com/profile/view?id=
> > > > > AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> > > > >  > > AAEWh2gBxianrbJd6zP6AcPCCd
> > > > > OABUrV8Pw>*
> > > > >
> > > > >
> > > > >
> > > > > http://talebzadehmich.wordpress.com
> > > > >
> > > > >
> > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility
> for
> > > any
> > > > > loss, damage or destruction of data or any other property which may
> > > arise
> > > > > from relying on this email's technical content is explicitly
> > > disclaimed.
> > > > > The author will in no case be liable for any monetary damages
> arising
> > > > from
> > > > > such loss, damage or destruction.

Re: Hbase fast access

2016-10-21 Thread Mich Talebzadeh
thanks

bq. all updates are done in memory o disk access

I meant data updates are operated in memory, no disk access.

in other much like rdbms read data into memory and update it there
(assuming that data is not already in memory?)

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 21 October 2016 at 21:46, Ted Yu  wrote:

> bq. this search is carried out through map-reduce on region servers?
>
> No map-reduce. region server uses its own thread(s).
>
> bq. all updates are done in memory o disk access
>
> Can you clarify ? There seems to be some missing letters.
>
> On Fri, Oct 21, 2016 at 1:43 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com>
> wrote:
>
> > thanks
> >
> > having read the docs it appears to me that the main reason of hbase being
> > faster is:
> >
> >
> >1. it behaves like an rdbms like oracle tetc. reads are looked for in
> >the buffer cache for consistent reads and if not found then store
> files
> > on
> >disks are searched. Does this mean that this search is carried out
> > through
> >map-reduce on region servers?
> >2. when the data is written it is written to log file sequentially
> >first, then to in-memory store, sorted like b-tree of rdbms and then
> >flushed to disk. this is exactly what checkpoint in an rdbms does
> >3. one can point out that hbase is faster because log structured merge
> >tree (LSM-trees)  has less depth than a B-tree in rdbms.
> >4. all updates are done in memory o disk access
> >5. in summary LSM-trees reduce disk access when data is read from disk
> >because of reduced seek time again less depth to get data with
> LSM-tree
> >
> >
> > appreciate any comments
> >
> >
> > cheers
> >
> >
> > Dr Mich Talebzadeh
> >
> >
> >
> > LinkedIn * https://www.linkedin.com/profile/view?id=
> > AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >  > OABUrV8Pw>*
> >
> >
> >
> > http://talebzadehmich.wordpress.com
> >
> >
> > *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> > loss, damage or destruction of data or any other property which may arise
> > from relying on this email's technical content is explicitly disclaimed.
> > The author will in no case be liable for any monetary damages arising
> from
> > such loss, damage or destruction.
> >
> >
> >
> > On 21 October 2016 at 17:51, Ted Yu  wrote:
> >
> > > See some prior blog:
> > >
> > > http://www.cyanny.com/2014/03/13/hbase-architecture-
> > > analysis-part1-logical-architecture/
> > >
> > > w.r.t. compaction in Hive, it is used to compact deltas into a base
> file
> > > (in the context of transactions).  Likely they're different.
> > >
> > > Cheers
> > >
> > > On Fri, Oct 21, 2016 at 9:08 AM, Mich Talebzadeh <
> > > mich.talebza...@gmail.com>
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > Can someone in a nutshell explain *the *Hbase use of log-structured
> > > > merge-tree (LSM-tree) as data storage architecture
> > > >
> > > > The idea of merging smaller files to larger files periodically to
> > reduce
> > > > disk seeks,  is this similar concept to compaction in HDFS or Hive?
> > > >
> > > > Thanks
> > > >
> > > >
> > > > Dr Mich Talebzadeh
> > > >
> > > >
> > > >
> > > > LinkedIn * https://www.linkedin.com/profile/view?id=
> > > > AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> > > >  > AAEWh2gBxianrbJd6zP6AcPCCd
> > > > OABUrV8Pw>*
> > > >
> > > >
> > > >
> > > > http://talebzadehmich.wordpress.com
> > > >
> > > >
> > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for
> > any
> > > > loss, damage or destruction of data or any other property which may
> > arise
> > > > from relying on this email's technical content is explicitly
> > disclaimed.
> > > > The author will in no case be liable for any monetary damages arising
> > > from
> > > > such loss, damage or destruction.
> > > >
> > > >
> > > >
> > > > On 21 October 2016 at 15:27, Mich Talebzadeh <
> > mich.talebza...@gmail.com>
> > > > wrote:
> > > >
> > > > > Sorry that should read Hive not Spark here
> > > > >
> > > > > Say compared to Spark that is basically a SQL layer relying on
> > > different
> > > > > engines (mr, Tez, Spark) to execute the code
> > > > >
> > > > > Dr Mich Talebzadeh
> > > > >
> > > > >
> > > > >
> > > > > LinkedIn * https://www.linkedin.com/profile/view?id=
> > > > 

ETL HBase HFile+HLog to ORC(or Parquet) file?

2016-10-21 Thread Demai Ni
hi,

I am wondering whether there are existing methods to ETL HBase data to
ORC(or other open source columnar) file?

I understand in Hive "insert into Hive_ORC_Table from SELET * from
HIVE_HBase_Table", can probably get the job done. Is this the common way to
do so? Performance is acceptable and able to handle the delta update in the
case HBase table changed?

I did a bit google, and find this
https://community.hortonworks.com/questions/2632/loading-hbase-from-hive-orc-tables.html

which is another way around.

Will it perform better(comparing to above Hive stmt) if using either
replication logic or snapshot backup to generate ORC file from hbase tables
and with incremental update ability?

I hope to has as fewer dependency as possible. in the Example of ORC, will
only depend on Apache ORC's API, and not depend on Hive

Demai


Re: Hbase fast access

2016-10-21 Thread Ted Yu
bq. this search is carried out through map-reduce on region servers?

No map-reduce. region server uses its own thread(s).

bq. all updates are done in memory o disk access

Can you clarify ? There seems to be some missing letters.

On Fri, Oct 21, 2016 at 1:43 PM, Mich Talebzadeh 
wrote:

> thanks
>
> having read the docs it appears to me that the main reason of hbase being
> faster is:
>
>
>1. it behaves like an rdbms like oracle tetc. reads are looked for in
>the buffer cache for consistent reads and if not found then store files
> on
>disks are searched. Does this mean that this search is carried out
> through
>map-reduce on region servers?
>2. when the data is written it is written to log file sequentially
>first, then to in-memory store, sorted like b-tree of rdbms and then
>flushed to disk. this is exactly what checkpoint in an rdbms does
>3. one can point out that hbase is faster because log structured merge
>tree (LSM-trees)  has less depth than a B-tree in rdbms.
>4. all updates are done in memory o disk access
>5. in summary LSM-trees reduce disk access when data is read from disk
>because of reduced seek time again less depth to get data with LSM-tree
>
>
> appreciate any comments
>
>
> cheers
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=
> AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  OABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 21 October 2016 at 17:51, Ted Yu  wrote:
>
> > See some prior blog:
> >
> > http://www.cyanny.com/2014/03/13/hbase-architecture-
> > analysis-part1-logical-architecture/
> >
> > w.r.t. compaction in Hive, it is used to compact deltas into a base file
> > (in the context of transactions).  Likely they're different.
> >
> > Cheers
> >
> > On Fri, Oct 21, 2016 at 9:08 AM, Mich Talebzadeh <
> > mich.talebza...@gmail.com>
> > wrote:
> >
> > > Hi,
> > >
> > > Can someone in a nutshell explain *the *Hbase use of log-structured
> > > merge-tree (LSM-tree) as data storage architecture
> > >
> > > The idea of merging smaller files to larger files periodically to
> reduce
> > > disk seeks,  is this similar concept to compaction in HDFS or Hive?
> > >
> > > Thanks
> > >
> > >
> > > Dr Mich Talebzadeh
> > >
> > >
> > >
> > > LinkedIn * https://www.linkedin.com/profile/view?id=
> > > AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> > >  AAEWh2gBxianrbJd6zP6AcPCCd
> > > OABUrV8Pw>*
> > >
> > >
> > >
> > > http://talebzadehmich.wordpress.com
> > >
> > >
> > > *Disclaimer:* Use it at your own risk. Any and all responsibility for
> any
> > > loss, damage or destruction of data or any other property which may
> arise
> > > from relying on this email's technical content is explicitly
> disclaimed.
> > > The author will in no case be liable for any monetary damages arising
> > from
> > > such loss, damage or destruction.
> > >
> > >
> > >
> > > On 21 October 2016 at 15:27, Mich Talebzadeh <
> mich.talebza...@gmail.com>
> > > wrote:
> > >
> > > > Sorry that should read Hive not Spark here
> > > >
> > > > Say compared to Spark that is basically a SQL layer relying on
> > different
> > > > engines (mr, Tez, Spark) to execute the code
> > > >
> > > > Dr Mich Talebzadeh
> > > >
> > > >
> > > >
> > > > LinkedIn * https://www.linkedin.com/profile/view?id=
> > > AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> > > >  > AAEWh2gBxianrbJd6zP6AcPCCd
> > > OABUrV8Pw>*
> > > >
> > > >
> > > >
> > > > http://talebzadehmich.wordpress.com
> > > >
> > > >
> > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for
> > any
> > > > loss, damage or destruction of data or any other property which may
> > arise
> > > > from relying on this email's technical content is explicitly
> > disclaimed.
> > > > The author will in no case be liable for any monetary damages arising
> > > from
> > > > such loss, damage or destruction.
> > > >
> > > >
> > > >
> > > > On 21 October 2016 at 13:17, Ted Yu  wrote:
> > > >
> > > >> Mich:
> > > >> Here is brief description of hbase architecture:
> > > >> https://hbase.apache.org/book.html#arch.overview
> > > >>
> > > >> You can also get more details from Lars George's or Nick Dimiduk's
> > > books.
> > > >>
> > > >> HBase doesn't support SQL directly. There is no cost based
> > optimization.
> > > >>
> > > >> Cheers
> > > >>
> > > >> > On Oct 21, 2016, at 1:43 AM, Mich Talebzadeh <
> > > 

Re: Hbase fast access

2016-10-21 Thread Mich Talebzadeh
thanks

having read the docs it appears to me that the main reason of hbase being
faster is:


   1. it behaves like an rdbms like oracle tetc. reads are looked for in
   the buffer cache for consistent reads and if not found then store files on
   disks are searched. Does this mean that this search is carried out through
   map-reduce on region servers?
   2. when the data is written it is written to log file sequentially
   first, then to in-memory store, sorted like b-tree of rdbms and then
   flushed to disk. this is exactly what checkpoint in an rdbms does
   3. one can point out that hbase is faster because log structured merge
   tree (LSM-trees)  has less depth than a B-tree in rdbms.
   4. all updates are done in memory o disk access
   5. in summary LSM-trees reduce disk access when data is read from disk
   because of reduced seek time again less depth to get data with LSM-tree


appreciate any comments


cheers


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 21 October 2016 at 17:51, Ted Yu  wrote:

> See some prior blog:
>
> http://www.cyanny.com/2014/03/13/hbase-architecture-
> analysis-part1-logical-architecture/
>
> w.r.t. compaction in Hive, it is used to compact deltas into a base file
> (in the context of transactions).  Likely they're different.
>
> Cheers
>
> On Fri, Oct 21, 2016 at 9:08 AM, Mich Talebzadeh <
> mich.talebza...@gmail.com>
> wrote:
>
> > Hi,
> >
> > Can someone in a nutshell explain *the *Hbase use of log-structured
> > merge-tree (LSM-tree) as data storage architecture
> >
> > The idea of merging smaller files to larger files periodically to reduce
> > disk seeks,  is this similar concept to compaction in HDFS or Hive?
> >
> > Thanks
> >
> >
> > Dr Mich Talebzadeh
> >
> >
> >
> > LinkedIn * https://www.linkedin.com/profile/view?id=
> > AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >  > OABUrV8Pw>*
> >
> >
> >
> > http://talebzadehmich.wordpress.com
> >
> >
> > *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> > loss, damage or destruction of data or any other property which may arise
> > from relying on this email's technical content is explicitly disclaimed.
> > The author will in no case be liable for any monetary damages arising
> from
> > such loss, damage or destruction.
> >
> >
> >
> > On 21 October 2016 at 15:27, Mich Talebzadeh 
> > wrote:
> >
> > > Sorry that should read Hive not Spark here
> > >
> > > Say compared to Spark that is basically a SQL layer relying on
> different
> > > engines (mr, Tez, Spark) to execute the code
> > >
> > > Dr Mich Talebzadeh
> > >
> > >
> > >
> > > LinkedIn * https://www.linkedin.com/profile/view?id=
> > AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> > >  AAEWh2gBxianrbJd6zP6AcPCCd
> > OABUrV8Pw>*
> > >
> > >
> > >
> > > http://talebzadehmich.wordpress.com
> > >
> > >
> > > *Disclaimer:* Use it at your own risk. Any and all responsibility for
> any
> > > loss, damage or destruction of data or any other property which may
> arise
> > > from relying on this email's technical content is explicitly
> disclaimed.
> > > The author will in no case be liable for any monetary damages arising
> > from
> > > such loss, damage or destruction.
> > >
> > >
> > >
> > > On 21 October 2016 at 13:17, Ted Yu  wrote:
> > >
> > >> Mich:
> > >> Here is brief description of hbase architecture:
> > >> https://hbase.apache.org/book.html#arch.overview
> > >>
> > >> You can also get more details from Lars George's or Nick Dimiduk's
> > books.
> > >>
> > >> HBase doesn't support SQL directly. There is no cost based
> optimization.
> > >>
> > >> Cheers
> > >>
> > >> > On Oct 21, 2016, at 1:43 AM, Mich Talebzadeh <
> > mich.talebza...@gmail.com>
> > >> wrote:
> > >> >
> > >> > Hi,
> > >> >
> > >> > This is a general question.
> > >> >
> > >> > Is Hbase fast because Hbase uses Hash tables and provides random
> > access,
> > >> > and it stores the data in indexed HDFS files for faster lookups.
> > >> >
> > >> > Say compared to Spark that is basically a SQL layer relying on
> > different
> > >> > engines (mr, Tez, Spark) to execute the code (although it has Cost
> > Base
> > >> > Optimizer), how Hbase fares, beyond relying on these engines
> > >> >
> > >> > Thanks
> > >> >
> > >> >
> > >> > Dr Mich Talebzadeh
> > >> >
> > >> >
> > >> >
> 

Re: mapreduce get error while update HBase

2016-10-21 Thread Ted Yu
Can you give us more information so that we can match the line numbers in
the stack trace with actual code ?

release of hbase
hadoop version

If you can show snippet of related code, that would be nice.

Thanks

On Fri, Oct 21, 2016 at 11:16 AM, 乔彦克  wrote:

> Hi, all
>
> I use mapreduce to update HBase data, got this error just now, and
> I have no idea why this happened.
>
> below is the error log:
>
> 2016-10-22 01:14:49,047 WARN [main]
> org.apache.hadoop.mapred.YarnChild: Exception running child :
> java.lang.RuntimeException: java.lang.NullPointerException
> at org.apache.hadoop.hbase.client.RpcRetryingCaller.
> callWithoutRetries(RpcRetryingCaller.java:208)
> at org.apache.hadoop.hbase.client.ClientScanner.call(
> ClientScanner.java:314)
> at org.apache.hadoop.hbase.client.ClientScanner.
> loadCache(ClientScanner.java:397)
> at org.apache.hadoop.hbase.client.ClientScanner.next(
> ClientScanner.java:358)
> at org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl.
> nextKeyValue(TableRecordReaderImpl.java:220)
> at org.apache.hadoop.hbase.mapreduce.TableRecordReader.
> nextKeyValue(TableRecordReader.java:151)
> at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.
> nextKeyValue(MapTask.java:556)
> at org.apache.hadoop.mapreduce.task.MapContextImpl.
> nextKeyValue(MapContextImpl.java:80)
> at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.
> nextKeyValue(WrappedMapper.java:91)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at org.apache.hadoop.security.UserGroupInformation.doAs(
> UserGroupInformation.java:1671)
> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
> Caused by: java.lang.NullPointerException
> at org.apache.hadoop.hbase.client.RpcRetryingCaller.
> callWithoutRetries(RpcRetryingCaller.java:199)
> ... 16 more
>
> Any reply is appreciated.
>
>
> Best regards
>
> Qiao Yanke
>


mapreduce get error while update HBase

2016-10-21 Thread 乔彦克
Hi, all

I use mapreduce to update HBase data, got this error just now, and
I have no idea why this happened.

below is the error log:

2016-10-22 01:14:49,047 WARN [main]
org.apache.hadoop.mapred.YarnChild: Exception running child :
java.lang.RuntimeException: java.lang.NullPointerException
at 
org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:208)
at 
org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:314)
at 
org.apache.hadoop.hbase.client.ClientScanner.loadCache(ClientScanner.java:397)
at 
org.apache.hadoop.hbase.client.ClientScanner.next(ClientScanner.java:358)
at 
org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl.nextKeyValue(TableRecordReaderImpl.java:220)
at 
org.apache.hadoop.hbase.mapreduce.TableRecordReader.nextKeyValue(TableRecordReader.java:151)
at 
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:556)
at 
org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80)
at 
org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.lang.NullPointerException
at 
org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:199)
... 16 more

Any reply is appreciated.


Best regards

Qiao Yanke


Re: Hbase fast access

2016-10-21 Thread Ted Yu
See some prior blog:

http://www.cyanny.com/2014/03/13/hbase-architecture-analysis-part1-logical-architecture/

w.r.t. compaction in Hive, it is used to compact deltas into a base file
(in the context of transactions).  Likely they're different.

Cheers

On Fri, Oct 21, 2016 at 9:08 AM, Mich Talebzadeh 
wrote:

> Hi,
>
> Can someone in a nutshell explain *the *Hbase use of log-structured
> merge-tree (LSM-tree) as data storage architecture
>
> The idea of merging smaller files to larger files periodically to reduce
> disk seeks,  is this similar concept to compaction in HDFS or Hive?
>
> Thanks
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=
> AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  OABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 21 October 2016 at 15:27, Mich Talebzadeh 
> wrote:
>
> > Sorry that should read Hive not Spark here
> >
> > Say compared to Spark that is basically a SQL layer relying on different
> > engines (mr, Tez, Spark) to execute the code
> >
> > Dr Mich Talebzadeh
> >
> >
> >
> > LinkedIn * https://www.linkedin.com/profile/view?id=
> AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >  OABUrV8Pw>*
> >
> >
> >
> > http://talebzadehmich.wordpress.com
> >
> >
> > *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> > loss, damage or destruction of data or any other property which may arise
> > from relying on this email's technical content is explicitly disclaimed.
> > The author will in no case be liable for any monetary damages arising
> from
> > such loss, damage or destruction.
> >
> >
> >
> > On 21 October 2016 at 13:17, Ted Yu  wrote:
> >
> >> Mich:
> >> Here is brief description of hbase architecture:
> >> https://hbase.apache.org/book.html#arch.overview
> >>
> >> You can also get more details from Lars George's or Nick Dimiduk's
> books.
> >>
> >> HBase doesn't support SQL directly. There is no cost based optimization.
> >>
> >> Cheers
> >>
> >> > On Oct 21, 2016, at 1:43 AM, Mich Talebzadeh <
> mich.talebza...@gmail.com>
> >> wrote:
> >> >
> >> > Hi,
> >> >
> >> > This is a general question.
> >> >
> >> > Is Hbase fast because Hbase uses Hash tables and provides random
> access,
> >> > and it stores the data in indexed HDFS files for faster lookups.
> >> >
> >> > Say compared to Spark that is basically a SQL layer relying on
> different
> >> > engines (mr, Tez, Spark) to execute the code (although it has Cost
> Base
> >> > Optimizer), how Hbase fares, beyond relying on these engines
> >> >
> >> > Thanks
> >> >
> >> >
> >> > Dr Mich Talebzadeh
> >> >
> >> >
> >> >
> >> > LinkedIn * https://www.linkedin.com/profile/view?id=
> AAEWh2gBxianrbJ
> >> d6zP6AcPCCdOABUrV8Pw
> >> >  >> Jd6zP6AcPCCdOABUrV8Pw>*
> >> >
> >> >
> >> >
> >> > http://talebzadehmich.wordpress.com
> >> >
> >> >
> >> > *Disclaimer:* Use it at your own risk. Any and all responsibility for
> >> any
> >> > loss, damage or destruction of data or any other property which may
> >> arise
> >> > from relying on this email's technical content is explicitly
> disclaimed.
> >> > The author will in no case be liable for any monetary damages arising
> >> from
> >> > such loss, damage or destruction.
> >>
> >
> >
>


Re: Hbase fast access

2016-10-21 Thread Mich Talebzadeh
Hi,

Can someone in a nutshell explain *the *Hbase use of log-structured
merge-tree (LSM-tree) as data storage architecture

The idea of merging smaller files to larger files periodically to reduce
disk seeks,  is this similar concept to compaction in HDFS or Hive?

Thanks


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 21 October 2016 at 15:27, Mich Talebzadeh 
wrote:

> Sorry that should read Hive not Spark here
>
> Say compared to Spark that is basically a SQL layer relying on different
> engines (mr, Tez, Spark) to execute the code
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 21 October 2016 at 13:17, Ted Yu  wrote:
>
>> Mich:
>> Here is brief description of hbase architecture:
>> https://hbase.apache.org/book.html#arch.overview
>>
>> You can also get more details from Lars George's or Nick Dimiduk's books.
>>
>> HBase doesn't support SQL directly. There is no cost based optimization.
>>
>> Cheers
>>
>> > On Oct 21, 2016, at 1:43 AM, Mich Talebzadeh 
>> wrote:
>> >
>> > Hi,
>> >
>> > This is a general question.
>> >
>> > Is Hbase fast because Hbase uses Hash tables and provides random access,
>> > and it stores the data in indexed HDFS files for faster lookups.
>> >
>> > Say compared to Spark that is basically a SQL layer relying on different
>> > engines (mr, Tez, Spark) to execute the code (although it has Cost Base
>> > Optimizer), how Hbase fares, beyond relying on these engines
>> >
>> > Thanks
>> >
>> >
>> > Dr Mich Talebzadeh
>> >
>> >
>> >
>> > LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJ
>> d6zP6AcPCCdOABUrV8Pw
>> > > Jd6zP6AcPCCdOABUrV8Pw>*
>> >
>> >
>> >
>> > http://talebzadehmich.wordpress.com
>> >
>> >
>> > *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any
>> > loss, damage or destruction of data or any other property which may
>> arise
>> > from relying on this email's technical content is explicitly disclaimed.
>> > The author will in no case be liable for any monetary damages arising
>> from
>> > such loss, damage or destruction.
>>
>
>


ImportTSV write to remote HDFS concurrently.

2016-10-21 Thread Vadim Vararu

Hi guys,

I'm trying to run the importTSV job and to write the result into a 
remote HDFS. Isn't it supposed to write data concurrently? Asking cause 
i get the same time with 2 and 4 nodes and i can see that there is only 
1 reduce running.

Where is the bottleneck?

Thanks, Vadim.


Re: Hbase fast access

2016-10-21 Thread Mich Talebzadeh
Sorry that should read Hive not Spark here

Say compared to Spark that is basically a SQL layer relying on different
engines (mr, Tez, Spark) to execute the code

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 21 October 2016 at 13:17, Ted Yu  wrote:

> Mich:
> Here is brief description of hbase architecture:
> https://hbase.apache.org/book.html#arch.overview
>
> You can also get more details from Lars George's or Nick Dimiduk's books.
>
> HBase doesn't support SQL directly. There is no cost based optimization.
>
> Cheers
>
> > On Oct 21, 2016, at 1:43 AM, Mich Talebzadeh 
> wrote:
> >
> > Hi,
> >
> > This is a general question.
> >
> > Is Hbase fast because Hbase uses Hash tables and provides random access,
> > and it stores the data in indexed HDFS files for faster lookups.
> >
> > Say compared to Spark that is basically a SQL layer relying on different
> > engines (mr, Tez, Spark) to execute the code (although it has Cost Base
> > Optimizer), how Hbase fares, beyond relying on these engines
> >
> > Thanks
> >
> >
> > Dr Mich Talebzadeh
> >
> >
> >
> > LinkedIn * https://www.linkedin.com/profile/view?id=
> AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >  OABUrV8Pw>*
> >
> >
> >
> > http://talebzadehmich.wordpress.com
> >
> >
> > *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> > loss, damage or destruction of data or any other property which may arise
> > from relying on this email's technical content is explicitly disclaimed.
> > The author will in no case be liable for any monetary damages arising
> from
> > such loss, damage or destruction.
>


Re: Hbase fast access

2016-10-21 Thread Ted Yu
Mich:
Here is brief description of hbase architecture:
https://hbase.apache.org/book.html#arch.overview

You can also get more details from Lars George's or Nick Dimiduk's books. 

HBase doesn't support SQL directly. There is no cost based optimization. 

Cheers

> On Oct 21, 2016, at 1:43 AM, Mich Talebzadeh  
> wrote:
> 
> Hi,
> 
> This is a general question.
> 
> Is Hbase fast because Hbase uses Hash tables and provides random access,
> and it stores the data in indexed HDFS files for faster lookups.
> 
> Say compared to Spark that is basically a SQL layer relying on different
> engines (mr, Tez, Spark) to execute the code (although it has Cost Base
> Optimizer), how Hbase fares, beyond relying on these engines
> 
> Thanks
> 
> 
> Dr Mich Talebzadeh
> 
> 
> 
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
> 
> 
> 
> http://talebzadehmich.wordpress.com
> 
> 
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.


Re: Scan a region in parallel

2016-10-21 Thread Anil
Thank you Ram. Now its clear. i will take a look at it.

Thanks again.

On 21 October 2016 at 14:25, ramkrishna vasudevan <
ramkrishna.s.vasude...@gmail.com> wrote:

> Phoenix does support intelligent ways when you query using columns since it
> is a SQL engine.
>
> There the parallelism happens by using guideposts - those are fixed spaced
> row keys stored in a seperate stats table. So when you do a query the
> Phoenix internally spawns parallels scan queries using those guide posts
> and thus making querying faster.
>
> Regards
> Ram
>
> On Fri, Oct 21, 2016 at 1:26 PM, Anil  wrote:
>
> > Thank you Ram.
> >
> > "So now  you are spawning those many scan threads equal to the number of
> > regions " - YES
> >
> > There are two ways of scanning region in parallel
> >
> > 1. scan a region with start row and stop row in parallel with single scan
> > operation on server side and hbase take care of parallelism internally.
> > 2. transform a start row and stop row of a region into number of start
> and
> > stop rows (by some criteria) and span scan query for each start and stop
> > row.
> >
> > #1 is not supported (as you also said).
> >
> > i am looking for #2. i checked the phoenix documentation and code. it
> seems
> > to me that phoenix is doing #2. i looked into phoenix code and could not
> > understand it completely.
> >
> > The usecase is very simple. Hbase not good (at least in terms of
> > performance for OLTP) query by all columns (other than row key) and
> sorting
> > of all columns of a row. even phoenix too.
> >
> > So i am planning load the hbase/phoenix table into in-memory data base
> for
> > faster access.
> >
> > scanning of big region sequentially will lead to larger load time. so
> > finding ways to minimize the load time.
> >
> > Hope this helps.
> >
> > Thanks.
> >
> >
> > On 21 October 2016 at 09:30, ramkrishna vasudevan <
> > ramkrishna.s.vasude...@gmail.com> wrote:
> >
> > > Hi Anil
> > >
> > > So now  you are spawning those many scan threads equal to the number of
> > > regions.
> > > bq.Is there any way to scan a region in parallel ?
> > > You mean with in a region you want to scan parallely? Which means that
> a
> > > single query you want to split up into N number of small scans and read
> > and
> > > aggregate on the client side/server side?
> > >
> > > Currently you cannot do that. Once you set a start and stoprow the scan
> > > will determine which region it belongs to and retrieves the data
> > > sequentially in that region (it applies the filtering that you do
> during
> > > the course of the scan).
> > >
> > > Have you tried Apache Phoenix?  Its a SQL wrapper over HBase and there
> > you
> > > could do parallel scans for a given SQL query if there are some guide
> > posts
> > > collected. Such things cannot be an integral part of HBase. But I fear
> > as I
> > > am not aware of your usecase we cannot suggest on this.
> > >
> > > REgards
> > > Ram
> > >
> > >
> > > On Fri, Oct 21, 2016 at 8:40 AM, Anil  wrote:
> > >
> > > > Any pointers ?
> > > >
> > > > On 20 October 2016 at 18:15, Anil  wrote:
> > > >
> > > > > HI,
> > > > >
> > > > > I am loading hbase table into an in-memory db to support filter,
> > > ordering
> > > > > and pagination.
> > > > >
> > > > > I am scanning region and inserting data into in-memory db. each
> > region
> > > > > scan is done in single thread so each region is scanned in
> parallel.
> > > > >
> > > > > Is there any way to scan a region in parallel ? any pointers would
> be
> > > > > helpful.
> > > > >
> > > > > Thanks
> > > > >
> > > >
> > >
> >
>


Re: Scan a region in parallel

2016-10-21 Thread ramkrishna vasudevan
Phoenix does support intelligent ways when you query using columns since it
is a SQL engine.

There the parallelism happens by using guideposts - those are fixed spaced
row keys stored in a seperate stats table. So when you do a query the
Phoenix internally spawns parallels scan queries using those guide posts
and thus making querying faster.

Regards
Ram

On Fri, Oct 21, 2016 at 1:26 PM, Anil  wrote:

> Thank you Ram.
>
> "So now  you are spawning those many scan threads equal to the number of
> regions " - YES
>
> There are two ways of scanning region in parallel
>
> 1. scan a region with start row and stop row in parallel with single scan
> operation on server side and hbase take care of parallelism internally.
> 2. transform a start row and stop row of a region into number of start and
> stop rows (by some criteria) and span scan query for each start and stop
> row.
>
> #1 is not supported (as you also said).
>
> i am looking for #2. i checked the phoenix documentation and code. it seems
> to me that phoenix is doing #2. i looked into phoenix code and could not
> understand it completely.
>
> The usecase is very simple. Hbase not good (at least in terms of
> performance for OLTP) query by all columns (other than row key) and sorting
> of all columns of a row. even phoenix too.
>
> So i am planning load the hbase/phoenix table into in-memory data base for
> faster access.
>
> scanning of big region sequentially will lead to larger load time. so
> finding ways to minimize the load time.
>
> Hope this helps.
>
> Thanks.
>
>
> On 21 October 2016 at 09:30, ramkrishna vasudevan <
> ramkrishna.s.vasude...@gmail.com> wrote:
>
> > Hi Anil
> >
> > So now  you are spawning those many scan threads equal to the number of
> > regions.
> > bq.Is there any way to scan a region in parallel ?
> > You mean with in a region you want to scan parallely? Which means that a
> > single query you want to split up into N number of small scans and read
> and
> > aggregate on the client side/server side?
> >
> > Currently you cannot do that. Once you set a start and stoprow the scan
> > will determine which region it belongs to and retrieves the data
> > sequentially in that region (it applies the filtering that you do during
> > the course of the scan).
> >
> > Have you tried Apache Phoenix?  Its a SQL wrapper over HBase and there
> you
> > could do parallel scans for a given SQL query if there are some guide
> posts
> > collected. Such things cannot be an integral part of HBase. But I fear
> as I
> > am not aware of your usecase we cannot suggest on this.
> >
> > REgards
> > Ram
> >
> >
> > On Fri, Oct 21, 2016 at 8:40 AM, Anil  wrote:
> >
> > > Any pointers ?
> > >
> > > On 20 October 2016 at 18:15, Anil  wrote:
> > >
> > > > HI,
> > > >
> > > > I am loading hbase table into an in-memory db to support filter,
> > ordering
> > > > and pagination.
> > > >
> > > > I am scanning region and inserting data into in-memory db. each
> region
> > > > scan is done in single thread so each region is scanned in parallel.
> > > >
> > > > Is there any way to scan a region in parallel ? any pointers would be
> > > > helpful.
> > > >
> > > > Thanks
> > > >
> > >
> >
>


Hbase fast access

2016-10-21 Thread Mich Talebzadeh
Hi,

This is a general question.

Is Hbase fast because Hbase uses Hash tables and provides random access,
and it stores the data in indexed HDFS files for faster lookups.

Say compared to Spark that is basically a SQL layer relying on different
engines (mr, Tez, Spark) to execute the code (although it has Cost Base
Optimizer), how Hbase fares, beyond relying on these engines

Thanks


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Re: Scan a region in parallel

2016-10-21 Thread Anil
Thank you Ram.

"So now  you are spawning those many scan threads equal to the number of
regions " - YES

There are two ways of scanning region in parallel

1. scan a region with start row and stop row in parallel with single scan
operation on server side and hbase take care of parallelism internally.
2. transform a start row and stop row of a region into number of start and
stop rows (by some criteria) and span scan query for each start and stop
row.

#1 is not supported (as you also said).

i am looking for #2. i checked the phoenix documentation and code. it seems
to me that phoenix is doing #2. i looked into phoenix code and could not
understand it completely.

The usecase is very simple. Hbase not good (at least in terms of
performance for OLTP) query by all columns (other than row key) and sorting
of all columns of a row. even phoenix too.

So i am planning load the hbase/phoenix table into in-memory data base for
faster access.

scanning of big region sequentially will lead to larger load time. so
finding ways to minimize the load time.

Hope this helps.

Thanks.


On 21 October 2016 at 09:30, ramkrishna vasudevan <
ramkrishna.s.vasude...@gmail.com> wrote:

> Hi Anil
>
> So now  you are spawning those many scan threads equal to the number of
> regions.
> bq.Is there any way to scan a region in parallel ?
> You mean with in a region you want to scan parallely? Which means that a
> single query you want to split up into N number of small scans and read and
> aggregate on the client side/server side?
>
> Currently you cannot do that. Once you set a start and stoprow the scan
> will determine which region it belongs to and retrieves the data
> sequentially in that region (it applies the filtering that you do during
> the course of the scan).
>
> Have you tried Apache Phoenix?  Its a SQL wrapper over HBase and there you
> could do parallel scans for a given SQL query if there are some guide posts
> collected. Such things cannot be an integral part of HBase. But I fear as I
> am not aware of your usecase we cannot suggest on this.
>
> REgards
> Ram
>
>
> On Fri, Oct 21, 2016 at 8:40 AM, Anil  wrote:
>
> > Any pointers ?
> >
> > On 20 October 2016 at 18:15, Anil  wrote:
> >
> > > HI,
> > >
> > > I am loading hbase table into an in-memory db to support filter,
> ordering
> > > and pagination.
> > >
> > > I am scanning region and inserting data into in-memory db. each region
> > > scan is done in single thread so each region is scanned in parallel.
> > >
> > > Is there any way to scan a region in parallel ? any pointers would be
> > > helpful.
> > >
> > > Thanks
> > >
> >
>