Re: Re: Re: Re: What way to improve MTTR other than DLR(distributed log replay)

2016-10-21 Thread Enis Söztutar
A bit late, but let me give my perspective. This can also be moved to jira or dev@ I think. DLR was a nice and had pretty good gains for MTTR. However, dealing with the sequence ids, onlining regions etc and the replay paths proved to be too difficult in practice. I think the way forward would be

Re: ETL HBase HFile+HLog to ORC(or Parquet) file?

2016-10-21 Thread Mich Talebzadeh
Hi Demai, As I understand you want to use Hbase as the real time layer and Hive Data Warehouse as the batch layer for analytics. In other words ingest data real time from source into Hbase and push that data into Hive recurring If you partition your target ORC table with DtStamp and

Re: ETL HBase HFile+HLog to ORC(or Parquet) file?

2016-10-21 Thread Demai Ni
Mich, thanks for the detail instructions. While aware of the Hive method, I have a few questions/concerns: 1) the Hive method is a "INSERT FROM SELECT " ,which usually not perform as good as a bulk load though I am not familiar with the real implementation 2) I have another SQL-on-Hadoop engine

Re: Hbase fast access

2016-10-21 Thread Mich Talebzadeh
I was asked an interesting question. Can one update data in Hbase? and my answer was it is only append only Can one update data in Hive? My answer was yes if table is created as ORC and tableproperties set with "transactional"="true" STORED AS ORC TBLPROPERTIES ( "orc.compress"="SNAPPY",

Re: Hbase fast access

2016-10-21 Thread Mich Talebzadeh
BTW. I always understood that Hbase is append only. is that generally true? thx Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw *

Re: Hbase fast access

2016-10-21 Thread Mich Talebzadeh
agreed much like any rdbms Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com *Disclaimer:* Use it at your

Re: ETL HBase HFile+HLog to ORC(or Parquet) file?

2016-10-21 Thread Mich Talebzadeh
Create an external table in Hive on Hbase atble. Pretty straight forward. hive> create external table marketDataHbase (key STRING, ticker STRING, timecreated STRING, price STRING) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'WITH SERDEPROPERTIES ("hbase.columns.mapping" =

Re: Hbase fast access

2016-10-21 Thread Ted Yu
Well, updates (in memory) would ultimately be flushed to disk, resulting in new hfiles. On Fri, Oct 21, 2016 at 1:50 PM, Mich Talebzadeh wrote: > thanks > > bq. all updates are done in memory o disk access > > I meant data updates are operated in memory, no disk

Re: Hbase fast access

2016-10-21 Thread Mich Talebzadeh
thanks bq. all updates are done in memory o disk access I meant data updates are operated in memory, no disk access. in other much like rdbms read data into memory and update it there (assuming that data is not already in memory?) HTH Dr Mich Talebzadeh LinkedIn *

ETL HBase HFile+HLog to ORC(or Parquet) file?

2016-10-21 Thread Demai Ni
hi, I am wondering whether there are existing methods to ETL HBase data to ORC(or other open source columnar) file? I understand in Hive "insert into Hive_ORC_Table from SELET * from HIVE_HBase_Table", can probably get the job done. Is this the common way to do so? Performance is acceptable and

Re: Hbase fast access

2016-10-21 Thread Ted Yu
bq. this search is carried out through map-reduce on region servers? No map-reduce. region server uses its own thread(s). bq. all updates are done in memory o disk access Can you clarify ? There seems to be some missing letters. On Fri, Oct 21, 2016 at 1:43 PM, Mich Talebzadeh

Re: Hbase fast access

2016-10-21 Thread Mich Talebzadeh
thanks having read the docs it appears to me that the main reason of hbase being faster is: 1. it behaves like an rdbms like oracle tetc. reads are looked for in the buffer cache for consistent reads and if not found then store files on disks are searched. Does this mean that this

Re: mapreduce get error while update HBase

2016-10-21 Thread Ted Yu
Can you give us more information so that we can match the line numbers in the stack trace with actual code ? release of hbase hadoop version If you can show snippet of related code, that would be nice. Thanks On Fri, Oct 21, 2016 at 11:16 AM, 乔彦克 wrote: > Hi, all > > I

mapreduce get error while update HBase

2016-10-21 Thread 乔彦克
Hi, all I use mapreduce to update HBase data, got this error just now, and I have no idea why this happened. below is the error log: 2016-10-22 01:14:49,047 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.lang.RuntimeException:

Re: Hbase fast access

2016-10-21 Thread Ted Yu
See some prior blog: http://www.cyanny.com/2014/03/13/hbase-architecture-analysis-part1-logical-architecture/ w.r.t. compaction in Hive, it is used to compact deltas into a base file (in the context of transactions). Likely they're different. Cheers On Fri, Oct 21, 2016 at 9:08 AM, Mich

Re: Hbase fast access

2016-10-21 Thread Mich Talebzadeh
Hi, Can someone in a nutshell explain *the *Hbase use of log-structured merge-tree (LSM-tree) as data storage architecture The idea of merging smaller files to larger files periodically to reduce disk seeks, is this similar concept to compaction in HDFS or Hive? Thanks Dr Mich Talebzadeh

ImportTSV write to remote HDFS concurrently.

2016-10-21 Thread Vadim Vararu
Hi guys, I'm trying to run the importTSV job and to write the result into a remote HDFS. Isn't it supposed to write data concurrently? Asking cause i get the same time with 2 and 4 nodes and i can see that there is only 1 reduce running. Where is the bottleneck? Thanks, Vadim.

Re: Hbase fast access

2016-10-21 Thread Mich Talebzadeh
Sorry that should read Hive not Spark here Say compared to Spark that is basically a SQL layer relying on different engines (mr, Tez, Spark) to execute the code Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: Hbase fast access

2016-10-21 Thread Ted Yu
Mich: Here is brief description of hbase architecture: https://hbase.apache.org/book.html#arch.overview You can also get more details from Lars George's or Nick Dimiduk's books. HBase doesn't support SQL directly. There is no cost based optimization. Cheers > On Oct 21, 2016, at 1:43 AM,

Re: Scan a region in parallel

2016-10-21 Thread Anil
Thank you Ram. Now its clear. i will take a look at it. Thanks again. On 21 October 2016 at 14:25, ramkrishna vasudevan < ramkrishna.s.vasude...@gmail.com> wrote: > Phoenix does support intelligent ways when you query using columns since it > is a SQL engine. > > There the parallelism happens

Re: Scan a region in parallel

2016-10-21 Thread ramkrishna vasudevan
Phoenix does support intelligent ways when you query using columns since it is a SQL engine. There the parallelism happens by using guideposts - those are fixed spaced row keys stored in a seperate stats table. So when you do a query the Phoenix internally spawns parallels scan queries using

Hbase fast access

2016-10-21 Thread Mich Talebzadeh
Hi, This is a general question. Is Hbase fast because Hbase uses Hash tables and provides random access, and it stores the data in indexed HDFS files for faster lookups. Say compared to Spark that is basically a SQL layer relying on different engines (mr, Tez, Spark) to execute the code

Re: Scan a region in parallel

2016-10-21 Thread Anil
Thank you Ram. "So now you are spawning those many scan threads equal to the number of regions " - YES There are two ways of scanning region in parallel 1. scan a region with start row and stop row in parallel with single scan operation on server side and hbase take care of parallelism