Re: Implementation of full table scan using Spark

2017-06-29 Thread Ted Yu
: > @Ted Yu If full table scan does not read memstore then why I am getting the > recently inserted data. I am pretty sure others may have seen this earlier > and may not didn't notice. > > @Jingcheng Thanks for your answer. If you are true, then my understanding > was wrong. I will

Re: Implementation of full table scan using Spark

2017-06-28 Thread Sachin Jain
@Ted Yu If full table scan does not read memstore then why I am getting the recently inserted data. I am pretty sure others may have seen this earlier and may not didn't notice. @Jingcheng Thanks for your answer. If you are true, then my understanding was wrong. I will try to see the co

Re: Implementation of full table scan using Spark

2017-06-28 Thread Jingcheng Du
table scan works fast because we are reading Hfiles directly. I think the fast full table scan is because you run the scan in each region concurrently in Spark. 2017-06-29 11:33 GMT+08:00 Ted Yu : > TableInputFormat doesn't read memstore. > > bq. I am inserting 10-20 entires only &g

Re: Implementation of full table scan using Spark

2017-06-28 Thread Ted Yu
un 28, 2017 at 8:15 PM, Sachin Jain wrote: > Hi, > > I have used TableInputFormat and newAPIHadoopRDD defined on sparkContext to > do a full table scan and get an rdd from it. > > Partial piece of code looks like this: > >

Implementation of full table scan using Spark

2017-06-28 Thread Sachin Jain
Hi, I have used TableInputFormat and newAPIHadoopRDD defined on sparkContext to do a full table scan and get an rdd from it. Partial piece of code looks like this: sparkContext.newAPIHadoopRDD( HBaseConfigurationUtil.hbaseConfigurationForReading(table.getName.getNameWithNamespaceInclAsString

Re: Full table scan cost after deleting Millions of Records from HBase Table

2016-02-09 Thread Billy Watson
but just > wanted to check that in terms of region-splits or compaction I won't run > into issues. Can you think of any problems? > 2. Let's say there are 6 million records in the table, then do a full > table-scan querying a column-family that has a single family the value in &

Full table scan cost after deleting Millions of Records from HBase Table

2016-02-09 Thread houman
paction I won't run into issues. Can you think of any problems? 2. Let's say there are 6 million records in the table, then do a full table-scan querying a column-family that has a single family the value in the cell is either 1 or 0. Let's say it takes N seconds. Now I bulk delete

Re: Full table scan from random starting point?

2014-01-31 Thread Jean-Marc Spaggiari
Hi Robert, You can randomly build your start key, give it to your scanner, scan until the end of the table, then give it as the end key for a new scanner. Doing that you will scan the way you are looking for. Also, this might interest you: https://issues.apache.org/jira/browse/HBASE-9272 JM 20

Full table scan from random starting point?

2014-01-31 Thread Robert Dyer
Let's say I have one client on each of my regionservers. Each client needs to do a full scan on the same table. The order in which the rows are scanned by clients does not matter. Is it possible to have each client start at a random (or better, the first row located on the local rs) point in the

Re: full table scan

2011-06-21 Thread Stack
Andre: As per Ted in the other thread, because you have 2GB only, are you sure that you are not swapping? Swapping will cause all to slow down. St.Ack On Tue, Jun 21, 2011 at 12:02 AM, Andre Reiter wrote: > Hi Stack, > > thanks a lot for the reply > each row is about 2k in average, there are o

Re: full table scan

2011-06-21 Thread Andre Reiter
Hi Stack, thanks a lot for the reply each row is about 2k in average, there are only 2 families hardware: CPU: 2x AMD Opteron(tm) Processor 250 (2.4GHz) disk: 500 GB, software raid raid1 (2x WDC WD5000AAKB-00H8A0, ATA DISK drive) memory: 2 GB network: 1 Gbps Ethernet schrieb Stack: Sounds li

Re: full table scan

2011-06-20 Thread Stack
Sounds like you are doing about 5k rows/second per server. What size rows? How many column families? What kinda of hardware? St.Ack On Mon, Jun 20, 2011 at 10:13 PM, Andre Reiter wrote: > sorry guys, > still  the same problem... my MR jobs are running not very fast... > > the job org.apache.h

Re: full table scan

2011-06-20 Thread Andre Reiter
sorry guys, still the same problem... my MR jobs are running not very fast... the job org.apache.hadoop.hbase.mapreduce.RowCounter took 13 minutes to complete while we do not have much rows, just 3223543 at the moment we have 3 region servers, while the table is split over 13 regions on that 3

Re: full table scan

2011-06-12 Thread Stack
Thanks Ted. I misread On Jun 12, 2011, at 2:31, Ted Dunning wrote: > He said 10^9. Easy to misread. > > On Sat, Jun 11, 2011 at 6:41 PM, Stack wrote: > >> On Sat, Jun 11, 2011 at 1:36 AM, Andre Reiter wrote: >>> so what time can be expected for processing a full scan of i.e. >>> 1.000.00

Re: full table scan

2011-06-12 Thread Ted Dunning
He said 10^9. Easy to misread. On Sat, Jun 11, 2011 at 6:41 PM, Stack wrote: > On Sat, Jun 11, 2011 at 1:36 AM, Andre Reiter wrote: > > so what time can be expected for processing a full scan of i.e. > > 1.000.000.000 rows in an hbase cluster with i.e. 3 region servers? > > > > I don't think t

Re: full table scan

2011-06-11 Thread Stack
On Sat, Jun 11, 2011 at 1:36 AM, Andre Reiter wrote: > so what time can be expected for processing a full scan of i.e. > 1.000.000.000 rows in an hbase cluster with i.e. 3 region servers? > I don't think three servers and 1M rows (only) enough data and resources for contrast and compare. Multipl

Re: full table scan

2011-06-11 Thread Andre Reiter
Jean-Daniel Cryans wrote: You expect a MapReduce job to be faster than a Scan on small data, your expectation is wrong. never expected a MR job to be faster for every context There's a minimal cost to every MR job, which is of a few seconds, and you can't go around it. for sure there is an

Re: full table scan

2011-06-10 Thread Jean-Daniel Cryans
You expect a MapReduce job to be faster than a Scan on small data, your expectation is wrong. There's a minimal cost to every MR job, which is of a few seconds, and you can't go around it. What other people have been trying to tell you is that you don't have enough data to benefit from the parall

Re: full table scan

2011-06-07 Thread Andre Reiter
cool, just one change scan.setCaching(1000); reduced the processing time of my MR job from 60sec to 10sec ! nice :-) PS: now looking for other optimizations... Stack wrote: See http://hbase.apache.org/book/performance.html St.Ack

Re: full table scan

2011-06-07 Thread Stack
See http://hbase.apache.org/book/performance.html St.Ack On Tue, Jun 7, 2011 at 1:08 AM, Andre Reiter wrote: > now i found out, that there are three regions, each on a particular region > server (server2, server3, server4) > the processing time is still >=60sec, which is not very impressive... >

Re: full table scan

2011-06-07 Thread Andre Reiter
now i found out, that there are three regions, each on a particular region server (server2, server3, server4) the processing time is still >=60sec, which is not very impressive... what can i do, to speed up the table scan best regards andre Andreas Reiter wrote: hello everybody i'm trying t

Re: full table scan

2011-06-06 Thread Ted Yu
I think row counter would help you figure out the number of rows in each region. Refer to the following email thread, especially Stack's answer on Apr 1: row_counter map reduce job & 0.90.1 On Mon, Jun 6, 2011 at 3:07 PM, Andre Reiter wrote: > > Check the web console. >> > > ah, ok thanks! > at

Re: full table scan

2011-06-06 Thread Andre Reiter
Check the web console. ah, ok thanks! at the port 60010 on the hbase master i actually found a web interface there was only one region, i played i bit with it, and executed the "Split" function twice. Now i have three regions, one on each hbase region server but still, the processing time did

RE: full table scan

2011-06-06 Thread Doug Meil
Check the web console. -Original Message- From: Andre Reiter [mailto:a.rei...@web.de] Sent: Monday, June 06, 2011 5:27 PM To: user@hbase.apache.org Subject: Re: full table scan good question... i have no idea... i did not define explicitly the number of regions for the table, how can

Re: full table scan

2011-06-06 Thread Andre Reiter
: Joey Echeverria Sent: Mon Jun 06 2011 15:10:29 GMT+0200 (CET) To: CC: Subject: Re: full table scan How many regions does your table have?

Re: full table scan

2011-06-06 Thread Himanshu Vashishtha
Also, How big is each row? Are you using scanner cache? You just fetching all the rows to the client, and?. 300k is not big (It seems you have 1'ish region, that could explain similar timing). Add more data and mapreduce will pick up! Thanks, Himanshu On Mon, Jun 6, 2011 at 8:59 AM, Christopher

Re: full table scan

2011-06-06 Thread Christopher Tarnas
How many regions does your table have? If all of the data is still in one region then you will be rate limited by how fast that single region can be read. 3 nodes is also pretty small, the more nodes you have the better (at least 5 for dev and test and 10+ for production has been my experience).

Re: full table scan

2011-06-06 Thread Joey Echeverria
How many regions does your table have? On Mon, Jun 6, 2011 at 4:48 AM, Andreas Reiter wrote: > hello everybody > > i'm trying to scan my hbase table for reporting purposes > the cluster has 4 servers: >  - server1: namenode, secondary namenode, jobtracker, hbase master, > zookeeper1 >  - server2:

full table scan

2011-06-06 Thread Andreas Reiter
hello everybody i'm trying to scan my hbase table for reporting purposes the cluster has 4 servers: - server1: namenode, secondary namenode, jobtracker, hbase master, zookeeper1 - server2: datanode, tasktracker, hbase regionserver, zookeeper2 - server3: datanode, tasktracker, hbase regionserve

RE: Secondary Index versus Full Table Scan

2010-08-04 Thread Jonathan Gray
in a couple weeks. > -Original Message- > From: Todd Lipcon [mailto:t...@cloudera.com] > Sent: Wednesday, August 04, 2010 2:15 PM > To: user@hbase.apache.org > Subject: Re: Secondary Index versus Full Table Scan > > On Wed, Aug 4, 2010 at 1:14 PM, Lu

Re: Secondary Index versus Full Table Scan

2010-08-04 Thread Todd Lipcon
On Wed, Aug 4, 2010 at 1:14 PM, Luke Forehand < luke.foreh...@networkedinsights.com> wrote: > Todd Lipcon writes: > > > The above is true if you assume you can only do one get at a time. In > fact, > > you can probably pipeline gets, and there's actually a patch in the works > > for multiget supp

Re: Secondary Index versus Full Table Scan

2010-08-04 Thread Luke Forehand
Todd Lipcon writes: > The above is true if you assume you can only do one get at a time. In fact, > you can probably pipeline gets, and there's actually a patch in the works > for multiget support - HBASE-1845. I don't think it's being actively worked > on at the moment, though, so you'll have to

Re: Secondary Index versus Full Table Scan

2010-08-03 Thread Todd Lipcon
ike a top. Our > import rate is now around 3 GB per job which takes about 10 minutes. This > is > great. Now we are trying to tackle reading. > > With our current setup, a map reduce job with 24 mappers performing a full > table > scan of ~150 million records takes ~1 hour.

Re: Secondary Index versus Full Table Scan

2010-08-03 Thread Luke Forehand
Hegner, Travis writes: > > Going out on a limb, I think it will perform MUCH faster with multiple copies, as the data is already sitting > in each mappers memory, ready to be accessed locally. The time to process per mapper should be very > dramatically reduced. With that in mind, you only have

RE: Secondary Index versus Full Table Scan

2010-08-03 Thread Hegner, Travis
time. HTH, Travis Hegner http://www.travishegner.com/ -Original Message- From: Luke Forehand [mailto:luke.foreh...@networkedinsights.com] Sent: Tuesday, August 03, 2010 12:37 PM To: user@hbase.apache.org Subject: Re: Secondary Index versus Full Table Scan Edward Capriolo w

Re: Secondary Index versus Full Table Scan

2010-08-03 Thread Luke Forehand
Edward Capriolo writes: > Generally speaking: If you are doing full range scans of a table > indexes will not help. Adding indexes will make the performance worse, > it will take longer to load your data and now fetching the data will > involve two lookups instead of one. > > If you are doing fu

Re: Secondary Index versus Full Table Scan

2010-08-03 Thread Edward Capriolo
s > great.  Now we are trying to tackle reading. > > With our current setup, a map reduce job with 24 mappers performing a full > table > scan of ~150 million records takes ~1 hour.  This won't work for our use case, > because not only are we continuing to add more data to

Secondary Index versus Full Table Scan

2010-08-03 Thread Luke Forehand
, a map reduce job with 24 mappers performing a full table scan of ~150 million records takes ~1 hour. This won't work for our use case, because not only are we continuing to add more data to this table, but we are asking many more questions in a day. To increase performance, the first though