:
> @Ted Yu If full table scan does not read memstore then why I am getting the
> recently inserted data. I am pretty sure others may have seen this earlier
> and may not didn't notice.
>
> @Jingcheng Thanks for your answer. If you are true, then my understanding
> was wrong. I will
@Ted Yu If full table scan does not read memstore then why I am getting the
recently inserted data. I am pretty sure others may have seen this earlier
and may not didn't notice.
@Jingcheng Thanks for your answer. If you are true, then my understanding
was wrong. I will try to see the co
table scan works fast because we are
reading Hfiles directly.
I think the fast full table scan is because you run the scan in each region
concurrently in Spark.
2017-06-29 11:33 GMT+08:00 Ted Yu :
> TableInputFormat doesn't read memstore.
>
> bq. I am inserting 10-20 entires only
&g
un 28, 2017 at 8:15 PM, Sachin Jain
wrote:
> Hi,
>
> I have used TableInputFormat and newAPIHadoopRDD defined on sparkContext to
> do a full table scan and get an rdd from it.
>
> Partial piece of code looks like this:
>
>
Hi,
I have used TableInputFormat and newAPIHadoopRDD defined on sparkContext to
do a full table scan and get an rdd from it.
Partial piece of code looks like this:
sparkContext.newAPIHadoopRDD(
HBaseConfigurationUtil.hbaseConfigurationForReading(table.getName.getNameWithNamespaceInclAsString
but just
> wanted to check that in terms of region-splits or compaction I won't run
> into issues. Can you think of any problems?
> 2. Let's say there are 6 million records in the table, then do a full
> table-scan querying a column-family that has a single family the value in
&
paction I won't run
into issues. Can you think of any problems?
2. Let's say there are 6 million records in the table, then do a full
table-scan querying a column-family that has a single family the value in
the cell is either 1 or 0. Let's say it takes N seconds. Now I bulk delete
Hi Robert,
You can randomly build your start key, give it to your scanner, scan until
the end of the table, then give it as the end key for a new scanner. Doing
that you will scan the way you are looking for.
Also, this might interest you:
https://issues.apache.org/jira/browse/HBASE-9272
JM
20
Let's say I have one client on each of my regionservers. Each client needs
to do a full scan on the same table. The order in which the rows are
scanned by clients does not matter.
Is it possible to have each client start at a random (or better, the first
row located on the local rs) point in the
Andre:
As per Ted in the other thread, because you have 2GB only, are you
sure that you are not swapping? Swapping will cause all to slow down.
St.Ack
On Tue, Jun 21, 2011 at 12:02 AM, Andre Reiter wrote:
> Hi Stack,
>
> thanks a lot for the reply
> each row is about 2k in average, there are o
Hi Stack,
thanks a lot for the reply
each row is about 2k in average, there are only 2 families
hardware:
CPU: 2x AMD Opteron(tm) Processor 250 (2.4GHz)
disk: 500 GB, software raid raid1 (2x WDC WD5000AAKB-00H8A0, ATA DISK drive)
memory: 2 GB
network: 1 Gbps Ethernet
schrieb Stack:
Sounds li
Sounds like you are doing about 5k rows/second per server.
What size rows? How many column families? What kinda of hardware?
St.Ack
On Mon, Jun 20, 2011 at 10:13 PM, Andre Reiter wrote:
> sorry guys,
> still the same problem... my MR jobs are running not very fast...
>
> the job org.apache.h
sorry guys,
still the same problem... my MR jobs are running not very fast...
the job org.apache.hadoop.hbase.mapreduce.RowCounter took 13 minutes to
complete while we do not have much rows, just 3223543
at the moment we have 3 region servers, while the table is split over 13
regions on that 3
Thanks Ted. I misread
On Jun 12, 2011, at 2:31, Ted Dunning wrote:
> He said 10^9. Easy to misread.
>
> On Sat, Jun 11, 2011 at 6:41 PM, Stack wrote:
>
>> On Sat, Jun 11, 2011 at 1:36 AM, Andre Reiter wrote:
>>> so what time can be expected for processing a full scan of i.e.
>>> 1.000.00
He said 10^9. Easy to misread.
On Sat, Jun 11, 2011 at 6:41 PM, Stack wrote:
> On Sat, Jun 11, 2011 at 1:36 AM, Andre Reiter wrote:
> > so what time can be expected for processing a full scan of i.e.
> > 1.000.000.000 rows in an hbase cluster with i.e. 3 region servers?
> >
>
> I don't think t
On Sat, Jun 11, 2011 at 1:36 AM, Andre Reiter wrote:
> so what time can be expected for processing a full scan of i.e.
> 1.000.000.000 rows in an hbase cluster with i.e. 3 region servers?
>
I don't think three servers and 1M rows (only) enough data and
resources for contrast and compare. Multipl
Jean-Daniel Cryans wrote:
You expect a MapReduce job to be faster than a Scan on small data,
your expectation is wrong.
never expected a MR job to be faster for every context
There's a minimal cost to every MR job, which is of a few seconds, and
you can't go around it.
for sure there is an
You expect a MapReduce job to be faster than a Scan on small data,
your expectation is wrong.
There's a minimal cost to every MR job, which is of a few seconds, and
you can't go around it.
What other people have been trying to tell you is that you don't have
enough data to benefit from the parall
cool, just one change
scan.setCaching(1000);
reduced the processing time of my MR job from 60sec to 10sec !
nice :-)
PS: now looking for other optimizations...
Stack wrote:
See http://hbase.apache.org/book/performance.html
St.Ack
See http://hbase.apache.org/book/performance.html
St.Ack
On Tue, Jun 7, 2011 at 1:08 AM, Andre Reiter wrote:
> now i found out, that there are three regions, each on a particular region
> server (server2, server3, server4)
> the processing time is still >=60sec, which is not very impressive...
>
now i found out, that there are three regions, each on a particular region
server (server2, server3, server4)
the processing time is still >=60sec, which is not very impressive...
what can i do, to speed up the table scan
best regards
andre
Andreas Reiter wrote:
hello everybody
i'm trying t
I think row counter would help you figure out the number of rows in each
region.
Refer to the following email thread, especially Stack's answer on Apr 1:
row_counter map reduce job & 0.90.1
On Mon, Jun 6, 2011 at 3:07 PM, Andre Reiter wrote:
>
> Check the web console.
>>
>
> ah, ok thanks!
> at
Check the web console.
ah, ok thanks!
at the port 60010 on the hbase master i actually found a web interface
there was only one region, i played i bit with it, and executed the "Split"
function twice. Now i have three regions, one on each hbase region server
but still, the processing time did
Check the web console.
-Original Message-
From: Andre Reiter [mailto:a.rei...@web.de]
Sent: Monday, June 06, 2011 5:27 PM
To: user@hbase.apache.org
Subject: Re: full table scan
good question... i have no idea...
i did not define explicitly the number of regions for the table, how can
: Joey Echeverria
Sent: Mon Jun 06 2011 15:10:29 GMT+0200 (CET)
To:
CC:
Subject: Re: full table scan
How many regions does your table have?
Also,
How big is each row? Are you using scanner cache? You just fetching all the
rows to the client, and?.
300k is not big (It seems you have 1'ish region, that could explain similar
timing). Add more data and mapreduce will pick up!
Thanks,
Himanshu
On Mon, Jun 6, 2011 at 8:59 AM, Christopher
How many regions does your table have? If all of the data is still in one
region then you will be rate limited by how fast that single region can be
read. 3 nodes is also pretty small, the more nodes you have the better (at
least 5 for dev and test and 10+ for production has been my experience).
How many regions does your table have?
On Mon, Jun 6, 2011 at 4:48 AM, Andreas Reiter wrote:
> hello everybody
>
> i'm trying to scan my hbase table for reporting purposes
> the cluster has 4 servers:
> - server1: namenode, secondary namenode, jobtracker, hbase master,
> zookeeper1
> - server2:
hello everybody
i'm trying to scan my hbase table for reporting purposes
the cluster has 4 servers:
- server1: namenode, secondary namenode, jobtracker, hbase master, zookeeper1
- server2: datanode, tasktracker, hbase regionserver, zookeeper2
- server3: datanode, tasktracker, hbase regionserve
in
a couple weeks.
> -Original Message-
> From: Todd Lipcon [mailto:t...@cloudera.com]
> Sent: Wednesday, August 04, 2010 2:15 PM
> To: user@hbase.apache.org
> Subject: Re: Secondary Index versus Full Table Scan
>
> On Wed, Aug 4, 2010 at 1:14 PM, Lu
On Wed, Aug 4, 2010 at 1:14 PM, Luke Forehand <
luke.foreh...@networkedinsights.com> wrote:
> Todd Lipcon writes:
>
> > The above is true if you assume you can only do one get at a time. In
> fact,
> > you can probably pipeline gets, and there's actually a patch in the works
> > for multiget supp
Todd Lipcon writes:
> The above is true if you assume you can only do one get at a time. In fact,
> you can probably pipeline gets, and there's actually a patch in the works
> for multiget support - HBASE-1845. I don't think it's being actively worked
> on at the moment, though, so you'll have to
ike a top. Our
> import rate is now around 3 GB per job which takes about 10 minutes. This
> is
> great. Now we are trying to tackle reading.
>
> With our current setup, a map reduce job with 24 mappers performing a full
> table
> scan of ~150 million records takes ~1 hour.
Hegner, Travis writes:
>
> Going out on a limb, I think it will perform MUCH faster with multiple copies,
as the data is already sitting
> in each mappers memory, ready to be accessed locally. The time to process per
mapper should be very
> dramatically reduced. With that in mind, you only have
time.
HTH,
Travis Hegner
http://www.travishegner.com/
-Original Message-
From: Luke Forehand [mailto:luke.foreh...@networkedinsights.com]
Sent: Tuesday, August 03, 2010 12:37 PM
To: user@hbase.apache.org
Subject: Re: Secondary Index versus Full Table Scan
Edward Capriolo w
Edward Capriolo writes:
> Generally speaking: If you are doing full range scans of a table
> indexes will not help. Adding indexes will make the performance worse,
> it will take longer to load your data and now fetching the data will
> involve two lookups instead of one.
>
> If you are doing fu
s
> great. Now we are trying to tackle reading.
>
> With our current setup, a map reduce job with 24 mappers performing a full
> table
> scan of ~150 million records takes ~1 hour. This won't work for our use case,
> because not only are we continuing to add more data to
, a map reduce job with 24 mappers performing a full table
scan of ~150 million records takes ~1 hour. This won't work for our use case,
because not only are we continuing to add more data to this table, but we are
asking many more questions in a day. To increase performance, the first though
38 matches
Mail list logo