Re: big issue on retrieving 400MB data

Mostafa Mokhtar Fri, 28 Apr 2017 09:40:04 -0700

Hi Jeszy


One option for large data extracts is to create a text table in HDFS then
use "hdfs dfs -copyToLocal
<https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html#copyToLocal>"
to extract the text file(s), for large data dumps storing the data in HDFS
then extracting it is likely to be faster than impala-shell or JDCB.

When using impala-shell it is best practice to use -B to speedup fetching
the results.

Below are the results for two experiments, storing the results in an HDFS
text table and writing the out of HDFS file took 14 seconds compared to 40
seconds when using just impala-shell.

Lineitem table was used, output data size is 390MB which is comparable to
your use case.

*Write to HDFS then extract*

time (impala-shell -q "drop table if exists tpch_300_parquet.lineitem_3m;
create table tpch_300_parquet.lineitem_3m stored as textfile as select *
from tpch_300_parquet.lineitem limit 3000000;" ; hdfs dfs -copyToLocal
hdfs://
vd1315.domain.com:8020/user/hive/warehouse/tpch_300_parquet.db/lineitem_3m
.; du -sh lineitem_3m )
Starting Impala Shell without Kerberos authentication
Connected to server1.domain.com:21000
Server version: impalad version 2.9.0-SNAPSHOT RELEASE (build
dc25acd558e9388290503e0e1100baac2b3c800b)
Query: drop table if exists tpch_300_parquet.lineitem_3m
Query: create table tpch_300_parquet.lineitem_3m stored as textfile as
select * from tpch_300_parquet.lineitem limit 3000000
Query submitted at: 2017-04-28 09:26:23 (Coordinator:
http://server1.domain.com:25000)
Query progress can be monitored at:
http://server1.domain.com:25000/query_plan?query_id=a24f86518c3de1ce:2f35ef9b00000000
+-------------------------+
| summary                 |
+-------------------------+
| Inserted 3000000 row(s) |
+-------------------------+
Fetched 1 row(s) in 11.23s
390M lineitem_3m

*real 0m14.392s*
user 0m3.503s
sys 0m0.655s


*Extract results using impala-shell*

[mmokhtar@server1 ~]$ time (impala-shell -B -q "select * From
tpch_300_parquet.lineitem_3m " > lineitem_3m_impala_shell.txt; du -sh
lineitem_3m_impala_shell.txt; )
Starting Impala Shell without Kerberos authentication
Connected to server1.domain.com:21000
Server version: impalad version 2.9.0-SNAPSHOT RELEASE (build
dc25acd558e9388290503e0e1100baac2b3c800b)
Query: select * From tpch_300_parquet.lineitem_3m
Query submitted at: 2017-04-28 09:27:39 (Coordinator:
http://server1.domain.com:25000)
Query progress can be monitored at:
http://server1.domain.com:25000/query_plan?query_id=ee421b6d4a2226d3:8acbb75f00000000
Fetched 3000000 row(s) in 38.88s
390M lineitem_3m_impala_shell.txt

*real 0m39.152s*
user 0m26.012s
sys 0m0.668s


On Fri, Apr 28, 2017 at 7:26 AM, Jeszy <jes...@gmail.com> wrote:

> Hey,
>
> It looks like all the time is spent waiting for the client to fetch the
> results:
>      - ClientFetchWaitTimer: 17m31s
>
> Try doing:
> impala-shell -B -q '<query>'
>
> HTH
>
> 2017-04-28 14:51 GMT+02:00 吴朱华 <ikew...@gmail.com>:
> > Maybe I just paste some main thing on mail , and congratulation on IPO
> > thing.
> >
> > Unregister query: 17m42s (17m42s)
> >
> > Fetched 317246 row(s) in 1062.84s
> > Query Runtime Profile:
> > Query (id=8149e2439f43b15a:f08e570d7fbf1085):
> >   Summary:
> >     Session ID: 35436d1112b79287:9045c79c795858a5
> >     Session Type: BEESWAX
> >     Start Time: 2017-04-28 11:50:00.292615000
> >     End Time: 2017-04-28 12:07:43.133484000
> >     Query Type: QUERY
> >     Query State: FINISHED
> >     Query Status: OK
> >     Impala Version: impalad version 2.5.0-cdh5-INTERNAL RELEASE (build
> > 43880282edc04c03c162bbea6fc85b5388e7fdde)
> >     User: impala
> >     Connected User: impala
> >     Delegated User:
> >     Network Address: ::ffff:10.44.10.186:36325
> >     Default Db: sjzy
> >     Sql Statement: select
> > MRECID,UNITID,PCQDM,PCQMC,PCXQDM,PCXQMC,DM,H001,H002,
> H003,H021,H022,H023,H024,H025,H026A,H026B,H026C,H026D,H026E,
> H026F,H026G,H027,H028,H029,H030,H031,H032,H033,H034,H035,
> H036,H037A,H037B,H037C,H038,H039,H040,H041,H042,H043A,
> H043B,H043C,H043D,H043E,H043F,H043G,H043H,H043I,H043J,H043K,
> H043L,H044A,H044B,H044C,H044D,H044E,H044F,H044G,H044H,H044I,
> H050,H051,H052,H053,H054,H055,H056,H061,H062,H063,H064,H065,
> H066,H070,H071,H072,H073,H074,H075,H080,H100,H111,H112,H113,
> H120,H200,H201,H202,H203,H204,H205,H206,H207,H208,H209,H210,
> H211,H300,H320,H321,H322,H323,H324,H400,H401,H402,H403,H404,
> H405,H406,H500,H600,H601,H602,H603,H604,H605,H606,H607,H608,
> H609,H610,H611,H612,H613,H614,H615,H616,H621A,H621B,H621C,
> H621D,H621E,H621F,H622A,H622B,H622C,H801,H802,H803,H804,H901,H902,H903
> > FROM NP_2017_NP601 WHERE DS_AREACODE LIKE '445281%'
> >     Coordinator: node1.sky.org:22000
> >     Query Options (non default):
> >     Plan:
> > ----------------
> > Estimated Per-Host Requirements: Memory=4.50GB VCores=1
> >
> > 01:EXCHANGE [UNPARTITIONED]
> > |  hosts=4 per-host-mem=unavailable
> > |  tuple-ids=0 row-size=1.67KB cardinality=1155911
> > |
> > 00:SCAN HDFS [sjzy.np_2017_np601, RANDOM]
> >    partitions=1/1 files=20 size=1.06GB
> >    predicates: DS_AREACODE LIKE '445281%'
> >    table stats: 11559109 rows total
> >    column stats: all
> >    hosts=4 per-host-mem=4.50GB
> >    tuple-ids=0 row-size=1.67KB cardinality=1155911
> > ----------------
> >     Estimated Per-Host Mem: 4831838208
> >     Estimated Per-Host VCores: 1
> >     Request Pool: default-pool
> >     ExecSummary:
> > Operator       #Hosts  Avg Time  Max Time    #Rows  Est. #Rows   Peak Mem
> >  Est. Peak Mem  Detail
> > ------------------------------------------------------------
> -------------------------------------------------
> > 01:EXCHANGE         1  32.314ms  32.314ms  317.25K       1.16M          0
> >      -1.00 B  UNPARTITIONED
> > 00:SCAN HDFS       20   1s137ms   1s348ms  317.25K       1.16M  163.85 MB
> >      4.50 GB  sjzy.np_2017_np601
> >     Planner Timeline: 53.683ms
> >        - Analysis finished: 24.565ms (24.565ms)
> >        - Equivalence classes computed: 26.389ms (1.823ms)
> >        - Single node plan created: 33.607ms (7.218ms)
> >        - Runtime filters computed: 33.684ms (76.568us)
> >        - Distributed plan created: 39.125ms (5.441ms)
> >        - Planning finished: 53.683ms (14.558ms)
> >     Query Timeline: 17m42s
> >        - Start execution: 43.792us (43.792us)
> >        - Planning finished: 60.640ms (60.596ms)
> >        - Ready to start 20 remote fragments: 65.111ms (4.471ms)
> >        - All 20 remote fragments started: 74.572ms (9.461ms)
> >        - Rows available: 744.300ms (669.728ms)
> >        - First row fetched: 790.128ms (45.828ms)
> >        - Unregister query: 17m42s (17m42s)
> >   ImpalaServer:
> >      - ClientFetchWaitTimer: 17m31s
> >      - RowMaterializationTimer: 10s024ms
> >
> > 2017-04-28 19:44 GMT+08:00 Jim Apple <jbap...@cloudera.com>:
> >
> >> dev@ does not appear to accept attachments. You can upload it somewhere
> >> and
> >> post a link, though.
> >>
> >> On Thu, Apr 27, 2017 at 11:35 PM, 吴朱华 <ikew...@gmail.com> wrote:
> >>
> >> > Oops, I just resend it, you know the chinese network^_^
> >> >
> >> > 2017-04-28 14:20 GMT+08:00 Mostafa Mokhtar <mmokh...@cloudera.com>:
> >> >
> >> >> Btw the profile wasn't attached.
> >> >> Please resend.
> >> >>
> >> >> On Thu, Apr 27, 2017 at 11:11 PM, 吴朱华 <ikew...@gmail.com> wrote:
> >> >>
> >> >>> Profile is in the attachment, thanks
> >> >>>
> >> >>>
> >> >>> 2017-04-28 13:10 GMT+08:00 Dimitris Tsirogiannis <
> >> >>> dtsirogian...@cloudera.com>:
> >> >>>
> >> >>>> Maybe you also want to post some information about the schema (how
> >> wide
> >> >>>> your table is, does it use nested types, etc) as well as the
> profile
> >> of
> >> >>>> the
> >> >>>> slow query.
> >> >>>>
> >> >>>> Dimitris
> >> >>>>
> >> >>>> On Thu, Apr 27, 2017 at 9:30 PM, 吴朱华 <ikew...@gmail.com> wrote:
> >> >>>>
> >> >>>> > Hi guys:
> >> >>>> > we can facing a big issue when select * from a big table.
> >> >>>> > The performance is 17minutes for retrieving 400MB data. Even slow
> >> >>>> under
> >> >>>> > JDBC situation.
> >> >>>> > Is there anyway to improve it?^_^
> >> >>>> >
> >> >>>>
> >> >>>
> >> >>>
> >> >>
> >> >
> >>
>

Re: big issue on retrieving 400MB data

Reply via email to