Re: big issue on retrieving 400MB data

吴朱华 Fri, 28 Apr 2017 23:16:46 -0700

thx，yufeng，your solution save the day^_^

2017-04-29 13:50 GMT+08:00 yu feng <[email protected]>:


> We has faced this problem before with JDBC, this query speed too much time
> on fetching result, and the query is running. which will occupy slot of
> request pool(when we configured admission control), we have to cancel the
> query after a Certain time time such as 5 minutes.
>
> In our solution, JDBC fetch result with default fetchSize = 50, we try to
> set fetchSize = 10000（by call setFetchSize()） and the acceleration is
> obvious（in our test, result size is 2000+W and this can speed up from more
> than 1 hour to 500 seconds.）.
>
> 2017-04-29 0:39 GMT+08:00 Mostafa Mokhtar <[email protected]>:
>
> > Hi Jeszy
> >
> >
> > One option for large data extracts is to create a text table in HDFS then
> > use "hdfs dfs -copyToLocal
> > <https://hadoop.apache.org/docs/current/hadoop-project-
> dist/hadoop-common/
> > FileSystemShell.html#copyToLocal>"
> > to extract the text file(s), for large data dumps storing the data in
> HDFS
> > then extracting it is likely to be faster than impala-shell or JDCB.
> >
> > When using impala-shell it is best practice to use -B to speedup fetching
> > the results.
> >
> > Below are the results for two experiments, storing the results in an HDFS
> > text table and writing the out of HDFS file took 14 seconds compared to
> 40
> > seconds when using just impala-shell.
> >
> > Lineitem table was used, output data size is 390MB which is comparable to
> > your use case.
> >
> > *Write to HDFS then extract*
> >
> > time (impala-shell -q "drop table if exists tpch_300_parquet.lineitem_3m;
> > create table tpch_300_parquet.lineitem_3m stored as textfile as select *
> > from tpch_300_parquet.lineitem limit 3000000;" ; hdfs dfs -copyToLocal
> > hdfs://
> > vd1315.domain.com:8020/user/hive/warehouse/tpch_300_
> parquet.db/lineitem_3m
> > .; du -sh lineitem_3m )
> > Starting Impala Shell without Kerberos authentication
> > Connected to server1.domain.com:21000
> > Server version: impalad version 2.9.0-SNAPSHOT RELEASE (build
> > dc25acd558e9388290503e0e1100baac2b3c800b)
> > Query: drop table if exists tpch_300_parquet.lineitem_3m
> > Query: create table tpch_300_parquet.lineitem_3m stored as textfile as
> > select * from tpch_300_parquet.lineitem limit 3000000
> > Query submitted at: 2017-04-28 09:26:23 (Coordinator:
> > http://server1.domain.com:25000)
> > Query progress can be monitored at:
> > http://server1.domain.com:25000/query_plan?query_id=a24f86518c3de1ce:
> > 2f35ef9b00000000
> > +-------------------------+
> > | summary                 |
> > +-------------------------+
> > | Inserted 3000000 row(s) |
> > +-------------------------+
> > Fetched 1 row(s) in 11.23s
> > 390M lineitem_3m
> >
> > *real 0m14.392s*
> > user 0m3.503s
> > sys 0m0.655s
> >
> >
> > *Extract results using impala-shell*
> >
> > [mmokhtar@server1 ~]$ time (impala-shell -B -q "select * From
> > tpch_300_parquet.lineitem_3m " > lineitem_3m_impala_shell.txt; du -sh
> > lineitem_3m_impala_shell.txt; )
> > Starting Impala Shell without Kerberos authentication
> > Connected to server1.domain.com:21000
> > Server version: impalad version 2.9.0-SNAPSHOT RELEASE (build
> > dc25acd558e9388290503e0e1100baac2b3c800b)
> > Query: select * From tpch_300_parquet.lineitem_3m
> > Query submitted at: 2017-04-28 09:27:39 (Coordinator:
> > http://server1.domain.com:25000)
> > Query progress can be monitored at:
> > http://server1.domain.com:25000/query_plan?query_id=ee421b6d4a2226d3:
> > 8acbb75f00000000
> > Fetched 3000000 row(s) in 38.88s
> > 390M lineitem_3m_impala_shell.txt
> >
> > *real 0m39.152s*
> > user 0m26.012s
> > sys 0m0.668s
> >
> >
> > On Fri, Apr 28, 2017 at 7:26 AM, Jeszy <[email protected]> wrote:
> >
> > > Hey,
> > >
> > > It looks like all the time is spent waiting for the client to fetch the
> > > results:
> > >      - ClientFetchWaitTimer: 17m31s
> > >
> > > Try doing:
> > > impala-shell -B -q '<query>'
> > >
> > > HTH
> > >
> > > 2017-04-28 14:51 GMT+02:00 吴朱华 <[email protected]>:
> > > > Maybe I just paste some main thing on mail , and congratulation on
> IPO
> > > > thing.
> > > >
> > > > Unregister query: 17m42s (17m42s)
> > > >
> > > > Fetched 317246 row(s) in 1062.84s
> > > > Query Runtime Profile:
> > > > Query (id=8149e2439f43b15a:f08e570d7fbf1085):
> > > >   Summary:
> > > >     Session ID: 35436d1112b79287:9045c79c795858a5
> > > >     Session Type: BEESWAX
> > > >     Start Time: 2017-04-28 11:50:00.292615000
> > > >     End Time: 2017-04-28 12:07:43.133484000
> > > >     Query Type: QUERY
> > > >     Query State: FINISHED
> > > >     Query Status: OK
> > > >     Impala Version: impalad version 2.5.0-cdh5-INTERNAL RELEASE
> (build
> > > > 43880282edc04c03c162bbea6fc85b5388e7fdde)
> > > >     User: impala
> > > >     Connected User: impala
> > > >     Delegated User:
> > > >     Network Address: ::ffff:10.44.10.186:36325
> > > >     Default Db: sjzy
> > > >     Sql Statement: select
> > > > MRECID,UNITID,PCQDM,PCQMC,PCXQDM,PCXQMC,DM,H001,H002,
> > > H003,H021,H022,H023,H024,H025,H026A,H026B,H026C,H026D,H026E,
> > > H026F,H026G,H027,H028,H029,H030,H031,H032,H033,H034,H035,
> > > H036,H037A,H037B,H037C,H038,H039,H040,H041,H042,H043A,
> > > H043B,H043C,H043D,H043E,H043F,H043G,H043H,H043I,H043J,H043K,
> > > H043L,H044A,H044B,H044C,H044D,H044E,H044F,H044G,H044H,H044I,
> > > H050,H051,H052,H053,H054,H055,H056,H061,H062,H063,H064,H065,
> > > H066,H070,H071,H072,H073,H074,H075,H080,H100,H111,H112,H113,
> > > H120,H200,H201,H202,H203,H204,H205,H206,H207,H208,H209,H210,
> > > H211,H300,H320,H321,H322,H323,H324,H400,H401,H402,H403,H404,
> > > H405,H406,H500,H600,H601,H602,H603,H604,H605,H606,H607,H608,
> > > H609,H610,H611,H612,H613,H614,H615,H616,H621A,H621B,H621C,
> > > H621D,H621E,H621F,H622A,H622B,H622C,H801,H802,H803,H804,H901,H902,H903
> > > > FROM NP_2017_NP601 WHERE DS_AREACODE LIKE '445281%'
> > > >     Coordinator: node1.sky.org:22000
> > > >     Query Options (non default):
> > > >     Plan:
> > > > ----------------
> > > > Estimated Per-Host Requirements: Memory=4.50GB VCores=1
> > > >
> > > > 01:EXCHANGE [UNPARTITIONED]
> > > > |  hosts=4 per-host-mem=unavailable
> > > > |  tuple-ids=0 row-size=1.67KB cardinality=1155911
> > > > |
> > > > 00:SCAN HDFS [sjzy.np_2017_np601, RANDOM]
> > > >    partitions=1/1 files=20 size=1.06GB
> > > >    predicates: DS_AREACODE LIKE '445281%'
> > > >    table stats: 11559109 rows total
> > > >    column stats: all
> > > >    hosts=4 per-host-mem=4.50GB
> > > >    tuple-ids=0 row-size=1.67KB cardinality=1155911
> > > > ----------------
> > > >     Estimated Per-Host Mem: 4831838208
> > > >     Estimated Per-Host VCores: 1
> > > >     Request Pool: default-pool
> > > >     ExecSummary:
> > > > Operator       #Hosts  Avg Time  Max Time    #Rows  Est. #Rows   Peak
> > Mem
> > > >  Est. Peak Mem  Detail
> > > > ------------------------------------------------------------
> > > -------------------------------------------------
> > > > 01:EXCHANGE         1  32.314ms  32.314ms  317.25K       1.16M
> >   0
> > > >      -1.00 B  UNPARTITIONED
> > > > 00:SCAN HDFS       20   1s137ms   1s348ms  317.25K       1.16M
> 163.85
> > MB
> > > >      4.50 GB  sjzy.np_2017_np601
> > > >     Planner Timeline: 53.683ms
> > > >        - Analysis finished: 24.565ms (24.565ms)
> > > >        - Equivalence classes computed: 26.389ms (1.823ms)
> > > >        - Single node plan created: 33.607ms (7.218ms)
> > > >        - Runtime filters computed: 33.684ms (76.568us)
> > > >        - Distributed plan created: 39.125ms (5.441ms)
> > > >        - Planning finished: 53.683ms (14.558ms)
> > > >     Query Timeline: 17m42s
> > > >        - Start execution: 43.792us (43.792us)
> > > >        - Planning finished: 60.640ms (60.596ms)
> > > >        - Ready to start 20 remote fragments: 65.111ms (4.471ms)
> > > >        - All 20 remote fragments started: 74.572ms (9.461ms)
> > > >        - Rows available: 744.300ms (669.728ms)
> > > >        - First row fetched: 790.128ms (45.828ms)
> > > >        - Unregister query: 17m42s (17m42s)
> > > >   ImpalaServer:
> > > >      - ClientFetchWaitTimer: 17m31s
> > > >      - RowMaterializationTimer: 10s024ms
> > > >
> > > > 2017-04-28 19:44 GMT+08:00 Jim Apple <[email protected]>:
> > > >
> > > >> dev@ does not appear to accept attachments. You can upload it
> > somewhere
> > > >> and
> > > >> post a link, though.
> > > >>
> > > >> On Thu, Apr 27, 2017 at 11:35 PM, 吴朱华 <[email protected]> wrote:
> > > >>
> > > >> > Oops, I just resend it, you know the chinese network^_^
> > > >> >
> > > >> > 2017-04-28 14:20 GMT+08:00 Mostafa Mokhtar <[email protected]
> >:
> > > >> >
> > > >> >> Btw the profile wasn't attached.
> > > >> >> Please resend.
> > > >> >>
> > > >> >> On Thu, Apr 27, 2017 at 11:11 PM, 吴朱华 <[email protected]> wrote:
> > > >> >>
> > > >> >>> Profile is in the attachment, thanks
> > > >> >>>
> > > >> >>>
> > > >> >>> 2017-04-28 13:10 GMT+08:00 Dimitris Tsirogiannis <
> > > >> >>> [email protected]>:
> > > >> >>>
> > > >> >>>> Maybe you also want to post some information about the schema
> > (how
> > > >> wide
> > > >> >>>> your table is, does it use nested types, etc) as well as the
> > > profile
> > > >> of
> > > >> >>>> the
> > > >> >>>> slow query.
> > > >> >>>>
> > > >> >>>> Dimitris
> > > >> >>>>
> > > >> >>>> On Thu, Apr 27, 2017 at 9:30 PM, 吴朱华 <[email protected]>
> wrote:
> > > >> >>>>
> > > >> >>>> > Hi guys:
> > > >> >>>> > we can facing a big issue when select * from a big table.
> > > >> >>>> > The performance is 17minutes for retrieving 400MB data. Even
> > slow
> > > >> >>>> under
> > > >> >>>> > JDBC situation.
> > > >> >>>> > Is there anyway to improve it?^_^
> > > >> >>>> >
> > > >> >>>>
> > > >> >>>
> > > >> >>>
> > > >> >>
> > > >> >
> > > >>
> > >
> >
>

Re: big issue on retrieving 400MB data

Reply via email to