Re: performance issue on big table join

Alexander Behm Thu, 02 Nov 2017 09:33:06 -0700

See my response on the other thread you started. The probe side of joins
are are executed in a single thread per host. Impala can run multiple
builds in parallel - but each build uses only a single thread.
A single query might not be able to max out your CPU, but most realistic
workloads run several queries concurrently.


On Thu, Nov 2, 2017 at 12:22 AM, Hongxu Ma <inte...@outlook.com> wrote:

> Thanks LL. Your query options look good.
>
> As Xu Cheng mentioned, I also noticed that Impala do hash join slowly in
> some big data situations.
> Very curious to the root cause.
>
>
> 在 02/11/2017 10:00, 俊杰陈 写道:
>
> +user list
>
> 2017-11-02 9:57 GMT+08:00 俊杰陈 <cjjnj...@gmail.com> <cjjnj...@gmail.com>:
>
>
> Hi Mostafa
>
> Cheng already put the profile in thread.
>
> Here is another profile for impala release version. you can also see the
> attachment.
>
>
> 2017-11-02 9:30 GMT+08:00 Mostafa Mokhtar <mmokh...@cloudera.com> 
> <mmokh...@cloudera.com>:
>
>
> Attaching the query profile will be most helpful to investigate this
> issue.
>
> If you can capture the profile from the WebUI on the coordinator node it
> would be great.
>
> On Wed, Nov 1, 2017 at 6:22 PM, 俊杰陈 <cjjnj...@gmail.com> <cjjnj...@gmail.com> 
> wrote:
>
>
> Thanks Hongxu,
>
> Here are configurations on my cluster,  most of them are default values.
> Which item do you think it may impact?
>
>         ABORT_ON_DEFAULT_LIMIT_EXCEEDED: [0]
>         ABORT_ON_ERROR: [0]
>         ALLOW_UNSUPPORTED_FORMATS: [0]
>         APPX_COUNT_DISTINCT: [0]
>         BATCH_SIZE: [0]
>         COMPRESSION_CODEC: [NONE]
>         DEBUG_ACTION: []
>         DEFAULT_ORDER_BY_LIMIT: [-1]
>         DISABLE_CACHED_READS: [0]
>         DISABLE_CODEGEN: [0]
>         DISABLE_OUTERMOST_TOPN: [0]
>         DISABLE_ROW_RUNTIME_FILTERING: [0]
>         DISABLE_STREAMING_PREAGGREGATIONS: [0]
>         DISABLE_UNSAFE_SPILLS: [0]
>         ENABLE_EXPR_REWRITES: [1]
>         EXEC_SINGLE_NODE_ROWS_THRESHOLD: [100]
>         EXPLAIN_LEVEL: [1]
>         HBASE_CACHE_BLOCKS: [0]
>         HBASE_CACHING: [0]
>         MAX_BLOCK_MGR_MEMORY: [0]
>         MAX_ERRORS: [100]
>         MAX_IO_BUFFERS: [0]
>         MAX_NUM_RUNTIME_FILTERS: [10]
>         MAX_SCAN_RANGE_LENGTH: [0]
>         MEM_LIMIT: [0]
>         MT_DOP: [0]
>         NUM_NODES: [0]
>         NUM_SCANNER_THREADS: [0]
>         OPTIMIZE_PARTITION_KEY_SCANS: [0]
>         PARQUET_ANNOTATE_STRINGS_UTF8: [0]
>         PARQUET_FALLBACK_SCHEMA_RESOLUTION: [0]
>         PARQUET_FILE_SIZE: [0]
>         PREFETCH_MODE: [1]
>         QUERY_TIMEOUT_S: [0]
>         REPLICA_PREFERENCE: [0]
>         REQUEST_POOL: []
>         RESERVATION_REQUEST_TIMEOUT: [0]
>         RM_INITIAL_MEM: [0]
>         RUNTIME_BLOOM_FILTER_SIZE: [1048576]
>         RUNTIME_FILTER_MAX_SIZE: [16777216]
>         RUNTIME_FILTER_MIN_SIZE: [1048576]
>         RUNTIME_FILTER_MODE: [2]
>         RUNTIME_FILTER_WAIT_TIME_MS: [0]
>         S3_SKIP_INSERT_STAGING: [1]
>         SCAN_NODE_CODEGEN_THRESHOLD: [1800000]
>         SCHEDULE_RANDOM_REPLICA: [0]
>         SCRATCH_LIMIT: [-1]
>         SEQ_COMPRESSION_MODE: [0]
>         STRICT_MODE: [0]
>         SUPPORT_START_OVER: [false]
>         SYNC_DDL: [0]
>         V_CPU_CORES: [0]
>
> 2017-10-31 15:30 GMT+08:00 Hongxu Ma <inte...@outlook.com> 
> <inte...@outlook.com>:
>
>
> Hi JJ
> Consider it only takes 3mins on SparkSQL, maybe there are some
>
> mistakes
>
> in
>
> query options.
> Try run "set;" in impala-shell and check all query options, e.g:
>     BATCH_SIZE: [0]
>     DISABLE_CODEGEN: [0]
>     RUNTIME_FILTER_MODE: GLOBAL
>
> Just a guess, thanks.
>
> 在 27/10/2017 10:25, 俊杰陈 写道:
> The profile file is damaged. Here is a screenshot for exec summary
> [cid:ii_j999ymep1_15f5ba563aeabb91]
> 
>
> 2017-10-27 10:04 GMT+08:00 俊杰陈 <cjjnj...@gmail.com<mailto:cjj
> nj...@gmail.com> <cjjnj...@gmail.com>>:
> Hi Devs
>
> I met a performance issue on big table join. The query takes more
>
> than 3
>
> hours on Impala and only 3 minutes on Spark SQL on the same 5 nodes
> cluster. when running query,  the left scanner and exchange node are
>
> very
>
> slow.  Did I miss some key arguments?
>
> you can see profile file in attachment.
>
> [cid:ii_j9998pph2_15f5b92f2cf47020]
> 
> --
> Thanks & Best Regards
>
>
>
> --
> Thanks & Best Regards
>
>
> --
> Regards,
> Hongxu.
>
>
>
>
> --
> Thanks & Best Regards
>
>
>
>
> --
> Thanks & Best Regards
>
>
>
>
> --
> Regards,
> Hongxu.
>
>

Re: performance issue on big table join

Reply via email to