[Impala-ASF-CR] IMPALA-4172/IMPALA-3653: Improvements to block metadata loading

2016-11-26 Thread Mostafa Mokhtar (Code Review)
Mostafa Mokhtar has posted comments on this change.

Change subject: IMPALA-4172/IMPALA-3653: Improvements to block metadata loading
..


Patch Set 4:

Just tried out the latest patch and metadata loading is 5.4x faster. 

With the patch metadata loading for 80 partitions with 250K files finished in 
27 seconds compared to 146 seconds without. 

Most of the CPU time is spent in the RemoteIterator, to further speedup 
metadata loading I recommend using a thread pool. 

Stack Trace Sample CountPercentage(%)
org.apache.impala.catalog.HdfsTable.load(boolean, IMetaStoreClient, Table)  
509 74.307
   org.apache.impala.catalog.HdfsTable.load(boolean, IMetaStoreClient, Table, 
boolean, boolean, Set)509 74.307
  org.apache.impala.catalog.HdfsTable.loadAllPartitions(List, Table)
507 74.015
 org.apache.impala.catalog.HdfsTable.loadMetadataAndDiskIds(FileSystem, 
List, HashMap)  497 72.555
org.apache.impala.catalog.HdfsTable.loadBlockMetadata(FileSystem, 
Path, HashMap, Map)   472 68.905
   org.apache.hadoop.fs.FileSystem$5.hasNext()  365 53.285
  
org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.hasNext() 
339 49.489
 
org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.hasNextNoFilter()
  258 37.664
org.apache.hadoop.hdfs.DFSClient.listPaths(String, 
byte[], boolean) 258 37.664

com.sun.proxy.$Proxy21.getListing(String, byte[], boolean)  258 37.664
  
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(Object, Method, 
Object[])258 37.664
 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(Method, 
Object[])   258 37.664
 
java.lang.reflect.Method.invoke(Object, Object[])  258 37.664
 
org.apache.hadoop.hdfs.protocol.HdfsLocatedFileStatus.makeQualifiedLocated(URI, 
Path)  81  11.825

-- 
To view, visit http://gerrit.cloudera.org:8080/5148
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ie127658172e6e70dae441374530674a4ac9d5d26
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Bharath Vissapragada 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Bharath Vissapragada 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-HasComments: No


[Impala-ASF-CR] Add Apache license header to files in doc directory

2016-11-26 Thread Jim Apple (Code Review)
Jim Apple has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/5232

Change subject: Add Apache license header to files in doc directory
..

Add Apache license header to files in doc directory

This now gives a clean RAT check with bin/check-rat-report.py, which
is one way for the Impala community to check compliance with ASF rules
on intellectual property.

Change-Id: I2ad06435f84a65ba126759e42a18fdaf52cd7036
---
M bin/rat_exclude_files.txt
M docs/Cloudera-Impala-Release-Notes.ditamap
M docs/impala.ditamap
M docs/impala_html.ditaval
M docs/impala_pdf.ditaval
M docs/impala_sqlref.ditamap
M docs/shared/ImpalaVariables.xml
M docs/shared/impala_common.xml
M docs/topics/impala.xml
M docs/topics/impala_abort_on_default_limit_exceeded.xml
M docs/topics/impala_abort_on_error.xml
M docs/topics/impala_admin.xml
M docs/topics/impala_admission.xml
M docs/topics/impala_aggregate_functions.xml
M docs/topics/impala_aliases.xml
M docs/topics/impala_allow_unsupported_formats.xml
M docs/topics/impala_alter_function.xml
M docs/topics/impala_alter_table.xml
M docs/topics/impala_alter_view.xml
M docs/topics/impala_analytic_functions.xml
M docs/topics/impala_appx_count_distinct.xml
M docs/topics/impala_appx_median.xml
M docs/topics/impala_array.xml
M docs/topics/impala_auditing.xml
M docs/topics/impala_authentication.xml
M docs/topics/impala_authorization.xml
M docs/topics/impala_avg.xml
M docs/topics/impala_avro.xml
M docs/topics/impala_batch_size.xml
M docs/topics/impala_bigint.xml
M docs/topics/impala_bit_functions.xml
M docs/topics/impala_boolean.xml
M docs/topics/impala_breakpad.xml
M docs/topics/impala_cdh.xml
M docs/topics/impala_char.xml
M docs/topics/impala_cluster_sizing.xml
M docs/topics/impala_cm_installation.xml
M docs/topics/impala_comments.xml
M docs/topics/impala_complex_types.xml
M docs/topics/impala_components.xml
M docs/topics/impala_compression_codec.xml
M docs/topics/impala_compute_stats.xml
M docs/topics/impala_concepts.xml
M docs/topics/impala_conditional_functions.xml
M docs/topics/impala_config.xml
M docs/topics/impala_config_options.xml
M docs/topics/impala_config_performance.xml
M docs/topics/impala_connecting.xml
M docs/topics/impala_conversion_functions.xml
M docs/topics/impala_count.xml
M docs/topics/impala_create_data_source.xml
M docs/topics/impala_create_database.xml
M docs/topics/impala_create_function.xml
M docs/topics/impala_create_role.xml
M docs/topics/impala_create_table.xml
M docs/topics/impala_create_view.xml
M docs/topics/impala_data_sources.xml
M docs/topics/impala_databases.xml
M docs/topics/impala_datatypes.xml
M docs/topics/impala_date.xml
M docs/topics/impala_datetime_functions.xml
M docs/topics/impala_ddl.xml
M docs/topics/impala_debug_action.xml
M docs/topics/impala_decimal.xml
M docs/topics/impala_default_order_by_limit.xml
M docs/topics/impala_delegation.xml
M docs/topics/impala_delete.xml
M docs/topics/impala_describe.xml
M docs/topics/impala_development.xml
M docs/topics/impala_disable_cached_reads.xml
M docs/topics/impala_disable_codegen.xml
M docs/topics/impala_disable_outermost_topn.xml
M docs/topics/impala_disable_row_runtime_filtering.xml
M docs/topics/impala_disable_streaming_preaggregations.xml
M docs/topics/impala_disable_unsafe_spills.xml
M docs/topics/impala_disk_space.xml
M docs/topics/impala_distinct.xml
M docs/topics/impala_dml.xml
M docs/topics/impala_double.xml
M docs/topics/impala_drop_data_source.xml
M docs/topics/impala_drop_database.xml
M docs/topics/impala_drop_function.xml
M docs/topics/impala_drop_role.xml
M docs/topics/impala_drop_stats.xml
M docs/topics/impala_drop_table.xml
M docs/topics/impala_drop_view.xml
M docs/topics/impala_errata.xml
M docs/topics/impala_exec_single_node_rows_threshold.xml
M docs/topics/impala_explain.xml
M docs/topics/impala_explain_level.xml
M docs/topics/impala_explain_plan.xml
M docs/topics/impala_faq.xml
M docs/topics/impala_faq_base.xml
M docs/topics/impala_features.xml
M docs/topics/impala_file_formats.xml
M docs/topics/impala_fixed_issues.xml
M docs/topics/impala_float.xml
M docs/topics/impala_functions.xml
M docs/topics/impala_functions_overview.xml
M docs/topics/impala_glossary.xml
M docs/topics/impala_grant.xml
M docs/topics/impala_group_by.xml
M docs/topics/impala_group_concat.xml
M docs/topics/impala_hadoop.xml
M docs/topics/impala_having.xml
M docs/topics/impala_hbase.xml
M docs/topics/impala_hbase_cache_blocks.xml
M docs/topics/impala_hbase_caching.xml
M docs/topics/impala_hints.xml
M docs/topics/impala_howto_rm.xml
M docs/topics/impala_identifiers.xml
M docs/topics/impala_impala_shell.xml
M docs/topics/impala_incompatible_changes.xml
M docs/topics/impala_insert.xml
M docs/topics/impala_install.xml
M docs/topics/impala_int.xml
M docs/topics/impala_intro.xml
M docs/topics/impala_invalidate_metadata.xml
M docs/topics/impala_isilon.xml
M docs/topics/impala_jdbc.xml
M docs/topics/impala_joins.xml
M docs/topics/impala_kerberos.xml

[Impala-ASF-CR] IMPALA-4172/IMPALA-3653: Improvements to block metadata loading

2016-11-26 Thread Bharath Vissapragada (Code Review)
Bharath Vissapragada has uploaded a new patch set (#4).

Change subject: IMPALA-4172/IMPALA-3653: Improvements to block metadata loading
..

IMPALA-4172/IMPALA-3653: Improvements to block metadata loading

This patch improves the block metadata loading (locations and disk
storage IDs) for partitioned and un-partitioned tables in the Catalog
server.

Without this patch:
--
We loop throuh each and every file in the table/partition directories
and call getFileBlockLocations() on it to obtain the block metadata.
This results in large no. of RPC calls to namenode, especially with
tables with large no. of files/partitions.

With this patch:
---
We move the block metadata querying to use listStatus() call which
accepts a directory as input and fetches the 'BlockLocation' objects
for every file recursively in that directory. This improves the
metadata loading in the following ways.

- For non-partitioned tables, we query all the BlockLocations in a
single RPC call in the base table directory and load the corresponding
disk IDs.

- For partitioned tables, we query the BlockLocations for all the
partitions residing under the base table directories in a single RPC
and then load every partition with non-default partition directory
separately.

Also, this patch does away with VolumeIds returned by the HDFS NN
and uses the new StorageIDs returned by the BlockLocation class.
These StorageIDs are UUID strings and hence are mapped to a
per-node 0-based index as expected by the backend. In the upcoming
versions of Hadoop APIs, getFileBlockStorageLocations() is deprecated
and instead the listStatus() returns BlockLocations with storage IDs
embedded. This patch makes use to improvement to reduce an addition
RPC to NN to fetch the storage locations.

Change-Id: Ie127658172e6e70dae441374530674a4ac9d5d26
---
A fe/src/main/java/org/apache/impala/catalog/DiskIdMapper.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/common/FileSystemUtil.java
3 files changed, 385 insertions(+), 250 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/48/5148/4
-- 
To view, visit http://gerrit.cloudera.org:8080/5148
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ie127658172e6e70dae441374530674a4ac9d5d26
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Bharath Vissapragada 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Bharath Vissapragada 
Gerrit-Reviewer: Mostafa Mokhtar 


[Impala-ASF-CR] IMPALA-4172/IMPALA-3653: Improvements to block metadata loading

2016-11-26 Thread Bharath Vissapragada (Code Review)
Bharath Vissapragada has posted comments on this change.

Change subject: IMPALA-4172/IMPALA-3653: Improvements to block metadata loading
..


Patch Set 2:

(16 comments)

http://gerrit.cloudera.org:8080/#/c/5148/2/fe/src/main/java/org/apache/impala/catalog/DiskIdMapper.java
File fe/src/main/java/org/apache/impala/catalog/DiskIdMapper.java:

Line 27:  * This class encapsulates the logic required for mapping HDFS
> Much fluff, simplify to: Maps HDFS storage-UUIDs to per-host 0-based, seque
Done


Line 29:  * by the backend.
> required by Impala backends
Done


Line 31: public class DiskIdMapper {
> Make this a singleton to clarify the intended use.
Done


Line 36: // This is intentionally implemented as a static object to share 
the storageIDs across
> move this part to the class comment
Done


Line 38: // consistent across all the tables, without which, we can 
potentially confuse the
> also mention that the static solution requires the least amount of memory
Done


Line 48: private static ConcurrentHashMap
> Don't think we need a ConcurrentHashMap or AtomicInteger here, because when
Used a stricter locking scheme, but as we discussed offline, this can still 
have contention in the initial metadata load. Added a TODO for that.


http://gerrit.cloudera.org:8080/#/c/5148/2/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java:

Line 294:   // Keeps track of all the parition paths that are filtered to add 
block metadata.
> this comment style for classes and functions
Oops, done.


Line 368: if 
(!pathMd.filteredParitionPaths.contains(fileStatus.getPath())) continue;
> Looks like this contains() and the indexOf() below can become very expensiv
The new PS design revamps the whole thing. We don't do indexOf() searches 
anymore.


Line 423: for (int i=0; i < fileMd.pathFds.size(); ++i) {
> i = 0 (add spaces)
Done


Line 458:   // Only build the storage IDs fs instances that support the 
BlockLocation API.
> ... for fs instances ...
Done


Line 720:   private void loadAllPartitions(
> After reading through this a few times I'm still confused about the high-le
The new PS implements a similar logic. We can sync-up on that if required.


Line 777: if (dirsToLoad.contains(partDir) ||
> positive is more readable to me
Done


http://gerrit.cloudera.org:8080/#/c/5148/2/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java
File fe/src/main/java/org/apache/impala/common/FileSystemUtil.java:

Line 275:* Returns true if the filesystem supports BlockLocation API.
> What API is it specifically? LocatedFileStatus.getBlockLocations()?
Yes, all methods returning 'BlockLocation's.


Line 277:   public static boolean supportsBlockLocationAPI(FileSystem fs) {
> supportsBlockLocationApi
Done


Line 406:   public static boolean isPathChild(Path p1, Path p2) {
> isChildPath(Path p, Path parent)
Done


Line 408: Path parentDir = p1.getParent();
> I think this can be improved by using the Path.getDepth(). You can do getPa
good idea. Done.


-- 
To view, visit http://gerrit.cloudera.org:8080/5148
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ie127658172e6e70dae441374530674a4ac9d5d26
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Bharath Vissapragada 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Bharath Vissapragada 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4512: Add a script that builds Impala on stock Ubuntu 14.04.

2016-11-26 Thread Jim Apple (Code Review)
Jim Apple has posted comments on this change.

Change subject: IMPALA-4512: Add a script that builds Impala on stock Ubuntu 
14.04.
..


Patch Set 2:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/5154/2/bin/bootstrap_build.sh
File bin/bootstrap_build.sh:

Line 39: # when done building:
> I think deleting the directory when done will be a bit surprising to people
Yeah, I agree. I was thinking of taking out almost everything except the 
apt-get, export JAVA_HOME, and buildall.sh - that way it can be run from the 
same directory the script is in, assuming the script was acquired by fetching 
an Impala git repo.

What do you think?


PS2, Line 44: gcc
> We could replace gcc, g++ and make with build-essential. I don't feel stron
I thought this way it would be easier to port to other distributions, or maybe 
even other Unixes.


-- 
To view, visit http://gerrit.cloudera.org:8080/5154
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: If34e398052a61dfda9825b1cf3a918eb61736048
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Jim Apple 
Gerrit-Reviewer: Jim Apple 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-3200: Implement suballocator for splitting buffers

2016-11-26 Thread Jim Apple (Code Review)
Jim Apple has posted comments on this change.

Change subject: IMPALA-3200: Implement suballocator for splitting buffers
..


Patch Set 5:

(31 comments)

http://gerrit.cloudera.org:8080/#/c/4715/4//COMMIT_MSG
Commit Message:

PS4, Line 13:  buddy allocation
> The main use case is to allocate power-of-two buffers, so currently this is
Though the number of elements in HTs is a power of two, the size of the 
allocations could be different if the sizeof a slot changes, yes? For instance, 
if sizeof(slot) is 24, then most allocations will waste a good bit of space?


http://gerrit.cloudera.org:8080/#/c/4715/5/be/src/bufferpool/suballocator-test.cc
File be/src/bufferpool/suballocator-test.cc:

Line 72:   /// Verify that the memory for all of the allocations is writable 
and disjoint by
Maybe replace "Verify" with "ASSERT", for clarity?


PS5, Line 72: allocations
"Suballocations", here and below


PS5, Line 77: list
vector


PS5, Line 77: clear
call std::vector::clear() after the for loop to prevent UB if a caller uses 
(*allocs)[i] for some i.


Line 85:   ObjectPool obj_pool_;
Each of these member variables could use an explanation about why it is here.


Line 101:   ASSERT_OK(pool.RegisterClient("client", _reservation_, 
profile(), ));
The ASSERTs in this file could generally benefit from more info in the "<<" 
section to help diagnose failures from the logs. I have recently seen a backend 
test error that I was able to reproduce, but only on a machine image I had 
limited access to, so I had to ssh in between other people using it, and more 
backend test logs would have saved me some time.


Line 111: ASSERT_OK(allocator.Allocate(ALLOC_SIZE, ()));
I generally think of ASSERT as being for failures in which the test should not 
continue, but  here I think some further parts of the test are interesting. For 
instance, the smaller allocations below are interesting even if this one fails.

I can understand breaking from the loop if this fails, though.

My feeling is that several other ASSERTs should be EXPECTs in this file.


Line 116:   // Attempts to allocate more memory should fail gracefully.
Can we allocate 0 bytes here?


Line 127:   for (unique_ptr& alloc : allocs) {
call FreeAllocations?


PS5, Line 147: odd
It might also be good to test edge cases: (1 << i) - 1


Line 159: memset(alloc->data(), 0, alloc->len()); // Check memory is 
writable.
Add note that ASAN performs this check.


Line 180:   const int num_allocs = 16;
either "NUM_ALLOCS" or "total_mem", above


PS5, Line 184: int64_t curr_alloc_size = TEST_BUFFER_LEN / 8;
 :   while (curr_alloc_size * num_allocs < TOTAL_MEM) {
Could be more readable as a for loop


Line 188: for (int i = 0; i < num_allocs; ++i) {
Can this be range-based, like the loop on 205?


Line 197: // allocations should require an extra page.
Can you add a comment explaining why should they never require more than an 
extra page? That way, if this breaks, the test code will help the fixer 
understand why it was originally true and why it is part of the buddy system 
contract.


Line 206: allocator.Free(move(alloc));
call FreeAllocations()


PS5, Line 215:   const int64_t TOTAL_MEM = TEST_BUFFER_LEN * 100;
 :   global_reservation_.InitRootTracker(nullptr, TOTAL_MEM);
 :   BufferPool pool(TEST_BUFFER_LEN, TOTAL_MEM);
 : 
 :   ReservationTracker reservation;
 :   reservation.InitChildTracker(nullptr, _reservation_, 
nullptr, TOTAL_MEM);
 :   BufferPool::Client client;
 :   ASSERT_OK(pool.RegisterClient("client", , 
profile(), ));
 : 
 :   Suballocator allocator(, , TEST_BUFFER_LEN, 
ExpansionStrategy::NEVER);
This section seems like a good candidate for a member function, as it appears 
very similarly elsewhere


Line 263: /// Do some randomised testing of the allocator. Simulate some 
interesting patterns with
This tests only correctness, not anything about fragmentation, right?


Line 289:   int64_t remaining_mem_per_alloc = (TOTAL_MEM - allocated_mem) / 
num_allocs;
const


PS5, Line 290: 1.0
The first parameter below is 2 - that's the mean, right, so the average is 0.2?


PS5, Line 334: 8
I didn't see this in the header file. I'd suggest that this expectation should 
be in the contract.


Line 336: for (int64_t offset = 0; offset < alloc->len(); offset += 8) {
std::iota


http://gerrit.cloudera.org:8080/#/c/4715/4/be/src/bufferpool/suballocator.cc
File be/src/bufferpool/suballocator.cc:

Line 144: DCHECK_EQ(free_node->len_, 1LL << (i + LOG_MIN_ALLOCATION_BYTES));
> Done
Does not look done. Maybe some changes were left out of PS5 by mistake?


http://gerrit.cloudera.org:8080/#/c/4715/4/be/src/bufferpool/suballocator.h
File be/src/bufferpool/suballocator.h:

Line 37: };
> It can but we end up with Suballocator::ExpansionStrategy everywhere, which
Maybe typedef 

[Impala-ASF-CR] Avoid std::function when possible.

2016-11-26 Thread Jim Apple (Code Review)
Jim Apple has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/5230

Change subject: Avoid std::function when possible.
..

Avoid std::function when possible.

std::function does some tricky stuff under the hood, and can in fact
throw in its constructor when it allocates heap memory. This patch
changes ScopeExitTrigger to use templates instead, avoiding the
complexity of std::function. It uses a std::make_pair-like constructor
helper so clients don't have to try to name the types of lambdas.

Change-Id: If9dc034c2e094ea7f87f78d8d9101a71d8d2e295
---
M be/src/service/fe-support.cc
M be/src/util/scope-exit-trigger.h
2 files changed, 22 insertions(+), 6 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/30/5230/1
-- 
To view, visit http://gerrit.cloudera.org:8080/5230
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: If9dc034c2e094ea7f87f78d8d9101a71d8d2e295
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Jim Apple