snleee commented on a change in pull request #5221: Add a new server api for
download of segments.
URL: https://github.com/apache/incubator-pinot/pull/5221#discussion_r405191232
##########
File path:
pinot-server/src/main/java/org/apache/pinot/server/api/resources/TablesResource.java
##########
@@ -175,4 +183,41 @@ public String getCrcMetadataForTable(
}
}
}
+
+ @GET
+ @Produces(MediaType.APPLICATION_OCTET_STREAM)
+ @Path("/tables/{tableName}/segments/{segmentName}")
+ @ApiOperation(value = "Download a segment", notes = "Download a segment in
zipped tar format")
+ public Response downloadSegment(
+ @ApiParam(value = "Name of the table with type REALTIME OR OFFLINE",
required = true, example = "myTable_OFFLINE") @PathParam("tableName") String
tableName,
+ @ApiParam(value = "Name of the segment", required = true)
@PathParam("segmentName") @Encoded String segmentName,
+ @Context HttpHeaders httpHeaders)
+ throws Exception {
+ LOGGER.info("Get a request to download segment {} for table {}",
segmentName, tableName);
+ TableDataManager tableDataManager = checkGetTableDataManager(tableName);
+ SegmentDataManager segmentDataManager =
tableDataManager.acquireSegment(segmentName);
+ if (segmentDataManager == null) {
+ throw new WebApplicationException(String.format("Table %s segments %s
does not exist", tableName, segmentName),
+ Response.Status.NOT_FOUND);
+ }
+ try {
+ String tableDir = tableDataManager.getTableDataDir();
+ String tarFilePath =
TarGzCompressionUtils.createTarGzOfDirectory(tableDir + "/" + segmentName);
Review comment:
Current API behavior will compress the segment file every time we hit this
API. It looks a bit expensive operation. Does
`TarGzCompressionUtils.createTarGzOfDirectory ` use `tar -cvf` or `tar -czvf`?
`cvf` will simply group the multiple files/directories into a single file while
`czvf` will do the compression. I guess that
`TarGzCompressionUtils.createTarGzOfDirectory ` probably try to compress the
file.
Depending on the use case, compression may become the performance
bottleneck. Imagine that a single server gets the download request for multiple
segment at the similar time. Compressing multiple files concurrently will
consume a lot of CPU resource.
One way to improve this is simply using the `tar cvf` equivalent logic (no
compression) and send the file. Another approach is to keep compressed files in
some directory and use it as a cache (then we also needs to handle
invalidation). We don't need to address this for now but let's add at least a
comment on this in case someone faces the bottleneck here..
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]