Apache Pinot Daily Email Digest (2020-12-03)

Pinot Slack Email Digest Thu, 03 Dec 2020 18:00:29 -0800

#general

@fabianpaul: @fabianpaul has joined the channel
@kelly.revenaugh: @kelly.revenaugh has joined the channel

#random

@fabianpaul: @fabianpaul has joined the channel
@kelly.revenaugh: @kelly.revenaugh has joined the channel

#troubleshooting

@taranrishit1234: I tried to put this csv to query in pinot but other than the schema the data is not being shown in query console. What should be wrong? all the related files are in the attachment.
@npawar: Could you tell us all the steps you did. And also share the output of the LaunchDataIngestionJob command?
@ken: I ran into an issue where a segment I created was > 8GB when tarred, and thus failed during the “converting segment” phase: Converting segment: /tmp/pinot-d6bab609-8906-4c84-966b-5f96d41b1d80/output/crawldata_OFFLINE_2018-10-13_2020-10-11_0 to v3 format v3 segment location for segment: crawldata_OFFLINE_2018-10-13_2020-10-11_0 is /tmp/pinot-d6bab609-8906-4c84-966b-5f96d41b1d80/output/crawldata_OFFLINE_2018-10-13_2020-10-11_0/v3 Deleting files in v1 segment directory: /tmp/pinot-d6bab609-8906-4c84-966b-5f96d41b1d80/output/crawldata_OFFLINE_2018-10-13_2020-10-11_0 Computed crc = 1033854200, based on files [/tmp/pinot-d6bab609-8906-4c84-966b-5f96d41b1d80/output/crawldata_OFFLINE_2018-10-13_2020-10-11_0/v3/columns.psf, /tmp/pinot-d6bab609-8906-4c84-966b-5f96d41b1d80/output/crawldata_OFFLINE_2018-10-13_2020-10-11_0/v3/index_map, /tmp/pinot-d6bab609-8906-4c84-966b-5f96d41b1d80/output/crawldata_OFFLINE_2018-10-13_2020-10-11_0/v3/metadata.properties] Driver, record read time : 236809 Driver, stats collector time : 0 Driver, indexing time : 122449 Tarring segment from: /tmp/pinot-d6bab609-8906-4c84-966b-5f96d41b1d80/output/crawldata_OFFLINE_2018-10-13_2020-10-11_0 to: /tmp/pinot-d6bab609-8906-4c84-966b-5f96d41b1d80/output/crawldata_OFFLINE_2018-10-13_2020-10-11_0.tar.gz Failed to generate Pinot segment for file - java.lang.RuntimeException: entry size ‘8991809155’ is too big ( > 8589934591 ). at org.apache.commons.compress.archivers.tar.TarArchiveOutputStream.failForBigNumber(TarArchiveOutputStream.java:636) ~[pinot-all-0.6.0-jar-with-dependencies.jar:0.6.0-bb646baceafcd9b849a1ecdec7a11203c7027e21]
@g.kishore: Yes please. 8 Gb is quite big, can you break it up into smaller size?
@ken: Yes, I can - I need to figure out how to get Flink batch to bucket by day, as that’s how I’m segmenting.
@g.kishore: after 2gb, we typically run into JVM limits on offset length etc. Also, segment is the unit of parallelism
@ken: Is there any rule of thumb for target number of segments in table? As in say one active/hot segment per server core?
@ken: Or is it fine to have a lot more (smaller) segments, to support finer-grained exclusion of segments and thus more efficient querying?
@g.kishore: 150 MV to 500mb is sweet spot
@ken: I think that as per , Pinot should be using BIGNUMBER_POSIX for the bigNumberMode.
@ken: Should I file an issue?
@fabianpaul: @fabianpaul has joined the channel
@kelly.revenaugh: @kelly.revenaugh has joined the channel
@elon.azoulay: Hi, we had a server go into a gc loop where it wasn't reducing the heap (only 1 server, the other 5 are fine). Then we noticed that 3 out of 6 of our servers had 2x the amount of data for a table (i.e. 300gb vs 150gb). I am running a rebalance now. Is there anything we can do to even out the disk space among all the servers? We also have replicas per partition set to 3, but we have 6 servers, should we increase replicas to 6, or reduce replicas per partition to 2?
--------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]

Apache Pinot Daily Email Digest (2020-12-03)

#general

#random

#troubleshooting

Reply via email to