Apache Pinot Daily Email Digest (2020-12-01)

Pinot Slack Email Digest Tue, 01 Dec 2020 18:00:35 -0800

#general

@whatatrip888: I am working on load test using Jmeter in pinot tables. Currently I am getting an exception related to connection pooling. Response message:java.sql.SQLException: Cannot create PoolableConnectionFactory (null) java.sql.SQLException: Cannot create PoolableConnectionFactory (null) at org.apache.commons.dbcp2.BasicDataSource.createPoolableConnectionFactory(BasicDataSource.java:669) ~[commons-dbcp2-2.7.0.jar:2.7.0] at org.apache.commons.dbcp2.BasicDataSource.createDataSource(BasicDataSource.java:544) ~[commons-dbcp2-2.7.0.jar:2.7.0] at org.apache.commons.dbcp2.BasicDataSource.getConnection(BasicDataSource.java:753) ~[commons-dbcp2-2.7.0.jar:2.7.0] at org.apache.jmeter.protocol.jdbc.config.DataSourceElement.initPool(DataSourceElement.java:308) [ApacheJMeter_jdbc.jar:5.3] at org.apache.jmeter.protocol.jdbc.config.DataSourceElement.testStarted(DataSourceElement.java:127) [ApacheJMeter_jdbc.jar:5.3] at org.apache.jmeter.engine.StandardJMeterEngine.notifyTestListenersOfStart(StandardJMeterEngine.java:205) [ApacheJMeter_core.jar:5.3] at org.apache.jmeter.engine.StandardJMeterEngine.run(StandardJMeterEngine.java:380) [ApacheJMeter_core.jar:5.3] at java.lang.Thread.run(Unknown Source) [?:1.8.0_271]
@fx19880617: How do you query Pinot ? Which client are you using? Did you see any exceptions in Pinot controller or broker ?
@whatatrip888: @fx19880617 @npawar @g.kishore >From Jmeter, JDBC client used to querying pinot. There is no exceptions in Pinot controller or broker. The error from jmeter like this: 2020-12-01 19:55:11,767 INFO o.a.j.e.StandardJMeterEngine: Running the test! 2020-12-01 19:55:11,767 INFO o.a.j.s.SampleEvent: List of sample_variables: [] 2020-12-01 19:55:11,769 INFO o.a.p.c.PinotDriver: Initiating connection to database for url: jdbc: 2020-12-01 19:55:11,804 INFO o.a.p.c.Connection: Creating connection to broker list [instance-3:8099] 2020-12-01 19:55:11,812 ERROR o.a.j.p.j.c.DataSourceElement: Error preinitializing the connection pool: demo@494046791 java.sql.SQLException: Cannot create PoolableConnectionFactory (null) at org.apache.commons.dbcp2.BasicDataSource.createPoolableConnectionFactory(BasicDataSource.java:669) ~[commons-dbcp2-2.7.0.jar:2.7.0] at org.apache.commons.dbcp2.BasicDataSource.createDataSource(BasicDataSource.java:544) ~[commons-dbcp2-2.7.0.jar:2.7.0] at org.apache.commons.dbcp2.BasicDataSource.getConnection(BasicDataSource.java:753) ~[commons-dbcp2-2.7.0.jar:2.7.0] at org.apache.jmeter.protocol.jdbc.config.DataSourceElement.initPool(DataSourceElement.java:308) [ApacheJMeter_jdbc.jar:5.3] at org.apache.jmeter.protocol.jdbc.config.DataSourceElement.testStarted(DataSourceElement.java:127) [ApacheJMeter_jdbc.jar:5.3] at org.apache.jmeter.engine.StandardJMeterEngine.notifyTestListenersOfStart(StandardJMeterEngine.java:205) [ApacheJMeter_core.jar:5.3] at org.apache.jmeter.engine.StandardJMeterEngine.run(StandardJMeterEngine.java:380) [ApacheJMeter_core.jar:5.3] at java.lang.Thread.run(Unknown Source) [?:1.8.0_271] Caused by: java.sql.SQLFeatureNotSupportedException at org.apache.pinot.client.base.AbstractBaseConnection.getAutoCommit(AbstractBaseConnection.java:116) ~[pinot-jdbc-client-0.5.0.jar:0.5.0-d87bbc9032c6efe626eb5f9ef1db4de7aa067179] at org.apache.commons.dbcp2.DelegatingConnection.getAutoCommit(DelegatingConnection.java:391) ~[commons-dbcp2-2.7.0.jar:2.7.0] at org.apache.commons.dbcp2.PoolableConnectionFactory.activateObject(PoolableConnectionFactory.java:116) ~[commons-dbcp2-2.7.0.jar:2.7.0] at org.apache.commons.dbcp2.BasicDataSource.validateConnectionFactory(BasicDataSource.java:117) ~[commons-dbcp2-2.7.0.jar:2.7.0] at org.apache.commons.dbcp2.BasicDataSource.createPoolableConnectionFactory(BasicDataSource.java:665) ~[commons-dbcp2-2.7.0.jar:2.7.0] ... 7 more 2020-12-01 19:55:11,813 INFO o.a.j.g.u.JMeterMenuBar: setRunning(true, *local*) 2020-12-01 19:55:11,817 INFO o.a.j.e.StandardJMeterEngine: Starting ThreadGroup: 1 : Thread Group 2020-12-01 19:55:11,818 INFO o.a.j.e.StandardJMeterEngine: Starting 1 threads for group Thread Group. 2020-12-01 19:55:11,818 INFO o.a.j.e.StandardJMeterEngine: Thread will continue on error 2020-12-01 19:55:11,818 INFO o.a.j.t.ThreadGroup: Starting thread group... number=1 threads=1 ramp-up=1 delayedStart=false 2020-12-01 19:55:11,818 INFO o.a.j.t.ThreadGroup: Started thread group number 1 2020-12-01 19:55:11,818 INFO o.a.j.e.StandardJMeterEngine: All thread groups have been started 2020-12-01 19:55:11,838 INFO o.a.j.t.JMeterThread: Thread started: Thread Group 1-1 2020-12-01 19:55:11,839 INFO o.a.p.c.PinotDriver: Initiating connection to database for url: jdbc: 2020-12-01 19:55:11,871 INFO o.a.p.c.Connection: Creating connection to broker list [instance-3:8099] 2020-12-01 19:55:11,878 INFO o.a.j.t.JMeterThread: Thread is done: Thread Group 1-1 2020-12-01 19:55:11,879 INFO o.a.j.t.JMeterThread: Thread finished: Thread Group 1-1 2020-12-01 19:55:11,879 INFO o.a.j.e.StandardJMeterEngine: Notifying test listeners of end of test 2020-12-01 19:55:11,879 INFO o.a.j.g.u.JMeterMenuBar: setRunning(false, *local*)
@fx19880617: @kharekartik do you have any idea on this?
@kharekartik: @whatatrip888 Which version of pinot and jdbc are you using?
@dungnt: @dungnt has joined the channel
@gloetscher: Hey everyone ! I’m currently investigating Apache Pinot, and after reading a good chunk of the documentation, I have a couple of questions. • If this , it’s said that if you lose all your controller, your cluster will still be able to answer to read queries (but not write queries, obviously). Then, if a new controller is started, it says that the cluster will recover and will be then available again for write queries. That supposed that all cluster states are stored somewhere. I suppose that “somewhere” is Zookeeper ? • Offline servers are responsible to host segments. Let’s say we have only one replica for a given segment, and the offline server hosting it dies. Will Helix discover that and will ask another offline server to download the same segment, in order to make it available again to the brokers ? • Where can I find some information about the resource requirements (mainly CPU / memory) for controllers / brokers / realtime servers / offline servers ? Thanks for your help !
@tanmay.movva: > Will Helix discover that and will ask another offline server to download the same segment, in order to make it available again to the brokers I don’t think this happens automatically, when a server has died or left the cluster, then we need to trigger the server rebalancing for that lost table so that the segments are hosted by other servers.
@tanmay.movva: > will be then available again for write queries Rather than seeing it as a write query, I think it is better to see it as new segments won’t be published. Because the servers won’t be able to commit the segment(as controller is involved in segment completion protocol) even though they would be able to ingest data. You can read about the protocal here -
@gloetscher: > I don’t think this happens automatically, when a server has died or left the cluster, then we need to trigger the server rebalancing for that lost table so that the segments are hosted by other servers. OK so I guess it’ll be managed by replicas, then
@gloetscher: > Rather than seeing it as a write query, I think it is better to see it as new segments won’t be published. Because the servers won’t be able to commit the segment(as controller is involved in segment completion protocol) even though they would be able to ingest data. Thanks for the clarification. Yeah, when I said “write queries”, what I meant was “the possibility to make a segment available in Pinot”. But as long as the “cold” storage remains available, you can still upload fresh data in it. It’ll just not be available in the Pinot cluster.
@g.kishore: When a server dies and if you are running on k8s, a new container is created that will download the segments and start serving
@gloetscher: yeah, but let’s say we have another server available with the correct tags, will it be picked up as the new host for the missing segment ?
@g.kishore: What’s the name of new server (logical)
@gloetscher: sorry, I’m not sure to understand you.
@gloetscher: also, I’m only starting with Pinot, so I may be missing some important points
@g.kishore: When a segment is uploaded, we assign it to one or mor servers. That mapping is stored in Helix
@g.kishore: That mapping will only change in the following scenario • add more servers and invoke rebalance • Untag a server and invoke rebalance
@g.kishore: Untag is a way to say that this server should not host segments for this table anymore
@g.kishore: Note that if a server dies, it’s tag is still maintained in Helix/zookeeper
@g.kishore:
@g.kishore: This video should help
@gloetscher: I’ll have a look, thanks !
@srsurya122: tried to execute the pinot start controller cmd It got the following error could you please help me with this?
@taranrishit1234: +1
@fx19880617: I think this is the issue in windows. Can you try to run Pinot on a Linux OS or try to run it using docker ?
@srsurya122:
@myeole: I am trying to run Pinot in Kubernetes but seeing following error. Any suggestions ? helm install -n pinot-quickstart kafka incubator/kafka --set replicas=1 Error: failed to download "incubator/kafka" (hint: running `helm repo update` may help)
@fx19880617: you need to install the repo
@fx19880617: ``` helm repo add incubator ```
@fx19880617: ```helm repo add stable ```
@fx19880617: this is typical init cmd for helm after you install it
@fx19880617:
@joao.comini: Oh, `@here` is so scary :fearful:
@fx19880617: Agreed
@fx19880617: you can go to <#C011C9JHN7R|troubleshooting> channel for questions :wink:
@myeole: sure Thanks
@zjinwei: @zjinwei has joined the channel

#random

@dungnt: @dungnt has joined the channel
@zjinwei: @zjinwei has joined the channel

#troubleshooting

@tanmay.movva: Hello, I have added updated the table with indexing config to add indices on some columns. After this I triggered `Reload All Segments` to apply the indexes. When I try to check the `Reload Status` , I get this error on the UI ```Table type : REALTIME not yet supported.```
@tanmay.movva: I have checked the tableIndexConfig in the IDEAL STATE of the table and the indices were present. Is this the source of truth for me to check if the indexing has been applied or not? If not, how can I check if the indexing has been applied to that table?
@jackie.jxt: There are 2 ways: 1. Make a query `select count(*) from table where col = 123` and check the `numEntriesScannedInFilter`, if it is 0, then that means the inverted index on `col` is present 2. Log in to the server and check the segment directory under data directory. In the `index_map` file you should be able to find the inverted index for that column if it exists
@tanmay.movva: Thanks @jackie.jxt :slightly_smiling_face:
@tanmay.movva: Hello, when I am querying min/max of a column which is not present in the table, pinot returns Infinity/-infinity. Shouldn’t the ideal behaviour be to throw an error saying the column is not present?
@fx19880617: do you mean query a column which doesn't exist ? I also feel the behavior should be throw exception by saying field not found/ evaluated.
@fx19880617: Can you create a github issue for this
@tanmay.movva: Sure.
@fx19880617: This may also be extended to other aggregation functions I guess
@tanmay.movva: Yes. Same scenario for other aggregations also.
@tanmay.movva: @fx19880617
@fx19880617: cool! thanks!
@srsurya122: @srsurya122 has joined the channel
@zjinwei: @zjinwei has joined the channel
@amitchopra: @amitchopra has joined the channel
@yupeng: hey, any good way to optimize such query ```SELECT hour_start_timestamp_utc FROM downtime WHERE (secondsSinceEpoch > 1606247126) ORDER BY secondsSinceEpoch DESC, hour_start_timestamp_utc DESC LIMIT 1``` it scans the past 1 week of data but return only 1 record. since the table is large, it ends up scanning about 100 million records per query, and takes seconds query output is like ```{ "selectionResults": { "columns": ["hour_start_timestamp_utc"], "results": [ ["2020-12-01 09:00:00"] ] }, "exceptions": [], "numServersQueried": 9, "numServersResponded": 9, "numSegmentsQueried": 1059, "numSegmentsProcessed": 1059, "numSegmentsMatched": 18, "numConsumingSegmentsQueried": 0, "numDocsScanned": 142101504, "numEntriesScannedInFilter": 0, "numEntriesScannedPostFilter": 284203008, "numGroupsLimitReached": false, "totalDocs": 7374174837, "timeUsedMs": 3522, "segmentStatistics": [], "traceInfo": {}, "minConsumingFreshnessTimeMs": 0 }```
@mayanks: Perhaps add another predicate on secondsSinceEpoch < xxx
@mayanks: That should allow for pruning more segments. If the code doesn't do that today, would be a good feature to add.
@yupeng: you mean the range indexing?
@mayanks: Well range based pruning (based on metadata)
@mayanks: so if minTime = a and maxTime = b (in metadata), then segments where querty is not between a and be can be pruned out
@mayanks: We do have a time segment pruner, I don't recall if it can handle ranges
@yupeng: got it
@yupeng: does it require to have both lower and upper bounds?
@fx19880617: use max(hour_start_timestamp_utc)?
@ken: If the segments are time-based, and you set the table config’s columnMinMaxValueGeneratorMode to something that gave you min/max for the secondsSinceEpoch column, then would Pinot optimize out the segments that contain no possible rows?
@mayanks: @yupeng I assumed your predicate timestamp was for 1 week ago. If so, that would prune out data from older than 1 week, but would still scan data for the last one week
@mayanks: @fx19880617 using `max` would still scan, unless we can somehow prune out segments
@yupeng: @fx19880617 tried max, but got error ``` "message": "QueryExecutionError:\njava.lang.NumberFormatException: For input string: \"2020-10-15 09:00:00\"\n\tat sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043)\n\tat sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110)\n\tat java.lang.Double.parseDouble(Double.java:538)\n\tat org.apache.pinot.core.segment.index.readers.StringDictionary.getDoubleValue(StringDictionary.java:58)\n\tat org.apache.pinot.core.operator.query.DictionaryBasedAggregationOperator.getNextBlock(DictionaryBasedAggregationOperator.java:66)\n\tat org.apache.pinot.core.operator.query.DictionaryBasedAggregationOperator.getNextBlock(DictionaryBasedAggregationOperator.java:43)\n\tat org.apache.pinot.core.operator.BaseOperator.nextBlock(BaseOperator.java:49)\n\tat org.apache.pinot.core.operator.CombineOperator$1.runJob(CombineOperator.java:105)\n\tat org.apache.pinot.core.util.trace.TraceRunnable.run(TraceRunnable.java:40)\n\tat java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)\n\tat java.util.concurrent.FutureTask.run(FutureTask.java:266)\n\tat java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)\n\tat com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)\n\tat com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:57)"```
@mayanks: Yeah, seems that works on numeric data only
@fx19880617: it’s not for string, if that column is number then should be ok
@fx19880617: then you can use time_coverter function to convert it to human readable format
@yupeng: that would lead to scan as well?
@fx19880617: if it’s a full segment match, then we should just read it from dictionary
@yupeng: @mayanks that’s right, we wanted to get latest record, but not sure when it was last generated
@mayanks: I don't think max helps here, unless we find a way to avoid segment scans (pruning)
@yupeng: so use a wider range to be safe
@mayanks: Why do you go back 7 days for the latest record?
@yupeng: to be safe
@yupeng: it’s used to track the most recent down time
@mayanks: Your data is time ordered?
@yupeng: need to check with the user, but i hope we can assume so
@mayanks: If you just say max without any predicates then it will work (I think that is what Xiang was suggesting)?
@mayanks: without any predicates, we dont' scan, we just look at the metadata
@yupeng: right
@yupeng: we tried that query, and got the error above
@mayanks: Try time-convert as Xiang suggested
@graham: @graham has joined the channel
@pradeepgv42: Hi, we are trying NFS as deep store for pinot in an off cloud setting, wondering if there’s anything we should be careful about? Also, did anybody try MinIO as deepstore for pinot?
@mayanks: We use NFS too, works well so far. The only issue is pushes of large data size can take longer since all payload has to go through controller
@pradeepgv42: got it thanks

#presto-pinot-streaming

@dungnt: @dungnt has joined the channel

#pinot-docs

@g.kishore: @fx19880617 we forgot to add docs for case statement
@fx19880617:
@ken: On it doesn’t list BOOLEAN as a data type, and on it doesn’t say what the default value would be for BOOLEANs. Should I fix that? And BOOLEAN gets converted to a STRING, right? Any other similar “alias” types that are missing?
@fx19880617: I don’t think we support boolean, it will be converted to string with dictionary.
@ken: I see the “BOOLEAN” field type described on the page (“Data type of the dimension column. Can be STRING, BOOLEAN, INT, LONG, DOUBLE, FLOAT, BYTES”). Also, I’m using it in my schema. I would assume I’d get an error if it wasn’t supported (which is different from getting auto-converted to a string).
@fx19880617: ah ic, this is for the internal implementation details, basically we treat bool same as string field
@ken: Is any validation done? Or is it just blindly converted to string?
@fx19880617: data validation is there
@ken: On , it’s not clear which (if any) of the properties must exist in the table json structure. E.g. `enableDefaultStarTree` is described, but it’s not in my table json copied from `pinot-quickstart`. And for any that are optional, the default value should be called out. E.g. I assume `enableDefaultStarTree` has a default value of `false`.
@chinmay.cerebro: good point. We didn't have a separate column called default value for this table since the original fields did not really have any. We can either update the description or just create a new column for this
@ken: Is every property optional?
@ken: (in that if the json doesn’t contain it, then there’s a default that gets used)
@chinmay.cerebro: Yes I think every property here is optional
@chinmay.cerebro: there is no default for most of them
@chinmay.cerebro: eg: inverted/sorted/bloom columns are empty if nothing is specified. loadMode, nullHandlingEnabled etc have default values
@ken: Got it, thanks

#presto-pinot-connector

@dungnt: @dungnt has joined the channel

#test-channel

@kennybastani: @kennybastani has joined the channel
@karinwolok1: @karinwolok1 has joined the channel
@kennybastani: test
@kennybastani: Did you get the notification?
@karinwolok1: yes
@karinwolok1: I got it earlier from someone which is why I want to change it
@karinwolok1: I am looking on slack settings and it doesn't show that option, although when I google it - it says there is a way. I am looking for it but it doesn't exist on my end
@kennybastani: Okay, I think that the workspace owner (Kishore or Ananth) has to change this setting
@kennybastani:
@karinwolok1: Ok I will tell Kishore
--------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]

Apache Pinot Daily Email Digest (2020-12-01)

#general

#random

#troubleshooting

#presto-pinot-streaming

#pinot-docs

#presto-pinot-connector

#test-channel

Reply via email to