#general


@octchristmas: Hi. team~ What is the difference between 'append' and 'refresh' in segmentPushType for offline table?
  @mayanks: Append means data push is incremental (say hourly/daily) and is appended to the table.
  @mayanks: REFRESH implies that all data to this table will be refreshed with each push.
  @mayanks: Note, these are way to specify user intent to Pinot. So it can take care of background stuff like time-boundary management, retention etc
  @ken: @mayanks - actually doesn’t REFRESH imply that _segments_ will be refreshed (replaced completely) during the push? Asking because we have a table where we push a subset of segments as updates every day, configured as refresh, and it’s working fine. So it’s not “all data to this table will be refreshed”.
  @ken: @mark.needham - I was looking at the page that @octchristmas referenced in this question *(), and noticed that it said to use `IngestionConfig -> BatchIngestionConfig -> segmentPushType` as of 0.7, but the only link to `IngestionConfig` on that page takes you to Ingestion Transformations (), which doesn’t talk about the configuration settings. I see that `BatchIngestionConfig` is also referenced on , but that’s more like a tutorial versus field documentation.
  @mayanks: @ken I don’t think Pinot does the REFRESH on its own. It relies on user to overwrite all segments a something that needs to be clarified in docs if it is not. cc: @mark.needham
  @ken: Hi @mayanks - My comment wasn’t about how much Pinot does on its own. It’s about whether “all segments” need to be overwritten or not. Based on my experience, refresh just means that the user is responsible for updating segments (which can be all segments for a table, or some subset), but it’s not a full table operation.
  @mayanks: You are correct user is responsible, and if they only update part of data Pinot won’t flag it
  @octchristmas: @mayanks @ken Thank you. I found that offline table's RetentionManager only deletes segments of type 'append'. So I was curious about the difference between the two. But I still can't tell the difference between 'refresh' and 'append'. When I tested with sample data (transcript data), both types of tables replaced segments when injecting data in the same time range. I expected that the 'refresh' type replaces the segment, and the 'append' type creates one more segment and the data will be 2 copies, but it was not. Can you tell me in more detail what the difference between these two is the behavior in pinot ? And, is it correct to use 'injestionConfig > batchIngestionConfig > segemntPushType' instead of 'segmentConfig > segmentPushType'? When I set 'injestionConfig > batchIngestionConfig > segemntPushType', it is not actually reflected. When I set 'segmentConfig > segmentPushType', it is actually reflected, and I tested with this setting. tableConfig: `{` `"tableName": "transcript_trans",` `"tableType": "OFFLINE",` `"segmentsConfig": {` `"schemaName": "transcript_trans",` `"replication": 1,` `"timeColumnName": "timestampInEpoch", "timeType": "MILLISECONDS",` `"retentionTimeUnit": "HOURS", "retentionTimeValue": 1` `},` `"tenants": { "broker":"DefaultTenant", "server":"DefaultTenant" },` `"tableIndexConfig": {` `"loadMode": "MMAP"` `},` `"ingestionConfig": {` `"batchIngestionConfig": {` `"segmentPushType": "APPEND"` `},` `"filterConfig": {` `"filterFunction": "Groovy({(score1 as float) >= 4 && (score1 as float) < 6}, score1)"` `},` `"transformConfigs": [` `{` `"columnName": "fullName",` `"transformFunction": "Groovy({firstName+' '+lastName}, firstName, lastName)"` `},` `{` `"columnName": "scoreSum",` `"transformFunction": "Groovy({score1+score2}, score1, score2)"` `},` `{` `"columnName": "datetime",` `"transformFunction": "toDateTime(timestampInEpoch, 'yyyy-MM-dd HH:mm:ss')"` `}` `]` `},` `"metadata": {}` `}` `-- pinot controller ui, tableConfig` `"ingestionConfig": {` `"batchIngestionConfig": {},` `"filterConfig": {` `"filterFunction": "Groovy({(score1 as float) >= 4 && (score1 as float) < 6}, score1)"` `},`
  @mayanks: From Pinot’s behavior difference you are right only difference is that retention manager will delete segments for append table. Also time boundary happens for append. Refresh is more for snapshot type use cases where you want to rewrite entire data in Pinot with each push.
  @mayanks: In your case is it a real-time table, or a hybrid table with time column and incremental push to Pinot? If yes that indicates you need append table
  @mayanks: But if it is offline only and each time you want to overwrite all data in Pinot, then it is refresh use case
@vasanth.reddy: @vasanth.reddy has joined the channel
@omkaunda: @omkaunda has joined the channel
@hughmil3s: @hughmil3s has joined the channel
@casber: @casber has joined the channel
@kiukchung: @kiukchung has joined the channel

#random


@vasanth.reddy: @vasanth.reddy has joined the channel
@omkaunda: @omkaunda has joined the channel
@hughmil3s: @hughmil3s has joined the channel
@casber: @casber has joined the channel
@kiukchung: @kiukchung has joined the channel
@tonya: Hey folks! I wanted to let you know that StarTree will be hosting our first real-time user-facing analytics conference in Aug in SFO. The CFP is now open, and we’d love to see contributions from the Pinot community! :heart: I’m the StarTree Community Manager, so please feel free to reach out with any questions. I hope to see you there. :slightly_smiling_face:
@jt: @jt has left the channel

#troubleshooting


@vasanth.reddy: @vasanth.reddy has joined the channel
@prashant.pandey: Hi team. I have a column with very high cardinality. It is defined as a MV column in my schema. We also have an inverted index defined on it. However, we want to disable dictionary on this column as it’s very high cardinality and the dictionary size is almost 50% of the total segment size. However, Pinot does not let me disable it and errors out with the following message: “Cannot create an Inverted index on column tags__KEYS specified in the noDictionaryColumns config” Can I disable dict. of this column and still have inverted index defined on it? Also, when should we use a forward index vs. an inverted index?
  @kharekartik: We have recently added feature to address this issue in out main branch. `optimizeDictionaryForMetrics` config in This only works for Single-valued columns though currently. For your second question, forward index are created by default. Inverted index should be used when you want to do a constant time lookup for a column value.
  @kharekartik: Inverted index unfortunately can't be created on no dictionary columns. So you will need to have a dictionary for it to work. @mayanks Is there a better solution here?
  @prashant.pandey: > So you will need to have a dictionary for it to work. In our case, we have around 800M segments out of which 400M is the size of the dict due to the cardinality (700k unique values). Can we somehow optimise this?
  @prashant.pandey: Like we want the inverted index without the dict. Dict. anyway wouldn’t be that much helpful due to the high cardinality I presume? The docs themselves say that if there is very less duplicate data, the compression is not that useful.
  @kharekartik: Yeah, Having such large dicts doesn't make sense. I am not sure though why is inverted index not support on Raw columns. Can I ask one thing? If it is a high cardinality columns, do you do a lot of lookups on that? Or mostly range filters?
  @prashant.pandey: Yes we do have a lot of filters. We have two such columns: `tag__KEYS` : Stores keys of your tags. `tag__VALUES` : Stores values of those the corresponding tags in `tag__KEYS`. So basically, they look like: ```tag__KEYS: "key1, key2, key3" tag__VALUES: "val1, val2, val3"``` Almost all of our queries have predicates on these two columns.
  @mayanks: Are you using Map type here? Seems like you can use JSON type which can have json index on no dict column
  @mayanks: @prashant.pandey ^^
  @prashant.pandey: @mayanks Sorry was AFK for a long time. Yes true in fact that is exactly what is in the roadmap. But that’ll take some time. I was looking at some quick solution that could reduce the size of my segments by almost 50% :slightly_smiling_face:.
  @mayanks: Short of that, I don’t think think there’s a quick solution there.
@harish.bohara: In my setup i have ```volumeMounts: - name: data mountPath: /bitnami/zookeeper I have no name!@zookeeper-0:/$ df -h Filesystem Size Used Avail Use% Mounted on overlay 50G 3.4G 47G 7% / tmpfs 64M 0 64M 0% /dev tmpfs 3.8G 0 3.8G 0% /sys/fs/cgroup /dev/nvme0n1p1 50G 3.4G 47G 7% /etc/hosts /dev/nvme1n1 99G 441M 98G 1% /bitnami/zookeeper shm 64M 0 64M 0% /dev/shm tmpfs 3.8G 12K 3.8G 1% /run/secrets/kubernetes.io/serviceaccount tmpfs 3.8G 0 3.8G 0% /proc/acpi tmpfs 3.8G 0 3.8G 0% /sys/firmware zookeeper-0:/bitnami/zookeeper/data$ ls -la total 16 drwxrwsr-x 3 1001 1001 4096 Apr 19 07:58 . drwxrwsr-x 4 root 1001 4096 Apr 19 07:58 .. -rw-rw-r-- 1 1001 1001 2 Apr 19 07:58 myid drwxrwsr-x 2 1001 1001 4096 Apr 19 15:08 version-2```
@omkaunda: @omkaunda has joined the channel
@hughmil3s: @hughmil3s has joined the channel
@srini: hey y’all! @hughmil3s and I were having some trouble getting Pinot connected to Superset and would love any help here. We tried both the URI’s in the superset docs () and the Pinot docs () and neither seemed to work for us. I specifically tested with the pinot quickstart docker image and kept getting generic SQLalchemy errors, which makes it hard to debug! Hugh’s trying to test this PR () in the Superset repo, that improves the Pinot capabilities in Superset :zap: cc @kennybastani @chinmay.cerebro @g.kishore
  @hughmil3s: I’m currently running pinot locally with docker on 9000
  @hughmil3s: and this the connection string that i’ve created: ``
  @hughmil3s: when i try connect just using the `pinotdb` i’ve been getting this error: ```~/src/superset/venv/lib/python3.8/site-packages/pinotdb/db.py in check_sufficient_responded(self, query, queried, responded) 261 needed = fraction 262 if responded < 0 or responded < needed: --> 263 raise exceptions.DatabaseError( 264 f"Query\n\n{query} timed out: Out of {queried}, only" 265 f" {responded} responded, while needed was {needed}" DatabaseError: Query select * from baseballStats timed out: Out of -1, only -1 responded, while needed was -1```
  @hughmil3s: let me know if anyone can help me debug so i can verify my changes work as intended
  @kennybastani: Try `pinot+` for the connection string
  @hughmil3s: same error
  @hughmil3s: ```2-04-19 10:21:16,806:INFO:pinotdb.sqlalchemy:Updated pinot dialect args from {'dbapi': <module 'pinotdb' from '/Users/hugh/src/superset/venv/lib/python3.8/site-packages/pinotdb/__init__.py'>, 'debug': False, 'verify_ssl': True}: None and False 2022-04-19 10:21:16,807:INFO:pinotdb.sqlalchemy:Updated pinot dialect args from {'host': 'pinot', 'port': 8000, 'path': 'query', 'scheme': 'http', 'verify_ssl': True, 'username': None, 'password': None, 'debug': False}: and False 2022-04-19 10:21:16,807:DEBUG:superset.stats_logger:[stats_logger] (incr) test_connection_attempt 2022-04-19 10:21:16,837:DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): pinot:8000 2022-04-19 10:21:16,844:DEBUG:superset.stats_logger:[stats_logger] (incr) test_connection_error.DBAPIError 2022-04-19 10:21:16,846:DEBUG:superset.stats_logger:[stats_logger] (incr) DatabaseRestApi.test_connection.error 2022-04-19 10:21:16,847:WARNING:superset.views.base:[SupersetError(message='(builtins.NoneType) None\n(Background on this error at: ', error_type=<SupersetErrorType.GENERIC_DB_ENGINE_ERROR: 'GENERIC_DB_ENGINE_ERROR'>, level=<ErrorLevel.ERROR: 'error'>, extra={'engine_name': 'Apache Pinot', 'issue_codes': [{'code': 1002, 'message': 'Issue 1002 - The database returned an unexpected error.'}]})]```
  @kennybastani: Can you get the log output from Pinot?
  @kennybastani: Oh hey, sorry my URL was wrong
  @kennybastani: `pinot+`
  @kennybastani: That's what I get from copy and pasting from my old code bases :stuck_out_tongue_winking_eye:
  @kennybastani: Oh still wrong...
  @kennybastani: Okay, it should be good now
  @kennybastani: URL decoded it looks like this: `pinot+`
  @kennybastani: Where did you get the format for your connection string?
  @hughmil3s: from the docs this is what i’m using to test now: ```from sqlalchemy import * ...: from sqlalchemy.engine import create_engine ...: from sqlalchemy.schema import * ...: ...: engine = create_engine('pinot+') ...: ...: places = Table('baseballStats', MetaData(bind=engine), autoload=True) ...: print(select([func.count('*')], from_obj=places).scalar())```
  @chinmay.cerebro: @hughmil3s just to confirm - database connection is fine - but query is timing out correct ?
  @kennybastani: Did that connection string work out?
  @hughmil3s: for what i posted above this is the error i’m getting: ```ConnectionError: HTTPConnectionPool(host='localhost', port=8000): Max retries exceeded with url: /query/sql (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x10730a790>: Failed to establish a new connection: [Errno 61] Connection refused'))```
  @hughmil3s: With pinotdb: ```In [26]: from pinotdb import connect ...: ...: conn = connect(host='localhost', port=9000, path='/query/sql', scheme='http') ...: curs = conn.cursor() ...: curs.execute(""" ...: select * from baseballStats ...: """) ...: for row in curs: ...: print(row) ...:``` ```~/src/superset/venv/lib/python3.8/site-packages/pinotdb/db.py in check_sufficient_responded(self, query, queried, responded) 261 needed = fraction 262 if responded < 0 or responded < needed: --> 263 raise exceptions.DatabaseError( 264 f"Query\n\n{query} timed out: Out of {queried}, only" 265 f" {responded} responded, while needed was {needed}" DatabaseError: Query select * from baseballStats timed out: Out of -1, only -1 responded, while needed was -1```
  @kennybastani: Can you `curl `
  @hughmil3s: curl: (7) Failed to connect to localhost port 8000: Connection refused
  @kennybastani: Okay, so your broker is not available at `localhost:8000`
  @hughmil3s:
  @hughmil3s: for context
  @hughmil3s: how do i figure out what port my broker is living at
  @kennybastani: One sec
  @hughmil3s: is it 8000?
  @kennybastani: It should be
  @kennybastani: Try `netstat -vanp tcp | grep '*.2123\|9000\|8000\|7000'`
  @hughmil3s: ```✦ ➜ netstat -vanp tcp | grep '*.2123\|9000\|8000\|7000' tcp6 0 0 ::1.9000 ::1.57711 FIN_WAIT_2 407508 146808 478 0 0x0112 0x0000000c tcp6 0 0 ::1.57711 ::1.9000 CLOSE_WAIT 407577 146808 57242 0 0x0022 0x00000000 tcp46 0 0 *.9000 *.* LISTEN 131072 131072 478 0 0x0100 0x00000006 tcp6 0 0 ::1.9000 ::1.63536 TIME_WAIT 403702 146808 478 0 0x2131 0x0000000c tcp6 0 0 ::1.9000 ::1.63537 TIME_WAIT 400527 146808 478 0 0x2131 0x0000000c tcp6 0 0 ::1.9000 ::1.63538 TIME_WAIT 404686 146808 478 0 0x2131 0x0000000c```
  @kennybastani: Well thats odd :stuck_out_tongue_winking_eye:
  @kennybastani: Which version of Pinot are you using?
  @hughmil3s: so i ran this to start up the cluster: ```docker run \ -p 9000:9000 \ apachepinot/pinot:0.9.3 QuickStart \ -type batch```
  @hughmil3s: so 0.9.3
  @kennybastani: Ah, so you're running your test script inside the container of the controller
  @kennybastani: Which makes sense why you can't see the broker
  @hughmil3s: got it
  @kennybastani: Let's see
  @hughmil3s: what is the proper command
  @kennybastani: Oh, if it's QuickStart it should be all in one container
  @kennybastani: Try adding the broker and server ports to your docker command
  @hughmil3s: ```docker run \ -p 9000:9000 \ -p 8000:8000 \ -p 7000:7000 \ apachepinot/pinot:0.9.3 QuickStart \ -type batch```
  @hughmil3s: right?
  @kennybastani: Yeah that looks good
  @kennybastani: That should solve the problem I think
  @hughmil3s: working on it now
  @hughmil3s: and i’ll report back thanks kenny
  @kennybastani: :tada:
  @kennybastani: Sounds good Hugh. Let me know if I can help with anything
  @hughmil3s: getting the same thing
  @hughmil3s: ```✦ ➜ docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES fe8b5fdf9dcb apachepinot/pinot:latest "./bin/pinot-admin.s…" 14 minutes ago Up 2 minutes 0.0.0.0:7000->7000/tcp, 0.0.0.0:8000->8000/tcp, 0.0.0.0:9000->9000/tcp, 8096-8099/tcp confident_curie```
  @hughmil3s: ```✦ ➜ netstat -vanp tcp | grep '*.2123\|9000\|8000\|7000' tcp46 0 0 *.8000 *.* LISTEN 131072 131072 478 0 0x0100 0x00000006 tcp46 0 0 *.7000 *.* LISTEN 131072 131072 478 0 0x0100 0x00000006 tcp46 0 0 *.9000 *.* LISTEN 131072 131072 478 0 0x0100 0x00000006 tcp6 0 0 ::1.9000 ::1.57711 FIN_WAIT_2 407508 146808 478 0 0x0112 0x0000000c tcp6 0 0 ::1.57711 ::1.9000 CLOSE_WAIT 407577 146808 57242 0 0x0022 0x00000000```
  @kennybastani: Okay, that looks good. We might need to update that connection string. It is likely correct in the docs. I took mine from a much older version of Pinot.
  @hughmil3s: waiitttt
  @hughmil3s: we got it!
  @hughmil3s: `pinot+`
  @kennybastani: Woot :slightly_smiling_face:
  @kennybastani: Sweet. If you folks need anything else, ping me. Looking forward to seeing the enhancements.
  @hughmil3s: naw we are good i’m going to test this pr with the new grains then report there thanks kenny
  @kennybastani: You're very welcome
  @hughmil3s: Running Pinot w/ Docker ```docker run \ -p 9000:9000 \ -p 8000:8000 \ -p 7000:7000 \ apachepinot/pinot:0.9.3 QuickStart \ -type batch``` Working Connection String: `pinot+` Python Test Example: ```In [2]: from pinotdb import connect ...: ...: conn = connect(host='localhost', port=8000, path='/query/sql', scheme='http') ...: curs = conn.cursor() ...: curs.execute(""" ...: select * from baseballStats ...: """) ...: for row in curs: ...: print(row) ...: ```
  @kennybastani: Okay I'll track back in the doc changes to see what happened
  @hughmil3s: :thumbsup:
  @ahsen.m: i am able to connect superset with pinot like this ```pinot+```
@casber: @casber has joined the channel
@kiukchung: @kiukchung has joined the channel

#getting-started


@vasanth.reddy: @vasanth.reddy has joined the channel
@omkaunda: @omkaunda has joined the channel
@hughmil3s: @hughmil3s has joined the channel
@casber: @casber has joined the channel
@kiukchung: @kiukchung has joined the channel
--------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pinot.apache.org For additional commands, e-mail: dev-h...@pinot.apache.org

Reply via email to