Apache Pinot Daily Email Digest (2021-04-14)

Pinot Slack Email Digest Wed, 14 Apr 2021 19:00:39 -0700

#general

@fx19880617: Hello Community, We’re happy to announce the release of :wine_glass: Apache Pinot 0.7.1! This release includes several awesome new features :page_with_curl: :earth_americas: :unlock: : ```- JSON index - Lookup-based join support - Geospatial support - TLS support for pinot connections - Introduced new APIs for segment management and offline table push. - Various performance optimizations, improvements and bug fixes.``` Please also see the full release notes here: The release can be downloaded at Additional resources - Project website: Getting started: Pinot developer blogs: Intro to Pinot Video: Twitter: Meetup:
@vananth22: ```Lookup-based join support``` is the game changer. Thanks for adding it!!!
@mailtobuchi: Great features. Would love to take the `Lookup join feature` for a toss.
@havjyan: @havjyan has joined the channel
@gabuglc: Hey guys, What is the correct way to add a table/schema from kafka via UI?
@gabuglc:
@mayanks: The error seems to suggest that the table config is missing the schema name?
@gabuglc: Yes, im creating the schema and the table at the same time. Table is on the left, Schema is on the right
@mayanks: I mean there is suppoeds to be a schema field in the table config JSON that refers to the name of the schema in the right
@gabuglc: Isn't it schemaName on the table conf?
@mayanks: Ah yes, didn't catch it the first time
@mayanks: can you upload schema first and then create table?
@jackie.jxt: FYI, the `schemaName` is not mandatory. By default the table will link to the schema with the same name
@jackie.jxt: @npawar Can you please take a look and see if it is a bug?
@npawar: this is a very old version, i dont know how this is supposed to behave
@npawar: can you upgrade?
@gabuglc: I'm using 0.6.0.
@gabuglc: And I only got these options
@npawar: can you use latest tag?
@gabuglc: just updated. ty
@mayanks: @gabuglc Did that solve the issue?
@gabuglc: Yes, thanks alot
@aaron: Is there anything I can do to make batch import faster? It seems like most of the time is spent processing the Parquet files I'm importing, but I still don't see very high CPU usage on my machine (particularly, most cores are not busy). I see stuff like this in the logs: ```Apr 14, 2021 3:16:33 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: time spent so far 0% reading (1854 ms) and 99% processing (311813 ms)``` Is there a setting to use more cores to process segments in parallel or anything like that?
@dlavoie: What about your disk IO?
@aaron: Looking at some system stats, Disk I/O seems really low: Writes on the order of 100 MB/sec, reads on the order of 8 MB/sec
@dlavoie: what kind of disks are we talking? to some extend, 100MB/sec could be a bottleneck
@aaron: Looking into that now! Good call
@aaron: This is an NVMe under a virtualization layer
@zhong.chen: @zhong.chen has joined the channel

#random

@havjyan: @havjyan has joined the channel
@zhong.chen: @zhong.chen has joined the channel

#troubleshooting

@phuchdh: Hello guys, i’m have some issue with RealtimeToOfflineSegments task. So i create 2 hybrids table from same days. 1 for QC env and 1 for UAT env. • In the tables managements. It’s seem `RealtimetoOfflines` task in QC env has been stop. but i cannot find any errors log. • Another the question is the realtime segments will be remove after convert to Offline Segments Table ?
@fx19880617: have you checked logs in minion?
@phuchdh: here is the logs of minion pods.
@phuchdh: sometime, my zookeepers pods rollout because vm preemptible in gcloud
@fx19880617: ic, maybe have 3 pinot-zookeepers for HA?
@phuchdh: i already setup 3 zookeepers for HA
@fx19880617: ok
@fx19880617: so if there is no task logs on minion, then it means the minion tasks are not scheduled
@fx19880617: can you check controller log and look for `RealtimetoOffline` log?
@phuchdh: only 1 logs grep by “realtime”
@fx19880617: hmm, is this task scheduled? can you check minion apis through controller swagger ui?
@phuchdh: Minion apis is Task in swagger ?
@phuchdh: Could u answer question 2: ```Another the question is the realtime segments will be remove after convert to Offline Segments Table ?```
@fx19880617: I don’t think so. This task requires a hybrid table and it will create segments and push to offline table. You can set a fairly low retention for realtime table but longer for offline table .
@fx19880617:
@laxman: Found the root-cause. This is possibly due to a bug in groovy. >From thread dumps we see all message handlers are slowly going to the following state and stuck there. ```"HelixTaskExecutor-message_handle_thread" #51 daemon prio=5 os_prio=0 cpu=70457.28ms elapsed=4885.80s tid=0x00007fe4e43d6000 nid=0x6e waiting for monitor entry [0x00007fe4aa6e5000] java.lang.Thread.State: BLOCKED (on object monitor) at org.codehaus.groovy.reflection.ClassInfo$GlobalClassSet.add(ClassInfo.java:477) - waiting to lock <0x0000000702ed2218> (a org.codehaus.groovy.util.ManagedLinkedList) at org.codehaus.groovy.reflection.ClassInfo$1.computeValue(ClassInfo.java:83) at org.codehaus.groovy.reflection.ClassInfo$1.computeValue(ClassInfo.java:79) at org.codehaus.groovy.reflection.GroovyClassValuePreJava7$EntryWithValue.<init>(GroovyClassValuePreJava7.java:37) at org.codehaus.groovy.reflection.GroovyClassValuePreJava7$GroovyClassValuePreJava7Segment.createEntry(GroovyClassValuePreJava7.java:64) at org.codehaus.groovy.reflection.GroovyClassValuePreJava7$GroovyClassValuePreJava7Segment.createEntry(GroovyClassValuePreJava7.java:55) at org.codehaus.groovy.util.AbstractConcurrentMap$Segment.put(AbstractConcurrentMap.java:120) at org.codehaus.groovy.util.AbstractConcurrentMap$Segment.getOrPut(AbstractConcurrentMap.java:100) at org.codehaus.groovy.util.AbstractConcurrentMap.getOrPut(AbstractConcurrentMap.java:38) at org.codehaus.groovy.reflection.GroovyClassValuePreJava7.get(GroovyClassValuePreJava7.java:94) at org.codehaus.groovy.reflection.ClassInfo.getClassInfo(ClassInfo.java:144) at org.codehaus.groovy.runtime.metaclass.MetaClassRegistryImpl.getMetaClass(MetaClassRegistryImpl.java:258) at org.codehaus.groovy.runtime.InvokerHelper.getMetaClass(InvokerHelper.java:883) at groovy.lang.GroovyObjectSupport.<init>(GroovyObjectSupport.java:34) at groovy.lang.Script.<init>(Script.java:42) at Script1.<init>(Script1.groovy) at jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(java.base@11.0.10/Native Method) at jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(java.base@11.0.10/NativeConstructorAccessorImpl.java:62) at jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(java.base@11.0.10/DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(java.base@11.0.10/Constructor.java:490) at org.codehaus.groovy.runtime.InvokerHelper.createScript(InvokerHelper.java:431) at groovy.lang.GroovyShell.parse(GroovyShell.java:700) at groovy.lang.GroovyShell.parse(GroovyShell.java:736) at groovy.lang.GroovyShell.parse(GroovyShell.java:727) at org.apache.pinot.core.data.function.GroovyFunctionEvaluator.<init>(GroovyFunctionEvaluator.java:73) at org.apache.pinot.core.data.function.FunctionEvaluatorFactory.getExpressionEvaluator(FunctionEvaluatorFactory.java:91) at org.apache.pinot.core.data.function.FunctionEvaluatorFactory.getExpressionEvaluator(FunctionEvaluatorFactory.java:78) at org.apache.pinot.core.util.IngestionUtils.extractFieldsFromSchema(IngestionUtils.java:61) at org.apache.pinot.core.util.IngestionUtils.getFieldsForRecordExtractor(IngestionUtils.java:50) at org.apache.pinot.core.data.manager.realtime.LLRealtimeSegmentDataManager.<init>(LLRealtimeSegmentDataManager.java:1186) at org.apache.pinot.core.data.manager.realtime.RealtimeTableDataManager.addSegment(RealtimeTableDataManager.java:314) at org.apache.pinot.server.starter.helix.HelixInstanceDataManager.addRealtimeSegment(HelixInstanceDataManager.java:133) at org.apache.pinot.server.starter.helix.SegmentOnlineOfflineStateModelFactory$SegmentOnlineOfflineStateModel.onBecomeOnlineFromOffline(SegmentOnlineOfflineStateModelFactory.java:164) at org.apache.pinot.server.starter.helix.SegmentOnlineOfflineStateModelFactory$SegmentOnlineOfflineStateModel.onBecomeConsumingFromOffline(SegmentOnlineOfflineStateModelFactory.java:88) at jdk.internal.reflect.GeneratedMethodAccessor723.invoke(Unknown Source) at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(java.base@11.0.10/DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(java.base@11.0.10/Method.java:566) at org.apache.helix.messaging.handling.HelixStateTransitionHandler.invoke(HelixStateTransitionHandler.java:404) at org.apache.helix.messaging.handling.HelixStateTransitionHandler.handleMessage(HelixStateTransitionHandler.java:331) - locked <0x000000072defa200> (a org.apache.pinot.server.starter.helix.SegmentOnlineOfflineStateModelFactory$SegmentOnlineOfflineStateModel) at org.apache.helix.messaging.handling.HelixTask.call(HelixTask.java:97) at org.apache.helix.messaging.handling.HelixTask.call(HelixTask.java:49) at java.util.concurrent.FutureTask.run(java.base@11.0.10/FutureTask.java:264) at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.10/ThreadPoolExecutor.java:1128) at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.10/ThreadPoolExecutor.java:628) at java.lang.Thread.run(java.base@11.0.10/Thread.java:834)```
@havjyan: @havjyan has joined the channel
@havjyan: Hello Everyone , I am very new to Pino and having some trouble getting it up and running on Windows. Was wondering if there are any recourses or video tutorials that I can follow. In a nutshell I am trying to recreate whatever was done in this blog post . Thank you in advance.
@dlavoie: While Pinot is written in Java, there is no packaging wrappers written in favor of Windows. Startup scripts are intended to be ran on Linux or Macos machines.
@havjyan: Is it possible to utilize Ubuntu ?
@dlavoie: Of course
@dlavoie: If you have docker on your windows machine, you should be able to follow Kenny’s tutorial
@dlavoie: The only cavat is that you will have to run the commands from his bootstrap script manually in a windows terminal rather than executing the shell script
@havjyan: I have installed Docker desktop and was also able to install Docker version 19.03.8 in Ubuntu. Where would I actually run this commend '$ docker network create PinotNetwork $ docker-compose up -d $ docker-compose logs -f — tail=100
@dlavoie: in a windows terminal
@dlavoie: also, you need to install `docker-compose` I think it is part of Docker deskptop but it worth double checking
@havjyan: Great ! thanks for you help. I am going to try and not give up on this an get it done :slightly_smiling_face:
@dlavoie: Learning that toolset will be valuable for you for far beyond than Pinot :slightly_smiling_face:
@havjyan: I am getting this ERROR after running docker-compose up -d in cmd. Can't find a suitable configuration file in this directory or any parent. Are you in the right directory? Supported filenames: docker-compose.yml, docker-compose.yaml
@dlavoie: I don’t believe you are running the command from the git repository
@havjyan: I see ,so I need to navigate to climate-change-analysis folder and then run the commends from the cloned repo ?
@dlavoie: Yes
@havjyan: Thank you so much ! everything seems to be working, but when running docker-compose logs -f --tail=100 , I didn't do tail=100 and its taking very long time for checking the logs. Should I wait until it converges or is there a way to stop the commend and start the bootstrap ?
@dlavoie: Pinot is not a small system. You wan with `docker-compose up -d` so the system is running in background. You can exist your `docker-compose logs` command and it will not kill the process
@dlavoie: `docker-compose down` to clean everything. `docker-compose stop` to pause it
@kennybastani: @havjyan Glad you were able to get up and running. Thanks @dlavoie
@havjyan: Thank you guys, I am now able to access Pinot on . But for some reason password credential admin/admin do not work for superset login ...
@havjyan: also just noticed that instead of stormEvents I have baseballStats ...
@kennybastani: Make sure you run `$ sh ./bootstrap.sh`
@kennybastani: Baseball stats is fine. It's a quick start. But after you run the bootstrap script, it will add all the storm data and events.
@kennybastani: It might take awhile.
@kennybastani: Also, hopefully you have at least 16gb memory on your machine.
@havjyan: I am running this commends in windows command prompt so it dose not recognize the sh commend, as Daniel suggested I run the commands from your bootstrap script manually...
@havjyan: docker exec -ti pinot_app_noaa bash -c "sh ./import/import-storm-events.sh"
@havjyan: this is what I am getting in response `: not foundport-storm-events.sh: 2: ./import/import-storm-events.sh: Adding 'stormEvents' table to Pinot... : not foundport-storm-events.sh: 4: ./import/import-storm-events.sh: " is not a valid option : not foundport-storm-events.sh: 7: ./import/import-storm-events.sh: Downloading CSV files from NOAA server... : not foundport-storm-events.sh: 9: ./import/import-storm-events.sh: --2021-04-14 18:52:34-- => '' Resolving ()... 205.167.25.137, 2610:20:8040:2::137 Connecting to ()|205.167.25.137|:21... connected. Logging in as anonymous ... Logged in! ==> SYST ... done. ==> PWD ... done. ==> TYPE I ... done. ==> CWD (1) /pub/data/swdi/stormevents/csvfiles ... done. ==> PASV ... done. ==> LIST ... done. [ <=> ] 23.05K 140KB/s in 0.2s 2021-04-14 18:52:35 (140 KB/s) - '' saved [23606] No matches on pattern 'StormEvents_details-ftp_v1.0_*.csv.gz\r'. FINISHED --2021-04-14 18:52:35-- Total wall clock time: 0.7s Downloaded: 1 files, 23K in 0.2s (140 KB/s) : not foundport-storm-events.sh: 11: ./import/import-storm-events.sh: ./import/import-storm-events.sh: 15: ./import/import-storm-events.sh: Syntax error: word unexpected (expecting "do")`
@kennybastani: Can you take a screenshot of your terminal output and paste it here?
@havjyan:
@kennybastani: Ah. This is a shell script. Which you can only run in a Unix/Linux environment. The Window terminal uses bat scripts, which is why it's not working.
@kennybastani: Windows does have a bash-based terminal
@kennybastani:
@kennybastani: Definitely a bummer trying to run this stuff and learn on Windows
@kennybastani: But if you follow that guide, you'll be able to run shell scripts successfully
@havjyan: I will install WSL and start over. I did learn a lot today though :slightly_smiling_face: . Thanks again for your help. Just one last question. Would I be able to set everything up in Ubuntu ?
@aaron: I'm running a SegmentCreationAndTarPush batch ingest that has outputDirURI as a path on S3. I notice that if I drop the table, re-create the table, and run the batch ingest job again with different data (a different inputDirURI), it still ends up pushing all of the old data from the previous batch ingest job. Is this expected? Is there some way I can prevent it from happening?
@g.kishore: its because the output directory still contains all the old segments.
@g.kishore: you can delete that before running the job
@aaron: Ok thanks
@aaron: Easy enough :slightly_smiling_face: So should I think of the outputDirURI as a temporary directory for the batch ingestion job?
@g.kishore: yes
@g.kishore: this has come up multiple times, may be we should automatically delete it or figure out the newly generated files
@zhong.chen: @zhong.chen has joined the channel

#pinot-dev

@snlee: Hi, it looks that all the PRs are failing at the unit tests ```[INFO] ERROR in /home/runner/work/incubator-pinot/incubator-pinot/pinot-controller/src/main/resources/app/pages/Query.tsx [INFO] ./app/pages/Query.tsx [INFO] [tsl] ERROR in /home/runner/work/incubator-pinot/incubator-pinot/pinot-controller/src/main/resources/app/pages/Query.tsx(238,20) [INFO] TS2345: Argument of type '{ data: any[]; fileName: string; exportType: any; }' is not assignable to parameter of type 'IOption<void>'. [INFO] Property 'fields' is missing in type '{ data: any[]; fileName: string; exportType: any; }' but required in type 'IOption<void>'. [INFO] Child HtmlWebpackCompiler: [INFO] 1 asset [INFO] Entrypoint HtmlWebpackPlugin_0 = __child-HtmlWebpackPlugin_0 [INFO] [0] ./node_modules/html-webpack-plugin/lib/loader.js!./app/index.html 1.42 KiB {0} [built] Error: npm ERR! code ELIFECYCLE Error: npm ERR! errno 2 Error: npm ERR! pinot-controller-ui@1.0.0 build: `webpack --mode production` Error: npm ERR! Exit status 2 Error: npm ERR! Error: npm ERR! Failed at the pinot-controller-ui@1.0.0 build script. Error: npm ERR! This is probably not a problem with npm. There is likely additional logging output above. Error: Error: npm ERR! A complete log of this run can be found in: Error: npm ERR! /home/runner/.npm/_logs/2021-04-14T17_33_03_262Z-debug.log```
@snlee: I see some error when building UI npm project. Is someone looking into this?
@jackie.jxt: Yes, @gaurav is working on a fix now

#feat-partial-upsert

@yupeng: we are having the meeting now, feel free to join
@yupeng:
@jackie.jxt: Regarding this issue, I think we should model partial upsert as read-update-write for a given primary key, similar to the mutable databases
@jackie.jxt: All the changes should happen through the Kafka (as Kishore said, we treat Kafka as the write API)
@jackie.jxt: The custom update logic can be achieved by the partial upsert api `GenericRow update(GenericRow current, GenericRow new)`
@yupeng: Ok. Then only concern is the performance for backfilling large volume of data
@jackie.jxt: That should be very rare, and I think performance should be okay. Basically we just replay all the changes from kafka.
@yupeng: I don’t think it’s rare. There are cases when ppl want to correct some data without changing the timestamp
@yupeng: In this case how do you write the kafka flow
@jackie.jxt: The update logic can be customized
@jackie.jxt: It might be common to correct data for some primary keys, but should not be common to do large scale update
@yupeng: How about you write what the solution would be like with the two examples in the doc?
@jackie.jxt: ```(1, a1, b1, c1, -, -, -, t1) & (1, -, -, -, d1, e1, f1, t0) -> (1, a1, b1, c1, d1, e1, f1, t1) (1, a1, b1, c1, d1, e1, f1, t1) & (1, -, -, -, d2, e2, f2, t2) -> (1, a1, b1, c1, d2, e2, f2, t2) (1, a1, b1, c1, d2, e2, f2, t2) & (1, -, -, -, d1, e1, f1, t0) -> (1, a1, b1, c1, d2, e2, f2, t2)```
@jackie.jxt: • To update the existing field, the new record should have a newer timestamp • To update the non-existing field, change the timestamp for the new record
@yupeng: Okay. Then when do we need the direct segment push that @tingchen added?
@jackie.jxt: When bootstraping the table
@jackie.jxt: Actually the bootstrap could also be done via kafka
@jackie.jxt: The reason we have to do segment replacement is because Pinot is append only, but that is not the case with upsert
@jackie.jxt: So actually it might make sense to have everything through kafka for upsert table
@yupeng: There is usually 10-20x throughout difference between direct push vs going through kafka
@yupeng: In our past observation, backfill via kafka may take days whereas going with direct push may just take a couple of hours
@jackie.jxt: The initial bootstrap can be done via direct push, then the updates come through kafka
@jackie.jxt: The problem here is that we should not change the history for upsert table. Also it is better to pay the extra cost at write time instead of query time
@jackie.jxt: Also, if we can make the assumption that a primary key won't get any update after a period of time (say 3 days for an order), we might be able to flush it to an offline table
@yupeng: Bootstrap makes sense
@yupeng: If we cannot easily change history, then we shall consider a feature to replace an existing table
@yupeng: Think from user perspective how the data correction flow shall be like
@jackie.jxt: To correct the record for a primary key, simply put the desired record with the current (latest) timestamp
@jackie.jxt: Hmm, for different scenario, user might want to put different timestamp for the update message
@jackie.jxt: But the update logic should be general enough to handle all these custom logic
--------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pinot.apache.org For additional commands, e-mail: dev-h...@pinot.apache.org