Apache Pinot Daily Email Digest (2022-05-31)

Pinot Slack Email Digest Tue, 31 May 2022 20:36:12 -0700

#general

@jacob.branch: @jacob.branch has joined the channel
@piercarlo.paltro: @piercarlo.paltro has joined the channel
@satyam.raj: Hey guys, how should we be handling the scenario if there are multiple kafka topics that need to be ingested to pinot and joined to have the final result? Should there be a pre-aggregate/lookup streaming job that consolidates multiple topics data into one topic that gets ingested by pinot, or should we use Presto to do the joins?
@kharekartik: Do all topics receive data in same format?
@kharekartik: @walterddr for joins
@satyam.raj: the topics will have different data, like “app_install” can be one kafka topic, and “app_open” can be another kafka topic. these two need to be joined
@mayanks: You likely need a stream processing (flink) job upstream for this. Unless all you want to do is dimension lookup, in which case refer to
@satyam.raj: But what if there are lots of such events, any two of which can be joined at query time?
@mayanks: In lookup join, the dimension table is static (periodic refresh). If you are referring to flink, then that’s what it is made for.
@arawat: Hi Pinot team, our security team flagged our pinot deployment in labs for security vulnerabilities with majority coming from `com.fasterxml.jackson` and all of them are addressed in newer versions of dependencies. Any thoughts on how should we go about addressing these. Can share with you the list if interested,
@mayanks: Will dm
@alex.gartner: @alex.gartner has joined the channel
@madison.s204: @madison.s204 has joined the channel
@alex.gartner: Hi all, doing some testing with Pinot lately. Just wondering, is there a "Kibana"-like tool for Pinot that can make it a little bit easier to visualize data, without having to write an application that does so?
@diogo.baeder: Have you tried Apache Superset?
@mayanks: Yes, you can refer to:
@alex.gartner: thank you both! haven't checked superset but it looks perfect
@mayanks: Glad to assist
@alex.gartner: Another question I've been wondering about is this idea of both realtime and offline tables being queried at once, via the same table name. Does anyone have an interesting use case for when they've used this? I'm trying to wrap my head around one
@mayanks: Yes this is a very common pattern. What’s your question on this one?
@alex.gartner: really just trying to imagine where this would be useful. in my cases, our streaming data sources are usually so different from our batched data, I'm wondering why I'd want to query them at the same time
@alex.gartner: do you have an example of this in practice?
@mayanks: By example you mean a config setup? Or just want to know who is running it?
@mayanks: If former
@alex.gartner: latter, just a scenario in which it makes sense
@mayanks: I think many of LinkedIn’s use cases follow that pattern. For example, “who viewed my profile” that is powered by Pinot follows that
@mayanks: Real-time ingestion gives you freshness. Offline gives you opportunity to pre-aggregate, correct stream error etc
@mayanks: So you get best of both worlds. Does that make sense?
@alex.gartner: ahhh yeah totally. ty!
@wadodkar: @wadodkar has joined the channel
@kevin.kamel: @kevin.kamel has joined the channel
@carolyn: @carolyn has joined the channel

#random

#troubleshooting

@jacob.branch: @jacob.branch has joined the channel
@sowmya.gowda: Hi Team, I'm facing a issue with pinot datatypes. I have a column jobTitle value as "Staff RN (Med Surg, Ortho/Neuro, GI/GU floor" in my file and defined schema with string datatype only. But I'm getting error while loading into table - `Cannot read single-value from Object[]: [Staff RN (Med Surg, Ortho/Neuro, GI/GU floor] for column: jobTitle`
@saurabhd336: Can you share your table config, schema json and the data format, data file / data json you're trying to ingest? Is this a realtime table or a offline table?
@sowmya.gowda: Its a offline table ingesting from csv file. Sharing tar file consisting table config, schema and job_specification file and raw_data/xab.csv file
@saurabhd336: @sowmya.gowda values like `Staff RN (Med Surg; Ortho/Neuro; GI/GU floor` are the culprits here. The `;` character is the default multi value separator for the CsvReader configured in the job spec to ingest the data. I was able to generate the segment correctly with ```executionFrameworkSpec: name: 'standalone' segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner' segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner' segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner' jobType: SegmentCreationAndTarPush inputDirURI: '/Users/saurabh.dubey/Downloads/test2_candidate/raw_data/' includeFileNamePattern: 'glob:**/*.csv' outputDirURI: '/Users/saurabh.dubey/Downloads/test2_candidate/segments/' overwriteOutput: true pinotFSSpecs: - scheme: file className: org.apache.pinot.spi.filesystem.LocalPinotFS recordReaderSpec: dataFormat: 'csv' className: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReader' configClassName: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig' configs: multiValueDelimiter: '\$' tableSpec: tableName: 'test2_candidate' schemaURI: '' tableConfigURI: '' pinotClusterSpecs: - controllerURI: ''``` ^ This spec. Basically overriding the ``` configs: multiValueDelimiter: '\$'``` part to change the multiValueDelimiter to some other character. But this may not always work (if some strings contain $ character). But basically you should figure out the correct multiValueDelimiter as per your data and use that in the ingestion spec. Else change the ingestion from csv to something more robust like json
@saurabhd336: ^@kharekartik for more
@sowmya.gowda: Thank you @saurabhd336 for the quick solution. It helps me a lot !!
@piercarlo.paltro: @piercarlo.paltro has joined the channel
@luisfernandez: hello my friends, my team has been trying to ingest data using the job spec for some weeks now, and it has been quite challenging, we are trying to ingest around 500gb of data which is 2 years of data for our system, we are using apache pinot `0.10.0` we ran into this issue: so we had to create a script to do the imports daily, however, for some reason pinot servers are exhausting memory (32gbs) and before running the job they are mostly at half capacity what are some of the reasons that our pinot servers would ran out of memory from these ingestion jobs? also we are using the standalone job and we change the input directory in our script every time it finishes daily. Would appreciate any help!
@ken: Can’t you use the `pushFileNamePattern` support to build a segment name that’s composed of the previous directory name and the file name? So you could create something like `2009-movies` as the final name.
@luisfernandez: oh i have to check that out
@luisfernandez: another question that i had is how do you tell the script to output the logs somewhere just so that i can have it run it as a background task
@luisfernandez: do you know?
@ken: Are you talking about the script that runs the admin tool? If so, then it’s the usual Linux command line thing of adding `>>logfile.txt 2>&1`, see
@luisfernandez: right but that only logs this: ```SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/opt/pinot/lib/pinot-all-0.10.0-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/pinot/plugins/pinot-environment/pinot-azure/pinot-azure-0.10.0-shaded.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/pinot/plugins/pinot-file-system/pinot-s3/pinot-s3-0.10.0-shaded.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/pinot/plugins/pinot-input-format/pinot-parquet/pinot-parquet-0.10.0-shaded.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/pinot/plugins/pinot-metrics/pinot-yammer/pinot-yammer-0.10.0-shaded.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/pinot/plugins/pinot-metrics/pinot-dropwizard/pinot-dropwizard-0.10.0-shaded.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance. WARNING: An illegal reflective access operation has occurred WARNING: Illegal reflective access by org.codehaus.groovy.reflection.CachedClass (file:/opt/pinot/lib/pinot-all-0.10.0-jar-with-dependencies.jar) to method java.lang.Object.finalize() WARNING: Please consider reporting this to the maintainers of org.codehaus.groovy.reflection.CachedClass WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations WARNING: All illegal access operations will be denied in a future release```
@luisfernandez: i’m currently running it like this:
@luisfernandez: ```JAVA_OPTS='-Xms1G -Xmx1G -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -Xlog:gc*:file=/opt/pinot/gc-pinot-controller.log -javaagent:/opt/pinot/etc/jmx_prometheus_javaagent/jmx_prometheus_javaagent-0.12.0.jar=7007:/opt/pinot/etc/jmx_prometheus_javaagent/configs/pinot.yml' /opt/pinot/bin/pinot-admin.sh LaunchDataIngestionJob -jobSpecFile /opt/pinot/migration/job.yaml```
@luisfernandez: (this is for one day worth of data)
@ken: Don’t you also wind up with logs in the `logs/` subdir inside of your `/opt/pinot/` directory?
@ken: e.g. `pinot-all.log`?
@luisfernandez: i do have those logs but i guess how would i differentiate what’s log by what
@ken: The minimal stdout/stderr logging output is what I often see when slf4j finds multiple bindings. I would just focus on what’s in the logs/ subdir.
@ken: I made a run at fixing up Pinot logging so you wouldn’t get the issue of multiple bindings, but it’s a giant hairball.
@luisfernandez: so in the logs/subdir i see the logs for the controller itself and i guess i would see for the job too?
@ken: In a normal configuration, each process (server, broker, controller) has its own log file(s). So in that case, what gets logged when you run the admin app should just be what it’s logging as part of your request. Note that if you’re using Hadoop or Spark to run a segment generation job, then those systems will have their own logging infrastructure as well.
@luisfernandez: I'm using the standalone mode thank you now we got better logging at least
@luisfernandez: it has been a little harder to get this import process in place
@luisfernandez: we have year/month/day/severalfilesperday.parquet because of the bug we are doing imports daily instead
@luisfernandez: and it takes us days to do these imports
@ken: If you do a metadata push it should be pretty fast. We load about 1100 segments from HDFS via this approach in a few hours. This assumes segments have been already built and stored in HDFS, which we do via a Hadoop job that takes about an hour or so.
@luisfernandez: `SegmentCreationAndMetadataPush` this one right?
@luisfernandez: we interface with GCS
@luisfernandez: and we are just doing standalone
@ken: Just `SegmentMetadataPush` for us, since we create the segments using a scalable Hadoop map-reduce job.
@luisfernandez: we have a spark process that grabs the data from bigquery and puts it in gcs
@luisfernandez: and then we use the standalone job to look at the gcs buckets and create segments and do metadata push
@ken: So you can use a Spark job to also create the segments from the text files you extract from BigQuery.
@luisfernandez: which that would be one of these guides right?
@ken: That is scalable and can be much, much faster than trying to do it in a single process via a standalone job
@ken: Yes, that’s the guide. And yes, you can use this to ingest text, parquet, or avro files.
@luisfernandez: wouldn’t i run into the issue with the version problem that we have with pinot 0.10.0?
@ken: Are you talking about `pinot servers are exhausting memory (32gbs) and before running the job they are mostly at half capacity what are some of the reasons that our pinot servers would ran out of memory from these ingestion jobs`?
@luisfernandez: oh nonono, i’m talking about running this with spark instead of the standalone job, which is what we are doing, i also don’t know why that happened ^
@luisfernandez: we gave it more memories to the machines but i feel like something else is the root cause
@ken: In your `tableIndexConfig` make sure you set `"createInvertedIndexDuringSegmentGeneration": true,`
@ken: This is in the table spec (Json file)
@luisfernandez: let me check what it’s set at
@luisfernandez: oofff what happens if it’s `false`?
@ken: As per , if it’s false (which is the default) then indexes are created on servers when segments are loaded. which can be both a CPU and memory hog
@luisfernandez: is it safe to change on a existing table?
@ken: I believe so, yes - it should only impact the segment generation job, not any segments that have been already deployed
@ken: Generating the segment with the inverted index makes the segment bigger, but if you’re deploying using metadata push that shouldn’t matter much. Note though that currently metadata push requires each segment be downloaded to the machine running the standalone job, so it can be untarred to extract metadata. So you want a fast connection from that server and your deep store.
@luisfernandez: another question kinda related to the above, we are currently running on gke, and our deep storage is configured with gcs, we have liveness and readiness probes configured in these machines, i think that the server when it starts it tries to pull the data available from gcs, however, i think this may take longer as more data gets ingested, how do you all manage this? we had 10min configure for all the data to get into the server but now we more data being in the machines it seems like we need even more wait time for the data to be ready in the machines, any suggestions?
@mayanks: Every restart should not require pull from deep store if you are using ebs
@luisfernandez: right
@luisfernandez: i got confused
@luisfernandez: so right now the issue is that our health/readiness check is not getting ready in those 10 min
@luisfernandez: and pod gets restarted
@luisfernandez: `"message": "null:\n64370 segments [….] unavailable, errorCode: 305` is the error we see in the brokers
@luisfernandez: and those are all the segments available
@alex.gartner: @alex.gartner has joined the channel
@madison.s204: @madison.s204 has joined the channel
@wadodkar: @wadodkar has joined the channel
@kevin.kamel: @kevin.kamel has joined the channel
@carolyn: @carolyn has joined the channel

#getting-started

#releases

@wadodkar: @wadodkar has joined the channel

#introductions

@jacob.branch: @jacob.branch has joined the channel
@piercarlo.paltro: @piercarlo.paltro has joined the channel
@alex.gartner: @alex.gartner has joined the channel
@madison.s204: @madison.s204 has joined the channel
@wadodkar: @wadodkar has joined the channel
@kevin.kamel: @kevin.kamel has joined the channel
@carolyn: @carolyn has joined the channel
--------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pinot.apache.org For additional commands, e-mail: dev-h...@pinot.apache.org