Apache Pinot Daily Email Digest (2022-04-27)

Pinot Slack Email Digest Wed, 27 Apr 2022 19:00:45 -0700

#general

@buntyk91: @buntyk91 has joined the channel
@kishorenaidu712: Hi all, I have a question regarding historical data. How will pinot handle data which is existing say for quite a few years. Will there be any change in performance metrics when such historical data is queried after very long time?
@diogo.baeder: To my knowledge, it has no performance impact, because Pinot first tries to find what segment(s) your data is at, according to the time in the query, and then fetches the data from them.
@kishorenaidu712: So will segments continue to exist in the servers or will it be paged out after certain time? And how will performance be impacted if segment is not found in server and has to be fetched from segment store?
@diogo.baeder: It depends on whether you configure data retention or not. If you don't, the default behavior is to keep the data forever, in which case you'll always have the segments available - provided that you configured the deep store properly, of course.
@kishorenaidu712: So will there be any upper bound for storage in servers?
@diogo.baeder: It depends on how you setup the deep store. I have my project's Pinot cluster deep store setup to use S3, which basically gives me theoretically infinite storage. I did bump upper bounds when I didn't have the deep store setup yet, in the past, though, because I ended up using all of the disk space.
@kishorenaidu712: Got it, thank you
@diogo.baeder: No problem :slightly_smiling_face:
@mayanks: @kishorenaidu712 what’s the data size we are talking about here?
@kishorenaidu712: The data is about 45-50GB
@mayanks: Yeah, that is quite small to think about tiered storage like solution
@kishorenaidu712: Yeah i realised after i went through the blog on tiered storage solution.
@mayanks: Cool
@munendra.chevuru: @munendra.chevuru has joined the channel
@nisheetkmr: Hi team, I am trying to bootstrap realtime upsert enabled table. I have around 2-3 years that I want to upload to this realtime table. I was trying to utilize the segment generation using spark to create segments and then upload those segments to realtime table. But the initial segment creation job itself fails as it tries to search for OFFLINE table in the table config. I couldn't find any better guide/documentation to perform this. I was just going through whatever changes is there in this PR and was trying accordingly
@mayanks: @jackie.jxt @tingchen ^^
@jackie.jxt: @tingchen Could you please share the steps of generating segments for upsert table? Do you use pinot spark job or some custom job?
@nisheetkmr: I am using spark pinot job. I have some parquet files present in s3 which I am trying to load. I have attached the ingestion yaml file
@tingchen: cc @yupeng I think yupeng used Pinot Flink connector?
@yupeng: yes, you need flink for this
@yupeng:
@yupeng: take a look at this guide
@mayanks: @jackie.jxt @tingchen If this is a supported flow (as per ), could we also support spark based push?
@karinwolok1: Hi all! StarTree just announced the FIRST EVER :mega: ** :mega: *We are looking for speakers!* :eyes: So please submit your talks! :speaking_head_in_silhouette: It's happening in San Francisco on August 16th and 17th. You can register now for early bird special pricing. If you can't join us in person, you can also register for access to all the on-demand videos (and the live streamed keynote!) for free! Looking forward to seeing you all!
@karinwolok1:

#random

@buntyk91: @buntyk91 has joined the channel
@munendra.chevuru: @munendra.chevuru has joined the channel

#troubleshooting

@buntyk91: @buntyk91 has joined the channel
@alihaydar.atil: Hello everyone, I know its a little technical question but, Would changing MAX_DOC_PER_CALL variable from 10000 to 100000 in DocIdSetPlanNode class cause any problem you could foresee? I am trying to write a custom function which basically does smoothing on a numeric column in order to remove unnecessary (for me) records. I have realized accessed record blocks limited by MAX_DOC_PER_CALL variable. I am asking this because my smoothing fuction performs better with more data. My query almost always has "limit 1000000" option and bandwidth is an important resource for me. I would appreciate it if you could share your thoughts with me :pray:
@kharekartik: Hi, do you mean you are not able to get more than 10K records in result even with higher limit? I am not sure running such a large scan query is good idea. Also, are you using APIs or one of our connectors to run the query?
@kharekartik: Also, can you provide a sample query so that we can understand the usecase better
@rsivakumar: Hello Pinot team, I’m learning to work with Pinot and have hit a couple edge cases, that I couldn’t find the answers to in the Docs. I’ll post them here as two separate threads. *There’s some weirdness around using _id as a column name.* I’m trying to ingest data into Pinot from an OLTP data-store, and I wanted to have the primary-key be a column named “_id”. During ingestion, I found that our 32-digit hexadecimal string is converted into a much longer string if the column were named “_id”. Renaming the column to “id” works just fine. Is `_id` a reserved name in Pinot? Will attach screenshots with both with *_id* and *id* as column names in this thread.
@rsivakumar:
@kharekartik: Hi can you also add schema and table config
@rsivakumar: Here you go @kharekartik. I’ve removed all the other columns and redacted the broker url. I don’t think this matters because the only thing that changed between the transform that worked and the one that didn’t was the name of the `id` field.
@rsivakumar:
@kharekartik: thanks
@kharekartik: `JSONPATHSTRING(fullDocument, '$._id.$oid')` It seems like `id` is not `_id` here which seems to be a whole object but `oid` field inside `_id` object. That should be the difference between two values
@rsivakumar: I see what’s going on here. The pre-ingested event already has an `_id` attribute, so it looks like this is being used directly instead of the one described in my transformation. Is there any way to force the ingestion config to use my transformation when there’s a conflict with the keys in the event source? Here’s a quick look at how the input to my pinot ingestion looks like. ```{ "_id": { "_id": { "$oid": "6246a32a8b5a712b500f1eec" }, "copyingData": true }, "operationType": "insert", "documentKey": { "_id": { "$oid": "6246a32a8b5a712b500f1eec" } }, "fullDocument": { "_id": { "$oid": "6246a32a8b5a712b500f1eec" }, "isDeleted": false, "createdAt": { "$date": { "$numberLong": "1648796458544" } }, "updatedAt": { "$date": { "$numberLong": "1648796459023" } } } }```
@kharekartik: I don't understand what you mean by conflict
@rsivakumar: The input event that’s being ingested already has an `_id` field. ```"_id": { "_id": { "$oid": "6246a32a8b5a712b500f1eec" }``` When I add a transformation like the following, I imagine that the `_id` field is picked up as is from the event instead of the transformation. ```{ "columnName": "_id", "transformFunction": "JSONPATHSTRING(fullDocument, '$._id.$oid')" }``` This is the only way I can explain what’s going on. ~~ If I set `columnName` to `id` instead of `_id` in the transformation, then the transformation works as expected.
@rsivakumar: Here’s another question. My queries to Pinot tables are failing with either error code *305 (Segment unavailable)* or error code *410 (BrokerResourceMissingError)*. So far, my debugging has uncovered that not all queries are failing => Roughly 50% of the segments are healthy and consuming, whereas the other 50% of the segments are in a bad state. For the segments in a bad state, I’m unable to see any mapping between the Segments and the servers that store the segment data through the UI. I’ve tried deleting and recreating the failing tables as well as to reset/refresh/reload the segment, but none of this seems to work. I’m trying to understand if there’s anything we would have done that could have affected the configurations within the segments. Has anyone encountered this before, and what are some ways to gracefully recover from such issues?
@munendra.chevuru: @munendra.chevuru has joined the channel
@pavel.stejskal650: Hello! Do you have any working tutorial for Spark batch loading for the latest version of Pinot? After migration of jars to plugins-external. Cannot make it working at all
@kharekartik: Hi can you describe what error you are getting? also what spark version hadoop version you are using
@kharekartik: Also by latest you mean master build or 0.10.0?
@pavel.stejskal650: Hi, I’m trying to follow official pinot docs (different from ) which is pain as well :slightly_smiling_face: Spark 2.4.8 with hadoop 2.7 (official build), running in local mode Command: export SPARK_HOME=/spark export PINOT_ROOT_DIR=/pinot export PINOT_VERSION=0.10.0 export PINOT_DISTRIBUTION_DIR=$PINOT_ROOT_DIR cd ${PINOT_DISTRIBUTION_DIR} ${SPARK_HOME}/bin/spark-submit \ --verbose \ --class org.apache.pinot.tools.admin.command.LaunchDataIngestionJobCommand \ --master “local[2]” \ --deploy-mode client \ --conf “spark.driver.extraJavaOptions=-Dplugins.dir=${PINOT_DISTRIBUTION_DIR}/plugins-external” \ --conf “spark.driver.extraClassPath=${PINOT_DISTRIBUTION_DIR}/lib/pinot-all-${PINOT_VERSION}-jar-with-dependencies.jar” \ local://${PINOT_DISTRIBUTION_DIR}/lib/pinot-all-${PINOT_VERSION}-jar-with-dependencies.jar \ -jobSpecFile /app/job-spec-spark.yaml I copied pinot-batch-ingestion-spark-0.10.0-shaded.jar to spark/jars and getting error: Exception in thread “main” java.lang.VerifyError: Bad type on operand stack Exception Details: Location: org/apache/spark/metrics/sink/MetricsServlet.<init>(Ljava/util/Properties;Lcom/codahale/metrics/MetricRegistry;Lorg/apache/spark/SecurityManager;)V @116: invokevirtual Reason: Type ‘com/codahale/metrics/json/MetricsModule’ (current frame, stack[2]) is not assignable to ‘shaded/com/fasterxml/jackson/databind/Module’ Current Frame: bci: @116 flags: { } locals: { ‘org/apache/spark/metrics/sink/MetricsServlet’, ‘java/util/Properties’, ‘com/codahale/metrics/MetricRegistry’, ‘org/apache/spark/SecurityManager’ } stack: { ‘org/apache/spark/metrics/sink/MetricsServlet’, ‘shaded/com/fasterxml/jackson/databind/ObjectMapper’, ‘com/codahale/metrics/json/MetricsModule’ } Bytecode: 0000000: 2a2b b500 2a2a 2cb5 002f 2a2d b500 5c2a 0000010: b700 7e2a 1280 b500 322a 1282 b500 342a 0000020: 03b5 0037 2a2b 2ab6 0084 b600 8ab5 0039 0000030: 2ab2 008f 2b2a b600 91b6 008a b600 95bb 0000040: 0014 592a b700 96b6 009c bb00 1659 2ab7 0000050: 009d b600 a1b8 00a7 b500 3b2a bb00 7159 0000060: b700 a8bb 00aa 59b2 00b0 b200 b32a b600 0000070: b5b7 00b8 b600 bcb5 003e b1
@pavel.stejskal650: If I don’t copy pinot-batch-ingestion-spark-0.10.0-shaded.jar to spark/jars, class not found…
@kharekartik: can you also tell the java version you are using?
@pavel.stejskal650: And pinot is 0.10.0 official build. OpenJDK 11
@kharekartik: got it
@kharekartik: actually we recently made spark dependency as provided in our master branch Is it possible for you to use that spark-shaded-0.11-Snapshot
@kharekartik: I’ll move this conversation to DM we can sort it out there
@pavel.stejskal650: This issue is related to 0.10.0 version, the latest versions is working fine with following command: export SPARK_HOME=/spark export PINOT_ROOT_DIR=/pinot export PINOT_VERSION=0.11.0-SNAPSHOT export PINOT_DISTRIBUTION_DIR=$PINOT_ROOT_DIR cd ${PINOT_DISTRIBUTION_DIR} ${SPARK_HOME}/bin/spark-submit \ --verbose \ --class org.apache.pinot.tools.admin.command.LaunchDataIngestionJobCommand \ --master “local[2]” \ --deploy-mode client \ --conf “spark.driver.extraJavaOptions=-Dplugins.dir=${PINOT_DISTRIBUTION_DIR}/plugins-external” \ --conf “spark.driver.extraClassPath=${PINOT_DISTRIBUTION_DIR}/lib/pinot-all-${PINOT_VERSION}-jar-with-dependencies.jar:/pinot/plugins-external/pinot-batch-ingestion/pinot-batch-ingestion-spark/pinot-batch-ingestion-spark-0.11.0-SNAPSHOT-shaded.jar” \ local://${PINOT_DISTRIBUTION_DIR}/lib/pinot-all-${PINOT_VERSION}-jar-with-dependencies.jar \ -jobSpecFile /app/job-spec-spark.yaml *Note*: this command is different from ! Should be fixed
@kharekartik: Thanks for pointing out! Updated the documentation.

#pinot-dev

@kharekartik: can someone help in reviewing this PR? Pending for last few days.
@walterddr: left a few comments please kindly take a look.

#thirdeye-pinot

@mathur.amol: @mathur.amol has joined the channel

#getting-started

@buntyk91: @buntyk91 has joined the channel
@munendra.chevuru: @munendra.chevuru has joined the channel

#flink-pinot-connector

@ysuo: @ysuo has joined the channel
@ysuo: Hi team, is flink connector available now?
@ysuo: I mean flink-pinot-connector.

#introductions

@buntyk91: @buntyk91 has joined the channel
@greetbot: Good you are here @buntyk91 :relaxed:
@munendra.chevuru: @munendra.chevuru has joined the channel
@greetbot: @munendra.chevuru rivers know this: there is no hurry. We shall get there some day :national_park:
@mayanks: :wave:Hello new community members, please take a moment to introduce yourself and help us know how we can help you with your Pinot journey.
--------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pinot.apache.org For additional commands, e-mail: dev-h...@pinot.apache.org

Apache Pinot Daily Email Digest (2022-04-27)

#general

#random

#troubleshooting

#pinot-dev

#thirdeye-pinot

#getting-started

#flink-pinot-connector

#introductions

Reply via email to