Apache Pinot Daily Email Digest (2021-09-16)

Pinot Slack Email Digest Thu, 16 Sep 2021 19:00:35 -0700

#general

@prtk.ngm: Hello All, i am follwing this to configure spark job for ingestion, can we give input staging Directory as HDFS Directory. I don't want to use Hadoop utility jar to ingest data.
@ken: I assume you can use HDFS for the staging directory, since we do that for batch segment generation with Hadoop
@raj.swarnim: @raj.swarnim has joined the channel
@sheetalarun.kadam2: Hi all, I want to do insert of a single record to Pinot from an application. Setting up Kafka for real time ingestion seems too complicated for very small volume insert calls. Is there some other way?
@g.kishore: Pinot does not support row level insert api as of now. You can use the batch api
@npawar: You can try out the SegmentWriter interface. There’s no document yet, but this test demonstrates how to use But in any case, creating segments with single row isn’t the best idea
@sheetalarun.kadam2: Thanks I will check it out. But yes the single row thing is what’s troubling me. I use the table for searchbox so I don’t want to batch process
@ken: Are you trying to use Pinot for near real time search?
@sheetalarun.kadam2: Its a normal search bar. It will be a regex query on one of the columns. The reason to use Pinot is I have some dashboarding needs which require fast aggregations. Having a different database like Mysql for just one table (the search query one) seemed like an unnecessary layer. So I am thinking to use Pinot for the search
@ken: OK, but normally adding row-by-row (not batch) means you want near-real time (NRT) search. As in, soon after data is available you want it to be searchable. Otherwise you could just use batch to generate segments every day (as an example).
@sheetalarun.kadam2: oh yes, I want it near real time. Data should available for all as soon as insert is done
@g.kishore: If you are planning to use this in production and expect strong guarantee.. it’s better to use Kafka
@becca.silverman: @becca.silverman has joined the channel
@leb9882: @leb9882 has joined the channel
@sanipindi: @sanipindi has joined the channel

#random

@raj.swarnim: @raj.swarnim has joined the channel
@becca.silverman: @becca.silverman has joined the channel
@leb9882: @leb9882 has joined the channel
@sanipindi: @sanipindi has joined the channel

#troubleshooting

@raj.swarnim: @raj.swarnim has joined the channel
@dadelcas: Hello, I'm trying to configure presto to query pinot tables. The catalogue seems fine, I can show tables in Pinot. However when I issue a query I get the following error: `Query <id> failed: Cannot fetch from cache <table>` Any hints to fix this error would be appreciated
@mayanks: Are you able to run Pinot query directly? Also can you run explain on presto
@dadelcas: The query runs in Pinot, is a simple statement `select * from <table> limit 1` . Explain returns the same error
@dadelcas: I'm running Pinot 0.8.0 and Presto 0.261
@becca.silverman: @becca.silverman has joined the channel
@anu110195: Is there any way to check slow queries in pinot ?
@mayanks: You can look at the broker logs for details on whether it was a server, multiple servers, or broker that caused the issue. You can also look at response metadata to check how much work the query did (in terms of scanning / selecting docs, etc);
@leb9882: @leb9882 has joined the channel
@raj.swarnim: Can anyone help me to understand, what this error means - `Error: Could not find or load main class org.apache.pinot.thirdeye.anomaly.ThirdEyeAnomalyApplication`. And how to fix this, don't have much experience in Java.
@mayanks: @pyne.suvodeep ^^
@pyne.suvodeep: Hi @raj.swarnim It means that java is unable to find that class in the classpath. Can you share the steps through which you got to this error?
@sanipindi: @sanipindi has joined the channel
@qianbo.wang: Hi, I have a question about `DATETIMECONVERT` . It mentions it buckets the time based on the given time granularity, but what is the start of the first bucket? e.g. when running this query: ```SELECT DATETIMECONVERT(time_col, '1:MILLISECONDS:EPOCH', '1:SECONDS:EPOCH', '30:DAYS') as new_time_col, COUNT(id) FROM table WHERE (time_col BETWEEN <epoch_second of 7/16> AND <epoch_second of 9/16>) AND GROUP BY new_time_col ORDER BY new_time_col``` It returns 3 buckets for: 7/1, 7/31 and 8/30. So I wonder how is this being calculated?
@qianbo.wang: For a bit of more context, we are trying to bucket our data with size of 30-day bucket to categorize their age.
@qianbo.wang: Same result using ```DATETIMECONVERT(time_col, '1:MILLISECONDS:EPOCH', '1:SECONDS:EPOCH', '720:HOURS')```
@jackie.jxt: The start of the first bucket is Unix epoch time, and we use millis since epoch to calculate the time bucket
@qianbo.wang: ah, I see. thanks!

#thirdeye-pinot

@raj.swarnim: @raj.swarnim has joined the channel

#getting-started

@zineb.raiiss: Hello friends, I want to test the ThirdEye solution for Pinot anomaly detection, so I followed the documentation , but failed to connect to
@zineb.raiiss: Do You have any idea?
@npawar: @pyne.suvodeep
@tiger: Any tips for debugging slow queries? I was stress testing my cluster, and noticed a behavior where when I send a bunch of queries at once, the query latency goes from ~100ms to 4-5 seconds. The latency then stays relatively high for a few minutes after the stress test and then returns back to ~100ms. I also noticed behavior where sometimes a single server would take significantly longer to process a query, which ends up increasing the overall latency by a lot. That one slow server also stays consistently slow for a while, so every query is bottlenecked by that server. Thanks!
@kulbir.nijjer: This might be a good start:
--------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pinot.apache.org For additional commands, e-mail: dev-h...@pinot.apache.org