Re: How to use delta storage format
@Mike I'd appreciate some feedback. Paul Parker schrieb am Di., 31. März 2020, 17:07: > Let me share answers from the delta community: > > Answer to Q1: > Structured streaming queries can do commits every minute, even every 20-30 > seconds. This definitely creates small files. But that is okay, because it > is expected that people will periodically compact the files. The same > timing should work fine for Nifi and any other streaming engine. It does > create 1000-ish versions per day, but that is okay. > > Answer to Q2: > That is up to the sink implementation. Both are okay. In fact, it probably > can be combination of both, as long as we dont commit every second. That > may not scale well. > > Answer to Q3: > You need a primary node which is responsible for managing the Delta table. > That note would be responsible for reading the log, parsing it, updating > it, etc. Unfortunately, we have no good non-spark way to read the log, and > much less write to the log. > there is an experimental uber jar that tries to package delta + spark > together into a single jar ... using which you could read the log. its > available here - https://github.com/delta-io/connectors/ > > What it does is it basically runs a local-mode (in-process) spark to read > the log. This what we are using to build a hive connector, that will allow > hive to read delta files. Now the goal of that was to only read. your goal > is to write which is definitely more complicated, because for that you have > to do much more. Now that uber jar has all the necessary code to do the > writing. ... which you could use. but there has to be a driver node which > has to collect all the parquet files written by other nodes and atomically > commit those parquet files to the Delta log to make them visible to all > readers. > > Does the first orientation help? > > > Mike Thomsen schrieb am So., 29. März 2020, > 20:29: > >> It looks like a lot of their connectors rely on external management by >> Spark. That is true of Hive, and also of Athena/Presto unless I misread the >> documentation. Read some of the fine print near the bottom of this to get >> an idea of what I mean: >> >> https://github.com/delta-io/connectors >> >> Hypothetically, we could build a connector for NiFi, but there are some >> things about the design of Delta Lake that I am not sure about based on my >> own research and what not. Roughly, they are the following: >> >> 1. What is a good timing strategy for doing commits to Delta Lake? >> 2. What would trigger a commit in the first place? >> 3. Is there a good way to trigger commits that would work within the >> masterless cluster design of clustered NiFi instead of requiring a special >> "primary node only" processor for executing commits? >> >> Based on my experimentation, one of the biggest questions around the >> first point is do you really want potentially thousands or tens of >> thousands of time shift events to be created throughout the day? A record >> processor that reads in a ton of small record sets and injects that into >> the Delta Lake would create a ton of these checkpoints, and they'd be >> largely meaningless to people trying to make sense of them for the purpose >> of going back and forth in time between versions. >> >> Do we trigger a commit per record set or set a timer? >> >> Most of us on the NiFi dev side have no real experience here. It would be >> helpful for us to get some ideas to form use cases from the community >> because there are some big gaps on how we'd even start to shape the >> requirements. >> >> On Sun, Mar 29, 2020 at 1:28 PM Paul Parker >> wrote: >> >>> Hi Mike, >>> your alternate suggestion sounds good. But how does it work if I want to >>> keep this running continuously? In other words, the delta table should be >>> continuously updated. Finally, this is one of the biggest advantages of >>> Delta: you can ingest batch and streaming data into one table. >>> >>> I also think about workarounds (Use Athena, Presto or Redshift with >>> Nifi): >>> "Here is the list of integrations that enable you to access Delta >>> tables from external data processing engines. >>> >>>- Presto and Athena to Delta Lake Integration >>><https://docs.delta.io/latest/presto-integration.html> >>>- Redshift Spectrum to Delta Lake Integration >>><https://docs.delta.io/latest/redshift-spectrum-integration.html> >>>- Snowflake to Delta Lake Integration >>><https://docs.delta.io/latest/
Re: How to use delta storage format
Let me share answers from the delta community: Answer to Q1: Structured streaming queries can do commits every minute, even every 20-30 seconds. This definitely creates small files. But that is okay, because it is expected that people will periodically compact the files. The same timing should work fine for Nifi and any other streaming engine. It does create 1000-ish versions per day, but that is okay. Answer to Q2: That is up to the sink implementation. Both are okay. In fact, it probably can be combination of both, as long as we dont commit every second. That may not scale well. Answer to Q3: You need a primary node which is responsible for managing the Delta table. That note would be responsible for reading the log, parsing it, updating it, etc. Unfortunately, we have no good non-spark way to read the log, and much less write to the log. there is an experimental uber jar that tries to package delta + spark together into a single jar ... using which you could read the log. its available here - https://github.com/delta-io/connectors/ What it does is it basically runs a local-mode (in-process) spark to read the log. This what we are using to build a hive connector, that will allow hive to read delta files. Now the goal of that was to only read. your goal is to write which is definitely more complicated, because for that you have to do much more. Now that uber jar has all the necessary code to do the writing. ... which you could use. but there has to be a driver node which has to collect all the parquet files written by other nodes and atomically commit those parquet files to the Delta log to make them visible to all readers. Does the first orientation help? Mike Thomsen schrieb am So., 29. März 2020, 20:29: > It looks like a lot of their connectors rely on external management by > Spark. That is true of Hive, and also of Athena/Presto unless I misread the > documentation. Read some of the fine print near the bottom of this to get > an idea of what I mean: > > https://github.com/delta-io/connectors > > Hypothetically, we could build a connector for NiFi, but there are some > things about the design of Delta Lake that I am not sure about based on my > own research and what not. Roughly, they are the following: > > 1. What is a good timing strategy for doing commits to Delta Lake? > 2. What would trigger a commit in the first place? > 3. Is there a good way to trigger commits that would work within the > masterless cluster design of clustered NiFi instead of requiring a special > "primary node only" processor for executing commits? > > Based on my experimentation, one of the biggest questions around the first > point is do you really want potentially thousands or tens of thousands of > time shift events to be created throughout the day? A record processor that > reads in a ton of small record sets and injects that into the Delta Lake > would create a ton of these checkpoints, and they'd be largely meaningless > to people trying to make sense of them for the purpose of going back and > forth in time between versions. > > Do we trigger a commit per record set or set a timer? > > Most of us on the NiFi dev side have no real experience here. It would be > helpful for us to get some ideas to form use cases from the community > because there are some big gaps on how we'd even start to shape the > requirements. > > On Sun, Mar 29, 2020 at 1:28 PM Paul Parker wrote: > >> Hi Mike, >> your alternate suggestion sounds good. But how does it work if I want to >> keep this running continuously? In other words, the delta table should be >> continuously updated. Finally, this is one of the biggest advantages of >> Delta: you can ingest batch and streaming data into one table. >> >> I also think about workarounds (Use Athena, Presto or Redshift with Nifi): >> "Here is the list of integrations that enable you to access Delta tables >> from external data processing engines. >> >>- Presto and Athena to Delta Lake Integration >><https://docs.delta.io/latest/presto-integration.html> >>- Redshift Spectrum to Delta Lake Integration >><https://docs.delta.io/latest/redshift-spectrum-integration.html> >>- Snowflake to Delta Lake Integration >><https://docs.delta.io/latest/snowflake-integration.html> >>- Apache Hive to Delta Lake Integration >><https://docs.delta.io/latest/hive-integration.html>" >> >> Source: >> https://docs.delta.io/latest/integrations.html >> >> I am looking forward to further ideas from the community. >> >> Mike Thomsen schrieb am So., 29. März 2020, >> 17:23: >> >>> I think there is a connector for Hive that works with Delta. You could >>
Re: How to use delta storage format
Hi Mike, your alternate suggestion sounds good. But how does it work if I want to keep this running continuously? In other words, the delta table should be continuously updated. Finally, this is one of the biggest advantages of Delta: you can ingest batch and streaming data into one table. I also think about workarounds (Use Athena, Presto or Redshift with Nifi): "Here is the list of integrations that enable you to access Delta tables from external data processing engines. - Presto and Athena to Delta Lake Integration <https://docs.delta.io/latest/presto-integration.html> - Redshift Spectrum to Delta Lake Integration <https://docs.delta.io/latest/redshift-spectrum-integration.html> - Snowflake to Delta Lake Integration <https://docs.delta.io/latest/snowflake-integration.html> - Apache Hive to Delta Lake Integration <https://docs.delta.io/latest/hive-integration.html>" Source: https://docs.delta.io/latest/integrations.html I am looking forward to further ideas from the community. Mike Thomsen schrieb am So., 29. März 2020, 17:23: > I think there is a connector for Hive that works with Delta. You could try > setting up Hive to work with Delta and then using NiFi to feed Hive. > Alternatively, you can simply convert the results of the JDBC query into a > Parquet file, push to a desired location and run a Spark job to convert > from Parquet to Delta (that should be pretty fast because Delta is > basically a fork of Parquet). > > On Fri, Mar 27, 2020 at 2:13 PM Paul Parker wrote: > >> We read data via JDBC from a database and want to save the results as a >> delta table and then read them again. How can I realize this with Nifi and >> Hive or Glue Metastore? >> >
How to use delta storage format
We read data via JDBC from a database and want to save the results as a delta table and then read them again. How can I realize this with Nifi and Hive or Glue Metastore?
Re: How write multiple prepends/appens in a single EL expression?
You can go without append and prepend. sftp://${sftp.remote.host}/${path}/${filename} Give it a try. Eric Chaves schrieb am Do., 19. März 2020, 21:05: > Hi folks, > > Sorry for another newbie question. =) I'm trying to write a single EL > expression to perform multiples string manipulation but I keep hitting an > invalid EL expression. > > What would be the correct way to write this expression: > ${sftp.remote.host:prepend("sftp://";):append:("/"):append(${path}):append("/"):append(${filename})} > ? > > Thanks in advance, > > Eric >
Re: Metrics via Prometheus
It would be great if you could share your story as a blog post. Eric Ladner schrieb am Mi., 4. März 2020, 19:45: > Thank you so much for your guidance. I was able to get data flowing into > Prometheus fairly easily once all the pieces were understood. > > Now, I just need to dig into Prometheus queries and make some Grafana > dashboards. > > On Tue, Mar 3, 2020 at 2:54 PM Yolanda Davis > wrote: > >> Sure not a problem! Hopefully below thoughts can help you get started: >> >> As you may know the PrometheusReportingTask is a bit different from other >> tasks in that it actually exposes an endpoint for Prometheus to scrape (vs. >> pushing data directly to Prometheus). When the task is started the >> endpoint is created on the port you designate under “/metrics”; so just >> ensure that you don’t have anything already on the port you select. If you >> want to ensure that you have a secured endpoint for Prometheus to connect, >> be sure to use a SSL Context Service (a controller service that will allow >> the reporting task to use the appropriate key/trust stores for TLS). Also >> you'll want to consider the levels at which you are reporting (Root Group, >> Process Group or All Components), especially in terms of the amount of data >> you are looking to send back. Jvm metrics can be sent as well flow >> specific metrics. Finally consider how often metrics should be refreshed by >> adjusting the Scheduling Strategy in the settings tab for the task. >> >> When starting the task you should be able to go directly to the endpoint >> (without Prometheus) to confirm it’s output (e.g. >> http://locahost:9092/metrics ). You should see a format similar to what >> Prometheus supports for it’s scraping jobs (see example >> https://prometheus.io/docs/instrumenting/exposition_formats/#text-format-example >> ) >> >> On the Prometheus side you’ll want to follow their instructions on how to >> setup a scrape configuration that will point to the newly created metrics >> endpoint . I’d recommend checking out the first steps for help ( >> https://prometheus.io/docs/introduction/first_steps/#configuring-prometheus) >> and then when you need to provide more advanced settings take a look here >> https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config. >> The key is you’ll want to define a new scrape job that looks at the NiFi >> endpoint for scraping. To start you may want to refer to the cluster >> directly but later add the security credentials or use another method for >> discovering the endpoint. >> >> Once these configurations are in place, and Prometheus is started (or >> restarted) after a few seconds you should begin to see metrics landing when >> querying in Grafana. >> >> I hope this helps! Please let me know if you have any further questions. >> >> -yolanda >> >> On Tue, Mar 3, 2020 at 2:10 PM Eric Ladner wrote: >> >>> Yes, exactly! Reporting Task -> Prometheus -> Grafana for keeping an >>> eye on things running in NiFi. >>> >>> If you have any hints/tips on getting things working, I'd be grateful. >>> >>> On Tue, Mar 3, 2020 at 12:35 PM Yolanda Davis >>> wrote: >>> Hi Eric, Were you looking to use the Prometheus Reporting Task for making metrics available for Prometheus scraping? I don't believe any documentation outside of what is in NiFi exists just yet, but I'm happy to help answer questions you may have (I've used this task recently). -yolanda On Tue, Mar 3, 2020 at 10:51 AM Eric Ladner wrote: > Is there a guide to setting up Nifi and Prometheus anywhere? The nar > docs are a little vague. > > Thanks, > > Eric Ladner > -- -- yolanda.m.da...@gmail.com @YolandaMDavis >>> >>> -- >>> Eric Ladner >>> >> >> >> -- >> -- >> yolanda.m.da...@gmail.com >> @YolandaMDavis >> >> > > -- > Eric Ladner >