Re: How to use delta storage format

2020-04-02 Thread Paul Parker
@Mike I'd appreciate some feedback.

Paul Parker  schrieb am Di., 31. März 2020, 17:07:

> Let me share answers from the delta community:
>
> Answer to Q1:
> Structured streaming queries can do commits every minute, even every 20-30
> seconds. This definitely creates small files. But that is okay, because it
> is expected that people will periodically compact the files. The same
> timing should work fine for Nifi and any other streaming engine. It does
> create 1000-ish versions per day, but that is okay.
>
> Answer to Q2:
> That is up to the sink implementation. Both are okay. In fact, it probably
> can be combination of both, as long as we dont commit every second. That
> may not scale well.
>
> Answer to Q3:
> You need a primary node which is responsible for managing the Delta table.
> That note would be responsible for reading the log, parsing it, updating
> it, etc. Unfortunately, we have no good non-spark way to read the log, and
> much less write to the log.
> there is an experimental uber jar that tries to package delta + spark
> together into a single jar ... using which you could read the log. its
> available here - https://github.com/delta-io/connectors/
>
> What it does is it basically runs a local-mode (in-process) spark to read
> the log. This what we are using to build a hive connector, that will allow
> hive to read delta files. Now the goal of that was to only read. your goal
> is to write which is definitely more complicated, because for that you have
> to do much more. Now that uber jar has all the necessary code to do the
> writing. ... which you could use. but there has to be a driver node which
> has to collect all the parquet files written by other nodes and atomically
> commit those parquet files to the Delta log to make them visible to all
> readers.
>
> Does the first orientation help?
>
>
> Mike Thomsen  schrieb am So., 29. März 2020,
> 20:29:
>
>> It looks like a lot of their connectors rely on external management by
>> Spark. That is true of Hive, and also of Athena/Presto unless I misread the
>> documentation. Read some of the fine print near the bottom of this to get
>> an idea of what I mean:
>>
>> https://github.com/delta-io/connectors
>>
>> Hypothetically, we could build a connector for NiFi, but there are some
>> things about the design of Delta Lake that I am not sure about based on my
>> own research and what not. Roughly, they are the following:
>>
>> 1. What is a good timing strategy for doing commits to Delta Lake?
>> 2. What would trigger a commit in the first place?
>> 3. Is there a good way to trigger commits that would work within the
>> masterless cluster design of clustered NiFi instead of requiring a special
>> "primary node only" processor for executing commits?
>>
>> Based on my experimentation, one of the biggest questions around the
>> first point is do you really want potentially thousands or tens of
>> thousands of time shift events to be created throughout the day? A record
>> processor that reads in a ton of small record sets and injects that into
>> the Delta Lake would create a ton of these checkpoints, and they'd be
>> largely meaningless to people trying to make sense of them for the purpose
>> of going back and forth in time between versions.
>>
>> Do we trigger a commit per record set or set a timer?
>>
>> Most of us on the NiFi dev side have no real experience here. It would be
>> helpful for us to get some ideas to form use cases from the community
>> because there are some big gaps on how we'd even start to shape the
>> requirements.
>>
>> On Sun, Mar 29, 2020 at 1:28 PM Paul Parker 
>> wrote:
>>
>>> Hi Mike,
>>> your alternate suggestion sounds good. But how does it work if I want to
>>> keep this running continuously? In other words, the delta table should be
>>> continuously updated. Finally, this is one of the biggest advantages of
>>> Delta: you can ingest batch and streaming data into one table.
>>>
>>> I also think about workarounds (Use Athena, Presto or Redshift with
>>> Nifi):
>>> "Here is the list of integrations that enable you to access Delta
>>> tables from external data processing engines.
>>>
>>>- Presto and Athena to Delta Lake Integration
>>><https://docs.delta.io/latest/presto-integration.html>
>>>- Redshift Spectrum to Delta Lake Integration
>>><https://docs.delta.io/latest/redshift-spectrum-integration.html>
>>>- Snowflake to Delta Lake Integration
>>><https://docs.delta.io/latest/

Re: How to use delta storage format

2020-03-31 Thread Paul Parker
Let me share answers from the delta community:

Answer to Q1:
Structured streaming queries can do commits every minute, even every 20-30
seconds. This definitely creates small files. But that is okay, because it
is expected that people will periodically compact the files. The same
timing should work fine for Nifi and any other streaming engine. It does
create 1000-ish versions per day, but that is okay.

Answer to Q2:
That is up to the sink implementation. Both are okay. In fact, it probably
can be combination of both, as long as we dont commit every second. That
may not scale well.

Answer to Q3:
You need a primary node which is responsible for managing the Delta table.
That note would be responsible for reading the log, parsing it, updating
it, etc. Unfortunately, we have no good non-spark way to read the log, and
much less write to the log.
there is an experimental uber jar that tries to package delta + spark
together into a single jar ... using which you could read the log. its
available here - https://github.com/delta-io/connectors/

What it does is it basically runs a local-mode (in-process) spark to read
the log. This what we are using to build a hive connector, that will allow
hive to read delta files. Now the goal of that was to only read. your goal
is to write which is definitely more complicated, because for that you have
to do much more. Now that uber jar has all the necessary code to do the
writing. ... which you could use. but there has to be a driver node which
has to collect all the parquet files written by other nodes and atomically
commit those parquet files to the Delta log to make them visible to all
readers.

Does the first orientation help?


Mike Thomsen  schrieb am So., 29. März 2020, 20:29:

> It looks like a lot of their connectors rely on external management by
> Spark. That is true of Hive, and also of Athena/Presto unless I misread the
> documentation. Read some of the fine print near the bottom of this to get
> an idea of what I mean:
>
> https://github.com/delta-io/connectors
>
> Hypothetically, we could build a connector for NiFi, but there are some
> things about the design of Delta Lake that I am not sure about based on my
> own research and what not. Roughly, they are the following:
>
> 1. What is a good timing strategy for doing commits to Delta Lake?
> 2. What would trigger a commit in the first place?
> 3. Is there a good way to trigger commits that would work within the
> masterless cluster design of clustered NiFi instead of requiring a special
> "primary node only" processor for executing commits?
>
> Based on my experimentation, one of the biggest questions around the first
> point is do you really want potentially thousands or tens of thousands of
> time shift events to be created throughout the day? A record processor that
> reads in a ton of small record sets and injects that into the Delta Lake
> would create a ton of these checkpoints, and they'd be largely meaningless
> to people trying to make sense of them for the purpose of going back and
> forth in time between versions.
>
> Do we trigger a commit per record set or set a timer?
>
> Most of us on the NiFi dev side have no real experience here. It would be
> helpful for us to get some ideas to form use cases from the community
> because there are some big gaps on how we'd even start to shape the
> requirements.
>
> On Sun, Mar 29, 2020 at 1:28 PM Paul Parker  wrote:
>
>> Hi Mike,
>> your alternate suggestion sounds good. But how does it work if I want to
>> keep this running continuously? In other words, the delta table should be
>> continuously updated. Finally, this is one of the biggest advantages of
>> Delta: you can ingest batch and streaming data into one table.
>>
>> I also think about workarounds (Use Athena, Presto or Redshift with Nifi):
>> "Here is the list of integrations that enable you to access Delta tables
>> from external data processing engines.
>>
>>- Presto and Athena to Delta Lake Integration
>><https://docs.delta.io/latest/presto-integration.html>
>>- Redshift Spectrum to Delta Lake Integration
>><https://docs.delta.io/latest/redshift-spectrum-integration.html>
>>- Snowflake to Delta Lake Integration
>><https://docs.delta.io/latest/snowflake-integration.html>
>>- Apache Hive to Delta Lake Integration
>><https://docs.delta.io/latest/hive-integration.html>"
>>
>> Source:
>> https://docs.delta.io/latest/integrations.html
>>
>> I am looking forward to further ideas from the community.
>>
>> Mike Thomsen  schrieb am So., 29. März 2020,
>> 17:23:
>>
>>> I think there is a connector for Hive that works with Delta. You could
>>

Re: How to use delta storage format

2020-03-29 Thread Paul Parker
Hi Mike,
your alternate suggestion sounds good. But how does it work if I want to
keep this running continuously? In other words, the delta table should be
continuously updated. Finally, this is one of the biggest advantages of
Delta: you can ingest batch and streaming data into one table.

I also think about workarounds (Use Athena, Presto or Redshift with Nifi):
"Here is the list of integrations that enable you to access Delta tables
from external data processing engines.

   - Presto and Athena to Delta Lake Integration
   <https://docs.delta.io/latest/presto-integration.html>
   - Redshift Spectrum to Delta Lake Integration
   <https://docs.delta.io/latest/redshift-spectrum-integration.html>
   - Snowflake to Delta Lake Integration
   <https://docs.delta.io/latest/snowflake-integration.html>
   - Apache Hive to Delta Lake Integration
   <https://docs.delta.io/latest/hive-integration.html>"

Source:
https://docs.delta.io/latest/integrations.html

I am looking forward to further ideas from the community.

Mike Thomsen  schrieb am So., 29. März 2020, 17:23:

> I think there is a connector for Hive that works with Delta. You could try
> setting up Hive to work with Delta and then using NiFi to feed Hive.
> Alternatively, you can simply convert the results of the JDBC query into a
> Parquet file, push to a desired location and run a Spark job to convert
> from Parquet to Delta (that should be pretty fast because Delta is
> basically a fork of Parquet).
>
> On Fri, Mar 27, 2020 at 2:13 PM Paul Parker  wrote:
>
>> We read data via JDBC from a database and want to save the results as a
>> delta table and then read them again. How can I realize this with Nifi and
>> Hive or Glue Metastore?
>>
>


How to use delta storage format

2020-03-27 Thread Paul Parker
We read data via JDBC from a database and want to save the results as a
delta table and then read them again. How can I realize this with Nifi and
Hive or Glue Metastore?


Re: How write multiple prepends/appens in a single EL expression?

2020-03-19 Thread Paul Parker
You can go without append and prepend.

sftp://${sftp.remote.host}/${path}/${filename}

Give it a try.

Eric Chaves  schrieb am Do., 19. März 2020, 21:05:

> Hi folks,
>
> Sorry for another newbie question. =) I'm trying to write a single EL
> expression to perform multiples string manipulation but I keep hitting an
> invalid EL expression.
>
> What would be the correct way to write this expression: 
> ${sftp.remote.host:prepend("sftp://";):append:("/"):append(${path}):append("/"):append(${filename})}
> ?
>
> Thanks in advance,
>
> Eric
>


Re: Metrics via Prometheus

2020-03-05 Thread Paul Parker
It would be great if you could share your story as a blog post.

Eric Ladner  schrieb am Mi., 4. März 2020, 19:45:

> Thank you so much for your guidance.  I was able to get data flowing into
> Prometheus fairly easily once all the pieces were understood.
>
> Now, I just need to dig into Prometheus queries and make some Grafana
> dashboards.
>
> On Tue, Mar 3, 2020 at 2:54 PM Yolanda Davis 
> wrote:
>
>> Sure not a problem!  Hopefully below thoughts can help you get started:
>>
>> As you may know the PrometheusReportingTask is a bit different from other
>> tasks in that it actually exposes an endpoint for Prometheus to scrape (vs.
>> pushing data directly to Prometheus).  When the task is started the
>> endpoint is created on the port you designate under “/metrics”; so just
>> ensure that you don’t have anything already on the port you select. If you
>> want to ensure that you have a secured endpoint for Prometheus to connect,
>> be sure to use a SSL Context Service (a controller service that will allow
>> the reporting task to use the appropriate key/trust stores for TLS). Also
>> you'll want to consider the levels at which you are reporting (Root Group,
>> Process Group or All Components), especially in terms of the amount of data
>> you are looking to send back.  Jvm metrics can be sent as well flow
>> specific metrics. Finally consider how often metrics should be refreshed by
>> adjusting the Scheduling Strategy in the settings tab for the task.
>>
>> When starting the task you should be able to go directly to the endpoint
>> (without Prometheus) to confirm it’s output (e.g.
>> http://locahost:9092/metrics ).  You should see a format similar to what
>> Prometheus supports for it’s scraping jobs (see example
>> https://prometheus.io/docs/instrumenting/exposition_formats/#text-format-example
>> )
>>
>> On the Prometheus side you’ll want to follow their instructions on how to
>> setup a scrape configuration that  will point to the newly created metrics
>> endpoint . I’d recommend checking out the first steps for help (
>> https://prometheus.io/docs/introduction/first_steps/#configuring-prometheus)
>> and then when you need to provide more advanced settings take a look here
>> https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config.
>> The key is you’ll want to define a new scrape job that looks at the NiFi
>> endpoint for scraping.  To start you may want to refer to the cluster
>> directly but later add the security credentials or use another method for
>> discovering the endpoint.
>>
>> Once these configurations are in place, and Prometheus is started (or
>> restarted) after a few seconds you should begin to see metrics landing when
>> querying in Grafana.
>>
>> I hope this helps!  Please let me know if you have any further questions.
>>
>> -yolanda
>>
>> On Tue, Mar 3, 2020 at 2:10 PM Eric Ladner  wrote:
>>
>>> Yes, exactly!   Reporting Task -> Prometheus -> Grafana for keeping an
>>> eye on things running in NiFi.
>>>
>>> If you have any hints/tips on getting things working, I'd be grateful.
>>>
>>> On Tue, Mar 3, 2020 at 12:35 PM Yolanda Davis 
>>> wrote:
>>>
 Hi Eric,

 Were you looking to use the Prometheus Reporting Task for making
 metrics available for Prometheus scraping? I don't believe any
 documentation outside of what is in NiFi exists just yet, but I'm happy to
 help answer questions you may have (I've used this task recently).

 -yolanda

 On Tue, Mar 3, 2020 at 10:51 AM Eric Ladner 
 wrote:

> Is there a guide to setting up Nifi and Prometheus anywhere?  The nar
> docs are a little vague.
>
> Thanks,
>
> Eric Ladner
>


 --
 --
 yolanda.m.da...@gmail.com
 @YolandaMDavis


>>>
>>> --
>>> Eric Ladner
>>>
>>
>>
>> --
>> --
>> yolanda.m.da...@gmail.com
>> @YolandaMDavis
>>
>>
>
> --
> Eric Ladner
>