Web Based Nifi for Python

2023-02-21 Thread Darren Govoni
Hi,
   This looks like a promising spiritual successor to Java Nifi. Pure python. 
Processors scale individually, directly on the metal.
Also runs purely in browser for testing.

https://elasticcode.ai

Pretty neat!


Fwd: Please do not use the Apache NiFi mailing lists to advertise PyFi

2021-09-09 Thread Darren Govoni
Sorry folks. The PMC have spoken. For anyone wanting to keep up with this free 
and generous spiritual offshoot of Nifi will have to follow the git pages.

Sent from my Verizon, Samsung Galaxy smartphone
Get Outlook for Android<https://aka.ms/AAb9ysg>

From: Joe Witt 
Sent: Thursday, September 9, 2021 1:34:21 PM
To: Darren Govoni 
Subject: Please do not use the Apache NiFi mailing lists to advertise PyFi

Darren

On behalf of the Apache NiFi PMC I am writing to advise that we
appreciate your work and efforts related to the PYFI concept but are
concerned about your usage of the Apache NiFi mailing lists.

You've initiated at least 9 e-mail threads in July, August, and
September telling people on NiFi's fairly large users list about PyFi.
It isn't to build/communicate with the Apache NiFi community about
work being done within Apache NiFi.  Rather it is to direct or make
interested folks to come to your Python based rewrite heavily inspired
by NiFi. The last seven threads have zero replies as well and the
content directs people to your github/not proposals for contributions
within the Apache NiFi community. Your readme suggests you might
provide code at a later date and that you're looking for 'sponsors' to
continue the work.  This means the usage of the Apache NiFi mailing
lists at this point is an advertisement for something other than
Apache NiFi.

You are certainly doing interesting work and developing an interesting
tool.  Provided you honor the Apache Software Foundation trademarks
and such and provided you work within the Apache License version 2
you're good there.  The challenge is the use of the Apache NiFi
mailing list to direct people to your project. Please discontinue
doing so.  Should you decide you wish to help build a python based
NiFi implementation within the NiFi community or want to bring some of
the ideas/concepts/implementations into the NiFi community then of
course you're welcome to engage with the lists for such things.

Thanks
Apache NiFi PMC


PYFI Network Layer Architecture

2021-09-02 Thread Darren Govoni
Hi!
   I've added some more description and some diagrams that detail the 
distributed, networked architecture of PYFI here[1]. Which is currently 
functional.

I'll probably be doing an initial release of the stack and & CLI soon-ish.

[1] https://github.com/radiantone/pyfi#network-layers

Darren


PYFI Logical Processors

2021-08-29 Thread Darren Govoni
Hi!
   I've added some more information about PYFI logical processors (derived from 
notion of Nifi processors) here:

https://github.com/radiantone/pyfi#logical-processors

One cool feature of PYFI processors is that they can be scaled or moved from 
one server to another (even while running) and queued messages/data/functions 
still arrive reliably, wherever clusters of those processors exist. Even after 
server or process restarts.

It's not in the README yet, but each processor streams logs on various channels 
in real-time (websockets). You can listen to a specific processor, for example, 
using the CLI and get streaming telemetry.

$ pyfi listen --name pyfi.queue1.proc1 --server localhost --channel task

Cheers!


PYFI At-Scale Design

2021-08-24 Thread Darren Govoni
Hi!
  I've added a diagram[1] detailing the at-scale properties of PYFI for those 
who are interested. In a nutshell, PYFI scales at the processor level (vs. the 
flow), allowing you to have say, 50 processors of the same type running on 50 
CPUs, whereas another (less intensive) processor in your flow, might only 
require 10 CPUs.

At-Scale means there is a 1-1 correspondence between logical and physical 
compute units.

[1] https://github.com/radiantone/pyfi#at-scale-design

If a data ingress or queue is particularly demanding, then scale only that 
processor. Furthermore, there are auto-scaling elastic qualities as well where 
PYFI will scale a processor based on temporary demand.

This design is inherently load-balanced at the processor level and has the 
following additional qualities:

  *   hardware redundancy
  *   high-availability
  *   fault-tolerance
  *   fail-over
  *   performance
  *   ease of maintenance

[cid:0375d7ee-31cf-439f-9273-2312d2325801]

Cheers!
Darren


PYFI Tech Stack & Security Model

2021-08-19 Thread Darren Govoni
Hi!
I've added some details for the PYFI tech stack[1] and security model[2] 
for anyone curious. I will slowly be trickling out updates on the main PYFI 
github[3] before eventually releasing the initial reference stack (docker 
compose file) and python packages for the CLI.

Cheers!
Darren

PS. Thanks to nifi moderators for letting me share!

[1] https://github.com/radiantone/pyfi#tech-stack NOTE: A reference 
implementation
[2] https://github.com/radiantone/pyfi#security-model
[3] https://github.com/radiantone/pyfi



PYFI Git Repo & Design Information

2021-08-10 Thread Darren Govoni
Hi!
  I've started documenting the PYFI system on this repo here.
https://github.com/radiantone/pyfi
[https://opengraph.githubassets.com/57cfd8be02d48d7eff23408bb0e26da78d414ddc221529f0baa00e3b0a5e3dfe/radiantone/pyfi]
GitHub - radiantone/pyfi: A distributed data flow and computation system that 
runs on transactional messaging 
infrastructure
A distributed data flow and computation system that runs on transactional 
messaging infrastructure - GitHub - radiantone/pyfi: A distributed data flow 
and computation system that runs on transactio...
github.com
I have a working stack that accomplishes the core design goals described in the 
repo but without integration with pyfi-ui
https://github.com/radiantone/pyfi-ui which will be a straight-forward, but 
somewhat heavy lift. Anyone interested in sponsoring or joining the effort 
please contact me. Full access to source code can be provided with the goal of 
opening it once it has the proper support behind it.

Also, there will be default processors that integrate with Apache NIFI as well 
for ingress and egress.

Cheers!
Darren







PYFI CLI, DevOps & Automation

2021-07-30 Thread Darren Govoni
Hello!
Just a quick note about PYFI (The python Nifi-like platform I'm 
developing). Soon I will post a repo for pyfi cli.
You will be able to construct your PYFI mesh networks and manage all aspects of 
the system including flows, processors, scaling, start/stop etc using the CLI 
without the UI.
This is an important design goal as it will allow for improved devops, 
automation and maintenance without human GUI. Quick look below.

Cheers!
Darren

$ pyfi
Usage: pyfi [OPTIONS] COMMAND [ARGS]...

  Pyfi CLI for managing the pyfi network

Options:
  --debug / --no-debug
  -d, --db TEXT Database URI
  --helpShow this message and exit.

Commands:
  add Add an object to the database
  agent   Run pyfi agent
  api API server admin
  db  Database operations
  get Get unique row
  ls  List database objects
  nodeNode management operations
  procRun or manage processors
  taskPyfi task management
  update  Update a database object
  web Web server admin

$ pyfi add
Usage: pyfi add [OPTIONS] COMMAND [ARGS]...

  Add an object to the database

Options:
  --id TEXT  ID of object being added
  --help Show this message and exit.

Commands:
  agent  Add agent object to the database
  outlet Add outlet to a processor
  plug   Add plug to a processor
  processor  Add processor to the database
  queue  Add queue object to the database
  role   Add role object to the database
  user   Add user object to the database

$ pyfi proc
Usage: pyfi proc [OPTIONS] COMMAND [ARGS]...

  Run or manage processors

Options:
  --id TEXT  ID of processor
  --help Show this message and exit.

Commands:
  remove   Remove a processor
  restart  Retart a processor
  startStart a processor
  stop Stop a processor


Re: PYFI UI updates

2021-07-19 Thread Darren Govoni
Oh, forgot. Yeah

https://github.com/radiantone/pyfi-ui

This is just a functional NodeJS UI right now, the backend will come later.

From: Aaron Rich 
Sent: Monday, July 19, 2021 10:43 AM
To: users@nifi.apache.org 
Subject: Re: PYFI UI updates

Is this project posted anywhere to take a look at? Looks pretty cool.

Thanks.

On Mon, Jul 19, 2021, 05:17 Darren Govoni 
mailto:dar...@ontrenet.com>> wrote:
Hi,
  Made some useful updates to PYFI UI if anyone is curious. Essentially 
real-time charting & scripting for processors. PYFI will store all the data 
usage metrics over time so they can be mined for predictive AI models.

Each row (e.g. In, Read/Write, Out, Tasks) has a new sparkline real-time chart 
showing usage history. Clicking on an individual row shows the large data mine 
for that metric.

[cid:35ba29fe-c1f7-46ad-b586-3f2b29617138][cid:066ba9c3-3f61-4e3c-b568-20331477909b]

Scripting processors is easy. Just get the data from an input port and write 
whatever you want to an output port (you control the data typing). The data is 
sent to transactional queues connected to other processors (fully distributed).
[cid:6092448c-d247-4d86-b42e-83b8a1f4b448]


Re: PYFI Python Nifi Clone

2021-07-13 Thread Darren Govoni
Hello!

Here is my github repo for PYFI NodeJS/Vue implementation of Nifi UI. Enjoy!

https://github.com/radiantone/pyfi-ui

Darren

From: Darren Govoni
Sent: Saturday, July 10, 2021 12:38 PM
To: users@nifi.apache.org 
Subject: PYFI Python Nifi Clone

Hi!,
   Just sharing a fun project I'll post on github soon. I'm creating a pure 
python clone of Nifi that separates the UI (Vue/NodeJS implementation) from the 
backend distributed messaging layer (RabbitMQ, Redis, AMQP, SQS). It will allow 
for runtime scripting of processors using python and leverage a variety of 
transactional message brokers and distributed topologies (e.g. AMQP).

Here is a sneak peek at my port of the UI to Vue/NodeJS which I'll share on 
github soon (minified). It's a fully MVC/Node/Vue reactive and responsive UI 
that adheres to Material Design 2.0 standard. Also uses webpack build and is 
minified, etc.

Makes a number of improvements such as tabs for multiple flow renders and will 
interface directly with git for flow versioning.

Cheers!
Darren


Re: PYFI Python Nifi Clone

2021-07-10 Thread Darren Govoni
Ahh right! Thanks for reminder!

Sent from my Verizon, Samsung Galaxy smartphone
Get Outlook for Android<https://aka.ms/AAb9ysg>


From: Joe Witt 
Sent: Saturday, July 10, 2021 12:44:53 PM
To: users@nifi.apache.org 
Subject: Re: PYFI Python Nifi Clone

Sounds fun and looks cool.

But do not violate the marks such as do not use the Apache NiFi logo.

Thanks

On Sat, Jul 10, 2021 at 9:38 AM Darren Govoni 
mailto:dar...@ontrenet.com>> wrote:
Hi!,
   Just sharing a fun project I'll post on github soon. I'm creating a pure 
python clone of Nifi that separates the UI (Vue/NodeJS implementation) from the 
backend distributed messaging layer (RabbitMQ, Redis, AMQP, SQS). It will allow 
for runtime scripting of processors using python and leverage a variety of 
transactional message brokers and distributed topologies (e.g. AMQP).

Here is a sneak peek at my port of the UI to Vue/NodeJS which I'll share on 
github soon (minified). It's a fully MVC/Node/Vue reactive and responsive UI 
that adheres to Material Design 2.0 standard. Also uses webpack build and is 
minified, etc.

Makes a number of improvements such as tabs for multiple flow renders and will 
interface directly with git for flow versioning.

Cheers!

Darren


Re: Hive & Hadoop 3.1.2

2021-04-09 Thread Darren Govoni
Does that also come with a HiveController3? I looked but didnt see one.

Sent from my Verizon, Samsung Galaxy smartphone
Get Outlook for Android<https://aka.ms/AAb9ysg>


From: Doutre, Mark 
Sent: Friday, April 9, 2021 9:12:08 AM
To: users@nifi.apache.org 
Subject: RE: Hive & Hadoop 3.1.2


Use the Hive3 processors. They were removed from the NiFi distribution and you 
have to install them yourself – nar’s are in maven.

Be aware there is a bad memory leak in Hive3Streaming.



From: Darren Govoni 
Sent: 09 April 2021 14:00
To: users@nifi.apache.org
Subject: Hive & Hadoop 3.1.2



Hi

   I wss trying to update the version of hadoop common hive uses from 2.6.2 to 
3.1.2 which compiled fine but the hive shimloader complained about the version 
and threw exceptions.



Is the version of hive used in nifi stuck on hadoop 2.6.2 or is there a 
different path to upgrading it?



Darren



Sent from my Verizon, Samsung Galaxy smartphone
Get Outlook for Android<https://aka.ms/AAb9ysg>


Hive & Hadoop 3.1.2

2021-04-09 Thread Darren Govoni
Hi
   I wss trying to update the version of hadoop common hive uses from 2.6.2 to 
3.1.2 which compiled fine but the hive shimloader complained about the version 
and threw exceptions.

Is the version of hive used in nifi stuck on hadoop 2.6.2 or is there a 
different path to upgrading it?

Darren

Sent from my Verizon, Samsung Galaxy smartphone
Get Outlook for Android


pkinit CLI?

2020-12-14 Thread Darren Govoni
Hi,
  Is there a CLI script or command in Kerby for pkinit? How is it performed?

thanks,
Darren


Re: Secure Mode & Kerberos

2020-12-14 Thread Darren Govoni
I see. Thank you.

Sent from my Verizon, Samsung Galaxy smartphone
Get Outlook for Android<https://aka.ms/ghei36>


From: Bryan Bende 
Sent: Monday, December 14, 2020 2:26:54 PM
To: users@nifi.apache.org 
Subject: Re: Secure Mode & Kerberos

It refers to what I said earlier about providing a core-site.xml to the 
processor that has:


hadoop.security.authentication
kerberos


It means the core-site you provided doesn't have that, which indicates HDFS is 
not kerberized, but you filled in the kerberos properties on the processor, so 
it is telling you they won't be used for anything since core-site doesn't say 
that kerberos is enabled.

On Mon, Dec 14, 2020 at 2:12 PM Darren Govoni 
mailto:dar...@ontrenet.com>> wrote:
Gotcha. Thanks.

The only reason I got down this road is because the HDFS processors were logging

"Configuration does not have security enabled, keytab and principal will be 
ignored."

Which is a bit vague and left me thinking i needed to run Nifi in secure mode. 
The processors were configured for kereberos. Still not sure what the message 
refers to though.

Darren

Sent from my Verizon, Samsung Galaxy smartphone
Get Outlook for Android<https://aka.ms/ghei36>


From: Bryan Bende mailto:bbe...@gmail.com>>
Sent: Monday, December 14, 2020 1:56:01 PM
To: users@nifi.apache.org<mailto:users@nifi.apache.org> 
mailto:users@nifi.apache.org>>
Subject: Re: Secure Mode & Kerberos

Ok so you are authenticating with a client cert, so this has nothing to do with 
kerberos.

Put the DN from the client cert as the initial admin in authorizers.xml and it 
generates the policies in authorizations.xml for you.

You likely need to delete users.xml and authorizations.xml in order for it to 
be a fresh setup and trigger the seeding of the initial admin.



On Mon, Dec 14, 2020 at 1:51 PM Darren Govoni 
mailto:dar...@ontrenet.com>> wrote:
I see this error in the browser
[cid:17662b30953cb971f161]
Along with the exception in the log: Kerberos ticket login not supported by 
this NiFi

That is just with adding the /etc/krb5.conf to nifi.properties per your 
suggestion.

I do have a browser cert it prompted me to select.

I had started to add the cert CN to authorizers.xml (i.e. add it to initial 
admin field), but it requires populating authorizations.xml as well and I'm not 
sure how to do that.


From: Bryan Bende mailto:bbe...@gmail.com>>
Sent: Monday, December 14, 2020 1:04 PM
To: users@nifi.apache.org<mailto:users@nifi.apache.org> 
mailto:users@nifi.apache.org>>
Subject: Re: Secure Mode & Kerberos

I'm confused, how are you trying to authenticate to nifi and what is
the error your are getting in the nifi UI when you attempt to access
it?

You said you didn't want to authenticate via kerberos, so the warning
should not matter.

On Mon, Dec 14, 2020 at 11:26 AM Darren Govoni 
mailto:dar...@ontrenet.com>> wrote:
>
> Thanks Bryan.
>
> I'm seeing in AccessResource.java that it will throw this exception if spnego 
> is not configured or keberosService is null, which it is in my nifi.
>
> Doing a quick search for setKeberosService callers doesnt turn anything up in 
> the code. And this exception prevents me accessing the app.
>
> Do i need to configure anything in authorizers.xml or users.xml?
>
> I set the krb file in nifi.properties already.
>
> Darren
>
> Sent from my Verizon, Samsung Galaxy smartphone
> Get Outlook for Android
>
> 
> From: Bryan Bende mailto:bbe...@gmail.com>>
> Sent: Monday, December 14, 2020 11:19:28 AM
> To: users@nifi.apache.org<mailto:users@nifi.apache.org> 
> mailto:users@nifi.apache.org>>
> Subject: Re: Secure Mode & Kerberos
>
> That is just a warning that prints every time you refresh the UI, the
> UI makes a call to see if SPNEGO is enabled, it shouldn't impact
> anything, same case for OIDC.
>
> On Mon, Dec 14, 2020 at 10:15 AM Darren Govoni 
> mailto:dar...@ontrenet.com>> wrote:
> >
> > When I remove the SPNEGO properties and set the krb5 file
> >
> > # kerberos #
> > nifi.kerberos.krb5.file=/etc/krb5.conf
> >
> >
> > 020-12-14 10:09:44,477 WARN [NiFi Web Server-19] 
> > o.a.n.w.a.c.IllegalStateExceptionMapper java.lang.IllegalStateException: 
> > Kerberos ticket login not supported by this NiFi.. Returning Conflict 
> > response.
> > java.lang.IllegalStateException: Kerberos ticket login not supported by 
> > this NiFi.
> >
> > Also threw exception about OpenID Connect not configured.
> >
> > Nifi 1.11.4
> >
> > 
> > From: Darren Govoni mailto:dar...@ontrenet.com>>
> > Sent: Mo

Re: Secure Mode & Kerberos

2020-12-14 Thread Darren Govoni
Gotcha. Thanks.

The only reason I got down this road is because the HDFS processors were logging

"Configuration does not have security enabled, keytab and principal will be 
ignored."

Which is a bit vague and left me thinking i needed to run Nifi in secure mode. 
The processors were configured for kereberos. Still not sure what the message 
refers to though.

Darren

Sent from my Verizon, Samsung Galaxy smartphone
Get Outlook for Android<https://aka.ms/ghei36>


From: Bryan Bende 
Sent: Monday, December 14, 2020 1:56:01 PM
To: users@nifi.apache.org 
Subject: Re: Secure Mode & Kerberos

Ok so you are authenticating with a client cert, so this has nothing to do with 
kerberos.

Put the DN from the client cert as the initial admin in authorizers.xml and it 
generates the policies in authorizations.xml for you.

You likely need to delete users.xml and authorizations.xml in order for it to 
be a fresh setup and trigger the seeding of the initial admin.



On Mon, Dec 14, 2020 at 1:51 PM Darren Govoni 
mailto:dar...@ontrenet.com>> wrote:
I see this error in the browser
[cid:1766299c1e4cb971f161]
Along with the exception in the log: Kerberos ticket login not supported by 
this NiFi

That is just with adding the /etc/krb5.conf to nifi.properties per your 
suggestion.

I do have a browser cert it prompted me to select.

I had started to add the cert CN to authorizers.xml (i.e. add it to initial 
admin field), but it requires populating authorizations.xml as well and I'm not 
sure how to do that.


From: Bryan Bende mailto:bbe...@gmail.com>>
Sent: Monday, December 14, 2020 1:04 PM
To: users@nifi.apache.org<mailto:users@nifi.apache.org> 
mailto:users@nifi.apache.org>>
Subject: Re: Secure Mode & Kerberos

I'm confused, how are you trying to authenticate to nifi and what is
the error your are getting in the nifi UI when you attempt to access
it?

You said you didn't want to authenticate via kerberos, so the warning
should not matter.

On Mon, Dec 14, 2020 at 11:26 AM Darren Govoni 
mailto:dar...@ontrenet.com>> wrote:
>
> Thanks Bryan.
>
> I'm seeing in AccessResource.java that it will throw this exception if spnego 
> is not configured or keberosService is null, which it is in my nifi.
>
> Doing a quick search for setKeberosService callers doesnt turn anything up in 
> the code. And this exception prevents me accessing the app.
>
> Do i need to configure anything in authorizers.xml or users.xml?
>
> I set the krb file in nifi.properties already.
>
> Darren
>
> Sent from my Verizon, Samsung Galaxy smartphone
> Get Outlook for Android
>
> 
> From: Bryan Bende mailto:bbe...@gmail.com>>
> Sent: Monday, December 14, 2020 11:19:28 AM
> To: users@nifi.apache.org<mailto:users@nifi.apache.org> 
> mailto:users@nifi.apache.org>>
> Subject: Re: Secure Mode & Kerberos
>
> That is just a warning that prints every time you refresh the UI, the
> UI makes a call to see if SPNEGO is enabled, it shouldn't impact
> anything, same case for OIDC.
>
> On Mon, Dec 14, 2020 at 10:15 AM Darren Govoni 
> mailto:dar...@ontrenet.com>> wrote:
> >
> > When I remove the SPNEGO properties and set the krb5 file
> >
> > # kerberos #
> > nifi.kerberos.krb5.file=/etc/krb5.conf
> >
> >
> > 020-12-14 10:09:44,477 WARN [NiFi Web Server-19] 
> > o.a.n.w.a.c.IllegalStateExceptionMapper java.lang.IllegalStateException: 
> > Kerberos ticket login not supported by this NiFi.. Returning Conflict 
> > response.
> > java.lang.IllegalStateException: Kerberos ticket login not supported by 
> > this NiFi.
> >
> > Also threw exception about OpenID Connect not configured.
> >
> > Nifi 1.11.4
> >
> > 
> > From: Darren Govoni mailto:dar...@ontrenet.com>>
> > Sent: Monday, December 14, 2020 10:00 AM
> > To: users@nifi.apache.org<mailto:users@nifi.apache.org> 
> > mailto:users@nifi.apache.org>>
> > Subject: Re: Secure Mode & Kerberos
> >
> > Hi Bryan
> >
> > I did do that but still got the warning/error. But I will go back and 
> > verify this.
> >
> > Darren
> >
> > Sent from my Verizon, Samsung Galaxy smartphone
> > Get Outlook for Android
> >
> > 
> > From: Bryan Bende mailto:bbe...@gmail.com>>
> > Sent: Monday, December 14, 2020 9:37:33 AM
> > To: users@nifi.apache.org<mailto:users@nifi.apache.org> 
> > mailto:users@nifi.apache.org>>
> > Subject: Re: Secure Mode & Kerberos
> >
> > You don't need to have NiFi secured with Kerberos in order to

Re: Secure Mode & Kerberos

2020-12-14 Thread Darren Govoni
I see this error in the browser
[cid:6d423021-564f-4505-9f55-11a4628ebb86]
Along with the exception in the log: Kerberos ticket login not supported by 
this NiFi

That is just with adding the /etc/krb5.conf to nifi.properties per your 
suggestion.

I do have a browser cert it prompted me to select.

I had started to add the cert CN to authorizers.xml (i.e. add it to initial 
admin field), but it requires populating authorizations.xml as well and I'm not 
sure how to do that.


From: Bryan Bende 
Sent: Monday, December 14, 2020 1:04 PM
To: users@nifi.apache.org 
Subject: Re: Secure Mode & Kerberos

I'm confused, how are you trying to authenticate to nifi and what is
the error your are getting in the nifi UI when you attempt to access
it?

You said you didn't want to authenticate via kerberos, so the warning
should not matter.

On Mon, Dec 14, 2020 at 11:26 AM Darren Govoni  wrote:
>
> Thanks Bryan.
>
> I'm seeing in AccessResource.java that it will throw this exception if spnego 
> is not configured or keberosService is null, which it is in my nifi.
>
> Doing a quick search for setKeberosService callers doesnt turn anything up in 
> the code. And this exception prevents me accessing the app.
>
> Do i need to configure anything in authorizers.xml or users.xml?
>
> I set the krb file in nifi.properties already.
>
> Darren
>
> Sent from my Verizon, Samsung Galaxy smartphone
> Get Outlook for Android
>
> 
> From: Bryan Bende 
> Sent: Monday, December 14, 2020 11:19:28 AM
> To: users@nifi.apache.org 
> Subject: Re: Secure Mode & Kerberos
>
> That is just a warning that prints every time you refresh the UI, the
> UI makes a call to see if SPNEGO is enabled, it shouldn't impact
> anything, same case for OIDC.
>
> On Mon, Dec 14, 2020 at 10:15 AM Darren Govoni  wrote:
> >
> > When I remove the SPNEGO properties and set the krb5 file
> >
> > # kerberos #
> > nifi.kerberos.krb5.file=/etc/krb5.conf
> >
> >
> > 020-12-14 10:09:44,477 WARN [NiFi Web Server-19] 
> > o.a.n.w.a.c.IllegalStateExceptionMapper java.lang.IllegalStateException: 
> > Kerberos ticket login not supported by this NiFi.. Returning Conflict 
> > response.
> > java.lang.IllegalStateException: Kerberos ticket login not supported by 
> > this NiFi.
> >
> > Also threw exception about OpenID Connect not configured.
> >
> > Nifi 1.11.4
> >
> > 
> > From: Darren Govoni 
> > Sent: Monday, December 14, 2020 10:00 AM
> > To: users@nifi.apache.org 
> > Subject: Re: Secure Mode & Kerberos
> >
> > Hi Bryan
> >
> > I did do that but still got the warning/error. But I will go back and 
> > verify this.
> >
> > Darren
> >
> > Sent from my Verizon, Samsung Galaxy smartphone
> > Get Outlook for Android
> >
> > 
> > From: Bryan Bende 
> > Sent: Monday, December 14, 2020 9:37:33 AM
> > To: users@nifi.apache.org 
> > Subject: Re: Secure Mode & Kerberos
> >
> > You don't need to have NiFi secured with Kerberos in order to use HDFS
> > processors talking to kerberized HDFS.
> >
> > You just need to specify the krb5.conf in nifi.properties, and you
> > need to provide the HDFS processors with a core-site.xml that has
> > security set to kerberos.
> >
> > On Mon, Dec 14, 2020 at 9:28 AM Darren Govoni  wrote:
> > >
> > > Hi,
> > >   I want to test the HDFS processors using Kerberos, but they trigger a 
> > > warning saying Nifi is not running in secure mode, so it ignores kerberos.
> > >
> > > In order to get Nifi into secure mode I had to enable SPNEGO which it 
> > > seems to want a kerberos header to allow me into the app now.
> > >
> > > Is there a way to allow processors to run securely with kerberos without 
> > > having to auth myself into the app via kerberos? Which I'm not sure how 
> > > to do.
> > >
> > > Darren
> > >
> > > PS. I do have a Apache Kerby KDC running locally if that can help me auth 
> > > into Nifi.


Re: Secure Mode & Kerberos

2020-12-14 Thread Darren Govoni
Thanks Bryan.

I'm seeing in AccessResource.java that it will throw this exception if spnego 
is not configured or keberosService is null, which it is in my nifi.

Doing a quick search for setKeberosService callers doesnt turn anything up in 
the code. And this exception prevents me accessing the app.

Do i need to configure anything in authorizers.xml or users.xml?

I set the krb file in nifi.properties already.

Darren

Sent from my Verizon, Samsung Galaxy smartphone
Get Outlook for Android<https://aka.ms/ghei36>


From: Bryan Bende 
Sent: Monday, December 14, 2020 11:19:28 AM
To: users@nifi.apache.org 
Subject: Re: Secure Mode & Kerberos

That is just a warning that prints every time you refresh the UI, the
UI makes a call to see if SPNEGO is enabled, it shouldn't impact
anything, same case for OIDC.

On Mon, Dec 14, 2020 at 10:15 AM Darren Govoni  wrote:
>
> When I remove the SPNEGO properties and set the krb5 file
>
> # kerberos #
> nifi.kerberos.krb5.file=/etc/krb5.conf
>
>
> 020-12-14 10:09:44,477 WARN [NiFi Web Server-19] 
> o.a.n.w.a.c.IllegalStateExceptionMapper java.lang.IllegalStateException: 
> Kerberos ticket login not supported by this NiFi.. Returning Conflict 
> response.
> java.lang.IllegalStateException: Kerberos ticket login not supported by this 
> NiFi.
>
> Also threw exception about OpenID Connect not configured.
>
> Nifi 1.11.4
>
> 
> From: Darren Govoni 
> Sent: Monday, December 14, 2020 10:00 AM
> To: users@nifi.apache.org 
> Subject: Re: Secure Mode & Kerberos
>
> Hi Bryan
>
> I did do that but still got the warning/error. But I will go back and verify 
> this.
>
> Darren
>
> Sent from my Verizon, Samsung Galaxy smartphone
> Get Outlook for Android
>
> 
> From: Bryan Bende 
> Sent: Monday, December 14, 2020 9:37:33 AM
> To: users@nifi.apache.org 
> Subject: Re: Secure Mode & Kerberos
>
> You don't need to have NiFi secured with Kerberos in order to use HDFS
> processors talking to kerberized HDFS.
>
> You just need to specify the krb5.conf in nifi.properties, and you
> need to provide the HDFS processors with a core-site.xml that has
> security set to kerberos.
>
> On Mon, Dec 14, 2020 at 9:28 AM Darren Govoni  wrote:
> >
> > Hi,
> >   I want to test the HDFS processors using Kerberos, but they trigger a 
> > warning saying Nifi is not running in secure mode, so it ignores kerberos.
> >
> > In order to get Nifi into secure mode I had to enable SPNEGO which it seems 
> > to want a kerberos header to allow me into the app now.
> >
> > Is there a way to allow processors to run securely with kerberos without 
> > having to auth myself into the app via kerberos? Which I'm not sure how to 
> > do.
> >
> > Darren
> >
> > PS. I do have a Apache Kerby KDC running locally if that can help me auth 
> > into Nifi.


Re: Secure Mode & Kerberos

2020-12-14 Thread Darren Govoni
When I remove the SPNEGO properties and set the krb5 file

# kerberos #
nifi.kerberos.krb5.file=/etc/krb5.conf


020-12-14 10:09:44,477 WARN [NiFi Web Server-19] 
o.a.n.w.a.c.IllegalStateExceptionMapper java.lang.IllegalStateException: 
Kerberos ticket login not supported by this NiFi.. Returning Conflict response.
java.lang.IllegalStateException: Kerberos ticket login not supported by this 
NiFi.

Also threw exception about OpenID Connect not configured.

Nifi 1.11.4


From: Darren Govoni 
Sent: Monday, December 14, 2020 10:00 AM
To: users@nifi.apache.org 
Subject: Re: Secure Mode & Kerberos

Hi Bryan

I did do that but still got the warning/error. But I will go back and verify 
this.

Darren

Sent from my Verizon, Samsung Galaxy smartphone
Get Outlook for Android<https://aka.ms/ghei36>


From: Bryan Bende 
Sent: Monday, December 14, 2020 9:37:33 AM
To: users@nifi.apache.org 
Subject: Re: Secure Mode & Kerberos

You don't need to have NiFi secured with Kerberos in order to use HDFS
processors talking to kerberized HDFS.

You just need to specify the krb5.conf in nifi.properties, and you
need to provide the HDFS processors with a core-site.xml that has
security set to kerberos.

On Mon, Dec 14, 2020 at 9:28 AM Darren Govoni  wrote:
>
> Hi,
>   I want to test the HDFS processors using Kerberos, but they trigger a 
> warning saying Nifi is not running in secure mode, so it ignores kerberos.
>
> In order to get Nifi into secure mode I had to enable SPNEGO which it seems 
> to want a kerberos header to allow me into the app now.
>
> Is there a way to allow processors to run securely with kerberos without 
> having to auth myself into the app via kerberos? Which I'm not sure how to do.
>
> Darren
>
> PS. I do have a Apache Kerby KDC running locally if that can help me auth 
> into Nifi.


Re: Secure Mode & Kerberos

2020-12-14 Thread Darren Govoni
Hi Bryan

I did do that but still got the warning/error. But I will go back and verify 
this.

Darren

Sent from my Verizon, Samsung Galaxy smartphone
Get Outlook for Android<https://aka.ms/ghei36>


From: Bryan Bende 
Sent: Monday, December 14, 2020 9:37:33 AM
To: users@nifi.apache.org 
Subject: Re: Secure Mode & Kerberos

You don't need to have NiFi secured with Kerberos in order to use HDFS
processors talking to kerberized HDFS.

You just need to specify the krb5.conf in nifi.properties, and you
need to provide the HDFS processors with a core-site.xml that has
security set to kerberos.

On Mon, Dec 14, 2020 at 9:28 AM Darren Govoni  wrote:
>
> Hi,
>   I want to test the HDFS processors using Kerberos, but they trigger a 
> warning saying Nifi is not running in secure mode, so it ignores kerberos.
>
> In order to get Nifi into secure mode I had to enable SPNEGO which it seems 
> to want a kerberos header to allow me into the app now.
>
> Is there a way to allow processors to run securely with kerberos without 
> having to auth myself into the app via kerberos? Which I'm not sure how to do.
>
> Darren
>
> PS. I do have a Apache Kerby KDC running locally if that can help me auth 
> into Nifi.


Secure Mode & Kerberos

2020-12-14 Thread Darren Govoni
Hi,
  I want to test the HDFS processors using Kerberos, but they trigger a warning 
saying Nifi is not running in secure mode, so it ignores kerberos.

In order to get Nifi into secure mode I had to enable SPNEGO which it seems to 
want a kerberos header to allow me into the app now.

Is there a way to allow processors to run securely with kerberos without having 
to auth myself into the app via kerberos? Which I'm not sure how to do.

Darren

PS. I do have a Apache Kerby KDC running locally if that can help me auth into 
Nifi.


Re: Tls-toolkit.sh?

2020-12-11 Thread Darren Govoni
Yeah. I did put the same cert into the browser. But maybe needs a redo. Usually 
browser prompts to choose the cert but here it doesnt.

Sent from my Verizon, Samsung Galaxy smartphone
Get Outlook for Android<https://aka.ms/ghei36>


From: Etienne Jouvin 
Sent: Friday, December 11, 2020 12:40:28 PM
To: users@nifi.apache.org 
Subject: Re: Tls-toolkit.sh?

Hi.

You have to generate a certificate for client, to inject in the browser.
I never did it, always setup accounts with LDAPs.

But there is a nice guide :
https://nifi.apache.org/docs/nifi-docs/html/walkthroughs.html#securing-nifi-with-provided-certificates

Etienne



Le ven. 11 déc. 2020 à 18:32, Darren Govoni 
mailto:dar...@ontrenet.com>> a écrit :
Hi

So i used tls-toolkit.sh standalone -n 'localhost' -C 'CN=localhost,OU=NIFI'

It generated the keystore/truststore and nifi.properties. when i use chrome or 
firefox both reject the cert and wont load nifi.

Am i missing something?

Darren

From: Andrew Grande mailto:apere...@gmail.com>>
Sent: Friday, December 11, 2020 12:02:46 PM
To: users@nifi.apache.org<mailto:users@nifi.apache.org> 
mailto:users@nifi.apache.org>>
Subject: Re: Tls-toolkit.sh?

NiFi toolkit link here https://nifi.apache.org/download.html

Enjoy :)

On Fri, Dec 11, 2020, 8:59 AM Darren Govoni 
mailto:dar...@ontrenet.com>> wrote:
Hi

I want to setup a secure local nifi and the online docs refer to this script 
but i cant find it anywhere.

Any clues?

Darren

Sent from my Verizon, Samsung Galaxy smartphone
Get Outlook for Android<https://aka.ms/ghei36>


Re: Tls-toolkit.sh?

2020-12-11 Thread Darren Govoni
Hi

So i used tls-toolkit.sh standalone -n 'localhost' -C 'CN=localhost,OU=NIFI'

It generated the keystore/truststore and nifi.properties. when i use chrome or 
firefox both reject the cert and wont load nifi.

Am i missing something?

Darren

From: Andrew Grande 
Sent: Friday, December 11, 2020 12:02:46 PM
To: users@nifi.apache.org 
Subject: Re: Tls-toolkit.sh?

NiFi toolkit link here https://nifi.apache.org/download.html

Enjoy :)

On Fri, Dec 11, 2020, 8:59 AM Darren Govoni 
mailto:dar...@ontrenet.com>> wrote:
Hi

I want to setup a secure local nifi and the online docs refer to this script 
but i cant find it anywhere.

Any clues?

Darren

Sent from my Verizon, Samsung Galaxy smartphone
Get Outlook for Android<https://aka.ms/ghei36>


Re: Tls-toolkit.sh?

2020-12-11 Thread Darren Govoni
Ah. Separate download. Got it. Thank you.

Sent from my Verizon, Samsung Galaxy smartphone
Get Outlook for Android<https://aka.ms/ghei36>


From: Andrew Grande 
Sent: Friday, December 11, 2020 12:02:46 PM
To: users@nifi.apache.org 
Subject: Re: Tls-toolkit.sh?

NiFi toolkit link here https://nifi.apache.org/download.html

Enjoy :)

On Fri, Dec 11, 2020, 8:59 AM Darren Govoni 
mailto:dar...@ontrenet.com>> wrote:
Hi

I want to setup a secure local nifi and the online docs refer to this script 
but i cant find it anywhere.

Any clues?

Darren

Sent from my Verizon, Samsung Galaxy smartphone
Get Outlook for Android<https://aka.ms/ghei36>


Tls-toolkit.sh?

2020-12-11 Thread Darren Govoni
Hi

I want to setup a secure local nifi and the online docs refer to this script 
but i cant find it anywhere.

Any clues?

Darren

Sent from my Verizon, Samsung Galaxy smartphone
Get Outlook for Android


Setting properties in kadmin

2020-12-10 Thread Darren Govoni
Hi
  I'm new to directory. I am running the kadmin.sh and need to set some config 
properties the command lists when you run it but dont see how.

For example how to set: useTicketCache true

Thank you!
Darren

Sent from my Verizon, Samsung Galaxy smartphone
Get Outlook for Android


Re: Authorization Framework

2020-11-04 Thread Darren Govoni
Sure thing Joe. Let me provide a more clear use case.

As I mentioned, our identities are established at the enterprise level. So 
while I mentioned the existing auth(entication yes) I see in processors now 
(basic auth, kerberos), probably the general use case there is to authenticate 
to a remote service and then you're good, our environment separates 
authentication and authorization and those are managed by different systems 
entirely at different points in time.

Servers will have identifying certs that are issued and managed automatically 
and those certs will be used to identify the requesting service's (e.g. NIFI) 
access to another enterprise layer (e.g. HDFS). In our authorization system, an 
admin might grant specific privileges to a principal (e.g. NIFI) allowing it to 
only read or write or access this data set or that data set in another service. 
This is above and beyond anything the service endpoint itself might expose.

In our environment there is no need to trigger an authentication of the 
principal (X.509 cert) for that particular server/service as the plumbing for 
that is "out of band" for nifi processors as best as I can tell. But I'm still 
learning these things. 

In the spirit of test-driven development, we (our team and company) will have 
to understand the exact needs before finalizing a viable framework or 
implementation of PKInit (what our company refers to it), so it's still early, 
but I want to implement these business needs in a way that makes sense for us, 
and for the direction Nifi is going, in the event others might find our 
solution useful. For now, it's mainly exploring/proving pattens and having open 
discussions. 

Once we get some more concrete workflow decided on our end, I can share it and 
we can talk about it can best be accommodated.

Darren



From: Joe Witt 
Sent: Wednesday, November 4, 2020 3:48 PM
To: users@nifi.apache.org 
Subject: Re: Authorization Framework

Darren

Its difficult to get to what you have in mind as you keep saying authorization 
but then giving examples of authentication protocols (kerberos/keytabs, basic 
auth).

Lets focus though on your later comment about hdfs processors.  Take for 
example put hdfs...it connects to and hdfs cluster to put data.  In terms of 
the actual dataflow we get to authenticate/convey our identity to hdfs and 
where we want to write data.  Hdfs then gets to accept or reiect that.  *That* 
is authorization.  Now then, speaking in terms of flow administration in nifi 
we do have authorization scenarios.  Like who can view the processor, start it, 
stop it, and so on.   This kind of authorization in nifi IS something that can 
be extended/altered to meet some awesome and complex needs.

Lets keep circling closer to your intent here

thanks

On Wed, Nov 4, 2020 at 1:38 PM Darren Govoni 
mailto:dar...@ontrenet.com>> wrote:
Hi Bryan,
   Thanks for the input. Right now, I'm really exploring how better to 
accommodate migrating from the use of keytabs to our corporate mandate for 
pkinit support. Observing that the current authorizations in processors (basic 
auth, kerberos etc) are tightly wired, it suggested to me an opportunity to 
move security more into an "aspect" of processors rather than woven into the 
processor specific code. Of course, there are interactions that take place 
throughout the behavior of the processor.

This is a fairly common approach to security, noting that for any given 
behavior, it can be done all the same securely or without security. So I would 
think it should be possible to abstract the authorizations as needed, and there 
are a variety of patterns to pull those details out of the components needing 
them. 

I suppose the difference is more subtle these days, but in my mind 
Authentication does one thing. Decide if principal (e.g.username) is who they 
say they are, using the provided credentials (e.g. password). Once this is 
established the authentication service will return an identifying token, cert 
etc. That's it. Now, some services will inline this activity along with 
authorization.

For us, this is already handled and the identity is established as a digitally 
signed X.509 certificate. Thus, our primary need is to consult our security 
services which will decide if that principal is allowed to do something - such 
as use HDFSProcessor, Query Solr, etc.

In looking at the current code in the processors (and I haven't studied them 
all but will look more closely at HDFS), it didn't seem like a good approach to 
layer another authorization (PKInit) into that existing code and it will 
certainly get crowded in processors doing that, which should focus on 
processing. Just my opinions so far! Subject to change.

Darren


From: Bryan Bende mailto:bbe...@gmail.com>>
Sent: Wednesday, November 4, 2020 3:22 PM

To: users@nifi.apache.org<mailto:users@nifi.apache.org> 
mail

Re: Authorization Framework

2020-11-04 Thread Darren Govoni
Hi Bryan,
   Thanks for the input. Right now, I'm really exploring how better to 
accommodate migrating from the use of keytabs to our corporate mandate for 
pkinit support. Observing that the current authorizations in processors (basic 
auth, kerberos etc) are tightly wired, it suggested to me an opportunity to 
move security more into an "aspect" of processors rather than woven into the 
processor specific code. Of course, there are interactions that take place 
throughout the behavior of the processor.

This is a fairly common approach to security, noting that for any given 
behavior, it can be done all the same securely or without security. So I would 
think it should be possible to abstract the authorizations as needed, and there 
are a variety of patterns to pull those details out of the components needing 
them. 

I suppose the difference is more subtle these days, but in my mind 
Authentication does one thing. Decide if principal (e.g.username) is who they 
say they are, using the provided credentials (e.g. password). Once this is 
established the authentication service will return an identifying token, cert 
etc. That's it. Now, some services will inline this activity along with 
authorization.

For us, this is already handled and the identity is established as a digitally 
signed X.509 certificate. Thus, our primary need is to consult our security 
services which will decide if that principal is allowed to do something - such 
as use HDFSProcessor, Query Solr, etc.

In looking at the current code in the processors (and I haven't studied them 
all but will look more closely at HDFS), it didn't seem like a good approach to 
layer another authorization (PKInit) into that existing code and it will 
certainly get crowded in processors doing that, which should focus on 
processing. Just my opinions so far! Subject to change.

Darren


From: Bryan Bende 
Sent: Wednesday, November 4, 2020 3:22 PM
To: users@nifi.apache.org 
Subject: Re: Authorization Framework

Darren,

I also thought you were talking about authentication. Processors don’t really 
perform authorization, they provide credentials to some system which is 
authentication, the system then decides if they authenticated successfully, and 
then some systems may also perform authorization to determine if the 
authenticated identity is allowed to perform the action. The examples you gave 
of basic auth and kerberos are both authentication mechanisms.

I think it will be very hard to not have this logic embedded in processors 
since many times it is specific to the client library being used. For example, 
HDFS processors use the UserGroupInformation class from hadoop-common for 
kerberos authentication where as Kafka processors use the Kafka client which 
takes a JAAS config string.

The parts that can be factored out are usually common things like credential 
holders, such as SSLContextService or KeytabCredentialService, both of which 
don’t really do anything besides hold values that are then used in different 
ways by various processors.

If we are missing what you are talking about, let us know.

Thanks,

Bryan

On Nov 4, 2020, at 2:45 PM, Darren Govoni 
mailto:dar...@ontrenet.com>> wrote:

Thanks Joe.

Just looking to see where community might be going down the road with respect 
to processor security so we can keep our efforts aligned.

In regards to your question I primarily mean authorization. Our company already 
has a SSO that establishes identity credentials so these are then used to 
authorize specific functions and access to certain infrastructure systems when 
constructing flows.

Darren

Sent from my Verizon, Samsung Galaxy smartphone
Get Outlook for Android<https://aka.ms/ghei36>


From: Joe Witt mailto:joe.w...@gmail.com>>
Sent: Wednesday, November 4, 2020 12:29:35 PM
To: users@nifi.apache.org<mailto:users@nifi.apache.org> 
mailto:users@nifi.apache.org>>
Subject: Re: Authorization Framework

Darren

You will want this thread on dev list to get traction.

Also please clarify if you mean authorization or whether you  mean 
authentication.   I read all usages as meaning to discuss authentication.

thanks

On Wed, Nov 4, 2020 at 9:53 AM Darren Govoni 
mailto:dar...@ontrenet.com>> wrote:
Greetings!

We have an internal need to move to a specific PK based authorization for all 
our nifi processors. Currently, authorizations such as basic auth and kerberos 
seem to be wired directly inside the processors. My design approach to 
addressing our need also seeks to factor authorization out of processors where 
specific authorization handlers can be composed and config/run time and lighten 
the responsibilities inside processor classes.

Towards this end, my initial design goals for this framework are thus:

1) Allow various kinds of authorization handlers to be written and added to 
processors without necessarily recoding the proce

Re: Authorization Framework

2020-11-04 Thread Darren Govoni
Thanks Joe.

Just looking to see where community might be going down the road with respect 
to processor security so we can keep our efforts aligned.

In regards to your question I primarily mean authorization. Our company already 
has a SSO that establishes identity credentials so these are then used to 
authorize specific functions and access to certain infrastructure systems when 
constructing flows.

Darren

Sent from my Verizon, Samsung Galaxy smartphone
Get Outlook for Android<https://aka.ms/ghei36>


From: Joe Witt 
Sent: Wednesday, November 4, 2020 12:29:35 PM
To: users@nifi.apache.org 
Subject: Re: Authorization Framework

Darren

You will want this thread on dev list to get traction.

Also please clarify if you mean authorization or whether you  mean 
authentication.   I read all usages as meaning to discuss authentication.

thanks

On Wed, Nov 4, 2020 at 9:53 AM Darren Govoni 
mailto:dar...@ontrenet.com>> wrote:
Greetings!

We have an internal need to move to a specific PK based authorization for all 
our nifi processors. Currently, authorizations such as basic auth and kerberos 
seem to be wired directly inside the processors. My design approach to 
addressing our need also seeks to factor authorization out of processors where 
specific authorization handlers can be composed and config/run time and lighten 
the responsibilities inside processor classes.

Towards this end, my initial design goals for this framework are thus:

1) Allow various kinds of authorization handlers to be written and added to 
processors without necessarily recoding the processor.
2) Allow for a pipeline effect where one or more authorizers might need to 
operate at the same time.
3) Do not disrupt existing processors that rely on their internal coding for 
authorization
4) Use appropriate design patterns to allow for flexible implementations of 
principals, credentials and other authorization assets.
5) Secure any clear text assets (usernames and passwords) in existing 
authorizations when moving them inside the framework.

How does the community conduct initial design reviews of such changes? We would 
be quite a ways from contributing anything back but want to keep in sync with 
community practices and expectations to make such an offering immediately 
useful.

Regards,
Darren



Authorization Framework

2020-11-04 Thread Darren Govoni
Greetings!

We have an internal need to move to a specific PK based authorization for all 
our nifi processors. Currently, authorizations such as basic auth and kerberos 
seem to be wired directly inside the processors. My design approach to 
addressing our need also seeks to factor authorization out of processors where 
specific authorization handlers can be composed and config/run time and lighten 
the responsibilities inside processor classes.

Towards this end, my initial design goals for this framework are thus:

1) Allow various kinds of authorization handlers to be written and added to 
processors without necessarily recoding the processor.
2) Allow for a pipeline effect where one or more authorizers might need to 
operate at the same time.
3) Do not disrupt existing processors that rely on their internal coding for 
authorization
4) Use appropriate design patterns to allow for flexible implementations of 
principals, credentials and other authorization assets.
5) Secure any clear text assets (usernames and passwords) in existing 
authorizations when moving them inside the framework.

How does the community conduct initial design reviews of such changes? We would 
be quite a ways from contributing anything back but want to keep in sync with 
community practices and expectations to make such an offering immediately 
useful.

Regards,
Darren



Re: Run Nifi in IntelliJ to debug?

2020-10-27 Thread Darren Govoni
That did the trick! Thank you for everyones input and if i missed an obvious 
suggestion along the way my regrets.

Should be good now!

Sent from my Verizon, Samsung Galaxy smartphone
Get Outlook for Android<https://aka.ms/ghei36>


From: Bryan Bende 
Sent: Tuesday, October 27, 2020 12:30:31 PM
To: users@nifi.apache.org 
Subject: Re: Run Nifi in IntelliJ to debug?

I haven't fully read this thread, but there is already a line in
nifi's bootstrap.conf that you can uncomment:

# Enable Remote Debugging
#java.arg.debug=-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=8000

Change the port if desired, and then create a Remote Debug
configuration in IntelliJ for that port.

This will debug the main application, not bootstrap.

On Tue, Oct 27, 2020 at 12:08 PM Darren Govoni  wrote:
>
> Hello,
>So i was able to get intelliJ to debug nifi but only inside the bootstrap 
> process. It looks like nifi spawns a new process and that process does not 
> run the debug options.
>
> Is there a way to instruct nifi to enable debug port on its main process? 
> That will have the actual app code i need to trace.
>
> Thanks for any tips. Much appreciated!
> Darren
>
> Sent from my Verizon, Samsung Galaxy smartphone
> Get Outlook for Android
>
> 
> From: Mike Thomsen 
> Sent: Monday, October 26, 2020 10:15:33 PM
> To: users@nifi.apache.org 
> Subject: Re: Run Nifi in IntelliJ to debug?
>
> Are you using a binary derived from the source code in your IDE? Like
> a 1.12.1 binary and the source code from the release?
>
> On Mon, Oct 26, 2020 at 7:47 PM Russell Bateman  wrote:
> >
> > Hmmm... It's rare that I debug NiFi code. And it's also rare that I debug 
> > my own in that context since the NiFi test runner allows me to fend off 
> > most surprises via my JUnit tests.
> >
> > I think back in 2016, I was debugging a start-up problem involving NiFi 
> > start-up and incompatibility with the Java Flight Recorder. As I recall, I 
> > downloaded the relevant NiFi code sources matching the version of NiFi I 
> > was debugging remotely. I remember ultimately making a slight (and only 
> > temporary) change to NiFi start-up that fixed the problem. At that point I 
> > must have been building my own copy to have seen it fixed.. It had to do 
> > with the order in which NiFi was getting command-line arguments making it 
> > so the JFR wasn't running. I'd have to dig back to figure out what I was 
> > doing, but it's probably not too relevant to what you need to do.
> >
> > What do you need to see in this?
> >
> > Russ
> >
> > On 10/26/20 5:38 PM, Darren Govoni wrote:
> >
> > Correct. Primarily the nifi-web-api module and AccessResource class. For 
> > starters.
> >
> > Sent from my Verizon, Samsung Galaxy smartphone
> > Get Outlook for Android
> >
> > 
> > From: Russell Bateman 
> > Sent: Monday, October 26, 2020 7:37:13 PM
> > To: Darren Govoni ; users@nifi.apache.org 
> > 
> > Subject: Re: Run Nifi in IntelliJ to debug?
> >
> > Darren,
> >
> > This is just Apache NiFi code out of NARs you want to step through or is it 
> > yours? You haven't stripped debug information or anything, right?
> >
> > Russ
> >
> > On 10/26/20 5:30 PM, Darren Govoni wrote:
> >
> > Kevin/Russel
> >
> > Thanks for the info. I did set things up this way.
> >
> > IntelliJ does connect to the nifi jvm and nifi runs and works but intellij 
> > isnt breaking on code it should.
> >
> > I did set the module where the code/classes are located (in the remote 
> > connection dialog) and i see the exception im tracking print on the console 
> > output but intellij never breaks.
> >
> > Is there an extra step needed? Generate sources?
> >
> > For future it would be nice if there was a maven goal for debug.
> >
> > Much appreciated!
> > Darren
> >
> > Sent from my Verizon, Samsung Galaxy smartphone
> > Get Outlook for Android
> > 
> > From: Russell Bateman 
> > Sent: Monday, October 26, 2020 4:09:50 PM
> > To: users@nifi.apache.org ; Darren Govoni 
> > 
> > Subject: Re: Run Nifi in IntelliJ to debug?
> >
> > Darren,
> >
> > I was out this morning and didn't see your plea until I got in just now. 
> > Here's a step by step I wrote up for both IntelliJ IDEA and Eclipse (I'm 
> > more an IntelliJ guy). It also covers using an IP tunnel.
> >
> > https://www.javahotchocolate.com/notes/nifi.html#20160323
> >
> > On 10/26/20 9:52 AM, Darren Govoni wrote:
> >
> > Hi
> >Is it possible to run Nifi from inside IntelliJ with debugging such that 
> > I can hit the app from my browser and trigger breakpoints?
> >
> > If anyone has done this can you please share any info?
> >
> > Thanks in advance!
> > Darren
> >
> > Sent from my Verizon, Samsung Galaxy smartphone
> > Get Outlook for Android
> >
> >
> >
> >


Re: Run Nifi in IntelliJ to debug?

2020-10-27 Thread Darren Govoni
Hello,
   So i was able to get intelliJ to debug nifi but only inside the bootstrap 
process. It looks like nifi spawns a new process and that process does not run 
the debug options.

Is there a way to instruct nifi to enable debug port on its main process? That 
will have the actual app code i need to trace.

Thanks for any tips. Much appreciated!
Darren

Sent from my Verizon, Samsung Galaxy smartphone
Get Outlook for Android<https://aka.ms/ghei36>


From: Mike Thomsen 
Sent: Monday, October 26, 2020 10:15:33 PM
To: users@nifi.apache.org 
Subject: Re: Run Nifi in IntelliJ to debug?

Are you using a binary derived from the source code in your IDE? Like
a 1.12.1 binary and the source code from the release?

On Mon, Oct 26, 2020 at 7:47 PM Russell Bateman  wrote:
>
> Hmmm... It's rare that I debug NiFi code. And it's also rare that I debug my 
> own in that context since the NiFi test runner allows me to fend off most 
> surprises via my JUnit tests.
>
> I think back in 2016, I was debugging a start-up problem involving NiFi 
> start-up and incompatibility with the Java Flight Recorder. As I recall, I 
> downloaded the relevant NiFi code sources matching the version of NiFi I was 
> debugging remotely. I remember ultimately making a slight (and only 
> temporary) change to NiFi start-up that fixed the problem. At that point I 
> must have been building my own copy to have seen it fixed.. It had to do with 
> the order in which NiFi was getting command-line arguments making it so the 
> JFR wasn't running. I'd have to dig back to figure out what I was doing, but 
> it's probably not too relevant to what you need to do.
>
> What do you need to see in this?
>
> Russ
>
> On 10/26/20 5:38 PM, Darren Govoni wrote:
>
> Correct. Primarily the nifi-web-api module and AccessResource class. For 
> starters.
>
> Sent from my Verizon, Samsung Galaxy smartphone
> Get Outlook for Android
>
> ________
> From: Russell Bateman 
> Sent: Monday, October 26, 2020 7:37:13 PM
> To: Darren Govoni ; users@nifi.apache.org 
> 
> Subject: Re: Run Nifi in IntelliJ to debug?
>
> Darren,
>
> This is just Apache NiFi code out of NARs you want to step through or is it 
> yours? You haven't stripped debug information or anything, right?
>
> Russ
>
> On 10/26/20 5:30 PM, Darren Govoni wrote:
>
> Kevin/Russel
>
> Thanks for the info. I did set things up this way.
>
> IntelliJ does connect to the nifi jvm and nifi runs and works but intellij 
> isnt breaking on code it should.
>
> I did set the module where the code/classes are located (in the remote 
> connection dialog) and i see the exception im tracking print on the console 
> output but intellij never breaks.
>
> Is there an extra step needed? Generate sources?
>
> For future it would be nice if there was a maven goal for debug.
>
> Much appreciated!
> Darren
>
> Sent from my Verizon, Samsung Galaxy smartphone
> Get Outlook for Android
> 
> From: Russell Bateman 
> Sent: Monday, October 26, 2020 4:09:50 PM
> To: users@nifi.apache.org ; Darren Govoni 
> 
> Subject: Re: Run Nifi in IntelliJ to debug?
>
> Darren,
>
> I was out this morning and didn't see your plea until I got in just now. 
> Here's a step by step I wrote up for both IntelliJ IDEA and Eclipse (I'm more 
> an IntelliJ guy). It also covers using an IP tunnel.
>
> https://www.javahotchocolate.com/notes/nifi.html#20160323
>
> On 10/26/20 9:52 AM, Darren Govoni wrote:
>
> Hi
>Is it possible to run Nifi from inside IntelliJ with debugging such that I 
> can hit the app from my browser and trigger breakpoints?
>
> If anyone has done this can you please share any info?
>
> Thanks in advance!
> Darren
>
> Sent from my Verizon, Samsung Galaxy smartphone
> Get Outlook for Android
>
>
>
>


Re: Run Nifi in IntelliJ to debug?

2020-10-26 Thread Darren Govoni
Correct. Primarily the nifi-web-api module and AccessResource class. For 
starters.

Sent from my Verizon, Samsung Galaxy smartphone
Get Outlook for Android<https://aka.ms/ghei36>


From: Russell Bateman 
Sent: Monday, October 26, 2020 7:37:13 PM
To: Darren Govoni ; users@nifi.apache.org 

Subject: Re: Run Nifi in IntelliJ to debug?

Darren,

This is just Apache NiFi code out of NARs you want to step through or is it 
yours? You haven't stripped debug information or anything, right?

Russ

On 10/26/20 5:30 PM, Darren Govoni wrote:
Kevin/Russel

Thanks for the info. I did set things up this way.

IntelliJ does connect to the nifi jvm and nifi runs and works but intellij isnt 
breaking on code it should.

I did set the module where the code/classes are located (in the remote 
connection dialog) and i see the exception im tracking print on the console 
output but intellij never breaks.

Is there an extra step needed? Generate sources?

For future it would be nice if there was a maven goal for debug.

Much appreciated!
Darren

Sent from my Verizon, Samsung Galaxy smartphone
Get Outlook for Android<https://aka.ms/ghei36>

From: Russell Bateman <mailto:r...@windofkeltia.com>
Sent: Monday, October 26, 2020 4:09:50 PM
To: users@nifi.apache.org<mailto:users@nifi.apache.org> 
<mailto:users@nifi.apache.org>; Darren Govoni 
<mailto:dar...@ontrenet.com>
Subject: Re: Run Nifi in IntelliJ to debug?

Darren,

I was out this morning and didn't see your plea until I got in just now. Here's 
a step by step I wrote up for both IntelliJ IDEA and Eclipse (I'm more an 
IntelliJ guy). It also covers using an IP tunnel.

https://www.javahotchocolate.com/notes/nifi.html#20160323

On 10/26/20 9:52 AM, Darren Govoni wrote:
Hi
   Is it possible to run Nifi from inside IntelliJ with debugging such that I 
can hit the app from my browser and trigger breakpoints?

If anyone has done this can you please share any info?

Thanks in advance!
Darren

Sent from my Verizon, Samsung Galaxy smartphone
Get Outlook for Android<https://aka.ms/ghei36>




Re: Run Nifi in IntelliJ to debug?

2020-10-26 Thread Darren Govoni
Kevin/Russel

Thanks for the info. I did set things up this way.

IntelliJ does connect to the nifi jvm and nifi runs and works but intellij isnt 
breaking on code it should.

I did set the module where the code/classes are located (in the remote 
connection dialog) and i see the exception im tracking print on the console 
output but intellij never breaks.

Is there an extra step needed? Generate sources?

For future it would be nice if there was a maven goal for debug.

Much appreciated!
Darren

Sent from my Verizon, Samsung Galaxy smartphone
Get Outlook for Android<https://aka.ms/ghei36>

From: Russell Bateman 
Sent: Monday, October 26, 2020 4:09:50 PM
To: users@nifi.apache.org ; Darren Govoni 

Subject: Re: Run Nifi in IntelliJ to debug?

Darren,

I was out this morning and didn't see your plea until I got in just now. Here's 
a step by step I wrote up for both IntelliJ IDEA and Eclipse (I'm more an 
IntelliJ guy). It also covers using an IP tunnel.

https://www.javahotchocolate.com/notes/nifi.html#20160323

On 10/26/20 9:52 AM, Darren Govoni wrote:
Hi
   Is it possible to run Nifi from inside IntelliJ with debugging such that I 
can hit the app from my browser and trigger breakpoints?

If anyone has done this can you please share any info?

Thanks in advance!
Darren

Sent from my Verizon, Samsung Galaxy smartphone
Get Outlook for Android<https://aka.ms/ghei36>



Re: Run Nifi in IntelliJ to debug?

2020-10-26 Thread Darren Govoni
Thanks Matt. I think if i can attach remotely and step through the code that 
will satisfy my needs. Let me give it a try.

I also found how to run mvnDebug and attach to that from intellij. Just need to 
find a maven goal that runs nifi but i havent seen one yet.

Sent from my Verizon, Samsung Galaxy smartphone
Get Outlook for Android<https://aka.ms/ghei36>


From: Matt Burgess 
Sent: Monday, October 26, 2020 12:05:03 PM
To: users@nifi.apache.org 
Subject: Re: Run Nifi in IntelliJ to debug?

Sorry I misread the part where you wanted to run NiFi inside IntelliJ,
I was talking about running it externally (from the command-line,
e.g.) and connecting the IntelliJ debugger. I haven't run NiFi itself
using IntelliJ, maybe someone else can chime in for that.

On Mon, Oct 26, 2020 at 12:03 PM Matt Burgess  wrote:
>
> Yes, that's a pretty common operation amongst NiFi developers. In
> conf/bootstrap.conf there's a section called Enable Remote Debugging
> and a commented-out line something like:
>
> java.arg.debug=-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5005
>
> You can remove the comment from that line and set things like the
> address to the desired port, whether to suspend the JVM until a
> debugger connects, etc. Then in IntelliJ you can create a new
> configuration of type Remote, point it at the port you set in the
> above line, and connect the debugger. It will then stop at breakpoints
> and you can do all the debugging stuff like add Watches, execute
> expressions (to change values at runtime), etc.
>
> Regards,
> Matt
>
> On Mon, Oct 26, 2020 at 11:52 AM Darren Govoni  wrote:
> >
> > Hi
> >Is it possible to run Nifi from inside IntelliJ with debugging such that 
> > I can hit the app from my browser and trigger breakpoints?
> >
> > If anyone has done this can you please share any info?
> >
> > Thanks in advance!
> > Darren
> >
> > Sent from my Verizon, Samsung Galaxy smartphone
> > Get Outlook for Android


Run Nifi in IntelliJ to debug?

2020-10-26 Thread Darren Govoni
Hi
   Is it possible to run Nifi from inside IntelliJ with debugging such that I 
can hit the app from my browser and trigger breakpoints?

If anyone has done this can you please share any info?

Thanks in advance!
Darren

Sent from my Verizon, Samsung Galaxy smartphone
Get Outlook for Android


Re: Build Problem 1.11.4 on MacOS

2020-10-20 Thread Darren Govoni
Ok, thanks for the input! Will try that.

From: Mike Thomsen 
Sent: Tuesday, October 20, 2020 7:39 PM
To: users@nifi.apache.org 
Subject: Re: Build Problem 1.11.4 on MacOS

I had build problems on macOS for a long time, and when I switched to
u265 everything seemed to build again.

On Tue, Oct 20, 2020 at 12:47 PM Joe Witt  wrote:
>
> Darren,
>
> I believe there were gremlins in that JDK release.. Can you please try 
> something like 265?
>
> On Tue, Oct 20, 2020 at 8:52 AM Darren Govoni  wrote:
>>
>> Hi,
>>   Seem to have this recurring problem trying to build on MacOS with 
>> nifi-utils. Anyone have a workaround or fix for this?
>>
>> Thanks in advance!
>>
>> [ERROR] Failed to execute goal 
>> org.apache.maven.plugins:maven-compiler-plugin:3.8.1:testCompile 
>> (groovy-tests) on project nifi-utils: Compilation failure
>> [ERROR] Failure executing groovy-eclipse compiler:
>> [ERROR] Annotation processing got disabled, since it requires a 1.6 
>> compliant JVM
>> [ERROR] Exception in thread "main" java.lang.NoClassDefFoundError: Could not 
>> initialize class org.codehaus.groovy.vmplugin.v7.Java7
>> [ERROR] at 
>> org.codehaus.groovy.vmplugin.VMPluginFactory.(VMPluginFactory.java:43)
>>
>> AFAIK my jvm is compliant
>>
>> dgovoni@C02RN8AHG8WP nifi % java -version
>> openjdk version "1.8.0_262"
>> OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_262-b10)
>> OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.262-b10, mixed mode)
>> dgovoni@C02RN8AHG8WP nifi %
>>
>>


Build Problem 1.11.4 on MacOS

2020-10-20 Thread Darren Govoni
Hi,
  Seem to have this recurring problem trying to build on MacOS with nifi-utils. 
Anyone have a workaround or fix for this?

Thanks in advance!


[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-compiler-plugin:3.8.1:testCompile (groovy-tests) 
on project nifi-utils: Compilation failure
[ERROR] Failure executing groovy-eclipse compiler:
[ERROR] Annotation processing got disabled, since it requires a 1.6 compliant 
JVM
[ERROR] Exception in thread "main" java.lang.NoClassDefFoundError: Could not 
initialize class org.codehaus.groovy.vmplugin.v7.Java7
[ERROR] at 
org.codehaus.groovy.vmplugin.VMPluginFactory.(VMPluginFactory.java:43)

AFAIK my jvm is compliant

dgovoni@C02RN8AHG8WP nifi % java -version
openjdk version "1.8.0_262"
OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_262-b10)
OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.262-b10, mixed mode)
dgovoni@C02RN8AHG8WP nifi %



Re: Data Provenance Stops Working

2020-08-10 Thread Darren Govoni
I also use 1.11.4 and out of the box there IS NO provenance data whatsoever. It 
just doesn't work if you install and run nifi as is.

Sent from my Verizon, Samsung Galaxy smartphone
Get Outlook for Android


From: Shawn Weeks 
Sent: Monday, August 10, 2020 2:23:19 PM
To: users@nifi.apache.org 
Subject: Re: Data Provenance Stops Working


It sounds like if I expand the retention time a lot, say 30 days the issue 
should be less bad?



Thanks

Shawn



From: Mark Payne 
Reply-To: "users@nifi.apache.org" 
Date: Monday, August 10, 2020 at 12:37 PM
To: "users@nifi.apache.org" 
Subject: Re: Data Provenance Stops Working



Shawn / Wyll,



I think you’re probably running into NIFI-7346 [1], which basically says 
there’s a case in which NiFi may “age off” old data even when it’s still the 
file that’s being actively written to. In Linux/OSX this results in simply 
deleting the file, and then anything else written to it disappears into the 
ether. Of course, now the file never exceeds the max size, since it’ll be 0 
bytes forever, os it never rolls over again. So when this happens, no more 
provenance data gets created until NiFi is restarted.



It’s also possible that you’re hitting NIFI-7375 [2]. This Jira only affects 
you if you get to provenance by right-clicking on a Processor and clicking View 
Provenance (i.e., not if you go to the Hamburger Menu in the top-right corner 
and go to Provenance from there and search that way). If this is the problem, 
once you right-click and go to View Provenance, you can actually click the 
Search icon (magnifying glass) in that empty Provenance Results panel and click 
Search and then it will actually bring back the results. So that’s obnoxious 
but it’s a workaround that may help.



The good news is that both of these have been addressed for 1.12.0, which 
sounds like it should be coming out very soon!



Thanks

-Mark



[1] https://issues.apache.org/jira/browse/NIFI-7346

[2] https://issues.apache.org/jira/browse/NIFI-7375





On Aug 10, 2020, at 1:26 PM, Joe Witt 
mailto:joe.w...@gmail.com>> wrote:



shawn - i believe it is related to our default settings and have phoned a 
friend to jump in here when able. but default retention and default sharding i 
*think* can result in this.  You can generate a thread dump before and after 
the locked state to see what it is stuck/sitting on.  That will help here



Thanks



On Mon, Aug 10, 2020 at 10:24 AM Shawn Weeks 
mailto:swe...@weeksconsulting.us>> wrote:

Out of the box even the initial admin user has to be granted permission I 
think, mine worked fine for several months since 1.11.1 was released and just 
started having an issues a couple of weeks ago. I’ve increasing the retention 
time a bit to see if that improves the situation a bit.



Thanks

Shawn Weeks



From: Wyll Ingersoll 
mailto:wyllys.ingers...@keepertech.com>>
Reply-To: "users@nifi.apache.org" 
mailto:users@nifi.apache.org>>
Date: Monday, August 10, 2020 at 12:22 PM
To: "users@nifi.apache.org" 
mailto:users@nifi.apache.org>>
Subject: Re: Data Provenance Stops Working



I run 1.11.4 in a cluster on AWS also and have a similar issue with the 
provenance data, I can't ever view it.  It's probably somehow misconfigured but 
I haven't been able to figure it out.



From: Andy LoPresto mailto:alopre...@apache.org>>
Sent: Monday, August 10, 2020 1:11 PM
To: users@nifi.apache.org 
mailto:users@nifi.apache.org>>
Subject: Re: Data Provenance Stops Working



Shawn,



I don’t know if this is specifically related, but there were a number of 
critical issues discovered in the 1.11.x release line that have been fixed in 
1.11.4. I would not recommend running any prior version on that release line.



1.12.0 should be coming imminently, so if you are going to upgrade anyway, you 
may want to wait a week or so and get the newest bits with hundreds of new 
features, but for stability alone, I would strongly recommend 1.11.4.



https://cwiki.apache.org/confluence/display/NIFI/Release+Notes





Andy LoPresto
alopre...@apache.org
alopresto.apa...@gmail.com
He/Him

PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69



On Aug 10, 2020, at 10:00 AM, Shawn Weeks 
mailto:swe...@weeksconsulting.us>> wrote:



I’m running a three node NiFi Cluster on AWS EC2s using integrated Zookeeper 
and SSL Enabled. Version is 1.11.1, OS is RedHat 7.7, Java is 1.8.0_242. For 
some reason after a period of time Data Provenance goes blank, old records are 
no longer queryable and new data provenance doesn’t appear to get written. No 
Lucene or other exceptions are logged and restarting the node causes data 
provenance to go back to being written however old data provenance does not 
re-appear. No exceptions appear when querying data 

Re: MergeContent resulting in corrupted JSON

2020-06-30 Thread Darren Govoni
Run the nifi jvm in a runtime profiler/analyzer like appdynamics and see if it 
detects any memory leaks or dangling unclosed file buffers/io. Throwing darts 
but the problem could be as deep as the Linux kernel or confined inside the jvm 
for your specific scenario.

Sent from my Verizon, Samsung Galaxy smartphone
Get Outlook for Android


From: Jason Iannone 
Sent: Tuesday, June 30, 2020 10:36:02 PM
To: users@nifi.apache.org 
Subject: Re: MergeContent resulting in corrupted JSON

Previous spotting of the issue was a red herring. We removed our custom code 
and are still facing random "org.codehaus.jackson.JsonParseException: Illegal 
Character" during PutDatabaseRecord due to a flowfile containing malformed JSON 
post MergeContent. Error never occurs immediately and is usually once we've 
processed several million records. We did a NOOP run, which was ConsumeKafka -> 
UpdateCounter and everything seemed ok.

Here's the current form of the flow:

  1.  ConsumeKafka_2_0 - Encoding headers as ISO-8859-1 due to some containing 
binary data
 *   I have a fork of nifi with changes to allow base64 and hex encoding of 
select nifi headers.
 *   Next test will be without pulling any headers
  2.  RouteOnAttribute - Validate attributes
  3.  Base64EncodeContent - Content is binary, converting to a format we can 
store to later process
  4.  ExtractText - Copy Base64 encoded content to attribute
  5.  AttributesToJson - Provenance shows output as being fine
  6.  MergeContent - Provenance shows output of malformed JSON being written in 
the combined flowflle.
  7.  PutDatabaseRecord - Schema specified as Schema Text

Since we've removed all traces of custom code what are peoples thoughts on 
possible causes? Could this be an OS issue, or are there any known issues with 
specific versions of RHEL?

Logically I think it makes sense to remove JSON from the equation as a whole.

Thanks,
Jason

On Wed, Jun 24, 2020 at 2:54 PM Jason Iannone 
mailto:bread...@gmail.com>> wrote:
Exactly my thought, and we've been combing through the code but nothing 
significant has jumped out. Something that does are Nifi JIRA's, NIFI-6923, 
NIFI-6924, and NIFI-6846. Considering we're on 1.10.0 I've requested upgrading 
to 1.11.4.

Thanks,
Jason

On Tue, Jun 23, 2020 at 9:05 AM Mark Payne 
mailto:marka...@hotmail.com>> wrote:
It should be okay to create a buffer like that. Assuming the FlowFile is small. 
Typically we try to avoid buffering the content of a FlowFile into memory. But 
if it’s a reasonable small FlowFile, that’s probably fine.

To be honest, if the issue is intermittent and doesn’t always happen on the 
same input, it sounds like a threading/concurrency bug. Do you have a buffer or 
anything like that as a member variable?

On Jun 22, 2020, at 10:02 PM, Jason Iannone 
mailto:bread...@gmail.com>> wrote:

I'm now thinking its due to how we handled reading the flowfile content into a 
buffer.

Previous:
session.read(flowFile, in -> {
  atomicVessel.set(ByteStreams.toByteArray(in));
});

Current:
final byte[] buffer = new byte[(int) flowFile.getSize()];
session.read(flowFile, in -> StreamUtils.fillBuffer(in, buffer, true));

Making this change reduced the occurrences of the data corruption, but we still 
saw it occur. What I'm now wondering is if sizing the byte array based on 
flowFile.getSize() is ideal? The contents of the file are raw bytes coming from 
ConsumeKafka_2_0.

Thanks,
Jason

On Mon, Jun 22, 2020 at 4:51 PM Mark Payne 
mailto:marka...@hotmail.com>> wrote:
Jason,

Glad to hear it. This is where the data provenance becomes absolutely 
invaluable. So now you should be able to trace the lineage of that FlowFile 
back to the start using data provenance. You can see exactly what it looked 
like when it was received. If it looks wrong there, the provenance shows 
exactly where it was received from so you know where to look next. If it looks 
good on receipt, you can trace the data through the flow and see exactly what 
the data looked like before and after each processor. And when you see which 
processor resulted in corruption, you can easily download the data as it looks 
when it went into the processor to make it easy to re-ingest and test.

Thanks
-Mark


On Jun 22, 2020, at 4:46 PM, Jason Iannone 
mailto:bread...@gmail.com>> wrote:

I spoke too soon, and must be the magic of sending an email! We found what 
appears to be corrupted content and captured the binary, hoping to play it 
through the code and see what's going on.

Thanks,
Jason

On Mon, Jun 22, 2020 at 4:35 PM Jason Iannone 
mailto:bread...@gmail.com>> wrote:
Hey Mark,

We hit the issue again, and when digging into the lineage we can see the 
content is fine coming into MergeContent but is corrupt on output of Join. Any 
other suggestions?

Thanks,
Jason

On Wed, Jun 10, 2020 at 2:26 PM Mark Payne 
mailto:marka...@hotmail.com>> wrote:
Jason,

Control characters should not cause any problem 

Re: initiating a machine learning script on a remote server

2020-06-25 Thread Darren Govoni
Quick answer is you could just execute a ssh command to execute on the remote 
machine.

If you need flowfiles to go remote, nifi supports remote processor groups.

Sent from my Verizon, Samsung Galaxy smartphone



Re: MergeContent resulting in corrupted JSON

2020-06-24 Thread Darren Govoni
Just lurking this thread. You wrote earlier.

The next time that a processor needs to write to the content of a FlowFile, it 
may end up appending to that same file on disk, but the FlowFile that the 
content corresponds to will keep track of the byte offset into the file where 
its content begins and how many bytes in that file belong to that FlowFile.

All of these activities, including when the append has completed should be 
fully synchronized with one another.

Tracking the current offset and appending to it will not be thread safe unless 
synchronized so all threads have a fully consistent view of a given files 
current offset after all appends to that file have completed.

Just something to double check.

Sent from my Verizon, Samsung Galaxy smartphone

From: Jason Iannone 
Sent: Wednesday, June 24, 2020 2:54:26 PM
To: users@nifi.apache.org 
Subject: Re: MergeContent resulting in corrupted JSON

Exactly my thought, and we've been combing through the code but nothing 
significant has jumped out. Something that does are Nifi JIRA's, NIFI-6923, 
NIFI-6924, and NIFI-6846. Considering we're on 1.10.0 I've requested upgrading 
to 1.11.4.

Thanks,
Jason

On Tue, Jun 23, 2020 at 9:05 AM Mark Payne 
mailto:marka...@hotmail.com>> wrote:
It should be okay to create a buffer like that. Assuming the FlowFile is small. 
Typically we try to avoid buffering the content of a FlowFile into memory. But 
if it’s a reasonable small FlowFile, that’s probably fine.

To be honest, if the issue is intermittent and doesn’t always happen on the 
same input, it sounds like a threading/concurrency bug. Do you have a buffer or 
anything like that as a member variable?

On Jun 22, 2020, at 10:02 PM, Jason Iannone 
mailto:bread...@gmail.com>> wrote:

I'm now thinking its due to how we handled reading the flowfile content into a 
buffer.

Previous:
session.read(flowFile, in -> {
  atomicVessel.set(ByteStreams.toByteArray(in));
});

Current:
final byte[] buffer = new byte[(int) flowFile.getSize()];
session.read(flowFile, in -> StreamUtils.fillBuffer(in, buffer, true));

Making this change reduced the occurrences of the data corruption, but we still 
saw it occur. What I'm now wondering is if sizing the byte array based on 
flowFile.getSize() is ideal? The contents of the file are raw bytes coming from 
ConsumeKafka_2_0.

Thanks,
Jason

On Mon, Jun 22, 2020 at 4:51 PM Mark Payne 
mailto:marka...@hotmail.com>> wrote:
Jason,

Glad to hear it. This is where the data provenance becomes absolutely 
invaluable. So now you should be able to trace the lineage of that FlowFile 
back to the start using data provenance. You can see exactly what it looked 
like when it was received. If it looks wrong there, the provenance shows 
exactly where it was received from so you know where to look next. If it looks 
good on receipt, you can trace the data through the flow and see exactly what 
the data looked like before and after each processor. And when you see which 
processor resulted in corruption, you can easily download the data as it looks 
when it went into the processor to make it easy to re-ingest and test.

Thanks
-Mark


On Jun 22, 2020, at 4:46 PM, Jason Iannone 
mailto:bread...@gmail.com>> wrote:

I spoke too soon, and must be the magic of sending an email! We found what 
appears to be corrupted content and captured the binary, hoping to play it 
through the code and see what's going on.

Thanks,
Jason

On Mon, Jun 22, 2020 at 4:35 PM Jason Iannone 
mailto:bread...@gmail.com>> wrote:
Hey Mark,

We hit the issue again, and when digging into the lineage we can see the 
content is fine coming into MergeContent but is corrupt on output of Join. Any 
other suggestions?

Thanks,
Jason

On Wed, Jun 10, 2020 at 2:26 PM Mark Payne 
mailto:marka...@hotmail.com>> wrote:
Jason,

Control characters should not cause any problem with MergeContent. MergeContent 
just copies bytes from one stream to another. It’s also worth noting that 
attributes don’t really come into play here. MergeContent is combining the 
FlowFile content, so even if it has some weird attributes, those won’t cause a 
problem in the output content. NiFi stores attributes as a mapping of String to 
String key/value pairs (i.e., Map). So the processor is 
assuming that if you want to convert a message header to an attribute, that 
header must be a string.

Content in the repository is stored using “slabs” or “blocks.” One processor at 
a time has the opportunity to write to a file in the content repository. When 
the processor finishes writing and transfers the FlowFile to the next 
processor, NiFi keeps track of which file its content was written to, the byte 
offset where its content starts, and the length of the content. The next time 
that a processor needs to write to the content of a FlowFile, it may end up 
appending to that same file on disk, but the FlowFile that the content 
corresponds to will keep track of the byte offset into the file where 

Re: Kerberos - Ticket Cache and JAAS config

2020-06-16 Thread Darren Govoni
I would use some kind of SSO type proxy service and have your Nifi processors 
request an authorization from that whereby the proxy service performs the 
authentication to the backend service you are protecting and only returns to 
Nifi the needed token to interact with it.

Probably for this approach you'll need a single JAAS implementation to the 
proxy and the token payloads can be any underlying implementation that the 
remote service requires.

Not sure off hand which SSO proxy might just drop into your scenario but a 
custom JAAS impl will probably be needed in Nifi regardless.

What you don't want Nifi to do is juggle and manage white box awareness of all 
these different remote services. Rather just request authorization and pass 
session tokens onward.

As they say, though, the devil is in the details.

Darren

Sent from my Verizon, Samsung Galaxy smartphone



Re: Can Nifi load balance flowfiles?

2020-04-24 Thread Darren Govoni
Thanks Joe.

I'm shooting for [CASE1].

The part I missed was setting the number of relationships. I just dragged 
connections from it thinking the strategy applied.

My regrets and thank you for the clarification!

From: Joe Witt 
Sent: Friday, April 24, 2020 2:31 PM
To: users@nifi.apache.org 
Subject: Re: Can Nifi load balance flowfiles?

Darren,

It isn't quite clear to me what you want to do.  Is the pattern

FlowFile A and B enter DistributeLoad which has relationships 1 and 2.

[CASE1] A goes to 1, B goes to 2.

OR

[CASE2] is is A goes to 1, B goes to 1 and CopyA goes to 2, CopyB goes to 2?

If it is CASE1 you want distribute load and you set the 'number of 
relationships' property to 2 and the strategy you want.  Then you set 
relationship 1 to some target processor and relationship 2 to some other 
processor.

If it is CASE2 you want then you just use whatever source processor you have 
before DistributeLoad and you use the same relationship (for example success) 
more than once and we'll make flowfile copies for you.

Hopefully that helps.

Thanks

On Fri, Apr 24, 2020 at 2:19 PM Darren Govoni 
mailto:dar...@ontrenet.com>> wrote:
Hi Joe,
   I've tried to use DistributeLoad, but perhaps I am doing something wrong.

I have 2 flowfiles coming into DistributeLoad and I want that processor to 
distribute them evenly across its outbound connections.

I have 2 connections outbound from DistributeLoad. I expected to see 1 flowfile 
in each outbound queue, but instead I see 2 in each, like a normal processor 
would distribute flowfiles.

Is there a special use of this processor to have to evenly distribute the 
received flowfiles across outbound queues?

thanks!!

From: Joe Witt mailto:joe.w...@gmail.com>>
Sent: Friday, April 24, 2020 8:08 AM
To: users@nifi.apache.org<mailto:users@nifi.apache.org> 
mailto:users@nifi.apache.org>>
Subject: Re: Can Nifi load balance flowfiles?

Take a look at DistributeLoad.

thanks

On Fri, Apr 24, 2020 at 7:05 AM Darren Govoni 
mailto:dar...@ontrenet.com>> wrote:
Hi

Let's say I have a splitjson processor. I want to connect 10 processors to it 
such that it will send one output to one processor in an evenly distributed 
manner.

Can Nifi do this?

Darren

Sent from my Verizon, Samsung Galaxy smartphone


Re: Can Nifi load balance flowfiles?

2020-04-24 Thread Darren Govoni
Hi Joe,
   I've tried to use DistributeLoad, but perhaps I am doing something wrong.

I have 2 flowfiles coming into DistributeLoad and I want that processor to 
distribute them evenly across its outbound connections.

I have 2 connections outbound from DistributeLoad. I expected to see 1 flowfile 
in each outbound queue, but instead I see 2 in each, like a normal processor 
would distribute flowfiles.

Is there a special use of this processor to have to evenly distribute the 
received flowfiles across outbound queues?

thanks!!

From: Joe Witt 
Sent: Friday, April 24, 2020 8:08 AM
To: users@nifi.apache.org 
Subject: Re: Can Nifi load balance flowfiles?

Take a look at DistributeLoad.

thanks

On Fri, Apr 24, 2020 at 7:05 AM Darren Govoni 
mailto:dar...@ontrenet.com>> wrote:
Hi

Let's say I have a splitjson processor. I want to connect 10 processors to it 
such that it will send one output to one processor in an evenly distributed 
manner.

Can Nifi do this?

Darren

Sent from my Verizon, Samsung Galaxy smartphone


Can Nifi load balance flowfiles?

2020-04-24 Thread Darren Govoni
Hi

Let's say I have a splitjson processor. I want to connect 10 processors to it 
such that it will send one output to one processor in an evenly distributed 
manner.

Can Nifi do this?

Darren

Sent from my Verizon, Samsung Galaxy smartphone


Re: Not Seeing Provenance data

2020-04-11 Thread Darren Govoni
Yes. Didn't change anything. Just unzipped the nifi distro onto a big partition 
and ran it as is.

Sent from my Verizon, Samsung Galaxy smartphone


From: Patrick Timmins 
Sent: Saturday, April 11, 2020 10:20:12 AM
To: users@nifi.apache.org 
Subject: Re: Not Seeing Provenance data


Is the underlying storage for the four repositories (provenance, database, 
flowfile, and content) consistent within a node?

Are all three nodes in the cluster using the same type of underlying 
storage/device for the various NiFi repositories?


On 4/11/2020 8:45 AM, Wyllys Ingersoll wrote:
Nope, already checked that.

On Fri, Apr 10, 2020 at 8:23 PM Patrick Timmins 
mailto:ptimm...@cox.net>> wrote:

No issues here.  Sounds like a timezone / system clock / clock drift issue (in 
a cluster).

On 4/10/2020 11:59 AM, Joe Witt wrote:
The provenance repo is in large scale use by many many users so fundamentally 
it does work.  There are conditions that apparently need improving.  In the 
past couple days these items have been flagged by folks on this list, JIRAs and 
PRs raised and merged, all good. If you can help by creating a build of the 
latest and confirm it fixes your case then please do so.

Thanks

On Fri, Apr 10, 2020 at 12:48 PM Darren Govoni 
mailto:dar...@ontrenet.com>> wrote:
It would seem the feature is either broken completely or only works in specific 
conditions.

Can the Nifi team put a fix on their road map for this?
Its a rather central feature to Nifi.

Sent from my Verizon, Samsung Galaxy smartphone


From: Wyllys Ingersoll 
mailto:wyllys.ingers...@keepertech.com>>
Sent: Friday, April 10, 2020 11:17:42 AM
To: users@nifi.apache.org<mailto:users@nifi.apache.org> 
mailto:users@nifi.apache.org>>
Subject: Re: Not Seeing Provenance data

I have a similar problem with viewing provenance.  I have a 3-node cluster in a 
kubernetes environment, the provenance_repository directory for each node is on 
a persistent data store so it is not deleted or lost between container restarts 
(which are not very common).  My nifi.provenance.repository.max.storage.time is 
24 hours.

Whenever I try to view any provenance, nothing is ever shown.  If I manually 
inspect the provenance_repository directory, there is a lucene index and TOC 
being created.

I see log messages like these:

Submitting query +processorId:882133fe-b684-148b-ad88-7850437ca591 with 
identifier 64a703fe-0171-1000--65abd91a against index directories 
[./provenance_repository/lucene-8-index-1560864819888]
Returning the following list of index locations because they were finished 
being written to before 1586531601311: []
Found no events in the Provenance Repository. In order to perform maintenace of 
the indices, will assume that the first event time is now (1586531601311)


Any suggestions?

-Wyllys Ingersoll



On Thu, Apr 9, 2020 at 11:25 AM Dobbernack, Harald (Key-Work) 
mailto:harald.dobbern...@key-work.de>> wrote:

Hey Mark,



great news and thank you very much!



Happy Holidays!

Harald



Von: Mark Payne mailto:marka...@hotmail.com>>
Gesendet: Donnerstag, 9. April 2020 17:18
An: users@nifi.apache.org<mailto:users@nifi.apache.org>
Betreff: Re: Not Seeing Provenance data



Thanks Harald,



I have created a Jira [1] for this. There’s currently a PR up for it as well.



Thanks

-Mark



[1] https://issues.apache.org/jira/browse/NIFI-7346



On Apr 9, 2020, at 11:14 AM, Dobbernack, Harald (Key-Work) 
mailto:harald.dobbern...@key-work.de>> wrote:



Hi Mark,



I can confirm after testing that if no provenance event has been generated in a 
time greater than the set nifi.provenance.repository.max.storage.time then as 
expected the last recorded provenance events don’t exist anymore but also from 
then on any new provenance events are also not searchable, the provenance 
Search remains completely empty regardless of how many flows are active.  As 
described also *.prov file is then missing in provenance repository. After 
restart of Nifi new prov File will be generated and provenance will work again, 
but only showing stuff generated since last NiFi Start.



So yes, I’d say your Idea

‘If so, then I think that would understand why it deleted the data. It’s 
trying to age off old data

 but unfortunately it doesn’t perform a check to first determine whether or 
not the “old file”

 that it’s about to delete is also the “active file”.’

fits very nicely to my test.



As a workaround we’re going to set a greater 
nifi.provenance.repository.max.storage.time until this can be resolved.



Thanks again for looking into this.

Harald





Von: Dobbernack, Harald (Key-Work)
Gesendet: Donnerstag, 9. April 2020 15:22
An: users@nifi.apache.org<mailto:users@nifi.apache.org>
Betreff: AW: Not Seeing Provenance data



Hi Mark,



thank you for looking into this.



The nifi.provenance.repository.max.storage.time setting 

Re: Not Seeing Provenance data

2020-04-11 Thread Darren Govoni
It never has worked for me with a simple, out of the box install on one machine 
in EC2.

But there should be a configuration that keeps the last X hours of provenance. 
NOT based on wall clock time.

For example. I want the last 24 hours of provenance regardless if the last time 
a processor ran was 3 days ago. So it would be relative to the latest logged 
data.

Sent from my Verizon, Samsung Galaxy smartphone


From: Wyllys Ingersoll 
Sent: Saturday, April 11, 2020 9:45:57 AM
To: users@nifi.apache.org 
Subject: Re: Not Seeing Provenance data

Nope, already checked that.

On Fri, Apr 10, 2020 at 8:23 PM Patrick Timmins 
mailto:ptimm...@cox.net>> wrote:

No issues here.  Sounds like a timezone / system clock / clock drift issue (in 
a cluster).

On 4/10/2020 11:59 AM, Joe Witt wrote:
The provenance repo is in large scale use by many many users so fundamentally 
it does work.  There are conditions that apparently need improving.  In the 
past couple days these items have been flagged by folks on this list, JIRAs and 
PRs raised and merged, all good. If you can help by creating a build of the 
latest and confirm it fixes your case then please do so.

Thanks

On Fri, Apr 10, 2020 at 12:48 PM Darren Govoni 
mailto:dar...@ontrenet.com>> wrote:
It would seem the feature is either broken completely or only works in specific 
conditions.

Can the Nifi team put a fix on their road map for this?
Its a rather central feature to Nifi.

Sent from my Verizon, Samsung Galaxy smartphone


From: Wyllys Ingersoll 
mailto:wyllys.ingers...@keepertech.com>>
Sent: Friday, April 10, 2020 11:17:42 AM
To: users@nifi.apache.org<mailto:users@nifi.apache.org> 
mailto:users@nifi.apache.org>>
Subject: Re: Not Seeing Provenance data

I have a similar problem with viewing provenance.  I have a 3-node cluster in a 
kubernetes environment, the provenance_repository directory for each node is on 
a persistent data store so it is not deleted or lost between container restarts 
(which are not very common).  My nifi.provenance.repository.max.storage.time is 
24 hours.

Whenever I try to view any provenance, nothing is ever shown.  If I manually 
inspect the provenance_repository directory, there is a lucene index and TOC 
being created.

I see log messages like these:

Submitting query +processorId:882133fe-b684-148b-ad88-7850437ca591 with 
identifier 64a703fe-0171-1000--65abd91a against index directories 
[./provenance_repository/lucene-8-index-1560864819888]
Returning the following list of index locations because they were finished 
being written to before 1586531601311: []
Found no events in the Provenance Repository. In order to perform maintenace of 
the indices, will assume that the first event time is now (1586531601311)


Any suggestions?

-Wyllys Ingersoll



On Thu, Apr 9, 2020 at 11:25 AM Dobbernack, Harald (Key-Work) 
mailto:harald.dobbern...@key-work.de>> wrote:

Hey Mark,



great news and thank you very much!



Happy Holidays!

Harald



Von: Mark Payne mailto:marka...@hotmail.com>>
Gesendet: Donnerstag, 9. April 2020 17:18
An: users@nifi.apache.org<mailto:users@nifi.apache.org>
Betreff: Re: Not Seeing Provenance data



Thanks Harald,



I have created a Jira [1] for this. There’s currently a PR up for it as well.



Thanks

-Mark



[1] https://issues.apache.org/jira/browse/NIFI-7346



On Apr 9, 2020, at 11:14 AM, Dobbernack, Harald (Key-Work) 
mailto:harald.dobbern...@key-work.de>> wrote:



Hi Mark,



I can confirm after testing that if no provenance event has been generated in a 
time greater than the set nifi.provenance.repository.max.storage.time then as 
expected the last recorded provenance events don’t exist anymore but also from 
then on any new provenance events are also not searchable, the provenance 
Search remains completely empty regardless of how many flows are active.  As 
described also *.prov file is then missing in provenance repository. After 
restart of Nifi new prov File will be generated and provenance will work again, 
but only showing stuff generated since last NiFi Start.



So yes, I’d say your Idea

‘If so, then I think that would understand why it deleted the data. It’s 
trying to age off old data

 but unfortunately it doesn’t perform a check to first determine whether or 
not the “old file”

 that it’s about to delete is also the “active file”.’

fits very nicely to my test.



As a workaround we’re going to set a greater 
nifi.provenance.repository.max.storage.time until this can be resolved.



Thanks again for looking into this.

Harald





Von: Dobbernack, Harald (Key-Work)
Gesendet: Donnerstag, 9. April 2020 15:22
An: users@nifi.apache.org<mailto:users@nifi.apache.org>
Betreff: AW: Not Seeing Provenance data



Hi Mark,



thank you for looking into this.



The nifi.provenance.repository.max.storage.time setting might explain why I 
haven

Re: Not Seeing Provenance data

2020-04-10 Thread Darren Govoni
 would understand why it deleted the data. It’s trying 
to age off old data but unfortunately it doesn’t perform a check to first 
determine whether or not the “old file” that it’s about to delete is also the 
“active file”.



Can you confirm whether or not you would expect to see 24 hours pass without 
any provenance data?



Thanks

-Mark







On Apr 9, 2020, at 4:32 AM, Dobbernack, Harald (Key-Work) 
mailto:harald.dobbern...@key-work.de>> wrote:



What I noticed is that as long as provenance is working there will be *.prov 
files in the directory. When Provenance isn’t working these files are not to be 
seen. Maybe some Cleaning Process deletes those files prematurely or the 
process building them doesn’t work any more?



Von: Dobbernack, Harald (Key-Work) 
mailto:harald.dobbern...@key-work.de>>
Gesendet: Donnerstag, 9. April 2020 10:27
An: users@nifi.apache.org<mailto:users@nifi.apache.org>
Betreff: AW: Not Seeing Provenance data



This is something I experience too from time to time. My quick and dirty 
workaround is stop nifi, delete everything in the provenance directory, 
restart….  Then Provenance is usable again (of course only with data since the 
delete) . I’m hoping very much there is a better way, someone can show us 
better settings or a potential bug can be discovered…



Von: Darren Govoni mailto:dar...@ontrenet.com>>
Gesendet: Mittwoch, 8. April 2020 20:31
An: users@nifi.apache.org<mailto:users@nifi.apache.org>
Betreff: Not Seeing Provenance data



Hi,

  When I go to "View data provenance" in Nifi, I never see any logs for my 
flow. Am I missing some configuration setting somewhere?



thanks,

Darren





Harald Dobbernack
Key-Work Consulting GmbH | Kriegsstr. 100 | 76133 | Karlsruhe | Germany | 
https://www.key-work.de | 
Datenschutz<https://www.key-work.de/de/footer/datenschutz.html>
Fon: +49-721-78203-264 | E-Mail: 
harald.dobbern...@key-work.de<mailto:harald.dobbern...@key-work.de> | Fax: 
+49-721-78203-10

Key-Work Consulting GmbH, Karlsruhe, HRB 108695, HRG Mannheim
Geschäftsführer: Andreas Stappert, Tobin Wotring




Not Seeing Provenance data

2020-04-08 Thread Darren Govoni
Hi,
  When I go to "View data provenance" in Nifi, I never see any logs for my 
flow. Am I missing some configuration setting somewhere?

thanks,
Darren



Re: Adding Nested Properties/JSON

2020-03-31 Thread Darren Govoni
Sure. Thank you.

Processor #1 creates this JSON

{
   "name":"this and that",
   "field":"value"
}

passes to Processor #2 which adds a record to a sub-field

{
   "name":"this and that",
   "field":"value",
   "others": [
  {"name":"here and there"}
]
}

passes to Processor #3 which also adds a record to "others".

{
   "name":"this and that",
   "field":"value",
   "others": [
  {"name":"here and there"},
  {"name":"one and two"},
]
}

which is the final output. So it's more building a JSON than transforming, 
sorta.

From: Etienne Jouvin 
Sent: Tuesday, March 31, 2020 9:37 AM
To: users@nifi.apache.org 
Subject: Re: Adding Nested Properties/JSON

Can you post example of input and expected result.

For adding, you can use default or modify-overwrite-beta

[https://ipmcdn.avast.com/images/icons/icon-envelope-tick-round-orange-animated-no-repeat-v1.gif]<https://www.avast.com/sig-email?utm_medium=email_source=link_campaign=sig-email_content=webmail>
  Garanti sans virus. 
www.avast.com<https://www.avast.com/sig-email?utm_medium=email_source=link_campaign=sig-email_content=webmail>

Le mar. 31 mars 2020 à 15:30, Darren Govoni 
mailto:dar...@ontrenet.com>> a écrit :
Hi. Thank you.

In looking at the Jolt docs these are the operations:

shift, sort, cardinality, modify-default-beta, modify-overwrite-beta, 
modify-define-beta, or remove

I primarily need "add" such that I can add nested elements or add elements to 
an array already in the JSON.

Can a single Jolt processor do this? Or do I need to merge two inputs to join 
them into a single JSON?

thanks in advance!
Darren



From: Etienne Jouvin mailto:lapinoujou...@gmail.com>>
Sent: Tuesday, March 31, 2020 8:52 AM
To: users@nifi.apache.org<mailto:users@nifi.apache.org> 
mailto:users@nifi.apache.org>>
Subject: Re: Adding Nested Properties/JSON

Hello.

Jolt transformation.

Etienne

Le mar. 31 mars 2020 à 14:40, Darren Govoni 
mailto:dar...@ontrenet.com>> a écrit :
Hi,
   I want to use Nifi to design a flow that modifies, updates, etc a nested 
JSON document (or that can finally output one at the end).

For example:

{
   "name":"this and that",
   "field":"value",
   "others": [
   {"name":"here and there"},
   ...
   ]
}

What's the best approach to this using Nifi?

Thanks in advance!
Darren

[https://ipmcdn.avast.com/images/icons/icon-envelope-tick-round-orange-animated-no-repeat-v1.gif]<https://www.avast.com/sig-email?utm_medium=email_source=link_campaign=sig-email_content=webmail>
  Garanti sans virus. 
www.avast.com<https://www.avast.com/sig-email?utm_medium=email_source=link_campaign=sig-email_content=webmail>


Re: Adding Nested Properties/JSON

2020-03-31 Thread Darren Govoni
Hi. Thank you.

In looking at the Jolt docs these are the operations:

shift, sort, cardinality, modify-default-beta, modify-overwrite-beta, 
modify-define-beta, or remove

I primarily need "add" such that I can add nested elements or add elements to 
an array already in the JSON.

Can a single Jolt processor do this? Or do I need to merge two inputs to join 
them into a single JSON?

thanks in advance!
Darren



From: Etienne Jouvin 
Sent: Tuesday, March 31, 2020 8:52 AM
To: users@nifi.apache.org 
Subject: Re: Adding Nested Properties/JSON

Hello.

Jolt transformation.

Etienne

Le mar. 31 mars 2020 à 14:40, Darren Govoni 
mailto:dar...@ontrenet.com>> a écrit :
Hi,
   I want to use Nifi to design a flow that modifies, updates, etc a nested 
JSON document (or that can finally output one at the end).

For example:

{
   "name":"this and that",
   "field":"value",
   "others": [
   {"name":"here and there"},
   ...
   ]
}

What's the best approach to this using Nifi?

Thanks in advance!
Darren


Adding Nested Properties/JSON

2020-03-31 Thread Darren Govoni
Hi,
   I want to use Nifi to design a flow that modifies, updates, etc a nested 
JSON document (or that can finally output one at the end).

For example:

{
   "name":"this and that",
   "field":"value",
   "others": [
   {"name":"here and there"},
   ...
   ]
}

What's the best approach to this using Nifi?

Thanks in advance!
Darren


Re: Suggestions for Flow Development Lifestyle

2020-02-25 Thread Darren Govoni
You could try using git and pull requests, merges, code review process. Just 
have to export and import your flows as templates.

Alternatively, if Nifi registry was built on git API (local or remote repos 
etc) then it would all "just work" the way you describe.

From: Eric Secules 
Sent: Tuesday, February 25, 2020 1:31 AM
To: users@nifi.apache.org 
Subject: Suggestions for Flow Development Lifestyle

Hello everyone,

Im starting to use nifi and nifi registry on my development team and we're 
running into issues working together on the same versioned process groups.
The nifi registry doesn't support branching, merging and code review nativly so 
we all have ended up developing on the same branch on the same registry 
instance and doing in-person peer review. Is there a better way for teams to 
develop process groups for nifi?

What we've tried:
I tried to set up my own registry on my local machine where I do development 
and make incremental changes. Then when I am ready to "merge" I import the 
process group to the central registry from my local registry. The main issue 
with this is that there's no mechanism for merge if the central registry and my 
local registry have diverged. The other issue is when a versioned process group 
containing other versioned process groupss is moved from local-reg to 
central-reg the inner PGs still say their source is local-reg despite the 
containing PG moving from local-reg to central-reg. This becomes a problem in 
production environments which would only be connected to central-reg. Tldr; 
Moving nested versioned flows between registries is complicated.

I've also tried backing up my local registry to a separate branch in git and 
manually merging it with the branch that central-reg backs up to, but these git 
branches are glorified backups and the registry doesn't seem to be built to 
pull updates from them. On top of that doing a code review on the generated 
JSON describing a process group is difficult and I ran into several merge 
conflicts testing out a very simple merge where the target branch diverged 
slightly from the feature branch.

Does anyone have any other approaches that have succeeded on teams with 
multiple people developing on the same set of process groups?

Thanks,
Eric


Re: InvokeHTTP & AttributesToJSON Help

2020-02-02 Thread Darren Govoni
Ok, thanks!

From: Etienne Jouvin 
Sent: Saturday, February 1, 2020 4:22 PM
To: users@nifi.apache.org 
Subject: Re: InvokeHTTP & AttributesToJSON Help

Hi.

If I well understand, you can configure the InvokeHTTP to always output the 
response.
In this case, the response will be routed to the response relation in any case.
Then yo ucan do a RouteOnAttribute to check the response status. And if HTTP 
code, from attribute, is 500 go to a specific relation, if 200 to other and so 
on.
But be careful, retry, error relation are still active. So you can auto finish 
on them and just work from the response relation.

Etienne


Le sam. 1 févr. 2020 à 15:26, Darren Govoni 
mailto:dar...@ontrenet.com>> a écrit :
Hi,
  I have probably 2 easy problems I can't seem to solve (still new).


  1.  I want to route a status 500 to Failure. Not retry. The response contains 
a JSON message.
  2.  Currently, I am routing the InvokeHTTP Retry with code 500 to 
AttributesToJSON to pull the response JSON from "invokehttp.response.body" and 
put it as the flow file. However, it does not work the way I expect.
 *   I want the response body to become the flow file. Instead I get
*   { "invokehttp.response.body": "my json encoded json response" }
*   I do not want the outer "invokehttp.response.body" field
  3.  I then tried to unwrap this using SplitJSON, but I cannot seems to use 
this JSON path
 *   $.invokehttp.response.body - Because the dot notation used by Nifi has 
different semantics to JSONPath.

Any easy fixes to these conundrums?

thank you!
D


InvokeHTTP & AttributesToJSON Help

2020-02-01 Thread Darren Govoni
Hi,
  I have probably 2 easy problems I can't seem to solve (still new).


  1.  I want to route a status 500 to Failure. Not retry. The response contains 
a JSON message.
  2.  Currently, I am routing the InvokeHTTP Retry with code 500 to 
AttributesToJSON to pull the response JSON from "invokehttp.response.body" and 
put it as the flow file. However, it does not work the way I expect.
 *   I want the response body to become the flow file. Instead I get
*   { "invokehttp.response.body": "my json encoded json response" }
*   I do not want the outer "invokehttp.response.body" field
  3.  I then tried to unwrap this using SplitJSON, but I cannot seems to use 
this JSON path
 *   $.invokehttp.response.body - Because the dot notation used by Nifi has 
different semantics to JSONPath.

Any easy fixes to these conundrums?

thank you!
D


Unsubscribe

2019-06-19 Thread Darren Govoni


Get Outlook for Android



Re: [Twisted-Python] Thread Consumption Problem in Daemon?

2018-11-23 Thread Darren Govoni
Thanks. I added Tipper to my program and will see what it shows when I ping
the process.

https://pypi.org/project/tipper/

On Thu, Nov 22, 2018 at 6:43 AM Chris Withers  wrote:

> On 22/11/2018 02:30, Glyph wrote:
> >
> >> On Nov 19, 2018, at 6:16 AM, Darren Govoni  >> <mailto:darrenus...@gmail.com>> wrote:
> >>
> >> I tried to find out if there is a way to limit the thread pool size
> >> from command line for twisted web and found nothing. Does it exist?
> >
> > The thread pool is limited to 10. While this is configurable via the
> > API, no command line option is exposed to tune it.  (This would be a
> > great contribution if you were so inclined!)
> >
> > It seems likely to me that Flask is spawning background threads for some
> > reason; given the way Twisted's threadpool works, leaks like this are
> > not common.  However, anything is possible: you probably want to gather
> > some information about what all those threads are doing.
>
> Some ideas on this front:
>
> - pstree/ps and strace will tell you at a low level
>
> - http://pyrasite.com/ and then use Python's thread introspection stuff.
>
> cheers,
>
> Chris
>
> ___
> Twisted-Python mailing list
> Twisted-Python@twistedmatrix.com
> https://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python
>
___
Twisted-Python mailing list
Twisted-Python@twistedmatrix.com
https://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python


Re: [Twisted-Python] Thread Consumption Problem in Daemon?

2018-11-19 Thread Darren Govoni
I tried to find out if there is a way to limit the thread pool size from
command line for twisted web and found nothing. Does it exist?

On Mon, Nov 19, 2018 at 8:30 AM Jean-Paul Calderone <
exar...@twistedmatrix.com> wrote:

> On Mon, Nov 19, 2018 at 8:26 AM Maarten ter Huurne 
> wrote:
>
>> On maandag 19 november 2018 12:40:20 CET Darren Govoni wrote:
>> > Hi,
>> >   I am using twisted to run my Flask app via WSGI like so.
>> >
>> > twistd --pidfile $PORT/pidfile -l $PORT/logfile  -n web --port
>> > tcp:$PORT --wsgi my.app
>> >
>> > Naturally, I have functions representing routes that enter and exit
>> > just fine.
>> >
>> > However, I notice the twisted daemon process is :"gathering threads".
>> > Eventually system runs out of them.
>> >
>> >  Here's a full status for one twisted server. 504 threads???
>>
>> I have a server running inside twistd which uses exactly 1 thread after
>> running for a few weeks, so the problem may not be in twistd itself.
>>
>> I'm using a reverse-proxy HTTP setup though, not WSGI. Maybe the problem
>> is specific to WSGI, Flask or your application?
>>
>
>
> Twisted's WSGI support definitely uses threads (as this is essentially a
> requirement of WSGI).  It uses the reactor thread pool (if you launch it
> from the CLI with twistd) which used to be limited to 10 threads.  I don't
> know if the same limit is in place these days.
>
> Jean-Paul
>
> ___
> Twisted-Python mailing list
> Twisted-Python@twistedmatrix.com
> https://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python
>
___
Twisted-Python mailing list
Twisted-Python@twistedmatrix.com
https://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python


[Twisted-Python] Thread Consumption Problem in Daemon?

2018-11-19 Thread Darren Govoni
Hi,
  I am using twisted to run my Flask app via WSGI like so.

twistd --pidfile $PORT/pidfile -l $PORT/logfile  -n web --port tcp:$PORT
--wsgi my.app

Naturally, I have functions representing routes that enter and exit just
fine.

However, I notice the twisted daemon process is :"gathering threads".
Eventually system runs out of them.

 Here's a full status for one twisted server. 504 threads???

Name:   twistd
Umask:  0077
State:  S (sleeping)
Tgid:   54855
Ngid:   35415
Pid:54855
PPid:   1
TracerPid:  0
Uid:4052405240524052
Gid:4052405240524052
FDSize: 256
Groups: 4052
VmPeak: 34240104 kB
VmSize: 34239336 kB
VmLck: 0 kB
VmPin: 0 kB
VmHWM:   1942708 kB
VmRSS:   1871884 kB
RssAnon: 1834800 kB
RssFile:   37080 kB
RssShmem:  4 kB
VmData: 33310576 kB
VmStk:   284 kB
VmExe: 4 kB
VmLib:234176 kB
VmPTE:  8876 kB
VmSwap:0 kB
Threads:504
SigQ:   1/1546652
SigPnd: 
ShdPnd: 
SigBlk: 
SigIgn: 01001007
SigCgt: 0001800146e8
CapInh: 
CapPrm: 
CapEff: 
CapBnd: 001f
CapAmb: 
Seccomp:0
Speculation_Store_Bypass:   thread vulnerable
Cpus_allowed:   ff,
Cpus_allowed_list:  0-55
Mems_allowed:
 
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0003
Mems_allowed_list:  0-1
voluntary_ctxt_switches:358534596
nonvoluntary_ctxt_switches: 31738
___
Twisted-Python mailing list
Twisted-Python@twistedmatrix.com
https://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python


[jira] [Commented] (SPARK-17788) RangePartitioner results in few very large tasks and many small to empty tasks

2017-12-06 Thread Darren Govoni (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16280271#comment-16280271
 ] 

Darren Govoni commented on SPARK-17788:
---

I'm also running into this error on spark 2.1.0
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 42 in 
stage 11.0 failed 4 times, most recent failure: Lost task 42.3 in stage 11.0 
(TID 7544, bdr-itwp-hdfs-2.dev.uspto.gov, executor 2): 
java.lang.IllegalArgumentException: Cannot allocate a page with more than 
17179869176 bytes


> RangePartitioner results in few very large tasks and many small to empty 
> tasks 
> ---
>
> Key: SPARK-17788
> URL: https://issues.apache.org/jira/browse/SPARK-17788
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.0.0
> Environment: Ubuntu 14.04 64bit
> Java 1.8.0_101
>Reporter: Babak Alipour
>Assignee: Wenchen Fan
> Fix For: 2.3.0
>
>
> Greetings everyone,
> I was trying to read a single field of a Hive table stored as Parquet in 
> Spark (~140GB for the entire table, this single field is a Double, ~1.4B 
> records) and look at the sorted output using the following:
> sql("SELECT " + field + " FROM MY_TABLE ORDER BY " + field + " DESC") 
> ​But this simple line of code gives:
> Caused by: java.lang.IllegalArgumentException: Cannot allocate a page with 
> more than 17179869176 bytes
> Same error for:
> sql("SELECT " + field + " FROM MY_TABLE).sort(field)
> and:
> sql("SELECT " + field + " FROM MY_TABLE).orderBy(field)
> After doing some searching, the issue seems to lie in the RangePartitioner 
> trying to create equal ranges. [1]
> [1] 
> https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/RangePartitioner.html
>  
>  The Double values I'm trying to sort are mostly in the range [0,1] (~70% of 
> the data which roughly equates 1 billion records), other numbers in the 
> dataset are as high as 2000. With the RangePartitioner trying to create equal 
> ranges, some tasks are becoming almost empty while others are extremely 
> large, due to the heavily skewed distribution. 
> This is either a bug in Apache Spark or a major limitation of the framework. 
> I hope one of the devs can help solve this issue.
> P.S. Email thread on Spark user mailing list:
> http://mail-archives.apache.org/mod_mbox/spark-user/201610.mbox/%3CCA%2B_of14hTVYTUHXC%3DmS9Kqd6qegVvkoF-ry3Yj2%2BRT%2BWSBNzhg%40mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17788) RangePartitioner results in few very large tasks and many small to empty tasks

2017-12-06 Thread Darren Govoni (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16280271#comment-16280271
 ] 

Darren Govoni edited comment on SPARK-17788 at 12/6/17 2:57 PM:


I'm also running into this error on spark 2.1.0
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 42 in 
stage 11.0 failed 4 times, most recent failure: Lost task 42.3 in stage 11.0 
(TID 7544,xxx.xxx.xxx.xxx.xx, executor 2): java.lang.IllegalArgumentException: 
Cannot allocate a page with more than 17179869176 bytes



was (Author: sesshomurai):
I'm also running into this error on spark 2.1.0
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 42 in 
stage 11.0 failed 4 times, most recent failure: Lost task 42.3 in stage 11.0 
(TID 7544, bdr-itwp-hdfs-2.dev.uspto.gov, executor 2): 
java.lang.IllegalArgumentException: Cannot allocate a page with more than 
17179869176 bytes


> RangePartitioner results in few very large tasks and many small to empty 
> tasks 
> ---
>
> Key: SPARK-17788
> URL: https://issues.apache.org/jira/browse/SPARK-17788
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.0.0
> Environment: Ubuntu 14.04 64bit
> Java 1.8.0_101
>Reporter: Babak Alipour
>Assignee: Wenchen Fan
> Fix For: 2.3.0
>
>
> Greetings everyone,
> I was trying to read a single field of a Hive table stored as Parquet in 
> Spark (~140GB for the entire table, this single field is a Double, ~1.4B 
> records) and look at the sorted output using the following:
> sql("SELECT " + field + " FROM MY_TABLE ORDER BY " + field + " DESC") 
> ​But this simple line of code gives:
> Caused by: java.lang.IllegalArgumentException: Cannot allocate a page with 
> more than 17179869176 bytes
> Same error for:
> sql("SELECT " + field + " FROM MY_TABLE).sort(field)
> and:
> sql("SELECT " + field + " FROM MY_TABLE).orderBy(field)
> After doing some searching, the issue seems to lie in the RangePartitioner 
> trying to create equal ranges. [1]
> [1] 
> https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/RangePartitioner.html
>  
>  The Double values I'm trying to sort are mostly in the range [0,1] (~70% of 
> the data which roughly equates 1 billion records), other numbers in the 
> dataset are as high as 2000. With the RangePartitioner trying to create equal 
> ranges, some tasks are becoming almost empty while others are extremely 
> large, due to the heavily skewed distribution. 
> This is either a bug in Apache Spark or a major limitation of the framework. 
> I hope one of the devs can help solve this issue.
> P.S. Email thread on Spark user mailing list:
> http://mail-archives.apache.org/mod_mbox/spark-user/201610.mbox/%3CCA%2B_of14hTVYTUHXC%3DmS9Kqd6qegVvkoF-ry3Yj2%2BRT%2BWSBNzhg%40mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Running Spark on EMR

2017-01-15 Thread Darren Govoni
So what was the answer?


Sent from my Verizon, Samsung Galaxy smartphone
 Original message From: Andrew Holway 
 Date: 1/15/17  11:37 AM  (GMT-05:00) To: Marco 
Mistroni  Cc: Neil Jonkers , User 
 Subject: Re: Running Spark on EMR 
Darn. I didn't respond to the list. Sorry.


On Sun, Jan 15, 2017 at 5:29 PM, Marco Mistroni  wrote:
thanks Neil. I followed original suggestion from Andrw and everything is 
working fine nowkr
On Sun, Jan 15, 2017 at 4:27 PM, Neil Jonkers  wrote:
Hello,
Can you drop the url:
 spark://master:7077
The url is used when running Spark in standalone mode.
Regards

 Original message From: Marco Mistroni  Date:15/01/2017  16:34  
(GMT+02:00) To: User  Subject: Running Spark on EMR 
hi all could anyone assist here?i am trying to run spark 2.0.0 on an EMR 
cluster,but i am having issues connecting to the master nodeSo, below is a 
snippet of what i am doing

sc = SparkSession.builder.master(sparkHost).appName("DataProcess").getOrCreate()

sparkHost is passed as input parameter. that was thought so that i can run the 
script locallyon my spark local instance as well as submitting scripts on any 
cluster i want

Now i have 1 - setup a cluster on EMR. 2 - connected to masternode3  - launch 
the command spark-submit myscripts.py spark://master:7077
But that results in an connection refused exceptionThen i have tried to remove 
the .master call above and launch the script with the following command
spark-submit --master spark://master:7077   myscript.py  but still i am 
gettingconnectionREfused exception

I am using Spark 2.0.0 , could anyone advise on how shall i build the spark 
session and how can i submit a pythjon script to the cluster?
kr marco  





-- 
Otter Networks UG
http://otternetworks.de
Gotenstraße 17
10829 Berlin



Spark in docker over EC2

2017-01-10 Thread Darren Govoni
Anyone got a good guide for getting spark master to talk to remote workers 
inside dockers? I followed the tips found by searching but doesn't work still. 
Spark 1.6.2.
I exposed all the ports and tried to set local IP inside container to the host 
IP but spark complains it can't bind ui ports.
Thanks in advance!


Sent from my Verizon, Samsung Galaxy smartphone

RE: AMQP extension for Apache Spark Streaming (messaging/IoT)

2016-07-03 Thread Darren Govoni


This is fantastic news.


Sent from my Verizon 4G LTE smartphone

 Original message 
From: Paolo Patierno  
Date: 7/3/16  4:41 AM  (GMT-05:00) 
To: user@spark.apache.org 
Subject: AMQP extension for Apache Spark Streaming (messaging/IoT) 

Hi all,

I'm working on an AMQP extension for Apache Spark Streaming, developing a 
reliable receiver for that. 

After
 MQTT support (I see it in the Apache Bahir repository), another messaging/IoT 
protocol 
could be very useful for the Apache Spark Streaming ecosystem. Out there a lot 
of broker (with "store and forward" mechanism) support AMQP as first citizen 
protocol other than the Apache Qpid Dispatch Router that is based on that for 
message routing.
Currently the source code is in my own GitHub account and it's in a early 
stage; the first step was just having something working end-to-end. I'm going 
to add feature like QoS and flow control in AMQP terms very soon. I was 
inspired by the spark-packages directories structure using Scala (as main 
language) and SBT (as build tool).

https://github.com/ppatierno/dstream-amqp

What do 
you think about that ?

Looking forward to hear from you.

Thanks,
Paolo.
  

Re: Build error zengine

2016-06-21 Thread Darren Govoni

  
  
H. It worked the 3rd time. Not sure
  what the hiccup was.
  
  On 06/21/2016 11:46 AM, Hyung Sung Shim wrote:


  hi.
I just got build success on my ubuntu machine as your build
  command.
Did you install prerequisites things[1] to build zeppelin?
And Can you share your build log?


[1]
https://github.com/apache/zeppelin

  
  
2016-06-21 23:47 GMT+09:00 Vinay Shukla
  <vinayshu...@gmail.com>:
  What is
the exact build failure?

  

On Tuesday, June 21, 2016, Darren Govoni <dar...@ontrenet.com>
wrote:

  
Ubuntu 15.10
  
  mvn clean package -Pspark-1.6 -Phadoop-2.4 -Pyarn
  -Ppyspark
  
  
  On 06/21/2016 09:03 AM, Hyung Sung Shim wrote:


  hi.
What is your build command and please tell
  me your environments.
  
  
2016-06-21 21:45
  GMT+09:00 Darren Govoni <dar...@ontrenet.com>:
  

  Hi
  
  
  I am trying to build git repo but
zengine fails. Any tips on this?
  
  
  Thanks 
  
  
  
  
  
  
  
Sent
  from my Verizon Wireless 4G LTE
  smartphone
  

  


  


  

  

  


  


  



Build error zengine

2016-06-21 Thread Darren Govoni


Hi
I am trying to build git repo but zengine fails. Any tips on this?
Thanks 


Sent from my Verizon Wireless 4G LTE smartphone

Re: Dockerfile?

2016-06-05 Thread Darren Govoni


Thanks. I will share my dockerfile once I get it working too.


Sent from my Verizon Wireless 4G LTE smartphone

 Original message 
From: Luciano Resende <luckbr1...@gmail.com> 
Date: 06/05/2016  1:48 PM  (GMT-05:00) 
To: users@zeppelin.apache.org 
Subject: Re: Dockerfile? 

This is not Ubuntu based, but can give you some help:

Spark base:
https://github.com/lresende/docker-spark

Zeppelin (using the above spark base):
https://github.com/lresende/docker-systemml-notebook

This is pre-r dependencies, so I still need to update with R.



On Sun, Jun 5, 2016 at 10:07 AM, Darren Govoni <dar...@ontrenet.com> wrote:


Hi
Does anyone know of an updated docker file that builds latest zeppelin, spark, 
hadoop etc. Ubuntu based?
ThanksDarren


Sent from my Verizon Wireless 4G LTE smartphone


-- 
Luciano Resende
http://twitter.com/lresende1975
http://lresende.blogspot.com/



Re: Spark + Kafka processing trouble

2016-05-30 Thread Darren Govoni


Well that could be the problem. A SQL database is essential a big synchronizer. 
If you have a lot of spark tasks all bottlenecking on a single database socket 
(is the database clustered or colocated with spark workers?) then you will have 
blocked threads on the database server.


Sent from my Verizon Wireless 4G LTE smartphone

 Original message 
From: Malcolm Lockyer <malcolm.lock...@hapara.com> 
Date: 05/30/2016  10:40 PM  (GMT-05:00) 
To: user@spark.apache.org 
Subject: Re: Spark + Kafka processing trouble 

On Tue, May 31, 2016 at 1:56 PM, Darren Govoni <dar...@ontrenet.com> wrote:
> So you are calling a SQL query (to a single database) within a spark
> operation distributed across your workers?

Yes, but currently with very small sets of data (1-10,000) and on a
single (dev) machine right now.





(sorry didn't reply to the list)

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Spark + Kafka processing trouble

2016-05-30 Thread Darren Govoni


So you are calling a SQL query (to a single database) within a spark operation 
distributed across your workers? 


Sent from my Verizon Wireless 4G LTE smartphone

 Original message 
From: Malcolm Lockyer  
Date: 05/30/2016  9:45 PM  (GMT-05:00) 
To: user@spark.apache.org 
Subject: Spark + Kafka processing trouble 

Hopefully this is not off topic for this list, but I am hoping to
reach some people who have used Kafka + Spark before.

We are new to Spark and are setting up our first production
environment and hitting a speed issue that maybe configuration related
- and we have little experience in configuring Spark environments.

So we've got a Spark streaming job that seems to take an inordinate
amount of time to process. I realize that without specifics, it is
difficult to trace - however the most basic primitives in Spark are
performing horribly. The lazy nature of Spark is making it difficult
for me to understand what is happening - any suggestions are very much
appreciated.

Environment is MBP 2.2 i7. Spark master is "local[*]". We are using
Kafka and PostgreSQL, both local. The job is designed to:

a) grab some data from Kafka
b) correlate with existing data in PostgreSQL
c) output data to Kafka

I am isolating timings by calling System.nanoTime() before and after
something that forces calculation, for example .count() on a
DataFrame. It seems like every operation has a MASSIVE fixed overhead
and that is stacking up making each iteration on the RDD extremely
slow. Slow operations include pulling a single item from the Kafka
queue, running a simple query against PostgresSQL, and running a Spark
aggregation on a RDD with a handful of rows.

The machine is not maxing out on memory, disk or CPU. The machine
seems to be doing nothing for a high percentage of the execution time.
We have reproduced this behavior on two other machines. So we're
suspecting a configuration issue

As a concrete example, we have a DataFrame produced by running a JDBC
query by mapping over an RDD from Kafka. Calling count() (I guess
forcing execution) on this DataFrame when there is *1* item/row (Note:
SQL database is EMPTY at this point so this is not a factor) takes 4.5
seconds, calling count when there are 10,000 items takes 7 seconds.

Can anybody offer experience of something like this happening for
them? Any suggestions on how to understand what is going wrong?

I have tried tuning the number of Kafka partitions - increasing this
seems to increase the concurrency and ultimately number of things
processed per minute, but to get something half decent, I'm going to
need running with 1024 or more partitions. Is 1024 partitions a
reasonable number? What do you use in you environments?

I've tried different options for batchDuration. The calculation seems
to be batchDuration * Kafka partitions for number of items per
iteration, but this is always still extremely slow (many per iteration
vs. very few doesn't seem to really improve things). Can you suggest a
list of the Spark configuration parameters related to speed that you
think are key - preferably with the values you use for those
parameters?

I'd really really appreciate any help or suggestions as I've been
working on this speed issue for 3 days without success and my head is
starting to hurt. Thanks in advance.



Thanks,

--

Malcolm Lockyer

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Submit python egg?

2016-05-18 Thread Darren Govoni


Hi  I have a python egg with a __main__.py in it. I am able to execute the egg 
by itself fine.
Is there a way to just submit the egg to spark and have it run? It seems an 
external .py script is needed which would be unfortunate if true.
Thanks


Sent from my Verizon Wireless 4G LTE smartphone

Re: Python interpreter without spark?

2016-03-26 Thread Darren Govoni


Thanks moon.
Hopefully someone will want to work on it.Definitely would be useful.

Sent from my Verizon Wireless 4G LTE smartphone

 Original message 
From: moon soo Lee <m...@apache.org> 
Date: 03/26/2016  11:39 AM  (GMT-05:00) 
To: users@zeppelin.incubator.apache.org 
Subject: Re: Python interpreter without spark? 

Hi,
There is an open issue for python interpreter. 
https://issues.apache.org/jira/browse/ZEPPELIN-502I think it'll really help 
python users. Big +1 for this issue.

Thanks,moon
On Sat, Mar 26, 2016 at 7:36 AM Darren Govoni <dar...@ontrenet.com> wrote:


Hi
Is there a standalone python interpreter without spark? Or maybe an Ipython 
interpreter?
Looking to run python notes without a big backend.
Thanks


Sent from my Verizon Wireless 4G LTE smartphone


Re: Does pyspark still lag far behind the Scala API in terms of features

2016-03-02 Thread Darren Govoni


Our data is made up of single text documents scraped off the web. We store 
these in a  RDD. A Dataframe or similar structure makes no sense at that point. 
And the RDD is transient.
So my point is. Dataframes should not replace plain old rdd since rdds allow 
for more flexibility and sql etc is not even usable on our data while in rdd. 
So all those nice dataframe apis aren't usable until it's structured. Which is 
the core problem anyway.


Sent from my Verizon Wireless 4G LTE smartphone

 Original message 
From: Nicholas Chammas <nicholas.cham...@gmail.com> 
Date: 03/02/2016  5:43 PM  (GMT-05:00) 
To: Darren Govoni <dar...@ontrenet.com>, Jules Damji <dmat...@comcast.net>, 
Joshua Sorrell <jsor...@gmail.com> 
Cc: user@spark.apache.org 
Subject: Re: Does pyspark still lag far behind the Scala API in terms of 
features 

Plenty of people get their data in Parquet, Avro, or ORC files; or from a 
database; or do their initial loading of un- or semi-structured data using one 
of the various data source libraries which help with type-/schema-inference.
All of these paths help you get to a DataFrame very quickly.
Nick
On Wed, Mar 2, 2016 at 5:22 PM Darren Govoni <dar...@ontrenet.com> wrote:


Dataframes are essentially structured tables with schemas. So where does the 
non typed data sit before it becomes structured if not in a traditional RDD?
For us almost all the processing comes before there is structure to it.




Sent from my Verizon Wireless 4G LTE smartphone

 Original message 
From: Nicholas Chammas <nicholas.cham...@gmail.com> 
Date: 03/02/2016  5:13 PM  (GMT-05:00) 
To: Jules Damji <dmat...@comcast.net>, Joshua Sorrell <jsor...@gmail.com> 
Cc: user@spark.apache.org 
Subject: Re: Does pyspark still lag far behind the Scala API in terms of 
features 

> However, I believe, investing (or having some members of your group) learn 
>and invest in Scala is worthwhile for few reasons. One, you will get the 
>performance gain, especially now with Tungsten (not sure how it relates to 
>Python, but some other knowledgeable people on the list, please chime in).
The more your workload uses DataFrames, the less of a difference there will be 
between the languages (Scala, Java, Python, or R) in terms of performance.
One of the main benefits of Catalyst (which DFs enable) is that it 
automatically optimizes DataFrame operations, letting you focus on _what_ you 
want while Spark will take care of figuring out _how_.
Tungsten takes things further by tightly managing memory using the type 
information made available to it via DataFrames. This benefit comes into play 
regardless of the language used.
So in short, DataFrames are the "new RDD"--i.e. the new base structure you 
should be using in your Spark programs wherever possible. And with DataFrames, 
what language you use matters much less in terms of performance.
Nick
On Tue, Mar 1, 2016 at 12:07 PM Jules Damji <dmat...@comcast.net> wrote:
Hello Joshua,
comments are inline...

On Mar 1, 2016, at 5:03 AM, Joshua Sorrell <jsor...@gmail.com> wrote:
I haven't used Spark in the last year and a half. I am about to start a project 
with a new team, and we need to decide whether to use pyspark or Scala.
Indeed, good questions, and they do come up lot in trainings that I have 
attended, where this inevitable question is raised.I believe, it depends on 
your level of comfort zone or adventure into newer things.
True, for the most part that Apache Spark committers have been committed to 
keep the APIs at parity across all the language offerings, even though in some 
cases, in particular Python, they have lagged by a minor release. To the the 
extent that they’re committed to level-parity is a good sign. It might to be 
the case with some experimental APIs, where they lag behind,  but for the most 
part, they have been admirably consistent. 
With Python there’s a minor performance hit, since there’s an extra level of 
indirection in the architecture and an additional Python PID that the executors 
launch to execute your pickled Python lambdas. Other than that it boils down to 
your comfort zone. I recommend looking at Sameer’s slides on (Advanced Spark 
for DevOps Training) where he walks through the pySpark and Python 
architecture. 

We are NOT a java shop. So some of the build tools/procedures will require some 
learning overhead if we go the Scala route. What I want to know is: is the 
Scala version of Spark still far enough ahead of pyspark to be well worth any 
initial training overhead?  
If you are a very advanced Python shop and if you’ve in-house libraries that 
you have written in Python that don’t exist in Scala or some ML libs that don’t 
exist in the Scala version and will require fair amount of porting and gap is 
too large, then perhaps it makes sense to stay put with Python.
However, I believe, investing (or having some members of your 

Re: Does pyspark still lag far behind the Scala API in terms of features

2016-03-02 Thread Darren Govoni


Dataframes are essentially structured tables with schemas. So where does the 
non typed data sit before it becomes structured if not in a traditional RDD?
For us almost all the processing comes before there is structure to it.




Sent from my Verizon Wireless 4G LTE smartphone

 Original message 
From: Nicholas Chammas  
Date: 03/02/2016  5:13 PM  (GMT-05:00) 
To: Jules Damji , Joshua Sorrell  
Cc: user@spark.apache.org 
Subject: Re: Does pyspark still lag far behind the Scala API in terms of 
features 

> However, I believe, investing (or having some members of your group) learn 
>and invest in Scala is worthwhile for few reasons. One, you will get the 
>performance gain, especially now with Tungsten (not sure how it relates to 
>Python, but some other knowledgeable people on the list, please chime in).
The more your workload uses DataFrames, the less of a difference there will be 
between the languages (Scala, Java, Python, or R) in terms of performance.
One of the main benefits of Catalyst (which DFs enable) is that it 
automatically optimizes DataFrame operations, letting you focus on _what_ you 
want while Spark will take care of figuring out _how_.
Tungsten takes things further by tightly managing memory using the type 
information made available to it via DataFrames. This benefit comes into play 
regardless of the language used.
So in short, DataFrames are the "new RDD"--i.e. the new base structure you 
should be using in your Spark programs wherever possible. And with DataFrames, 
what language you use matters much less in terms of performance.
Nick
On Tue, Mar 1, 2016 at 12:07 PM Jules Damji  wrote:
Hello Joshua,
comments are inline...

On Mar 1, 2016, at 5:03 AM, Joshua Sorrell  wrote:
I haven't used Spark in the last year and a half. I am about to start a project 
with a new team, and we need to decide whether to use pyspark or Scala.
Indeed, good questions, and they do come up lot in trainings that I have 
attended, where this inevitable question is raised.I believe, it depends on 
your level of comfort zone or adventure into newer things.
True, for the most part that Apache Spark committers have been committed to 
keep the APIs at parity across all the language offerings, even though in some 
cases, in particular Python, they have lagged by a minor release. To the the 
extent that they’re committed to level-parity is a good sign. It might to be 
the case with some experimental APIs, where they lag behind,  but for the most 
part, they have been admirably consistent. 
With Python there’s a minor performance hit, since there’s an extra level of 
indirection in the architecture and an additional Python PID that the executors 
launch to execute your pickled Python lambdas. Other than that it boils down to 
your comfort zone. I recommend looking at Sameer’s slides on (Advanced Spark 
for DevOps Training) where he walks through the pySpark and Python 
architecture. 

We are NOT a java shop. So some of the build tools/procedures will require some 
learning overhead if we go the Scala route. What I want to know is: is the 
Scala version of Spark still far enough ahead of pyspark to be well worth any 
initial training overhead?  
If you are a very advanced Python shop and if you’ve in-house libraries that 
you have written in Python that don’t exist in Scala or some ML libs that don’t 
exist in the Scala version and will require fair amount of porting and gap is 
too large, then perhaps it makes sense to stay put with Python.
However, I believe, investing (or having some members of your group) learn and 
invest in Scala is worthwhile for few reasons. One, you will get the 
performance gain, especially now with Tungsten (not sure how it relates to 
Python, but some other knowledgeable people on the list, please chime in). Two, 
since Spark is written in Scala, it gives you an enormous advantage to read 
sources (which are well documented and highly readable) should you have to 
consult or learn nuances of certain API method or action not covered 
comprehensively in the docs. And finally, there’s a long term benefit in 
learning Scala for reasons other than Spark. For example, writing other 
scalable and distributed applications.

Particularly, we will be using Spark Streaming. I know a couple of years ago 
that practically forced the decision to use Scala.  Is this still the case?
You’ll notice that certain APIs call are not available, at least for now, in 
Python. http://spark.apache.org/docs/latest/streaming-programming-guide.html

CheersJules
--
The Best Ideas Are Simple
Jules S. Damji
e-mail:dmat...@comcast.net
e-mail:jules.da...@gmail.com




RE: [DISCUSS] Update Roadmap

2016-02-27 Thread Darren Govoni


Looks fantastic moon.
Anything in the community with regards to easier debugging with specific 
backends? E.g. spark.
Sent from my Verizon Wireless 4G LTE smartphone

 Original message 
From: moon soo Lee  
Date: 02/27/2016  3:48 PM  (GMT-05:00) 
To: us...@zeppelin.incubator.apache.org, dev@zeppelin.incubator.apache.org 
Subject: [DISCUSS] Update Roadmap 

Hi Zeppelin users and developers,
The roadmap we have published 
athttps://cwiki.apache.org/confluence/display/ZEPPELIN/Zeppelin+Roadmapis 
almost 9 month old, and it doesn't reflect where the community goes anymore. 
It's time to update.
Based on mailing list, jira issues, pullrequests, feedbacks from users, 
conferences and meetings, I could summarize the major interest of users and 
developers in 7 categories. Enterprise ready, Usability improvement, 
Pluggability, Documentation, Backend integration, Notebook storage, and 
Visualization.
And i could list related subjects under each categories.Enterprise 
readyAuthentication Shiro authentication ZEPPELIN-548Authorization Notebook 
authorization PR-681SecurityMulti-tenancyStabilityUsability ImprovementUX 
improvementBetter Table data supportDownload data as csv, etc PR-725, PR-714, 
PR-6, PR-89Featureful table data display (pagenation, etc)Pluggability 
ZEPPELIN-533Pluggable visualizationDynamic Interpreter, notebook, visualization 
loadingRepository and registry for pluggable componentsImprove 
documentationImprove contents and readabilitymore tutorials, 
examplesInterpreterGeneric JDBC Interpreter(spark)R InterpreterCluster manager 
for interpreter (Proposal)more interpretersNotebook storageVersioning 
ZEPPELIN-540more notebook storagesVisualizationMore visualizations PR-152, 
PR-728, PR-336, PR-321Customize graph (show/hide label, color, etc)
It will help anyone quickly get overall interest of project and the direction. 
And based on this roadmap, we can discuss and re-define the next release 0.6.0 
scope and it's schedule.
What do you think? Any feedback would be appreciated.
Thanks,moon



RE: How could I do this algorithm in Spark?

2016-02-25 Thread Darren Govoni


This might be hard to do. One generalization of this problem is 
https://en.m.wikipedia.org/wiki/Longest_path_problem
Given a node (e.g. A), find longest path. All interior relations are transitive 
and can be inferred.
But finding a distributed spark way of doing it in P time would be interesting.

Sent from my Verizon Wireless 4G LTE smartphone

 Original message 
From: Guillermo Ortiz  
Date: 02/24/2016  5:26 PM  (GMT-05:00) 
To: user  
Subject: How could I do this algorithm in Spark? 

I want to do some algorithm in Spark.. I know how to do it in a single machine 
where all data are together, but I don't know a good way to do it in Spark. 
If someone has an idea..I have some data like thisa , bx , yb , cy , yc , d
I want something like:a , db , dc , dx , yy , y
I need to know that a->b->c->d, so a->d, b->d and c->d.I don't want the code, 
just an idea how I could deal with it. 
Any idea?


RE: Unusually large deserialisation time

2016-02-16 Thread Darren Govoni


I meant to write 'last task in stage'.


Sent from my Verizon Wireless 4G LTE smartphone

 Original message 
From: Darren Govoni <dar...@ontrenet.com> 
Date: 02/16/2016  6:55 AM  (GMT-05:00) 
To: Abhishek Modi <abshkm...@gmail.com>, user@spark.apache.org 
Subject: RE: Unusually large deserialisation time 



I think this is part of the bigger issue of serious deadlock conditions 
occurring in spark many of us have posted on.
Would the task in question be the past task of a stage by chance?


Sent from my Verizon Wireless 4G LTE smartphone

 Original message 
From: Abhishek Modi <abshkm...@gmail.com> 
Date: 02/16/2016  4:12 AM  (GMT-05:00) 
To: user@spark.apache.org 
Subject: Unusually large deserialisation time 

I'm doing a mapPartitions on a rdd cached in memory followed by a reduce. Here 
is my code snippet 

// myRdd is an rdd consisting of Tuple2[Int,Long] 
myRdd.mapPartitions(rangify).reduce( (x,y) => (x._1+y._1,x._2 ++ y._2)) 

//The rangify function 
def rangify(l: Iterator[ Tuple2[Int,Long] ]) : Iterator[ Tuple2[Long, List [ 
ArrayBuffer[ Tuple2[Long,Long] ] ] ] ]= { 
  var sum=0L 
  val mylist=ArrayBuffer[ Tuple2[Long,Long] ]() 

  if(l.isEmpty) 
    return List( (0L,List [ ArrayBuffer[ Tuple2[Long,Long] ] ] ())).toIterator 

  var prev= -1000L 
  var begin= -1000L 

  for (x <- l){ 
    sum+=x._1 

    if(prev<0){ 
      prev=x._2 
      begin=x._2 
    } 

    else if(x._2==prev+1) 
      prev=x._2 

    else { 
      list+=((begin,prev)) 
      prev=x._2 
      begin=x._2 
    } 
  } 

  mylist+= ((begin,prev)) 

  List((sum, List(mylist) ) ).toIterator 
} 


The rdd is cached in memory. I'm using 20 executors with 1 core for each 
executor. The cached rdd has 60 blocks. The problem is for every 2-3 runs of 
the job, there is a task which has an abnormally large deserialisation time. 
Screenshot attached 

Thank you,Abhishek




RE: Unusually large deserialisation time

2016-02-16 Thread Darren Govoni


I think this is part of the bigger issue of serious deadlock conditions 
occurring in spark many of us have posted on.
Would the task in question be the past task of a stage by chance?


Sent from my Verizon Wireless 4G LTE smartphone

 Original message 
From: Abhishek Modi  
Date: 02/16/2016  4:12 AM  (GMT-05:00) 
To: user@spark.apache.org 
Subject: Unusually large deserialisation time 

I'm doing a mapPartitions on a rdd cached in memory followed by a reduce. Here 
is my code snippet 

// myRdd is an rdd consisting of Tuple2[Int,Long] 
myRdd.mapPartitions(rangify).reduce( (x,y) => (x._1+y._1,x._2 ++ y._2)) 

//The rangify function 
def rangify(l: Iterator[ Tuple2[Int,Long] ]) : Iterator[ Tuple2[Long, List [ 
ArrayBuffer[ Tuple2[Long,Long] ] ] ] ]= { 
  var sum=0L 
  val mylist=ArrayBuffer[ Tuple2[Long,Long] ]() 

  if(l.isEmpty) 
    return List( (0L,List [ ArrayBuffer[ Tuple2[Long,Long] ] ] ())).toIterator 

  var prev= -1000L 
  var begin= -1000L 

  for (x <- l){ 
    sum+=x._1 

    if(prev<0){ 
      prev=x._2 
      begin=x._2 
    } 

    else if(x._2==prev+1) 
      prev=x._2 

    else { 
      list+=((begin,prev)) 
      prev=x._2 
      begin=x._2 
    } 
  } 

  mylist+= ((begin,prev)) 

  List((sum, List(mylist) ) ).toIterator 
} 


The rdd is cached in memory. I'm using 20 executors with 1 core for each 
executor. The cached rdd has 60 blocks. The problem is for every 2-3 runs of 
the job, there is a task which has an abnormally large deserialisation time. 
Screenshot attached 

Thank you,Abhishek




Re: Launching EC2 instances with Spark compiled for Scala 2.11

2016-01-25 Thread Darren Govoni


Why not deploy it. Then build a custom distribution with Scala 2.11 and just 
overlay it.


Sent from my Verizon Wireless 4G LTE smartphone

 Original message 
From: Nuno Santos  
Date: 01/25/2016  7:38 AM  (GMT-05:00) 
To: user@spark.apache.org 
Subject: Re: Launching EC2 instances with Spark compiled for Scala 2.11 

Hello, 

Any updates on this question? I'm also very interested in a solution, as I'm
trying to use Spark on EC2 but need Scala 2.11 support. The scripts in the
ec2 directory of the Spark distribution install use Scala 2.10 by default
and I can't see any obvious option to change to Scala 2.11. 

Regards, 
Nuno



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Launching-EC2-instances-with-Spark-compiled-for-Scala-2-11-tp24979p26059.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: 10hrs of Scheduler Delay

2016-01-25 Thread Darren Govoni


Yeah. I have screenshots and stack traces. I will post them to the ticket. 
Nothing informative.
I should also mention I'm using pyspark but I think the deadlock is inside the 
Java scheduler code.



Sent from my Verizon Wireless 4G LTE smartphone

 Original message 
From: "Sanders, Isaac B" <sande...@rose-hulman.edu> 
Date: 01/25/2016  8:59 AM  (GMT-05:00) 
To: Ted Yu <yuzhih...@gmail.com> 
Cc: Darren Govoni <dar...@ontrenet.com>, Renu Yadav <yren...@gmail.com>, Muthu 
Jayakumar <bablo...@gmail.com>, user@spark.apache.org 
Subject: Re: 10hrs of Scheduler Delay 



Is the thread dump the stack trace you are talking about? If so, I will see if 
I can capture the few different stages I have seen it in.



Thanks for the help, I was able to do it for 0.1% of my data. I will create the 
JIRA.



Thanks,
Isaac


On Jan 25, 2016, at 8:51 AM, Ted Yu <yuzhih...@gmail.com> wrote:







Opening a JIRA is fine. 



See if you can capture stack trace during the hung stage and attach to JIRA so 
that we have more clue. 



Thanks


On Jan 25, 2016, at 4:25 AM, Darren Govoni <dar...@ontrenet.com> wrote:






Probably we should open a ticket for this.
There's definitely a deadlock situation occurring in spark under certain 
conditions.



The only clue I have is it always happens on the last stage. And it does seem 
sensitive to scale. If my job has 300mb of data I'll see the deadlock. But if I 
only run 10mb of it it will succeed. This suggest a serious fundamental scaling 
problem.



Workers have plenty of resources.










Sent from my Verizon Wireless 4G LTE smartphone





 Original message 

From: "Sanders, Isaac B" <sande...@rose-hulman.edu>


Date: 01/24/2016 2:54 PM (GMT-05:00) 

To: Renu Yadav <yren...@gmail.com> 

Cc: Darren Govoni <dar...@ontrenet.com>, Muthu Jayakumar <bablo...@gmail.com>, 
Ted Yu <yuzhih...@gmail.com>,
user@spark.apache.org 

Subject: Re: 10hrs of Scheduler Delay 



I am not getting anywhere with any of the suggestions so far. :(



Trying some more outlets, I will share any solution I find.



- Isaac




On Jan 23, 2016, at 1:48 AM, Renu Yadav <yren...@gmail.com> wrote:



If you turn on spark.speculation on then that might help. it worked  for me




On Sat, Jan 23, 2016 at 3:21 AM, Darren Govoni 
<dar...@ontrenet.com> wrote:



Thanks for the tip. I will try it. But this is the kind of thing spark is 
supposed to figure out and handle. Or at least not get stuck forever.











Sent from my Verizon Wireless 4G LTE smartphone





 Original message ----



From: Muthu Jayakumar <bablo...@gmail.com>


Date: 01/22/2016 3:50 PM (GMT-05:00) 

To: Darren Govoni <dar...@ontrenet.com>, "Sanders, Isaac B" 
<sande...@rose-hulman.edu>, Ted Yu <yuzhih...@gmail.com>


Cc: user@spark.apache.org


Subject: Re: 10hrs of Scheduler Delay 



Does increasing the number of partition helps? You could try out something 3 
times what you currently have. 
Another trick i used was to partition the problem into multiple dataframes and 
run them sequentially and persistent the result and then run a union on the 
results. 



Hope this helps. 




On Fri, Jan 22, 2016, 3:48 AM Darren Govoni <dar...@ontrenet.com> wrote:




Me too. I had to shrink my dataset to get it to work. For us at least Spark 
seems to have scaling issues.












Sent from my Verizon Wireless 4G LTE smartphone





 Original message 


From: "Sanders, Isaac B" <sande...@rose-hulman.edu>


Date: 01/21/2016 11:18 PM (GMT-05:00) 

To: Ted Yu <yuzhih...@gmail.com>


Cc: user@spark.apache.org


Subject: Re: 10hrs of Scheduler Delay 




I have run the driver on a smaller dataset (k=2, n=5000) and it worked quickly 
and didn’t hang like this. This dataset is closer to k=10, n=4.4m, but I am 
using more resources on this one.



- Isaac






On Jan 21, 2016, at 11:06 PM, Ted Yu <yuzhih...@gmail.com> wrote:



You may have seen the following on github page:


Latest commit 50fdf0e  on Feb 22, 2015






That was 11 months ago.



Can you search for similar algorithm which runs on Spark and is newer ?



If nothing found, consider running the tests coming from the project to 
determine whether the delay is intrinsic.



Cheers



On Thu, Jan 21, 2016 at 7:46 PM, Sanders, Isaac B 
<sande...@rose-hulman.edu> wrote:



That thread seems to be moving, it oscillates between a few different traces… 
Maybe it is working. It seems odd that it would take that long.



This is 3rd party code, and after looking at some of it, I think it might not 
be as Spark-y as it could be.



I linked it below. I don’t know a lot about spark, so it might be fine, but I 
have my suspicions.



https://github.com/alitouka/spark_dbscan/blob/master/src/src/main/scala/org/alitouka/spark/dbscan/exploratoryAnalysis/Distance

Re: 10hrs of Scheduler Delay

2016-01-25 Thread Darren Govoni


Probably we should open a ticket for this.There's definitely a deadlock 
situation occurring in spark under certain conditions.
The only clue I have is it always happens on the last stage. And it does seem 
sensitive to scale. If my job has 300mb of data I'll see the deadlock. But if I 
only run 10mb of it it will succeed. This suggest a serious fundamental scaling 
problem.
Workers have plenty of resources.


Sent from my Verizon Wireless 4G LTE smartphone

 Original message 
From: "Sanders, Isaac B" <sande...@rose-hulman.edu> 
Date: 01/24/2016  2:54 PM  (GMT-05:00) 
To: Renu Yadav <yren...@gmail.com> 
Cc: Darren Govoni <dar...@ontrenet.com>, Muthu Jayakumar <bablo...@gmail.com>, 
Ted Yu <yuzhih...@gmail.com>, user@spark.apache.org 
Subject: Re: 10hrs of Scheduler Delay 






I am not getting anywhere with any of the suggestions so far. :(



Trying some more outlets, I will share any solution I find.



- Isaac




On Jan 23, 2016, at 1:48 AM, Renu Yadav <yren...@gmail.com> wrote:



If you turn on spark.speculation on then that might help. it worked  for me




On Sat, Jan 23, 2016 at 3:21 AM, Darren Govoni 
<dar...@ontrenet.com> wrote:



Thanks for the tip. I will try it. But this is the kind of thing spark is 
supposed to figure out and handle. Or at least not get stuck forever.











Sent from my Verizon Wireless 4G LTE smartphone





 Original message 



From: Muthu Jayakumar <bablo...@gmail.com>


Date: 01/22/2016 3:50 PM (GMT-05:00) 

To: Darren Govoni <dar...@ontrenet.com>, "Sanders, Isaac B" 
<sande...@rose-hulman.edu>, Ted Yu <yuzhih...@gmail.com>


Cc: user@spark.apache.org


Subject: Re: 10hrs of Scheduler Delay 



Does increasing the number of partition helps? You could try out something 3 
times what you currently have. 
Another trick i used was to partition the problem into multiple dataframes and 
run them sequentially and persistent the result and then run a union on the 
results. 



Hope this helps. 




On Fri, Jan 22, 2016, 3:48 AM Darren Govoni <dar...@ontrenet.com> wrote:




Me too. I had to shrink my dataset to get it to work. For us at least Spark 
seems to have scaling issues.












Sent from my Verizon Wireless 4G LTE smartphone





 Original message 


From: "Sanders, Isaac B" <sande...@rose-hulman.edu>


Date: 01/21/2016 11:18 PM (GMT-05:00) 

To: Ted Yu <yuzhih...@gmail.com>


Cc: user@spark.apache.org


Subject: Re: 10hrs of Scheduler Delay 




I have run the driver on a smaller dataset (k=2, n=5000) and it worked quickly 
and didn’t hang like this. This dataset is closer to k=10, n=4.4m, but I am 
using more resources on this one.



- Isaac






On Jan 21, 2016, at 11:06 PM, Ted Yu <yuzhih...@gmail.com> wrote:



You may have seen the following on github page:


Latest commit 50fdf0e  on Feb 22, 2015






That was 11 months ago.



Can you search for similar algorithm which runs on Spark and is newer ?



If nothing found, consider running the tests coming from the project to 
determine whether the delay is intrinsic.



Cheers



On Thu, Jan 21, 2016 at 7:46 PM, Sanders, Isaac B 
<sande...@rose-hulman.edu> wrote:



That thread seems to be moving, it oscillates between a few different traces… 
Maybe it is working. It seems odd that it would take that long.



This is 3rd party code, and after looking at some of it, I think it might not 
be as Spark-y as it could be.



I linked it below. I don’t know a lot about spark, so it might be fine, but I 
have my suspicions.



https://github.com/alitouka/spark_dbscan/blob/master/src/src/main/scala/org/alitouka/spark/dbscan/exploratoryAnalysis/DistanceToNearestNeighborDriver.scala



- Isaac




On Jan 21, 2016, at 10:08 PM, Ted Yu <yuzhih...@gmail.com> wrote:



You may have noticed the following - did this indicate prolonged computation in 
your code ?




Re: 10hrs of Scheduler Delay

2016-01-22 Thread Darren Govoni


Thanks for the tip. I will try it. But this is the kind of thing spark is 
supposed to figure out and handle. Or at least not get stuck forever.


Sent from my Verizon Wireless 4G LTE smartphone

 Original message 
From: Muthu Jayakumar <bablo...@gmail.com> 
Date: 01/22/2016  3:50 PM  (GMT-05:00) 
To: Darren Govoni <dar...@ontrenet.com>, "Sanders, Isaac B" 
<sande...@rose-hulman.edu>, Ted Yu <yuzhih...@gmail.com> 
Cc: user@spark.apache.org 
Subject: Re: 10hrs of Scheduler Delay 

Does increasing the number of partition helps? You could try out something 3 
times what you currently have. Another trick i used was to partition the 
problem into multiple dataframes and run them sequentially and persistent the 
result and then run a union on the results. 
Hope this helps. 

On Fri, Jan 22, 2016, 3:48 AM Darren Govoni <dar...@ontrenet.com> wrote:


Me too. I had to shrink my dataset to get it to work. For us at least Spark 
seems to have scaling issues.


Sent from my Verizon Wireless 4G LTE smartphone

 Original message 
From: "Sanders, Isaac B" <sande...@rose-hulman.edu> 
Date: 01/21/2016  11:18 PM  (GMT-05:00) 
To: Ted Yu <yuzhih...@gmail.com> 
Cc: user@spark.apache.org 
Subject: Re: 10hrs of Scheduler Delay 


I have run the driver on a smaller dataset (k=2, n=5000) and it worked quickly 
and didn’t hang like this. This dataset is closer to k=10, n=4.4m, but I am 
using more resources on this one.



- Isaac






On Jan 21, 2016, at 11:06 PM, Ted Yu <yuzhih...@gmail.com> wrote:



You may have seen the following on github page:


Latest commit 50fdf0e  on Feb 22, 2015






That was 11 months ago.



Can you search for similar algorithm which runs on Spark and is newer ?



If nothing found, consider running the tests coming from the project to 
determine whether the delay is intrinsic.



Cheers



On Thu, Jan 21, 2016 at 7:46 PM, Sanders, Isaac B 
<sande...@rose-hulman.edu> wrote:



That thread seems to be moving, it oscillates between a few different traces… 
Maybe it is working. It seems odd that it would take that long.



This is 3rd party code, and after looking at some of it, I think it might not 
be as Spark-y as it could be.



I linked it below. I don’t know a lot about spark, so it might be fine, but I 
have my suspicions.



https://github.com/alitouka/spark_dbscan/blob/master/src/src/main/scala/org/alitouka/spark/dbscan/exploratoryAnalysis/DistanceToNearestNeighborDriver.scala



- Isaac




On Jan 21, 2016, at 10:08 PM, Ted Yu <yuzhih...@gmail.com> wrote:



You may have noticed the following - did this indicate prolonged computation in 
your code ?


org.apache.commons.math3.util.MathArrays.distance(MathArrays.java:205)
org.apache.commons.math3.ml.distance.EuclideanDistance.compute(EuclideanDistance.java:34)
org.alitouka.spark.dbscan.spatial.DistanceCalculation$class.calculateDistance(DistanceCalculation.scala:15)
org.alitouka.spark.dbscan.exploratoryAnalysis.DistanceToNearestNeighborDriver$.calculateDistance(DistanceToNearestNeighborDriver.scala:16)




On Thu, Jan 21, 2016 at 5:13 PM, Sanders, Isaac B 
<sande...@rose-hulman.edu> wrote:



Hadoop is: HDP 2.3.2.0-2950



Here is a gist (pastebin) of my versions en masse and a stacktrace: 
https://gist.github.com/isaacsanders/2e59131758469097651b



Thanks







On Jan 21, 2016, at 7:44 PM, Ted Yu <yuzhih...@gmail.com> wrote:



Looks like you were running on YARN.



What hadoop version are you using ?



Can you capture a few stack traces of the AppMaster during the delay and 
pastebin them ?



Thanks



On Thu, Jan 21, 2016 at 8:08 AM, Sanders, Isaac B 
<sande...@rose-hulman.edu> wrote:



The Spark Version is 1.4.1



The logs are full of standard fair, nothing like an exception or even 
interesting [INFO] lines.



Here is the script I am using: 
https://gist.github.com/isaacsanders/660f480810fbc07d4df2



Thanks
Isaac




On Jan 21, 2016, at 11:03 AM, Ted Yu <yuzhih...@gmail.com> wrote:



Can you provide a bit more information ?



command line for submitting Spark job
version of Spark
anything interesting from driver / executor logs ?



Thanks 







On Thu, Jan 21, 2016 at 7:35 AM, Sanders, Isaac B 
<sande...@rose-hulman.edu> wrote:


Hey all,



I am a CS student in the United States working on my senior thesis.



My thesis uses Spark, and I am encountering some trouble.



I am using 
https://github.com/alitouka/spark_dbscan, and to determine parameters, I am 
using the utility class they supply, 
org.alitouka.spark.dbscan.exploratoryAnalysis.DistanceToNearestNeighborDriver.



I am on a 10 node cluster with one machine with 8 cores and 32G of memory and 
nine machines with 6 cores and 16G of memory.



I have 442M of data, which seems like it would be a joke, but the job stalls at 
the last stage.



It was stuck in Scheduler Delay for 10 hours overnight, and I have tried

Re: 10hrs of Scheduler Delay

2016-01-22 Thread Darren Govoni


Me too. I had to shrink my dataset to get it to work. For us at least Spark 
seems to have scaling issues.


Sent from my Verizon Wireless 4G LTE smartphone

 Original message 
From: "Sanders, Isaac B"  
Date: 01/21/2016  11:18 PM  (GMT-05:00) 
To: Ted Yu  
Cc: user@spark.apache.org 
Subject: Re: 10hrs of Scheduler Delay 


I have run the driver on a smaller dataset (k=2, n=5000) and it worked quickly 
and didn’t hang like this. This dataset is closer to k=10, n=4.4m, but I am 
using more resources on this one.



- Isaac






On Jan 21, 2016, at 11:06 PM, Ted Yu  wrote:



You may have seen the following on github page:


Latest commit 50fdf0e  on Feb 22, 2015






That was 11 months ago.



Can you search for similar algorithm which runs on Spark and is newer ?



If nothing found, consider running the tests coming from the project to 
determine whether the delay is intrinsic.



Cheers



On Thu, Jan 21, 2016 at 7:46 PM, Sanders, Isaac B 
 wrote:



That thread seems to be moving, it oscillates between a few different traces… 
Maybe it is working. It seems odd that it would take that long.



This is 3rd party code, and after looking at some of it, I think it might not 
be as Spark-y as it could be.



I linked it below. I don’t know a lot about spark, so it might be fine, but I 
have my suspicions.



https://github.com/alitouka/spark_dbscan/blob/master/src/src/main/scala/org/alitouka/spark/dbscan/exploratoryAnalysis/DistanceToNearestNeighborDriver.scala



- Isaac




On Jan 21, 2016, at 10:08 PM, Ted Yu  wrote:



You may have noticed the following - did this indicate prolonged computation in 
your code ?


org.apache.commons.math3.util.MathArrays.distance(MathArrays.java:205)
org.apache.commons.math3.ml.distance.EuclideanDistance.compute(EuclideanDistance.java:34)
org.alitouka.spark.dbscan.spatial.DistanceCalculation$class.calculateDistance(DistanceCalculation.scala:15)
org.alitouka.spark.dbscan.exploratoryAnalysis.DistanceToNearestNeighborDriver$.calculateDistance(DistanceToNearestNeighborDriver.scala:16)




On Thu, Jan 21, 2016 at 5:13 PM, Sanders, Isaac B 
 wrote:



Hadoop is: HDP 2.3.2.0-2950



Here is a gist (pastebin) of my versions en masse and a stacktrace: 
https://gist.github.com/isaacsanders/2e59131758469097651b



Thanks







On Jan 21, 2016, at 7:44 PM, Ted Yu  wrote:



Looks like you were running on YARN.



What hadoop version are you using ?



Can you capture a few stack traces of the AppMaster during the delay and 
pastebin them ?



Thanks



On Thu, Jan 21, 2016 at 8:08 AM, Sanders, Isaac B 
 wrote:



The Spark Version is 1.4.1



The logs are full of standard fair, nothing like an exception or even 
interesting [INFO] lines.



Here is the script I am using: 
https://gist.github.com/isaacsanders/660f480810fbc07d4df2



Thanks
Isaac




On Jan 21, 2016, at 11:03 AM, Ted Yu  wrote:



Can you provide a bit more information ?



command line for submitting Spark job
version of Spark
anything interesting from driver / executor logs ?



Thanks 







On Thu, Jan 21, 2016 at 7:35 AM, Sanders, Isaac B 
 wrote:


Hey all,



I am a CS student in the United States working on my senior thesis.



My thesis uses Spark, and I am encountering some trouble.



I am using 
https://github.com/alitouka/spark_dbscan, and to determine parameters, I am 
using the utility class they supply, 
org.alitouka.spark.dbscan.exploratoryAnalysis.DistanceToNearestNeighborDriver.



I am on a 10 node cluster with one machine with 8 cores and 32G of memory and 
nine machines with 6 cores and 16G of memory.



I have 442M of data, which seems like it would be a joke, but the job stalls at 
the last stage.



It was stuck in Scheduler Delay for 10 hours overnight, and I have tried a 
number of things for the last couple days, but nothing seems to be helping.



I have tried:

- Increasing heap sizes and numbers of cores

- More/less executors with different amounts of resources.

- Kyro Serialization

- FAIR Scheduling



It doesn’t seem like it should require this much. Any ideas?



- Isaac





















































Re: 10hrs of Scheduler Delay

2016-01-21 Thread Darren Govoni


I've experienced this same problem. Always the last stage hangs. Indeterminant. 
No errors in logs. I run spark 1.5.2. Can't find an explanation. But it's 
definitely a showstopper.


Sent from my Verizon Wireless 4G LTE smartphone

 Original message 
From: Ted Yu  
Date: 01/21/2016  7:44 PM  (GMT-05:00) 
To: "Sanders, Isaac B"  
Cc: user@spark.apache.org 
Subject: Re: 10hrs of Scheduler Delay 

Looks like you were running on YARN.
What hadoop version are you using ?
Can you capture a few stack traces of the AppMaster during the delay and 
pastebin them ?
Thanks
On Thu, Jan 21, 2016 at 8:08 AM, Sanders, Isaac B  
wrote:





The Spark Version is 1.4.1



The logs are full of standard fair, nothing like an exception or even 
interesting [INFO] lines.



Here is the script I am using: 
https://gist.github.com/isaacsanders/660f480810fbc07d4df2



Thanks
Isaac




On Jan 21, 2016, at 11:03 AM, Ted Yu  wrote:



Can you provide a bit more information ?



command line for submitting Spark job
version of Spark
anything interesting from driver / executor logs ?



Thanks 





On Thu, Jan 21, 2016 at 7:35 AM, Sanders, Isaac B 
 wrote:


Hey all,



I am a CS student in the United States working on my senior thesis.



My thesis uses Spark, and I am encountering some trouble.



I am using 
https://github.com/alitouka/spark_dbscan, and to determine parameters, I am 
using the utility class they supply, 
org.alitouka.spark.dbscan.exploratoryAnalysis.DistanceToNearestNeighborDriver.



I am on a 10 node cluster with one machine with 8 cores and 32G of memory and 
nine machines with 6 cores and 16G of memory.



I have 442M of data, which seems like it would be a joke, but the job stalls at 
the last stage.



It was stuck in Scheduler Delay for 10 hours overnight, and I have tried a 
number of things for the last couple days, but nothing seems to be helping.



I have tried:

- Increasing heap sizes and numbers of cores

- More/less executors with different amounts of resources.

- Kyro Serialization

- FAIR Scheduling



It doesn’t seem like it should require this much. Any ideas?



- Isaac















Re: Docker/Mesos with Spark

2016-01-19 Thread Darren Govoni


I also would be interested in some best practice for making this work.
Where will the writeup be posted? On mesosphere website?


Sent from my Verizon Wireless 4G LTE smartphone

 Original message 
From: Sathish Kumaran Vairavelu  
Date: 01/19/2016  7:00 PM  (GMT-05:00) 
To: Tim Chen  
Cc: John Omernik , user  
Subject: Re: Docker/Mesos with Spark 

Thank you! Looking forward for it..

On Tue, Jan 19, 2016 at 4:03 PM Tim Chen  wrote:
Hi Sathish,
Sorry about that, I think that's a good idea and I'll write up a section in the 
Spark documentation page to explain how it can work. We (Mesosphere) have been 
doing this for our DCOS spark for our past releases and has been working well 
so far.
Thanks!
Tim
On Tue, Jan 19, 2016 at 12:28 PM, Sathish Kumaran Vairavelu 
 wrote:
Hi Tim

Do you have any materials/blog for running Spark in a container in Mesos 
cluster environment? I have googled it but couldn't find info on it. Spark 
documentation says it is possible, but no details provided.. Please help


Thanks 

Sathish



On Mon, Sep 21, 2015 at 11:54 AM Tim Chen  wrote:
Hi John,
There is no other blog post yet, I'm thinking to do a series of posts but so 
far haven't get time to do that yet.
Running Spark in docker containers makes distributing spark versions easy, it's 
simple to upgrade and automatically caches on the slaves so the same image just 
runs right away. Most of the docker perf is usually related to network and 
filesystem overheads, but I think with recent changes in Spark to make Mesos 
sandbox the default temp dir filesystem won't be a big concern as it's mostly 
writing to the mounted in Mesos sandbox. Also Mesos uses host network by 
default so network is affected much.
Most of the cluster mode limitation is that you need to make the spark job 
files available somewhere that all the slaves can access remotely (http, s3, 
hdfs, etc) or available on all slaves locally by path. 
I'll try to make more doc efforts once I get my existing patches and testing 
infra work done.
Let me know if you have more questions,
Tim
On Sat, Sep 19, 2015 at 5:42 AM, John Omernik  wrote:
I was searching in the 1.5.0 docs on the Docker on Mesos capabilities and just 
found you CAN run it this way.  Are there any user posts, blog posts, etc on 
why and how you'd do this? 
Basically, at first I was questioning why you'd run spark in a docker 
container, i.e., if you run with tar balled executor, what are you really 
gaining?  And in this setup, are you losing out on performance somehow? (I am 
guessing smarter people than I have figured that out).  
Then I came along a situation where I wanted to use a python library with 
spark, and it had to be installed on every node, and I realized one big 
advantage of dockerized spark would be that spark apps that needed other 
libraries could be contained and built well.   
OK, that's huge, let's do that.  For my next question there are lot of 
"questions" have on how this actually works.  Does Clustermode/client mode 
apply here? If so, how?  Is there a good walk through on getting this setup? 
Limitations? Gotchas?  Should I just dive in an start working with it? Has 
anyone done any stories/rough documentation? This seems like a really helpful 
feature to scaling out spark, and letting developers truly build what they need 
without tons of admin overhead, so I really want to explore. 
Thanks!
John








Re: rdd.foreach return value

2016-01-18 Thread Darren Govoni


What's the rationale behind that? It certainly limits the kind of flow logic we 
can do in one statement.


Sent from my Verizon Wireless 4G LTE smartphone

 Original message 
From: David Russell  
Date: 01/18/2016  10:44 PM  (GMT-05:00) 
To: charles li  
Cc: user@spark.apache.org 
Subject: Re: rdd.foreach return value 

The foreach operation on RDD has a void (Unit) return type. See attached. So 
there is no return value to the driver.

David

"All that is gold does not glitter, Not all those who wander are lost."


 Original Message 
Subject: rdd.foreach return value
Local Time: January 18 2016 10:34 pm
UTC Time: January 19 2016 3:34 am
From: charles.up...@gmail.com
To: user@spark.apache.org

code snippet



the 'print' actually print info on the worker node, but I feel confused where 
the 'return' value 
goes to. for I get nothing on the driver node.
-- 
--
a spark lover, a quant, a developer and a good man.

http://github.com/litaotao



Task hang problem

2015-12-29 Thread Darren Govoni


Hi,

  I've had this nagging problem where a task will hang and the
entire job hangs. Using pyspark. Spark 1.5.1



The job output looks like this, and hangs after the last task:



..

15/12/29 17:00:38 INFO BlockManagerInfo: Added broadcast_0_piece0 in
memory on 10.65.143.174:34385 (size: 5.8 KB, free: 2.1 GB)

15/12/29 17:00:39 INFO TaskSetManager: Finished task 15.0 in stage
0.0 (TID 15) in 11668 ms on 10.65.143.174 (29/32)

15/12/29 17:00:39 INFO TaskSetManager: Finished task 23.0 in stage
0.0 (TID 23) in 11684 ms on 10.65.143.174 (30/32)

15/12/29 17:00:39 INFO TaskSetManager: Finished task 7.0 in stage
0.0 (TID 7) in 11717 ms on 10.65.143.174 (31/32)

{nothing here for a while, ~6mins}





Here is the executor status, from UI.





  

  31
  31
  0
  RUNNING
  PROCESS_LOCAL
  2 / 10.65.143.174
  2015/12/29 17:00:28
  6.8 min
  0 ms
  0 ms
  60 ms
  0 ms
  0 ms
  0.0 B

  



Here is executor 2 from 10.65.143.174. Never see task 31 get to the
executor.any ideas?



.

15/12/29 17:00:38 INFO TorrentBroadcast: Started reading broadcast
variable 0

15/12/29 17:00:38 INFO MemoryStore: ensureFreeSpace(5979) called
with curMem=0, maxMem=2223023063

15/12/29 17:00:38 INFO MemoryStore: Block broadcast_0_piece0 stored
as bytes in memory (estimated size 5.8 KB, free 2.1 GB)

15/12/29 17:00:38 INFO TorrentBroadcast: Reading broadcast variable
0 took 208 ms

15/12/29 17:00:38 INFO MemoryStore: ensureFreeSpace(8544) called
with curMem=5979, maxMem=2223023063

15/12/29 17:00:38 INFO MemoryStore: Block broadcast_0 stored as
values in memory (estimated size 8.3 KB, free 2.1 GB)

15/12/29 17:00:39 INFO PythonRunner: Times: total = 913, boot = 747,
init = 166, finish = 0

15/12/29 17:00:39 INFO Executor: Finished task 15.0 in stage 0.0
(TID 15). 967 bytes result sent to driver

15/12/29 17:00:39 INFO PythonRunner: Times: total = 955, boot = 735,
init = 220, finish = 0

15/12/29 17:00:39 INFO Executor: Finished task 23.0 in stage 0.0
(TID 23). 967 bytes result sent to driver

15/12/29 17:00:39 INFO PythonRunner: Times: total = 970, boot = 812,
init = 158, finish = 0

15/12/29 17:00:39 INFO Executor: Finished task 7.0 in stage 0.0 (TID
7). 967 bytes result sent to driver

root@ip-10-65-143-174 2]$ 


Sent from my Verizon Wireless 4G LTE smartphone

Re: Task hang problem

2015-12-29 Thread Darren Govoni

  

  
  
here's executor trace.

  

  
  
Thread 58: Executor task launch
worker-3 (RUNNABLE)

  
java.net.SocketInputStream.socketRead0(Native Method)
java.net.SocketInputStream.read(SocketInputStream.java:152)
java.net.SocketInputStream.read(SocketInputStream.java:122)
java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
java.io.BufferedInputStream.read(BufferedInputStream.java:254)
java.io.DataInputStream.readInt(DataInputStream.java:387)
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:139)
org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:207)
org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
org.apache.spark.scheduler.Task.run(Task.scala:88)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
  

  
  
Thread 41: BLOCK_MANAGER cleanup
timer (WAITING)

  
java.lang.Object.wait(Native Method)
java.lang.Object.wait(Object.java:503)
java.util.TimerThread.mainLoop(Timer.java:526)
java.util.TimerThread.run(Timer.java:505)
  

  
  
Thread 42: BROADCAST_VARS cleanup
timer (WAITING)

  
java.lang.Object.wait(Native Method)
java.lang.Object.wait(Object.java:503)
java.util.TimerThread.mainLoop(Timer.java:526)
java.util.TimerThread.run(Timer.java:505)
  

  
  
Thread 54: driver-heartbeater
(TIMED_WAITING)

  
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2082)
java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1090)
java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:807)
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
  

  
  
Thread 3: Finalizer (WAITING)

  
java.lang.Object.wait(Native Method)
java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:135)
java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:151)
java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:209)
  

  
  
Thread 25:
ForkJoinPool-3-worker-15 (WAITING)

  
sun.misc.Unsafe.park(Native Method)
scala.concurrent.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:2075)
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
  

  
  
Thread 35: Hashed wheel timer #2
(TIMED_WAITING)

  
java.lang.Thread.sleep(Native Method)
org.jboss.netty.util.HashedWheelTimer$Worker.waitForNextTick(HashedWheelTimer.java:483)
org.jboss.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:392)
org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
java.lang.Thread.run(Thread.java:745)
  

  
  
Thread 68: Idle Worker Monitor
for /usr/bin/python2.7 (TIMED_WAITING)

  
java.lang.Thread.sleep(Native Method)
org.apache.spark.api.python.PythonWorkerFactory$MonitorThread.run(PythonWorkerFactory.scala:229)
  

  
  
Thread 1: main (WAITING)

  
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
java.util.concurrent.CountDownLatch.await(CountDownLatch.java:236)
akka.actor.ActorSystemImpl$TerminationCallbacks.ready(ActorSystem.scala:819)

Re: DataFrame Vs RDDs ... Which one to use When ?

2015-12-28 Thread Darren Govoni


I'll throw a thought in here.
Dataframes are nice if your data is uniform and clean with consistent schema.
However in many big data problems this is seldom the case. 


Sent from my Verizon Wireless 4G LTE smartphone

 Original message 
From: Chris Fregly  
Date: 12/28/2015  5:22 PM  (GMT-05:00) 
To: Richard Eggert  
Cc: Daniel Siegmann , Divya Gehlot 
, "user @spark"  
Subject: Re: DataFrame Vs RDDs ... Which one to use When ? 

here's a good article that sums it up, in my opinion: 
https://ogirardot.wordpress.com/2015/05/29/rdds-are-the-new-bytecode-of-apache-spark/
basically, building apps with RDDs is like building with apps with primitive 
JVM bytecode.  haha.
@richard:  remember that even if you're currently writing RDDs in Java/Scala, 
you're not gaining the code gen/rewrite performance benefits of the Catalyst 
optimizer.
i agree with @daniel who suggested that you start with DataFrames and revert to 
RDDs only when DataFrames don't give you what you need.
the only time i use RDDs directly these days is when i'm dealing with a Spark 
library that has not yet moved to DataFrames - ie. GraphX - and it's kind of 
annoying switching back and forth.
almost everything you need should be in the DataFrame API.
Datasets are similar to RDDs, but give you strong compile-time typing, tabular 
structure, and Catalyst optimizations.
hopefully Datasets is the last API we see from Spark SQL...  i'm getting tired 
of re-writing slides and book chapters!  :)
On Mon, Dec 28, 2015 at 4:55 PM, Richard Eggert  
wrote:
One advantage of RDD's over DataFrames is that RDD's allow you to use your own 
data types, whereas DataFrames are backed by RDD's of Record objects, which are 
pretty flexible but don't give you much in the way of compile-time type 
checking. If you have an RDD of case class elements or JSON, then Spark SQL can 
automatically figure out how to convert it into an RDD of Record objects (and 
therefore a DataFrame), but there's no way to automatically go the other way 
(from DataFrame/Record back to custom types).
In general, you can ultimately do more with RDDs than DataFrames, but 
DataFrames give you a lot of niceties (automatic query optimization, table 
joins, SQL-like syntax, etc.) for free, and can avoid some of the runtime 
overhead associated with writing RDD code in a non-JVM language (such as Python 
or R), since the query optimizer is effectively creating the required JVM code 
under the hood. There's little to no performance benefit if you're already 
writing Java or Scala code, however (and RDD-based code may actually perform 
better in some cases, if you're willing to carefully tune your code).
On Mon, Dec 28, 2015 at 3:05 PM, Daniel Siegmann  
wrote:
DataFrames are a higher level API for working with tabular data - RDDs are used 
underneath. You can use either and easily convert between them in your code as 
necessary.

DataFrames provide a nice abstraction for many cases, so it may be easier to 
code against them. Though if you're used to thinking in terms of collections 
rather than tables, you may find RDDs more natural. Data frames can also be 
faster, since Spark will do some optimizations under the hood - if you are 
using PySpark, this will avoid the overhead. Data frames may also perform 
better if you're reading structured data, such as a Hive table or Parquet files.

I recommend you prefer data frames, switching over to RDDs as necessary (when 
you need to perform an operation not supported by data frames / Spark SQL).

HOWEVER (and this is a big one), Spark 1.6 will have yet another API - 
datasets. The release of Spark 1.6 is currently being finalized and I would 
expect it in the next few days. You will probably want to use the new API once 
it's available.


On Sun, Dec 27, 2015 at 9:18 PM, Divya Gehlot  wrote:
Hi,
I am new bee to spark and a bit confused about RDDs and DataFames in Spark.
Can somebody explain me with the use cases which one to use when ?

Would really appreciate the clarification .

Thanks,
Divya 






-- 
Rich




-- 

Chris FreglyPrincipal Data Solutions EngineerIBM Spark Technology Center, San 
Francisco, CAhttp://spark.tc | http://advancedspark.com



Question & Suggestions

2015-12-17 Thread Darren Govoni

Hi,
  Fantastic tool. Some suggestions. Please excuse if they are addressed 
already.


1. Can output from a note be forced to scroll (that is have a scrollbar) 
beyond a certain size?

2. Any planned support to drag notes and relocate them?
3. Can I write an interpreter as a 'service' and simply register it? Or 
does it have to be a compiled server-side class?
If so, it would be nice to define an interpreter JSON/REST protocol so I 
can write one in say Python or whatever.

4. Any plans to allow notes to be 'stacked'?

Question

1. I saw the theme guide, but it wasn't clear how to create my own 
header/theme bar. How can I do this?


thanks!
Darren


Re: Scala VS Java VS Python

2015-12-16 Thread Darren Govoni


I use python too. I'm actually surprises it's not the primary language since it 
is by far more used in data science than java snd Scala combined.
If I had a second choice of script language for general apps I'd want groovy 
over scala.


Sent from my Verizon Wireless 4G LTE smartphone

 Original message 
From: Daniel Lopes  
Date: 12/16/2015  4:16 PM  (GMT-05:00) 
To: Daniel Valdivia  
Cc: user  
Subject: Re: Scala VS Java VS Python 

For me Scala is better like Spark is written in Scala, and I like python cuz I 
always used python for data science. :)
On Wed, Dec 16, 2015 at 5:54 PM, Daniel Valdivia  
wrote:
Hello,



This is more of a "survey" question for the community, you can reply to me 
directly so we don't flood the mailing list.



I'm having a hard time learning Spark using Python since the API seems to be 
slightly incomplete, so I'm looking at my options to start doing all my apps in 
either Scala or Java, being a Java Developer, java 1.8 looks like the logical 
way, however I'd like to ask here what's the most common (Scala Or Java) since 
I'm observing mixed results in the social documentation, however Scala seems to 
be the predominant language for spark examples.



Thank for the advice

-

To unsubscribe, e-mail: user-unsubscr...@spark.apache.org

For additional commands, e-mail: user-h...@spark.apache.org






-- 
Daniel Lopes, B.EngData Scientist - BankFacilCREA/SP 5069410560Mob +55 (18) 
99764-2733Ph +55 (11) 3522-8009http://about.me/dannyeuu
Av. Nova Independência, 956, São Paulo, SPBairro Brooklin PaulistaCEP 
04570-001https://www.bankfacil.com.br




Re: Pyspark submitted app just hangs

2015-12-02 Thread Darren Govoni

The pyspark app stdout/err log shows this oddity.

Traceback (most recent call last):
  File "/root/spark/notebooks/ingest/XXX.py", line 86, in 
print pdfRDD.collect()[:5]
  File "/root/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 773, 
in collect
  File 
"/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 
536, in __call__
  File 
"/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 
364, in send_command
  File 
"/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 
473, in send_command

  File "/usr/lib64/python2.7/socket.py", line 430, in readline
data = recv(1)
KeyboardInterrupt


On 12/02/2015 08:57 PM, Jim Lohse wrote:
Is this the stderr output from a woker? Are any files being written? 
Can you run in debug and see how far it's getting?


This to me doesn't give me a direction to look without the actual logs 
from $SPARK_HOME or the stderr from the worker UI.


Just imho maybe someone know what this means but it seems like it 
could be caused by a lot of things.


On 12/2/2015 6:48 PM, Darren Govoni wrote:

Hi all,
  Wondering if someone can provide some insight why this pyspark app 
is just hanging. Here is output.


...
15/12/03 01:47:05 INFO TaskSetManager: Starting task 21.0 in stage 
0.0 (TID 21, 10.65.143.174, PROCESS_LOCAL, 1794787 bytes)
15/12/03 01:47:05 INFO TaskSetManager: Starting task 22.0 in stage 
0.0 (TID 22, 10.97.144.52, PROCESS_LOCAL, 1801814 bytes)
15/12/03 01:47:05 INFO TaskSetManager: Starting task 23.0 in stage 
0.0 (TID 23, 10.65.67.146, PROCESS_LOCAL, 1823921 bytes)
15/12/03 01:47:05 INFO TaskSetManager: Starting task 24.0 in stage 
0.0 (TID 24, 10.144.176.22, PROCESS_LOCAL, 1820713 bytes)
15/12/03 01:47:05 INFO TaskSetManager: Starting task 25.0 in stage 
0.0 (TID 25, 10.65.143.174, PROCESS_LOCAL, 1850492 bytes)
15/12/03 01:47:05 INFO TaskSetManager: Starting task 26.0 in stage 
0.0 (TID 26, 10.97.144.52, PROCESS_LOCAL, 1845557 bytes)
15/12/03 01:47:05 INFO TaskSetManager: Starting task 27.0 in stage 
0.0 (TID 27, 10.65.67.146, PROCESS_LOCAL, 1876187 bytes)
15/12/03 01:47:05 INFO TaskSetManager: Starting task 28.0 in stage 
0.0 (TID 28, 10.144.176.22, PROCESS_LOCAL, 2054748 bytes)
15/12/03 01:47:05 INFO TaskSetManager: Starting task 29.0 in stage 
0.0 (TID 29, 10.65.143.174, PROCESS_LOCAL, 1967659 bytes)
15/12/03 01:47:05 INFO TaskSetManager: Starting task 30.0 in stage 
0.0 (TID 30, 10.97.144.52, PROCESS_LOCAL, 1977909 bytes)
15/12/03 01:47:05 INFO TaskSetManager: Starting task 31.0 in stage 
0.0 (TID 31, 10.65.67.146, PROCESS_LOCAL, 2084044 bytes)
15/12/03 01:47:06 INFO BlockManagerInfo: Added broadcast_0_piece0 in 
memory on 10.65.143.174:39356 (size: 5.2 KB, free: 4.1 GB)
15/12/03 01:47:06 INFO BlockManagerInfo: Added broadcast_0_piece0 in 
memory on 10.144.176.22:40904 (size: 5.2 KB, free: 4.1 GB)
15/12/03 01:47:06 INFO BlockManagerInfo: Added broadcast_0_piece0 in 
memory on 10.97.144.52:35646 (size: 5.2 KB, free: 4.1 GB)
15/12/03 01:47:06 INFO BlockManagerInfo: Added broadcast_0_piece0 in 
memory on 10.65.67.146:44110 (size: 5.2 KB, free: 4.1 GB)


...

In the spark console, it says 0/32 tasks and just sits there. No 
movement.


Thanks in advance,
D

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Pyspark submitted app just hangs

2015-12-02 Thread Darren Govoni

Hi all,
  Wondering if someone can provide some insight why this pyspark app is 
just hanging. Here is output.


...
15/12/03 01:47:05 INFO TaskSetManager: Starting task 21.0 in stage 0.0 
(TID 21, 10.65.143.174, PROCESS_LOCAL, 1794787 bytes)
15/12/03 01:47:05 INFO TaskSetManager: Starting task 22.0 in stage 0.0 
(TID 22, 10.97.144.52, PROCESS_LOCAL, 1801814 bytes)
15/12/03 01:47:05 INFO TaskSetManager: Starting task 23.0 in stage 0.0 
(TID 23, 10.65.67.146, PROCESS_LOCAL, 1823921 bytes)
15/12/03 01:47:05 INFO TaskSetManager: Starting task 24.0 in stage 0.0 
(TID 24, 10.144.176.22, PROCESS_LOCAL, 1820713 bytes)
15/12/03 01:47:05 INFO TaskSetManager: Starting task 25.0 in stage 0.0 
(TID 25, 10.65.143.174, PROCESS_LOCAL, 1850492 bytes)
15/12/03 01:47:05 INFO TaskSetManager: Starting task 26.0 in stage 0.0 
(TID 26, 10.97.144.52, PROCESS_LOCAL, 1845557 bytes)
15/12/03 01:47:05 INFO TaskSetManager: Starting task 27.0 in stage 0.0 
(TID 27, 10.65.67.146, PROCESS_LOCAL, 1876187 bytes)
15/12/03 01:47:05 INFO TaskSetManager: Starting task 28.0 in stage 0.0 
(TID 28, 10.144.176.22, PROCESS_LOCAL, 2054748 bytes)
15/12/03 01:47:05 INFO TaskSetManager: Starting task 29.0 in stage 0.0 
(TID 29, 10.65.143.174, PROCESS_LOCAL, 1967659 bytes)
15/12/03 01:47:05 INFO TaskSetManager: Starting task 30.0 in stage 0.0 
(TID 30, 10.97.144.52, PROCESS_LOCAL, 1977909 bytes)
15/12/03 01:47:05 INFO TaskSetManager: Starting task 31.0 in stage 0.0 
(TID 31, 10.65.67.146, PROCESS_LOCAL, 2084044 bytes)
15/12/03 01:47:06 INFO BlockManagerInfo: Added broadcast_0_piece0 in 
memory on 10.65.143.174:39356 (size: 5.2 KB, free: 4.1 GB)
15/12/03 01:47:06 INFO BlockManagerInfo: Added broadcast_0_piece0 in 
memory on 10.144.176.22:40904 (size: 5.2 KB, free: 4.1 GB)
15/12/03 01:47:06 INFO BlockManagerInfo: Added broadcast_0_piece0 in 
memory on 10.97.144.52:35646 (size: 5.2 KB, free: 4.1 GB)
15/12/03 01:47:06 INFO BlockManagerInfo: Added broadcast_0_piece0 in 
memory on 10.65.67.146:44110 (size: 5.2 KB, free: 4.1 GB)


...

In the spark console, it says 0/32 tasks and just sits there. No movement.

Thanks in advance,
D

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Managing flows

2015-11-11 Thread Darren Govoni

Hi again,
   Sorry for the noob questions. I am reading all the online material 
as much as possible.

But what hasn't jumped out at me yet is how flows are managed?

Are they saved, loaded, etc? I access my nifi and build a flow. Now I 
want to save it and work on another flow.

Lastly, will the flow be running even if I exit the webapp?

thanks for any tips. If I missed something obvious, regrets.

D


Re: Managing flows

2015-11-11 Thread Darren Govoni

Thanks Joe.

And it seems all the different flows would be seen on the one canvas, 
just not connected?


On 11/11/2015 10:02 AM, Joe Witt wrote:

Darren,

A single NiFi instance (on one node or a cluster of 10+) can handle
*many* different flows.

Thanks
Joe

On Wed, Nov 11, 2015 at 10:00 AM, Darren Govoni <dar...@ontrenet.com> wrote:

Mark,
Thanks for the tips. Appreciate it.

So when I run nifi on a single server. It is essentially "one flow"?
If I wanted to have say 2 or 3 active flows, I would (reasonably) have to
run more instances of nifi with appropriate
configuration to not conflict. Is that right?

Darren


On 11/11/2015 09:54 AM, Mark Petronic wrote:

Look in your Nifi conf directory. The active flow is there as an aptly
named .gz file. Guessing you could just rename that and restart Nifi
which would create a blank new one. Build up another flow, then you
could repeat the same "copy to new file name" and restore some other
one to continue on some previous flow/. I'm pretty new to Nifi, too,
so maybe there is another way. Also, you can create point-in-time
backups of your from from the "Settings" dialog in the DFM. There is a
link that shows up in there to click. It will copy your master flow gz
to your conf/archive directory. You can create multiple snapshots of
your flow to retain change history. I actually gunzip my backups and
commit them to Git for a more formal change history tracking
mechanism.

Hope that helps.

On Wed, Nov 11, 2015 at 9:45 AM, Darren Govoni <dar...@ontrenet.com>
wrote:

Hi again,
 Sorry for the noob questions. I am reading all the online material as
much as possible.
But what hasn't jumped out at me yet is how flows are managed?

Are they saved, loaded, etc? I access my nifi and build a flow. Now I
want
to save it and work on another flow.
Lastly, will the flow be running even if I exit the webapp?

thanks for any tips. If I missed something obvious, regrets.

D






Python Kafka support?

2015-11-10 Thread Darren Govoni

Hi,
 I read on this page 
http://spark.apache.org/docs/latest/streaming-kafka-integration.html 
about python support for "receiverless" kafka integration (Approach 2) 
but it says its incomplete as of version 1.4.


Has this been updated in version 1.5.1?

Darren

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



[jira] [Commented] (SPARK-3789) [GRAPHX] Python bindings for GraphX

2015-11-04 Thread Darren Govoni (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14990509#comment-14990509
 ] 

Darren Govoni commented on SPARK-3789:
--

I think reasons for this are dominated mostly by lack of awareness and maturity 
of graphx. While the scala and java interfaces are there, those languages are 
not really "data science" languages so without python you might be seeing an 
inadvertent barrier or perception there that it(graphx) is "not fully 
supported" and thus turning people away.

> [GRAPHX] Python bindings for GraphX
> ---
>
> Key: SPARK-3789
> URL: https://issues.apache.org/jira/browse/SPARK-3789
> Project: Spark
>  Issue Type: New Feature
>  Components: GraphX, PySpark
>Reporter: Ameet Talwalkar
>Assignee: Kushal Datta
> Attachments: PyGraphX_design_doc.pdf
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3789) [GRAPHX] Python bindings for GraphX

2015-11-04 Thread Darren Govoni (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14989807#comment-14989807
 ] 

Darren Govoni commented on SPARK-3789:
--

Because the ticket has been open for over a year says nothing about the 
validity of it. It may mean the correct talent is not available to produce it. 
But as mentioned above, the work has been done by 3rd parties.

Also, it is in error to assume that the lack of activity on this ticket is some 
kind of litmus for the value of this capability. Back in the real world, the 
rest of us find alternative solutions to get this capability but would very 
much like to have it unified into Spark. Python is the fast growing data 
science language right now in a field that is also growing.

The entire world of ecommerce on the internet is driven by graph analytics 
(page rank, suggestions, also viewed, etc. etc.) While arcane to some, is a 
very important and growing field of computer science. ESPECIALLY where big data 
is concerned. So its best to set aside any conclusions from the activity of 
this ticket.

The pull request is there, let's get it in.

> [GRAPHX] Python bindings for GraphX
> ---
>
> Key: SPARK-3789
> URL: https://issues.apache.org/jira/browse/SPARK-3789
> Project: Spark
>  Issue Type: New Feature
>  Components: GraphX, PySpark
>Reporter: Ameet Talwalkar
>Assignee: Kushal Datta
> Attachments: PyGraphX_design_doc.pdf
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3789) [GRAPHX] Python bindings for GraphX

2015-11-04 Thread Darren Govoni (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14989807#comment-14989807
 ] 

Darren Govoni edited comment on SPARK-3789 at 11/4/15 4:14 PM:
---

Because the ticket has been open for over a year says nothing about the 
validity of it. It may mean the correct talent is not available to produce it. 
But as mentioned above, the work has been done by 3rd parties.

Also, it is in error to assume that the lack of activity on this ticket is some 
kind of litmus for the value of this capability. Back in the real world, the 
rest of us find alternative solutions to get this capability but would very 
much like to have it unified into Spark. Python is the fast growing data 
science language right now in a field that is also growing.

The entire world of ecommerce on the internet is driven by graph analytics 
(page rank, suggestions, also viewed, etc. etc.) While arcane to some, is a 
very important and growing field of computer science and web analytics. 
ESPECIALLY where big data is concerned. So its best to set aside any 
conclusions from the activity of this ticket.

The pull request is there, let's get it in.


was (Author: sesshomurai):
Because the ticket has been open for over a year says nothing about the 
validity of it. It may mean the correct talent is not available to produce it. 
But as mentioned above, the work has been done by 3rd parties.

Also, it is in error to assume that the lack of activity on this ticket is some 
kind of litmus for the value of this capability. Back in the real world, the 
rest of us find alternative solutions to get this capability but would very 
much like to have it unified into Spark. Python is the fast growing data 
science language right now in a field that is also growing.

The entire world of ecommerce on the internet is driven by graph analytics 
(page rank, suggestions, also viewed, etc. etc.) While arcane to some, is a 
very important and growing field of computer science. ESPECIALLY where big data 
is concerned. So its best to set aside any conclusions from the activity of 
this ticket.

The pull request is there, let's get it in.

> [GRAPHX] Python bindings for GraphX
> ---
>
> Key: SPARK-3789
> URL: https://issues.apache.org/jira/browse/SPARK-3789
> Project: Spark
>  Issue Type: New Feature
>  Components: GraphX, PySpark
>Reporter: Ameet Talwalkar
>Assignee: Kushal Datta
> Attachments: PyGraphX_design_doc.pdf
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3789) [GRAPHX] Python bindings for GraphX

2015-11-04 Thread Darren Govoni (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14989940#comment-14989940
 ] 

Darren Govoni commented on SPARK-3789:
--

I guess by that logic why does GraphX even exist? The issue here is really 
whether a python binding is pertinent or not. If GraphX has a purpose in life 
then a python binding is useful same as it is for scala or java. Otherwise the 
whole case for language independence falls apart too. e.g. how many people use 
scala for data analytics? etc.

> [GRAPHX] Python bindings for GraphX
> ---
>
> Key: SPARK-3789
> URL: https://issues.apache.org/jira/browse/SPARK-3789
> Project: Spark
>  Issue Type: New Feature
>  Components: GraphX, PySpark
>Reporter: Ameet Talwalkar
>Assignee: Kushal Datta
> Attachments: PyGraphX_design_doc.pdf
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   4   5   >