Re: Common data exchange formats and tabular data

2016-01-01 Thread Joe Witt
Toivo - this thread seems important and does not appear to have come
to a resolution.  Do you want to pick this back up or are you
comfortable with where it is as for now?

On Wed, Dec 2, 2015 at 12:39 PM, dcave <dc...@ssglimited.com> wrote:
> Adding multiple input and output format support would complicate the
> usability and ongoing maintenance of the SQL/NoSQL processors.
> Additionally, as you suggested it is impossible to select a "correct" format
> or set of formats that can handle all potential needs.
>
> A simpler and more streamlined solution is to put the emphasis on having
> Convert processors available that can handle specific cases as they come up
> as your last comment suggested.  This also keeps processor focus on one
> specific task rather than having Get/Put/Convert hybrids that can lead to
> unneeded complexity and code bloat.
>
> This is in line with Benjamin's line of work.
>
>
>
> --
> View this message in context: 
> http://apache-nifi-developer-list.39713.n7.nabble.com/Common-data-exchange-formats-and-tabular-data-tp3508p5551.html
> Sent from the Apache NiFi Developer List mailing list archive at Nabble.com.


Re: Common data exchange formats and tabular data

2015-12-02 Thread dcave
Adding multiple input and output format support would complicate the
usability and ongoing maintenance of the SQL/NoSQL processors. 
Additionally, as you suggested it is impossible to select a "correct" format
or set of formats that can handle all potential needs.

A simpler and more streamlined solution is to put the emphasis on having
Convert processors available that can handle specific cases as they come up
as your last comment suggested.  This also keeps processor focus on one
specific task rather than having Get/Put/Convert hybrids that can lead to
unneeded complexity and code bloat.

This is in line with Benjamin's line of work.



--
View this message in context: 
http://apache-nifi-developer-list.39713.n7.nabble.com/Common-data-exchange-formats-and-tabular-data-tp3508p5551.html
Sent from the Apache NiFi Developer List mailing list archive at Nabble.com.


Re: Common data exchange formats and tabular data

2015-11-08 Thread Toivo Adams
All,

Benjamin has already done a lot of good work and it would very helpful we
can agree how to move on.
https://issues.apache.org/jira/browse/NIFI-901

My first post was naive, there are much more things to consider.

It is probably impossible to select only one “correct data exchange format”
what all processors should use.

But can we agree one or two preferred data format what SQL and NoSQL
processors should support.
And all other other formats are supported using converter processors.

I my opinion preferred data exchange format should:

1. Support schema in the way or another.

2. Support streaming.

3. Support different data types (String, numeric types, date/time, binary)

4. Serialization/deserialization should be fast and efficient.

5. Widely used and has strong supporters.

6. Can be used in transformations, filtering, join, split, etc.

7. Can be converted to/and from other formats relatively easily.

Nice to have:

1. Nested data structures. For example Orders can contain order rows.


Or maybe we recommend all SQL and NoSQL processors should support two or
more input/output formats and user can select format using configuration?
Or separate sets of processors for different formats?


Thanks
Toivo




--
View this message in context: 
http://apache-nifi-developer-list.39713.n7.nabble.com/Common-data-exchange-formats-and-tabular-data-tp3508p4337.html
Sent from the Apache NiFi Developer List mailing list archive at Nabble.com.


Re: Common data exchange formats and tabular data

2015-11-08 Thread Toivo Adams
Matt,

Good overview.

>Avro is similar, but the schema must always be provided with the data. In 
>the case of NiFi DataFlows, it's likely more efficient to send the schema 
>once as an initialization packet (I can't remember the real term in NiFi), 
>then the rows can be streamed individually, in batches of user-defined
size, 
>sampled, etc. 

Do you mean "Initial Information Packet" or "IIP" ?
Mr. Morrison classical FBP includes such functionality, used often for
configuration.

As far I know NiFi don't have such concept.

But NiFi ExecuteSQL uses Avro with schema for query result.
Result is one big FlowFile which includes both schema and all rows.
Processor just creates schema from JDBC metadata, writes it to Avro
container
and next all rows are written to Avro container.
Writing and reading such file is done using streaming, so result can very
big.

Thanks
Toivo




--
View this message in context: 
http://apache-nifi-developer-list.39713.n7.nabble.com/Common-data-exchange-formats-and-tabular-data-tp3508p4271.html
Sent from the Apache NiFi Developer List mailing list archive at Nabble.com.


Re: Common data exchange formats and tabular data

2015-11-02 Thread Matthew Burgess
Hello all,

I am new to the NiFi community but I have a good amount of experience with
ETL tools and applications that process lots of tabular data. In my
experience, JSON is only useful as the common format for tabular data if it
has a "flat" schema, in which case there aren't any advantages for JSON over
other formats such as CSV. However, I've seen lots of "CSV" files that don't
seem to adhere to any standard, so I would presume NiFi would need a rigid
schema such as RFC-4180 (http://www.rfc-base.org/txt/rfc-4180.txt).

However CSV isn't a natural way to express the schema of the rows, so JSON
or YAML is probably a better choice. There's a format called Tabular Data
Package that combines CSV and JSON for tabular data serialization:
http://dataprotocols.org/tabular-data-package/

Avro is similar, but the schema must always be provided with the data. In
the case of NiFi DataFlows, it's likely more efficient to send the schema
once as an initialization packet (I can't remember the real term in NiFi),
then the rows can be streamed individually, in batches of user-defined size,
sampled, etc.

Having said all that, there are projects like Apache Drill that can handle
non-flat JSON files and still present them in tabular format. They have
functions like KVGEN and FLATTEN to transform the document(s) into tabular
format. In the use cases you present below, you already know the data is
tabular and as such, the extra data model transformation is not needed.  If
this is desired, it should be apparent that a Streaming JSON processor would
be necessary; otherwise, for large tabular datasets you'd have to read the
whole JSON file into memory to parse individual rows.

Regards,
Matt

From:  Toivo Adams <toivo.ad...@gmail.com>
Reply-To:  <dev@nifi.apache.org>
Date:  Monday, November 2, 2015 at 5:12 AM
To:  <dev@nifi.apache.org>
Subject:  Common data exchange formats and tabular data

All,
Some processors get/put data in tabular form. (PutSQL, ExecuteSQL, soon
Cassandra) 
It would be very nice to be able use such processors in pipeline ­ previous
processor output is next processor input. To achieve this, processors should
use common data exchange format.

JSON is most widely used, it¹s simple and readable. But JSON lacks schema.
Schema can be very useful to automate data insert/update.

Avro has schema, but is somewhat more complicated and not widely used
(yet?).

Please see also:

https://issues.apache.org/jira/browse/NIFI-978

https://issues.apache.org/jira/browse/NIFI-901

Opinions?

Thanks
Toivo




--
View this message in context:
http://apache-nifi-developer-list.39713.n7.nabble.com/Common-data-exchange-f
ormats-and-tabular-data-tp3508.html
Sent from the Apache NiFi Developer List mailing list archive at Nabble.com.