<https://lh3.googleusercontent.com/sFfpg5aAjAUkREAkI0p20PmURyu6gZWZES7GDT2wzPXircqItxjJWzW0o02fw82dHg3cmgB1i_xx-rNb1si8ppCEGU51SOInOR4VZDpOARuuNpIHY1mfvGQq7Tfj6CrzBzFGNbkS> Hi everyone, This is an announcement for Gallia <https://github.com/galliaproject/gallia-core/blob/init/README.md> , a new library for data manipulation that maintains a schema throughout transformations and may process data at scale by wrapping Spark RDDs <https://github.com/galliaproject/gallia-core/blob/master/README.md#spark-rdds> . Here’s a very basic example of usage on an individual object: """{"foo": "hello", "bar": 1, "baz": true, "qux": "world"}""" .read() // will infer schema if none is provided .toUpperCase('foo) .increment ('bar) .remove ('qux) .nest ('baz).under('parent) .flip ('parent |> 'baz) .printJson() // prints: {"foo": "HELLO", "bar": 2, "parent": { "baz": false }} Trying to manipulate 'parent |> 'baz as anything other than a boolean results in a type failure at runtime (but before the data is seen): .square ('parent |> 'baz ~> 'BAZ) // instead of "flip" earlier // ERROR: TypeMismatch (Boolean, expected Number): 'parent |> 'baz SQL-like processing looks like the following: "/data/people.jsonl.gz2" // case class Person(name: String, ...) .stream[Person] // INPUT: [{"name": "John", "age": 20, "city": "Toronto"}, {... /* 1. WHERE */ .filterBy('age).matches(_ < 25) /* 2. SELECT */ .retain('name, 'age) /* 3. GROUP BY + COUNT */ .countBy('age) .printJsonl() // OUTPUT: {"age": 21, "_count": 10}\n{"age": 22, ... More examples: reduction <https://github.com/galliaproject/gallia-core/blob/master/README.md#reduction> aggregations <https://github.com/galliaproject/gallia-core/blob/master/README.md#aggregations> pivoting <https://github.com/galliaproject/gallia-core/blob/master/README.md#pivoting> It’s also possible - but not required - to process data at scale by leveraging Spark RDDs <https://github.com/galliaproject/gallia-core/blob/master/README.md#spark-rdds> . A much more thorough tour can be found at https://github.com/galliaproject/gallia-core/blob/init/README.md <https://github.com/galliaproject/gallia-core/blob/init/README.md> I would love to hear whether this is an effort worth pursuing! Anthony ( @anthony_cros <https://twitter.com/anthony_cros> )
-- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/