Introducing Gallia: a Scala+Spark library for data manipulation

galliaproject Mon, 08 Feb 2021 09:43:38 -0800

<https://lh3.googleusercontent.com/sFfpg5aAjAUkREAkI0p20PmURyu6gZWZES7GDT2wzPXircqItxjJWzW0o02fw82dHg3cmgB1i_xx-rNb1si8ppCEGU51SOInOR4VZDpOARuuNpIHY1mfvGQq7Tfj6CrzBzFGNbkS>
 
Hi everyone,
This is an announcement for  Gallia
<https://github.com/galliaproject/gallia-core/blob/init/README.md>  , a new
library for data manipulation that maintains a schema throughout
transformations and may process data at scale by  wrapping Spark RDDs
<https://github.com/galliaproject/gallia-core/blob/master/README.md#spark-rdds> 
.
Here’s a very basic example of usage on an individual object:
  """{"foo": "hello", "bar": 1, "baz": true, "qux": "world"}"""    .read()
// will infer schema if none is provided      .toUpperCase('foo)     
.increment  ('bar)      .remove     ('qux)      .nest      
('baz).under('parent)      .flip       ('parent |&gt; 'baz)    .printJson()   
// prints: {"foo": "HELLO", "bar": 2, "parent": { "baz": false }}
Trying to manipulate 'parent |&gt; 'baz as anything other than a boolean
results in a type failure at runtime (but before the data is seen):
      .square ('parent |&gt; 'baz ~&gt; 'BAZ) // instead of "flip" earlier     
// ERROR: TypeMismatch (Boolean, expected Number): 'parent |&gt; 'baz
SQL-like processing looks like the following:
  "/data/people.jsonl.gz2"    // case class Person(name: String, ...)   
.stream[Person]    // INPUT: [{"name": "John", "age": 20, "city":
"Toronto"}, {...      /* 1. WHERE            */ .filterBy('age).matches(_
&lt; 25)      /* 2. SELECT           */ .retain('name, 'age)      /* 3.
GROUP BY + COUNT */ .countBy('age)    .printJsonl()    // OUTPUT: {"age":
21, "_count": 10}\n{"age": 22, ...
More examples:
reduction
<https://github.com/galliaproject/gallia-core/blob/master/README.md#reduction>  
aggregations
<https://github.com/galliaproject/gallia-core/blob/master/README.md#aggregations>
  
pivoting
<https://github.com/galliaproject/gallia-core/blob/master/README.md#pivoting>  
It’s also possible - but not required - to process data at scale by 
leveraging Spark RDDs
<https://github.com/galliaproject/gallia-core/blob/master/README.md#spark-rdds> 
.
A much more thorough tour can be found at 
https://github.com/galliaproject/gallia-core/blob/init/README.md
<https://github.com/galliaproject/gallia-core/blob/init/README.md>  
I would love to hear whether this is an effort worth pursuing!
Anthony ( @anthony_cros <https://twitter.com/anthony_cros>  )





--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

Introducing Gallia: a Scala+Spark library for data manipulation

Reply via email to