Re: Introducing Gallia: a Scala+Spark library for data manipulation

2021-03-31 Thread galliaproject
I posted another update on the scala mailing list:
https://users.scala-lang.org/t/introducing-gallia-a-library-for-data-manipulation/7112/11

It notably pertains to:
- A full *RDD*-powered example:
https://github.com/galliaproject/gallia-genemania-spark#description (via
EMR)
- New license (*BSL*): https://github.com/galliaproject/gallia-core/bsl.md;
basically it's *free for essential or small* entities



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Introducing Gallia: a Scala+Spark library for data manipulation

2021-02-16 Thread galliaproject
I posted a quick update on the  scala mailing list

 
, which mostly discusses Scala 2.13 support, additional examples and
licensing.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Introducing Gallia: a Scala+Spark library for data manipulation

2021-02-08 Thread galliaproject
<https://lh3.googleusercontent.com/sFfpg5aAjAUkREAkI0p20PmURyu6gZWZES7GDT2wzPXircqItxjJWzW0o02fw82dHg3cmgB1i_xx-rNb1si8ppCEGU51SOInOR4VZDpOARuuNpIHY1mfvGQq7Tfj6CrzBzFGNbkS>
 
Hi everyone,
This is an announcement for  Gallia
<https://github.com/galliaproject/gallia-core/blob/init/README.md>  , a new
library for data manipulation that maintains a schema throughout
transformations and may process data at scale by  wrapping Spark RDDs
<https://github.com/galliaproject/gallia-core/blob/master/README.md#spark-rdds> 
.
Here’s a very basic example of usage on an individual object:
  """{"foo": "hello", "bar": 1, "baz": true, "qux": "world"}""".read()
// will infer schema if none is provided  .toUpperCase('foo) 
.increment  ('bar)  .remove ('qux)  .nest  
('baz).under('parent)  .flip   ('parent | 'baz).printJson()   
// prints: {"foo": "HELLO", "bar": 2, "parent": { "baz": false }}
Trying to manipulate 'parent | 'baz as anything other than a boolean
results in a type failure at runtime (but before the data is seen):
  .square ('parent | 'baz ~ 'BAZ) // instead of "flip" earlier 
// ERROR: TypeMismatch (Boolean, expected Number): 'parent | 'baz
SQL-like processing looks like the following:
  "/data/people.jsonl.gz2"// case class Person(name: String, ...)   
.stream[Person]// INPUT: [{"name": "John", "age": 20, "city":
"Toronto"}, {...  /* 1. WHERE*/ .filterBy('age).matches(_
 25)      /* 2. SELECT   */ .retain('name, 'age)  /* 3.
GROUP BY + COUNT */ .countBy('age).printJsonl()// OUTPUT: {"age":
21, "_count": 10}\n{"age": 22, ...
More examples:
reduction
<https://github.com/galliaproject/gallia-core/blob/master/README.md#reduction>  
aggregations
<https://github.com/galliaproject/gallia-core/blob/master/README.md#aggregations>
  
pivoting
<https://github.com/galliaproject/gallia-core/blob/master/README.md#pivoting>  
It’s also possible - but not required - to process data at scale by 
leveraging Spark RDDs
<https://github.com/galliaproject/gallia-core/blob/master/README.md#spark-rdds> 
.
A much more thorough tour can be found at 
https://github.com/galliaproject/gallia-core/blob/init/README.md
<https://github.com/galliaproject/gallia-core/blob/init/README.md>  
I would love to hear whether this is an effort worth pursuing!
Anthony ( @anthony_cros <https://twitter.com/anthony_cros>  )




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/