Re: Introducing Gallia: a Scala+Spark library for data manipulation

2021-03-31 Thread galliaproject
I posted another update on the scala mailing list:
https://users.scala-lang.org/t/introducing-gallia-a-library-for-data-manipulation/7112/11

It notably pertains to:
- A full *RDD*-powered example:
https://github.com/galliaproject/gallia-genemania-spark#description (via
EMR)
- New license (*BSL*): https://github.com/galliaproject/gallia-core/bsl.md;
basically it's *free for essential or small* entities



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Introducing Gallia: a Scala+Spark library for data manipulation

2021-02-16 Thread galliaproject
I posted a quick update on the  scala mailing list

 
, which mostly discusses Scala 2.13 support, additional examples and
licensing.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Introducing Gallia: a Scala+Spark library for data manipulation

2021-02-08 Thread galliaproject

 
Hi everyone,
This is an announcement for  Gallia
  , a new
library for data manipulation that maintains a schema throughout
transformations and may process data at scale by  wrapping Spark RDDs
 
.
Here’s a very basic example of usage on an individual object:
  """{"foo": "hello", "bar": 1, "baz": true, "qux": "world"}""".read()
// will infer schema if none is provided  .toUpperCase('foo) 
.increment  ('bar)  .remove ('qux)  .nest  
('baz).under('parent)  .flip   ('parent |> 'baz).printJson()   
// prints: {"foo": "HELLO", "bar": 2, "parent": { "baz": false }}
Trying to manipulate 'parent |> 'baz as anything other than a boolean
results in a type failure at runtime (but before the data is seen):
  .square ('parent |> 'baz ~> 'BAZ) // instead of "flip" earlier 
// ERROR: TypeMismatch (Boolean, expected Number): 'parent |> 'baz
SQL-like processing looks like the following:
  "/data/people.jsonl.gz2"// case class Person(name: String, ...)   
.stream[Person]// INPUT: [{"name": "John", "age": 20, "city":
"Toronto"}, {...  /* 1. WHERE*/ .filterBy('age).matches(_
< 25)  /* 2. SELECT   */ .retain('name, 'age)  /* 3.
GROUP BY + COUNT */ .countBy('age).printJsonl()// OUTPUT: {"age":
21, "_count": 10}\n{"age": 22, ...
More examples:
reduction
  
aggregations

  
pivoting
  
It’s also possible - but not required - to process data at scale by 
leveraging Spark RDDs
 
.
A much more thorough tour can be found at 
https://github.com/galliaproject/gallia-core/blob/init/README.md
  
I would love to hear whether this is an effort worth pursuing!
Anthony ( @anthony_cros   )




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/