date:20210208

Announcing Hyperspace v0.4.0 - an indexing subsystem for Apache Spark™

2021-02-08 Thread Terry Kim

Hi,

We are happy to announce that Hyperspace v0.4.0 - an indexing subsystem for
Apache Spark™ - has been released
!

Here are the some of the highlights:

   - Delta Lake support: Hyperspace v0.4.0 supports creating indexes on
   Delta Lake tables. Please refer to the user guide
   

for
   more info.
   - Support for Databricks: A known issue when Hyperspace was run on
   Databricks has been addressed. Hyperspace v0.4.0 can now run on Databricks
   Runtime 5.5 LTS & 6.4!
   - Globbing patterns for indexes: Globbing patterns can be used to
   specify a subset of source data to create/maintain index on. Please refer
   to the user guide
   

on
   the usage.
   - Hybrid Scan improvements: Hyperspace 0.4.0 brings in several
   improvements on Hybrid Scan such as a better mechanism
   

to
   enable/disable the feature, rank algorithm improvements, quick index
   refresh, etc.
   - Pluggable source provider: This release introduces a (evolving)
   pluggable source provider API set so that different source formats can be
   plugged in. This enabled Delta Lake source to be plugged in, and there is
   on-going PR to support Iceberg tables.

We would like to thank the community for the great feedback and all those
who contributed to this release.

Thanks,
Terry Kim on behalf of the Hyperspace team

Getting : format(target_id, ".", name), value) .. error

2021-02-08 Thread shahab

Hello,

I am getting this in unclear error message when I read a parquet file, it
seems something is wrong with data but what? I googled a lot but did not
find any clue. I hope some spark experts could help me with this?

best,
Shahab


Traceback (most recent call last):
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py",
line 94, in rdd
jrdd = self._jdf.javaToPython()
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py",
line 63, in deco
return f(*a, **kw)
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py",
line 328, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o622.javaToPython.

Introducing Gallia: a Scala+Spark library for data manipulation

2021-02-08 Thread galliaproject


 
Hi everyone,
This is an announcement for  Gallia
  , a new
library for data manipulation that maintains a schema throughout
transformations and may process data at scale by  wrapping Spark RDDs
 
.
Here’s a very basic example of usage on an individual object:
  """{"foo": "hello", "bar": 1, "baz": true, "qux": "world"}""".read()
// will infer schema if none is provided  .toUpperCase('foo) 
.increment  ('bar)  .remove ('qux)  .nest  
('baz).under('parent)  .flip   ('parent |> 'baz).printJson()   
// prints: {"foo": "HELLO", "bar": 2, "parent": { "baz": false }}
Trying to manipulate 'parent |> 'baz as anything other than a boolean
results in a type failure at runtime (but before the data is seen):
  .square ('parent |> 'baz ~> 'BAZ) // instead of "flip" earlier 
// ERROR: TypeMismatch (Boolean, expected Number): 'parent |> 'baz
SQL-like processing looks like the following:
  "/data/people.jsonl.gz2"// case class Person(name: String, ...)   
.stream[Person]// INPUT: [{"name": "John", "age": 20, "city":
"Toronto"}, {...  /* 1. WHERE*/ .filterBy('age).matches(_
< 25)  /* 2. SELECT   */ .retain('name, 'age)  /* 3.
GROUP BY + COUNT */ .countBy('age).printJsonl()// OUTPUT: {"age":
21, "_count": 10}\n{"age": 22, ...
More examples:
reduction
  
aggregations

  
pivoting
  
It’s also possible - but not required - to process data at scale by 
leveraging Spark RDDs
 
.
A much more thorough tour can be found at 
https://github.com/galliaproject/gallia-core/blob/init/README.md
  
I would love to hear whether this is an effort worth pursuing!
Anthony ( @anthony_cros   )




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

Announcing Hyperspace v0.4.0 - an indexing subsystem for Apache Spark™

Getting : format(target_id, ".", name), value) .. error

Introducing Gallia: a Scala+Spark library for data manipulation

3 matches

Site Navigation

Mail list logo

Footer information