Talk info share - "Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio"

2021-04-25 Thread Jasmine Wang
Hi guys,

Wanted to share an upcoming free online tech talk on April 27th Tuesday at
10AM PT by NVIDIA on "Advancing GPU Analytics with RAPIDS Accelerator for
Spark and Alluxio"
There will be a live Q after the talk in case anyone is interested.
Registration is here https://go.alluxio.io/community-alluxio-day-2021

Cheers,

Jasmine


Re: pyspark sql load with path of special character

2021-04-25 Thread Stephen Coy
It probably does not like the colons in the path name “…20:04:27+00:00/…”, 
especially if you’re running on a Windows box.

On 24 Apr 2021, at 1:29 am, Regin Quinoa 
mailto:sweatr...@gmail.com>> wrote:

Hi, I am using pyspark sql to load files into table following
```LOAD DATA LOCAL INPATH '/user/hive/warehouse/students' OVERWRITE INTO TABLE 
test_load;```
 
https://spark.apache.org/docs/latest/sql-ref-syntax-dml-load.html

It complains pyspark.sql.utils.AnalysisException: load data input path does not 
exist
 when the path string has timestamp in the directory structure like
XX/XX/2021-03-02T20:04:27+00:00/file.parquet

It works with path without timestamp. How to work it around?


[http://downloads.ifmsystems.com/data/marketing/images/signatures/driving-force-newsletter.jpg]

This email contains confidential information of and is the copyright of 
Infomedia. It must not be forwarded, amended or disclosed without consent of 
the sender. If you received this message by mistake, please advise the sender 
and delete all copies. Security of transmission on the internet cannot be 
guaranteed, could be infected, intercepted, or corrupted and you should ensure 
you have suitable antivirus protection in place. By sending us your or any 
third party personal details, you consent to (or confirm you have obtained 
consent from such third parties) to Infomedia’s privacy policy. 
http://www.infomedia.com.au/privacy-policy/


Re: Is a Hive installation necessary for Spark SQL?

2021-04-25 Thread Dennis Suhari
Hi, 

you can also load other data source without Hive using spark read format into a 
spark Dataframe . From there you can also combine the results using the 
Dataframe world.

The use cases of hive is to have a common Abstraction layer when you want to do 
data tagging, access management under one hoof using tools like Apache Ranger / 
Apache atlas.



Br,

Dennis 

Von meinem iPhone gesendet

> Am 25.04.2021 um 15:37 schrieb krchia :
> 
> Does it make sense to keep a Hive installation when your parquet files come
> with a transactional metadata layer like Delta Lake / Apache Iceberg?
> 
> My understanding from this:
> https://github.com/delta-io/delta/issues/85
> 
> is that Hive is no longer necessary other than discovering where the table
> is stored. Hence, we can simply do something like:
> ```
> df = spark.read.delta($LOCATION)
> df.createOrReplaceTempView("myTable")
> res = spark.sql("select * from myTable")
> ```
> and this approach still gets all the benefits of having the metadata for
> partition discovery / SQL optimization? With Delta, the Hive metastore
> should only store a pointer from the table name to the path of the table,
> and all other metadata will come from the Delta log, which will be processed
> in Spark.
> 
> One reason i can think of keeping Hive is to keep track of other data
> sources that don't necessarily have a Delta / Iceberg transactional metadata
> layer. But i'm not sure if it's still worth it, are there any use cases i
> might have missed out on keeping a Hive installation after migrating to
> Delta / Iceberg?
> 
> Please correct me if i've used any terms wrongly.
> 
> 
> 
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Is a Hive installation necessary for Spark SQL?

2021-04-25 Thread krchia
Does it make sense to keep a Hive installation when your parquet files come
with a transactional metadata layer like Delta Lake / Apache Iceberg?

My understanding from this:
https://github.com/delta-io/delta/issues/85

is that Hive is no longer necessary other than discovering where the table
is stored. Hence, we can simply do something like:
```
df = spark.read.delta($LOCATION)
df.createOrReplaceTempView("myTable")
res = spark.sql("select * from myTable")
```
and this approach still gets all the benefits of having the metadata for
partition discovery / SQL optimization? With Delta, the Hive metastore
should only store a pointer from the table name to the path of the table,
and all other metadata will come from the Delta log, which will be processed
in Spark.

One reason i can think of keeping Hive is to keep track of other data
sources that don't necessarily have a Delta / Iceberg transactional metadata
layer. But i'm not sure if it's still worth it, are there any use cases i
might have missed out on keeping a Hive installation after migrating to
Delta / Iceberg?

Please correct me if i've used any terms wrongly.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org