Re: Announcing Delta Lake 0.2.0

2019-06-20 Thread ayan guha
Hi
We used spark.sql to create a table using DELTA. We also have a hive
metastore attached to the spark session. Hence, a table gets created in
Hive metastore. We then tried to query the table from Hive. We faced
following issues:

   1. SERDE is SequenceFile, should have been Parquet
   2. Scema fields are not passed.

Essentially the hive DDL looks like:

*CREATE TABLE `TABLE NAME`(**  `col` array COMMENT 'from
deserializer')*

*ROW FORMAT SERDE **
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' **WITH
SERDEPROPERTIES ( **  'path'=WASB PATH**')  **STORED AS INPUTFORMAT *
*  'org.apache.hadoop.mapred.SequenceFileInputFormat'*

*OUTPUTFORMAT **
'org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat'  **LOCATION **
'* *WASB PATH'*

*TBLPROPERTIES ( **  'spark.sql.create.version'='2.4.0',**
'spark.sql.sources.provider'='DELTA',**
'spark.sql.sources.schema.numParts'='1',*
*  'spark.sql.sources.schema.part.0'='{\"type\":\"struct\",\"fields\":[]}',**
'transient_lastDdlTime'='1556544657')*

Is this expected? And will the use case be supported in future releases?


We are now experimenting

Best

Ayan

On Fri, Jun 21, 2019 at 11:06 AM Liwen Sun  wrote:

> Hi James,
>
> Right now we don't have plans for having a catalog component as part of
> Delta Lake, but we are looking to support Hive metastore and also DDL
> commands in the near future.
>
> Thanks,
> Liwen
>
> On Thu, Jun 20, 2019 at 4:46 AM James Cotrotsios <
> jamescotrots...@gmail.com> wrote:
>
>> Is there a plan to have a business catalog component for the Data Lake?
>> If not how would someone make a proposal to create an open source project
>> related to that. I would be interested in building out an open source data
>> catalog that would use the Hive metadata store as a baseline for technical
>> metadata.
>>
>>
>> On Wed, Jun 19, 2019 at 3:04 PM Liwen Sun 
>> wrote:
>>
>>> We are delighted to announce the availability of Delta Lake 0.2.0!
>>>
>>> To try out Delta Lake 0.2.0, please follow the Delta Lake Quickstart:
>>> https://docs.delta.io/0.2.0/quick-start.html
>>>
>>> To view the release notes:
>>> https://github.com/delta-io/delta/releases/tag/v0.2.0
>>>
>>> This release introduces two main features:
>>>
>>> *Cloud storage support*
>>> In addition to HDFS, you can now configure Delta Lake to read and write
>>> data on cloud storage services such as Amazon S3 and Azure Blob Storage.
>>> For configuration instructions, please see:
>>> https://docs.delta.io/0.2.0/delta-storage.html
>>>
>>> *Improved concurrency*
>>> Delta Lake now allows concurrent append-only writes while still ensuring
>>> serializability. For concurrency control in Delta Lake, please see:
>>> https://docs.delta.io/0.2.0/delta-concurrency.html
>>>
>>> We have also greatly expanded the test coverage as part of this release.
>>>
>>> We would like to acknowledge all community members for contributing to
>>> this release.
>>>
>>> Best regards,
>>> Liwen Sun
>>>
>>>

-- 
Best Regards,
Ayan Guha


Re: Announcing Delta Lake 0.2.0

2019-06-20 Thread Gourav Sengupta
Hi Liwen,

thanks a ton,  I think that there is a difference between a storage class
and metastore, just like there is a difference between a database and file
system and coffee and cup.

It will be wonderful to keep the focus on the fantastic opportunity that
Delta creates for us :)

Regards,
Gourav Sengupta

On Fri, Jun 21, 2019 at 2:05 AM Liwen Sun  wrote:

> Hi James,
>
> Right now we don't have plans for having a catalog component as part of
> Delta Lake, but we are looking to support Hive metastore and also DDL
> commands in the near future.
>
> Thanks,
> Liwen
>
> On Thu, Jun 20, 2019 at 4:46 AM James Cotrotsios <
> jamescotrots...@gmail.com> wrote:
>
>> Is there a plan to have a business catalog component for the Data Lake?
>> If not how would someone make a proposal to create an open source project
>> related to that. I would be interested in building out an open source data
>> catalog that would use the Hive metadata store as a baseline for technical
>> metadata.
>>
>>
>> On Wed, Jun 19, 2019 at 3:04 PM Liwen Sun 
>> wrote:
>>
>>> We are delighted to announce the availability of Delta Lake 0.2.0!
>>>
>>> To try out Delta Lake 0.2.0, please follow the Delta Lake Quickstart:
>>> https://docs.delta.io/0.2.0/quick-start.html
>>>
>>> To view the release notes:
>>> https://github.com/delta-io/delta/releases/tag/v0.2.0
>>>
>>> This release introduces two main features:
>>>
>>> *Cloud storage support*
>>> In addition to HDFS, you can now configure Delta Lake to read and write
>>> data on cloud storage services such as Amazon S3 and Azure Blob Storage.
>>> For configuration instructions, please see:
>>> https://docs.delta.io/0.2.0/delta-storage.html
>>>
>>> *Improved concurrency*
>>> Delta Lake now allows concurrent append-only writes while still ensuring
>>> serializability. For concurrency control in Delta Lake, please see:
>>> https://docs.delta.io/0.2.0/delta-concurrency.html
>>>
>>> We have also greatly expanded the test coverage as part of this release.
>>>
>>> We would like to acknowledge all community members for contributing to
>>> this release.
>>>
>>> Best regards,
>>> Liwen Sun
>>>
>>> --
> You received this message because you are subscribed to the Google Groups
> "Delta Lake Users and Developers" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to delta-users+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/delta-users/CAE4dWq-4rC8n5OXuB7NRfDhY4ZLwC8w20cLf7wbktvLKWotHow%40mail.gmail.com
> 
> .
>


Re: Announcing Delta Lake 0.2.0

2019-06-20 Thread Liwen Sun
Hi James,

Right now we don't have plans for having a catalog component as part of
Delta Lake, but we are looking to support Hive metastore and also DDL
commands in the near future.

Thanks,
Liwen

On Thu, Jun 20, 2019 at 4:46 AM James Cotrotsios 
wrote:

> Is there a plan to have a business catalog component for the Data Lake? If
> not how would someone make a proposal to create an open source project
> related to that. I would be interested in building out an open source data
> catalog that would use the Hive metadata store as a baseline for technical
> metadata.
>
>
> On Wed, Jun 19, 2019 at 3:04 PM Liwen Sun 
> wrote:
>
>> We are delighted to announce the availability of Delta Lake 0.2.0!
>>
>> To try out Delta Lake 0.2.0, please follow the Delta Lake Quickstart:
>> https://docs.delta.io/0.2.0/quick-start.html
>>
>> To view the release notes:
>> https://github.com/delta-io/delta/releases/tag/v0.2.0
>>
>> This release introduces two main features:
>>
>> *Cloud storage support*
>> In addition to HDFS, you can now configure Delta Lake to read and write
>> data on cloud storage services such as Amazon S3 and Azure Blob Storage.
>> For configuration instructions, please see:
>> https://docs.delta.io/0.2.0/delta-storage.html
>>
>> *Improved concurrency*
>> Delta Lake now allows concurrent append-only writes while still ensuring
>> serializability. For concurrency control in Delta Lake, please see:
>> https://docs.delta.io/0.2.0/delta-concurrency.html
>>
>> We have also greatly expanded the test coverage as part of this release.
>>
>> We would like to acknowledge all community members for contributing to
>> this release.
>>
>> Best regards,
>> Liwen Sun
>>
>>


Re: Announcing Delta Lake 0.2.0

2019-06-20 Thread Li Gao
Lyft recently open sourced a data discovery tool called Amundsen that can
serve many of the data catalog needs.

https://eng.lyft.com/amundsen-lyfts-data-discovery-metadata-engine-62d27254fbb9
https://github.com/lyft/amundsenmetadatalibrary

You still need HMS to store the data schema though.



On Thu, Jun 20, 2019 at 4:47 AM James Cotrotsios 
wrote:

> Is there a plan to have a business catalog component for the Data Lake? If
> not how would someone make a proposal to create an open source project
> related to that. I would be interested in building out an open source data
> catalog that would use the Hive metadata store as a baseline for technical
> metadata.
>
>
> On Wed, Jun 19, 2019 at 3:04 PM Liwen Sun 
> wrote:
>
>> We are delighted to announce the availability of Delta Lake 0.2.0!
>>
>> To try out Delta Lake 0.2.0, please follow the Delta Lake Quickstart:
>> https://docs.delta.io/0.2.0/quick-start.html
>>
>> To view the release notes:
>> https://github.com/delta-io/delta/releases/tag/v0.2.0
>>
>> This release introduces two main features:
>>
>> *Cloud storage support*
>> In addition to HDFS, you can now configure Delta Lake to read and write
>> data on cloud storage services such as Amazon S3 and Azure Blob Storage.
>> For configuration instructions, please see:
>> https://docs.delta.io/0.2.0/delta-storage.html
>>
>> *Improved concurrency*
>> Delta Lake now allows concurrent append-only writes while still ensuring
>> serializability. For concurrency control in Delta Lake, please see:
>> https://docs.delta.io/0.2.0/delta-concurrency.html
>>
>> We have also greatly expanded the test coverage as part of this release.
>>
>> We would like to acknowledge all community members for contributing to
>> this release.
>>
>> Best regards,
>> Liwen Sun
>>
>>


Spark-cluster slowness

2019-06-20 Thread Amit Sharma
I have spark cluster on two data centers each. Cluster on spark cluster B
is 6 times slower than cluster A. I ran the same job on both cluster and
time difference is of 6 times. I used the same config and using spark
2.3.3. I checked  that on spark UI it displays the slaves nodes but when i
check under Executor tab i saw all the nodes there but do not see active
tasks while task status is active. Please help me to find the root cause.

Thanks
Amit


Re: Announcing Delta Lake 0.2.0

2019-06-20 Thread James Cotrotsios
Is there a plan to have a business catalog component for the Data Lake? If
not how would someone make a proposal to create an open source project
related to that. I would be interested in building out an open source data
catalog that would use the Hive metadata store as a baseline for technical
metadata.


On Wed, Jun 19, 2019 at 3:04 PM Liwen Sun  wrote:

> We are delighted to announce the availability of Delta Lake 0.2.0!
>
> To try out Delta Lake 0.2.0, please follow the Delta Lake Quickstart:
> https://docs.delta.io/0.2.0/quick-start.html
>
> To view the release notes:
> https://github.com/delta-io/delta/releases/tag/v0.2.0
>
> This release introduces two main features:
>
> *Cloud storage support*
> In addition to HDFS, you can now configure Delta Lake to read and write
> data on cloud storage services such as Amazon S3 and Azure Blob Storage.
> For configuration instructions, please see:
> https://docs.delta.io/0.2.0/delta-storage.html
>
> *Improved concurrency*
> Delta Lake now allows concurrent append-only writes while still ensuring
> serializability. For concurrency control in Delta Lake, please see:
> https://docs.delta.io/0.2.0/delta-concurrency.html
>
> We have also greatly expanded the test coverage as part of this release.
>
> We would like to acknowledge all community members for contributing to
> this release.
>
> Best regards,
> Liwen Sun
>
>