[jira] [Commented] (HUDI-773) Hudi On Azure Data Lake Storage V2

2020-04-10 Thread Yanjia Gary Li (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17081030#comment-17081030
 ] 

Yanjia Gary Li commented on HUDI-773:
-

surprisingly easy...I tried the following test using Spark2.4 HDinsigh cluster 
with Azure Data Lake Storage V2. Hudi ran out of the box. No extra config 
needed.
{code:java}
// Initial Batch
val outputPath = "/Test/HudiWrite"
val df1 = Seq(
  ("0", "year=2019", "test1", "pass", "201901"),
  ("1", "year=2019", "test1", "pass", "201901"),
  ("2", "year=2020", "test1", "pass", "201901"),
  ("3", "year=2020", "test1", "pass", "201901")
).toDF("_uuid", "_partition", "PARAM_NAME", "RESULT_STRING", "TIMESTAMP")
val bulk_insert_ops = Map(
  DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "_uuid",
  DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY -> "_partition",
  DataSourceWriteOptions.OPERATION_OPT_KEY -> 
DataSourceWriteOptions.BULK_INSERT_OPERATION_OPT_VAL,
  DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "TIMESTAMP",
  "hoodie.bulkinsert.shuffle.parallelism" -> "10",
  "hoodie.upsert.shuffle.parallelism" -> "10",
  HoodieWriteConfig.TABLE_NAME -> "test"
)
df1.write.format("org.apache.hudi").options(bulk_insert_ops).mode(SaveMode.Overwrite).save(outputPath)

// Upsert
val upsert_ops = Map(
  DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "_uuid",
  DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY -> "_partition",
  DataSourceWriteOptions.OPERATION_OPT_KEY -> 
DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL,
  DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "TIMESTAMP",
  "hoodie.bulkinsert.shuffle.parallelism" -> "10",
  "hoodie.upsert.shuffle.parallelism" -> "10",
  HoodieWriteConfig.TABLE_NAME -> "test"
)
val df2 = Seq(
  ("0", "year=2019", "test1", "pass", "201910"),
  ("1", "year=2019", "test1", "pass", "201910"),
  ("2", "year=2020", "test1", "pass", "201910"),
  ("3", "year=2020", "test1", "pass", "201910")
).toDF("_uuid", "_partition", "PARAM_NAME", "RESULT_STRING", "TIMESTAMP")
df2.write.format("org.apache.hudi").options(upsert_ops).mode(SaveMode.Append).save(outputPath)

// Read as hudi format
val df_read = 
spark.read.format("org.apache.hudi").option(DataSourceReadOptions.QUERY_TYPE_OPT_KEY,
 DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).load(outputPath)
assert(df_read.count() == 4){code}
 

> Hudi On Azure Data Lake Storage V2
> --
>
> Key: HUDI-773
> URL: https://issues.apache.org/jira/browse/HUDI-773
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Usability
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-773) Hudi On Azure Data Lake Storage V2

2020-04-10 Thread Yanjia Gary Li (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17081032#comment-17081032
 ] 

Yanjia Gary Li commented on HUDI-773:
-

Any extra tests needed? What tests have you guys done for AWS and GCP? 
[~vinoth] [~vbalaji]

> Hudi On Azure Data Lake Storage V2
> --
>
> Key: HUDI-773
> URL: https://issues.apache.org/jira/browse/HUDI-773
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Usability
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-773) Hudi On Azure Data Lake Storage V2

2020-04-10 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17081081#comment-17081081
 ] 

Vinoth Chandar commented on HUDI-773:
-

yeah it should be fine.. Few things to check 

1. do we have all the azure storage schemes supported? (StorageSchemes class) 
2. docs on azure support.
3. There are some schemes that support appends.. we need to classify them 
properly in the same StorageSchemes class.. 

if you can verify and close.. that would be awesome :) 

> Hudi On Azure Data Lake Storage V2
> --
>
> Key: HUDI-773
> URL: https://issues.apache.org/jira/browse/HUDI-773
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Usability
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Minor
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-773) Hudi On Azure Data Lake Storage V2

2020-04-16 Thread Sasikumar Venkatesh (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17085380#comment-17085380
 ] 

Sasikumar Venkatesh commented on HUDI-773:
--

Hi [~vinoth] and [~garyli1019]

I am experimenting Hudi on ADLS Gen 2, 

 
{code:java}
outputPath = abfss://tpch@.dfs.core.windows.net/hudi-tables

df.write.format("hudi").options(hudiOptions)...
.save(outputPath){code}
I have changed the output dir to the above path and when I write using Hudi 
format. It is showing the error, _*Configuration property 
[<>.dfs.core.windows.net|http://re230bb35spstg.dfs.core.windows.net/] 
not found*_. I have configured ADLS FS and FSImpl class in spark using OAUTH 
and I can write to ADLS2 using the below code. 
{code:java}
df.write.format("parquet").save(outputPath){code}
Is there any specific config related to ADLS2 for connecting the storage 
account using OAuth while using Hudi.  



Thanks and regards,

Sasikumar

 

 

> Hudi On Azure Data Lake Storage V2
> --
>
> Key: HUDI-773
> URL: https://issues.apache.org/jira/browse/HUDI-773
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Usability
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Minor
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-773) Hudi On Azure Data Lake Storage V2

2020-04-16 Thread Yanjia Gary Li (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17085409#comment-17085409
 ] 

Yanjia Gary Li commented on HUDI-773:
-

Hello [~sasikumar.venkat], thanks for sharing!

I am able to write Hudi data without OAUTH. We are probably first few people in 
the community using Hudi on Azure, so I believe we need to figure this out :)

I will try to reproduce your issue. Will update here once I tried. 

> Hudi On Azure Data Lake Storage V2
> --
>
> Key: HUDI-773
> URL: https://issues.apache.org/jira/browse/HUDI-773
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Usability
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Minor
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-773) Hudi On Azure Data Lake Storage V2

2020-04-17 Thread Sasikumar Venkatesh (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17085499#comment-17085499
 ] 

Sasikumar Venkatesh commented on HUDI-773:
--

Thank you for the prompt reply [~garyli1019]. Let me know If any Jira ssues 
related to ADLS. I can contribute. 

> Hudi On Azure Data Lake Storage V2
> --
>
> Key: HUDI-773
> URL: https://issues.apache.org/jira/browse/HUDI-773
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Usability
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Minor
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-773) Hudi On Azure Data Lake Storage V2

2020-04-17 Thread Yanjia Gary Li (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17086087#comment-17086087
 ] 

Yanjia Gary Li commented on HUDI-773:
-

Hello [~sasikumar.venkat], I am very new to Azure.

How is your cluster set up? Are you using HDInsign or Databricks? Is your Spark 
cluster attached to the storage account or access it through an API?

> Hudi On Azure Data Lake Storage V2
> --
>
> Key: HUDI-773
> URL: https://issues.apache.org/jira/browse/HUDI-773
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Usability
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Minor
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-773) Hudi On Azure Data Lake Storage V2

2020-04-20 Thread Sasikumar Venkatesh (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17087672#comment-17087672
 ] 

Sasikumar Venkatesh commented on HUDI-773:
--

My Cluster is setup on Databricks. I have attached my storage account in the 
cluster. 

I have tried
 # Added my container in ADLS as a mount point in the Databricks cluster.
 # I have configured a Service Principal in Azure to access it via OAuth. 

I think both the method happens through API. 

 

> Hudi On Azure Data Lake Storage V2
> --
>
> Key: HUDI-773
> URL: https://issues.apache.org/jira/browse/HUDI-773
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Usability
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Minor
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-773) Hudi On Azure Data Lake Storage V2

2020-04-20 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17087869#comment-17087869
 ] 

Vinoth Chandar commented on HUDI-773:
-

[~sasikumar.venkat] Happy to work with you and get this ironed out.. Could you 
please past the entire stack trace? for the error? Not super familiar with 
azure, but that can help start some troubleshooting

> Hudi On Azure Data Lake Storage V2
> --
>
> Key: HUDI-773
> URL: https://issues.apache.org/jira/browse/HUDI-773
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Usability
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Minor
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-773) Hudi On Azure Data Lake Storage V2

2020-04-20 Thread Yanjia Gary Li (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17087994#comment-17087994
 ] 

Yanjia Gary Li commented on HUDI-773:
-

[~sasikumar.venkat] I haven't tried Databricks Spark myself, but one of my 
colleagues tried that before and have some issues with the Hudi write, probably 
related to yours. As Vinoth mentioned, any debugging info would be helpful. I 
will also try it myself later

> Hudi On Azure Data Lake Storage V2
> --
>
> Key: HUDI-773
> URL: https://issues.apache.org/jira/browse/HUDI-773
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Usability
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Minor
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-773) Hudi On Azure Data Lake Storage V2

2020-04-22 Thread Sasikumar Venkatesh (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17089852#comment-17089852
 ] 

Sasikumar Venkatesh commented on HUDI-773:
--

[~vinoth] and [~garyli1019]

My Setup is As follows. 

I have a file system in ADLS2(Similar to S3 bucket) and setup OAuth based 
connections
{code:java}
spark.conf.set("fs.azure.account.auth.type.<>.dfs.core.windows.net",
 "OAuth") 
spark.conf.set("fs.azure.account.oauth.provider.type.<>.dfs.core.windows.net",
 "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider") 
spark.conf.set("fs.azure.account.oauth2.client.id.<>.dfs.core.windows.net",
 "**redacted") 
spark.conf.set("fs.azure.account.oauth2.client.secret.<>.dfs.core.windows.net",
 "**redacted") 
spark.conf.set("fs.azure.account.oauth2.client.endpoint.<>.dfs.core.windows.net",
 "https://login.microsoftonline.com/<>/oauth2/token")
{code}
The Hudi Write is as follows,
{code:java}
val tableName = "customer" 
customerDF.write.format("org.apache.hudi") .options(hudiOptions) 
.mode(SaveMode.Overwrite).save("abfss://<>.dfs.core.windows.net/hudi-tables/customer")
{code}
I could see that When I am trying to write with Hudi Format, It is trying to 
load the ADLS credentials from org.apache.hadoop.fs.azurebfs.AbfsConfiguration, 
 Since My Credential properties, is set to  `fs.azure.account.*`.

*I am wondering is there any special config I need to add in My Spark Conf to 
write to ADLS in Hudi format.* 

The stack trace of the error is given below:
{code:java}
Configuration property <>.dfs.core.windows.net not found. at 
shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AbfsConfiguration.getStorageAccountKey(AbfsConfiguration.java:392)
 at 
shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.initializeClient(AzureBlobFileSystemStore.java:1008)
 at 
shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.(AzureBlobFileSystemStore.java:151)
 at 
shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.initialize(AzureBlobFileSystem.java:106)
 at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669) at 
org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370) at 
org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) at 
org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:81) at 
org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:91) at 
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:147)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:135)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$5.apply(SparkPlan.scala:188)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) 
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:184) 
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:135) at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:118)
 at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:116) 
at 
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:710)
 at 
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:710)
 at 
org.apache.spark.sql.execution.SQLExecution$$anonfun$withCustomExecutionEnv$1.apply(SQLExecution.scala:113)
 at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:242)
 at 
org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:99)
 at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:172)
 at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:710) 
at 
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:306) 
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:292) at 
org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:235) at 
line8353df56311d44ef989b6a6d378b55bd92.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(command-4384834320483321:4)
 at 
line8353df56311d44ef989b6a6d378b55bd92.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(command-4384834320483321:73)
 at 
line8353df56311d44ef989b6a6d378b55bd92.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw

[jira] [Commented] (HUDI-773) Hudi On Azure Data Lake Storage V2

2020-04-23 Thread Yanjia Gary Li (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091042#comment-17091042
 ] 

Yanjia Gary Li commented on HUDI-773:
-

Hello [~sasikumar.venkat], could you try the following:

mount your storage account to Databricks
{code:java}
dbutils.fs.mount(
source = "abfss://x...@xxx.dfs.core.windows.net",
mountPoint = "/mountpoint",
extraConfigs = configs)
{code}
When writing to Hudi, use the abfss URL
{code:java}
save("abfss://<>.dfs.core.windows.net/hudi-tables/customer"){code}
When read Hudi data, use the mount point
{code:java}
load("/mountpoint/hudi-tables/customer")
{code}
I believe this error could be related to Databricks internal setup

> Hudi On Azure Data Lake Storage V2
> --
>
> Key: HUDI-773
> URL: https://issues.apache.org/jira/browse/HUDI-773
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Usability
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Minor
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)