[ANNOUNCE] .NET for Apache Spark™ 2.1 released

2022-02-02 Thread Terry Kim
Hi,

We are happy to announce that .NET for Apache Spark™ v2.1 has been released
<https://github.com/dotnet/spark/releases/tag/v2.1.0>! The release note
<https://github.com/dotnet/spark/blob/main/docs/release-notes/2.1.0/release-2.1.0.md>
includes
the full list of features/improvements of this release.

Here are the some of the highlights:

   - Support for Apache Spark 3.2
   - Exposing new SQL function APIs introduced in Spark 3.2

We would like to thank the community for the great feedback and all those
who contributed to this release.

Thanks,
Terry Kim on behalf of the .NET for Apache Spark™ team


Announcing Hyperspace v0.4.0 - an indexing subsystem for Apache Spark™

2021-02-08 Thread Terry Kim
Hi,

We are happy to announce that Hyperspace v0.4.0 - an indexing subsystem for
Apache Spark™ - has been released
<https://github.com/microsoft/hyperspace/releases/tag/v0.4.0>!

Here are the some of the highlights:

   - Delta Lake support: Hyperspace v0.4.0 supports creating indexes on
   Delta Lake tables. Please refer to the user guide
   
<https://microsoft.github.io/hyperspace/docs/ug-supported-data-formats/#delta-lake>
for
   more info.
   - Support for Databricks: A known issue when Hyperspace was run on
   Databricks has been addressed. Hyperspace v0.4.0 can now run on Databricks
   Runtime 5.5 LTS & 6.4!
   - Globbing patterns for indexes: Globbing patterns can be used to
   specify a subset of source data to create/maintain index on. Please refer
   to the user guide
   
<https://microsoft.github.io/hyperspace/docs/ug-quick-start-guide/#supporting-globbing-patterns-on-hyperspace-since-040>
on
   the usage.
   - Hybrid Scan improvements: Hyperspace 0.4.0 brings in several
   improvements on Hybrid Scan such as a better mechanism
   
<https://microsoft.github.io/hyperspace/docs/ug-mutable-dataset/#how-to-enable>
to
   enable/disable the feature, rank algorithm improvements, quick index
   refresh, etc.
   - Pluggable source provider: This release introduces a (evolving)
   pluggable source provider API set so that different source formats can be
   plugged in. This enabled Delta Lake source to be plugged in, and there is
   on-going PR to support Iceberg tables.

We would like to thank the community for the great feedback and all those
who contributed to this release.

Thanks,
Terry Kim on behalf of the Hyperspace team


Re: [Spark SQL]HiveQL and Spark SQL producing different results

2021-01-12 Thread Terry Kim
Ying,
Can you share a query that produces different results?

Thanks,
Terry

On Sun, Jan 10, 2021 at 1:48 PM Ying Zhou  wrote:

> Hi,
>
> I run some SQL using both Hive and Spark. Usually we get the same results.
> However when a window function is in the script Hive and Spark can produce
> different results. Is this intended behavior or either Hive or Spark has a
> bug?
>
> Thanks,
> Ying
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Announcing Hyperspace v0.3.0 - an indexing subsystem for Apache Spark™

2020-11-17 Thread Terry Kim
Hi,

We are happy to announce that Hyperspace v0.3.0 - an indexing subsystem for
Apache Spark™ - has been just released
<https://github.com/microsoft/hyperspace/releases/tag/v0.3.0>!

Here are the some of the highlights:

   - Mutable dataset support: Hyperspace v0.3.0 supports mutable dataset
   where users can append or delete the source data.
  - Hybrid scan: Prior to v0.3.0, any changes in the original dataset
  content required a full refresh to make the index usable again,
which could
  be a costly operation. With the Hybrid scan, the existing index can be
  utilized along with newly appended and/or deleted source files, without
  explicit refresh operation. Please check out the Hybrid Scan doc
  
<https://microsoft.github.io/hyperspace/docs/ug-mutable-dataset/#hybrid-scan>
for
  more detail.
  - Incremental refresh: v0.3.0 introduces a "incremental" mode to
  refresh indexes. In this mode, index files are created only for the newly
  appended source files; deleted source files are also handled by removing
  them from the existing index files. Please check out the Incremental
  Refresh doc
  
<https://microsoft.github.io/hyperspace/docs/ug-mutable-dataset/#refresh-index---incremental-mode>
for
  more detail.
   - Optimize index: The number of files for indexes can increase due to
   the incremental refreshes, possibly degrading the performance. The new
   "optimizeIndex" API optimizes the existing indexes by merging index files
   to create an optimal number of files. Please check out the Optimize
   Index doc
   <https://microsoft.github.io/hyperspace/docs/ug-optimize-index/> for
   more detail.

We would like to thank the community for the great feedback and all those
who contributed to this release.

Thanks,
Terry Kim on behalf of the Hyperspace team


Announcing .NET for Apache Spark™ 1.0

2020-11-06 Thread Terry Kim
Hi,

We are happy to announce that .NET for Apache Spark™ v1.0 has been released
<https://github.com/dotnet/spark/releases/tag/v1.0.0>! Please check
out the official
blog
<https://cloudblogs.microsoft.com/opensource/2020/10/30/announcing-net-apache-spark-1/>.
The release note
<https://github.com/dotnet/spark/blob/master/docs/release-notes/1.0.0/release-1.0.0.md>
includes
the full list of features/improvements of this release.

Here are the some of the highlights:

   - Support for Apache Spark 3.0
   - Exposing new DataFrame / SQL function APIs introduced in Spark 3.0
   - Support for all the complex types in Spark SQL
   - Support for Delta Lake <https://github.com/delta-io/delta> v0.7 and
   Hyperspace <https://github.com/microsoft/hyperspace> v0.2

We would like to thank the community for the great feedback and all those
who contributed to this release.

Thanks,
Terry Kim on behalf of the .NET for Apache Spark™ team


Re: Renaming a DataFrame column makes Spark lose partitioning information

2020-08-04 Thread Terry Kim
This is fixed in Spark 3.0 by https://github.com/apache/spark/pull/26943:

scala> :paste
// Entering paste mode (ctrl-D to finish)

Seq((1, 2))
  .toDF("a", "b")
  .repartition($"b")
  .withColumnRenamed("b", "c")
  .repartition($"c")
  .explain()

// Exiting paste mode, now interpreting.

== Physical Plan ==
*(1) Project [a#7, b#8 AS c#11]
+- Exchange hashpartitioning(b#8, 200), false, [id=#12]
   +- LocalTableScan [a#7, b#8]

Thanks,
Terry

On Tue, Aug 4, 2020 at 6:26 AM Antoine Wendlinger 
wrote:

> Hi,
>
> When renaming a DataFrame column, it looks like Spark is forgetting the
> partition information:
>
> Seq((1, 2))
>   .toDF("a", "b")
>   .repartition($"b")
>   .withColumnRenamed("b", "c")
>   .repartition($"c")
>   .explain()
>
> Gives the following plan:
>
> == Physical Plan ==
> Exchange hashpartitioning(c#40, 10)
> +- *(1) Project [a#36, b#37 AS c#40]
>+- Exchange hashpartitioning(b#37, 10)
>   +- LocalTableScan [a#36, b#37]
>
> As you can see, two shuffles are done, but the second one is unnecessary.
> Is there a reason I don't know for this behavior ? Is there a way to work
> around it (other than not renaming my columns) ?
>
> I'm using Spark 2.4.3.
>
>
> Thanks for your help,
>
> Antoine
>


Re: Future timeout

2020-07-20 Thread Terry Kim
"spark.sql.broadcastTimeout" is the config you can use:
https://github.com/apache/spark/blob/fe07521c9efd9ce0913eee0d42b0ffd98b1225ec/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L863

Thanks,
Terry

On Mon, Jul 20, 2020 at 11:20 AM Amit Sharma  wrote:

> Please help on this.
>
>
> Thanks
> Amit
>
> On Fri, Jul 17, 2020 at 9:10 AM Amit Sharma  wrote:
>
>> Hi, sometimes my spark streaming job throw this exception  Futures timed
>> out after [300 seconds].
>> I am not sure where is the default timeout configuration. Can i increase
>> it. Please help.
>>
>>
>>
>> Thanks
>> Amit
>>
>>
>>
>> Caused by: java.util.concurrent.TimeoutException: Futures timed out after
>> [300 seconds]
>> at
>> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
>> at
>> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>> at
>> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201)
>> at
>> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:136)
>> at
>> org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:372)
>> at
>> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:144)
>> at
>> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:140)
>> at
>> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>> at
>> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>> at
>> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>> at
>> org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:140)
>> at
>> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:116)
>> at
>> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenOuter(BroadcastHashJoinExec.scala:257)
>> at
>> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:101)
>> at
>> org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:186)
>> at
>> org.apache.spark.sql.execution.ProjectExec.consume(basicPhysicalOperators.scala:35)
>> at
>> org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:65)
>> at
>> org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:186)
>> at
>> org.apache.spark.sql.execution.SerializeFromObjectExec.consume(objects.scala:101)
>> at
>> org.apache.spark.sql.execution.SerializeFromObjectExec.doConsume(objects.scala:121)
>> at
>> org.apache.spark.sql.execution.CodegenSupport$class.constructDoConsumeFunction(WholeStageCodegenExec.scala:213)
>> at
>> org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:184)
>> at
>> org.apache.spark.sql.execution.MapElementsExec.consume(objects.scala:200)
>> at
>> org.apache.spark.sql.execution.MapElementsExec.doConsume(objects.scala:224)
>> at
>> org.apache.spark.sql.execution.CodegenSupport$class.constructDoConsumeFunction(WholeStageCodegenExec.scala:213)
>> at
>> org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:184)
>> at
>> org.apache.spark.sql.execution.DeserializeToObjectExec.consume(objects.scala:68)
>>
>


Announcing .NET for Apache Spark™ 0.12

2020-07-02 Thread Terry Kim
We are happy to announce that .NET for Apache Spark™ v0.12 has been released
<https://github.com/dotnet/spark/releases>! Thanks to the community for the
great feedback. The release note
<https://github.com/dotnet/spark/blob/master/docs/release-notes/0.12/release-0.12.md>
includes the full list of features/improvements of this release.

Here are the some of the highlights:

   - Ability to write UDFs using complex types such as Row, Array, Map,
   Date, Timestamp, etc.
   - Ability to write UDFs using .NET DataFrame
   <https://devblogs.microsoft.com/dotnet/an-introduction-to-dataframe/>
   (backed by Apache Arrow)
   - Enhanced structured streaming support with ForeachBatch/Foreach APIs
   - .NET binding for Delta Lake <https://github.com/delta-io/delta> v0.6
   and Hyperspace <https://github.com/microsoft/hyperspace> v0.1
   - Support for Apache Spark™ 2.4.6 (3.0 support is on the way!)
   - SparkSession.CreateDataFrame, Broadcast variable
   - Preliminary support for MLLib (TF-IDF, Word2Vec, Bucketizer, etc.)
   - Support for .NET Core 3.1

We would like to thank all those who contributed to this release.

Thanks,
Terry Kim on behalf of the .NET for Apache Spark™ team


Hyperspace v0.1 is now open-sourced!

2020-07-02 Thread Terry Kim
Hi all,

We are happy to announce the open-sourcing of Hyperspace v0.1, an indexing
subsystem for Apache Spark™:

   - Code: https://github.com/microsoft/hyperspace
   - Blog Article: https://aka.ms/hyperspace-blog
   - Spark Summit Talk:
   
https://databricks.com/session_na20/hyperspace-an-indexing-subsystem-for-apache-spark
   - Docs: https://aka.ms/hyperspace

This project would not have been possible without the outstanding work from
the Apache Spark™ community. Thank you everyone and we look forward to
collaborating with the community towards evolving Hyperspace.

Thanks,
Terry Kim on behalf of the Hyperspace team


Re: Using existing distribution for join when subset of keys

2020-05-31 Thread Terry Kim
Is the following what you trying to do?

spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "0")
val df1 = (0 until 100).map(i => (i % 5, i % 13)).toDF("x", "y")
val df2 = (0 until 100).map(i => (i % 7, i % 11)).toDF("x", "y")
df1.write.format("parquet").bucketBy(8, "x", "y").saveAsTable("t1")
df2.write.format("parquet").bucketBy(8, "x", "y").saveAsTable("t2")
val t1 = spark.table("t1")
val t2 = spark.table("t2")
val joined = t1.join(t2, Seq("x", "y"))
joined.explain

I see no exchange:

== Physical Plan ==
*(3) Project [x#342, y#343]
+- *(3) SortMergeJoin [x#342, y#343], [x#346, y#347], Inner
   :- *(1) Sort [x#342 ASC NULLS FIRST, y#343 ASC NULLS FIRST], false, 0
   :  +- *(1) Project [x#342, y#343]
   : +- *(1) Filter (isnotnull(x#342) && isnotnull(y#343))
   :+- *(1) FileScan parquet default.t1[x#342,y#343] Batched: true,
Format: Parquet, Location: InMemoryFileIndex[file:/], PartitionFilters: [],
PushedFilters: [IsNotNull(x), IsNotNull(y)], ReadSchema:
struct, SelectedBucketsCount: 8 out of 8
   +- *(2) Sort [x#346 ASC NULLS FIRST, y#347 ASC NULLS FIRST], false, 0
  +- *(2) Project [x#346, y#347]
 +- *(2) Filter (isnotnull(x#346) && isnotnull(y#347))
+- *(2) FileScan parquet default.t2[x#346,y#347] Batched: true,
Format: Parquet, Location: InMemoryFileIndex[file:/], PartitionFilters: [],
PushedFilters: [IsNotNull(x), IsNotNull(y)], ReadSchema:
struct, SelectedBucketsCount: 8 out of 8

On Sun, May 31, 2020 at 2:38 PM Patrick Woody 
wrote:

> Hey Terry,
>
> Thanks for the response! I'm not sure that it ends up working though - the
> bucketing still seems to require the exchange before the join. Both tables
> below are saved bucketed by "x":
> *(5) Project [x#29, y#30, z#31, z#37]
> +- *(5) SortMergeJoin [x#29, y#30], [x#35, y#36], Inner
>:- *(2) Sort [x#29 ASC NULLS FIRST, y#30 ASC NULLS FIRST], false, 0
> *   :  +- Exchange hashpartitioning(x#29, y#30, 200)*
>: +- *(1) Project [x#29, y#30, z#31]
>:+- *(1) Filter (isnotnull(x#29) && isnotnull(y#30))
>:   +- *(1) FileScan parquet default.ax[x#29,y#30,z#31]
> Batched: true, Format: Parquet, Location:
> InMemoryFileIndex[file:/home/pwoody/tm/spark-2.4.5-bin-hadoop2.7/spark-warehouse/ax],
> PartitionFilters: [], PushedFilters: [IsNotNull(x), IsNotNull(y)],
> ReadSchema: struct, SelectedBucketsCount: 200 out of 200
>+- *(4) Sort [x#35 ASC NULLS FIRST, y#36 ASC NULLS FIRST], false, 0
> *  +- Exchange hashpartitioning(x#35, y#36, 200)*
>  +- *(3) Project [x#35, y#36, z#37]
> +- *(3) Filter (isnotnull(x#35) && isnotnull(y#36))
>+- *(3) FileScan parquet default.bx[x#35,y#36,z#37]
> Batched: true, Format: Parquet, Location:
> InMemoryFileIndex[file:/home/pwoody/tm/spark-2.4.5-bin-hadoop2.7/spark-warehouse/bx],
> PartitionFilters: [], PushedFilters: [IsNotNull(x), IsNotNull(y)],
> ReadSchema: struct, SelectedBucketsCount: 200 out of 200
>
> Best,
> Pat
>
>
>
> On Sun, May 31, 2020 at 3:15 PM Terry Kim  wrote:
>
>> You can use bucketBy to avoid shuffling in your scenario. This test suite
>> has some examples:
>> https://github.com/apache/spark/blob/45cf5e99503b00a6bd83ea94d6d92761db1a00ab/sql/core/src/test/scala/org/apache/spark/sql/sources/BucketedReadSuite.scala#L343
>>
>> Thanks,
>> Terry
>>
>> On Sun, May 31, 2020 at 7:43 AM Patrick Woody 
>> wrote:
>>
>>> Hey all,
>>>
>>> I have one large table, A, and two medium sized tables, B & C, that I'm
>>> trying to complete a join on efficiently. The result is multiplicative on A
>>> join B, so I'd like to avoid shuffling that result. For this example, let's
>>> just assume each table has three columns, x, y, z. The below is all being
>>> tested on Spark 2.4.5 locally.
>>>
>>> I'd like to perform the following join:
>>> A.join(B, Seq("x", "y")).join(C, Seq("x", "z"))
>>> This outputs the following physical plan:
>>> == Physical Plan ==
>>> *(6) Project [x#32, z#34, y#33, z#74, y#53]
>>> +- *(6) SortMergeJoin [x#32, z#34], [x#52, z#54], Inner
>>>:- *(4) Sort [x#32 ASC NULLS FIRST, z#34 ASC NULLS FIRST], false, 0
>>>:  +- Exchange hashpartitioning(x#32, z#34, 200)
>>>: +- *(3) Project [x#32, y#33, z#34, z#74]
>>>:+- *(3) SortMergeJoin [x#32, y#33], [x#72, y#73], Inner
>>>:   :- *(1) Sort [x#32 ASC NULLS FIRST, y#33 ASC NULLS
>>> FIRST], false, 0
>>>  

Re: Using existing distribution for join when subset of keys

2020-05-31 Thread Terry Kim
You can use bucketBy to avoid shuffling in your scenario. This test suite
has some examples:
https://github.com/apache/spark/blob/45cf5e99503b00a6bd83ea94d6d92761db1a00ab/sql/core/src/test/scala/org/apache/spark/sql/sources/BucketedReadSuite.scala#L343

Thanks,
Terry

On Sun, May 31, 2020 at 7:43 AM Patrick Woody 
wrote:

> Hey all,
>
> I have one large table, A, and two medium sized tables, B & C, that I'm
> trying to complete a join on efficiently. The result is multiplicative on A
> join B, so I'd like to avoid shuffling that result. For this example, let's
> just assume each table has three columns, x, y, z. The below is all being
> tested on Spark 2.4.5 locally.
>
> I'd like to perform the following join:
> A.join(B, Seq("x", "y")).join(C, Seq("x", "z"))
> This outputs the following physical plan:
> == Physical Plan ==
> *(6) Project [x#32, z#34, y#33, z#74, y#53]
> +- *(6) SortMergeJoin [x#32, z#34], [x#52, z#54], Inner
>:- *(4) Sort [x#32 ASC NULLS FIRST, z#34 ASC NULLS FIRST], false, 0
>:  +- Exchange hashpartitioning(x#32, z#34, 200)
>: +- *(3) Project [x#32, y#33, z#34, z#74]
>:+- *(3) SortMergeJoin [x#32, y#33], [x#72, y#73], Inner
>:   :- *(1) Sort [x#32 ASC NULLS FIRST, y#33 ASC NULLS FIRST],
> false, 0
>:   :  +- Exchange hashpartitioning(x#32, y#33, 200)
>:   : +- LocalTableScan [x#32, y#33, z#34]
>:   +- *(2) Sort [x#72 ASC NULLS FIRST, y#73 ASC NULLS FIRST],
> false, 0
>:  +- Exchange hashpartitioning(x#72, y#73, 200)
>: +- LocalTableScan [x#72, y#73, z#74]
>+- *(5) Sort [x#52 ASC NULLS FIRST, z#54 ASC NULLS FIRST], false, 0
>   +- Exchange hashpartitioning(x#52, z#54, 200)
>  +- LocalTableScan [x#52, y#53, z#54]
>
>
> I may be misremembering, but in the past I thought you had the ability to
> pre-partition each table by "x" and it would satisfy the requirements of
> the join since it is already clustered by the key on both sides using the
> same hash function (this assumes numPartitions lines up obviously). However
> it seems like it will insert another exchange:
>
> A.repartition($"x").join(B.repartition($"x"), Seq("x",
> "y")).join(C.repartition($"x"), Seq("x", "z"))
> *(6) Project [x#32, z#34, y#33, z#74, y#53]
> +- *(6) SortMergeJoin [x#32, z#34], [x#52, z#54], Inner
>:- *(4) Sort [x#32 ASC NULLS FIRST, z#34 ASC NULLS FIRST], false, 0
>:  +- Exchange hashpartitioning(x#32, z#34, 200)
>: +- *(3) Project [x#32, y#33, z#34, z#74]
>:+- *(3) SortMergeJoin [x#32, y#33], [x#72, y#73], Inner
>:   :- *(1) Sort [x#32 ASC NULLS FIRST, y#33 ASC NULLS FIRST],
> false, 0
>:   :  +- Exchange hashpartitioning(x#32, y#33, 200)
>:   : +- Exchange hashpartitioning(x#32, 200)
>:   :+- LocalTableScan [x#32, y#33, z#34]
>:   +- *(2) Sort [x#72 ASC NULLS FIRST, y#73 ASC NULLS FIRST],
> false, 0
>:  +- Exchange hashpartitioning(x#72, y#73, 200)
>: +- Exchange hashpartitioning(x#72, 200)
>:+- LocalTableScan [x#72, y#73, z#74]
>+- *(5) Sort [x#52 ASC NULLS FIRST, z#54 ASC NULLS FIRST], false, 0
>   +- Exchange hashpartitioning(x#52, z#54, 200)
>  +- ReusedExchange [x#52, y#53, z#54], Exchange
> hashpartitioning(x#32, 200).
>
> Note, that using this "strategy" with groupBy("x", "y") works fine though
> I assume that is because it doesn't have to consider the other side of the
> join.
>
> Did this used to work or am I simply confusing it with groupBy? Either way
> - any thoughts on how I can avoid shuffling the bulk of the join result?
>
> Thanks,
> Pat
>
>
>
>
>


Re: [Spark SQL]: Does namespace name is always needed in a query for tables from a user defined catalog plugin

2019-12-01 Thread Terry Kim
Hi Xufei,
I also noticed the same while looking into relation resolution behavior
(See Appendix A in this doc
).
I created SPARK-30094  and
will follow up.

Thanks,
Terry

On Sun, Dec 1, 2019 at 7:12 PM xufei  wrote:

> Hi,
>
> I'm trying to write a catalog plugin based on spark-3.0-preview,  and I
> found even when I use 'use catalog.namespace' to set the current catalog
> and namespace, I still need to qualified name in the query.
>
> For example, I add a catalog named 'example_catalog', there is a database
> named 'test' in 'example_catalog', and a table 't' in
> 'example_catalog.test'. I can query the table using 'select * from
> example_catalog.test.t' under default catalog(which is spark_catalog).
> After I use 'use example_catalog.test' to change the current catalog to
> 'example_catalog', and the current namespace to 'test', I can query the
> table using 'select * from test.t', but 'select * from t' failed due to
> table_not_found exception.
>
> I want to know if this is an expected behavior?  If yes, it sounds a
> little weird since I think after 'use example_catalog.test', all the
> un-qualified identifiers should be interpreted as
> 'example_catalog.test.identifier'.
>
> Attachment is a test file that you can use to reproduce the problem I met.
>
> Thanks.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org


Announcing .NET for Apache Spark 0.5.0

2019-09-30 Thread Terry Kim
We are thrilled to announce that .NET for Apache Spark 0.5.0 has been just
released !



Some of the highlights of this release include:

   - Delta Lake 's *DeltaTable *APIs
   - UDF improvements
   - Support for Spark 2.3.4/2.4.4

The release notes

include
the full list of features/improvements of this release.



We would like to thank all those who contributed to this release.



Thanks,

Terry


Re: Release Apache Spark 2.4.4

2019-08-13 Thread Terry Kim
Can the following be included?

[SPARK-27234][SS][PYTHON] Use InheritableThreadLocal for current epoch in
EpochTracker (to support Python UDFs)


Thanks,
Terry

On Tue, Aug 13, 2019 at 10:24 PM Wenchen Fan  wrote:

> +1
>
> On Wed, Aug 14, 2019 at 12:52 PM Holden Karau 
> wrote:
>
>> +1
>> Does anyone have any critical fixes they’d like to see in 2.4.4?
>>
>> On Tue, Aug 13, 2019 at 5:22 PM Sean Owen  wrote:
>>
>>> Seems fine to me if there are enough valuable fixes to justify another
>>> release. If there are any other important fixes imminent, it's fine to
>>> wait for those.
>>>
>>>
>>> On Tue, Aug 13, 2019 at 6:16 PM Dongjoon Hyun 
>>> wrote:
>>> >
>>> > Hi, All.
>>> >
>>> > Spark 2.4.3 was released three months ago (8th May).
>>> > As of today (13th August), there are 112 commits (75 JIRAs) in
>>> `branch-24` since 2.4.3.
>>> >
>>> > It would be great if we can have Spark 2.4.4.
>>> > Shall we start `2.4.4 RC1` next Monday (19th August)?
>>> >
>>> > Last time, there was a request for K8s issue and now I'm waiting for
>>> SPARK-27900.
>>> > Please let me know if there is another issue.
>>> >
>>> > Thanks,
>>> > Dongjoon.
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>


Announcing .NET for Apache Spark 0.4.0

2019-07-31 Thread Terry Kim
We are thrilled to announce that .NET for Apache Spark 0.4.0 has been just
released !



Some of the highlights of this release include:

   - Apache Arrow backed UDFs (Vector UDF, Grouped Map UDF)
   - Robust UDF-related assembly loading
   - Local UDF debugging



The release notes

include the full list of features/improvements of this release.



We would like to thank all those who contributed to this release.



Thanks,

Terry