date:20211101

[spark] branch branch-3.0 updated: [MINOR][DOCS] Corrected spacing in structured streaming programming

2021-11-01 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.0 by this push:
 new 615e525  [MINOR][DOCS] Corrected spacing in structured streaming 
programming
615e525 is described below

commit 615e5257887e8e7a0879ccca43bfbe0ebf161f28
Author: mans2singh 
AuthorDate: Tue Nov 2 11:01:57 2021 +0900

[MINOR][DOCS] Corrected spacing in structured streaming programming

### What changes were proposed in this pull request?
There is no space between `with` and  `` as shown below:

`... configured 
withspark.sql.streaming.fileSource.cleaner.numThreads ...`

### Why are the changes needed?
Added space

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Only documentation was changed and no code was change.

Closes #34458 from mans2singh/structured_streaming_programming_guide_space.

Authored-by: mans2singh 
Signed-off-by: Hyukjin Kwon 
(cherry picked from commit 675071a38e47dc2c55cf4f71de7ad0bebc1b4f2b)
Signed-off-by: Hyukjin Kwon 
---
 docs/structured-streaming-programming-guide.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/structured-streaming-programming-guide.md 
b/docs/structured-streaming-programming-guide.md
index 31b1ca9..84296e0 100644
--- a/docs/structured-streaming-programming-guide.md
+++ b/docs/structured-streaming-programming-guide.md
@@ -553,7 +553,7 @@ Here are the details of all the sources in Spark.
 For example, suppose you provide '/hello?/spark/*' as source pattern, 
'/hello1/spark/archive/dir' cannot be used as the value of "sourceArchiveDir", 
as '/hello?/spark/*' and '/hello1/spark/archive' will be matched. 
'/hello1/spark' cannot be also used as the value of "sourceArchiveDir", as 
'/hello?/spark' and '/hello1/spark' will be matched. '/archived/here' would be 
OK as it doesn't match.
 Spark will move source files respecting their own path. For example, 
if the path of source file is /a/b/dataset.txt and the path of 
archive directory is /archived/here, file will be moved to 
/archived/here/a/b/dataset.txt.
 NOTE: Both archiving (via moving) or deleting completed files will 
introduce overhead (slow down, even if it's happening in separate thread) in 
each micro-batch, so you need to understand the cost for each operation in your 
file system before enabling this option. On the other hand, enabling this 
option will reduce the cost to list source files which can be an expensive 
operation.
-Number of threads used in completed file cleaner can be configured 
withspark.sql.streaming.fileSource.cleaner.numThreads (default: 
1).
+Number of threads used in completed file cleaner can be configured 
with spark.sql.streaming.fileSource.cleaner.numThreads (default: 
1).
 NOTE 2: The source path should not be used from multiple sources or 
queries when enabling this option. Similarly, you must ensure the source path 
doesn't match to any files in output directory of file stream sink.
 NOTE 3: Both delete and move actions are best effort. Failing to 
delete or move files will not fail the streaming query. Spark may not clean up 
some source files in some circumstances - e.g. the application doesn't shut 
down gracefully, too many files are queued to clean up.
 

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.1 updated: [MINOR][DOCS] Corrected spacing in structured streaming programming

2021-11-01 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.1
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.1 by this push:
 new e9bead7  [MINOR][DOCS] Corrected spacing in structured streaming 
programming
e9bead7 is described below

commit e9bead79f8555faa8ba6a3b2ca9925a28022bee9
Author: mans2singh 
AuthorDate: Tue Nov 2 11:01:57 2021 +0900

[MINOR][DOCS] Corrected spacing in structured streaming programming

### What changes were proposed in this pull request?
There is no space between `with` and  `` as shown below:

`... configured 
withspark.sql.streaming.fileSource.cleaner.numThreads ...`

### Why are the changes needed?
Added space

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Only documentation was changed and no code was change.

Closes #34458 from mans2singh/structured_streaming_programming_guide_space.

Authored-by: mans2singh 
Signed-off-by: Hyukjin Kwon 
(cherry picked from commit 675071a38e47dc2c55cf4f71de7ad0bebc1b4f2b)
Signed-off-by: Hyukjin Kwon 
---
 docs/structured-streaming-programming-guide.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/structured-streaming-programming-guide.md 
b/docs/structured-streaming-programming-guide.md
index d88cf91b..28d312e 100644
--- a/docs/structured-streaming-programming-guide.md
+++ b/docs/structured-streaming-programming-guide.md
@@ -553,7 +553,7 @@ Here are the details of all the sources in Spark.
 For example, suppose you provide '/hello?/spark/*' as source pattern, 
'/hello1/spark/archive/dir' cannot be used as the value of "sourceArchiveDir", 
as '/hello?/spark/*' and '/hello1/spark/archive' will be matched. 
'/hello1/spark' cannot be also used as the value of "sourceArchiveDir", as 
'/hello?/spark' and '/hello1/spark' will be matched. '/archived/here' would be 
OK as it doesn't match.
 Spark will move source files respecting their own path. For example, 
if the path of source file is /a/b/dataset.txt and the path of 
archive directory is /archived/here, file will be moved to 
/archived/here/a/b/dataset.txt.
 NOTE: Both archiving (via moving) or deleting completed files will 
introduce overhead (slow down, even if it's happening in separate thread) in 
each micro-batch, so you need to understand the cost for each operation in your 
file system before enabling this option. On the other hand, enabling this 
option will reduce the cost to list source files which can be an expensive 
operation.
-Number of threads used in completed file cleaner can be configured 
withspark.sql.streaming.fileSource.cleaner.numThreads (default: 
1).
+Number of threads used in completed file cleaner can be configured 
with spark.sql.streaming.fileSource.cleaner.numThreads (default: 
1).
 NOTE 2: The source path should not be used from multiple sources or 
queries when enabling this option. Similarly, you must ensure the source path 
doesn't match to any files in output directory of file stream sink.
 NOTE 3: Both delete and move actions are best effort. Failing to 
delete or move files will not fail the streaming query. Spark may not clean up 
some source files in some circumstances - e.g. the application doesn't shut 
down gracefully, too many files are queued to clean up.
 

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.2 updated: [MINOR][DOCS] Corrected spacing in structured streaming programming

2021-11-01 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new a63d2d2  [MINOR][DOCS] Corrected spacing in structured streaming 
programming
a63d2d2 is described below

commit a63d2d2c31af6180a15c13098a1345523a0712c6
Author: mans2singh 
AuthorDate: Tue Nov 2 11:01:57 2021 +0900

[MINOR][DOCS] Corrected spacing in structured streaming programming

### What changes were proposed in this pull request?
There is no space between `with` and  `` as shown below:

`... configured 
withspark.sql.streaming.fileSource.cleaner.numThreads ...`

### Why are the changes needed?
Added space

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Only documentation was changed and no code was change.

Closes #34458 from mans2singh/structured_streaming_programming_guide_space.

Authored-by: mans2singh 
Signed-off-by: Hyukjin Kwon 
(cherry picked from commit 675071a38e47dc2c55cf4f71de7ad0bebc1b4f2b)
Signed-off-by: Hyukjin Kwon 
---
 docs/structured-streaming-programming-guide.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/structured-streaming-programming-guide.md 
b/docs/structured-streaming-programming-guide.md
index 4642d44..3aa0d4c 100644
--- a/docs/structured-streaming-programming-guide.md
+++ b/docs/structured-streaming-programming-guide.md
@@ -553,7 +553,7 @@ Here are the details of all the sources in Spark.
 For example, suppose you provide '/hello?/spark/*' as source pattern, 
'/hello1/spark/archive/dir' cannot be used as the value of "sourceArchiveDir", 
as '/hello?/spark/*' and '/hello1/spark/archive' will be matched. 
'/hello1/spark' cannot be also used as the value of "sourceArchiveDir", as 
'/hello?/spark' and '/hello1/spark' will be matched. '/archived/here' would be 
OK as it doesn't match.
 Spark will move source files respecting their own path. For example, 
if the path of source file is /a/b/dataset.txt and the path of 
archive directory is /archived/here, file will be moved to 
/archived/here/a/b/dataset.txt.
 NOTE: Both archiving (via moving) or deleting completed files will 
introduce overhead (slow down, even if it's happening in separate thread) in 
each micro-batch, so you need to understand the cost for each operation in your 
file system before enabling this option. On the other hand, enabling this 
option will reduce the cost to list source files which can be an expensive 
operation.
-Number of threads used in completed file cleaner can be configured 
withspark.sql.streaming.fileSource.cleaner.numThreads (default: 
1).
+Number of threads used in completed file cleaner can be configured 
with spark.sql.streaming.fileSource.cleaner.numThreads (default: 
1).
 NOTE 2: The source path should not be used from multiple sources or 
queries when enabling this option. Similarly, you must ensure the source path 
doesn't match to any files in output directory of file stream sink.
 NOTE 3: Both delete and move actions are best effort. Failing to 
delete or move files will not fail the streaming query. Spark may not clean up 
some source files in some circumstances - e.g. the application doesn't shut 
down gracefully, too many files are queued to clean up.
 

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (675071a -> 320fa07)

2021-11-01 Thread sarutak

This is an automated email from the ASF dual-hosted git repository.

sarutak pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 675071a  [MINOR][DOCS] Corrected spacing in structured streaming 
programming
 add 320fa07  [SPARK-37159][SQL][TESTS] Change 
HiveExternalCatalogVersionsSuite to be able to test with Java 17

No new revisions were added by this update.

Summary of changes:
 .../apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala  | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (cf7fbc1 -> 675071a)

2021-11-01 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from cf7fbc1  [SPARK-36554][SQL][PYTHON] Expose make_date expression in 
functions.scala
 add 675071a  [MINOR][DOCS] Corrected spacing in structured streaming 
programming

No new revisions were added by this update.

Summary of changes:
 docs/structured-streaming-programming-guide.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (11de0fd -> cf7fbc1)

2021-11-01 Thread sarutak

This is an automated email from the ASF dual-hosted git repository.

sarutak pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 11de0fd  [MINOR][DOCS] Add import for MultivariateGaussian to Docs
 add cf7fbc1  [SPARK-36554][SQL][PYTHON] Expose make_date expression in 
functions.scala

No new revisions were added by this update.

Summary of changes:
 python/docs/source/reference/pyspark.sql.rst   |  1 +
 python/pyspark/sql/functions.py| 29 ++
 python/pyspark/sql/tests/test_functions.py | 10 +++-
 .../scala/org/apache/spark/sql/functions.scala |  9 +++
 4 files changed, 48 insertions(+), 1 deletion(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (70fde44 -> 11de0fd)

2021-11-01 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 70fde44  [SPARK-37062][SS] Introduce a new data source for providing 
consistent set of rows per microbatch
 add 11de0fd  [MINOR][DOCS] Add import for MultivariateGaussian to Docs

No new revisions were added by this update.

Summary of changes:
 python/pyspark/ml/stat.py | 1 +
 1 file changed, 1 insertion(+)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-37062][SS] Introduce a new data source for providing consistent set of rows per microbatch

2021-11-01 Thread kabhwan

This is an automated email from the ASF dual-hosted git repository.

kabhwan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 70fde44  [SPARK-37062][SS] Introduce a new data source for providing 
consistent set of rows per microbatch
70fde44 is described below

commit 70fde44e930926cbcd1fc95fa7cfb915c25cff9c
Author: Jungtaek Lim 
AuthorDate: Mon Nov 1 20:04:10 2021 +0900

[SPARK-37062][SS] Introduce a new data source for providing consistent set 
of rows per microbatch

### What changes were proposed in this pull request?

This PR proposes to introduce a new data source having short name as 
"rate-micro-batch", which produces similar input rows as "rate" (increment long 
values with timestamps), but ensures that each micro-batch has a "predictable" 
set of input rows.

"rate-micro-batch" data source receives a config to specify the number of 
rows per micro-batch, which defines the set of input rows for further 
micro-batches. For example, if the number of rows per micro-batch is set to 
1000, the first batch would have 1000 rows having value range as `0~999`, the 
second batch would have 1000 rows having value range as `1000~1999`, and so on. 
This characteristic brings different use cases compared to rate data source, as 
we can't predict the input rows [...]

For generated time (timestamp column), the data source applies the same 
mechanism to make the value of column be predictable. `startTimestamp` option 
defines the starting value of generated time, and `advanceMillisPerBatch` 
option defines how much time the generated time should advance per micro-batch. 
All input rows in the same micro-batch will have same timestamp.

This source supports the following options:

* `rowsPerBatch` (e.g. 100): How many rows should be generated per 
micro-batch.
* `numPartitions` (e.g. 10, default: Spark's default parallelism): The 
partition number for the generated rows.
* `startTimestamp` (e.g. 1000, default: 0): starting value of generated time
* `advanceMillisPerBatch` (e.g. 1000, default: 1000): the amount of time 
being advanced in generated time on each micro-batch.

### Why are the changes needed?

The "rate" data source has been known to be used as a benchmark for 
streaming query.

While this helps to put the query to the limit (how many rows the query 
could process per second), the rate data source doesn't provide consistent rows 
per batch into stream, which leads two environments be hard to compare with.

For example, in many cases, you may want to compare the metrics in the 
batches between test environments (like running same streaming query with 
different options). These metrics are strongly affected if the distribution of 
input rows in batches are changing, especially a micro-batch has been lagged 
(in any reason) and rate data source produces more input rows to the next batch.

Also, when you test against streaming aggregation, you may want the data 
source produces the same set of input rows per batch (deterministic), so that 
you can plan how these input rows will be aggregated and how state rows will be 
evicted, and craft the test query based on the plan.

### Does this PR introduce _any_ user-facing change?

Yes, end users can leverage a new data source in micro-batch mode of 
streaming query to test/benchmark.

### How was this patch tested?

New UTs, and manually tested via below query in spark-shell:

```
spark.readStream.format("rate-micro-batch").option("rowsPerBatch", 
20).option("numPartitions", 3).load().writeStream.format("console").start()
```

Closes #34333 from HeartSaVioR/SPARK-37062.

Authored-by: Jungtaek Lim 
Signed-off-by: Jungtaek Lim 
---
 docs/structured-streaming-programming-guide.md |  13 ++
 ...org.apache.spark.sql.sources.DataSourceRegister |   1 +
 .../sources/RatePerMicroBatchProvider.scala| 127 +
 .../sources/RatePerMicroBatchStream.scala  | 175 ++
 .../sources/RatePerMicroBatchProviderSuite.scala   | 204 +
 5 files changed, 520 insertions(+)

diff --git a/docs/structured-streaming-programming-guide.md 
b/docs/structured-streaming-programming-guide.md
index b36cdc7..6237d47 100644
--- a/docs/structured-streaming-programming-guide.md
+++ b/docs/structured-streaming-programming-guide.md
@@ -517,6 +517,8 @@ There are a few built-in sources.
 
   - **Rate source (for testing)** - Generates data at the specified number of 
rows per second, each output row contains a `timestamp` and `value`. Where 
`timestamp` is a `Timestamp` type containing the time of message dispatch, and 
`value` is of `Long` type containing the message count, starting from 0 as the 
first row. This source is intended for testing and

[spark] branch master updated (13c372d -> d43a678)

2021-11-01 Thread maxgekk

This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 13c372d  [SPARK-37150][SQL] Migrate DESCRIBE NAMESPACE to use V2 
command by default
 add d43a678  [SPARK-37161][SQL] RowToColumnConverter support 
AnsiIntervalType

No new revisions were added by this update.

Summary of changes:
 .../org/apache/spark/sql/execution/Columnar.scala  |  4 +--
 .../execution/vectorized/ColumnarBatchSuite.scala  | 37 ++
 2 files changed, 39 insertions(+), 2 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.0 updated: [MINOR][DOCS] Corrected spacing in structured streaming programming

[spark] branch branch-3.1 updated: [MINOR][DOCS] Corrected spacing in structured streaming programming

[spark] branch branch-3.2 updated: [MINOR][DOCS] Corrected spacing in structured streaming programming

[spark] branch master updated (675071a -> 320fa07)

[spark] branch master updated (cf7fbc1 -> 675071a)

[spark] branch master updated (11de0fd -> cf7fbc1)

[spark] branch master updated (70fde44 -> 11de0fd)

[spark] branch master updated: [SPARK-37062][SS] Introduce a new data source for providing consistent set of rows per microbatch

[spark] branch master updated (13c372d -> d43a678)

9 matches

Site Navigation

Mail list logo

Footer information