date:20230509

Re: Does spark read the same file twice, if two stages are using the same DataFrame?

2023-05-09 Thread Mich Talebzadeh

When I run this job in local mode  spark-submit --master local[4]

with

spark = SparkSession.builder \
.appName("tests") \
.enableHiveSupport() \
.getOrCreate()
spark.conf.set("spark.sql.adaptive.enabled", "true")
df3.explain(extended=True)

and no caching

I see this plan

== Parsed Logical Plan ==
'Join UsingJoin(Inner, [index])
:- Relation [index#0,0#1] csv
+- Aggregate [index#11], [index#11, avg(cast(0#12 as double)) AS avg(0)#7]
   +- Relation [index#11,0#12] csv

== Analyzed Logical Plan ==
index: string, 0: string, avg(0): double
Project [index#0, 0#1, avg(0)#7]
+- Join Inner, (index#0 = index#11)
   :- Relation [index#0,0#1] csv
   +- Aggregate [index#11], [index#11, avg(cast(0#12 as double)) AS
avg(0)#7]
  +- Relation [index#11,0#12] csv

== Optimized Logical Plan ==
Project [index#0, 0#1, avg(0)#7]
+- Join Inner, (index#0 = index#11)
   :- Filter isnotnull(index#0)
   :  +- Relation [index#0,0#1] csv
   +- Aggregate [index#11], [index#11, avg(cast(0#12 as double)) AS
avg(0)#7]
  +- Filter isnotnull(index#11)
 +- Relation [index#11,0#12] csv

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Project [index#0, 0#1, avg(0)#7]
   +- BroadcastHashJoin [index#0], [index#11], Inner, BuildRight, false
  :- Filter isnotnull(index#0)
  :  +- FileScan csv [index#0,0#1] Batched: false, DataFilters:
[isnotnull(index#0)], Format: CSV, Location: InMemoryFileIndex(1
paths)[hdfs://rhes75:9000/tmp/df1.csv], PartitionFilters: [],
PushedFilters: [IsNotNull(index)], ReadSchema: struct
  +- BroadcastExchange HashedRelationBroadcastMode(List(input[0,
string, true]),false), [plan_id=174]
 +- HashAggregate(keys=[index#11], functions=[avg(cast(0#12 as
double))], output=[index#11, avg(0)#7])
+- Exchange hashpartitioning(index#11, 200),
ENSURE_REQUIREMENTS, [plan_id=171]
   +- HashAggregate(keys=[index#11],
functions=[partial_avg(cast(0#12 as double))], output=[index#11, sum#28,
count#29L])
  +- Filter isnotnull(index#11)
 +- FileScan csv [index#11,0#12] Batched: false,
DataFilters: [isnotnull(index#11)], Format: CSV, Location:
InMemoryFileIndex(1 paths)[hdfs://rhes75:9000/tmp/df1.csv],
PartitionFilters: [], PushedFilters: [IsNotNull(index)], ReadSchema:
struct


so two in memory file scans for the csv file. So it caches the data already
given the small result set. Do you see this?

HTH


Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sun, 7 May 2023 at 17:48, Nitin Siwach  wrote:

> Thank you for the help Mich :)
>
> I have not started with a pandas DF. I have used pandas to create a dummy
> .csv which I dump on the disk that I intend to use to showcase my pain
> point. Providing pandas code was to ensure an end-to-end runnable example
> is provided and the effort on anyone trying to help me out is minimized
>
> I don't think Spark validating the file existence qualifies as an action
> according to Spark parlance. Sure there would be an analysis exception in
> case the file is not found as per the location provided, however, if you
> provided a schema and a valid path then no job would show up on the spark
> UI validating (IMO) that no action has been taken. (1 Action necessarily
> equals at least one job). If you don't provide the schema then a job is
> triggered (an action) to infer the schema for subsequent logical planning.
>
> Since I am just demonstrating my lack of understanding I have chosen local
> mode. Otherwise, I do use google buckets to host all the data
>
> This being said I think my question is something entirely different. It is
> that calling one action  (df3.count()) is reading the same csv twice. I do
> not understand that. So far, I always thought that data should be persisted
> only in case a DAG subset is to be reused by several actions.
>
>
> On Sun, May 7, 2023 at 9:47 PM Mich Talebzadeh 
> wrote:
>
>> You have started with panda DF which won't scale outside of the driver
>> itself.
>>
>> Let us put that aside.
>> df1.to_csv("./df1.csv",index_label = "index")  ## write the dataframe to
>> the underlying file system
>>
>> starting with spark
>>
>> df1 = spark.read.csv("./df1.csv", header=True, schema = schema) ## read
>> the dataframe from the underlying file system
>>
>> That is your first action because spark needs to validate that file
>> (exiss) and the schema. What will happen if that file does not exist
>>
>>

Re: Can Spark SQL (not DataFrame or Dataset) aggregate array into map of element of count?

2023-05-09 Thread Yong Zhang

Hi, Mich:

Thanks for your reply, but maybe I didn't make my question clear.

I am looking for a solution to compute the count of each element in an array, 
without "exploding" the array, and output a Map structure as a column.
For example, for an array as ('a', 'b', 'a'), I want to output a column as 
Map('a' -> 2, 'b' -> 1).
I think that the "aggregate" function should be able to, following the example 
shown in the link of my original email, as

SELECT aggregate(array('a', 'b', 'a'),   map(), 
  (acc, x) -> ???,   acc -> acc) AS feq_cnt

Here are my questions:

  *   Is using "map()" above the best way? The "start" structure in this case 
should be Map.empty[String, Int], but of course, it won't work in pure Spark 
SQL, so the best solution I can think of is "map()", and I want a mutable Map.
  *   How to implement the logic in "???" place? If I do it in Scala, I will do 
"acc.update(x, acc.getOrElse(x, 0) + 1)", which means if an element exists, 
plus one for the value; otherwise, start the element with the count of 1. Of 
course, the above code won't work in Spark SQL.
  *   As I said, I am NOT running in either Scale or PySpark session, but in a 
pure Spark SQL.
  *   Is it possible to do the above logic in Spark SQL, without using 
"exploding"?

Thanks


From: Mich Talebzadeh 
Sent: Saturday, May 6, 2023 4:52 PM
To: Yong Zhang 
Cc: user@spark.apache.org 
Subject: Re: Can Spark SQL (not DataFrame or Dataset) aggregate array into map 
of element of count?

you can create DF from your SQL RS and work with that in Python the way you want

## you don't need all these
import findspark
findspark.init()
from pyspark.sql import SparkSession
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.functions import udf, col, current_timestamp, lit
from pyspark.sql.types import *
sqltext = """
SELECT aggregate(array(1, 2, 3, 4),
   named_struct('sum', 0, 'cnt', 0),
   (acc, x) -> named_struct('sum', acc.sum + x, 'cnt', acc.cnt 
+ 1),
   acc -> acc.sum / acc.cnt) AS avg
"""
df = spark.sql(sqltext)
df.printSchema()

root
 |-- avg: double (nullable = true)


Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


 
[https://ci3.googleusercontent.com/mail-sig/AIorK4zholKucR2Q9yMrKbHNn-o1TuS4mYXyi2KO6Xmx6ikHPySa9MLaLZ8t2hrA6AUcxSxDgHIwmKE]
   view my Linkedin 
profile


 https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Fri, 5 May 2023 at 20:33, Yong Zhang 
mailto:java8...@hotmail.com>> wrote:
Hi, This is on Spark 3.1 environment.

For some reason, I can ONLY do this in Spark SQL, instead of either Scala or 
PySpark environment.

I want to aggregate an array into a Map of element count, within that array, 
but in Spark SQL.
I know that there is an aggregate function available like

aggregate(expr, start, merge [, finish])

But I want to know if this can be done in the Spark SQL only, and:

  *   How to represent an empty Map as "start" element above
  *   How to merge each element (as String type) into Map (as adding count if 
exist in the Map, or add as (element -> 1) as new entry in the Map if not exist)

Like the following example -> 
https://docs.databricks.com/sql/language-manual/functions/aggregate.html

SELECT aggregate(array(1, 2, 3, 4),
   named_struct('sum', 0, 'cnt', 0),
   (acc, x) -> named_struct('sum', acc.sum + x, 'cnt', acc.cnt 
+ 1),
   acc -> acc.sum / acc.cnt) AS avg

I wonder:
select
  aggregate(
  array('a','b','a')),
  map('', 0),
  (acc, x) -> ???
  acc -> acc) as output

How to do the logic after "(acc, x) -> ", so I can output a map of count of 
each element in the array?
I know I can "explode", then groupby + count, but since I have multi array 
columns need to transform, so I want to do more a high order function way, and 
in pure Spark SQL.

Thanks

Re: Does spark read the same file twice, if two stages are using the same DataFrame?

2023-05-09 Thread Nitin Siwach

I do not think InMemoryFileIndex means it is caching the data. The caches
get shown as InMemoryTableScan. InMemoryFileIndex is just for partition
discovery and partition pruning.
Any read will always show up as a scan from InMemoryFileIndex. It is not
cached data. It is a cached file index. Please correct my understanding if
I am wrong

Even the following code shows a scan from an InMemoryFileIndex
```
df1 = spark.read.csv("./df1.csv", header=True, schema = schema)
df1.explain(mode = "extended")
```

output:
```

== Parsed Logical Plan ==
Relation [index#50,0#51] csv

== Analyzed Logical Plan ==
index: string, 0: string
Relation [index#50,0#51] csv

== Optimized Logical Plan ==
Relation [index#50,0#51] csv

== Physical Plan ==
FileScan csv [index#50,0#51] Batched: false, DataFilters: [], Format:
CSV, Location: InMemoryFileIndex(1
paths)[file:/home/nitin/work/df1.csv], PartitionFilters: [],
PushedFilters: [], ReadSchema: struct

```

On Mon, May 8, 2023 at 1:07 AM Mich Talebzadeh 
wrote:

> When I run this job in local mode  spark-submit --master local[4]
>
> with
>
> spark = SparkSession.builder \
> .appName("tests") \
> .enableHiveSupport() \
> .getOrCreate()
> spark.conf.set("spark.sql.adaptive.enabled", "true")
> df3.explain(extended=True)
>
> and no caching
>
> I see this plan
>
> == Parsed Logical Plan ==
> 'Join UsingJoin(Inner, [index])
> :- Relation [index#0,0#1] csv
> +- Aggregate [index#11], [index#11, avg(cast(0#12 as double)) AS avg(0)#7]
>+- Relation [index#11,0#12] csv
>
> == Analyzed Logical Plan ==
> index: string, 0: string, avg(0): double
> Project [index#0, 0#1, avg(0)#7]
> +- Join Inner, (index#0 = index#11)
>:- Relation [index#0,0#1] csv
>+- Aggregate [index#11], [index#11, avg(cast(0#12 as double)) AS
> avg(0)#7]
>   +- Relation [index#11,0#12] csv
>
> == Optimized Logical Plan ==
> Project [index#0, 0#1, avg(0)#7]
> +- Join Inner, (index#0 = index#11)
>:- Filter isnotnull(index#0)
>:  +- Relation [index#0,0#1] csv
>+- Aggregate [index#11], [index#11, avg(cast(0#12 as double)) AS
> avg(0)#7]
>   +- Filter isnotnull(index#11)
>  +- Relation [index#11,0#12] csv
>
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=false
> +- Project [index#0, 0#1, avg(0)#7]
>+- BroadcastHashJoin [index#0], [index#11], Inner, BuildRight, false
>   :- Filter isnotnull(index#0)
>   :  +- FileScan csv [index#0,0#1] Batched: false, DataFilters:
> [isnotnull(index#0)], Format: CSV, Location: InMemoryFileIndex(1
> paths)[hdfs://rhes75:9000/tmp/df1.csv], PartitionFilters: [],
> PushedFilters: [IsNotNull(index)], ReadSchema:
> struct
>   +- BroadcastExchange HashedRelationBroadcastMode(List(input[0,
> string, true]),false), [plan_id=174]
>  +- HashAggregate(keys=[index#11], functions=[avg(cast(0#12 as
> double))], output=[index#11, avg(0)#7])
> +- Exchange hashpartitioning(index#11, 200),
> ENSURE_REQUIREMENTS, [plan_id=171]
>+- HashAggregate(keys=[index#11],
> functions=[partial_avg(cast(0#12 as double))], output=[index#11, sum#28,
> count#29L])
>   +- Filter isnotnull(index#11)
>  +- FileScan csv [index#11,0#12] Batched: false,
> DataFilters: [isnotnull(index#11)], Format: CSV, Location:
> InMemoryFileIndex(1 paths)[hdfs://rhes75:9000/tmp/df1.csv],
> PartitionFilters: [], PushedFilters: [IsNotNull(index)], ReadSchema:
> struct
>
>
> so two in memory file scans for the csv file. So it caches the data
> already given the small result set. Do you see this?
>
> HTH
>
>
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead
> Palantir Technologies Limited
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sun, 7 May 2023 at 17:48, Nitin Siwach  wrote:
>
>> Thank you for the help Mich :)
>>
>> I have not started with a pandas DF. I have used pandas to create a dummy
>> .csv which I dump on the disk that I intend to use to showcase my pain
>> point. Providing pandas code was to ensure an end-to-end runnable example
>> is provided and the effort on anyone trying to help me out is minimized
>>
>> I don't think Spark validating the file existence qualifies as an action
>> according to Spark parlance. Sure there would be an analysis exception in
>> case the file is not found as per the location provided, however, if you
>> provided a schema and a valid path then no job would show up on the spark
>> UI validating (IMO) that no action has been taken. (1 Action necessarily
>> equals at

Re: Can Spark SQL (not DataFrame or Dataset) aggregate array into map of element of count?

2023-05-09 Thread Yong Zhang

Hi, Mich:

Thanks for your reply, but maybe I didn't make my question clear.

I am looking for a solution to compute the count of each element in an array, 
without "exploding" the array, and output a Map structure as a column.
For example, for an array as ('a', 'b', 'a'), I want to output a column as 
Map('a' -> 2, 'b' -> 1).
I think that "aggregate" function should be able to, using the example shown in 
the link of my original email, as

SELECT aggregate(array('a', 'b', 'a'),
   map(),
   (acc, x) -> ???,
   acc -> acc) AS feq_cnt

Here are my questions:

  *   Is using "map()" above the best way? The "start" structure in this case 
should be Map.empty[String, Int], but of course, it won't work in pure Spark 
SQL, so the best solution I can think of is "map()", and it is a mutable Map.
  *   How to implement the logic in "???" place? If I do it in the Scala, I 
will do "acc.update(x, acc.getOrElse(x, 0) + 1)", which means if element 
exists, plus one for the value; otherwise, start the element with count of 0. 
Of course, the above code wont' work in Spark SQL.
  *   As I said, I am NOT running in either Scale or PySpark session, but in a 
pure Spark SQL.
  *   Is it possible to do the above logic in Spark SQL, without using 
"exploding"?

Thanks


From: Mich Talebzadeh 
Sent: Saturday, May 6, 2023 4:52 PM
To: Yong Zhang 
Cc: user@spark.apache.org 
Subject: Re: Can Spark SQL (not DataFrame or Dataset) aggregate array into map 
of element of count?

you can create DF from your SQL RS and work with that in Python the way you want

## you don't need all these
import findspark
findspark.init()
from pyspark.sql import SparkSession
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.functions import udf, col, current_timestamp, lit
from pyspark.sql.types import *
sqltext = """
SELECT aggregate(array(1, 2, 3, 4),
   named_struct('sum', 0, 'cnt', 0),
   (acc, x) -> named_struct('sum', acc.sum + x, 'cnt', acc.cnt 
+ 1),
   acc -> acc.sum / acc.cnt) AS avg
"""
df = spark.sql(sqltext)
df.printSchema()

root
 |-- avg: double (nullable = true)


Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


 
[https://ci3.googleusercontent.com/mail-sig/AIorK4zholKucR2Q9yMrKbHNn-o1TuS4mYXyi2KO6Xmx6ikHPySa9MLaLZ8t2hrA6AUcxSxDgHIwmKE]
   view my Linkedin 
profile


 https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Fri, 5 May 2023 at 20:33, Yong Zhang 
mailto:java8...@hotmail.com>> wrote:
Hi, This is on Spark 3.1 environment.

For some reason, I can ONLY do this in Spark SQL, instead of either Scala or 
PySpark environment.

I want to aggregate an array into a Map of element count, within that array, 
but in Spark SQL.
I know that there is an aggregate function available like

aggregate(expr, start, merge [, finish])

But I want to know if this can be done in the Spark SQL only, and:

  *   How to represent an empty Map as "start" element above
  *   How to merge each element (as String type) into Map (as adding count if 
exist in the Map, or add as (element -> 1) as new entry in the Map if not exist)

Like the following example -> 
https://docs.databricks.com/sql/language-manual/functions/aggregate.html

SELECT aggregate(array(1, 2, 3, 4),
   named_struct('sum', 0, 'cnt', 0),
   (acc, x) -> named_struct('sum', acc.sum + x, 'cnt', acc.cnt 
+ 1),
   acc -> acc.sum / acc.cnt) AS avg

I wonder:
select
  aggregate(
  array('a','b','a')),
  map('', 0),
  (acc, x) -> ???
  acc -> acc) as output

How to do the logic after "(acc, x) -> ", so I can output a map of count of 
each element in the array?
I know I can "explode", then groupby + count, but since I have multi array 
columns need to transform, so I want to do more a high order function way, and 
in pure Spark SQL.

Thanks

Re: Does spark read the same file twice, if two stages are using the same DataFrame?

2023-05-09 Thread Mich Talebzadeh

When you run this in yarn mode, it uses  Broadcast Hash Join  for join
operation as shown in the following output. The datasets here are the same
size, so it broadcasts one dataset to all of the executors and then reads
the same dataset and does a hash join.

It is typical of joins . No surprises here. It has to read it twice to
perform this operation.  HJ was not invented by Spark. It has been around
in databases for years plus NLJ and MJ.

[image: image.png]

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 8 May 2023 at 09:38, Nitin Siwach  wrote:

> I do not think InMemoryFileIndex means it is caching the data. The caches
> get shown as InMemoryTableScan. InMemoryFileIndex is just for partition
> discovery and partition pruning.
> Any read will always show up as a scan from InMemoryFileIndex. It is not
> cached data. It is a cached file index. Please correct my understanding if
> I am wrong
>
> Even the following code shows a scan from an InMemoryFileIndex
> ```
> df1 = spark.read.csv("./df1.csv", header=True, schema = schema)
> df1.explain(mode = "extended")
> ```
>
> output:
> ```
>
> == Parsed Logical Plan ==
> Relation [index#50,0#51] csv
>
> == Analyzed Logical Plan ==
> index: string, 0: string
> Relation [index#50,0#51] csv
>
> == Optimized Logical Plan ==
> Relation [index#50,0#51] csv
>
> == Physical Plan ==
> FileScan csv [index#50,0#51] Batched: false, DataFilters: [], Format: CSV, 
> Location: InMemoryFileIndex(1 paths)[file:/home/nitin/work/df1.csv], 
> PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct
>
> ```
>
> On Mon, May 8, 2023 at 1:07 AM Mich Talebzadeh 
> wrote:
>
>> When I run this job in local mode  spark-submit --master local[4]
>>
>> with
>>
>> spark = SparkSession.builder \
>> .appName("tests") \
>> .enableHiveSupport() \
>> .getOrCreate()
>> spark.conf.set("spark.sql.adaptive.enabled", "true")
>> df3.explain(extended=True)
>>
>> and no caching
>>
>> I see this plan
>>
>> == Parsed Logical Plan ==
>> 'Join UsingJoin(Inner, [index])
>> :- Relation [index#0,0#1] csv
>> +- Aggregate [index#11], [index#11, avg(cast(0#12 as double)) AS avg(0)#7]
>>+- Relation [index#11,0#12] csv
>>
>> == Analyzed Logical Plan ==
>> index: string, 0: string, avg(0): double
>> Project [index#0, 0#1, avg(0)#7]
>> +- Join Inner, (index#0 = index#11)
>>:- Relation [index#0,0#1] csv
>>+- Aggregate [index#11], [index#11, avg(cast(0#12 as double)) AS
>> avg(0)#7]
>>   +- Relation [index#11,0#12] csv
>>
>> == Optimized Logical Plan ==
>> Project [index#0, 0#1, avg(0)#7]
>> +- Join Inner, (index#0 = index#11)
>>:- Filter isnotnull(index#0)
>>:  +- Relation [index#0,0#1] csv
>>+- Aggregate [index#11], [index#11, avg(cast(0#12 as double)) AS
>> avg(0)#7]
>>   +- Filter isnotnull(index#11)
>>  +- Relation [index#11,0#12] csv
>>
>> == Physical Plan ==
>> AdaptiveSparkPlan isFinalPlan=false
>> +- Project [index#0, 0#1, avg(0)#7]
>>+- BroadcastHashJoin [index#0], [index#11], Inner, BuildRight, false
>>   :- Filter isnotnull(index#0)
>>   :  +- FileScan csv [index#0,0#1] Batched: false, DataFilters:
>> [isnotnull(index#0)], Format: CSV, Location: InMemoryFileIndex(1
>> paths)[hdfs://rhes75:9000/tmp/df1.csv], PartitionFilters: [],
>> PushedFilters: [IsNotNull(index)], ReadSchema:
>> struct
>>   +- BroadcastExchange HashedRelationBroadcastMode(List(input[0,
>> string, true]),false), [plan_id=174]
>>  +- HashAggregate(keys=[index#11], functions=[avg(cast(0#12 as
>> double))], output=[index#11, avg(0)#7])
>> +- Exchange hashpartitioning(index#11, 200),
>> ENSURE_REQUIREMENTS, [plan_id=171]
>>+- HashAggregate(keys=[index#11],
>> functions=[partial_avg(cast(0#12 as double))], output=[index#11, sum#28,
>> count#29L])
>>   +- Filter isnotnull(index#11)
>>  +- FileScan csv [index#11,0#12] Batched: false,
>> DataFilters: [isnotnull(index#11)], Format: CSV, Location:
>> InMemoryFileIndex(1 paths)[hdfs://rhes75:9000/tmp/df1.csv],
>> PartitionFilters: [], PushedFilters: [IsNotNull(index)], ReadSchema:
>> struct
>>
>>
>> so two in memory file scans for the csv file. So it caches the data
>> already given the small result set. Do you see this?
>>
>> HTH
>>
>>
>> Mich Talebzadeh,
>> Lead Solutions Architect/Engineering Lead
>> Palantir Technologies Limited
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>>

unsubscribe

2023-05-09 Thread Balakumar iyer S

Re: Does spark read the same file twice, if two stages are using the same DataFrame?

Re: Can Spark SQL (not DataFrame or Dataset) aggregate array into map of element of count?

Re: Does spark read the same file twice, if two stages are using the same DataFrame?

Re: Can Spark SQL (not DataFrame or Dataset) aggregate array into map of element of count?

Re: Does spark read the same file twice, if two stages are using the same DataFrame?

unsubscribe

6 matches

Site Navigation

Mail list logo

Footer information