Re: How to share a dataset file across nodes

2023-03-09 Thread Mich Talebzadeh
Try something like below

1) Put your csv say cities.csv in HDFS as below
hdfs dfs -put cities.csv /data/stg/test
2) Read it into dataframe in PySpark as below
csv_file="hdfs://:PORT/data/stg/test/cities.csv"
# read it in spark
listing_df =
spark.read.format("com.databricks.spark.csv").option("inferSchema",
"true").option("header", "true").load(csv_file)
 listing_df.printSchema()


HTH


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 9 Mar 2023 at 21:07, Sean Owen  wrote:

> Put the file on HDFS, if you have a Hadoop cluster?
>
> On Thu, Mar 9, 2023 at 3:02 PM sam smith 
> wrote:
>
>> Hello,
>>
>> I use Yarn client mode to submit my driver program to Hadoop, the dataset
>> I load is from the local file system, when i invoke load("file://path")
>> Spark complains about the csv file being not found, which i totally
>> understand, since the dataset is not in any of the workers or the
>> applicationMaster but only where the driver program resides.
>> I tried to share the file using the configurations:
>>
>>> *spark.yarn.dist.files* OR *spark.files *
>>
>> but both ain't working.
>> My question is how to share the csv dataset across the nodes at the
>> specified path?
>>
>> Thanks.
>>
>


Re: How to share a dataset file across nodes

2023-03-09 Thread Sean Owen
Put the file on HDFS, if you have a Hadoop cluster?

On Thu, Mar 9, 2023 at 3:02 PM sam smith  wrote:

> Hello,
>
> I use Yarn client mode to submit my driver program to Hadoop, the dataset
> I load is from the local file system, when i invoke load("file://path")
> Spark complains about the csv file being not found, which i totally
> understand, since the dataset is not in any of the workers or the
> applicationMaster but only where the driver program resides.
> I tried to share the file using the configurations:
>
>> *spark.yarn.dist.files* OR *spark.files *
>
> but both ain't working.
> My question is how to share the csv dataset across the nodes at the
> specified path?
>
> Thanks.
>


How to share a dataset file across nodes

2023-03-09 Thread sam smith
Hello,

I use Yarn client mode to submit my driver program to Hadoop, the dataset I
load is from the local file system, when i invoke load("file://path") Spark
complains about the csv file being not found, which i totally understand,
since the dataset is not in any of the workers or the applicationMaster but
only where the driver program resides.
I tried to share the file using the configurations:

> *spark.yarn.dist.files* OR *spark.files *

but both ain't working.
My question is how to share the csv dataset across the nodes at the
specified path?

Thanks.