[ 
https://issues.apache.org/jira/browse/SPARK-30328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chendihao updated SPARK-30328:
------------------------------
    Description: 
We find that the incorrect Hadoop configuration files cause the failure of 
saving RDD to local file system. It is not expected because we have specify the 
local url and the API of DataFrame.write.text does not have this issue. It is 
easy to reproduce and verify with Spark 2.3.0.

1.Do not set environment variable of `HADOOP_CONF_DIR`.

2.Install pyspark and run the local Python script. This should work and save 
files to local file system.
{code:java}
from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local").getOrCreate()
sc = spark.sparkContextrdd = sc.parallelize([1, 2, 3])
rdd.saveAsTextFile("file:///tmp/rdd.text")
{code}
3.Set environment variable of `HADOOP_CONF_DIR` and put the Hadoop 
configuration files there. Make sure the format of `core-site.xml` is right but 
it has an unresolved host name.

4.Run the same Python script again. If it try to connect HDFS and found the 
unresolved host name, Java exception happens.

We thinks `saveAsTextFile("file:///)` should not attempt to connect HDFS 
whenever `HADOOP_CONF_DIR` is set or not. Actually the following code of 
DataFrame will work with the same incorrect Hadoop configuration files.
{code:java}
from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local").getOrCreate()
df = spark.createDataFrame(rows, ["attribute", "value"])
df.write.parquet("file:///tmp/df.parquet")
{code}

  was:
We find that the incorrect Hadoop configuration files cause the failure of 
saving RDD to local file system. It is not expected because we have specify the 
local url and the API of DataFrame.write.text does not have this issue. It is 
easy to reproduce and verify with Spark 2.3.0.

1.Do not set environment variable of `HADOOP_CONF_DIR`.

2.Install pyspark and run the local Python script. This should work and save 
files to local file system.
{code:java}
from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local").getOrCreate()
sc = spark.sparkContextrdd = sc.parallelize([1, 2, 3])
rdd.saveAsTextFile("file:///tmp/rdd.text")
{code}
3.Set environment variable of `HADOOP_CONF_DIR` and put the Hadoop 
configuration files there. Make sure the format of `core-site.xml` is right but 
it has an unresolved host name.

4.Run the same Python script again. If it try to connect HDFS and found the 
unresolved host name, Java exception happens.

We thinks `saveAsTextFile("file:///)` should not attempt to connect HDFS not 
matter `HADOOP_CONF_DIR` is set. Actually the following code will work with the 
same incorrect Hadoop configuration files.
{code:java}
from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local").getOrCreate()
df = spark.createDataFrame(rows, ["attribute", "value"])
df.write.parquet("file:///tmp/df.parquet")
{code}


> Fail to write local files with RDD.saveTextFile when setting the incorrect 
> Hadoop configuration files
> -----------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-30328
>                 URL: https://issues.apache.org/jira/browse/SPARK-30328
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.3.0
>            Reporter: chendihao
>            Priority: Major
>
> We find that the incorrect Hadoop configuration files cause the failure of 
> saving RDD to local file system. It is not expected because we have specify 
> the local url and the API of DataFrame.write.text does not have this issue. 
> It is easy to reproduce and verify with Spark 2.3.0.
> 1.Do not set environment variable of `HADOOP_CONF_DIR`.
> 2.Install pyspark and run the local Python script. This should work and save 
> files to local file system.
> {code:java}
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.master("local").getOrCreate()
> sc = spark.sparkContextrdd = sc.parallelize([1, 2, 3])
> rdd.saveAsTextFile("file:///tmp/rdd.text")
> {code}
> 3.Set environment variable of `HADOOP_CONF_DIR` and put the Hadoop 
> configuration files there. Make sure the format of `core-site.xml` is right 
> but it has an unresolved host name.
> 4.Run the same Python script again. If it try to connect HDFS and found the 
> unresolved host name, Java exception happens.
> We thinks `saveAsTextFile("file:///)` should not attempt to connect HDFS 
> whenever `HADOOP_CONF_DIR` is set or not. Actually the following code of 
> DataFrame will work with the same incorrect Hadoop configuration files.
> {code:java}
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.master("local").getOrCreate()
> df = spark.createDataFrame(rows, ["attribute", "value"])
> df.write.parquet("file:///tmp/df.parquet")
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to