warehouse"

Furcy Pin (JIRA) Thu, 24 Dec 2015 02:44:41 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-9600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15070862#comment-15070862
 ]


Furcy Pin commented on SPARK-9600:
----------------------------------

You are right that I forgot to wipe the metastore and warehouse between the two 
runs.
However doing so get me the same error :

I updated the code to make sure that each run start with a fresh new warehouse 
and metastore :

{code}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.hive.HiveContext
import scala.reflect.io.Path

object Spark9600 {

  case class Id(id: Int)

  def main(args: Array[String]): Unit = {

    Path("metastore_db").deleteRecursively()
    Path("warehouse").deleteRecursively()
    val conf: SparkConf = new 
SparkConf().setMaster("local[2]").setAppName("name").setSparkHome("sparkHome")
    val sc = new SparkContext(conf)
    val hc = new HiveContext(sc)
    
hc.setConf("hive.metastore.warehouse.dir",s"file://${Path("warehouse").toAbsolute}")
    hc.sql("CREATE TABLE t1 (id INT)")
    val df = hc.createDataFrame(sc.parallelize(Seq(Id(1),Id(2),Id(3))))
    df.write.insertInto("t1")
    df.write.saveAsTable("t2")
  }

}
{code}

And I get exactly the same error: 
{code}Mkdirs failed to create 
file:/user/hive/warehouse/t2/_temporary/0/_temporary/attempt_201512241135_0001_m_000000_0
 
{code}

How do you explain that the instruction {{hc.sql("CREATE TABLE t1 (id INT)")}} 
works and that {{df.write.saveAsTable("t2")}} doesn't?




> DataFrameWriter.saveAsTable always writes data to "/user/hive/warehouse"
> ------------------------------------------------------------------------
>
>                 Key: SPARK-9600
>                 URL: https://issues.apache.org/jira/browse/SPARK-9600
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.4.1, 1.5.0
>            Reporter: Cheng Lian
>            Assignee: Sudhakar Thota
>            Priority: Blocker
>         Attachments: SPARK-9600-fl1.txt
>
>
> Get a clean Spark 1.4.1 build:
> {noformat}
> $ git checkout v1.4.1
> $ ./build/sbt -Phive -Phive-thriftserver -Phadoop-1 -Dhadoop.version=1.2.1 
> clean assembly/assembly
> {noformat}
> Stop any running local Hadoop instance and unset all Hadoop environment 
> variables, so that we force Spark run with local file system only:
> {noformat}
> $ unset HADOOP_CONF_DIR
> $ unset HADOOP_PREFIX
> $ unset HADOOP_LIBEXEC_DIR
> $ unset HADOOP_CLASSPATH
> {noformat}
> In this way we also ensure that the default Hive warehouse location points to 
> local file system {{file:///user/hive/warehouse}}.  Now we create warehouse 
> directories for testing:
> {noformat}
> $ sudo rm -rf /user  # !! WARNING: IT'S /user RATHER THAN /usr !!
> $ sudo mkdir -p /user/hive/{warehouse,warehouse_hive13}
> $ sudo chown -R lian:staff /user
> $ tree /user
> /user
> └── hive
>     ├── warehouse
>     └── warehouse_hive13
> {noformat}
> Create a minimal {{hive-site.xml}}, only override the warehouse location, put 
> it under {{$SPARK_HOME/conf}}:
> {noformat}
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> <configuration>
>   <property>
>     <name>hive.metastore.warehouse.dir</name>
>     <value>file:///user/hive/warehouse_hive13</value>
>   </property>
> </configuration>
> {noformat}
> Now run our test snippets with {{pyspark}}:
> {noformat}
> $ ./bin/pyspark
> In [1]: sqlContext.range(10).coalesce(1).write.saveAsTable("ds")
> {noformat}
> Check warehouse directories:
> {noformat}
> $ tree /user
> /user
> └── hive
>     ├── warehouse
>     │   └── ds
>     │       ├── _SUCCESS
>     │       ├── _common_metadata
>     │       ├── _metadata
>     │       └── part-r-00000-46e4b32a-5c4d-4dba-b8d6-8d30ae910dc9.gz.parquet
>     └── warehouse_hive13
>         └── ds
> {noformat}
> Here you may notice the weird part: we have {{ds}} under both {{warehouse}} 
> and {{warehouse_hive13}}, but data are only written into the former.
> Now let's try HiveQl:
> {noformat}
> In [2]: sqlContext.range(10).coalesce(1).registerTempTable("t")
> In [3]: sqlContext.sql("CREATE TABLE ds_ctas AS SELECT * FROM t")
> {noformat}
> Check the directories again:
> {noformat}
> $ tree /user
> /user
> └── hive
>     ├── warehouse
>     │   └── ds
>     │       ├── _SUCCESS
>     │       ├── _common_metadata
>     │       ├── _metadata
>     │       └── part-r-00000-46e4b32a-5c4d-4dba-b8d6-8d30ae910dc9.gz.parquet
>     └── warehouse_hive13
>         ├── ds
>         └── ds_ctas
>             ├── _SUCCESS
>             └── part-00000
> {noformat}
> So HiveQl works fine.  (Hive never writes Parquet summary files, so 
> {{_common_metadata}} and {{_metadata}} are missing in {{ds_ctas}}).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9600) DataFrameWriter.saveAsTable always writes data to "/user/hive/warehouse"

Reply via email to