[jira] [Commented] (SPARK-10804) "LOCAL" in LOAD DATA LOCAL INPATH means "remote"

Marcelo Vanzin (JIRA) Thu, 24 Sep 2015 10:47:45 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-10804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14906723#comment-14906723
 ]


Marcelo Vanzin commented on SPARK-10804:
----------------------------------------

This is really a Hive issue, which Spark just inherits since it calls the Hive 
code directly to handle that statement.

> "LOCAL" in LOAD DATA LOCAL INPATH means "remote"
> ------------------------------------------------
>
>                 Key: SPARK-10804
>                 URL: https://issues.apache.org/jira/browse/SPARK-10804
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.5.0
>            Reporter: Antonio Piccolboni
>
> Connecting with a remote thriftserver with a custom JDBC client or beeline, 
> load data local inpath fails. Hiveserver2 docs explain in a quick comment 
> that local now means local to the server. I think this is just a 
> rationalization for a bug. When a user types "local" 
> # it needs to be local to him, not some server 
> # Failing 1., one needs to have a way to determine what local means and 
> create a "local" item under the new definition. 
> With the thirftserver, I have a host to connect to, but I don't have any way 
> to create a file local to that host, at least in spark. It may not be 
> desirable to create user directories on the thriftserver host or running file 
> transfer services like scp. Moreover, it appears that this syntax is unique 
> to Hive and Spark but its origin can be traced to  LOAD DATA LOCAL INFILE in 
> Oracle and was adopted by mysql. In the latter docs we can read "If LOCAL is 
> specified, the file is read by the client program on the client host and sent 
> to the server. The file can be given as a full path name to specify its exact 
> location. If given as a relative path name, the name is interpreted relative 
> to the directory in which the client program was started". This is not to say 
> that the spark or hive teams are bound to what Oracle and Mysql do, but to 
> support the idea that the meaning of LOCAL is settled. For instance, the 
> Impala documentation says: "Currently, the Impala LOAD DATA statement only 
> imports files from HDFS, not from the local filesystem. It does not support 
> the LOCAL keyword of the Hive LOAD DATA statement." I think this is a better 
> solution. The way things are in thriftserver, I developed a client under the 
> assumption that I could use LOAD DATA LOCAL INPATH and all tests where 
> passing in standalone mode, only to find with the first distributed test that 
> # LOCAL means "local to server", a.k.a. "remote"
> # INSERT INTO ... VALUES is not supported
> # There is really no workaround unless one assumes access what data store 
> spark is running against , like HDFS, and that the user can upload data to 
> it. 
> In the space of workarounds it is not terrible, but if you are trying to 
> write a self-contained spark package, that's a defeat and makes writing tests 
> particularly hard.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10804) "LOCAL" in LOAD DATA LOCAL INPATH means "remote"

Reply via email to