using UDF( defined in Java) in scala through scala

2015-09-29 Thread ogoh
Hello,
I have a udf declared in Java but I'd like to call it from spark-shell which
only supports Scala.
Since I am new to Scala, I couldn't figure out how to call register the Java
UDF using sqlContext.udf.register in scala.
Below is how I tried.
I appreciate any help.
Thanks,

= my UDF in java
public class ArrayStringOfJson implements UDF1 {
public ArrayType call(String input) throws Exception {
..
  }
}
= using in spark-shell
scala> import org.apache.spark.sql.api.java.UDF1
scala> import com.mycom.event.udfs.ArrayStringOfJson
scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
scala> val instance: ArrayStringOfJson = new ArrayStringOfJson
scala> sqlContext.udf.register("arraystring", instance,
org.apache.spark.sql.types.ArrayType)
:28: error: overloaded method value register with alternatives:
  (name: String,f: org.apache.spark.sql.api.java.UDF22[_, _, _, _, _, _, _,
_, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _],returnType:
org.apache.spark.sql.types.DataType)Unit 
  ...
  (name: String,f: org.apache.spark.sql.api.java.UDF1[_, _],returnType:
org.apache.spark.sql.types.DataType)Unit
 cannot be applied to (String, com.mycom.event.udfs.ArrayStringOfJson,
org.apache.spark.sql.types.ArrayType.type)
  sqlContext.udf.register("arraystring", instance,
org.apache.spark.sql.types.ArrayType)
 ^





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/using-UDF-defined-in-Java-in-scala-through-scala-tp24880.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



SparkSQL 1.4 can't accept registration of UDF?

2015-07-14 Thread ogoh
Hello,
I am using SparkSQL along with ThriftServer so that we can access using Hive
queries.
With Spark 1.3.1, I can register UDF function. But, Spark 1.4.0 doesn't work
for that. The jar of the udf is same.
Below is logs:
I appreciate any advice.


== With Spark 1.4
Beeline version 1.4.0 by Apache Hive

0: jdbc:hive2://localhost:1 add jar
hdfs:///user/hive/lib/dw-udf-2015.06.06-SNAPSHOT.jar;

0: jdbc:hive2://localhost:1 create temporary function parse_trace as
'com. mycom.dataengine.udf.GenericUDFParseTraceAnnotation';

15/07/14 23:49:43 DEBUG transport.TSaslTransport: writing data length: 206

15/07/14 23:49:43 DEBUG transport.TSaslTransport: CLIENT: reading data
length: 201

Error: org.apache.spark.sql.execution.QueryExecutionException: FAILED:
Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.FunctionTask (state=,code=0)


== With Spark 1.3.1:

Beeline version 1.3.1 by Apache Hive

0: jdbc:hive2://localhost:10001 add jar
hdfs:///user/hive/lib/dw-udf-2015.06.06-SNAPSHOT.jar;

+-+

| Result  |

+-+

+-+

No rows selected (1.313 seconds)

0: jdbc:hive2://localhost:10001 create temporary function parse_trace as
'com. mycom.dataengine.udf.GenericUDFParseTraceAnnotation';

+-+

| result  |

+-+

+-+

No rows selected (0.999 seconds)


=== The logs of ThriftServer of Spark 1.4.0

15/07/14 23:49:43 INFO SparkExecuteStatementOperation: Running query 'create
temporary function parse_trace as
'com.quixey.dataengine.udf.GenericUDFParseTraceAnnotation''

15/07/14 23:49:43 INFO ParseDriver: Parsing command: create temporary
function parse_trace as
'com.quixey.dataengine.udf.GenericUDFParseTraceAnnotation'

15/07/14 23:49:43 INFO ParseDriver: Parse Completed

15/07/14 23:49:43 INFO PerfLogger: PERFLOG method=Driver.run
from=org.apache.hadoop.hive.ql.Driver

15/07/14 23:49:43 INFO PerfLogger: PERFLOG method=TimeToSubmit
from=org.apache.hadoop.hive.ql.Driver

15/07/14 23:49:43 INFO Driver: Concurrency mode is disabled, not creating a
lock manager

15/07/14 23:49:43 INFO PerfLogger: PERFLOG method=compile
from=org.apache.hadoop.hive.ql.Driver

15/07/14 23:49:43 INFO PerfLogger: PERFLOG method=parse
from=org.apache.hadoop.hive.ql.Driver

15/07/14 23:49:43 INFO ParseDriver: Parsing command: create temporary
function parse_trace as
'com.quixey.dataengine.udf.GenericUDFParseTraceAnnotation'

15/07/14 23:49:43 INFO ParseDriver: Parse Completed

15/07/14 23:49:43 INFO PerfLogger: /PERFLOG method=parse
start=1436917783106 end=1436917783106 duration=0
from=org.apache.hadoop.hive.ql.Driver

15/07/14 23:49:43 INFO PerfLogger: PERFLOG method=semanticAnalyze
from=org.apache.hadoop.hive.ql.Driver

15/07/14 23:49:43 INFO HiveMetaStore: 2: get_database: default

15/07/14 23:49:43 INFO audit: ugi=anonymous ip=unknown-ip-addr 
cmd=get_database: default

15/07/14 23:49:43 INFO HiveMetaStore: 2: Opening raw store with
implemenation class:org.apache.hadoop.hive.metastore.ObjectStore

15/07/14 23:49:43 INFO ObjectStore: ObjectStore, initialize called

15/07/14 23:49:43 INFO MetaStoreDirectSql: MySQL check failed, assuming we
are not on mysql: Lexical error at line 1, column 5.  Encountered: @ (64),
after : .

15/07/14 23:49:43 INFO Query: Reading in results for query
org.datanucleus.store.rdbms.query.SQLQuery@0 since the connection used is
closing

15/07/14 23:49:43 INFO ObjectStore: Initialized ObjectStore

15/07/14 23:49:43 INFO FunctionSemanticAnalyzer: analyze done

15/07/14 23:49:43 INFO Driver: Semantic Analysis Completed

15/07/14 23:49:43 INFO PerfLogger: /PERFLOG method=semanticAnalyze
start=1436917783106 end=1436917783114 duration=8
from=org.apache.hadoop.hive.ql.Driver

15/07/14 23:49:43 INFO Driver: Returning Hive schema:
Schema(fieldSchemas:null, properties:null)

15/07/14 23:49:43 INFO PerfLogger: /PERFLOG method=compile
start=1436917783106 end=1436917783114 duration=8
from=org.apache.hadoop.hive.ql.Driver

15/07/14 23:49:43 INFO PerfLogger: PERFLOG method=Driver.execute
from=org.apache.hadoop.hive.ql.Driver

15/07/14 23:49:43 INFO Driver: Starting command: create temporary function
parse_trace as 'com.quixey.dataengine.udf.GenericUDFParseTraceAnnotation'

15/07/14 23:49:43 INFO PerfLogger: /PERFLOG method=TimeToSubmit
start=1436917783105 end=1436917783115 duration=10
from=org.apache.hadoop.hive.ql.Driver

15/07/14 23:49:43 INFO PerfLogger: PERFLOG method=runTasks
from=org.apache.hadoop.hive.ql.Driver

15/07/14 23:49:43 INFO PerfLogger: PERFLOG method=task.FUNCTION.Stage-0
from=org.apache.hadoop.hive.ql.Driver

15/07/14 23:49:43 ERROR Task: FAILED: Class
com.quixey.dataengine.udf.GenericUDFParseTraceAnnotation not found

15/07/14 23:49:43 INFO FunctionTask: create function:
java.lang.ClassNotFoundException:
com.quixey.dataengine.udf.GenericUDFParseTraceAnnotation

at java.net.URLClassLoader$1.run(URLClassLoader.java:372)

at java.net.URLClassLoader$1.run(URLClassLoader.java:361)



--
View this message in context: 

Re: Error when connecting to Spark SQL via Hive JDBC driver

2015-06-18 Thread ogoh

hello, 
I am not sure what is wrong..
But, in my case, I followed the instruction from
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/HiveJDBCDriver.html.
It worked fine with SQuirreL SQL Client
(http://squirrel-sql.sourceforge.net/), and SQL Workbench J
(http://www.sql-workbench.net/).




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Error-when-connecting-to-Spark-SQL-via-Hive-JDBC-driver-tp23397p23403.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



SparkSQL : using Hive UDF returning Map throws rror: scala.MatchError: interface java.util.Map (of class java.lang.Class) (state=,code=0)

2015-06-04 Thread ogoh

Hello,
I tested some custom udf on SparkSql's ThriftServer  Beeline (Spark 1.3.1).
Some udfs work fine (access array parameter and returning int or string
type). 
But my udf returning map type throws an error:
Error: scala.MatchError: interface java.util.Map (of class java.lang.Class)
(state=,code=0)

I converted the code into Hive's GenericUDF since I worried that using
complex type parameter (array of map) and returning complex type (map) can
be supported in Hive's GenericUDF instead of simple UDF.
But SparkSQL doesn't seem supporting GenericUDF.(error message : Error:
java.lang.IllegalAccessException: Class
org.apache.spark.sql.hive.HiveFunctionWrapper can not access ..).

Below is my example udf code returning MAP type.
I appreciate any advice.
Thanks

--

public final class ArrayToMap extends UDF {

public MapString,String evaluate(ArrayListString arrayOfString) {
// add code to handle all index problem

MapString, String map = new HashMapString, String();
   
int count = 0;
for (String element : arrayOfString) {
map.put(count + , element);
count++;

}
return map;
}
}






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-using-Hive-UDF-returning-Map-throws-rror-scala-MatchError-interface-java-util-Map-of-class--tp23164.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



SparkSQL's performance gets degraded depending on number of partitions of Hive tables..is it normal?

2015-06-01 Thread ogoh

Hello, 
I posted this question a while back but am posting it again to get your
attention.

I am using SparkSQL 1.3.1 and Hive 0.13.1 on AWS  YARN (tested under both
1.3.0  1.3.1). 
My hive table is partitioned.
I noticed that the query response time is bad depending on the number of
partitions though the query targets a small subset of the partitions. TRACE
level logs (ThriftServer's) showed that it runs commands like getFileInfo,
getListing, getBlockLocation for each every partitions ( also runs
getBlockLocation for each every files) though they are not part of the
queried partitions.

I don't know why it is necessary. Is it a bug of SparkSql? Is there a way to
avoid that?
Below is the detail of reporting this issue including logs.

Thanks,


--

My Hive table as an external table is partitioned with date and hour. 
I expected that a query with certain partitions will read only the data
files of the partitions. 
I turned on TRACE level logging for ThriftServer since the query response
time even for narrowed partitions was very long. 
And I found that all the available partitions are checked during some steps. 

The logs showed as a execution flow  such as: 
== 
Step 1: Contacted HiveMetastore to get partition info  (cmd :
get_partitions) 

Step 2: Came up with an execution rule 

Step 3: Contact namenode to make at least 4 calls (data is in HDFS) for all
available partitions of the table : 
   getFileInfo once, getListing once, and the repeat them again for each
partition. 

Step 4: Contact NameNode to find blocklocation of all the partitions 

Step 5: Contact DataNode for each file of all the partitions 

Step 6:  Contact NameNode  again for all the partitions 

Step 7: SparkSQL generated some optimal plan 

Step 8: Contacted corresponding datanodes for the narrowed partitions (it
seems) 
And more. 
=== 

Why Step3, 4, 5, and 6 should check all partitions? 
After removing partitions from the table, the query was much quicker while
processing same volume of data. 

I don't know if it is normal or Hive issue or SparkSQL issue or my
configuration issue. 
I added some logs below for some steps. 

I appreciate any of your advice. 

Thanks a lot, 
Okehee 

 some logs of some steps 

Query: select count(*) from api_search where pdate='2015-05-23'; 
( 

Step 2: 

2015-05-25 16:37:43 TRACE HiveContext$$anon$3:67 - 

=== Applying Rule
org.apache.spark.sql.catalyst.analysis.Analyzer$GlobalAggregates === 

!'Project [COUNT(1) AS _c0#25L]Aggregate [],
[COUNT(1) AS _c0#25L] 

  Filter (pdate#26 = 2015-05-23)Filter (pdate#26 =
2015-05-23) 

   MetastoreRelation api_hdfs_perf, api_search, None MetastoreRelation
api_hdfs_perf, api_search, None 
.. 

Step 3: 

2015-05-25 16:37:44 TRACE ProtobufRpcEngine:206 - 84: Call -
/10.128.193.211:9000: getFileInfo {src:
/user/datawarehouse/api_hdfs_perf/api_search/pdate=2015-05-08/phour=00} 

2015-05-25 16:37:44 DEBUG Client:424 - The ping interval is 6 ms. 

2015-05-25 16:37:44 DEBUG Client:693 - Connecting to /10.128.193.211:9000 

2015-05-25 16:37:44 DEBUG Client:1007 - IPC Client (2100771791) connection
to /10.128.193.211:9000 from ogoh sending #151 

2015-05-25 16:37:44 DEBUG Client:944 - IPC Client (2100771791) connection to
/10.128.193.211:9000 from ogoh: starting, having connections 2 

2015-05-25 16:37:44 DEBUG Client:1064 - IPC Client (2100771791) connection
to /10.128.193.211:9000 from ogoh got value #151 

2015-05-25 16:37:44 DEBUG ProtobufRpcEngine:235 - Call: getFileInfo took
13ms 

2015-05-25 16:37:44 TRACE ProtobufRpcEngine:250 - 84: Response -
/10.128.193.211:9000: getFileInfo {fs { fileType: IS_DIR path:  length: 0
permission { perm: 493 } owner: hadoop group: supergroup
modification_time: 1432364487906 access_time: 0 block_replication: 0
blocksize: 0 fileId: 100602 childrenNum: 2 }} 

2015-05-25 16:37:44 TRACE ProtobufRpcEngine:206 - 84: Call -
/10.128.193.211:9000: getFileInfo {src:
/user/datawarehouse/api_hdfs_perf/api_search/pdate=2015-05-08/phour=00} 

2015-05-25 16:37:44 DEBUG Client:1007 - IPC Client (2100771791) connection
to /10.128.193.211:9000 from ogoh sending #152 

2015-05-25 16:37:44 DEBUG Client:1064 - IPC Client (2100771791) connection
to /10.128.193.211:9000 from ogoh got value #152 

2015-05-25 16:37:44 DEBUG ProtobufRpcEngine:235 - Call: getFileInfo took
2ms. 
.. 


Step 4: 

2015-05-25 16:37:47 TRACE ProtobufRpcEngine:206 - 89: Call -
/10.128.193.211:9000: getBlockLocations {src:
/user/datawarehouse/api_hdfs_perf/api_search/pdate=2015-05-08/phour=00/part-r-2015050800-1.parquet
offset: 0 length: 1342177280} 
... 


Step 5: 

2015-05-25 16:37:48 DEBUG DFSClient:951 - Connecting to datanode
10.191.137.197:9200 

2015-05-25 16:37:48 TRACE BlockReaderFactory:653 -
BlockReaderFactory(fileName=/user/datawarehouse/api_hdfs_perf/api_search/pdate=2015-05-08/phour=00/part-r-2015050800-2.parquet,
block=BP-1843960649-10.128.193.211-1427923845046:blk_1073758677_981812

Re: Spark 1.3.0 - 1.3.1 produces java.lang.NoSuchFieldError: NO_FILTER

2015-05-30 Thread ogoh

I had the same issue on AWS EMR with Spark 1.3.1.e (AWS version) passed with
'-h' parameter (it is bootstrap action parameter for spark).
I don't see the problem with Spark 1.3.1.e not passing the parameter.
I am not sure about your env.
Thanks,



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-3-0-1-3-1-produces-java-lang-NoSuchFieldError-NO-FILTER-tp22897p23090.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



SparkSQL's performance : contacting namenode and datanode to uncessarily check all partitions for a query of specific partitions

2015-05-25 Thread ogoh
Hello,

I am using SparkSQL 1.3.0 and Hive 0.13.1 on AWS  YARN.

My Hive table as an external table is partitioned with date and hour.
I expected that a query with certain partitions will read only the data
files of the partitions.
I turned on TRACE level logging for ThriftServer since the query response
time even for narrowed partitions was very long.
And I found that all the available partitions are checked during some steps.

The logs showed as a execution flow  such as:
==
Step 1: Contacted HiveMetastore to get partition info  (cmd :
get_partitions)

Step 2: Came up with an execution rule

Step 3: Contact namenode to make at least 4 calls (data is in HDFS) for all
available partitions of the table :
   getFileInfo once, getListing once, and the repeat them again for each
partition.

Step 4: Contact NameNode to find blocklocation of all the partitions

Step 5: Contact DataNode for each file of all the partitions

Step 6:  Contact NameNode  again for all the partitions

Step 7: SparkSQL generated some optimal plan

Step 8: Contacted corresponding datanodes for the narrowed partitions (it
seems)
And more.
===

Why Step3, 4, 5, and 6 should check all partitions?
After removing partitions from the table, the query was much quicker while
processing same volume of data.

I don't know if it is normal or Hive issue or SparkSQL issue or my
configuration issue.
I added some logs below for some steps.

I appreciate any of your advice.

Thanks a lot,
Okehee

 some logs of some steps

Query: select count(*) from api_search where pdate='2015-05-23';
(

Step 2:

2015-05-25 16:37:43 TRACE HiveContext$$anon$3:67 -

=== Applying Rule
org.apache.spark.sql.catalyst.analysis.Analyzer$GlobalAggregates ===

!'Project [COUNT(1) AS _c0#25L]Aggregate [],
[COUNT(1) AS _c0#25L]

  Filter (pdate#26 = 2015-05-23)Filter (pdate#26 =
2015-05-23)

   MetastoreRelation api_hdfs_perf, api_search, None MetastoreRelation
api_hdfs_perf, api_search, None
..

Step 3:

2015-05-25 16:37:44 TRACE ProtobufRpcEngine:206 - 84: Call -
/10.128.193.211:9000: getFileInfo {src:
/user/datawarehouse/api_hdfs_perf/api_search/pdate=2015-05-08/phour=00}

2015-05-25 16:37:44 DEBUG Client:424 - The ping interval is 6 ms.

2015-05-25 16:37:44 DEBUG Client:693 - Connecting to /10.128.193.211:9000

2015-05-25 16:37:44 DEBUG Client:1007 - IPC Client (2100771791) connection
to /10.128.193.211:9000 from ogoh sending #151

2015-05-25 16:37:44 DEBUG Client:944 - IPC Client (2100771791) connection to
/10.128.193.211:9000 from ogoh: starting, having connections 2

2015-05-25 16:37:44 DEBUG Client:1064 - IPC Client (2100771791) connection
to /10.128.193.211:9000 from ogoh got value #151

2015-05-25 16:37:44 DEBUG ProtobufRpcEngine:235 - Call: getFileInfo took
13ms

2015-05-25 16:37:44 TRACE ProtobufRpcEngine:250 - 84: Response -
/10.128.193.211:9000: getFileInfo {fs { fileType: IS_DIR path:  length: 0
permission { perm: 493 } owner: hadoop group: supergroup
modification_time: 1432364487906 access_time: 0 block_replication: 0
blocksize: 0 fileId: 100602 childrenNum: 2 }}

2015-05-25 16:37:44 TRACE ProtobufRpcEngine:206 - 84: Call -
/10.128.193.211:9000: getFileInfo {src:
/user/datawarehouse/api_hdfs_perf/api_search/pdate=2015-05-08/phour=00}

2015-05-25 16:37:44 DEBUG Client:1007 - IPC Client (2100771791) connection
to /10.128.193.211:9000 from ogoh sending #152

2015-05-25 16:37:44 DEBUG Client:1064 - IPC Client (2100771791) connection
to /10.128.193.211:9000 from ogoh got value #152

2015-05-25 16:37:44 DEBUG ProtobufRpcEngine:235 - Call: getFileInfo took
2ms.
..


Step 4:

2015-05-25 16:37:47 TRACE ProtobufRpcEngine:206 - 89: Call -
/10.128.193.211:9000: getBlockLocations {src:
/user/datawarehouse/api_hdfs_perf/api_search/pdate=2015-05-08/phour=00/part-r-2015050800-1.parquet
offset: 0 length: 1342177280}
...


Step 5:

2015-05-25 16:37:48 DEBUG DFSClient:951 - Connecting to datanode
10.191.137.197:9200

2015-05-25 16:37:48 TRACE BlockReaderFactory:653 -
BlockReaderFactory(fileName=/user/datawarehouse/api_hdfs_perf/api_search/pdate=2015-05-08/phour=00/part-r-2015050800-2.parquet,
block=BP-1843960649-10.128.193.211-1427923845046:blk_1073758677_981812):
trying to create a remote block reader from a TCP socket
...

Step 6:

2015-05-25 16:37:56 TRACE ProtobufRpcEngine:206 - 84: Call -
/10.128.193.211:9000: getFileInfo {src:
/user/datawarehouse/api_hdfs_perf/api_search/pdate=2015-05-08/phour=00}
...


Step 7:

=== Applying Rule
org.apache.spark.sql.hive.HiveMetastoreCatalog$ParquetConversions ===

== Optimized Logical Plan ==

Aggregate [], [COUNT(1) AS _c0#25L]

 Project []

  Filter (pdate#111 = 2015-05-23)

  
Relation[timestamp#84,request_id#85,request_timestamp#86,response_timestamp#87,request_query_url#88,request_query_params#89,response_status#90,q#91,session_id#92,partner_id#93,partner_name#94,partner_ip#95,partner_useragent#96,search_id#97,user_id#98,client_ip#99,client_country#100

SparkSQL can't read S3 path for hive external table

2015-05-23 Thread ogoh

Hello,
I am using Spark1.3 in AWS.
SparkSQL can't recognize Hive external table on S3. 
The following is the error message. 
I appreciate any help.
Thanks,
Okehee
--  
15/05/24 01:02:18 ERROR thriftserver.SparkSQLDriver: Failed in [select
count(*) from api_search where pdate='2015-05-08']
java.lang.IllegalArgumentException: Wrong FS:
s3://test-emr/datawarehouse/api_s3_perf/api_search/pdate=2015-05-08/phour=00,
expected: hdfs://10.128.193.211:9000
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:647)
at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:467)
at
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:252)
at
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:251)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-can-t-read-S3-path-for-hive-external-table-tp23002.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



SparkSQL failing while writing into S3 for 'insert into table'

2015-05-22 Thread ogoh

Hello, 
I am using spark 1.3  Hive 0.13.1 in AWS.
From Spark-SQL, when running Hive query to export Hive query result into AWS
S3, it failed with the following message:
==
org.apache.hadoop.hive.ql.metadata.HiveException: checkPaths:
s3://test-dev/tmp/hive-hadoop/hive_2015-05-23_00-33-06_943_4594473380941885173-1/-ext-1
has nested
directorys3://test-dev/tmp/hive-hadoop/hive_2015-05-23_00-33-06_943_4594473380941885173-1/-ext-1/_temporary

at org.apache.hadoop.hive.ql.metadata.Hive.checkPaths(Hive.java:2157)

at org.apache.hadoop.hive.ql.metadata.Hive.copyFiles(Hive.java:2298)

at org.apache.hadoop.hive.ql.metadata.Table.copyFiles(Table.java:686)

at org.apache.hadoop.hive.ql.metadata.Hive.loadTable(Hive.java:1469)

at
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:230)

at
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:124)

at
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.execute(InsertIntoHiveTable.scala:249)

at
org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:1088)

at
org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:1088)
==

The query tested is 

spark-sqlcreate external table s3_dwserver_sql_t1 (q string) location
's3://test-dev/s3_dwserver_sql_t1')

spark-sqlinsert into table s3_dwserver_sql_t1 select q from api_search
where pdate='2015-05-12' limit 100;
==

It seems it generated query results into tmp dir firstly, and tries to
rename it into the right folder finally. But, it failed while renaming it. 

I appreciate any advice.
Thanks,
Okehee

 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-failing-while-writing-into-S3-for-insert-into-table-tp23000.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



beeline that comes with spark 1.3.0 doesn't work with --hiveconf or ''--hivevar which substitutes variables at hive scripts.

2015-04-22 Thread ogoh

Hello, 
I am using Spark 1.3 for SparkSQL (hive)  ThriftServer  Beeline.
The Beeline doesn't work with --hiveconf or ''--hivevar which substitutes
variables at hive scripts.
I found the following jiras saying that Hive 0.13 resolved that issue.
I wonder if this is well-known issue?

https://issues.apache.org/jira/browse/HIVE-4568 Beeline needs to support
resolving variables
https://issues.apache.org/jira/browse/HIVE-6173 Beeline doesn't accept
--hiveconf option as Hive CLI does

Thanks,
Okehee



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/beeline-that-comes-with-spark-1-3-0-doesn-t-work-with-hiveconf-or-hivevar-which-substitutes-variable-tp22615.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Generating a schema in Spark 1.3 failed while using DataTypes.

2015-04-02 Thread ogoh

Hello,
My ETL uses sparksql to generate parquet files which are served through
Thriftserver using hive ql.
It especially defines a schema programmatically since the schema can be only
known at runtime. 
With spark 1.2.1, it worked fine (followed
https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema).

I am trying to migrate into spark 1.3.0, but the API are confusing. 
I am not sure if the example of
https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema
is still valid on Spark1.3.0?
For example, DataType.StringType is not there any more. 
Instead, I found DataTypes.StringType etc. So, I migrated as below and it
builds fine. 
But at runtime, it throws Exception.

I appreciate any help.
Thanks,
Okehee

== Exception thrown 
java.lang.reflect.InvocationTargetException
scala.reflect.NameTransformer$.LOCAL_SUFFIX_STRING()Ljava/lang/String;
java.lang.NoSuchMethodError:
scala.reflect.NameTransformer$.LOCAL_SUFFIX_STRING()Ljava/lang/String;

 my code's snippet
import org.apache.spark.sql.types.DataTypes;
DataTypes.createStructField(property, DataTypes.IntegerType, true)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Generating-a-schema-in-Spark-1-3-failed-while-using-DataTypes-tp22362.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



SparkSQL supports hive insert overwrite directory?

2015-03-06 Thread ogoh
Hello,
I am using Spark 1.2.1 along with Hive 0.13.1.
I run some hive queries by using beeline and Thriftserver. 
Queries I tested so far worked well except the followings:
I want to export the query output into a file at either HDFS or local fs
(ideally local fs).
There are not yet supported?
The spark github has already unit tests using insert overwrite directory
in
https://github.com/apache/spark/blob/master/sql/hive/src/test/resources/ql/src/test/queries/clientpositive/insert_overwrite_local_directory_1.q.

$insert overwrite directory 'hdfs directory name' select * from temptable; 
TOK_QUERY
  TOK_FROM
TOK_TABREF
  TOK_TABNAME
temptable
  TOK_INSERT
TOK_DESTINATION
  TOK_DIR
'/user/ogoh/table'
TOK_SELECT
  TOK_SELEXPR
TOK_ALLCOLREF

scala.NotImplementedError: No parse rules for:
 TOK_DESTINATION
  TOK_DIR
'/user/bob/table'

$insert overwrite local directory 'hdfs directory name' select * from
temptable; ;
TOK_QUERY
  TOK_FROM
TOK_TABREF
  TOK_TABNAME
temptable
  TOK_INSERT
TOK_DESTINATION
  TOK_LOCAL_DIR
/user/bob/table
TOK_SELECT
  TOK_SELEXPR
TOK_ALLCOLREF

scala.NotImplementedError: No parse rules for:
 TOK_DESTINATION
  TOK_LOCAL_DIR
/user/ogoh/table

Thanks,
Okehee



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-supports-hive-insert-overwrite-directory-tp21951.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Hive on Spark vs. SparkSQL using Hive ?

2015-01-28 Thread ogoh

Hello,
probably this question was already asked but still I'd like to confirm from
Spark users.

This following blog shows 'hive on spark' :
http://blog.cloudera.com/blog/2014/12/hands-on-hive-on-spark-in-the-aws-cloud/;.
How is it different from using hive as data storage of SparkSQL
(http://spark.apache.org/docs/latest/sql-programming-guide.html)? 
Also, is there any update about SparkSQL's next release (current one is
still alpha)?

Thanks,
OGoh 




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Hive-on-Spark-vs-SparkSQL-using-Hive-tp21412.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org