using UDF( defined in Java) in scala through scala
Hello, I have a udf declared in Java but I'd like to call it from spark-shell which only supports Scala. Since I am new to Scala, I couldn't figure out how to call register the Java UDF using sqlContext.udf.register in scala. Below is how I tried. I appreciate any help. Thanks, = my UDF in java public class ArrayStringOfJson implements UDF1{ public ArrayType call(String input) throws Exception { .. } } = using in spark-shell scala> import org.apache.spark.sql.api.java.UDF1 scala> import com.mycom.event.udfs.ArrayStringOfJson scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc) scala> val instance: ArrayStringOfJson = new ArrayStringOfJson scala> sqlContext.udf.register("arraystring", instance, org.apache.spark.sql.types.ArrayType) :28: error: overloaded method value register with alternatives: (name: String,f: org.apache.spark.sql.api.java.UDF22[_, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _],returnType: org.apache.spark.sql.types.DataType)Unit ... (name: String,f: org.apache.spark.sql.api.java.UDF1[_, _],returnType: org.apache.spark.sql.types.DataType)Unit cannot be applied to (String, com.mycom.event.udfs.ArrayStringOfJson, org.apache.spark.sql.types.ArrayType.type) sqlContext.udf.register("arraystring", instance, org.apache.spark.sql.types.ArrayType) ^ -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/using-UDF-defined-in-Java-in-scala-through-scala-tp24880.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
SparkSQL 1.4 can't accept registration of UDF?
Hello, I am using SparkSQL along with ThriftServer so that we can access using Hive queries. With Spark 1.3.1, I can register UDF function. But, Spark 1.4.0 doesn't work for that. The jar of the udf is same. Below is logs: I appreciate any advice. == With Spark 1.4 Beeline version 1.4.0 by Apache Hive 0: jdbc:hive2://localhost:1 add jar hdfs:///user/hive/lib/dw-udf-2015.06.06-SNAPSHOT.jar; 0: jdbc:hive2://localhost:1 create temporary function parse_trace as 'com. mycom.dataengine.udf.GenericUDFParseTraceAnnotation'; 15/07/14 23:49:43 DEBUG transport.TSaslTransport: writing data length: 206 15/07/14 23:49:43 DEBUG transport.TSaslTransport: CLIENT: reading data length: 201 Error: org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.FunctionTask (state=,code=0) == With Spark 1.3.1: Beeline version 1.3.1 by Apache Hive 0: jdbc:hive2://localhost:10001 add jar hdfs:///user/hive/lib/dw-udf-2015.06.06-SNAPSHOT.jar; +-+ | Result | +-+ +-+ No rows selected (1.313 seconds) 0: jdbc:hive2://localhost:10001 create temporary function parse_trace as 'com. mycom.dataengine.udf.GenericUDFParseTraceAnnotation'; +-+ | result | +-+ +-+ No rows selected (0.999 seconds) === The logs of ThriftServer of Spark 1.4.0 15/07/14 23:49:43 INFO SparkExecuteStatementOperation: Running query 'create temporary function parse_trace as 'com.quixey.dataengine.udf.GenericUDFParseTraceAnnotation'' 15/07/14 23:49:43 INFO ParseDriver: Parsing command: create temporary function parse_trace as 'com.quixey.dataengine.udf.GenericUDFParseTraceAnnotation' 15/07/14 23:49:43 INFO ParseDriver: Parse Completed 15/07/14 23:49:43 INFO PerfLogger: PERFLOG method=Driver.run from=org.apache.hadoop.hive.ql.Driver 15/07/14 23:49:43 INFO PerfLogger: PERFLOG method=TimeToSubmit from=org.apache.hadoop.hive.ql.Driver 15/07/14 23:49:43 INFO Driver: Concurrency mode is disabled, not creating a lock manager 15/07/14 23:49:43 INFO PerfLogger: PERFLOG method=compile from=org.apache.hadoop.hive.ql.Driver 15/07/14 23:49:43 INFO PerfLogger: PERFLOG method=parse from=org.apache.hadoop.hive.ql.Driver 15/07/14 23:49:43 INFO ParseDriver: Parsing command: create temporary function parse_trace as 'com.quixey.dataengine.udf.GenericUDFParseTraceAnnotation' 15/07/14 23:49:43 INFO ParseDriver: Parse Completed 15/07/14 23:49:43 INFO PerfLogger: /PERFLOG method=parse start=1436917783106 end=1436917783106 duration=0 from=org.apache.hadoop.hive.ql.Driver 15/07/14 23:49:43 INFO PerfLogger: PERFLOG method=semanticAnalyze from=org.apache.hadoop.hive.ql.Driver 15/07/14 23:49:43 INFO HiveMetaStore: 2: get_database: default 15/07/14 23:49:43 INFO audit: ugi=anonymous ip=unknown-ip-addr cmd=get_database: default 15/07/14 23:49:43 INFO HiveMetaStore: 2: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore 15/07/14 23:49:43 INFO ObjectStore: ObjectStore, initialize called 15/07/14 23:49:43 INFO MetaStoreDirectSql: MySQL check failed, assuming we are not on mysql: Lexical error at line 1, column 5. Encountered: @ (64), after : . 15/07/14 23:49:43 INFO Query: Reading in results for query org.datanucleus.store.rdbms.query.SQLQuery@0 since the connection used is closing 15/07/14 23:49:43 INFO ObjectStore: Initialized ObjectStore 15/07/14 23:49:43 INFO FunctionSemanticAnalyzer: analyze done 15/07/14 23:49:43 INFO Driver: Semantic Analysis Completed 15/07/14 23:49:43 INFO PerfLogger: /PERFLOG method=semanticAnalyze start=1436917783106 end=1436917783114 duration=8 from=org.apache.hadoop.hive.ql.Driver 15/07/14 23:49:43 INFO Driver: Returning Hive schema: Schema(fieldSchemas:null, properties:null) 15/07/14 23:49:43 INFO PerfLogger: /PERFLOG method=compile start=1436917783106 end=1436917783114 duration=8 from=org.apache.hadoop.hive.ql.Driver 15/07/14 23:49:43 INFO PerfLogger: PERFLOG method=Driver.execute from=org.apache.hadoop.hive.ql.Driver 15/07/14 23:49:43 INFO Driver: Starting command: create temporary function parse_trace as 'com.quixey.dataengine.udf.GenericUDFParseTraceAnnotation' 15/07/14 23:49:43 INFO PerfLogger: /PERFLOG method=TimeToSubmit start=1436917783105 end=1436917783115 duration=10 from=org.apache.hadoop.hive.ql.Driver 15/07/14 23:49:43 INFO PerfLogger: PERFLOG method=runTasks from=org.apache.hadoop.hive.ql.Driver 15/07/14 23:49:43 INFO PerfLogger: PERFLOG method=task.FUNCTION.Stage-0 from=org.apache.hadoop.hive.ql.Driver 15/07/14 23:49:43 ERROR Task: FAILED: Class com.quixey.dataengine.udf.GenericUDFParseTraceAnnotation not found 15/07/14 23:49:43 INFO FunctionTask: create function: java.lang.ClassNotFoundException: com.quixey.dataengine.udf.GenericUDFParseTraceAnnotation at java.net.URLClassLoader$1.run(URLClassLoader.java:372) at java.net.URLClassLoader$1.run(URLClassLoader.java:361) -- View this message in context:
Re: Error when connecting to Spark SQL via Hive JDBC driver
hello, I am not sure what is wrong.. But, in my case, I followed the instruction from http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/HiveJDBCDriver.html. It worked fine with SQuirreL SQL Client (http://squirrel-sql.sourceforge.net/), and SQL Workbench J (http://www.sql-workbench.net/). -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Error-when-connecting-to-Spark-SQL-via-Hive-JDBC-driver-tp23397p23403.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
SparkSQL : using Hive UDF returning Map throws rror: scala.MatchError: interface java.util.Map (of class java.lang.Class) (state=,code=0)
Hello, I tested some custom udf on SparkSql's ThriftServer Beeline (Spark 1.3.1). Some udfs work fine (access array parameter and returning int or string type). But my udf returning map type throws an error: Error: scala.MatchError: interface java.util.Map (of class java.lang.Class) (state=,code=0) I converted the code into Hive's GenericUDF since I worried that using complex type parameter (array of map) and returning complex type (map) can be supported in Hive's GenericUDF instead of simple UDF. But SparkSQL doesn't seem supporting GenericUDF.(error message : Error: java.lang.IllegalAccessException: Class org.apache.spark.sql.hive.HiveFunctionWrapper can not access ..). Below is my example udf code returning MAP type. I appreciate any advice. Thanks -- public final class ArrayToMap extends UDF { public MapString,String evaluate(ArrayListString arrayOfString) { // add code to handle all index problem MapString, String map = new HashMapString, String(); int count = 0; for (String element : arrayOfString) { map.put(count + , element); count++; } return map; } } -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-using-Hive-UDF-returning-Map-throws-rror-scala-MatchError-interface-java-util-Map-of-class--tp23164.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
SparkSQL's performance gets degraded depending on number of partitions of Hive tables..is it normal?
Hello, I posted this question a while back but am posting it again to get your attention. I am using SparkSQL 1.3.1 and Hive 0.13.1 on AWS YARN (tested under both 1.3.0 1.3.1). My hive table is partitioned. I noticed that the query response time is bad depending on the number of partitions though the query targets a small subset of the partitions. TRACE level logs (ThriftServer's) showed that it runs commands like getFileInfo, getListing, getBlockLocation for each every partitions ( also runs getBlockLocation for each every files) though they are not part of the queried partitions. I don't know why it is necessary. Is it a bug of SparkSql? Is there a way to avoid that? Below is the detail of reporting this issue including logs. Thanks, -- My Hive table as an external table is partitioned with date and hour. I expected that a query with certain partitions will read only the data files of the partitions. I turned on TRACE level logging for ThriftServer since the query response time even for narrowed partitions was very long. And I found that all the available partitions are checked during some steps. The logs showed as a execution flow such as: == Step 1: Contacted HiveMetastore to get partition info (cmd : get_partitions) Step 2: Came up with an execution rule Step 3: Contact namenode to make at least 4 calls (data is in HDFS) for all available partitions of the table : getFileInfo once, getListing once, and the repeat them again for each partition. Step 4: Contact NameNode to find blocklocation of all the partitions Step 5: Contact DataNode for each file of all the partitions Step 6: Contact NameNode again for all the partitions Step 7: SparkSQL generated some optimal plan Step 8: Contacted corresponding datanodes for the narrowed partitions (it seems) And more. === Why Step3, 4, 5, and 6 should check all partitions? After removing partitions from the table, the query was much quicker while processing same volume of data. I don't know if it is normal or Hive issue or SparkSQL issue or my configuration issue. I added some logs below for some steps. I appreciate any of your advice. Thanks a lot, Okehee some logs of some steps Query: select count(*) from api_search where pdate='2015-05-23'; ( Step 2: 2015-05-25 16:37:43 TRACE HiveContext$$anon$3:67 - === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$GlobalAggregates === !'Project [COUNT(1) AS _c0#25L]Aggregate [], [COUNT(1) AS _c0#25L] Filter (pdate#26 = 2015-05-23)Filter (pdate#26 = 2015-05-23) MetastoreRelation api_hdfs_perf, api_search, None MetastoreRelation api_hdfs_perf, api_search, None .. Step 3: 2015-05-25 16:37:44 TRACE ProtobufRpcEngine:206 - 84: Call - /10.128.193.211:9000: getFileInfo {src: /user/datawarehouse/api_hdfs_perf/api_search/pdate=2015-05-08/phour=00} 2015-05-25 16:37:44 DEBUG Client:424 - The ping interval is 6 ms. 2015-05-25 16:37:44 DEBUG Client:693 - Connecting to /10.128.193.211:9000 2015-05-25 16:37:44 DEBUG Client:1007 - IPC Client (2100771791) connection to /10.128.193.211:9000 from ogoh sending #151 2015-05-25 16:37:44 DEBUG Client:944 - IPC Client (2100771791) connection to /10.128.193.211:9000 from ogoh: starting, having connections 2 2015-05-25 16:37:44 DEBUG Client:1064 - IPC Client (2100771791) connection to /10.128.193.211:9000 from ogoh got value #151 2015-05-25 16:37:44 DEBUG ProtobufRpcEngine:235 - Call: getFileInfo took 13ms 2015-05-25 16:37:44 TRACE ProtobufRpcEngine:250 - 84: Response - /10.128.193.211:9000: getFileInfo {fs { fileType: IS_DIR path: length: 0 permission { perm: 493 } owner: hadoop group: supergroup modification_time: 1432364487906 access_time: 0 block_replication: 0 blocksize: 0 fileId: 100602 childrenNum: 2 }} 2015-05-25 16:37:44 TRACE ProtobufRpcEngine:206 - 84: Call - /10.128.193.211:9000: getFileInfo {src: /user/datawarehouse/api_hdfs_perf/api_search/pdate=2015-05-08/phour=00} 2015-05-25 16:37:44 DEBUG Client:1007 - IPC Client (2100771791) connection to /10.128.193.211:9000 from ogoh sending #152 2015-05-25 16:37:44 DEBUG Client:1064 - IPC Client (2100771791) connection to /10.128.193.211:9000 from ogoh got value #152 2015-05-25 16:37:44 DEBUG ProtobufRpcEngine:235 - Call: getFileInfo took 2ms. .. Step 4: 2015-05-25 16:37:47 TRACE ProtobufRpcEngine:206 - 89: Call - /10.128.193.211:9000: getBlockLocations {src: /user/datawarehouse/api_hdfs_perf/api_search/pdate=2015-05-08/phour=00/part-r-2015050800-1.parquet offset: 0 length: 1342177280} ... Step 5: 2015-05-25 16:37:48 DEBUG DFSClient:951 - Connecting to datanode 10.191.137.197:9200 2015-05-25 16:37:48 TRACE BlockReaderFactory:653 - BlockReaderFactory(fileName=/user/datawarehouse/api_hdfs_perf/api_search/pdate=2015-05-08/phour=00/part-r-2015050800-2.parquet, block=BP-1843960649-10.128.193.211-1427923845046:blk_1073758677_981812
Re: Spark 1.3.0 - 1.3.1 produces java.lang.NoSuchFieldError: NO_FILTER
I had the same issue on AWS EMR with Spark 1.3.1.e (AWS version) passed with '-h' parameter (it is bootstrap action parameter for spark). I don't see the problem with Spark 1.3.1.e not passing the parameter. I am not sure about your env. Thanks, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-3-0-1-3-1-produces-java-lang-NoSuchFieldError-NO-FILTER-tp22897p23090.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
SparkSQL's performance : contacting namenode and datanode to uncessarily check all partitions for a query of specific partitions
Hello, I am using SparkSQL 1.3.0 and Hive 0.13.1 on AWS YARN. My Hive table as an external table is partitioned with date and hour. I expected that a query with certain partitions will read only the data files of the partitions. I turned on TRACE level logging for ThriftServer since the query response time even for narrowed partitions was very long. And I found that all the available partitions are checked during some steps. The logs showed as a execution flow such as: == Step 1: Contacted HiveMetastore to get partition info (cmd : get_partitions) Step 2: Came up with an execution rule Step 3: Contact namenode to make at least 4 calls (data is in HDFS) for all available partitions of the table : getFileInfo once, getListing once, and the repeat them again for each partition. Step 4: Contact NameNode to find blocklocation of all the partitions Step 5: Contact DataNode for each file of all the partitions Step 6: Contact NameNode again for all the partitions Step 7: SparkSQL generated some optimal plan Step 8: Contacted corresponding datanodes for the narrowed partitions (it seems) And more. === Why Step3, 4, 5, and 6 should check all partitions? After removing partitions from the table, the query was much quicker while processing same volume of data. I don't know if it is normal or Hive issue or SparkSQL issue or my configuration issue. I added some logs below for some steps. I appreciate any of your advice. Thanks a lot, Okehee some logs of some steps Query: select count(*) from api_search where pdate='2015-05-23'; ( Step 2: 2015-05-25 16:37:43 TRACE HiveContext$$anon$3:67 - === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$GlobalAggregates === !'Project [COUNT(1) AS _c0#25L]Aggregate [], [COUNT(1) AS _c0#25L] Filter (pdate#26 = 2015-05-23)Filter (pdate#26 = 2015-05-23) MetastoreRelation api_hdfs_perf, api_search, None MetastoreRelation api_hdfs_perf, api_search, None .. Step 3: 2015-05-25 16:37:44 TRACE ProtobufRpcEngine:206 - 84: Call - /10.128.193.211:9000: getFileInfo {src: /user/datawarehouse/api_hdfs_perf/api_search/pdate=2015-05-08/phour=00} 2015-05-25 16:37:44 DEBUG Client:424 - The ping interval is 6 ms. 2015-05-25 16:37:44 DEBUG Client:693 - Connecting to /10.128.193.211:9000 2015-05-25 16:37:44 DEBUG Client:1007 - IPC Client (2100771791) connection to /10.128.193.211:9000 from ogoh sending #151 2015-05-25 16:37:44 DEBUG Client:944 - IPC Client (2100771791) connection to /10.128.193.211:9000 from ogoh: starting, having connections 2 2015-05-25 16:37:44 DEBUG Client:1064 - IPC Client (2100771791) connection to /10.128.193.211:9000 from ogoh got value #151 2015-05-25 16:37:44 DEBUG ProtobufRpcEngine:235 - Call: getFileInfo took 13ms 2015-05-25 16:37:44 TRACE ProtobufRpcEngine:250 - 84: Response - /10.128.193.211:9000: getFileInfo {fs { fileType: IS_DIR path: length: 0 permission { perm: 493 } owner: hadoop group: supergroup modification_time: 1432364487906 access_time: 0 block_replication: 0 blocksize: 0 fileId: 100602 childrenNum: 2 }} 2015-05-25 16:37:44 TRACE ProtobufRpcEngine:206 - 84: Call - /10.128.193.211:9000: getFileInfo {src: /user/datawarehouse/api_hdfs_perf/api_search/pdate=2015-05-08/phour=00} 2015-05-25 16:37:44 DEBUG Client:1007 - IPC Client (2100771791) connection to /10.128.193.211:9000 from ogoh sending #152 2015-05-25 16:37:44 DEBUG Client:1064 - IPC Client (2100771791) connection to /10.128.193.211:9000 from ogoh got value #152 2015-05-25 16:37:44 DEBUG ProtobufRpcEngine:235 - Call: getFileInfo took 2ms. .. Step 4: 2015-05-25 16:37:47 TRACE ProtobufRpcEngine:206 - 89: Call - /10.128.193.211:9000: getBlockLocations {src: /user/datawarehouse/api_hdfs_perf/api_search/pdate=2015-05-08/phour=00/part-r-2015050800-1.parquet offset: 0 length: 1342177280} ... Step 5: 2015-05-25 16:37:48 DEBUG DFSClient:951 - Connecting to datanode 10.191.137.197:9200 2015-05-25 16:37:48 TRACE BlockReaderFactory:653 - BlockReaderFactory(fileName=/user/datawarehouse/api_hdfs_perf/api_search/pdate=2015-05-08/phour=00/part-r-2015050800-2.parquet, block=BP-1843960649-10.128.193.211-1427923845046:blk_1073758677_981812): trying to create a remote block reader from a TCP socket ... Step 6: 2015-05-25 16:37:56 TRACE ProtobufRpcEngine:206 - 84: Call - /10.128.193.211:9000: getFileInfo {src: /user/datawarehouse/api_hdfs_perf/api_search/pdate=2015-05-08/phour=00} ... Step 7: === Applying Rule org.apache.spark.sql.hive.HiveMetastoreCatalog$ParquetConversions === == Optimized Logical Plan == Aggregate [], [COUNT(1) AS _c0#25L] Project [] Filter (pdate#111 = 2015-05-23) Relation[timestamp#84,request_id#85,request_timestamp#86,response_timestamp#87,request_query_url#88,request_query_params#89,response_status#90,q#91,session_id#92,partner_id#93,partner_name#94,partner_ip#95,partner_useragent#96,search_id#97,user_id#98,client_ip#99,client_country#100
SparkSQL can't read S3 path for hive external table
Hello, I am using Spark1.3 in AWS. SparkSQL can't recognize Hive external table on S3. The following is the error message. I appreciate any help. Thanks, Okehee -- 15/05/24 01:02:18 ERROR thriftserver.SparkSQLDriver: Failed in [select count(*) from api_search where pdate='2015-05-08'] java.lang.IllegalArgumentException: Wrong FS: s3://test-emr/datawarehouse/api_s3_perf/api_search/pdate=2015-05-08/phour=00, expected: hdfs://10.128.193.211:9000 at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:647) at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:467) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:252) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:251) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-can-t-read-S3-path-for-hive-external-table-tp23002.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
SparkSQL failing while writing into S3 for 'insert into table'
Hello, I am using spark 1.3 Hive 0.13.1 in AWS. From Spark-SQL, when running Hive query to export Hive query result into AWS S3, it failed with the following message: == org.apache.hadoop.hive.ql.metadata.HiveException: checkPaths: s3://test-dev/tmp/hive-hadoop/hive_2015-05-23_00-33-06_943_4594473380941885173-1/-ext-1 has nested directorys3://test-dev/tmp/hive-hadoop/hive_2015-05-23_00-33-06_943_4594473380941885173-1/-ext-1/_temporary at org.apache.hadoop.hive.ql.metadata.Hive.checkPaths(Hive.java:2157) at org.apache.hadoop.hive.ql.metadata.Hive.copyFiles(Hive.java:2298) at org.apache.hadoop.hive.ql.metadata.Table.copyFiles(Table.java:686) at org.apache.hadoop.hive.ql.metadata.Hive.loadTable(Hive.java:1469) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:230) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:124) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.execute(InsertIntoHiveTable.scala:249) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:1088) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:1088) == The query tested is spark-sqlcreate external table s3_dwserver_sql_t1 (q string) location 's3://test-dev/s3_dwserver_sql_t1') spark-sqlinsert into table s3_dwserver_sql_t1 select q from api_search where pdate='2015-05-12' limit 100; == It seems it generated query results into tmp dir firstly, and tries to rename it into the right folder finally. But, it failed while renaming it. I appreciate any advice. Thanks, Okehee -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-failing-while-writing-into-S3-for-insert-into-table-tp23000.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
beeline that comes with spark 1.3.0 doesn't work with --hiveconf or ''--hivevar which substitutes variables at hive scripts.
Hello, I am using Spark 1.3 for SparkSQL (hive) ThriftServer Beeline. The Beeline doesn't work with --hiveconf or ''--hivevar which substitutes variables at hive scripts. I found the following jiras saying that Hive 0.13 resolved that issue. I wonder if this is well-known issue? https://issues.apache.org/jira/browse/HIVE-4568 Beeline needs to support resolving variables https://issues.apache.org/jira/browse/HIVE-6173 Beeline doesn't accept --hiveconf option as Hive CLI does Thanks, Okehee -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/beeline-that-comes-with-spark-1-3-0-doesn-t-work-with-hiveconf-or-hivevar-which-substitutes-variable-tp22615.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Generating a schema in Spark 1.3 failed while using DataTypes.
Hello, My ETL uses sparksql to generate parquet files which are served through Thriftserver using hive ql. It especially defines a schema programmatically since the schema can be only known at runtime. With spark 1.2.1, it worked fine (followed https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema). I am trying to migrate into spark 1.3.0, but the API are confusing. I am not sure if the example of https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema is still valid on Spark1.3.0? For example, DataType.StringType is not there any more. Instead, I found DataTypes.StringType etc. So, I migrated as below and it builds fine. But at runtime, it throws Exception. I appreciate any help. Thanks, Okehee == Exception thrown java.lang.reflect.InvocationTargetException scala.reflect.NameTransformer$.LOCAL_SUFFIX_STRING()Ljava/lang/String; java.lang.NoSuchMethodError: scala.reflect.NameTransformer$.LOCAL_SUFFIX_STRING()Ljava/lang/String; my code's snippet import org.apache.spark.sql.types.DataTypes; DataTypes.createStructField(property, DataTypes.IntegerType, true) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Generating-a-schema-in-Spark-1-3-failed-while-using-DataTypes-tp22362.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
SparkSQL supports hive insert overwrite directory?
Hello, I am using Spark 1.2.1 along with Hive 0.13.1. I run some hive queries by using beeline and Thriftserver. Queries I tested so far worked well except the followings: I want to export the query output into a file at either HDFS or local fs (ideally local fs). There are not yet supported? The spark github has already unit tests using insert overwrite directory in https://github.com/apache/spark/blob/master/sql/hive/src/test/resources/ql/src/test/queries/clientpositive/insert_overwrite_local_directory_1.q. $insert overwrite directory 'hdfs directory name' select * from temptable; TOK_QUERY TOK_FROM TOK_TABREF TOK_TABNAME temptable TOK_INSERT TOK_DESTINATION TOK_DIR '/user/ogoh/table' TOK_SELECT TOK_SELEXPR TOK_ALLCOLREF scala.NotImplementedError: No parse rules for: TOK_DESTINATION TOK_DIR '/user/bob/table' $insert overwrite local directory 'hdfs directory name' select * from temptable; ; TOK_QUERY TOK_FROM TOK_TABREF TOK_TABNAME temptable TOK_INSERT TOK_DESTINATION TOK_LOCAL_DIR /user/bob/table TOK_SELECT TOK_SELEXPR TOK_ALLCOLREF scala.NotImplementedError: No parse rules for: TOK_DESTINATION TOK_LOCAL_DIR /user/ogoh/table Thanks, Okehee -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-supports-hive-insert-overwrite-directory-tp21951.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Hive on Spark vs. SparkSQL using Hive ?
Hello, probably this question was already asked but still I'd like to confirm from Spark users. This following blog shows 'hive on spark' : http://blog.cloudera.com/blog/2014/12/hands-on-hive-on-spark-in-the-aws-cloud/;. How is it different from using hive as data storage of SparkSQL (http://spark.apache.org/docs/latest/sql-programming-guide.html)? Also, is there any update about SparkSQL's next release (current one is still alpha)? Thanks, OGoh -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Hive-on-Spark-vs-SparkSQL-using-Hive-tp21412.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org