sparkR 3rd library
Hi, I am using spark.lapply to execute an existing R script in standalone mode. This script calls a function 'rbga' from a 3rd library 'genalg'. This rbga function works fine in sparkR env when I call it directly, but when I apply this to spark.lapply I get the error could not find function "rbga" at org.apache.spark.api.r.RRunner.compute(RRunner.scala:108) at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:51) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala Any ideas/suggestions? BR, Patcharee - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
simple application on tez + llap
Hi, I found an example of simple applications like wordcount running on tez - https://github.com/apache/tez/tree/master/tez-examples/src/main/java/org/apache/tez/examples. However, how to run this on tez+llap? Any suggestions? BR, Patcharee
Re: import sql file
I exported sql table into .sql file and would like to import this into hive Best, Patcharee On 23. nov. 2016 10:40, Markovitz, Dudu wrote: Hi Patcharee The question is not clear. Dudu -Original Message- From: patcharee [mailto:patcharee.thong...@uni.no] Sent: Wednesday, November 23, 2016 11:37 AM To: user@hive.apache.org Subject: import sql file Hi, How can I import .sql file into hive? Best, Patcharee
import sql file
Hi, How can I import .sql file into hive? Best, Patcharee
Re: hiveserver2 java heap space
It works on Hive cli Patcharee On 10/24/2016 11:51 AM, Mich Talebzadeh wrote: does this work ok through Hive cli? Dr Mich Talebzadeh LinkedIn /https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw/ http://talebzadehmich.wordpress.com *Disclaimer:* Use it at your own risk.Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On 24 October 2016 at 10:43, Patcharee Thongtra <patcharee.thong...@uni.no <mailto:patcharee.thong...@uni.no>> wrote: Hi, I tried to query orc file by beeline and java program using jdbc ("select * from orcfileTable limit 1"). Both failed with Caused by: java.lang.OutOfMemoryError: Java heap space. Hiveserver2 heap size is 1024m. I guess I need to increase this Hiveserver2 heap size? However I wonder why I got this error because I query just ONE line. Any ideas? Thanks, Patcharee
hiveserver2 java heap space
Hi, I tried to query orc file by beeline and java program using jdbc ("select * from orcfileTable limit 1"). Both failed with Caused by: java.lang.OutOfMemoryError: Java heap space. Hiveserver2 heap size is 1024m. I guess I need to increase this Hiveserver2 heap size? However I wonder why I got this error because I query just ONE line. Any ideas? Thanks, Patcharee
hiveserver2 GC overhead limit exceeded
Hi, I use beeline to connect to hiveserver2. I tested with a simple command and got an error GC overhead limit exceeded 0: jdbc:hive2://service-10-1:10010/default> drop table testhivedrivertable; Error: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. GC overhead limit exceeded (state=08S01,code=1) How to solve this? How to identify if this error is from the client (beeline) or from hiveserver2? Thanks, Patcharee
Re: Spark DataFrame Plotting
Hi Moon, When I generate an extra column (schema will be Index:Int, A:Double, B:Double), what sql command to generate a graph with 2 lines (Index as a X-axis, BOTH A and B as Y-axis)? Do I need to group by? Thanks! Patcharee On 07. sep. 2016 16:58, moon soo Lee wrote: You will need to generate extra column which can be used as a X-axis for column A and B. On Wed, Sep 7, 2016 at 2:34 AM Abhisar Mohapatra <abhisar.mohapa...@inmobi.com <mailto:abhisar.mohapa...@inmobi.com>> wrote: I am not sure ,But can you try once by grouby function in zeppelin.If uou can group by columns then i guess that would do the trick or pass the data to another HTML block and use d3 comparison chart to plot it On Wed, Sep 7, 2016 at 2:56 PM, patcharee <patcharee.thong...@uni.no <mailto:patcharee.thong...@uni.no>> wrote: Normal select * gives me one column on X-axis and another on Y-axis. I cannot make both A:Double, B:Double displayed on Y-axis. How to do that? Patcharee On 07. sep. 2016 11:05, Abhisar Mohapatra wrote: You can do a normal select * on the dataframe and it would be automatically interpreted. On Wed, Sep 7, 2016 at 2:29 PM, patcharee <patcharee.thong...@uni.no <mailto:patcharee.thong...@uni.no>> wrote: Hi, I have a dataframe with this schema A:Double, B:Double. How can I plot this dataframe as two lines (comparing A and B at each step)? Best, Patcharee _ The information contained in this communication is intended solely for the use of the individual or entity to whom it is addressed and others authorized to receive it. It may contain confidential or legally privileged information. If you are not the intended recipient you are hereby notified that any disclosure, copying, distribution or taking any action in reliance on the contents of this information is strictly prohibited and may be unlawful. If you have received this communication in error, please notify us immediately by responding to this email and then delete it from your system. The firm is neither liable for the proper and complete transmission of the information contained in this communication nor for any delay in its receipt. _ The information contained in this communication is intended solely for the use of the individual or entity to whom it is addressed and others authorized to receive it. It may contain confidential or legally privileged information. If you are not the intended recipient you are hereby notified that any disclosure, copying, distribution or taking any action in reliance on the contents of this information is strictly prohibited and may be unlawful. If you have received this communication in error, please notify us immediately by responding to this email and then delete it from your system. The firm is neither liable for the proper and complete transmission of the information contained in this communication nor for any delay in its receipt.
Re: Spark DataFrame Plotting
Normal select * gives me one column on X-axis and another on Y-axis. I cannot make both A:Double, B:Double displayed on Y-axis. How to do that? Patcharee On 07. sep. 2016 11:05, Abhisar Mohapatra wrote: You can do a normal select * on the dataframe and it would be automatically interpreted. On Wed, Sep 7, 2016 at 2:29 PM, patcharee <patcharee.thong...@uni.no <mailto:patcharee.thong...@uni.no>> wrote: Hi, I have a dataframe with this schema A:Double, B:Double. How can I plot this dataframe as two lines (comparing A and B at each step)? Best, Patcharee _ The information contained in this communication is intended solely for the use of the individual or entity to whom it is addressed and others authorized to receive it. It may contain confidential or legally privileged information. If you are not the intended recipient you are hereby notified that any disclosure, copying, distribution or taking any action in reliance on the contents of this information is strictly prohibited and may be unlawful. If you have received this communication in error, please notify us immediately by responding to this email and then delete it from your system. The firm is neither liable for the proper and complete transmission of the information contained in this communication nor for any delay in its receipt.
Spark DataFrame Plotting
Hi, I have a dataframe with this schema A:Double, B:Double. How can I plot this dataframe as two lines (comparing A and B at each step)? Best, Patcharee
what contribute to Task Deserialization Time
Hi, I'm running a simple job (reading sequential file and collect data at the driver) with yarn-client mode. When looking at the history server UI, Task Deserialization Time of tasks are quite different (5 ms to 5 s). What contribute to this Task Deserialization Time? Thank you in advance! Patcharee - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Failed to stream on Yarn cluster
Hi again, Actually it works well! I just realized from looking at Yarn application log that the Flink streaming result is printed in taskmanager.out. When I sent a question to the mailing list I looked at the screen where I issued the command, and there was no streaming result there. Where is the taskmanager.out and how can I change it? Best, Patcharee On 28. april 2016 13:18, Maximilian Michels wrote: Hi Patcharee, What do you mean by "nothing happened"? There is no output? Did you check the logs? Cheers, Max On Thu, Apr 28, 2016 at 12:10 PM, patcharee <patcharee.thong...@uni.no> wrote: Hi, I am testing the streaming wiki example - https://ci.apache.org/projects/flink/flink-docs-master/quickstart/run_example_quickstart.html It works fine from local mode (mvn exec:java -Dexec.mainClass=wikiedits.WikipediaAnalysis). But when I run the jar on Yarn cluster mode, nothing happened. Any ideas? I tested the word count example from hdfs file on Yarn cluster and it worked fine. Best, Patcharee
Failed to stream on Yarn cluster
Hi, I am testing the streaming wiki example - https://ci.apache.org/projects/flink/flink-docs-master/quickstart/run_example_quickstart.html It works fine from local mode (mvn exec:java -Dexec.mainClass=wikiedits.WikipediaAnalysis). But when I run the jar on Yarn cluster mode, nothing happened. Any ideas? I tested the word count example from hdfs file on Yarn cluster and it worked fine. Best, Patcharee
Re: pyspark split pair rdd to multiple
I can also use dataframe. Any suggestions? Best, Patcharee On 20. april 2016 10:43, Gourav Sengupta wrote: Is there any reason why you are not using data frames? Regards, Gourav On Tue, Apr 19, 2016 at 8:51 PM, pth001 <patcharee.thong...@uni.no <mailto:patcharee.thong...@uni.no>> wrote: Hi, How can I split pair rdd [K, V] to map [K, Array(V)] efficiently in Pyspark? Best, Patcharee - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org <mailto:user-unsubscr...@spark.apache.org> For additional commands, e-mail: user-h...@spark.apache.org <mailto:user-h...@spark.apache.org>
Re: build r-intepreter
Yes, I did not install R. Stupid me. Thanks for your guide! BR, Patcharee On 04/13/2016 08:23 PM, Eric Charles wrote: Can you post the full stacktrace you have (look also at the log file)? Did you install R on your machine? SPARK_HOME is optional. On 13/04/16 15:39, Patcharee Thongtra wrote: Hi, When I ran R notebook example, I got these errors in the logs: - Caused by: org.apache.zeppelin.interpreter.InterpreterException: sparkr is not responding - Caused by: org.apache.thrift.transport.TTransportException I did not config SPARK_HOME so far, and intended to use the embedded spark for testing first. BR, Patcharee On 04/13/2016 02:52 PM, Patcharee Thongtra wrote: Hi, I have been struggling with R interpreter / SparkR interpreter. Is below the right command to build zeppelin with R interpreter / SparkR interpreter? mvn clean package -Pspark-1.6 -Phadoop-2.6 -Pyarn -Ppyspark -Psparkr BR, Patcharee
executor running time vs getting result from jupyter notebook
Hi, I am running a jupyter notebook - pyspark. I noticed from the history server UI there are some tasks spending a lot of time on either - executor running time - getting result But some tasks finished both steps very quick. All tasks however have very similar input size. What can be the factor of time spending on these steps? BR, Patcharee - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: build r-intepreter
Hi, When I ran R notebook example, I got these errors in the logs: - Caused by: org.apache.zeppelin.interpreter.InterpreterException: sparkr is not responding - Caused by: org.apache.thrift.transport.TTransportException I did not config SPARK_HOME so far, and intended to use the embedded spark for testing first. BR, Patcharee On 04/13/2016 02:52 PM, Patcharee Thongtra wrote: Hi, I have been struggling with R interpreter / SparkR interpreter. Is below the right command to build zeppelin with R interpreter / SparkR interpreter? mvn clean package -Pspark-1.6 -Phadoop-2.6 -Pyarn -Ppyspark -Psparkr BR, Patcharee
ExclamationTopology workers executors vs tasks
Hi, I am new to Storm. I am running the storm starter example ExclamationTopology on storm cluster (version 0.10). The code snippet is below: TopologyBuilder builder =new TopologyBuilder(); builder.setSpout("word",new TestWordSpout(),10); builder.setBolt("exclaim1",new ExclamationBolt(),3).shuffleGrouping("word"); builder.setBolt("exclaim2",new ExclamationBolt(),2).shuffleGrouping("exclaim1"); However I do not understand why the Num executors and Num tasks of this topology (seen from the Storm UI) are both 18 (instead of 15) ? Also from the Storm UI, the Num executors and Num tasks of the Spout word and the Bolt exclaim1 and exclaim2 are 10, 3 and 2 respectively (as same as defined in the code). Thanks, Patcharee
kafka streaming topic partitions vs executors
Hi, I am working a streaming application integrated with Kafka by the API createDirectStream. The application streams a topic which contains 10 partitions (on Kafka). It executes with 10 workers (--num-executors 10) When it reads data from Kafka/ZooKeeper, Spark creates 10 tasks (as same as the topic's partitions). However some executors are given more than 1 tasks and work on these tasks sequentially. Why Spark does not distribute these 10 tasks to 10 executors? How to do that? Thanks, Patcharee
Re: streaming textFileStream problem - got only ONE line
I moved them every interval to the monitored directory. Patcharee On 25. jan. 2016 22:30, Shixiong(Ryan) Zhu wrote: Did you move the file into "hdfs://helmhdfs/user/patcharee/cerdata/", or write into it directly? `textFileStream` requires that files must be written to the monitored directory by "moving" them from another location within the same file system. On Mon, Jan 25, 2016 at 6:30 AM, patcharee <patcharee.thong...@uni.no <mailto:patcharee.thong...@uni.no>> wrote: Hi, My streaming application is receiving data from file system and just prints the input count every 1 sec interval, as the code below: val sparkConf = new SparkConf() val ssc = new StreamingContext(sparkConf, Milliseconds(interval_ms)) val lines = ssc.textFileStream(args(0)) lines.count().print() The problem is sometimes the data received from scc.textFileStream is ONLY ONE line. But in fact there are multiple lines in the new file found in that interval. See log below which shows three intervals. In the 2nd interval, the new file is: hdfs://helmhdfs/user/patcharee/cerdata/datetime_19617.txt. This file contains 6288 lines. The ssc.textFileStream returns ONLY ONE line (the header). Any ideas/suggestions what the problem is? - SPARK LOG - 16/01/25 15:11:11 INFO FileInputDStream: Cleared 1 old files that were older than 1453731011000 ms: 145373101 ms 16/01/25 15:11:11 INFO FileInputDStream: Cleared 0 old files that were older than 1453731011000 ms: 16/01/25 15:11:12 INFO FileInputDStream: Finding new files took 4 ms 16/01/25 15:11:12 INFO FileInputDStream: New files at time 1453731072000 ms: hdfs://helmhdfs/user/patcharee/cerdata/datetime_19616.txt --- Time: 1453731072000 ms --- 6288 16/01/25 15:11:12 INFO FileInputDStream: Cleared 1 old files that were older than 1453731012000 ms: 1453731011000 ms 16/01/25 15:11:12 INFO FileInputDStream: Cleared 0 old files that were older than 1453731012000 ms: 16/01/25 15:11:13 INFO FileInputDStream: Finding new files took 4 ms 16/01/25 15:11:13 INFO FileInputDStream: New files at time 1453731073000 ms: hdfs://helmhdfs/user/patcharee/cerdata/datetime_19617.txt --- Time: 1453731073000 ms --- 1 16/01/25 15:11:13 INFO FileInputDStream: Cleared 1 old files that were older than 1453731013000 ms: 1453731012000 ms 16/01/25 15:11:13 INFO FileInputDStream: Cleared 0 old files that were older than 1453731013000 ms: 16/01/25 15:11:14 INFO FileInputDStream: Finding new files took 3 ms 16/01/25 15:11:14 INFO FileInputDStream: New files at time 1453731074000 ms: hdfs://helmhdfs/user/patcharee/cerdata/datetime_19618.txt --- Time: 1453731074000 ms --- 6288 Thanks, Patcharee
Pyspark filter not empty
Hi, In pyspark how to filter if a column of dataframe is not empty? I tried: dfNotEmpty = df.filter(df['msg']!='') It did not work. Thanks, Patcharee - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
spark streaming input rate strange
Hi, I have a streaming application with - 1 sec interval - accept data from a simulation through MulticastSocket The simulation sent out data using multiple clients/threads every 1 sec interval. The input rate accepted by the streaming looks strange. - When clients = 10,000 the event rate raises up to 10,000, stays at 10,000 a while and drops to about 7000-8000. - When clients = 20,000 the event rate raises up to 20,000, stays at 20,000 a while and drops to about 15000-17000. The same pattern Processing time is just about 400 ms. Any ideas/suggestions? Thanks, Patcharee
visualize data from spark streaming
Hi, How to visualize realtime data (in graph/chart) from spark streaming? Any tools? Best, Patcharee - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
bad performance on PySpark - big text file
Hi, I am very new to PySpark. I have a PySpark app working on text files with different size (100M - 100G). However each task is handling the same size of input split. But workers spend very much longer time on some input splits, especially when the input splits belong to a big file. See the log of these two input splits (check python.PythonRunner: Times: total ... ) 15/12/08 07:37:15 INFO rdd.NewHadoopRDD: Input split: hdfs://helmhdfs/user/patcharee/ntap-raw-20151015-20151126/html2/budisansblog.blogspot.com.html:39728447488+134217728 15/12/08 08:49:30 INFO python.PythonRunner: Times: total = 4335010, boot = -140, init = 282, finish = 4334868 15/12/08 08:49:30 INFO storage.MemoryStore: ensureFreeSpace(125163) called with curMem=227636200, maxMem=4341293383 15/12/08 08:49:30 INFO storage.MemoryStore: Block rdd_3_1772 stored as bytes in memory (estimated size 122.2 KB, free 3.8 GB) 15/12/08 08:49:30 INFO python.PythonRunner: Times: total = 4, boot = 1, init = 0, finish = 3 15/12/08 08:49:30 INFO storage.MemoryStore: ensureFreeSpace(126595) called with curMem=227761363, maxMem=4341293383 15/12/08 08:49:30 INFO storage.MemoryStore: Block rdd_9_1772 stored as bytes in memory (estimated size 123.6 KB, free 3.8 GB) 15/12/08 08:49:30 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1 15/12/08 08:49:30 INFO datasources.DynamicPartitionWriterContainer: Using output committer class org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter 15/12/08 08:49:30 INFO output.FileOutputCommitter: Saved output of task 'attempt_201512080849_0002_m_001772_0' to hdfs://helmhdfs/user/patcharee/NTAPBlogInfo/_temporary/0/task_201512080849_0002_m_001772 15/12/08 08:49:30 INFO mapred.SparkHadoopMapRedUtil: attempt_201512080849_0002_m_001772_0: Committed 15/12/08 08:49:30 INFO executor.Executor: Finished task 1772.0 in stage 2.0 (TID 1770). 16216 bytes result sent to driver 15/12/07 20:52:24 INFO rdd.NewHadoopRDD: Input split: hdfs://helmhdfs/user/patcharee/ntap-raw-20151015-20151126/html2/bcnn1wp.wordpress.com.html:1476395008+134217728 15/12/07 20:53:06 INFO python.PythonRunner: Times: total = 41776, boot = -425, init = 432, finish = 41769 15/12/07 20:53:06 INFO storage.MemoryStore: ensureFreeSpace(1434614) called with curMem=167647961, maxMem=4341293383 15/12/07 20:53:06 INFO storage.MemoryStore: Block rdd_3_994 stored as bytes in memory (estimated size 1401.0 KB, free 3.9 GB) 15/12/07 20:53:06 INFO python.PythonRunner: Times: total = 40, boot = -20, init = 21, finish = 39 15/12/07 20:53:06 INFO storage.MemoryStore: ensureFreeSpace(1463477) called with curMem=169082575, maxMem=4341293383 15/12/07 20:53:06 INFO storage.MemoryStore: Block rdd_9_994 stored as bytes in memory (estimated size 1429.2 KB, free 3.9 GB) 15/12/07 20:53:06 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1 15/12/07 20:53:06 INFO datasources.DynamicPartitionWriterContainer: Using output committer class org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter 15/12/07 20:53:06 INFO output.FileOutputCommitter: Saved output of task 'attempt_201512072053_0002_m_000994_0' to hdfs://helmhdfs/user/patcharee/NTAPBlogInfo/_temporary/0/task_201512072053_0002_m_000994 15/12/07 20:53:06 INFO mapred.SparkHadoopMapRedUtil: attempt_201512072053_0002_m_000994_0: Committed 15/12/07 20:53:06 INFO executor.Executor: Finished task 994.0 in stage 2.0 (TID 990). 9386 bytes result sent to driver Any suggestions please Thanks, Patcharee - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark UI - Streaming Tab
Hi, We tried to get the streaming tab interface on Spark UI - https://databricks.com/blog/2015/07/08/new-visualizations-for-understanding-spark-streaming-applications.html Tested on version 1.5.1, 1.6.0-snapshot, but no such interface for streaming applications at all. Any suggestions? Do we need to configure the history UI somehow to get such interface? Thanks, Patcharee - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark applications metrics
Hi How can I see the summary of data read / write, shuffle read / write, etc of an Application, not per stage? Thanks, Patcharee - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark UI - Streaming Tab
I ran streaming jobs, but no streaming tab appeared for those jobs. Patcharee On 04. des. 2015 18:12, PhuDuc Nguyen wrote: I believe the "Streaming" tab is dynamic - it appears once you have a streaming job running, not when the cluster is simply up. It does not depend on 1.6 and has been in there since at least 1.0. HTH, Duc On Fri, Dec 4, 2015 at 7:28 AM, patcharee <patcharee.thong...@uni.no <mailto:patcharee.thong...@uni.no>> wrote: Hi, We tried to get the streaming tab interface on Spark UI - https://databricks.com/blog/2015/07/08/new-visualizations-for-understanding-spark-streaming-applications.html Tested on version 1.5.1, 1.6.0-snapshot, but no such interface for streaming applications at all. Any suggestions? Do we need to configure the history UI somehow to get such interface? Thanks, Patcharee - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org <mailto:user-unsubscr...@spark.apache.org> For additional commands, e-mail: user-h...@spark.apache.org <mailto:user-h...@spark.apache.org>
Re: Spark Streaming - History UI
I meant there is no streaming tab at all. It looks like I need version 1.6 Patcharee On 02. des. 2015 11:34, Steve Loughran wrote: The history UI doesn't update itself for live apps (SPARK-7889) -though I'm working on it Are you trying to view a running streaming job? On 2 Dec 2015, at 05:28, patcharee <patcharee.thong...@uni.no> wrote: Hi, On my history server UI, I cannot see "streaming" tab for any streaming jobs? I am using version 1.5.1. Any ideas? Thanks, Patcharee - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark Streaming - History UI
Hi, On my history server UI, I cannot see "streaming" tab for any streaming jobs? I am using version 1.5.1. Any ideas? Thanks, Patcharee - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
custom inputformat recordreader
Hi, In python how to use inputformat/custom recordreader? Thanks, Patcharee - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
data local read counter
Hi, Is there a counter for data local read? I understood that it is locality level counter, but it seems not. Thanks, Patcharee - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: query orc file by hive
Hi, It work with non-partition ORC, but does not work with (2-column) partitioned ORC. Thanks, Patcharee On 09. nov. 2015 10:55, Elliot West wrote: Hi, You can create a table and point the location property to the folder containing your ORC file: CREATE EXTERNAL TABLE orc_table ( ) STORED AS ORC LOCATION '/hdfs/folder/containing/orc/file' ; https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTable Thanks - Elliot. On 9 November 2015 at 09:44, patcharee <patcharee.thong...@uni.no <mailto:patcharee.thong...@uni.no>> wrote: Hi, How can I query an orc file (*.orc) by Hive? This orc file is created by other apps, like spark, mr. Thanks, Patcharee
Re: query orc file by hive
Hi, It works after I altered add partition. Thanks! My partitioned orc file (directory) is created by Spark, therefore hive is not aware of the partitions automatically. Best, Patcharee On 13. nov. 2015 13:08, Elliot West wrote: Have you added the partitions to the meta store? ALTER TABLE ... ADD PARTITION ... If using Spark, I believe it has good support to do this automatically with the HiveContext, although I have not used it myself. Elliot. On Friday, 13 November 2015, patcharee <patcharee.thong...@uni.no <mailto:patcharee.thong...@uni.no>> wrote: Hi, It work with non-partition ORC, but does not work with (2-column) partitioned ORC. Thanks, Patcharee On 09. nov. 2015 10:55, Elliot West wrote: Hi, You can create a table and point the location property to the folder containing your ORC file: CREATE EXTERNAL TABLE orc_table ( ) STORED AS ORC LOCATION '/hdfs/folder/containing/orc/file' ; https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTable Thanks - Elliot. On 9 November 2015 at 09:44, patcharee <patcharee.thong...@uni.no <javascript:_e(%7B%7D,'cvml','patcharee.thong...@uni.no');>> wrote: Hi, How can I query an orc file (*.orc) by Hive? This orc file is created by other apps, like spark, mr. Thanks, Patcharee
query orc file by hive
Hi, How can I query an orc file (*.orc) by Hive? This orc file is created by other apps, like spark, mr. Thanks, Patcharee
[jira] [Issue Comment Deleted] (SPARK-11087) spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate
[ https://issues.apache.org/jira/browse/SPARK-11087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] patcharee updated SPARK-11087: -- Comment: was deleted (was: I found a scenario where the problem exists) > spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate > - > > Key: SPARK-11087 > URL: https://issues.apache.org/jira/browse/SPARK-11087 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 > Environment: orc file version 0.12 with HIVE_8732 > hive version 1.2.1.2.3.0.0-2557 >Reporter: patcharee >Priority: Minor > > I have an external hive table stored as partitioned orc file (see the table > schema below). I tried to query from the table with where clause> > hiveContext.setConf("spark.sql.orc.filterPushdown", "true") > hiveContext.sql("select u, v from 4D where zone = 2 and x = 320 and y = > 117")). > But from the log file with debug logging level on, the ORC pushdown predicate > was not generated. > Unfortunately my table was not sorted when I inserted the data, but I > expected the ORC pushdown predicate should be generated (because of the where > clause) though > Table schema > > hive> describe formatted 4D; > OK > # col_namedata_type comment > > date int > hhint > x int > y int > heightfloat > u float > v float > w float > phfloat > phb float > t float > p float > pbfloat > qvaporfloat > qgraupfloat > qnice float > qnrainfloat > tke_pbl float > el_pblfloat > qcloudfloat > > # Partition Information > # col_namedata_type comment > > zone int > z int > year int > month int > > # Detailed Table Information > Database: default > Owner:patcharee > CreateTime: Thu Jul 09 16:46:54 CEST 2015 > LastAccessTime: UNKNOWN > Protect Mode: None > Retention:0 > Location: hdfs://helmhdfs/apps/hive/warehouse/wrf_tables/4D > > Table Type: EXTERNAL_TABLE > Table Parameters: > EXTERNALTRUE > comment this table is imported from rwf_data/*/wrf/* > last_modified_bypatcharee > last_modified_time 1439806692 > orc.compressZLIB > transient_lastDdlTime 1439806692 > > # Storage Information > SerDe Library:org.apache.hadoop.hive.ql.io.orc.OrcSerde > InputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat > OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat > > Compressed: No > Num Buckets: -1 > Bucket Columns: [] >
[jira] [Issue Comment Deleted] (SPARK-11087) spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate
[ https://issues.apache.org/jira/browse/SPARK-11087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] patcharee updated SPARK-11087: -- Comment: was deleted (was: Hi [~zzhan], the problem actually happens when I generates orc file by "saveAsTable()" method (because I need my orc file to be accessible through hive). See below>> hive> create external table peopletable(name string, address string, phone string) partitioned by(age int) stored as orc location '/user/patcharee/peopletable'; On spark shell local mode>> 2501 sql("set hive.exec.dynamic.partition.mode=nonstrict") 2502 sqlContext.setConf("spark.sql.orc.filterPushdown", "true") 2503 case class Person(name: String, age: Int, address: String, phone: String) 2504 val records = (1 to 100).map { i => Person(s"name_$i", i, s"address_$i", s"phone_$i" ) } 2505 sc.parallelize(records).toDF().write.format("orc").mode("Append").partitionBy("age").saveAsTable("peopletable") 2506 val people = sqlContext.read.format("orc").load("peopletable") 2507 people.registerTempTable("people") 2508 sqlContext.sql("SELECT * FROM people WHERE age = 20 and name = 'name_20'").count It is true that if the orc file is generated by "save()" method, the predicate will be generated. But it is not for the case "saveAsTable()" method. [~zzhan] can you please suggest how to fix this?) > spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate > - > > Key: SPARK-11087 > URL: https://issues.apache.org/jira/browse/SPARK-11087 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 > Environment: orc file version 0.12 with HIVE_8732 > hive version 1.2.1.2.3.0.0-2557 >Reporter: patcharee >Priority: Minor > > I have an external hive table stored as partitioned orc file (see the table > schema below). I tried to query from the table with where clause> > hiveContext.setConf("spark.sql.orc.filterPushdown", "true") > hiveContext.sql("select u, v from 4D where zone = 2 and x = 320 and y = > 117")). > But from the log file with debug logging level on, the ORC pushdown predicate > was not generated. > Unfortunately my table was not sorted when I inserted the data, but I > expected the ORC pushdown predicate should be generated (because of the where > clause) though > Table schema > > hive> describe formatted 4D; > OK > # col_namedata_type comment > > date int > hhint > x int > y int > heightfloat > u float > v float > w float > phfloat > phb float > t float > p float > pbfloat > qvaporfloat > qgraupfloat > qnice float > qnrainfloat > tke_pbl float > el_pblfloat > qcloudfloat > > # Partition Information > # col_namedata_type comment > > zone int > z int > year int > month int > > # Detailed Table Information > Database:
[jira] [Reopened] (SPARK-11087) spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate
[ https://issues.apache.org/jira/browse/SPARK-11087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] patcharee reopened SPARK-11087: --- I found a scenario where the problem exists > spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate > - > > Key: SPARK-11087 > URL: https://issues.apache.org/jira/browse/SPARK-11087 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 > Environment: orc file version 0.12 with HIVE_8732 > hive version 1.2.1.2.3.0.0-2557 >Reporter: patcharee >Priority: Minor > > I have an external hive table stored as partitioned orc file (see the table > schema below). I tried to query from the table with where clause> > hiveContext.setConf("spark.sql.orc.filterPushdown", "true") > hiveContext.sql("select u, v from 4D where zone = 2 and x = 320 and y = > 117")). > But from the log file with debug logging level on, the ORC pushdown predicate > was not generated. > Unfortunately my table was not sorted when I inserted the data, but I > expected the ORC pushdown predicate should be generated (because of the where > clause) though > Table schema > > hive> describe formatted 4D; > OK > # col_namedata_type comment > > date int > hhint > x int > y int > heightfloat > u float > v float > w float > phfloat > phb float > t float > p float > pbfloat > qvaporfloat > qgraupfloat > qnice float > qnrainfloat > tke_pbl float > el_pblfloat > qcloudfloat > > # Partition Information > # col_namedata_type comment > > zone int > z int > year int > month int > > # Detailed Table Information > Database: default > Owner:patcharee > CreateTime: Thu Jul 09 16:46:54 CEST 2015 > LastAccessTime: UNKNOWN > Protect Mode: None > Retention:0 > Location: hdfs://helmhdfs/apps/hive/warehouse/wrf_tables/4D > > Table Type: EXTERNAL_TABLE > Table Parameters: > EXTERNALTRUE > comment this table is imported from rwf_data/*/wrf/* > last_modified_bypatcharee > last_modified_time 1439806692 > orc.compressZLIB > transient_lastDdlTime 1439806692 > > # Storage Information > SerDe Library:org.apache.hadoop.hive.ql.io.orc.OrcSerde > InputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat > OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat > > Compressed: No > Num Buckets: -1 > Bucket Columns: [] > Sort Columns: [] >
[jira] [Commented] (SPARK-11087) spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate
[ https://issues.apache.org/jira/browse/SPARK-11087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14993398#comment-14993398 ] patcharee commented on SPARK-11087: --- Hi [~zzhan], the problem actually happens when I generates orc file by "saveAsTable()" method (because I need my orc file to be accessible through hive). See below>> hive> create external table peopletable(name string, address string, phone string) partitioned by(age int) stored as orc location '/user/patcharee/peopletable'; On spark shell local mode>> 2501 sql("set hive.exec.dynamic.partition.mode=nonstrict") 2502 sqlContext.setConf("spark.sql.orc.filterPushdown", "true") 2503 case class Person(name: String, age: Int, address: String, phone: String) 2504 val records = (1 to 100).map { i => Person(s"name_$i", i, s"address_$i", s"phone_$i" ) } 2505 sc.parallelize(records).toDF().write.format("orc").mode("Append").partitionBy("age").saveAsTable("peopletable") 2506 val people = sqlContext.read.format("orc").load("peopletable") 2507 people.registerTempTable("people") 2508 sqlContext.sql("SELECT * FROM people WHERE age = 20 and name = 'name_20'").count It is true that if the orc file is generated by "save()" method, the predicate will be generated. But it is not for the case "saveAsTable()" method. [~zzhan] can you please suggest how to fix this? > spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate > - > > Key: SPARK-11087 > URL: https://issues.apache.org/jira/browse/SPARK-11087 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 > Environment: orc file version 0.12 with HIVE_8732 > hive version 1.2.1.2.3.0.0-2557 >Reporter: patcharee >Priority: Minor > > I have an external hive table stored as partitioned orc file (see the table > schema below). I tried to query from the table with where clause> > hiveContext.setConf("spark.sql.orc.filterPushdown", "true") > hiveContext.sql("select u, v from 4D where zone = 2 and x = 320 and y = > 117")). > But from the log file with debug logging level on, the ORC pushdown predicate > was not generated. > Unfortunately my table was not sorted when I inserted the data, but I > expected the ORC pushdown predicate should be generated (because of the where > clause) though > Table schema > > hive> describe formatted 4D; > OK > # col_namedata_type comment > > date int > hhint > x int > y int > heightfloat > u float > v float > w float > phfloat > phb float > t float > p float > pbfloat > qvaporfloat > qgraupfloat > qnice float > qnrainfloat > tke_pbl float > el_pblfloat > qcloudfloat > > # Partition Information > # col_namedata_type comment > > zone int > z int > year int > month int > > # Detailed Table Information > Database:
[jira] [Commented] (SPARK-11087) spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate
[ https://issues.apache.org/jira/browse/SPARK-11087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14993771#comment-14993771 ] patcharee commented on SPARK-11087: --- Hi, I found a scenario where the predicate does not work again. [~zzhan] Can you please have a look? First create a hive table >> hive> create table people(name string, address string, phone string) partitioned by(age int) stored as orc; Then use spark shell local mode to insert data and then query >> 120 import org.apache.spark.sql.Row 121 import org.apache.spark.{SparkConf, SparkContext} 122 import org.apache.spark.sql.types._ 123 import org.apache.spark.sql.types.{StructType,StructField,StringType,IntegerType,FloatType} 124 sqlContext.setConf("hive.exec.dynamic.partition.mode","nonstrict") 125 val records = (1 to 10).map( i => Row(s"name_$i", s"address_$i", s"phone_$i", i )) 126 val schemaString = "name address phone age" 127 val schema = StructType(schemaString.split(" ").map(fieldName => if (fieldName.equals("age")) StructField(fieldName, IntegerType, true) else StructField(fieldName, StringType, true))) 128 val x = sc.parallelize(records) 129 val rDF = sqlContext.createDataFrame(x, schema) 130 rDF.write.format("org.apache.spark.sql.hive.orc.DefaultSource").mode(org.apache.spark.sql.SaveMode.Append).partitionBy("age").saveAsTable("people") 131 sqlContext.setConf("spark.sql.orc.filterPushdown", "true") 132 val people = sqlContext.read.format("orc").load("/user/hive/warehouse/people") 133 people.registerTempTable("people") 134 sqlContext.sql("SELECT * FROM people WHERE age = 3 and name = 'name_3'").count 15/11/06 15:40:36 INFO HadoopRDD: Input split: hdfs://localhost:9000/user/hive/warehouse/people/age=3/part-0:0+453 15/11/06 15:40:36 DEBUG OrcInputFormat: No ORC pushdown predicate 15/11/06 15:40:36 DEBUG OrcInputFormat: No ORC pushdown predicate 15/11/06 15:40:36 INFO OrcRawRecordMerger: min key = null, max key = null 15/11/06 15:40:36 INFO OrcRawRecordMerger: min key = null, max key = null 15/11/06 15:40:36 INFO ReaderImpl: Reading ORC rows from hdfs://localhost:9000/user/hive/warehouse/people/age=3/part-0 with {include: [true, true, false, false], offset: 0, length: 9223372036854775807} 15/11/06 15:40:36 INFO ReaderImpl: Reading ORC rows from hdfs://localhost:9000/user/hive/warehouse/people/age=3/part-0 with {include: [true, true, false, false], offset: 0, length: 9223372036854775807} 15/11/06 15:40:36 INFO RecordReaderFactory: Schema is not specified on read. Using file schema. 15/11/06 15:40:36 INFO RecordReaderFactory: Schema is not specified on read. Using file schema. 15/11/06 15:40:36 DEBUG RecordReaderImpl: chunks = [range start: 111 end: 126] 15/11/06 15:40:36 DEBUG RecordReaderImpl: chunks = [range start: 111 end: 126] 15/11/06 15:40:36 DEBUG RecordReaderImpl: merge = [data range [111, 126), size: 15 type: array-backed] 15/11/06 15:40:36 DEBUG RecordReaderImpl: merge = [data range [111, 126), size: 15 type: array-backed] 15/11/06 15:40:36 INFO GeneratePredicate: Code generated in 5.063287 ms > spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate > - > > Key: SPARK-11087 > URL: https://issues.apache.org/jira/browse/SPARK-11087 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 > Environment: orc file version 0.12 with HIVE_8732 > hive version 1.2.1.2.3.0.0-2557 >Reporter: patcharee >Priority: Minor > > I have an external hive table stored as partitioned orc file (see the table > schema below). I tried to query from the table with where clause> > hiveContext.setConf("spark.sql.orc.filterPushdown", "true") > hiveContext.sql("select u, v from 4D where zone = 2 and x = 320 and y = > 117")). > But from the log file with debug logging level on, the ORC pushdown predicate > was not generated. > Unfortunately my table was not sorted when I inserted the data, but I > expected the ORC pushdown predicate should be generated (because of the where > clause) though > Table schema > > hive> describe formatted 4D; > OK > # col_namedata_type comment > > date int > hhint > x int > y int
[jira] [Comment Edited] (SPARK-11087) spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate
[ https://issues.apache.org/jira/browse/SPARK-11087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14993771#comment-14993771 ] patcharee edited comment on SPARK-11087 at 11/6/15 2:56 PM: Hi, I found a scenario where the predicate does not work again. [~zzhan] Can you please have a look? First create a hive table >> hive> create table people(name string, address string, phone string) partitioned by(age int) stored as orc; Then use spark shell local mode to insert data and then query >> 120 import org.apache.spark.sql.Row 121 import org.apache.spark.{SparkConf, SparkContext} 122 import org.apache.spark.sql.types._ 123 import org.apache.spark.sql.types.{StructType,StructField,StringType,IntegerType,FloatType} 124 sqlContext.setConf("hive.exec.dynamic.partition.mode","nonstrict") 125 val records = (1 to 10).map( i => Row(s"name_$i", s"address_$i", s"phone_$i", i )) 126 val schemaString = "name address phone age" 127 val schema = StructType(schemaString.split(" ").map(fieldName => if (fieldName.equals("age")) StructField(fieldName, IntegerType, true) else StructField(fieldName, StringType, true))) 128 val x = sc.parallelize(records) 129 val rDF = sqlContext.createDataFrame(x, schema) 130 rDF.write.format("org.apache.spark.sql.hive.orc.DefaultSource").mode(org.apache.spark.sql.SaveMode.Append).partitionBy("age").saveAsTable("people") 131 sqlContext.setConf("spark.sql.orc.filterPushdown", "true") 132 val people = sqlContext.read.format("orc").load("/user/hive/warehouse/people") 133 people.registerTempTable("people") 134 sqlContext.sql("SELECT * FROM people WHERE age = 3 and name = 'name_3'").count Below is a part of the log message from the last command >> 15/11/06 15:40:36 INFO HadoopRDD: Input split: hdfs://localhost:9000/user/hive/warehouse/people/age=3/part-0:0+453 15/11/06 15:40:36 DEBUG OrcInputFormat: No ORC pushdown predicate 15/11/06 15:40:36 DEBUG OrcInputFormat: No ORC pushdown predicate 15/11/06 15:40:36 INFO OrcRawRecordMerger: min key = null, max key = null 15/11/06 15:40:36 INFO OrcRawRecordMerger: min key = null, max key = null 15/11/06 15:40:36 INFO ReaderImpl: Reading ORC rows from hdfs://localhost:9000/user/hive/warehouse/people/age=3/part-0 with {include: [true, true, false, false], offset: 0, length: 9223372036854775807} 15/11/06 15:40:36 INFO ReaderImpl: Reading ORC rows from hdfs://localhost:9000/user/hive/warehouse/people/age=3/part-0 with {include: [true, true, false, false], offset: 0, length: 9223372036854775807} 15/11/06 15:40:36 INFO RecordReaderFactory: Schema is not specified on read. Using file schema. 15/11/06 15:40:36 INFO RecordReaderFactory: Schema is not specified on read. Using file schema. 15/11/06 15:40:36 DEBUG RecordReaderImpl: chunks = [range start: 111 end: 126] 15/11/06 15:40:36 DEBUG RecordReaderImpl: chunks = [range start: 111 end: 126] 15/11/06 15:40:36 DEBUG RecordReaderImpl: merge = [data range [111, 126), size: 15 type: array-backed] 15/11/06 15:40:36 DEBUG RecordReaderImpl: merge = [data range [111, 126), size: 15 type: array-backed] 15/11/06 15:40:36 INFO GeneratePredicate: Code generated in 5.063287 ms was (Author: patcharee): Hi, I found a scenario where the predicate does not work again. [~zzhan] Can you please have a look? First create a hive table >> hive> create table people(name string, address string, phone string) partitioned by(age int) stored as orc; Then use spark shell local mode to insert data and then query >> 120 import org.apache.spark.sql.Row 121 import org.apache.spark.{SparkConf, SparkContext} 122 import org.apache.spark.sql.types._ 123 import org.apache.spark.sql.types.{StructType,StructField,StringType,IntegerType,FloatType} 124 sqlContext.setConf("hive.exec.dynamic.partition.mode","nonstrict") 125 val records = (1 to 10).map( i => Row(s"name_$i", s"address_$i", s"phone_$i", i )) 126 val schemaString = "name address phone age" 127 val schema = StructType(schemaString.split(" ").map(fieldName => if (fieldName.equals("age")) StructField(fieldName, IntegerType, true) else StructField(fieldName, StringType, true))) 128 val x = sc.parallelize(records) 129 val rDF = sqlContext.createDataFrame(x, schema) 130 rDF.write.format("org.apache.spark.sql.hive.orc.DefaultSource").mode(org.apache.spark.sql.SaveMode.Append).partitionBy("age").saveAsTable("people") 131 sqlContext.setConf("spark.sql.orc.filterPushdown", "true") 132 val people = sqlContext.read.format("orc").load("/user/hive/warehouse/people") 133 people.regist
How to run parallel on each DataFrame group
Hi, I need suggestions on my coding. I would like to split DataFrame (rowDF) by a column (depth) into groups. Then sort each group, repartition and save output of each group into one file. See code below> val rowDF = sqlContext.createDataFrame(rowRDD, schema).cache() for (i <- 0 to 16) { val filterDF = rowDF.filter("depth="+i) val finalDF = filterDF.sort("xy").coalesce(1) finalDF.write.format("org.apache.spark.sql.hive.orc.DefaultSource").mode(org.apache.spark.sql.SaveMode.Append).partitionBy("depth").saveAsTable(args(3)) } The problem is each group after filtered is handled by an executor one by one. How to change the code to allow each group run in parallel? I looked at groupBy, but seem only for aggregation. Thanks, Patcharee
execute native system commands in Spark
Hi, Is it possible to execute native system commands (in parallel) Spark, like scala.sys.process ? Best, Patcharee - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Min-Max Index vs Bloom filter
Hi, For the orc format, which scenario that bloom filter is better than min-max index? Best, Patcharee
[jira] [Closed] (SPARK-11087) spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate
[ https://issues.apache.org/jira/browse/SPARK-11087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] patcharee closed SPARK-11087. - Resolution: Not A Problem The predicate is indeed generated and can be found in the executor log > spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate > - > > Key: SPARK-11087 > URL: https://issues.apache.org/jira/browse/SPARK-11087 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 > Environment: orc file version 0.12 with HIVE_8732 > hive version 1.2.1.2.3.0.0-2557 >Reporter: patcharee >Priority: Minor > > I have an external hive table stored as partitioned orc file (see the table > schema below). I tried to query from the table with where clause> > hiveContext.setConf("spark.sql.orc.filterPushdown", "true") > hiveContext.sql("select u, v from 4D where zone = 2 and x = 320 and y = > 117")). > But from the log file with debug logging level on, the ORC pushdown predicate > was not generated. > Unfortunately my table was not sorted when I inserted the data, but I > expected the ORC pushdown predicate should be generated (because of the where > clause) though > Table schema > > hive> describe formatted 4D; > OK > # col_namedata_type comment > > date int > hhint > x int > y int > heightfloat > u float > v float > w float > phfloat > phb float > t float > p float > pbfloat > qvaporfloat > qgraupfloat > qnice float > qnrainfloat > tke_pbl float > el_pblfloat > qcloudfloat > > # Partition Information > # col_namedata_type comment > > zone int > z int > year int > month int > > # Detailed Table Information > Database: default > Owner:patcharee > CreateTime: Thu Jul 09 16:46:54 CEST 2015 > LastAccessTime: UNKNOWN > Protect Mode: None > Retention:0 > Location: hdfs://helmhdfs/apps/hive/warehouse/wrf_tables/4D > > Table Type: EXTERNAL_TABLE > Table Parameters: > EXTERNALTRUE > comment this table is imported from rwf_data/*/wrf/* > last_modified_bypatcharee > last_modified_time 1439806692 > orc.compressZLIB > transient_lastDdlTime 1439806692 > > # Storage Information > SerDe Library:org.apache.hadoop.hive.ql.io.orc.OrcSerde > InputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat > OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat > > Compressed: No > Num Buckets: -1 > Bucket Columns: [] >
[jira] [Commented] (SPARK-11087) spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate
[ https://issues.apache.org/jira/browse/SPARK-11087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14970786#comment-14970786 ] patcharee commented on SPARK-11087: --- [~zzhan] I found the predicate generated in the executor log for the case using dataframe (not hiveContext.sql). Sorry for my mistake, and thanks for your help! > spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate > - > > Key: SPARK-11087 > URL: https://issues.apache.org/jira/browse/SPARK-11087 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 > Environment: orc file version 0.12 with HIVE_8732 > hive version 1.2.1.2.3.0.0-2557 >Reporter: patcharee >Priority: Minor > > I have an external hive table stored as partitioned orc file (see the table > schema below). I tried to query from the table with where clause> > hiveContext.setConf("spark.sql.orc.filterPushdown", "true") > hiveContext.sql("select u, v from 4D where zone = 2 and x = 320 and y = > 117")). > But from the log file with debug logging level on, the ORC pushdown predicate > was not generated. > Unfortunately my table was not sorted when I inserted the data, but I > expected the ORC pushdown predicate should be generated (because of the where > clause) though > Table schema > > hive> describe formatted 4D; > OK > # col_namedata_type comment > > date int > hhint > x int > y int > heightfloat > u float > v float > w float > phfloat > phb float > t float > p float > pbfloat > qvaporfloat > qgraupfloat > qnice float > qnrainfloat > tke_pbl float > el_pblfloat > qcloudfloat > > # Partition Information > # col_namedata_type comment > > zone int > z int > year int > month int > > # Detailed Table Information > Database: default > Owner:patcharee > CreateTime: Thu Jul 09 16:46:54 CEST 2015 > LastAccessTime: UNKNOWN > Protect Mode: None > Retention:0 > Location: hdfs://helmhdfs/apps/hive/warehouse/wrf_tables/4D > > Table Type: EXTERNAL_TABLE > Table Parameters: > EXTERNALTRUE > comment this table is imported from rwf_data/*/wrf/* > last_modified_bypatcharee > last_modified_time 1439806692 > orc.compressZLIB > transient_lastDdlTime 1439806692 > > # Storage Information > SerDe Library:org.apache.hadoop.hive.ql.io.orc.OrcSerde > InputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat > OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat > > Compressed: No
the column names removed after insert select
Hi I inserted a table from select (insert into table newtable select date, hh, x, y from oldtable). After the insert the column names of the table have been removed, see the output below when I use hive --orcfiledump - Type: struct<_col0:int,_col1:int,_col2:int,_col3:int> while it is supposed to be - Type: struct<date:int,hh:int,x:int,y:int> Any ideas how this happened and how I can fix it. Please suggest me. BR, Patcharee
the number of files after merging
Hi, I am using alter command below to merge partitioned orc file on one partition: alter table X partition(zone=1,z=1,year=2009,month=1) CONCATENATE; - How can I control the number of files after merging? I would like to get only one file per partition. - Is it possible to concatenate the whole table, not one-by-one partition? Thanks, Patcharee
[jira] [Commented] (SPARK-11087) spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate
[ https://issues.apache.org/jira/browse/SPARK-11087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14967661#comment-14967661 ] patcharee commented on SPARK-11087: --- Hi [~zzhan] What version of hive and orc file you are using? Can I get your hive configuration file? > spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate > - > > Key: SPARK-11087 > URL: https://issues.apache.org/jira/browse/SPARK-11087 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 > Environment: orc file version 0.12 with HIVE_8732 > hive version 1.2.1.2.3.0.0-2557 >Reporter: patcharee >Priority: Minor > > I have an external hive table stored as partitioned orc file (see the table > schema below). I tried to query from the table with where clause> > hiveContext.setConf("spark.sql.orc.filterPushdown", "true") > hiveContext.sql("select u, v from 4D where zone = 2 and x = 320 and y = > 117")). > But from the log file with debug logging level on, the ORC pushdown predicate > was not generated. > Unfortunately my table was not sorted when I inserted the data, but I > expected the ORC pushdown predicate should be generated (because of the where > clause) though > Table schema > > hive> describe formatted 4D; > OK > # col_namedata_type comment > > date int > hhint > x int > y int > heightfloat > u float > v float > w float > phfloat > phb float > t float > p float > pbfloat > qvaporfloat > qgraupfloat > qnice float > qnrainfloat > tke_pbl float > el_pblfloat > qcloudfloat > > # Partition Information > # col_namedata_type comment > > zone int > z int > year int > month int > > # Detailed Table Information > Database: default > Owner:patcharee > CreateTime: Thu Jul 09 16:46:54 CEST 2015 > LastAccessTime: UNKNOWN > Protect Mode: None > Retention:0 > Location: hdfs://helmhdfs/apps/hive/warehouse/wrf_tables/4D > > Table Type: EXTERNAL_TABLE > Table Parameters: > EXTERNALTRUE > comment this table is imported from rwf_data/*/wrf/* > last_modified_bypatcharee > last_modified_time 1439806692 > orc.compressZLIB > transient_lastDdlTime 1439806692 > > # Storage Information > SerDe Library:org.apache.hadoop.hive.ql.io.orc.OrcSerde > InputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat > OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat > > Compressed: No > Num Buckets: -1 > Bucket Colum
[jira] [Comment Edited] (SPARK-11087) spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate
[ https://issues.apache.org/jira/browse/SPARK-11087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14960296#comment-14960296 ] patcharee edited comment on SPARK-11087 at 10/19/15 3:34 AM: - [~zzhan] Below is my test. Please check. I tried to change "hive.exec.orc.split.strategy" also, but none of them given " OrcInputFormat [INFO] ORC pushdown predicate" as same as your result 2508 case class Contact(name: String, phone: String) 2509 case class Person(name: String, age: Int, contacts: Seq[Contact]) 2510 val records = (1 to 100).map { i => Person(s"name_$i", i, (0 to 1).map { m => Contact(s"contact_$m", s"phone_$m") } ) 2511 } 2512 sqlContext.setConf("spark.sql.orc.filterPushdown", "true") 2513 sc.parallelize(records).toDF().write.format("orc").partitionBy("age").save("peoplePartitioned") 2514 val peoplePartitioned = sqlContext.read.format("orc").load("peoplePartitioned") 2515 peoplePartitioned.registerTempTable("peoplePartitioned") scala> sqlContext.setConf("hive.exec.orc.split.strategy", "ETL") 15/10/16 09:10:49 DEBUG VariableSubstitution: Substitution is on: ETL 15/10/16 09:10:49 DEBUG VariableSubstitution: Substitution is on: ETL 15/10/16 09:10:49 DEBUG VariableSubstitution: Substitution is on: ETL 15/10/16 09:10:49 DEBUG VariableSubstitution: Substitution is on: ETL scala> sqlContext.sql("SELECT * FROM peoplePartitioned WHERE age = 20 and name = 'name_20'").count 15/10/16 09:10:52 DEBUG VariableSubstitution: Substitution is on: SELECT * FROM peoplePartitioned WHERE age = 20 and name = 'name_20' 15/10/16 09:10:52 DEBUG VariableSubstitution: Substitution is on: SELECT * FROM peoplePartitioned WHERE age = 20 and name = 'name_20' 15/10/16 09:10:53 INFO PerfLogger: 15/10/16 09:10:53 INFO PerfLogger: 15/10/16 09:10:53 DEBUG OrcInputFormat: Number of buckets specified by conf file is 0 15/10/16 09:10:53 DEBUG OrcInputFormat: Number of buckets specified by conf file is 0 15/10/16 09:10:53 DEBUG AcidUtils: in directory hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc base = null deltas = 0 15/10/16 09:10:53 DEBUG AcidUtils: in directory hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc base = null deltas = 0 15/10/16 09:10:53 DEBUG OrcInputFormat: BISplitStrategy strategy for hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc 15/10/16 09:10:53 DEBUG OrcInputFormat: BISplitStrategy strategy for hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc 15/10/16 09:10:53 INFO OrcInputFormat: FooterCacheHitRatio: 0/0 15/10/16 09:10:53 INFO OrcInputFormat: FooterCacheHitRatio: 0/0 15/10/16 09:10:53 DEBUG OrcInputFormat: hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc:0+551 projected_columns_uncompressed_size: -1 15/10/16 09:10:53 DEBUG OrcInputFormat: hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc:0+551 projected_columns_uncompressed_size: -1 15/10/16 09:10:53 INFO PerfLogger: 15/10/16 09:10:53 INFO PerfLogger: res5: Long = 1 scala> sqlContext.setConf("hive.exec.orc.split.strategy", "BI") 15/10/16 09:11:13 DEBUG VariableSubstitution: Substitution is on: BI 15/10/16 09:11:13 DEBUG VariableSubstitution: Substitution is on: BI 15/10/16 09:11:13 DEBUG VariableSubstitution: Substitution is on: BI 15/10/16 09:11:13 DEBUG VariableSubstitution: Substitution is on: BI scala> sqlContext.sql("SELECT * FROM peoplePartitioned WHERE age = 20 and name = 'name_20'").count 15/10/16 09:11:19 DEBUG VariableSubstitution: Substitution is on: SELECT * FROM peoplePartitioned WHERE age = 20 and name = 'name_20' 15/10/16 09:11:19 DEBUG VariableSubstitution: Substitution is on: SELECT * FROM peoplePartitioned WHERE age = 20 and name = 'name_20' 15/10/16 09:11:19 INFO PerfLogger: 15/10/16 09:11:19 INFO PerfLogger: 15/10/16 09:11:19 DEBUG OrcInputFormat: Number of buckets specified by conf file is 0 15/10/16 09:11:19 DEBUG OrcInputFormat: Number of buckets specified by conf file is 0 15/10/16 09:11:19 DEBUG AcidUtils: in directory hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc base = null deltas = 0 15/10/16 09:11:19 DEBUG AcidUtils: in directory hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc base = null deltas = 0 15/10/16 09:11:19 DEBUG OrcInputFormat: BISplitStrategy strategy for hdfs://helmhdfs/user/patcharee/peo
[jira] [Commented] (SPARK-11087) spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate
[ https://issues.apache.org/jira/browse/SPARK-11087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14960296#comment-14960296 ] patcharee commented on SPARK-11087: --- [~zhazhan] Below is my test. Please check. I tried to change "hive.exec.orc.split.strategy" also, but none of them given " OrcInputFormat [INFO] ORC pushdown predicate" as same as your result 2508 case class Contact(name: String, phone: String) 2509 case class Person(name: String, age: Int, contacts: Seq[Contact]) 2510 val records = (1 to 100).map { i => Person(s"name_$i", i, (0 to 1).map { m => Contact(s"contact_$m", s"phone_$m") } ) 2511 } 2512 sqlContext.setConf("spark.sql.orc.filterPushdown", "true") 2513 sc.parallelize(records).toDF().write.format("orc").partitionBy("age").save("peoplePartitioned") 2514 val peoplePartitioned = sqlContext.read.format("orc").load("peoplePartitioned") 2515 peoplePartitioned.registerTempTable("peoplePartitioned") scala> sqlContext.setConf("hive.exec.orc.split.strategy", "ETL") 15/10/16 09:10:49 DEBUG VariableSubstitution: Substitution is on: ETL 15/10/16 09:10:49 DEBUG VariableSubstitution: Substitution is on: ETL 15/10/16 09:10:49 DEBUG VariableSubstitution: Substitution is on: ETL 15/10/16 09:10:49 DEBUG VariableSubstitution: Substitution is on: ETL scala> sqlContext.sql("SELECT * FROM peoplePartitioned WHERE age = 20 and name = 'name_20'").count 15/10/16 09:10:52 DEBUG VariableSubstitution: Substitution is on: SELECT * FROM peoplePartitioned WHERE age = 20 and name = 'name_20' 15/10/16 09:10:52 DEBUG VariableSubstitution: Substitution is on: SELECT * FROM peoplePartitioned WHERE age = 20 and name = 'name_20' 15/10/16 09:10:53 INFO PerfLogger: 15/10/16 09:10:53 INFO PerfLogger: 15/10/16 09:10:53 DEBUG OrcInputFormat: Number of buckets specified by conf file is 0 15/10/16 09:10:53 DEBUG OrcInputFormat: Number of buckets specified by conf file is 0 15/10/16 09:10:53 DEBUG AcidUtils: in directory hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc base = null deltas = 0 15/10/16 09:10:53 DEBUG AcidUtils: in directory hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc base = null deltas = 0 15/10/16 09:10:53 DEBUG OrcInputFormat: BISplitStrategy strategy for hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc 15/10/16 09:10:53 DEBUG OrcInputFormat: BISplitStrategy strategy for hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc 15/10/16 09:10:53 INFO OrcInputFormat: FooterCacheHitRatio: 0/0 15/10/16 09:10:53 INFO OrcInputFormat: FooterCacheHitRatio: 0/0 15/10/16 09:10:53 DEBUG OrcInputFormat: hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc:0+551 projected_columns_uncompressed_size: -1 15/10/16 09:10:53 DEBUG OrcInputFormat: hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc:0+551 projected_columns_uncompressed_size: -1 15/10/16 09:10:53 INFO PerfLogger: 15/10/16 09:10:53 INFO PerfLogger: res5: Long = 1 scala> sqlContext.setConf("hive.exec.orc.split.strategy", "BI") 15/10/16 09:11:13 DEBUG VariableSubstitution: Substitution is on: BI 15/10/16 09:11:13 DEBUG VariableSubstitution: Substitution is on: BI 15/10/16 09:11:13 DEBUG VariableSubstitution: Substitution is on: BI 15/10/16 09:11:13 DEBUG VariableSubstitution: Substitution is on: BI scala> sqlContext.sql("SELECT * FROM peoplePartitioned WHERE age = 20 and name = 'name_20'").count 15/10/16 09:11:19 DEBUG VariableSubstitution: Substitution is on: SELECT * FROM peoplePartitioned WHERE age = 20 and name = 'name_20' 15/10/16 09:11:19 DEBUG VariableSubstitution: Substitution is on: SELECT * FROM peoplePartitioned WHERE age = 20 and name = 'name_20' 15/10/16 09:11:19 INFO PerfLogger: 15/10/16 09:11:19 INFO PerfLogger: 15/10/16 09:11:19 DEBUG OrcInputFormat: Number of buckets specified by conf file is 0 15/10/16 09:11:19 DEBUG OrcInputFormat: Number of buckets specified by conf file is 0 15/10/16 09:11:19 DEBUG AcidUtils: in directory hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc base = null deltas = 0 15/10/16 09:11:19 DEBUG AcidUtils: in directory hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc base = null deltas = 0 15/10/16 09:11:19 DEBUG OrcInputFormat: BISplitStrategy strategy for hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7
Re: sql query orc slow
Hi Zhan Zhang, Is my problem (which is ORC predicate is not generated from WHERE clause even though spark.sql.orc.filterPushdown=true) can be related to some factors below ? - orc file version (File Version: 0.12 with HIVE_8732) - hive version (using Hive 1.2.1.2.3.0.0-2557) - orc table is not sorted / indexed - the split strategy hive.exec.orc.split.strategy BR, Patcharee On 10/09/2015 08:01 PM, Zhan Zhang wrote: That is weird. Unfortunately, there is no debug info available on this part. Can you please open a JIRA to add some debug information on the driver side? Thanks. Zhan Zhang On Oct 9, 2015, at 10:22 AM, patcharee <patcharee.thong...@uni.no <mailto:patcharee.thong...@uni.no>> wrote: I set hiveContext.setConf("spark.sql.orc.filterPushdown", "true"). But from the log No ORC pushdown predicate for my query with WHERE clause. 15/10/09 19:16:01 DEBUG OrcInputFormat: No ORC pushdown predicate I did not understand what wrong with this. BR, Patcharee On 09. okt. 2015 19:10, Zhan Zhang wrote: In your case, you manually set an AND pushdown, and the predicate is right based on your setting, : leaf-0 = (EQUALS x 320) The right way is to enable the predicate pushdown as follows. sqlContext.setConf("spark.sql.orc.filterPushdown", "true”) Thanks. Zhan Zhang On Oct 9, 2015, at 9:58 AM, patcharee <patcharee.thong...@uni.no> wrote: Hi Zhan Zhang Actually my query has WHERE clause "select date, month, year, hh, (u*0.9122461 - v*-0.40964267), (v*0.9122461 + u*-0.40964267), z from 4D where x = 320 and y = 117 and zone == 2 and year=2009 and z >= 2 and z <= 8", column "x", "y" is not partition column, the others are partition columns. I expected the system will use predicate pushdown. I turned on the debug and found pushdown predicate was not generated ("DEBUG OrcInputFormat: No ORC pushdown predicate") Then I tried to set the search argument explicitly (on the column "x" which is not partition column) val xs = SearchArgumentFactory.newBuilder().startAnd().equals("x", 320).end().build() hiveContext.setConf("hive.io.file.readcolumn.names", "x") hiveContext.setConf("sarg.pushdown", xs.toKryo()) this time in the log pushdown predicate was generated but results was wrong (no results at all) 15/10/09 18:36:06 INFO OrcInputFormat: ORC pushdown predicate: leaf-0 = (EQUALS x 320) expr = leaf-0 Any ideas What wrong with this? Why the ORC pushdown predicate is not applied by the system? BR, Patcharee On 09. okt. 2015 18:31, Zhan Zhang wrote: Hi Patcharee, >From the query, it looks like only the column pruning will be applied. Partition pruning and predicate pushdown does not have effect. Do you see big IO difference between two methods? The potential reason of the speed difference I can think of may be the different versions of OrcInputFormat. The hive path may use NewOrcInputFormat, but the spark path use OrcInputFormat. Thanks. Zhan Zhang On Oct 8, 2015, at 11:55 PM, patcharee <patcharee.thong...@uni.no> wrote: Yes, the predicate pushdown is enabled, but still take longer time than the first method BR, Patcharee On 08. okt. 2015 18:43, Zhan Zhang wrote: Hi Patcharee, Did you enable the predicate pushdown in the second method? Thanks. Zhan Zhang On Oct 8, 2015, at 1:43 AM, patcharee <patcharee.thong...@uni.no> wrote: Hi, I am using spark sql 1.5 to query a hive table stored as partitioned orc file. We have the total files is about 6000 files and each file size is about 245MB. What is the difference between these two query methods below: 1. Using query on hive table directly hiveContext.sql("select col1, col2 from table1") 2. Reading from orc file, register temp table and query from the temp table val c = hiveContext.read.format("orc").load("/apps/hive/warehouse/table1") c.registerTempTable("regTable") hiveContext.sql("select col1, col2 from regTable") When the number of files is large (query all from the total 6000 files) , the second case is much slower then the first one. Any ideas why? BR, - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org <mailto:user-unsubscr...@spark.apache.org> For additional commands, e-mail: <mailto:user-h...@spark.apache.org>user-h...@spark.apache.org
orc table with sorted field
Hi, How can I create a partitioned orc table with sorted field(s)? I tried to use sorted by keyword, but failed parse exception> CREATE TABLE peoplesort (name string, age int) partition by (bddate int) SORTED BY (age) stored as orc Is it possible to have some sorted columns? From hive ddl page, it seems only bucket table can be sorted. Any suggestions please BR, Patcharee
Re: sql query orc slow
Hi Zhan Zhang, Here is the issue https://issues.apache.org/jira/browse/SPARK-11087 BR, Patcharee On 10/13/2015 06:47 PM, Zhan Zhang wrote: Hi Patcharee, I am not sure which side is wrong, driver or executor. If it is executor side, the reason you mentioned may be possible. But if the driver side didn’t set the predicate at all, then somewhere else is broken. Can you please file a JIRA with a simple reproduce step, and let me know the JIRA number? Thanks. Zhan Zhang On Oct 13, 2015, at 1:01 AM, Patcharee Thongtra <patcharee.thong...@uni.no <mailto:patcharee.thong...@uni.no>> wrote: Hi Zhan Zhang, Is my problem (which is ORC predicate is not generated from WHERE clause even though spark.sql.orc.filterPushdown=true) can be related to some factors below ? - orc file version (File Version: 0.12 with HIVE_8732) - hive version (using Hive 1.2.1.2.3.0.0-2557) - orc table is not sorted / indexed - the split strategy hive.exec.orc.split.strategy BR, Patcharee On 10/09/2015 08:01 PM, Zhan Zhang wrote: That is weird. Unfortunately, there is no debug info available on this part. Can you please open a JIRA to add some debug information on the driver side? Thanks. Zhan Zhang On Oct 9, 2015, at 10:22 AM, patcharee <patcharee.thong...@uni.no> wrote: I set hiveContext.setConf("spark.sql.orc.filterPushdown", "true"). But from the log No ORC pushdown predicate for my query with WHERE clause. 15/10/09 19:16:01 DEBUG OrcInputFormat: No ORC pushdown predicate I did not understand what wrong with this. BR, Patcharee On 09. okt. 2015 19:10, Zhan Zhang wrote: In your case, you manually set an AND pushdown, and the predicate is right based on your setting, : leaf-0 = (EQUALS x 320) The right way is to enable the predicate pushdown as follows. sqlContext.setConf("spark.sql.orc.filterPushdown", "true”) Thanks. Zhan Zhang On Oct 9, 2015, at 9:58 AM, patcharee <patcharee.thong...@uni.no> wrote: Hi Zhan Zhang Actually my query has WHERE clause "select date, month, year, hh, (u*0.9122461 - v*-0.40964267), (v*0.9122461 + u*-0.40964267), z from 4D where x = 320 and y = 117 and zone == 2 and year=2009 and z >= 2 and z <= 8", column "x", "y" is not partition column, the others are partition columns. I expected the system will use predicate pushdown. I turned on the debug and found pushdown predicate was not generated ("DEBUG OrcInputFormat: No ORC pushdown predicate") Then I tried to set the search argument explicitly (on the column "x" which is not partition column) val xs = SearchArgumentFactory.newBuilder().startAnd().equals("x", 320).end().build() hiveContext.setConf("hive.io.file.readcolumn.names", "x") hiveContext.setConf("sarg.pushdown", xs.toKryo()) this time in the log pushdown predicate was generated but results was wrong (no results at all) 15/10/09 18:36:06 INFO OrcInputFormat: ORC pushdown predicate: leaf-0 = (EQUALS x 320) expr = leaf-0 Any ideas What wrong with this? Why the ORC pushdown predicate is not applied by the system? BR, Patcharee On 09. okt. 2015 18:31, Zhan Zhang wrote: Hi Patcharee, >From the query, it looks like only the column pruning will be applied. Partition pruning and predicate pushdown does not have effect. Do you see big IO difference between two methods? The potential reason of the speed difference I can think of may be the different versions of OrcInputFormat. The hive path may use NewOrcInputFormat, but the spark path use OrcInputFormat. Thanks. Zhan Zhang On Oct 8, 2015, at 11:55 PM, patcharee <patcharee.thong...@uni.no> wrote: Yes, the predicate pushdown is enabled, but still take longer time than the first method BR, Patcharee On 08. okt. 2015 18:43, Zhan Zhang wrote: Hi Patcharee, Did you enable the predicate pushdown in the second method? Thanks. Zhan Zhang On Oct 8, 2015, at 1:43 AM, patcharee <patcharee.thong...@uni.no> wrote: Hi, I am using spark sql 1.5 to query a hive table stored as partitioned orc file. We have the total files is about 6000 files and each file size is about 245MB. What is the difference between these two query methods below: 1. Using query on hive table directly hiveContext.sql("select col1, col2 from table1") 2. Reading from orc file, register temp table and query from the temp table val c = hiveContext.read.format("orc").load("/apps/hive/warehouse/table1") c.registerTempTable("regTable") hiveContext.sql("select col1, col2 from regTable") When the number of files is large (query all from the total 6000 files) , the second case is much slower then the first one. Any ideas why? BR, - To unsubscribe, e-mail: <mailto:user-unsubscr...@spark.apache.org>user-unsubscr...@spark.apache.org For additional commands, e-mail: <mailto:user-h...@spark.apache.org>user-h...@spark.apache.org
[jira] [Created] (SPARK-11087) spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate
patcharee created SPARK-11087: - Summary: spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate Key: SPARK-11087 URL: https://issues.apache.org/jira/browse/SPARK-11087 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.1 Environment: orc file version 0.12 with HIVE_8732 hive version 1.2.1.2.3.0.0-2557 Reporter: patcharee Priority: Minor I have an external hive table stored as partitioned orc file (see the table schema below). I tried to query from the table with where clause> hiveContext.setConf("spark.sql.orc.filterPushdown", "true") hiveContext.sql("select u, v from 4D where zone = 2 and x = 320 and y = 117")). But from the log file with debug logging level on, the ORC pushdown predicate was not generated. Unfortunately my table was not sorted when I inserted the data, but I expected the ORC pushdown predicate should be generated (because of the where clause) though Table schema hive> describe formatted 4D; OK # col_name data_type comment dateint hh int x int y int height float u float v float w float ph float phb float t float p float pb float qvapor float qgraup float qnice float qnrain float tke_pbl float el_pbl float qcloud float # Partition Information # col_name data_type comment zoneint z int yearint month int # Detailed Table Information Database: default Owner: patcharee CreateTime: Thu Jul 09 16:46:54 CEST 2015 LastAccessTime: UNKNOWN Protect Mode: None Retention: 0 Location: hdfs://helmhdfs/apps/hive/warehouse/wrf_tables/4D Table Type: EXTERNAL_TABLE Table Parameters: EXTERNALTRUE comment this table is imported from rwf_data/*/wrf/* last_modified_bypatcharee last_modified_time 1439806692 orc.compressZLIB transient_lastDdlTime 1439806692 # Storage Information SerDe Library: org.apache.hadoop.hive.ql.io.orc.OrcSerde InputFormat:org.apache.hadoop.hive.ql.io.orc.OrcInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat Compressed: No Num Buckets:-1 Bucket Columns: [] Sort Columns: [] Storage Desc Params: serialization.format1 Time taken: 0.388 seconds, Fetched: 58 row(s) Data was inserted into this table by another spark job> df.write.format("org.apache.spark.sql.hive.orc.DefaultSource").mode(org.apache.spark.sql.SaveMode.Append).partitionBy("zone","z","year","month").saveAsTable("4D") -- This message was sent by Atlassian JIRA (v6.3.4#6332) -
Re: sql query orc slow
Yes, the predicate pushdown is enabled, but still take longer time than the first method BR, Patcharee On 08. okt. 2015 18:43, Zhan Zhang wrote: Hi Patcharee, Did you enable the predicate pushdown in the second method? Thanks. Zhan Zhang On Oct 8, 2015, at 1:43 AM, patcharee <patcharee.thong...@uni.no> wrote: Hi, I am using spark sql 1.5 to query a hive table stored as partitioned orc file. We have the total files is about 6000 files and each file size is about 245MB. What is the difference between these two query methods below: 1. Using query on hive table directly hiveContext.sql("select col1, col2 from table1") 2. Reading from orc file, register temp table and query from the temp table val c = hiveContext.read.format("orc").load("/apps/hive/warehouse/table1") c.registerTempTable("regTable") hiveContext.sql("select col1, col2 from regTable") When the number of files is large (query all from the total 6000 files) , the second case is much slower then the first one. Any ideas why? BR, - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: sql query orc slow
I set hiveContext.setConf("spark.sql.orc.filterPushdown", "true"). But from the log No ORC pushdown predicate for my query with WHERE clause. 15/10/09 19:16:01 DEBUG OrcInputFormat: No ORC pushdown predicate I did not understand what wrong with this. BR, Patcharee On 09. okt. 2015 19:10, Zhan Zhang wrote: In your case, you manually set an AND pushdown, and the predicate is right based on your setting, : leaf-0 = (EQUALS x 320) The right way is to enable the predicate pushdown as follows. sqlContext.setConf("spark.sql.orc.filterPushdown", "true”) Thanks. Zhan Zhang On Oct 9, 2015, at 9:58 AM, patcharee <patcharee.thong...@uni.no <mailto:patcharee.thong...@uni.no>> wrote: Hi Zhan Zhang Actually my query has WHERE clause "select date, month, year, hh, (u*0.9122461 - v*-0.40964267), (v*0.9122461 + u*-0.40964267), z from 4D where x = 320 and y = 117 and zone == 2 and year=2009 and z >= 2 and z <= 8", column "x", "y" is not partition column, the others are partition columns. I expected the system will use predicate pushdown. I turned on the debug and found pushdown predicate was not generated ("DEBUG OrcInputFormat: No ORC pushdown predicate") Then I tried to set the search argument explicitly (on the column "x" which is not partition column) val xs = SearchArgumentFactory.newBuilder().startAnd().equals("x", 320).end().build() hiveContext.setConf("hive.io.file.readcolumn.names", "x") hiveContext.setConf("sarg.pushdown", xs.toKryo()) this time in the log pushdown predicate was generated but results was wrong (no results at all) 15/10/09 18:36:06 INFO OrcInputFormat: ORC pushdown predicate: leaf-0 = (EQUALS x 320) expr = leaf-0 Any ideas What wrong with this? Why the ORC pushdown predicate is not applied by the system? BR, Patcharee On 09. okt. 2015 18:31, Zhan Zhang wrote: Hi Patcharee, >From the query, it looks like only the column pruning will be applied. Partition pruning and predicate pushdown does not have effect. Do you see big IO difference between two methods? The potential reason of the speed difference I can think of may be the different versions of OrcInputFormat. The hive path may use NewOrcInputFormat, but the spark path use OrcInputFormat. Thanks. Zhan Zhang On Oct 8, 2015, at 11:55 PM, patcharee <patcharee.thong...@uni.no <mailto:patcharee.thong...@uni.no>> wrote: Yes, the predicate pushdown is enabled, but still take longer time than the first method BR, Patcharee On 08. okt. 2015 18:43, Zhan Zhang wrote: Hi Patcharee, Did you enable the predicate pushdown in the second method? Thanks. Zhan Zhang On Oct 8, 2015, at 1:43 AM, patcharee <patcharee.thong...@uni.no <mailto:patcharee.thong...@uni.no>> wrote: Hi, I am using spark sql 1.5 to query a hive table stored as partitioned orc file. We have the total files is about 6000 files and each file size is about 245MB. What is the difference between these two query methods below: 1. Using query on hive table directly hiveContext.sql("select col1, col2 from table1") 2. Reading from orc file, register temp table and query from the temp table val c = hiveContext.read.format("orc").load("/apps/hive/warehouse/table1") c.registerTempTable("regTable") hiveContext.sql("select col1, col2 from regTable") When the number of files is large (query all from the total 6000 files) , the second case is much slower then the first one. Any ideas why? BR, - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org <mailto:user-unsubscr...@spark.apache.org> For additional commands, e-mail: user-h...@spark.apache.org <mailto:user-h...@spark.apache.org>
Re: sql query orc slow
Hi Zhan Zhang Actually my query has WHERE clause "select date, month, year, hh, (u*0.9122461 - v*-0.40964267), (v*0.9122461 + u*-0.40964267), z from 4D where x = 320 and y = 117 and zone == 2 and year=2009 and z >= 2 and z <= 8", column "x", "y" is not partition column, the others are partition columns. I expected the system will use predicate pushdown. I turned on the debug and found pushdown predicate was not generated ("DEBUG OrcInputFormat: No ORC pushdown predicate") Then I tried to set the search argument explicitly (on the column "x" which is not partition column) val xs = SearchArgumentFactory.newBuilder().startAnd().equals("x", 320).end().build() hiveContext.setConf("hive.io.file.readcolumn.names", "x") hiveContext.setConf("sarg.pushdown", xs.toKryo()) this time in the log pushdown predicate was generated but results was wrong (no results at all) 15/10/09 18:36:06 INFO OrcInputFormat: ORC pushdown predicate: leaf-0 = (EQUALS x 320) expr = leaf-0 Any ideas What wrong with this? Why the ORC pushdown predicate is not applied by the system? BR, Patcharee On 09. okt. 2015 18:31, Zhan Zhang wrote: Hi Patcharee, >From the query, it looks like only the column pruning will be applied. Partition pruning and predicate pushdown does not have effect. Do you see big IO difference between two methods? The potential reason of the speed difference I can think of may be the different versions of OrcInputFormat. The hive path may use NewOrcInputFormat, but the spark path use OrcInputFormat. Thanks. Zhan Zhang On Oct 8, 2015, at 11:55 PM, patcharee <patcharee.thong...@uni.no> wrote: Yes, the predicate pushdown is enabled, but still take longer time than the first method BR, Patcharee On 08. okt. 2015 18:43, Zhan Zhang wrote: Hi Patcharee, Did you enable the predicate pushdown in the second method? Thanks. Zhan Zhang On Oct 8, 2015, at 1:43 AM, patcharee <patcharee.thong...@uni.no> wrote: Hi, I am using spark sql 1.5 to query a hive table stored as partitioned orc file. We have the total files is about 6000 files and each file size is about 245MB. What is the difference between these two query methods below: 1. Using query on hive table directly hiveContext.sql("select col1, col2 from table1") 2. Reading from orc file, register temp table and query from the temp table val c = hiveContext.read.format("orc").load("/apps/hive/warehouse/table1") c.registerTempTable("regTable") hiveContext.sql("select col1, col2 from regTable") When the number of files is large (query all from the total 6000 files) , the second case is much slower then the first one. Any ideas why? BR, - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
hiveContext sql number of tasks
Hi, I do a sql query on about 10,000 partitioned orc files. Because of the partition schema the files cannot be merged any longer (to reduce the total number). From this command hiveContext.sql(sqlText), the 10K tasks were created to handle each file. Is it possible to use less tasks? How to force the spark sql to use less tasks? BR, Patcharee - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Idle time between jobs
Hi, I am using Spark 1.5. I have a spark application which is divided into some jobs. I noticed from the Event Timeline - Spark History UI, that there was idle time between jobs. See below, job 1 was submitted at 11:20:49 and finished at 11:20:52, but the job 2 was submitted "16s" after (at 11:21:08). I wonder what is going on during 16s? Any suggestions? Job IdDescription SubmittedDuration 2 saveAsTextFile at GenerateHistogram.scala:143 2015/09/16 11:21:08 0.7 s 1 collect at GenerateHistogram.scala:132 2015/09/16 11:20:49 2 s 0 count at GenerateHistogram.scala:129 2015/09/16 11:20:41 9 s Below is log 15/09/16 11:20:52 INFO DAGScheduler: Job 1 finished: collect at GenerateHistogram.scala:132, took 2.221756 s 15/09/16 11:21:08 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id 15/09/16 11:21:08 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id 15/09/16 11:21:08 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap 15/09/16 11:21:08 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition 15/09/16 11:21:08 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id 15/09/16 11:21:08 INFO FileOutputCommitter: File Output Committer Algorithm version is 1 15/09/16 11:21:08 INFO SparkContext: Starting job: saveAsTextFile at GenerateHistogram.scala:143 15/09/16 11:21:08 INFO DAGScheduler: Got job 2 (saveAsTextFile at GenerateHistogram.scala:143) with 1 output partitions 15/09/16 11:21:08 INFO DAGScheduler: Final stage: ResultStage 2(saveAsTextFile at GenerateHistogram.scala:143) BR, Patcharee - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
spark performance - executor computing time
Hi, I was running a job (on Spark 1.5 + Yarn + java 8). In a stage that lookup (org.apache.spark.rdd.PairRDDFunctions.lookup(PairRDDFunctions.scala:873)) there was an executor that took the executor computing time > 6 times of median. This executor had almost the same shuffle read size and low gc time as others. What can impact the executor computing time? Any suggestions what parameters I should monitor/configure? BR, Patcharee - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
spark 1.5 sort slow
Hi, I found spark 1.5 sorting is very slow compared to spark 1.4. Below is my code snippet val sqlRDD = sql("select date, u, v, z from fino3_hr3 where zone == 2 and z >= 2 and z <= order by date, z") println("sqlRDD " + sqlRDD.count()) The fino3_hr3 (in the sql command) is a hive table in orc format, partitioned by zone and z. Spark 1.5 takes 4.5 mins to execute this sql, while spark 1.4 takes 1.5 mins. I noticed that dissimilar to spark 1.4 when spark 1.5 sorted, data was shuffled into few tasks, not divided for all tasks. Do I need to set any configuration explicitly? Any suggestions? BR, Patcharee - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
embedded pig in the custer
Hi, I am using pig 0.14. How to run embedded pig (with PigServer) on the cluster which fully packs all classpath and default configuration for pig, hadoop, yarn, hdfs? With this simple solution java -cp myjar.jar mymainclass, it will definitely throw classnotfound exception, other exception. BR, Patcharee
Re: character '' not supported here
Hi, I created a hive table stored as orc file (partitioned and compressed by ZLIB) from Hive CLI, added data into this table by a Spark application. After adding I was able to query data and everything looked fine. Then I concatenated the table from Hive CLI. After that I am not able to query data, like select count(*) from Table, any more, just got error line 1:1 character '' not supported here, no matter Tez or MR engine. How can you solve the problem in your case? BR, Patcharee On 18. juli 2015 21:26, Nitin Pawar wrote: can you tell exactly what steps you did/? also did you try running the query with processing to MR instead of tez? not sure this issue with orc file formats .. i had once faced issues on alter table for orc backed tabled on adding a new column On Sun, Jul 19, 2015 at 12:05 AM, pth001 patcharee.thong...@uni.no mailto:patcharee.thong...@uni.no wrote: Hi, The query result 11236119012.64043-5.970886 8.5592070.0 0.00.0-19.686993 1308.804799848.00.006196644 0.00.0 301.274750.382470460.0NULL11 20081 11236122012.513598-6.3671713 7.3927946 0.00.00.0-22.300392 1441.054799848.0 0.00508465060.00.0 112.207870.304595230.0 NULL11 20081 5122503682415.1955.172235 4.9027147 -0.0244086120.023590.553 -38.96928-1130.0469 74660.542.5969802E-4 9.706164E-1123054.2680.0 0.241967370.0 NULL1120081 9121449412.25196412.081688 -9.594620.0 0.00.0-25.93576258.65625 99848.00.0021708217 0.00.01.2963213 1.15602660.0NULL11 20081 9121458412.3020987.752461 -12.1834630.0 0.00.0-24.983763 351.195399848.00.0023723599 0.00.0 1.41373750.992398860.0NULL11 20081 I stored table in orc format, partitioned and compressed by ZLIB. The problem happened just after I concatenate table. BR, Patcharee On 18/07/15 12:46, Nitin Pawar wrote: select * without where will work because it does not involve file processing I suspect the problem is with field delimiter so i asked for records so that we can see whats the data in each column are you using csv file with columns delimited by some char and it has numeric data in quotes ? On Sat, Jul 18, 2015 at 3:58 PM, patcharee patcharee.thong...@uni.no mailto:patcharee.thong...@uni.no wrote: This select * from table limit 5; works, but not others. So? Patcharee On 18. juli 2015 12:08, Nitin Pawar wrote: can you do select * from table limit 5; On Sat, Jul 18, 2015 at 3:35 PM, patcharee patcharee.thong...@uni.no mailto:patcharee.thong...@uni.no wrote: Hi, I am using hive 0.14 with Tez engine. Found a weird problem. Any suggestions? hive select count(*) from 4D; line 1:1 character '' not supported here line 1:2 character '' not supported here line 1:3 character '' not supported here line 1:4 character '' not supported here line 1:5 character '' not supported here line 1:6 character '' not supported here line 1:7 character '' not supported here line 1:8 character '' not supported here line 1:9 character '' not supported here ... ... line 1:131 character '' not supported here line 1:132 character '' not supported here line 1:133 character '' not supported here line 1:134 character '' not supported here line 1:135 character '' not supported here line 1:136 character '' not supported here line 1:137 character '' not supported here line 1:138 character '' not supported here line 1:139 character '' not supported here line 1:140 character '' not supported here line 1:141 character '' not supported here line 1:142 character '' not supported here line 1:143 character '' not supported here line 1:144 character '' not supported here line 1:145 character '' not supported here line 1:146 character '' not supported here BR, Patcharee -- Nitin Pawar -- Nitin Pawar -- Nitin Pawar
Re: character '' not supported here
Hi, I created a hive table stored as orc file (partitioned and compressed by ZLIB) from Hive CLI, added data into this table by a Spark application. After adding I was able to query data and everything looked fine. Then I concatenated the table from Hive CLI. After that I am not able to query data, like select count(*) from Table, any more, just got error line 1:1 character '' not supported here, no matter Tez or MR engine. How can you solve the problem in your case? BR, Patcharee On 18. juli 2015 21:26, Nitin Pawar wrote: can you tell exactly what steps you did/? also did you try running the query with processing to MR instead of tez? not sure this issue with orc file formats .. i had once faced issues on alter table for orc backed tabled on adding a new column On Sun, Jul 19, 2015 at 12:05 AM, pth001 patcharee.thong...@uni.no mailto:patcharee.thong...@uni.no wrote: Hi, The query result 11236119012.64043-5.970886 8.5592070.0 0.00.0-19.686993 1308.804799848.00.006196644 0.00.0 301.274750.382470460.0NULL11 20081 11236122012.513598-6.3671713 7.3927946 0.00.00.0-22.300392 1441.054799848.0 0.00508465060.00.0 112.207870.304595230.0 NULL11 20081 5122503682415.1955.172235 4.9027147 -0.0244086120.023590.553 -38.96928-1130.0469 74660.542.5969802E-4 9.706164E-1123054.2680.0 0.241967370.0 NULL1120081 9121449412.25196412.081688 -9.594620.0 0.00.0-25.93576258.65625 99848.00.0021708217 0.00.01.2963213 1.15602660.0NULL11 20081 9121458412.3020987.752461 -12.1834630.0 0.00.0-24.983763 351.195399848.00.0023723599 0.00.0 1.41373750.992398860.0NULL11 20081 I stored table in orc format, partitioned and compressed by ZLIB. The problem happened just after I concatenate table. BR, Patcharee On 18/07/15 12:46, Nitin Pawar wrote: select * without where will work because it does not involve file processing I suspect the problem is with field delimiter so i asked for records so that we can see whats the data in each column are you using csv file with columns delimited by some char and it has numeric data in quotes ? On Sat, Jul 18, 2015 at 3:58 PM, patcharee patcharee.thong...@uni.no mailto:patcharee.thong...@uni.no wrote: This select * from table limit 5; works, but not others. So? Patcharee On 18. juli 2015 12:08, Nitin Pawar wrote: can you do select * from table limit 5; On Sat, Jul 18, 2015 at 3:35 PM, patcharee patcharee.thong...@uni.no mailto:patcharee.thong...@uni.no wrote: Hi, I am using hive 0.14 with Tez engine. Found a weird problem. Any suggestions? hive select count(*) from 4D; line 1:1 character '' not supported here line 1:2 character '' not supported here line 1:3 character '' not supported here line 1:4 character '' not supported here line 1:5 character '' not supported here line 1:6 character '' not supported here line 1:7 character '' not supported here line 1:8 character '' not supported here line 1:9 character '' not supported here ... ... line 1:131 character '' not supported here line 1:132 character '' not supported here line 1:133 character '' not supported here line 1:134 character '' not supported here line 1:135 character '' not supported here line 1:136 character '' not supported here line 1:137 character '' not supported here line 1:138 character '' not supported here line 1:139 character '' not supported here line 1:140 character '' not supported here line 1:141 character '' not supported here line 1:142 character '' not supported here line 1:143 character '' not supported here line 1:144 character '' not supported here line 1:145 character '' not supported here line 1:146 character '' not supported here BR, Patcharee -- Nitin Pawar -- Nitin Pawar -- Nitin Pawar
character '' not supported here
Hi, I am using hive 0.14 with Tez engine. Found a weird problem. Any suggestions? hive select count(*) from 4D; line 1:1 character '' not supported here line 1:2 character '' not supported here line 1:3 character '' not supported here line 1:4 character '' not supported here line 1:5 character '' not supported here line 1:6 character '' not supported here line 1:7 character '' not supported here line 1:8 character '' not supported here line 1:9 character '' not supported here ... ... line 1:131 character '' not supported here line 1:132 character '' not supported here line 1:133 character '' not supported here line 1:134 character '' not supported here line 1:135 character '' not supported here line 1:136 character '' not supported here line 1:137 character '' not supported here line 1:138 character '' not supported here line 1:139 character '' not supported here line 1:140 character '' not supported here line 1:141 character '' not supported here line 1:142 character '' not supported here line 1:143 character '' not supported here line 1:144 character '' not supported here line 1:145 character '' not supported here line 1:146 character '' not supported here BR, Patcharee
character '' not supported here
Hi, I am using hive 0.14 with Tez engine. Found a weird problem. Any suggestions? hive select count(*) from 4D; line 1:1 character '' not supported here line 1:2 character '' not supported here line 1:3 character '' not supported here line 1:4 character '' not supported here line 1:5 character '' not supported here line 1:6 character '' not supported here line 1:7 character '' not supported here line 1:8 character '' not supported here line 1:9 character '' not supported here ... ... line 1:131 character '' not supported here line 1:132 character '' not supported here line 1:133 character '' not supported here line 1:134 character '' not supported here line 1:135 character '' not supported here line 1:136 character '' not supported here line 1:137 character '' not supported here line 1:138 character '' not supported here line 1:139 character '' not supported here line 1:140 character '' not supported here line 1:141 character '' not supported here line 1:142 character '' not supported here line 1:143 character '' not supported here line 1:144 character '' not supported here line 1:145 character '' not supported here line 1:146 character '' not supported here BR, Patcharee
Re: character '' not supported here
This select * from table limit 5; works, but not others. So? Patcharee On 18. juli 2015 12:08, Nitin Pawar wrote: can you do select * from table limit 5; On Sat, Jul 18, 2015 at 3:35 PM, patcharee patcharee.thong...@uni.no mailto:patcharee.thong...@uni.no wrote: Hi, I am using hive 0.14 with Tez engine. Found a weird problem. Any suggestions? hive select count(*) from 4D; line 1:1 character '' not supported here line 1:2 character '' not supported here line 1:3 character '' not supported here line 1:4 character '' not supported here line 1:5 character '' not supported here line 1:6 character '' not supported here line 1:7 character '' not supported here line 1:8 character '' not supported here line 1:9 character '' not supported here ... ... line 1:131 character '' not supported here line 1:132 character '' not supported here line 1:133 character '' not supported here line 1:134 character '' not supported here line 1:135 character '' not supported here line 1:136 character '' not supported here line 1:137 character '' not supported here line 1:138 character '' not supported here line 1:139 character '' not supported here line 1:140 character '' not supported here line 1:141 character '' not supported here line 1:142 character '' not supported here line 1:143 character '' not supported here line 1:144 character '' not supported here line 1:145 character '' not supported here line 1:146 character '' not supported here BR, Patcharee -- Nitin Pawar
Re: character '' not supported here
This select * from table limit 5; works, but not others. So? Patcharee On 18. juli 2015 12:08, Nitin Pawar wrote: can you do select * from table limit 5; On Sat, Jul 18, 2015 at 3:35 PM, patcharee patcharee.thong...@uni.no mailto:patcharee.thong...@uni.no wrote: Hi, I am using hive 0.14 with Tez engine. Found a weird problem. Any suggestions? hive select count(*) from 4D; line 1:1 character '' not supported here line 1:2 character '' not supported here line 1:3 character '' not supported here line 1:4 character '' not supported here line 1:5 character '' not supported here line 1:6 character '' not supported here line 1:7 character '' not supported here line 1:8 character '' not supported here line 1:9 character '' not supported here ... ... line 1:131 character '' not supported here line 1:132 character '' not supported here line 1:133 character '' not supported here line 1:134 character '' not supported here line 1:135 character '' not supported here line 1:136 character '' not supported here line 1:137 character '' not supported here line 1:138 character '' not supported here line 1:139 character '' not supported here line 1:140 character '' not supported here line 1:141 character '' not supported here line 1:142 character '' not supported here line 1:143 character '' not supported here line 1:144 character '' not supported here line 1:145 character '' not supported here line 1:146 character '' not supported here BR, Patcharee -- Nitin Pawar
Re: fails to alter table concatenate
Actually it works on mr. So the problem is from tez. thanks! BR, Patcharee On 30. juni 2015 10:23, Nitin Pawar wrote: can you try doing same by changing the query engine from tez to mr1? not sure if its hive bug or tez bug On Tue, Jun 30, 2015 at 1:46 PM, patcharee patcharee.thong...@uni.no mailto:patcharee.thong...@uni.no wrote: Hi, I am using hive 0.14. It fails to alter table concatenate occasionally (see the exception below). It is strange that it fails from time to time not predictable. Is there any suggestion/clue? hive alter table 4dim partition(zone=2,z=15,year=2005,month=4) CONCATENATE; VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED File MergeFAILED -1 00 -1 0 0 VERTICES: 00/01 [--] 0% ELAPSED TIME: 1435651968.00 s Status: Failed Vertex failed, vertexName=File Merge, vertexId=vertex_1435307579867_0041_1_00, diagnostics=[Vertex vertex_1435307579867_0041_1_00 [File Merge] killed/failed due to:ROOT_INPUT_INIT_FAILURE, Vertex Input: [hdfs://service-10-0.local:8020/apps/hive/warehouse/wrf_tables/4dim/zone=2/z=15/year=2005/month=4] initializer failed, vertex=vertex_1435307579867_0041_1_00 [File Merge], java.lang.NullPointerException at org.apache.hadoop.hive.ql.io http://org.apache.hadoop.hive.ql.io.HiveInputFormat.init(HiveInputFormat.java:265) at org.apache.hadoop.hive.ql.io http://org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:452) at org.apache.tez.mapreduce.hadoop.MRInputHelpers.generateOldSplits(MRInputHelpers.java:441) at org.apache.tez.mapreduce.hadoop.MRInputHelpers.generateInputSplitsToMem(MRInputHelpers.java:295) at org.apache.tez.mapreduce.common.MRInputAMSplitGenerator.initialize(MRInputAMSplitGenerator.java:124) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:245) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:239) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:239) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:226) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) ] DAG failed due to vertex failure. failedVertices:1 killedVertices:0 FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.DDLTask BR, Patcharee -- Nitin Pawar
fails to alter table concatenate
Hi, I am using hive 0.14. It fails to alter table concatenate occasionally (see the exception below). It is strange that it fails from time to time not predictable. Is there any suggestion/clue? hive alter table 4dim partition(zone=2,z=15,year=2005,month=4) CONCATENATE; VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED File MergeFAILED -1 00 -1 0 0 VERTICES: 00/01 [--] 0%ELAPSED TIME: 1435651968.00 s Status: Failed Vertex failed, vertexName=File Merge, vertexId=vertex_1435307579867_0041_1_00, diagnostics=[Vertex vertex_1435307579867_0041_1_00 [File Merge] killed/failed due to:ROOT_INPUT_INIT_FAILURE, Vertex Input: [hdfs://service-10-0.local:8020/apps/hive/warehouse/wrf_tables/4dim/zone=2/z=15/year=2005/month=4] initializer failed, vertex=vertex_1435307579867_0041_1_00 [File Merge], java.lang.NullPointerException at org.apache.hadoop.hive.ql.io.HiveInputFormat.init(HiveInputFormat.java:265) at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:452) at org.apache.tez.mapreduce.hadoop.MRInputHelpers.generateOldSplits(MRInputHelpers.java:441) at org.apache.tez.mapreduce.hadoop.MRInputHelpers.generateInputSplitsToMem(MRInputHelpers.java:295) at org.apache.tez.mapreduce.common.MRInputAMSplitGenerator.initialize(MRInputAMSplitGenerator.java:124) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:245) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:239) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:239) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:226) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) ] DAG failed due to vertex failure. failedVertices:1 killedVertices:0 FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.DDLTask BR, Patcharee
Re: Kryo serialization of classes in additional jars
Hi, I am having this problem on spark 1.4. Do you have any ideas how to solve it? I tried to use spark.executor.extraClassPath, but it did not help BR, Patcharee On 04. mai 2015 23:47, Imran Rashid wrote: Oh, this seems like a real pain. You should file a jira, I didn't see an open issue -- if nothing else just to document the issue. As you've noted, the problem is that the serializer is created immediately in the executors, right when the SparkEnv is created, but the other jars aren't downloaded later. I think you could workaround with some combination of pushing the jars to the cluster manually, and then using spark.executor.extraClassPath On Wed, Apr 29, 2015 at 6:42 PM, Akshat Aranya aara...@gmail.com mailto:aara...@gmail.com wrote: Hi, Is it possible to register kryo serialization for classes contained in jars that are added with spark.jars? In my experiment it doesn't seem to work, likely because the class registration happens before the jar is shipped to the executor and added to the classloader. Here's the general idea of what I want to do: val sparkConf = new SparkConf(true) .set(spark.jars, foo.jar) .setAppName(foo) .set(spark.serializer, org.apache.spark.serializer.KryoSerializer) // register classes contained in foo.jar sparkConf.registerKryoClasses(Array( classOf[com.foo.Foo], classOf[com.foo.Bar]))
Re: HiveContext saveAsTable create wrong partition
I found if I move the partitioned columns in schemaString and in Row to the end of the sequence, then it works correctly... On 16. juni 2015 11:14, patcharee wrote: Hi, I am using spark 1.4 and HiveContext to append data into a partitioned hive table. I found that the data insert into the table is correct, but the partition(folder) created is totally wrong. Below is my code snippet --- val schemaString = zone z year month date hh x y height u v w ph phb t p pb qvapor qgraup qnice qnrain tke_pbl el_pbl val schema = StructType( schemaString.split( ).map(fieldName = if (fieldName.equals(zone) || fieldName.equals(z) || fieldName.equals(year) || fieldName.equals(month) || fieldName.equals(date) || fieldName.equals(hh) || fieldName.equals(x) || fieldName.equals(y)) StructField(fieldName, IntegerType, true) else StructField(fieldName, FloatType, true) )) val pairVarRDD = sc.parallelize(Seq((Row(2,42,2009,3,1,0,218,365,9989.497.floatValue(),29.627113.floatValue(),19.071793.floatValue(),0.11982734.floatValue(),3174.6812.floatValue(), 97735.2.floatValue(),16.389032.floatValue(),-96.62891.floatValue(),25135.365.floatValue(),2.6476808E-5.floatValue(),0.0.floatValue(),13195.351.floatValue(), 0.0.floatValue(),0.1.floatValue(),0.0.floatValue())) )) val partitionedTestDF2 = sqlContext.createDataFrame(pairVarRDD, schema) partitionedTestDF2.write.format(org.apache.spark.sql.hive.orc.DefaultSource) .mode(org.apache.spark.sql.SaveMode.Append).partitionBy(zone,z,year,month).saveAsTable(test4DimBySpark) --- The table contains 23 columns (longer than Tuple maximum length), so I use Row Object to store raw data, not Tuple. Here is some message from spark when it saved data 15/06/16 10:39:22 INFO metadata.Hive: Renaming src:hdfs://service-10-0.local:8020/tmp/hive-patcharee/hive_2015-06-16_10-39-21_205_8768669104487548472-1/-ext-1/zone=13195/z=0/year=0/month=0/part-1;dest: hdfs://service-10-0.local:8020/apps/hive/warehouse/test4dimBySpark/zone=13195/z=0/year=0/month=0/part-1;Status:true 15/06/16 10:39:22 INFO metadata.Hive: New loading path = hdfs://service-10-0.local:8020/tmp/hive-patcharee/hive_2015-06-16_10-39-21_205_8768669104487548472-1/-ext-1/zone=13195/z=0/year=0/month=0 with partSpec {zone=13195, z=0, year=0, month=0} From the raw data (pairVarRDD) zone = 2, z = 42, year = 2009, month = 3. But spark created a partition {zone=13195, z=0, year=0, month=0}. When I queried from hive hive select * from test4dimBySpark; OK 242200931.00.0218.0365.09989.497 29.62711319.0717930.11982734-3174.681297735.2 16.389032-96.6289125135.3652.6476808E-50.0 13195 000 hive select zone, z, year, month from test4dimBySpark; OK 13195000 hive dfs -ls /apps/hive/warehouse/test4dimBySpark/*/*/*/*; Found 2 items -rw-r--r-- 3 patcharee hdfs 1411 2015-06-16 10:39 /apps/hive/warehouse/test4dimBySpark/zone=13195/z=0/year=0/month=0/part-1 The data stored in the table is correct zone = 2, z = 42, year = 2009, month = 3, but the partition created was wrong zone=13195/z=0/year=0/month=0 Is this a bug or what could be wrong? Any suggestion is appreciated. BR, Patcharee - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
HiveContext saveAsTable create wrong partition
Hi, I am using spark 1.4 and HiveContext to append data into a partitioned hive table. I found that the data insert into the table is correct, but the partition(folder) created is totally wrong. Below is my code snippet --- val schemaString = zone z year month date hh x y height u v w ph phb t p pb qvapor qgraup qnice qnrain tke_pbl el_pbl val schema = StructType( schemaString.split( ).map(fieldName = if (fieldName.equals(zone) || fieldName.equals(z) || fieldName.equals(year) || fieldName.equals(month) || fieldName.equals(date) || fieldName.equals(hh) || fieldName.equals(x) || fieldName.equals(y)) StructField(fieldName, IntegerType, true) else StructField(fieldName, FloatType, true) )) val pairVarRDD = sc.parallelize(Seq((Row(2,42,2009,3,1,0,218,365,9989.497.floatValue(),29.627113.floatValue(),19.071793.floatValue(),0.11982734.floatValue(),3174.6812.floatValue(), 97735.2.floatValue(),16.389032.floatValue(),-96.62891.floatValue(),25135.365.floatValue(),2.6476808E-5.floatValue(),0.0.floatValue(),13195.351.floatValue(), 0.0.floatValue(),0.1.floatValue(),0.0.floatValue())) )) val partitionedTestDF2 = sqlContext.createDataFrame(pairVarRDD, schema) partitionedTestDF2.write.format(org.apache.spark.sql.hive.orc.DefaultSource) .mode(org.apache.spark.sql.SaveMode.Append).partitionBy(zone,z,year,month).saveAsTable(test4DimBySpark) --- The table contains 23 columns (longer than Tuple maximum length), so I use Row Object to store raw data, not Tuple. Here is some message from spark when it saved data 15/06/16 10:39:22 INFO metadata.Hive: Renaming src:hdfs://service-10-0.local:8020/tmp/hive-patcharee/hive_2015-06-16_10-39-21_205_8768669104487548472-1/-ext-1/zone=13195/z=0/year=0/month=0/part-1;dest: hdfs://service-10-0.local:8020/apps/hive/warehouse/test4dimBySpark/zone=13195/z=0/year=0/month=0/part-1;Status:true 15/06/16 10:39:22 INFO metadata.Hive: New loading path = hdfs://service-10-0.local:8020/tmp/hive-patcharee/hive_2015-06-16_10-39-21_205_8768669104487548472-1/-ext-1/zone=13195/z=0/year=0/month=0 with partSpec {zone=13195, z=0, year=0, month=0} From the raw data (pairVarRDD) zone = 2, z = 42, year = 2009, month = 3. But spark created a partition {zone=13195, z=0, year=0, month=0}. When I queried from hive hive select * from test4dimBySpark; OK 242200931.00.0218.0365.09989.497 29.62711319.0717930.11982734-3174.681297735.2 16.389032-96.6289125135.3652.6476808E-50.0 131950 00 hive select zone, z, year, month from test4dimBySpark; OK 13195000 hive dfs -ls /apps/hive/warehouse/test4dimBySpark/*/*/*/*; Found 2 items -rw-r--r-- 3 patcharee hdfs 1411 2015-06-16 10:39 /apps/hive/warehouse/test4dimBySpark/zone=13195/z=0/year=0/month=0/part-1 The data stored in the table is correct zone = 2, z = 42, year = 2009, month = 3, but the partition created was wrong zone=13195/z=0/year=0/month=0 Is this a bug or what could be wrong? Any suggestion is appreciated. BR, Patcharee - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
sql.catalyst.ScalaReflection scala.reflect.internal.MissingRequirementError
Hi, I use spark 0.14. I tried to create dataframe from RDD below, but got scala.reflect.internal.MissingRequirementError val partitionedTestDF2 = pairVarRDD.toDF(column1,column2,column3) //pairVarRDD is RDD[Record4Dim_2], and Record4Dim_2 is a Case Class How can I fix this? Exception in thread main scala.reflect.internal.MissingRequirementError: class etl.Record4Dim_2 in JavaMirror with sun.misc.Launcher$AppClassLoader@30177039 of type class sun.misc.Launcher$AppClassLoader with classpath [file:/local/spark140/conf/,file:/local/spark140/lib/spark-assembly-1.4.0-SNAPSHOT-hadoop2.6.0.jar,file:/local/spark140/lib/datanucleus-core-3.2.10.jar,file:/local/spark140/lib/datanucleus-rdbms-3.2.9.jar,file:/local/spark140/lib/datanucleus-api-jdo-3.2.6.jar,file:/etc/hadoop/conf/] and parent being sun.misc.Launcher$ExtClassLoader@52c8c6d9 of type class sun.misc.Launcher$ExtClassLoader with classpath [file:/usr/jdk64/jdk1.7.0_67/jre/lib/ext/sunec.jar,file:/usr/jdk64/jdk1.7.0_67/jre/lib/ext/sunjce_provider.jar,file:/usr/jdk64/jdk1.7.0_67/jre/lib/ext/sunpkcs11.jar,file:/usr/jdk64/jdk1.7.0_67/jre/lib/ext/zipfs.jar,file:/usr/jdk64/jdk1.7.0_67/jre/lib/ext/localedata.jar,file:/usr/jdk64/jdk1.7.0_67/jre/lib/ext/dnsns.jar] and parent being primordial classloader with boot classpath [/usr/jdk64/jdk1.7.0_67/jre/lib/resources.jar:/usr/jdk64/jdk1.7.0_67/jre/lib/rt.jar:/usr/jdk64/jdk1.7.0_67/jre/lib/sunrsasign.jar:/usr/jdk64/jdk1.7.0_67/jre/lib/jsse.jar:/usr/jdk64/jdk1.7.0_67/jre/lib/jce.jar:/usr/jdk64/jdk1.7.0_67/jre/lib/charsets.jar:/usr/jdk64/jdk1.7.0_67/jre/lib/jfr.jar:/usr/jdk64/jdk1.7.0_67/jre/classes] not found. at scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16) at scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17) at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48) at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61) at scala.reflect.internal.Mirrors$RootsBase.staticModuleOrClass(Mirrors.scala:72) at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:119) at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:21) at no.uni.computing.etl.LoadWrfV14$$typecreator1$1.apply(LoadWrfV14.scala:91) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231) at org.apache.spark.sql.catalyst.ScalaReflection$class.localTypeOf(ScalaReflection.scala:71) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:59) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:28) at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:410) at org.apache.spark.sql.SQLContext$implicits$.rddToDataFrameHolder(SQLContext.scala:335) BR, Patcharee - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: hiveContext.sql NullPointerException
Hi, Does df.write.partitionBy(partitions).format(format).mode(overwrite).saveAsTable(tbl) support orc file? I tried df.write.partitionBy(zone, z, year, month).format(orc).mode(overwrite).saveAsTable(tbl), but after the insert my table tbl schema has been changed to something I did not expected .. -- FROM -- CREATE EXTERNAL TABLE `4dim`(`u` float, `v` float) PARTITIONED BY (`zone` int, `z` int, `year` int, `month` int) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' TBLPROPERTIES ( 'orc.compress'='ZLIB', 'transient_lastDdlTime'='1433016475') -- TO -- CREATE TABLE `4dim`(`col` arraystring COMMENT 'from deserializer') PARTITIONED BY (`zone` int COMMENT '', `z` int COMMENT '', `year` int COMMENT '', `month` int COMMENT '') ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.SequenceFileInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat' TBLPROPERTIES ( 'EXTERNAL'='FALSE', 'spark.sql.sources.provider'='orc', 'spark.sql.sources.schema.numParts'='1', 'spark.sql.sources.schema.part.0'='{\type\:\struct\,\fields\:[{\name\:\u\,\type\:\float\,\nullable\:true,\metadata\:{}},{\name\:\v\,\type\:\float\,\nullable\:true,\metadata\:{}},{\name\:\zone\,\type\:\integer\,\nullable\:true,\metadata\:{}},{\name\:\z\,\type\:\integer\,\nullable\:true,\metadata\:{}},{\name\:\year\,\type\:\integer\,\nullable\:true,\metadata\:{}},{\name\:\month\,\type\:\integer\,\nullable\:true,\metadata\:{}}]}', 'transient_lastDdlTime'='1434055247') I noticed there are files stored in hdfs as *.orc, but when I tried to query from hive I got nothing. How can I fix this? Any suggestions please BR, Patcharee On 07. juni 2015 16:40, Cheng Lian wrote: Spark SQL supports Hive dynamic partitioning, so one possible workaround is to create a Hive table partitioned by zone, z, year, and month dynamically, and then insert the whole dataset into it directly. In 1.4, we also provides dynamic partitioning support for non-Hive environment, and you can do something like this: df.write.partitionBy(zone, z, year, month).format(parquet).mode(overwrite).saveAsTable(tbl) Cheng On 6/7/15 9:48 PM, patcharee wrote: Hi, How can I expect to work on HiveContext on the executor? If only the driver can see HiveContext, does it mean I have to collect all datasets (very large) to the driver and use HiveContext there? It will be memory overload on the driver and fail. BR, Patcharee On 07. juni 2015 11:51, Cheng Lian wrote: Hi, This is expected behavior. HiveContext.sql (and also DataFrame.registerTempTable) is only expected to be invoked on driver side. However, the closure passed to RDD.foreach is executed on executor side, where no viable HiveContext instance exists. Cheng On 6/7/15 10:06 AM, patcharee wrote: Hi, I try to insert data into a partitioned hive table. The groupByKey is to combine dataset into a partition of the hive table. After the groupByKey, I converted the iterable[X] to DB by X.toList.toDF(). But the hiveContext.sql throws NullPointerException, see below. Any suggestions? What could be wrong? Thanks! val varWHeightFlatRDD = varWHeightRDD.flatMap(FlatMapUtilClass().flatKeyFromWrf).groupByKey() .foreach( x = { val zone = x._1._1 val z = x._1._2 val year = x._1._3 val month = x._1._4 val df_table_4dim = x._2.toList.toDF() df_table_4dim.registerTempTable(table_4Dim) hiveContext.sql(INSERT OVERWRITE table 4dim partition (zone= + zone + ,z= + z + ,year= + year + ,month= + month + ) + select date, hh, x, y, height, u, v, w, ph, phb, t, p, pb, qvapor, qgraup, qnice, qnrain, tke_pbl, el_pbl from table_4Dim); }) java.lang.NullPointerException at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:100) at no.uni.computing.etl.LoadWrfIntoHiveOptReduce1$$anonfun$7.apply(LoadWrfIntoHiveOptReduce1.scala:113) at no.uni.computing.etl.LoadWrfIntoHiveOptReduce1$$anonfun$7.apply(LoadWrfIntoHiveOptReduce1.scala:103) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28) at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:798) at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:798) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1511) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1511) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203
Re: hiveContext.sql NullPointerException
Hi, Thanks for your guidelines. I will try it out. Btw how do you know HiveContext.sql (and also DataFrame.registerTempTable) is only expected to be invoked on driver side? Where can I find document? BR, Patcharee On 07. juni 2015 16:40, Cheng Lian wrote: Spark SQL supports Hive dynamic partitioning, so one possible workaround is to create a Hive table partitioned by zone, z, year, and month dynamically, and then insert the whole dataset into it directly. In 1.4, we also provides dynamic partitioning support for non-Hive environment, and you can do something like this: df.write.partitionBy(zone, z, year, month).format(parquet).mode(overwrite).saveAsTable(tbl) Cheng On 6/7/15 9:48 PM, patcharee wrote: Hi, How can I expect to work on HiveContext on the executor? If only the driver can see HiveContext, does it mean I have to collect all datasets (very large) to the driver and use HiveContext there? It will be memory overload on the driver and fail. BR, Patcharee On 07. juni 2015 11:51, Cheng Lian wrote: Hi, This is expected behavior. HiveContext.sql (and also DataFrame.registerTempTable) is only expected to be invoked on driver side. However, the closure passed to RDD.foreach is executed on executor side, where no viable HiveContext instance exists. Cheng On 6/7/15 10:06 AM, patcharee wrote: Hi, I try to insert data into a partitioned hive table. The groupByKey is to combine dataset into a partition of the hive table. After the groupByKey, I converted the iterable[X] to DB by X.toList.toDF(). But the hiveContext.sql throws NullPointerException, see below. Any suggestions? What could be wrong? Thanks! val varWHeightFlatRDD = varWHeightRDD.flatMap(FlatMapUtilClass().flatKeyFromWrf).groupByKey() .foreach( x = { val zone = x._1._1 val z = x._1._2 val year = x._1._3 val month = x._1._4 val df_table_4dim = x._2.toList.toDF() df_table_4dim.registerTempTable(table_4Dim) hiveContext.sql(INSERT OVERWRITE table 4dim partition (zone= + zone + ,z= + z + ,year= + year + ,month= + month + ) + select date, hh, x, y, height, u, v, w, ph, phb, t, p, pb, qvapor, qgraup, qnice, qnrain, tke_pbl, el_pbl from table_4Dim); }) java.lang.NullPointerException at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:100) at no.uni.computing.etl.LoadWrfIntoHiveOptReduce1$$anonfun$7.apply(LoadWrfIntoHiveOptReduce1.scala:113) at no.uni.computing.etl.LoadWrfIntoHiveOptReduce1$$anonfun$7.apply(LoadWrfIntoHiveOptReduce1.scala:103) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28) at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:798) at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:798) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1511) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1511) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: hiveContext.sql NullPointerException
Hi, How can I expect to work on HiveContext on the executor? If only the driver can see HiveContext, does it mean I have to collect all datasets (very large) to the driver and use HiveContext there? It will be memory overload on the driver and fail. BR, Patcharee On 07. juni 2015 11:51, Cheng Lian wrote: Hi, This is expected behavior. HiveContext.sql (and also DataFrame.registerTempTable) is only expected to be invoked on driver side. However, the closure passed to RDD.foreach is executed on executor side, where no viable HiveContext instance exists. Cheng On 6/7/15 10:06 AM, patcharee wrote: Hi, I try to insert data into a partitioned hive table. The groupByKey is to combine dataset into a partition of the hive table. After the groupByKey, I converted the iterable[X] to DB by X.toList.toDF(). But the hiveContext.sql throws NullPointerException, see below. Any suggestions? What could be wrong? Thanks! val varWHeightFlatRDD = varWHeightRDD.flatMap(FlatMapUtilClass().flatKeyFromWrf).groupByKey() .foreach( x = { val zone = x._1._1 val z = x._1._2 val year = x._1._3 val month = x._1._4 val df_table_4dim = x._2.toList.toDF() df_table_4dim.registerTempTable(table_4Dim) hiveContext.sql(INSERT OVERWRITE table 4dim partition (zone= + zone + ,z= + z + ,year= + year + ,month= + month + ) + select date, hh, x, y, height, u, v, w, ph, phb, t, p, pb, qvapor, qgraup, qnice, qnrain, tke_pbl, el_pbl from table_4Dim); }) java.lang.NullPointerException at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:100) at no.uni.computing.etl.LoadWrfIntoHiveOptReduce1$$anonfun$7.apply(LoadWrfIntoHiveOptReduce1.scala:113) at no.uni.computing.etl.LoadWrfIntoHiveOptReduce1$$anonfun$7.apply(LoadWrfIntoHiveOptReduce1.scala:103) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28) at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:798) at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:798) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1511) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1511) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
write multiple outputs by key
Hi, How can I write to multiple outputs for each key? I tried to create custom partitioner or define the number of partition but does not work. There are only the few tasks/partitions (which equals to the number of all key combination) gets large datasets, data is not splitting to all tasks/partition. The job failed as the few tasks handled too far large datasets. Below is my code snippet. val varWFlatRDD = varWRDD.map(FlatMapUtilClass().flatKeyFromWrf).groupByKey() //key are (zone, z, year, month) .foreach( x = { val z = x._1._1 val year = x._1._2 val month = x._1._3 val df_table_4dim = x._2.toList.toDF() df_table_4dim.registerTempTable(table_4Dim) hiveContext.sql(INSERT OVERWRITE table 4dim partition (zone= + ZONE + ,z= + z + ,year= + year + ,month= + month + ) + select date, hh, x, y, height, u, v, w, ph, phb, t, p, pb, qvapor, qgraup, qnice, qnrain, tke_pbl, el_pbl from table_4Dim); }) From the spark history UI, at groupByKey there are 1000 tasks (equals to the parent's # partitions). at foreach there are 1000 tasks as well, but 50 tasks (same as the # all key combination) gets datasets. How can I fix this problem? Any suggestions are appreciated. BR, Patcharee - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
hiveContext.sql NullPointerException
Hi, I try to insert data into a partitioned hive table. The groupByKey is to combine dataset into a partition of the hive table. After the groupByKey, I converted the iterable[X] to DB by X.toList.toDF(). But the hiveContext.sql throws NullPointerException, see below. Any suggestions? What could be wrong? Thanks! val varWHeightFlatRDD = varWHeightRDD.flatMap(FlatMapUtilClass().flatKeyFromWrf).groupByKey() .foreach( x = { val zone = x._1._1 val z = x._1._2 val year = x._1._3 val month = x._1._4 val df_table_4dim = x._2.toList.toDF() df_table_4dim.registerTempTable(table_4Dim) hiveContext.sql(INSERT OVERWRITE table 4dim partition (zone= + zone + ,z= + z + ,year= + year + ,month= + month + ) + select date, hh, x, y, height, u, v, w, ph, phb, t, p, pb, qvapor, qgraup, qnice, qnrain, tke_pbl, el_pbl from table_4Dim); }) java.lang.NullPointerException at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:100) at no.uni.computing.etl.LoadWrfIntoHiveOptReduce1$$anonfun$7.apply(LoadWrfIntoHiveOptReduce1.scala:113) at no.uni.computing.etl.LoadWrfIntoHiveOptReduce1$$anonfun$7.apply(LoadWrfIntoHiveOptReduce1.scala:103) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28) at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:798) at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:798) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1511) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1511) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: FetchFailed Exception
Hi, I has this problem before, and in my case it is because the executor/container was killed by yarn when it used more memory than allocated. You can check if your case is the same by checking yarn node manager log. Best, Patcharee On 05. juni 2015 07:25, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: I see this Is this a problem with my code or the cluster ? Is there any way to fix it ? FetchFailed(BlockManagerId(2, phxdpehdc9dn2441.stratus.phx.ebay.com http://phxdpehdc9dn2441.stratus.phx.ebay.com, 59574), shuffleId=1, mapId=80, reduceId=20, message= org.apache.spark.shuffle.FetchFailedException: Failed to connect to phxdpehdc9dn2441.stratus.phx.ebay.com/10.115.81.47:59574 http://phxdpehdc9dn2441.stratus.phx.ebay.com/10.115.81.47:59574 at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67) at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83) at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125) at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:160) at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:159) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Failed to connect to phxdpehdc9dn2441.stratus.phx.ebay.com/10.115.81.47:59574 http://phxdpehdc9dn2441.stratus.phx.ebay.com/10.115.81.47:59574 at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:191) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156) at org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:78) at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140) at org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43) at org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) ... 3 more Caused by: java.net.ConnectException: Connection refused: phxdpehdc9dn2441.stratus.phx.ebay.com/10.115.81.47:59574 http://phxdpehdc9dn2441.stratus.phx.ebay.com/10.115.81.47:59574 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739) at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:208) at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:287) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized
NullPointerException SQLConf.setConf
Hi, I am using Hive 0.14 and spark 0.13. I got java.lang.NullPointerException when inserted into hive. Any suggestions please. hiveContext.sql(INSERT OVERWRITE table 4dim partition (zone= + ZONE + ,z= + zz + ,year= + YEAR + ,month= + MONTH + ) + select date, hh, x, y, height, u, v, w, ph, phb, t, p, pb, qvapor, qgraup, qnice, qnrain, tke_pbl, el_pbl from table_4Dim where z= + zz); java.lang.NullPointerException at org.apache.spark.sql.SQLConf.setConf(SQLConf.scala:196) at org.apache.spark.sql.SQLContext.setConf(SQLContext.scala:74) at org.apache.spark.sql.hive.HiveContext.hiveconf$lzycompute(HiveContext.scala:251) at org.apache.spark.sql.hive.HiveContext.hiveconf(HiveContext.scala:250) at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:95) at no.uni.computing.etl.LoadWrfIntoHiveOptReduce1$$anonfun$main$3$$anonfun$apply$1.apply(LoadWrfIntoHiveOptReduce1.scala:110) at no.uni.computing.etl.LoadWrfIntoHiveOptReduce1$$anonfun$main$3$$anonfun$apply$1.apply(LoadWrfIntoHiveOptReduce1.scala:107) at scala.collection.immutable.Range.foreach(Range.scala:141) at no.uni.computing.etl.LoadWrfIntoHiveOptReduce1$$anonfun$main$3.apply(LoadWrfIntoHiveOptReduce1.scala:107) at no.uni.computing.etl.LoadWrfIntoHiveOptReduce1$$anonfun$main$3.apply(LoadWrfIntoHiveOptReduce1.scala:107) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:806) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:806) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1511) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1511) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Best, Patcharee - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
MetaException(message:java.security.AccessControlException: Permission denied
Hi, I was running a spark job to insert overwrite hive table and got Permission denied. My question is why spark job did the insert by using user 'hive', not myself who ran the job? How can I fix the problem? val hiveContext = new HiveContext(sc) import hiveContext.implicits._ hiveContext.sql(INSERT OVERWRITE table 4dim ... ) Caused by: MetaException(message:java.security.AccessControlException: Permission denied: user=hive, access=WRITE, inode=/apps/hive/warehouse/wrf_tables/4dim/zone=2/z=1/year=2009/month=1:patcharee:hdfs:drwxr-xr-x at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkFsPermission(FSPermissionChecker.java:271) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:257) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:185) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6795) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6777) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPathAccess(FSNamesystem.java:6702) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAccess(FSNamesystem.java:9529) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkAccess(NameNodeRpcServer.java:1516) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.checkAccess(ClientNamenodeProtocolServerSideTranslatorPB.java:1433) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033) ) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$alter_partition_result$alter_partition_resultStandardScheme.read(ThriftHiveMetastore.java) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$alter_partition_result$alter_partition_resultStandardScheme.read(ThriftHiveMetastore.java) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$alter_partition_result.read(ThriftHiveMetastore.java) at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_alter_partition(ThriftHiveMetastore.java:2033) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.alter_partition(ThriftHiveMetastore.java:2018) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.alter_partition(HiveMetaStoreClient.java:1091) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89) at com.sun.proxy.$Proxy37.alter_partition(Unknown Source) at org.apache.hadoop.hive.ql.metadata.Hive.alterPartition(Hive.java:469) ... 26 more BR, Patcharee - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
MetaException(message:java.security.AccessControlException: Permission denied
Hi, I was running a spark job to insert overwrite hive table and got Permission denied. My question is why spark job did the insert by using user 'hive', not myself who ran the job? How can I fix the problem? val hiveContext = new HiveContext(sc) import hiveContext.implicits._ hiveContext.sql(INSERT OVERWRITE table 4dim ... ) Caused by: MetaException(message:java.security.AccessControlException: Permission denied: user=hive, access=WRITE, inode=/apps/hive/warehouse/wrf_tables/4dim/zone=2/z=1/year=2009/month=1:patcharee:hdfs:drwxr-xr-x at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkFsPermission(FSPermissionChecker.java:271) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:257) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:185) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6795) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6777) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPathAccess(FSNamesystem.java:6702) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAccess(FSNamesystem.java:9529) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkAccess(NameNodeRpcServer.java:1516) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.checkAccess(ClientNamenodeProtocolServerSideTranslatorPB.java:1433) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033) ) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$alter_partition_result$alter_partition_resultStandardScheme.read(ThriftHiveMetastore.java) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$alter_partition_result$alter_partition_resultStandardScheme.read(ThriftHiveMetastore.java) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$alter_partition_result.read(ThriftHiveMetastore.java) at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_alter_partition(ThriftHiveMetastore.java:2033) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.alter_partition(ThriftHiveMetastore.java:2018) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.alter_partition(HiveMetaStoreClient.java:1091) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89) at com.sun.proxy.$Proxy37.alter_partition(Unknown Source) at org.apache.hadoop.hive.ql.metadata.Hive.alterPartition(Hive.java:469) ... 26 more BR, Patcharee
Re: ERROR cluster.YarnScheduler: Lost executor
(NioEventLoop.java:528) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) ... 1 more ) I am using spark 1.3.1, is the problem from the https://issues.apache.org/jira/browse/SPARK-4516? Best, Patcharee On 03. juni 2015 10:11, Akhil Das wrote: Which version of spark? Looks like you are hitting this one https://issues.apache.org/jira/browse/SPARK-4516 Thanks Best Regards On Wed, Jun 3, 2015 at 1:06 PM, patcharee patcharee.thong...@uni.no mailto:patcharee.thong...@uni.no wrote: This is log I can get 15/06/02 16:37:31 INFO shuffle.RetryingBlockFetcher: Retrying fetch (2/3) for 4 outstanding blocks after 5000 ms 15/06/02 16:37:36 INFO client.TransportClientFactory: Found inactive connection to compute-10-3.local/10.10.255.238:33671 http://10.10.255.238:33671, creating a new one. 15/06/02 16:37:36 WARN server.TransportChannelHandler: Exception in connection from /10.10.255.238:35430 http://10.10.255.238:35430 java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:192) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379) at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311) at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881) at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:225) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) at java.lang.Thread.run(Thread.java:744) 15/06/02 16:37:36 ERROR server.TransportRequestHandler: Error sending result ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=1033433133943, chunkIndex=1}, buffer=FileSegmentManagedBuffer{file=/hdisk3/hadoop/yarn/local/usercache/patcharee/appcache/application_1432633634512_0213/blockmgr-12d59e6b-0895-4a0e-9d06-152d2f7ee855/09/shuffle_0_56_0.data, offset=896, length=1132499356}} to /10.10.255.238:35430 http://10.10.255.238:35430; closing connection java.nio.channels.ClosedChannelException 15/06/02 16:37:38 ERROR shuffle.RetryingBlockFetcher: Exception while beginning fetch of 4 outstanding blocks (after 2 retries) java.io.IOException: Failed to connect to compute-10-3.local/10.10.255.238:33671 http://10.10.255.238:33671 at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:191) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156) at org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:78) at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140) at org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43) at org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Caused by: java.net.ConnectException: Connection refused: compute-10-3.local/10.10.255.238:33671 http://10.10.255.238:33671 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:735) at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:208) at io.netty.channel.nio.AbstractNioChannel
Re: ERROR cluster.YarnScheduler: Lost executor
This is log I can get 15/06/02 16:37:31 INFO shuffle.RetryingBlockFetcher: Retrying fetch (2/3) for 4 outstanding blocks after 5000 ms 15/06/02 16:37:36 INFO client.TransportClientFactory: Found inactive connection to compute-10-3.local/10.10.255.238:33671, creating a new one. 15/06/02 16:37:36 WARN server.TransportChannelHandler: Exception in connection from /10.10.255.238:35430 java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:192) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379) at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311) at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881) at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:225) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) at java.lang.Thread.run(Thread.java:744) 15/06/02 16:37:36 ERROR server.TransportRequestHandler: Error sending result ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=1033433133943, chunkIndex=1}, buffer=FileSegmentManagedBuffer{file=/hdisk3/hadoop/yarn/local/usercache/patcharee/appcache/application_1432633634512_0213/blockmgr-12d59e6b-0895-4a0e-9d06-152d2f7ee855/09/shuffle_0_56_0.data, offset=896, length=1132499356}} to /10.10.255.238:35430; closing connection java.nio.channels.ClosedChannelException 15/06/02 16:37:38 ERROR shuffle.RetryingBlockFetcher: Exception while beginning fetch of 4 outstanding blocks (after 2 retries) java.io.IOException: Failed to connect to compute-10-3.local/10.10.255.238:33671 at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:191) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156) at org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:78) at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140) at org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43) at org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Caused by: java.net.ConnectException: Connection refused: compute-10-3.local/10.10.255.238:33671 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:735) at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:208) at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:287) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) ... 1 more Best, Patcharee On 03. juni 2015 09:21, Akhil Das wrote: You need to look into your executor/worker logs to see whats going on. Thanks Best Regards On Wed, Jun 3, 2015 at 12:01 PM, patcharee patcharee.thong...@uni.no mailto:patcharee.thong...@uni.no wrote: Hi, What can be the cause of this ERROR cluster.YarnScheduler: Lost executor? How can I fix it? Best, Patcharee - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org mailto:user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
ERROR cluster.YarnScheduler: Lost executor
Hi, What can be the cause of this ERROR cluster.YarnScheduler: Lost executor? How can I fix it? Best, Patcharee - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Insert overwrite to hive - ArrayIndexOutOfBoundsException
Hi, I am using spark 1.3.1. I tried to insert (a new partition) into an existing partitioned hive table, but got ArrayIndexOutOfBoundsException. Below is a code snippet and the debug log. Any suggestions please. + case class Record4Dim(key: String, date: Int, hh: Int, x: Int, y: Int, z: Int, height: Float, u: Float , v: Float, w: Float, ph: Float, phb: Float, t: Float, p: Float, pb: Float, qvapor: Float, qgraup: Float, qnice: Float, qnrain: Float, tke_pbl: Float, el_pbl: Float) def flatKeyFromWrf(x: (String, (Map[String,Float], Float))): Record4Dim = { } val varWHeightFlatRDD = varWHeightRDD.map(FlatMapUtilClass().flatKeyFromWrf).toDF() varWHeightFlatRDD.registerTempTable(table_4Dim) for (zz - 1 to 51) hiveContext.sql(INSERT OVERWRITE table 4dim partition (zone= + ZONE + ,z= + zz + ,year= + YEAR + ,month= + MONTH + ) + select date, hh, x, y, height, u, v, w, ph, phb, t, pb, pb, qvapor, qgraup, qnice, tke_pbl, el_pbl from table_4Dim where z= + zz); + 15/06/01 21:07:20 DEBUG YarnHistoryService: Enqueue [1433192840040]: SparkListenerTaskEnd(4,0,ResultTask,ExceptionFailure(java.lang.ArrayIndexOutOfBoundsException,18,[Ljava.lang.StackTraceElement;@5783ce22,java.lang.ArrayIndexOutOfBoundsException: 18 at org.apache.spark.sql.catalyst.expressions.GenericRow.isNullAt(rows.scala:79) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1$1.apply(InsertIntoHiveTable.scala:103) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1$1.apply(InsertIntoHiveTable.scala:100) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.scala:100) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:82) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:82) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Best, Patcharee - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
pig performance on reading/filtering orc file
Hi, I am using pig 0.14 to work on partitioned orc file. I tried to improve my pig performance. However I am curious why using filter at the beginning (approach 1) does not help and takes even longer times than replicated join (approach 2). This filter is supposed to cut down a lot of data to be taken from orc file. Is this related to how I partition the orc file? Any guidelines/suggestions are appreciated. --- Approach 1 --- coordinate = LOAD 'coordinate' USING org.apache.hive.hcatalog.pig.HCatLoader(); coordinate_zone = FILTER coordinate BY zone == 2; coordinate_xy = LIMIT coordinate_zone 1; rawdata_u = LOAD 'u' USING org.apache.hive.hcatalog.pig.HCatLoader(); rawdata_u_1 = foreach rawdata_u generate date,hh,(double)xlong_u,(double)xlat_u,height,u,zone,year,month; u_filter = FILTER rawdata_u_1 by zone == 2; / HERE I try to filter and expect to get better performance, but it is not / u_filter = FILTER u_filter by xlong_u == coordinate_xy.xlong_u and xlat_u == coordinate_xy.xlat_u; --- Approach 2 --- coordinate = LOAD 'coordinate' USING org.apache.hive.hcatalog.pig.HCatLoader(); coordinate_zone = FILTER coordinate BY zone == 2; coordinate_xy = LIMIT coordinate_zone 1; rawdata_u = LOAD 'u' USING org.apache.hive.hcatalog.pig.HCatLoader(); u_filter = FILTER rawdata_u by zone == 2 join_u_coordinate_cossin = join u_filter by (xlong_u, xlat_u), coordinate_xy by (xlong_u, xlat_u) USING 'replicated'; Best, Patcharee
cast column float
Hi, I queried a table based on value of two float columns select count(*) from u where xlong_u = 7.1578474 and xlat_u = 55.192524; select count(*) from u where xlong_u = cast(7.1578474 as float) and xlat_u = cast(55.192524 as float); Both query returned 0 records, even though there are some records matched the condition. What can be wrong? I am using Hive 0.14 BR, Patcharee
EOFException - TezJob - Cannot submit DAG
Hi, I ran a pig script on tez and got the EOFException. Check at http://wiki.apache.org/hadoop/EOFException I have no ideas at all how I can fix it. However I did not get the exception when I executed this pig script on MR. I am using HadoopVersion: 2.6.0.2.2.4.2-2, PigVersion: 0.14.0.2.2.4.2-2, TezVersion: 0.5.2.2.2.4.2-2 I will appreciate any suggestions. Thanks. 2015-05-22 14:44:13,638 [PigTezLauncher-0] ERROR org.apache.pig.backend.hadoop.executionengine.tez.TezJob - Cannot submit DAG - Application id: application_143223768_0133 org.apache.tez.dag.api.TezException: com.google.protobuf.ServiceException: java.io.EOFException: End of File Exception between local host is: compute-10-0.local/10.10.255.241; destination host is: compute-10-3.local:47111; : java.io.EOFException; For more details see: http://wiki.apache.org/hadoop/EOFException at org.apache.tez.client.TezClient.submitDAGSession(TezClient.java:415) at org.apache.tez.client.TezClient.submitDAG(TezClient.java:351) at org.apache.pig.backend.hadoop.executionengine.tez.TezJob.run(TezJob.java:162) at org.apache.pig.backend.hadoop.executionengine.tez.TezLauncher$1.run(TezLauncher.java:167) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Caused by: com.google.protobuf.ServiceException: java.io.EOFException: End of File Exception between local host is: compute-10-0.local/10.10.255.241; destination host is: compute-10-3.local:47111; : java.io.EOFException; For more details see: http://wiki.apache.org/hadoop/EOFException at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:246) at com.sun.proxy.$Proxy31.submitDAG(Unknown Source) at org.apache.tez.client.TezClient.submitDAGSession(TezClient.java:408) ... 8 more Caused by: java.io.EOFException: End of File Exception between local host is: compute-10-0.local/10.10.255.241; destination host is: compute-10-3.local:47111; : java.io.EOFException; For more details see: http://wiki.apache.org/hadoop/EOFException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791) at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764) at org.apache.hadoop.ipc.Client.call(Client.java:1473) at org.apache.hadoop.ipc.Client.call(Client.java:1400) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) ... 10 more Caused by: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:392) at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1072) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:967)
saveasorcfile on partitioned orc
Hi, I followed the information on https://www.mail-archive.com/reviews@spark.apache.org/msg141113.html to save orc file with spark 1.2.1. I can save data to a new orc file. I wonder how to save data to an existing and partitioned orc file? Any suggestions? BR, Patcharee - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: conflict from apache commons codec
Thanks for your input. There are two weird points found: - In my cluster (also the resources being localized for the tasks) there are two versions; commons-codec-1.4.jar and commons-codec-1.7.jar. I think commons-codec since version 1.4 shows be fine, but I still got the exception (NoSuchMethodError org.apache.commons.codec.binary.Base64.decodeBase64(Ljava/lang/String;)[B). Any ideas? - I tried to run the same pig script+same environment on small number of files and large number of files. The former (small number of files) did not throw the exception, but the latter did. What can be wrong for the latter? BR, Patcharee On 20. mai 2015 09:37, Siddharth Seth wrote: My best guess would be that an older version of commons-codec is also on the classpath for the running task. If you have access to the local-dirs configured under YARN - you could find the application dir in the local-dirs and see what exists in the classpath for a container. Alternately, set tez.generate.debug.artifacts to true. This should give you access to the dag plan in text form via the YARN UI - which will list out the resources being localized for tasks. On Tue, May 19, 2015 at 2:19 AM, patcharee patcharee.thong...@uni.no mailto:patcharee.thong...@uni.no wrote: Hi, I am using Pig version 0.14 with tez version 0.5.2. I have apache commons-codec-1.4.jar on the machine. However, I got the common codec exception when I executed a Pig job with tez. 2015-05-19 11:01:04,784 INFO [AsyncDispatcher event handler] history.HistoryEventHandler: [HISTORY][DAG:dag_1431972385685_0021_1][Event:TASK_ATTEMPT_FINISHED]: vertexName=scope-384, taskAttemptId=attempt_1431972385685_0021_1_00_04_0, startTime=1432025450752, finishTime=1432026064784, timeTaken=614032, status=FAILED, diagnostics=Error: Fatal Error cause TezChild exit.:java.lang.NoSuchMethodError: org.apache.commons.codec.binary.Base64.decodeBase64(Ljava/lang/String;)[B at org.apache.hadoop.yarn.util.AuxiliaryServiceHelper.getServiceDataFromEnv(AuxiliaryServiceHelper.java:37) at org.apache.tez.runtime.api.impl.TezTaskContextImpl.getServiceProviderMetaData(TezTaskContextImpl.java:175) at org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput.generateEventsOnClose(OrderedPartitionedKVOutput.java:187) at org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput.close(OrderedPartitionedKVOutput.java:148) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.close(LogicalIOProcessorRuntimeTask.java:348) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:178) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:168) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.call(TezTaskRunner.java:168) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.call(TezTaskRunner.java:163) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) What could be wrong? BR, Patcharee
hive on Tez - merging orc files
Hi, Is there anyone using hortonworks sandbox 2.2? I am trying to use hive on Tez on the sandbox. I set the running engine in hive-site.xml to Tez. property namehive.execution.engine/name valuetez/value /property Then I ran the script that alters a table to merge small orc files (alter table orc_merge5a partition(st=0.8) concatenate;). The merging feature worked, but Hive does not use Tez, it used MapReduce, so weird! Another point, I tried to run the same script on the production cluster which is on always Tez, the merging feature sometimes worked, sometimes did not. I would appreciate any suggestions. BR, Patcharee
Re: merge small orc files
Hi Gopal, Thanks for your explanation. What could be the case that SET hive.merge.orcfile.stripe.level=true alter table table concatenate do not work? I have a dynamic partitioned table (stored as orc). I tried to alter concatenate, but it did not work. See my test result. hive SET hive.merge.orcfile.stripe.level=true; hive alter table orc_merge5a partition(st=0.8) concatenate; Starting Job = job_1424363133313_0053, Tracking URL = http://service-test-1-2.testlocal:8088/proxy/application_1424363133313_0053/ Kill Command = /usr/hdp/2.2.0.0-2041/hadoop/bin/hadoop job -kill job_1424363133313_0053 Hadoop job information for null: number of mappers: 0; number of reducers: 0 2015-04-21 12:32:56,165 null map = 0%, reduce = 0% 2015-04-21 12:33:05,964 null map = 100%, reduce = 0% Ended Job = job_1424363133313_0053 Loading data to table default.orc_merge5a partition (st=0.8) Moved: 'hdfs://service-test-1-0.testlocal:8020/apps/hive/warehouse/orc_merge5a/st=0.8/00_0' to trash at: hdfs://service-test-1-0.testlocal:8020/user/patcharee/.Trash/Current Moved: 'hdfs://service-test-1-0.testlocal:8020/apps/hive/warehouse/orc_merge5a/st=0.8/02_0' to trash at: hdfs://service-test-1-0.testlocal:8020/user/patcharee/.Trash/Current Partition default.orc_merge5a{st=0.8} stats: [numFiles=2, numRows=0, totalSize=1067, rawDataSize=0] MapReduce Jobs Launched: Stage-null: HDFS Read: 0 HDFS Write: 0 SUCCESS Total MapReduce CPU Time Spent: 0 msec OK Time taken: 22.839 seconds hive dfs -ls ${hiveconf:hive.metastore.warehouse.dir}/orc_merge5a/st=0.8/; Found 2 items -rw-r--r-- 3 patcharee hdfs534 2015-04-21 12:33 /apps/hive/warehouse/orc_merge5a/st=0.8/00_0 -rw-r--r-- 3 patcharee hdfs533 2015-04-21 12:33 /apps/hive/warehouse/orc_merge5a/st=0.8/01_0 It seems nothing happened when I altered table concatenate. Any ideas? BR, Patcharee On 21. april 2015 04:41, Gopal Vijayaraghavan wrote: Hi, How to set the configuration hive-site.xml to automatically merge small orc file (output from mapreduce job) in hive 0.14 ? Hive cannot add work-stages to a map-reduce job. Hive follows merge.mapfiles=true when Hive generates a plan, by adding more work to the plan as a conditional task. -rwxr-xr-x 1 root hdfs 29072 2015-04-20 15:23 /apps/hive/warehouse/coordinate/zone=2/part-r-0 This looks like it was written by an MRv2 Reducer and not by the Hive FileSinkOperator handled by the MR outputcommitter instead of the Hive MoveTask. But 0.14 has an option which helps ³hive.merge.orcfile.stripe.level². If that is true (like your setting), then do ³alter table table concatenate² which effectively concatenates ORC blocks (without decompressing them), while maintaining metadata linkage of start/end offsets in the footer. Cheers, Gopal
Re: merge small orc files
Hi Gopal, The table created is not a bucketed table, but a dynamic partitioned table. I took the script test from https://svn.apache.org/repos/asf/hive/trunk/ql/src/test/queries/clientpositive/orc_merge7.q - create table orc_merge5 (userid bigint, string1 string, subtype double, decimal1 decimal, ts timestamp) stored as orc; - create table orc_merge5a (userid bigint, string1 string, subtype double, decimal1 decimal, ts timestamp) partitioned by (st double) stored as orc; I sent you the desc formatted table and application log. I just found out that there are some TezException which could be the cause of the problem. Please let me know how to fix it. BR, Patcharee On 21. april 2015 13:10, Gopal Vijayaraghavan wrote: alter table table concatenate do not work? I have a dynamic partitioned table (stored as orc). I tried to alter concatenate, but it did not work. See my test result. ORC fast concatenate does work on partitioned tables, but it doesn¹t work on bucketed tables. Bucketed tables cannot merge files, since the file count is capped by the numBuckets parameter. hive dfs -ls ${hiveconf:hive.metastore.warehouse.dir}/orc_merge5a/st=0.8/; Found 2 items -rw-r--r-- 3 patcharee hdfs534 2015-04-21 12:33 /apps/hive/warehouse/orc_merge5a/st=0.8/00_0 -rw-r--r-- 3 patcharee hdfs533 2015-04-21 12:33 /apps/hive/warehouse/orc_merge5a/st=0.8/01_0 Is this a bucketed table? When you look at the point of view of split generation cluster parallelism, bucketing is an anti-pattern, since in most query schemas it significantly slows down the slowest task. Making the fastest task faster isn¹t often worth it, if the overall query time goes up. Also if you want to, you can send me the yarn logs -applicationId app-id and the desc formatted of the table, which will help me understand what¹s happening better. Cheers, Gopal Container: container_1424363133313_0082_01_03 on compute-test-1-2.testlocal_45454 === LogType:stderr Log Upload Time:21-Apr-2015 14:17:54 LogLength:0 Log Contents: LogType:stdout Log Upload Time:21-Apr-2015 14:17:54 LogLength:2124 Log Contents: 0.294: [GC [PSYoungGen: 3642K-490K(6656K)] 3642K-1308K(62976K), 0.0071100 secs] [Times: user=0.00 sys=0.00, real=0.01 secs] 0.600: [GC [PSYoungGen: 6110K-496K(12800K)] 6929K-1992K(69120K), 0.0058540 secs] [Times: user=0.01 sys=0.00, real=0.00 secs] 1.061: [GC [PSYoungGen: 10217K-496K(12800K)] 11714K-3626K(69120K), 0.0077230 secs] [Times: user=0.01 sys=0.00, real=0.01 secs] 1.477: [GC [PSYoungGen: 8914K-512K(25088K)] 12045K-5154K(81408K), 0.0095740 secs] [Times: user=0.01 sys=0.01, real=0.01 secs] 2.361: [GC [PSYoungGen: 14670K-512K(25088K)] 19313K-6827K(81408K), 0.0106680 secs] [Times: user=0.01 sys=0.00, real=0.01 secs] 3.476: [GC [PSYoungGen: 22967K-3059K(51712K)] 29282K-9958K(108032K), 0.0201770 secs] [Times: user=0.02 sys=0.00, real=0.02 secs] 5.538: [GC [PSYoungGen: 50438K-3568K(52224K)] 57336K-15383K(108544K), 0.0374340 secs] [Times: user=0.04 sys=0.01, real=0.04 secs] 6.811: [GC [PSYoungGen: 29358K-6331K(61440K)] 41173K-18282K(117760K), 0.0421300 secs] [Times: user=0.03 sys=0.01, real=0.04 secs] 7.689: [GC [PSYoungGen: 28530K-6401K(61440K)] 40482K-19476K(117760K), 0.0443730 secs] [Times: user=0.03 sys=0.00, real=0.05 secs] Heap PSYoungGen total 61440K, used 28333K [0xfbb8, 0x0001, 0x0001) eden space 54784K, 40% used [0xfbb8,0xfdf463f8,0xff10) lgrp 0 space 2K, 49% used [0xfbb8,0xfc95add8,0xfd7b6000) lgrp 1 space 25896K, 29% used [0xfd7b6000,0xfdf463f8,0xff10) from space 6656K, 96% used [0xff10,0xff740400,0xff78) to space 8704K, 0% used [0xff78,0xff78,0x0001) ParOldGen total 56320K, used 13075K [0xd9a0, 0xdd10, 0xfbb8) object space 56320K, 23% used [0xd9a0,0xda6c4e68,0xdd10) PSPermGen total 28672K, used 28383K [0xd480, 0xd640, 0xd9a0) object space 28672K, 98% used [0xd480,0xd63b7f78,0xd640) LogType:syslog Log Upload Time:21-Apr-2015 14:17:54 LogLength:1355 Log Contents: 2015-04-21 14:17:40,208 INFO [main] task.TezChild: TezChild starting 2015-04-21 14:17:41,856 INFO [main] task.TezChild: PID, containerIdentifier: 15169, container_1424363133313_0082_01_03 2015-04-21 14:17:41,985 INFO [main] impl.MetricsConfig: loaded properties from hadoop-metrics2.properties 2015-04-21 14:17:42,146 INFO [main] impl.MetricsSystemImpl: Scheduled snapshot period at 60 second(s). 2015-04-21 14:17:42,146 INFO [main] impl.MetricsSystemImpl: TezTask metrics system started 2015-04-21 14:17:42,355 INFO [TezChild] task.ContainerReporter: Attempting to fetch new
merge small orc files
Hi, How to set the configuration hive-site.xml to automatically merge small orc file (output from mapreduce job) in hive 0.14 ? This is my current configuration property namehive.merge.mapfiles/name valuetrue/value /property property namehive.merge.mapredfiles/name valuetrue/value /property property namehive.merge.orcfile.stripe.level/name valuetrue/value /property However the output from a mapreduce job, which is stored into an orc file, was not merged. This is the output -rwxr-xr-x 1 root hdfs 0 2015-04-20 15:23 /apps/hive/warehouse/coordinate/zone=2/_SUCCESS -rwxr-xr-x 1 root hdfs 29072 2015-04-20 15:23 /apps/hive/warehouse/coordinate/zone=2/part-r-0 -rwxr-xr-x 1 root hdfs 29049 2015-04-20 15:23 /apps/hive/warehouse/coordinate/zone=2/part-r-1 -rwxr-xr-x 1 root hdfs 29075 2015-04-20 15:23 /apps/hive/warehouse/coordinate/zone=2/part-r-2 Any ideas? BR, Patcharee
override log4j.properties
Hello, How to override log4j.properties for a specific spark job? BR, Patcharee - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark Job History Server
I turned it on. But it failed to start. In the log, Spark assembly has been built with Hive, including Datanucleus jars on classpath Spark Command: /usr/lib/jvm/java-1.7.0-openjdk.x86_64/bin/java -cp :/root/spark-1.3.0-bin-hadoop2.4/sbin/../conf:/root/spark-1.3.0-bin-hadoop2.4/lib/spark-assembly-1.3.0-hadoop2.4.0.jar:/root/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar:/root/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar:/root/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar:/etc/hadoop/conf -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m org.apache.spark.deploy.history.HistoryServer 15/03/18 10:23:46 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Exception in thread main java.lang.ClassNotFoundException: org.apache.spark.deploy.yarn.history.YarnHistoryProvider at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:191) at org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:183) at org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala) Patcharee On 18. mars 2015 11:35, Akhil Das wrote: You can simply turn it on using: |./sbin/start-history-server.sh| Read more here http://spark.apache.org/docs/1.3.0/monitoring.html. Thanks Best Regards On Wed, Mar 18, 2015 at 4:00 PM, patcharee patcharee.thong...@uni.no mailto:patcharee.thong...@uni.no wrote: Hi, I am using spark 1.3. I would like to use Spark Job History Server. I added the following line into conf/spark-defaults.conf spark.yarn.services org.apache.spark.deploy.yarn.history.YarnHistoryService spark.history.provider org.apache.spark.deploy.yarn.history.YarnHistoryProvider spark.yarn.historyServer.address sandbox.hortonworks.com:19888 http://sandbox.hortonworks.com:19888 But got Exception in thread main java.lang.ClassNotFoundException: org.apache.spark.deploy.yarn.history.YarnHistoryProvider What class is really needed? How to fix it? Br, Patcharee - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org mailto:user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org mailto:user-h...@spark.apache.org
Spark Job History Server
Hi, I am using spark 1.3. I would like to use Spark Job History Server. I added the following line into conf/spark-defaults.conf spark.yarn.services org.apache.spark.deploy.yarn.history.YarnHistoryService spark.history.provider org.apache.spark.deploy.yarn.history.YarnHistoryProvider spark.yarn.historyServer.address sandbox.hortonworks.com:19888 But got Exception in thread main java.lang.ClassNotFoundException: org.apache.spark.deploy.yarn.history.YarnHistoryProvider What class is really needed? How to fix it? Br, Patcharee - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark Job History Server
Hi, My spark was compiled with yarn profile, I can run spark on yarn without problem. For the spark job history server problem, I checked spark-assembly-1.3.0-hadoop2.4.0.jar and found that the package org.apache.spark.deploy.yarn.history is missing. I don't know why BR, Patcharee On 18. mars 2015 11:43, Akhil Das wrote: You are not having yarn package in the classpath. You need to build your spark it with yarn. You can read these docs. http://spark.apache.org/docs/1.3.0/running-on-yarn.html Thanks Best Regards On Wed, Mar 18, 2015 at 4:07 PM, patcharee patcharee.thong...@uni.no mailto:patcharee.thong...@uni.no wrote: I turned it on. But it failed to start. In the log, Spark assembly has been built with Hive, including Datanucleus jars on classpath Spark Command: /usr/lib/jvm/java-1.7.0-openjdk.x86_64/bin/java -cp :/root/spark-1.3.0-bin-hadoop2.4/sbin/../conf:/root/spark-1.3.0-bin-hadoop2.4/lib/spark-assembly-1.3.0-hadoop2.4.0.jar:/root/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar:/root/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar:/root/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar:/etc/hadoop/conf -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m org.apache.spark.deploy.history.HistoryServer 15/03/18 10:23:46 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Exception in thread main java.lang.ClassNotFoundException: org.apache.spark.deploy.yarn.history.YarnHistoryProvider at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:191) at org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:183) at org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala) Patcharee On 18. mars 2015 11:35, Akhil Das wrote: You can simply turn it on using: |./sbin/start-history-server.sh| Read more here http://spark.apache.org/docs/1.3.0/monitoring.html. Thanks Best Regards On Wed, Mar 18, 2015 at 4:00 PM, patcharee patcharee.thong...@uni.no mailto:patcharee.thong...@uni.no wrote: Hi, I am using spark 1.3. I would like to use Spark Job History Server. I added the following line into conf/spark-defaults.conf spark.yarn.services org.apache.spark.deploy.yarn.history.YarnHistoryService spark.history.provider org.apache.spark.deploy.yarn.history.YarnHistoryProvider spark.yarn.historyServer.address sandbox.hortonworks.com:19888 http://sandbox.hortonworks.com:19888 But got Exception in thread main java.lang.ClassNotFoundException: org.apache.spark.deploy.yarn.history.YarnHistoryProvider What class is really needed? How to fix it? Br, Patcharee - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org mailto:user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org mailto:user-h...@spark.apache.org