pyspark split pair rdd to multiple
Hi, How can I split pair rdd [K, V] to map [K, Array(V)] efficiently in Pyspark? Best, Patcharee - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
dataframe access hive complex type
Hi, How dataframe (What API) can access hive complex type (Struct, Array, Maps)? Thanks, Patcharee - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
OrcNewOutputFormat write partitioned orc file
Hi, How to write partitioned orc file using OrcNewOutputFormat in MapReduce? Thanks Patcharee
override log4j level
Hi, How can I override log4j level by using --hiveconf? I want to use ERROR level for some tasks. Thanks, Patcharee
Re: character '' not supported here
Hi, The query result 11236119012.64043-5.9708868.5592070.0 0.0 0.0-19.6869931308.804799848.00.006196644 0.00.0 301.274750.382470460.0NULL11 20081 11236122012.513598-6.36717137.3927946 0.0 0.00.0-22.3003921441.054799848.0 0.00508465060.0 0.0112.207870.304595230.0 NULL1120081 5122503682415.1955.1722354.9027147 -0.0244086120.023590.553-38.96928-1130.0469 74660.54 2.5969802E-49.706164E-1123054.2680.0 0.241967370.0 NULL1120081 9121449412.25196412.081688-9.594620.0 0.0 0.0-25.93576258.6562599848.00.0021708217 0.00.0 1.29632131.15602660.0NULL11 20081 9121458412.3020987.752461-12.183463 0.0 0.00.0-24.983763351.195399848.0 0.00237235990.0 0.01.41373750.992398860.0 NULL1120081 I stored table in orc format, partitioned and compressed by ZLIB. The problem happened just after I concatenate table. BR, Patcharee On 18/07/15 12:46, Nitin Pawar wrote: select * without where will work because it does not involve file processing I suspect the problem is with field delimiter so i asked for records so that we can see whats the data in each column are you using csv file with columns delimited by some char and it has numeric data in quotes ? On Sat, Jul 18, 2015 at 3:58 PM, patcharee patcharee.thong...@uni.no mailto:patcharee.thong...@uni.no wrote: This select * from table limit 5; works, but not others. So? Patcharee On 18. juli 2015 12:08, Nitin Pawar wrote: can you do select * from table limit 5; On Sat, Jul 18, 2015 at 3:35 PM, patcharee patcharee.thong...@uni.no mailto:patcharee.thong...@uni.no wrote: Hi, I am using hive 0.14 with Tez engine. Found a weird problem. Any suggestions? hive select count(*) from 4D; line 1:1 character '' not supported here line 1:2 character '' not supported here line 1:3 character '' not supported here line 1:4 character '' not supported here line 1:5 character '' not supported here line 1:6 character '' not supported here line 1:7 character '' not supported here line 1:8 character '' not supported here line 1:9 character '' not supported here ... ... line 1:131 character '' not supported here line 1:132 character '' not supported here line 1:133 character '' not supported here line 1:134 character '' not supported here line 1:135 character '' not supported here line 1:136 character '' not supported here line 1:137 character '' not supported here line 1:138 character '' not supported here line 1:139 character '' not supported here line 1:140 character '' not supported here line 1:141 character '' not supported here line 1:142 character '' not supported here line 1:143 character '' not supported here line 1:144 character '' not supported here line 1:145 character '' not supported here line 1:146 character '' not supported here BR, Patcharee -- Nitin Pawar -- Nitin Pawar
alter table on multiple partitions
Hi, I have a table partitioned by a, b, c, d column. I want to alter concatenate this table. Is it possible to use wildcard in alter command to alter several partitions at a time? For ex. alter table TestHive partition (a=1, b=*, c=2, d=*) CONCATENATE; BR, Patcharee
How to use KryoSerializer : ClassNotFoundException
Hi, I am using spark 1.4. I wanted to serialize by KryoSerializer, but got ClassNotFoundException. The configuration and exception is below. When I submitted the job, I also provided --jars mylib.jar which contains WRFVariableZ. conf.set(spark.serializer, org.apache.spark.serializer.KryoSerializer) conf.registerKryoClasses(Array(classOf[WRFVariableZ])) Exception in thread main org.apache.spark.SparkException: Failed to register classes with Kryo at org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:114) Caused by: java.lang.ClassNotFoundException: no.uni.computing.io.WRFVariableZ How can I configure it? BR, Patcharee - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
memory needed for each executor
Hi, How can I know the size of memory needed for each executor (one core) to execute each job? If there are many cores per executors, will the memory be the multiplication (memory needed for each executor (one core) * no. of cores)? Any suggestions/guidelines? BR, Patcharee - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Dataframe Write : Tables created with SQLContext must be TEMPORARY. Use a HiveContext instead.
I got it. Thanks! Patcharee On 13/06/15 23:00, Will Briggs wrote: The context that is created by spark-shell is actually an instance of HiveContext. If you want to use it programmatically in your driver, you need to make sure that your context is a HiveContext, and not a SQLContext. https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables Hope this helps, Will On June 13, 2015, at 3:36 PM, pth001 patcharee.thong...@uni.no wrote: Hi, I am using spark 0.14. I try to insert data into a hive table (in orc format) from DF. partitionedTestDF.write.format(org.apache.spark.sql.hive.orc.DefaultSource) .mode(org.apache.spark.sql.SaveMode.Append).partitionBy(zone,z,year,month).saveAsTable(testorc) When this job is submitted by spark-submit I get Exception in thread main java.lang.RuntimeException: Tables created with SQLContext must be TEMPORARY. Use a HiveContext instead But the job works fine on spark-shell. What can be wrong? BR, Patcharee - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Dataframe Write : Tables created with SQLContext must be TEMPORARY. Use a HiveContext instead.
Hi, I am using spark 0.14. I try to insert data into a hive table (in orc format) from DF. partitionedTestDF.write.format(org.apache.spark.sql.hive.orc.DefaultSource) .mode(org.apache.spark.sql.SaveMode.Append).partitionBy(zone,z,year,month).saveAsTable(testorc) When this job is submitted by spark-submit I get Exception in thread main java.lang.RuntimeException: Tables created with SQLContext must be TEMPORARY. Use a HiveContext instead But the job works fine on spark-shell. What can be wrong? BR, Patcharee - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
ERROR 2135: Received error from store function.Premature EOF: no length prefix available
Hi, My pig on Tez (to store dataset into a partitioned hive table) throws the following exception. What can be wrong? How can I fix it? 2015-06-09 10:59:57,268 ERROR [TezChild] runtime.PigProcessor: Encountered exception while processing: org.apache.pig.backend.executionengine.ExecException: ERROR 2135: Received error from store function.Premature EOF: no length prefix available at org.apache.pig.backend.hadoop.executionengine.tez.plan.operator.POStoreTez.getNextTuple(POStoreTez.java:141) at org.apache.pig.backend.hadoop.executionengine.tez.runtime.PigProcessor.runPipeline(PigProcessor.java:316) at org.apache.pig.backend.hadoop.executionengine.tez.runtime.PigProcessor.run(PigProcessor.java:195) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:324) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:176) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:168) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.call(TezTaskRunner.java:168) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.call(TezTaskRunner.java:163) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.EOFException: Premature EOF: no length prefix available at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2208) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1440) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1362) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:589) BR, Patcharee
filter by query result
Hi, I am new to pig. First I queried a hive table (x = LOAD 'x' USING org.apache.hive.hcatalog.pig.HCatLoader();) and got a single record/value. How can I used this single value to filter in another query? I hope to get a better performance by filter as soon as possible. BR, Patcharee
create a pipeline
Hi, How can I create a pipeline (containing a sequence of pig scripts)? BR, Patcharee