from:"patcharee"

sparkR 3rd library

2017-09-04 Thread patcharee


Hi,

I am using spark.lapply to execute an existing R script in standalone 
mode. This script calls a function 'rbga' from a 3rd library 'genalg'. 
This rbga function works fine in sparkR env when I call it directly, but 
when I apply this to spark.lapply I get the error


could not find function "rbga"
at org.apache.spark.api.r.RRunner.compute(RRunner.scala:108)
at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:51)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala

Any ideas/suggestions?

BR, Patcharee


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

simple application on tez + llap

2017-02-24 Thread Patcharee Thongtra


Hi,

I found an example of simple applications like wordcount running on tez 
- 
https://github.com/apache/tez/tree/master/tez-examples/src/main/java/org/apache/tez/examples. 
However, how to run this on tez+llap? Any suggestions?


BR,

Patcharee

Re: import sql file

2016-11-23 Thread patcharee


I exported sql table into .sql file and would like to import this into hive

Best, Patcharee

On 23. nov. 2016 10:40, Markovitz, Dudu wrote:

Hi Patcharee
The question is not clear.

Dudu

-Original Message-
From: patcharee [mailto:patcharee.thong...@uni.no]
Sent: Wednesday, November 23, 2016 11:37 AM
To: user@hive.apache.org
Subject: import sql file

Hi,

How can I import .sql file into hive?

Best, Patcharee

import sql file

2016-11-23 Thread patcharee


Hi,

How can I import .sql file into hive?

Best, Patcharee

Re: hiveserver2 java heap space

2016-10-24 Thread Patcharee Thongtra


It works on Hive cli

Patcharee

On 10/24/2016 11:51 AM, Mich Talebzadeh wrote:

does this work ok through Hive cli?

Dr Mich Talebzadeh

LinkedIn 
/https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw/


http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk.Any and all responsibility for 
any loss, damage or destruction of data or any other property which 
may arise from relying on this email's technical content is explicitly 
disclaimed. The author will in no case be liable for any monetary 
damages arising from such loss, damage or destruction.



On 24 October 2016 at 10:43, Patcharee Thongtra 
<patcharee.thong...@uni.no <mailto:patcharee.thong...@uni.no>> wrote:


Hi,

I tried to query orc file by beeline and java program using jdbc
("select * from orcfileTable limit 1"). Both failed with Caused
by: java.lang.OutOfMemoryError: Java heap space. Hiveserver2 heap
size is 1024m. I  guess I need to increase this Hiveserver2 heap
size? However I wonder why I got this error because I query just
ONE line. Any ideas?

Thanks,

Patcharee

hiveserver2 java heap space

2016-10-24 Thread Patcharee Thongtra


Hi,

I tried to query orc file by beeline and java program using jdbc 
("select * from orcfileTable limit 1"). Both failed with Caused by: 
java.lang.OutOfMemoryError: Java heap space. Hiveserver2 heap size is 
1024m. I  guess I need to increase this Hiveserver2 heap size? However I 
wonder why I got this error because I query just ONE line. Any ideas?


Thanks,

Patcharee

hiveserver2 GC overhead limit exceeded

2016-10-23 Thread patcharee


Hi,

I use beeline to connect to hiveserver2. I tested with a simple command 
and got an error GC overhead limit exceeded


0: jdbc:hive2://service-10-1:10010/default> drop table testhivedrivertable;
Error: Error while processing statement: FAILED: Execution Error, return 
code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. GC overhead limit 
exceeded (state=08S01,code=1)


How to solve this? How to identify if this error is from the client 
(beeline) or from hiveserver2?


Thanks,

Patcharee

Re: Spark DataFrame Plotting

2016-09-08 Thread patcharee


Hi Moon,

When I generate an extra column (schema will be Index:Int, A:Double, 
B:Double), what sql command to generate a graph with 2 lines (Index as a 
X-axis, BOTH A and B as Y-axis)? Do I need to group by?


Thanks!

Patcharee


On 07. sep. 2016 16:58, moon soo Lee wrote:
You will need to generate extra column which can be used as a X-axis 
for column A and B.


On Wed, Sep 7, 2016 at 2:34 AM Abhisar Mohapatra 
<abhisar.mohapa...@inmobi.com <mailto:abhisar.mohapa...@inmobi.com>> 
wrote:


I am not sure ,But can you try once by grouby function in
zeppelin.If uou can group by columns then i guess that would do
the trick or pass the data to another HTML block and use d3
comparison chart to plot it

On Wed, Sep 7, 2016 at 2:56 PM, patcharee
<patcharee.thong...@uni.no <mailto:patcharee.thong...@uni.no>> wrote:

Normal select * gives me one column on X-axis and another on
Y-axis. I cannot make both A:Double, B:Double displayed on
Y-axis. How to do that?

Patcharee


On 07. sep. 2016 11:05, Abhisar Mohapatra wrote:

You can do a normal select * on the dataframe and it would be
automatically interpreted.

On Wed, Sep 7, 2016 at 2:29 PM, patcharee
<patcharee.thong...@uni.no
<mailto:patcharee.thong...@uni.no>> wrote:

Hi,

I have a dataframe with this schema A:Double, B:Double.
How can I plot this dataframe as two lines (comparing A
and B at each step)?

Best,

Patcharee



_
The information contained in this communication is intended
solely for the use of the individual or entity to whom it is
addressed and others authorized to receive it. It may contain
confidential or legally privileged information. If you are
not the intended recipient you are hereby notified that any
disclosure, copying, distribution or taking any action in
reliance on the contents of this information is strictly
prohibited and may be unlawful. If you have received this
communication in error, please notify us immediately by
responding to this email and then delete it from your system.
The firm is neither liable for the proper and complete
transmission of the information contained in this
communication nor for any delay in its receipt. 




_
The information contained in this communication is intended solely
for the use of the individual or entity to whom it is addressed
and others authorized to receive it. It may contain confidential
or legally privileged information. If you are not the intended
recipient you are hereby notified that any disclosure, copying,
distribution or taking any action in reliance on the contents of
this information is strictly prohibited and may be unlawful. If
you have received this communication in error, please notify us
immediately by responding to this email and then delete it from
your system. The firm is neither liable for the proper and
complete transmission of the information contained in this
communication nor for any delay in its receipt.

Re: Spark DataFrame Plotting

2016-09-07 Thread patcharee

Normal select * gives me one column on X-axis and another on Y-axis. I 
cannot make both A:Double, B:Double displayed on Y-axis. How to do that?


Patcharee


On 07. sep. 2016 11:05, Abhisar Mohapatra wrote:
You can do a normal select * on the dataframe and it would be 
automatically interpreted.


On Wed, Sep 7, 2016 at 2:29 PM, patcharee <patcharee.thong...@uni.no 
<mailto:patcharee.thong...@uni.no>> wrote:


Hi,

I have a dataframe with this schema A:Double, B:Double. How can I
plot this dataframe as two lines (comparing A and B at each step)?

    Best,

    Patcharee



_
The information contained in this communication is intended solely for 
the use of the individual or entity to whom it is addressed and others 
authorized to receive it. It may contain confidential or legally 
privileged information. If you are not the intended recipient you are 
hereby notified that any disclosure, copying, distribution or taking 
any action in reliance on the contents of this information is strictly 
prohibited and may be unlawful. If you have received this 
communication in error, please notify us immediately by responding to 
this email and then delete it from your system. The firm is neither 
liable for the proper and complete transmission of the information 
contained in this communication nor for any delay in its receipt.

Spark DataFrame Plotting

2016-09-07 Thread patcharee


Hi,

I have a dataframe with this schema A:Double, B:Double. How can I plot 
this dataframe as two lines (comparing A and B at each step)?


Best,

Patcharee

what contribute to Task Deserialization Time

2016-07-21 Thread patcharee


Hi,

I'm running a simple job (reading sequential file and collect data at 
the driver) with yarn-client mode. When looking at the history server 
UI, Task Deserialization Time of tasks are quite different (5 ms to 5 
s). What contribute to this Task Deserialization Time?


Thank you in advance!

Patcharee



-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Failed to stream on Yarn cluster

2016-04-28 Thread patcharee


Hi again,

Actually it works well! I just realized from looking at Yarn application 
log that the Flink streaming result is printed in taskmanager.out. When 
I sent a question to the mailing list I looked at the screen where I 
issued the command, and there was no streaming result there.


Where is the taskmanager.out and how can I change it?

Best,
Patcharee


On 28. april 2016 13:18, Maximilian Michels wrote:

Hi Patcharee,

What do you mean by "nothing happened"? There is no output? Did you
check the logs?

Cheers,
Max

On Thu, Apr 28, 2016 at 12:10 PM, patcharee <patcharee.thong...@uni.no> wrote:

Hi,

I am testing the streaming wiki example -
https://ci.apache.org/projects/flink/flink-docs-master/quickstart/run_example_quickstart.html

It works fine from local mode (mvn exec:java
-Dexec.mainClass=wikiedits.WikipediaAnalysis). But when I run the jar on
Yarn cluster mode, nothing happened. Any ideas? I tested the word count
example from hdfs file on Yarn cluster and it worked fine.

Best,
Patcharee

Failed to stream on Yarn cluster

2016-04-28 Thread patcharee


Hi,

I am testing the streaming wiki example - 
https://ci.apache.org/projects/flink/flink-docs-master/quickstart/run_example_quickstart.html


It works fine from local mode (mvn exec:java 
-Dexec.mainClass=wikiedits.WikipediaAnalysis). But when I run the jar on 
Yarn cluster mode, nothing happened. Any ideas? I tested the word count 
example from hdfs file on Yarn cluster and it worked fine.


Best,
Patcharee

Re: pyspark split pair rdd to multiple

2016-04-20 Thread patcharee


I can also use dataframe. Any suggestions?

Best,
Patcharee

On 20. april 2016 10:43, Gourav Sengupta wrote:

Is there any reason why you are not using data frames?


Regards,
Gourav

On Tue, Apr 19, 2016 at 8:51 PM, pth001 <patcharee.thong...@uni.no 
<mailto:patcharee.thong...@uni.no>> wrote:


Hi,

How can I split pair rdd [K, V] to map [K, Array(V)] efficiently
in Pyspark?

    Best,
    Patcharee

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
<mailto:user-unsubscr...@spark.apache.org>
For additional commands, e-mail: user-h...@spark.apache.org
<mailto:user-h...@spark.apache.org>

Re: build r-intepreter

2016-04-14 Thread Patcharee Thongtra


Yes, I did not install R. Stupid me.
Thanks for your guide!

BR,
Patcharee

On 04/13/2016 08:23 PM, Eric Charles wrote:

Can you post the full stacktrace you have (look also at the log file)?
Did you install R on your machine?

SPARK_HOME is optional.


On 13/04/16 15:39, Patcharee Thongtra wrote:

Hi,

When I ran R notebook example, I got these errors in the logs:

- Caused by: org.apache.zeppelin.interpreter.InterpreterException:
sparkr is not responding

- Caused by: org.apache.thrift.transport.TTransportException

I did not config SPARK_HOME so far, and intended to use the embedded
spark for testing first.

BR,
Patcharee


On 04/13/2016 02:52 PM, Patcharee Thongtra wrote:

Hi,

I have been struggling with R interpreter / SparkR interpreter. Is
below the right command to build zeppelin with R interpreter / SparkR
interpreter?

mvn clean package -Pspark-1.6 -Phadoop-2.6 -Pyarn -Ppyspark -Psparkr

BR,
Patcharee

executor running time vs getting result from jupyter notebook

2016-04-14 Thread Patcharee Thongtra


Hi,

I am running a jupyter notebook - pyspark. I noticed from the history 
server UI there are some tasks spending a lot of time on either

- executor running time
- getting result

But some tasks finished both steps very quick. All tasks however have 
very similar input size.


What can be the factor of time spending on these steps?

BR,
Patcharee

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: build r-intepreter

2016-04-13 Thread Patcharee Thongtra


Hi,

When I ran R notebook example, I got these errors in the logs:

- Caused by: org.apache.zeppelin.interpreter.InterpreterException: 
sparkr is not responding


- Caused by: org.apache.thrift.transport.TTransportException

I did not config SPARK_HOME so far, and intended to use the embedded 
spark for testing first.


BR,
Patcharee


On 04/13/2016 02:52 PM, Patcharee Thongtra wrote:

Hi,

I have been struggling with R interpreter / SparkR interpreter. Is 
below the right command to build zeppelin with R interpreter / SparkR 
interpreter?


mvn clean package -Pspark-1.6 -Phadoop-2.6 -Pyarn -Ppyspark -Psparkr

BR,
Patcharee

ExclamationTopology workers executors vs tasks

2016-03-01 Thread patcharee


Hi,

I am new to Storm. I am running the storm starter example 
ExclamationTopology on storm cluster (version 0.10). The code snippet is 
below:


TopologyBuilder builder =new TopologyBuilder();

builder.setSpout("word",new TestWordSpout(),10);
builder.setBolt("exclaim1",new ExclamationBolt(),3).shuffleGrouping("word");
builder.setBolt("exclaim2",new ExclamationBolt(),2).shuffleGrouping("exclaim1");


However I do not understand why the Num executors and Num tasks of this 
topology (seen from the Storm UI) are both 18 (instead of 15) ? Also 
from the Storm UI, the Num executors and Num tasks of the Spout word and 
the Bolt exclaim1 and exclaim2 are 10, 3 and 2 respectively (as same as 
defined in the code).


Thanks,
Patcharee

kafka streaming topic partitions vs executors

2016-02-26 Thread patcharee


Hi,

I am working a streaming application integrated with Kafka by the API 
createDirectStream. The application streams a topic which contains 10 
partitions (on Kafka). It executes with 10 workers (--num-executors 10) 
When it reads data from Kafka/ZooKeeper, Spark creates 10 tasks (as same 
as the topic's partitions). However some executors are given more than 1 
tasks and work on these tasks sequentially.


Why Spark does not distribute these 10 tasks to 10 executors? How to do 
that?


Thanks,
Patcharee

Re: streaming textFileStream problem - got only ONE line

2016-01-29 Thread patcharee


I moved them every interval to the monitored directory.

Patcharee

On 25. jan. 2016 22:30, Shixiong(Ryan) Zhu wrote:
Did you move the file into "hdfs://helmhdfs/user/patcharee/cerdata/", 
or write into it directly? `textFileStream` requires that files must 
be written to the monitored directory by "moving" them from another 
location within the same file system.


On Mon, Jan 25, 2016 at 6:30 AM, patcharee <patcharee.thong...@uni.no 
<mailto:patcharee.thong...@uni.no>> wrote:


Hi,

My streaming application is receiving data from file system and
just prints the input count every 1 sec interval, as the code below:

val sparkConf = new SparkConf()
val ssc = new StreamingContext(sparkConf, Milliseconds(interval_ms))
val lines = ssc.textFileStream(args(0))
lines.count().print()

The problem is sometimes the data received from scc.textFileStream
is ONLY ONE line. But in fact there are multiple lines in the new
file found in that interval. See log below which shows three
intervals. In the 2nd interval, the new file is:
    hdfs://helmhdfs/user/patcharee/cerdata/datetime_19617.txt. This
file contains 6288 lines. The ssc.textFileStream returns ONLY ONE
line (the header).

Any ideas/suggestions what the problem is?


-
SPARK LOG

-

16/01/25 15:11:11 INFO FileInputDStream: Cleared 1 old files that
were older than 1453731011000 ms: 145373101 ms
16/01/25 15:11:11 INFO FileInputDStream: Cleared 0 old files that
were older than 1453731011000 ms:
16/01/25 15:11:12 INFO FileInputDStream: Finding new files took 4 ms
16/01/25 15:11:12 INFO FileInputDStream: New files at time
1453731072000 ms:
    hdfs://helmhdfs/user/patcharee/cerdata/datetime_19616.txt
---
Time: 1453731072000 ms
---
6288

16/01/25 15:11:12 INFO FileInputDStream: Cleared 1 old files that
were older than 1453731012000 ms: 1453731011000 ms
16/01/25 15:11:12 INFO FileInputDStream: Cleared 0 old files that
were older than 1453731012000 ms:
16/01/25 15:11:13 INFO FileInputDStream: Finding new files took 4 ms
16/01/25 15:11:13 INFO FileInputDStream: New files at time
1453731073000 ms:
    hdfs://helmhdfs/user/patcharee/cerdata/datetime_19617.txt
---
Time: 1453731073000 ms
---
1

16/01/25 15:11:13 INFO FileInputDStream: Cleared 1 old files that
were older than 1453731013000 ms: 1453731012000 ms
16/01/25 15:11:13 INFO FileInputDStream: Cleared 0 old files that
were older than 1453731013000 ms:
16/01/25 15:11:14 INFO FileInputDStream: Finding new files took 3 ms
16/01/25 15:11:14 INFO FileInputDStream: New files at time
1453731074000 ms:
    hdfs://helmhdfs/user/patcharee/cerdata/datetime_19618.txt
---
Time: 1453731074000 ms
---
6288


Thanks,
Patcharee

Pyspark filter not empty

2016-01-29 Thread patcharee


Hi,

In pyspark how to filter if a column of dataframe is not empty?

I tried:

dfNotEmpty = df.filter(df['msg']!='')

It did not work.

Thanks,
Patcharee

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

spark streaming input rate strange

2016-01-22 Thread patcharee


Hi,

I have a streaming application with
- 1 sec interval
- accept data from a simulation through MulticastSocket

The simulation sent out data using multiple clients/threads every 1 sec 
interval. The input rate accepted by the streaming looks strange.
- When clients = 10,000 the event rate raises up to 10,000, stays at 
10,000 a while and drops to about 7000-8000.
- When clients = 20,000 the event rate raises up to 20,000, stays at 
20,000 a while and drops to about 15000-17000. The same pattern


Processing time is just about 400 ms.

Any ideas/suggestions?

Thanks,
Patcharee

visualize data from spark streaming

2016-01-20 Thread patcharee


Hi,

How to visualize realtime data (in graph/chart) from spark streaming? 
Any tools?


Best,
Patcharee

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

bad performance on PySpark - big text file

2015-12-08 Thread patcharee


Hi,

I am very new to PySpark. I have a PySpark app working on text files 
with different size (100M - 100G). However each task is handling the 
same size of input split. But workers spend very much longer time on 
some input splits, especially when the input splits belong to a big 
file. See the log of these two input splits (check python.PythonRunner: 
Times: total ... )


15/12/08 07:37:15 INFO rdd.NewHadoopRDD: Input split: 
hdfs://helmhdfs/user/patcharee/ntap-raw-20151015-20151126/html2/budisansblog.blogspot.com.html:39728447488+134217728
15/12/08 08:49:30 INFO python.PythonRunner: Times: total = 4335010, boot 
= -140, init = 282, finish = 4334868
15/12/08 08:49:30 INFO storage.MemoryStore: ensureFreeSpace(125163) 
called with curMem=227636200, maxMem=4341293383
15/12/08 08:49:30 INFO storage.MemoryStore: Block rdd_3_1772 stored as 
bytes in memory (estimated size 122.2 KB, free 3.8 GB)
15/12/08 08:49:30 INFO python.PythonRunner: Times: total = 4, boot = 1, 
init = 0, finish = 3
15/12/08 08:49:30 INFO storage.MemoryStore: ensureFreeSpace(126595) 
called with curMem=227761363, maxMem=4341293383
15/12/08 08:49:30 INFO storage.MemoryStore: Block rdd_9_1772 stored as 
bytes in memory (estimated size 123.6 KB, free 3.8 GB)
15/12/08 08:49:30 INFO output.FileOutputCommitter: File Output Committer 
Algorithm version is 1
15/12/08 08:49:30 INFO datasources.DynamicPartitionWriterContainer: 
Using output committer class 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
15/12/08 08:49:30 INFO output.FileOutputCommitter: Saved output of task 
'attempt_201512080849_0002_m_001772_0' to 
hdfs://helmhdfs/user/patcharee/NTAPBlogInfo/_temporary/0/task_201512080849_0002_m_001772
15/12/08 08:49:30 INFO mapred.SparkHadoopMapRedUtil: 
attempt_201512080849_0002_m_001772_0: Committed
15/12/08 08:49:30 INFO executor.Executor: Finished task 1772.0 in stage 
2.0 (TID 1770). 16216 bytes result sent to driver



15/12/07 20:52:24 INFO rdd.NewHadoopRDD: Input split: 
hdfs://helmhdfs/user/patcharee/ntap-raw-20151015-20151126/html2/bcnn1wp.wordpress.com.html:1476395008+134217728
15/12/07 20:53:06 INFO python.PythonRunner: Times: total = 41776, boot = 
-425, init = 432, finish = 41769
15/12/07 20:53:06 INFO storage.MemoryStore: ensureFreeSpace(1434614) 
called with curMem=167647961, maxMem=4341293383
15/12/07 20:53:06 INFO storage.MemoryStore: Block rdd_3_994 stored as 
bytes in memory (estimated size 1401.0 KB, free 3.9 GB)
15/12/07 20:53:06 INFO python.PythonRunner: Times: total = 40, boot = 
-20, init = 21, finish = 39
15/12/07 20:53:06 INFO storage.MemoryStore: ensureFreeSpace(1463477) 
called with curMem=169082575, maxMem=4341293383
15/12/07 20:53:06 INFO storage.MemoryStore: Block rdd_9_994 stored as 
bytes in memory (estimated size 1429.2 KB, free 3.9 GB)
15/12/07 20:53:06 INFO output.FileOutputCommitter: File Output Committer 
Algorithm version is 1
15/12/07 20:53:06 INFO datasources.DynamicPartitionWriterContainer: 
Using output committer class 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
15/12/07 20:53:06 INFO output.FileOutputCommitter: Saved output of task 
'attempt_201512072053_0002_m_000994_0' to 
hdfs://helmhdfs/user/patcharee/NTAPBlogInfo/_temporary/0/task_201512072053_0002_m_000994
15/12/07 20:53:06 INFO mapred.SparkHadoopMapRedUtil: 
attempt_201512072053_0002_m_000994_0: Committed
15/12/07 20:53:06 INFO executor.Executor: Finished task 994.0 in stage 
2.0 (TID 990). 9386 bytes result sent to driver


Any suggestions please

Thanks,
Patcharee




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Spark UI - Streaming Tab

2015-12-04 Thread patcharee


Hi,

We tried to get the streaming tab interface on Spark UI - 
https://databricks.com/blog/2015/07/08/new-visualizations-for-understanding-spark-streaming-applications.html


Tested on version 1.5.1, 1.6.0-snapshot, but no such interface for 
streaming applications at all. Any suggestions? Do we need to configure 
the history UI somehow to get such interface?


Thanks,
Patcharee

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Spark applications metrics

2015-12-04 Thread patcharee


Hi

How can I see the summary of data read / write, shuffle read / write, 
etc of an Application, not per stage?


Thanks,
Patcharee

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark UI - Streaming Tab

2015-12-04 Thread patcharee


I ran streaming jobs, but no streaming tab appeared for those jobs.

Patcharee


On 04. des. 2015 18:12, PhuDuc Nguyen wrote:
I believe the "Streaming" tab is dynamic - it appears once you have a 
streaming job running, not when the cluster is simply up. It does not 
depend on 1.6 and has been in there since at least 1.0.


HTH,
Duc

On Fri, Dec 4, 2015 at 7:28 AM, patcharee <patcharee.thong...@uni.no 
<mailto:patcharee.thong...@uni.no>> wrote:


Hi,

We tried to get the streaming tab interface on Spark UI -

https://databricks.com/blog/2015/07/08/new-visualizations-for-understanding-spark-streaming-applications.html

Tested on version 1.5.1, 1.6.0-snapshot, but no such interface for
streaming applications at all. Any suggestions? Do we need to
configure the history UI somehow to get such interface?

    Thanks,
Patcharee

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
<mailto:user-unsubscr...@spark.apache.org>
For additional commands, e-mail: user-h...@spark.apache.org
<mailto:user-h...@spark.apache.org>

Re: Spark Streaming - History UI

2015-12-02 Thread patcharee


I meant there is no streaming tab at all. It looks like I need version 1.6

Patcharee

On 02. des. 2015 11:34, Steve Loughran wrote:

The history UI doesn't update itself for live apps (SPARK-7889) -though I'm 
working on it

Are you trying to view a running streaming job?


On 2 Dec 2015, at 05:28, patcharee <patcharee.thong...@uni.no> wrote:

Hi,

On my history server UI, I cannot see "streaming" tab for any streaming jobs? I 
am using version 1.5.1. Any ideas?

Thanks,
Patcharee

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Spark Streaming - History UI

2015-12-01 Thread patcharee


Hi,

On my history server UI, I cannot see "streaming" tab for any streaming 
jobs? I am using version 1.5.1. Any ideas?


Thanks,
Patcharee

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

custom inputformat recordreader

2015-11-26 Thread Patcharee Thongtra


Hi,

In python how to use inputformat/custom recordreader?

Thanks,
Patcharee


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

data local read counter

2015-11-25 Thread Patcharee Thongtra


Hi,

Is there a counter for data local read? I understood that it is locality 
level counter, but it seems not.


Thanks,
Patcharee

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: query orc file by hive

2015-11-13 Thread patcharee


Hi,

It work with non-partition ORC, but does not work with (2-column) 
partitioned ORC.


Thanks,
Patcharee


On 09. nov. 2015 10:55, Elliot West wrote:

Hi,

You can create a table and point the location property to the folder 
containing your ORC file:


CREATE EXTERNAL TABLE orc_table (
  
)
STORED AS ORC
LOCATION '/hdfs/folder/containing/orc/file'
;


https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTable

Thanks - Elliot.

On 9 November 2015 at 09:44, patcharee <patcharee.thong...@uni.no 
<mailto:patcharee.thong...@uni.no>> wrote:


Hi,

How can I query an orc file (*.orc) by Hive? This orc file is
created by other apps, like spark, mr.

Thanks,
    Patcharee

Re: query orc file by hive

2015-11-13 Thread patcharee


Hi,

It works after I altered add partition. Thanks!

My partitioned orc file (directory) is created by Spark, therefore hive 
is not aware of the partitions automatically.


Best,
Patcharee

On 13. nov. 2015 13:08, Elliot West wrote:

Have you added the partitions to the meta store?

ALTER TABLE ... ADD PARTITION ...

If using Spark, I believe it has good support to do this automatically 
with the HiveContext, although I have not used it myself.


Elliot.

On Friday, 13 November 2015, patcharee <patcharee.thong...@uni.no 
<mailto:patcharee.thong...@uni.no>> wrote:


Hi,

It work with non-partition ORC, but does not work with (2-column)
partitioned ORC.

Thanks,
    Patcharee


On 09. nov. 2015 10:55, Elliot West wrote:

Hi,

You can create a table and point the location property to the
folder containing your ORC file:

CREATE EXTERNAL TABLE orc_table (

)
STORED AS ORC
LOCATION '/hdfs/folder/containing/orc/file'
;



https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTable

Thanks - Elliot.

On 9 November 2015 at 09:44, patcharee <patcharee.thong...@uni.no
<javascript:_e(%7B%7D,'cvml','patcharee.thong...@uni.no');>> wrote:

Hi,

How can I query an orc file (*.orc) by Hive? This orc file is
created by other apps, like spark, mr.

Thanks,
Patcharee

query orc file by hive

2015-11-09 Thread patcharee


Hi,

How can I query an orc file (*.orc) by Hive? This orc file is created by 
other apps, like spark, mr.


Thanks,
Patcharee

[jira] [Issue Comment Deleted] (SPARK-11087) spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate

2015-11-06 Thread patcharee (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

patcharee updated SPARK-11087:
--
Comment: was deleted

(was: I found a scenario where the problem exists)

> spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate
> -
>
> Key: SPARK-11087
> URL: https://issues.apache.org/jira/browse/SPARK-11087
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
> Environment: orc file version 0.12 with HIVE_8732
> hive version 1.2.1.2.3.0.0-2557
>Reporter: patcharee
>Priority: Minor
>
> I have an external hive table stored as partitioned orc file (see the table 
> schema below). I tried to query from the table with where clause>
> hiveContext.setConf("spark.sql.orc.filterPushdown", "true")
> hiveContext.sql("select u, v from 4D where zone = 2 and x = 320 and y = 
> 117")). 
> But from the log file with debug logging level on, the ORC pushdown predicate 
> was not generated. 
> Unfortunately my table was not sorted when I inserted the data, but I 
> expected the ORC pushdown predicate should be generated (because of the where 
> clause) though
> Table schema
> 
> hive> describe formatted 4D;
> OK
> # col_namedata_type   comment 
>
> date  int 
> hhint 
> x int 
> y int 
> heightfloat   
> u float   
> v float   
> w float   
> phfloat   
> phb   float   
> t float   
> p float   
> pbfloat   
> qvaporfloat   
> qgraupfloat   
> qnice float   
> qnrainfloat   
> tke_pbl   float   
> el_pblfloat   
> qcloudfloat   
>
> # Partition Information
> # col_namedata_type   comment 
>
> zone  int 
> z int 
> year  int     
> month int 
>
> # Detailed Table Information   
> Database: default  
> Owner:patcharee
> CreateTime:   Thu Jul 09 16:46:54 CEST 2015
> LastAccessTime:   UNKNOWN  
> Protect Mode: None 
> Retention:0
> Location: hdfs://helmhdfs/apps/hive/warehouse/wrf_tables/4D   
>  
> Table Type:   EXTERNAL_TABLE   
> Table Parameters:  
>   EXTERNALTRUE
>   comment this table is imported from rwf_data/*/wrf/*
>   last_modified_bypatcharee   
>   last_modified_time  1439806692  
>   orc.compressZLIB
>   transient_lastDdlTime   1439806692  
>
> # Storage Information  
> SerDe Library:org.apache.hadoop.hive.ql.io.orc.OrcSerde
> InputFormat:  org.apache.hadoop.hive.ql.io.orc.OrcInputFormat  
> OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
>  
> Compressed:   No   
> Num Buckets:  -1   
> Bucket Columns:   []   
>

[jira] [Issue Comment Deleted] (SPARK-11087) spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate

2015-11-06 Thread patcharee (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

patcharee updated SPARK-11087:
--
Comment: was deleted

(was: Hi [~zzhan], the problem actually happens when I generates orc file by 
"saveAsTable()" method (because I need my orc file to be accessible through 
hive). See below>>

hive> create external table peopletable(name string, address string, phone 
string) partitioned by(age int) stored as orc location 
'/user/patcharee/peopletable';

On spark shell local mode>>
2501  sql("set hive.exec.dynamic.partition.mode=nonstrict")
2502  sqlContext.setConf("spark.sql.orc.filterPushdown", "true")
2503  case class Person(name: String, age: Int, address: String, phone: String)
2504   val records = (1 to 100).map { i => Person(s"name_$i", i, s"address_$i", 
s"phone_$i" ) }
2505  
sc.parallelize(records).toDF().write.format("orc").mode("Append").partitionBy("age").saveAsTable("peopletable")
2506  val people = sqlContext.read.format("orc").load("peopletable")
2507  people.registerTempTable("people")
2508  sqlContext.sql("SELECT * FROM people WHERE age = 20 and name = 
'name_20'").count

It is true that if the orc file is generated by "save()" method, the predicate 
will be generated. But it is not for the case "saveAsTable()" method.

[~zzhan] can you please suggest how to fix this?)

> spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate
> -
>
> Key: SPARK-11087
> URL: https://issues.apache.org/jira/browse/SPARK-11087
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
> Environment: orc file version 0.12 with HIVE_8732
> hive version 1.2.1.2.3.0.0-2557
>Reporter: patcharee
>Priority: Minor
>
> I have an external hive table stored as partitioned orc file (see the table 
> schema below). I tried to query from the table with where clause>
> hiveContext.setConf("spark.sql.orc.filterPushdown", "true")
> hiveContext.sql("select u, v from 4D where zone = 2 and x = 320 and y = 
> 117")). 
> But from the log file with debug logging level on, the ORC pushdown predicate 
> was not generated. 
> Unfortunately my table was not sorted when I inserted the data, but I 
> expected the ORC pushdown predicate should be generated (because of the where 
> clause) though
> Table schema
> 
> hive> describe formatted 4D;
> OK
> # col_namedata_type   comment 
>
> date  int 
> hhint 
> x int 
> y int 
> heightfloat   
> u float   
> v float   
> w float   
> phfloat   
> phb   float   
> t float   
> p float   
> pbfloat   
> qvaporfloat   
> qgraupfloat   
> qnice float   
> qnrainfloat   
> tke_pbl   float   
> el_pblfloat   
> qcloudfloat   
>
> # Partition Information
> # col_namedata_type   comment 
>
> zone      int     
> z int 
> year  int 
> month int 
>
> # Detailed Table Information   
> Database:

[jira] [Reopened] (SPARK-11087) spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate

2015-11-06 Thread patcharee (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

patcharee reopened SPARK-11087:
---

I found a scenario where the problem exists

> spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate
> -
>
> Key: SPARK-11087
> URL: https://issues.apache.org/jira/browse/SPARK-11087
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
> Environment: orc file version 0.12 with HIVE_8732
> hive version 1.2.1.2.3.0.0-2557
>Reporter: patcharee
>Priority: Minor
>
> I have an external hive table stored as partitioned orc file (see the table 
> schema below). I tried to query from the table with where clause>
> hiveContext.setConf("spark.sql.orc.filterPushdown", "true")
> hiveContext.sql("select u, v from 4D where zone = 2 and x = 320 and y = 
> 117")). 
> But from the log file with debug logging level on, the ORC pushdown predicate 
> was not generated. 
> Unfortunately my table was not sorted when I inserted the data, but I 
> expected the ORC pushdown predicate should be generated (because of the where 
> clause) though
> Table schema
> 
> hive> describe formatted 4D;
> OK
> # col_namedata_type   comment 
>
> date  int 
> hhint 
> x int 
> y int 
> heightfloat   
> u float   
> v float   
> w float   
> phfloat   
> phb   float   
> t float   
> p float   
> pbfloat   
> qvaporfloat   
> qgraupfloat   
> qnice float   
> qnrainfloat   
> tke_pbl   float   
> el_pblfloat   
> qcloudfloat   
>
> # Partition Information
> # col_namedata_type   comment 
>
> zone  int 
> z int 
> year  int     
> month int 
>
> # Detailed Table Information   
> Database: default  
> Owner:patcharee
> CreateTime:   Thu Jul 09 16:46:54 CEST 2015
> LastAccessTime:   UNKNOWN  
> Protect Mode: None 
> Retention:0
> Location: hdfs://helmhdfs/apps/hive/warehouse/wrf_tables/4D   
>  
> Table Type:   EXTERNAL_TABLE   
> Table Parameters:  
>   EXTERNALTRUE
>   comment this table is imported from rwf_data/*/wrf/*
>   last_modified_bypatcharee   
>   last_modified_time  1439806692  
>   orc.compressZLIB
>   transient_lastDdlTime   1439806692  
>
> # Storage Information  
> SerDe Library:org.apache.hadoop.hive.ql.io.orc.OrcSerde
> InputFormat:  org.apache.hadoop.hive.ql.io.orc.OrcInputFormat  
> OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
>  
> Compressed:   No   
> Num Buckets:  -1   
> Bucket Columns:   []   
> Sort Columns: []   
>

[jira] [Commented] (SPARK-11087) spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate

2015-11-06 Thread patcharee (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14993398#comment-14993398
 ] 

patcharee commented on SPARK-11087:
---

Hi [~zzhan], the problem actually happens when I generates orc file by 
"saveAsTable()" method (because I need my orc file to be accessible through 
hive). See below>>

hive> create external table peopletable(name string, address string, phone 
string) partitioned by(age int) stored as orc location 
'/user/patcharee/peopletable';

On spark shell local mode>>
2501  sql("set hive.exec.dynamic.partition.mode=nonstrict")
2502  sqlContext.setConf("spark.sql.orc.filterPushdown", "true")
2503  case class Person(name: String, age: Int, address: String, phone: String)
2504   val records = (1 to 100).map { i => Person(s"name_$i", i, s"address_$i", 
s"phone_$i" ) }
2505  
sc.parallelize(records).toDF().write.format("orc").mode("Append").partitionBy("age").saveAsTable("peopletable")
2506  val people = sqlContext.read.format("orc").load("peopletable")
2507  people.registerTempTable("people")
2508  sqlContext.sql("SELECT * FROM people WHERE age = 20 and name = 
'name_20'").count

It is true that if the orc file is generated by "save()" method, the predicate 
will be generated. But it is not for the case "saveAsTable()" method.

[~zzhan] can you please suggest how to fix this?

> spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate
> -
>
> Key: SPARK-11087
> URL: https://issues.apache.org/jira/browse/SPARK-11087
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
> Environment: orc file version 0.12 with HIVE_8732
> hive version 1.2.1.2.3.0.0-2557
>Reporter: patcharee
>Priority: Minor
>
> I have an external hive table stored as partitioned orc file (see the table 
> schema below). I tried to query from the table with where clause>
> hiveContext.setConf("spark.sql.orc.filterPushdown", "true")
> hiveContext.sql("select u, v from 4D where zone = 2 and x = 320 and y = 
> 117")). 
> But from the log file with debug logging level on, the ORC pushdown predicate 
> was not generated. 
> Unfortunately my table was not sorted when I inserted the data, but I 
> expected the ORC pushdown predicate should be generated (because of the where 
> clause) though
> Table schema
> 
> hive> describe formatted 4D;
> OK
> # col_namedata_type   comment 
>
> date  int 
> hhint 
> x int 
> y int 
> heightfloat   
> u float   
> v float   
> w float   
> phfloat   
> phb   float   
> t float   
> p float   
> pbfloat   
> qvaporfloat   
> qgraupfloat   
> qnice float   
> qnrainfloat   
> tke_pbl   float   
> el_pblfloat   
> qcloudfloat   
>
> # Partition Information
> # col_namedata_type   comment 
>
> zone      int     
> z int 
> year  int 
> month int 
>
> # Detailed Table Information   
> Database:

[jira] [Commented] (SPARK-11087) spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate

2015-11-06 Thread patcharee (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14993771#comment-14993771
 ] 

patcharee commented on SPARK-11087:
---

Hi, I found a scenario where the predicate does not work again. [~zzhan] Can 
you please have a look? 

First create a hive table >>
hive> create table people(name string, address string, phone string) 
partitioned by(age int) stored as orc;

Then use spark shell local mode to insert data and then query >>
120  import org.apache.spark.sql.Row
121  import org.apache.spark.{SparkConf, SparkContext}
122  import org.apache.spark.sql.types._
123  import 
org.apache.spark.sql.types.{StructType,StructField,StringType,IntegerType,FloatType}
124  sqlContext.setConf("hive.exec.dynamic.partition.mode","nonstrict")
125  val records = (1 to 10).map( i => Row(s"name_$i", s"address_$i", 
s"phone_$i", i ))
126  val schemaString = "name address phone age"
127  val schema = StructType(schemaString.split(" ").map(fieldName => if 
(fieldName.equals("age")) StructField(fieldName, IntegerType, true) else 
StructField(fieldName, StringType, true)))
128  val x = sc.parallelize(records)
129  val rDF = sqlContext.createDataFrame(x, schema)
130  
rDF.write.format("org.apache.spark.sql.hive.orc.DefaultSource").mode(org.apache.spark.sql.SaveMode.Append).partitionBy("age").saveAsTable("people")
131  sqlContext.setConf("spark.sql.orc.filterPushdown", "true")
132  val people = 
sqlContext.read.format("orc").load("/user/hive/warehouse/people")
133  people.registerTempTable("people")
134  sqlContext.sql("SELECT * FROM people WHERE age = 3 and name = 
'name_3'").count

15/11/06 15:40:36 INFO HadoopRDD: Input split: 
hdfs://localhost:9000/user/hive/warehouse/people/age=3/part-0:0+453
15/11/06 15:40:36 DEBUG OrcInputFormat: No ORC pushdown predicate
15/11/06 15:40:36 DEBUG OrcInputFormat: No ORC pushdown predicate
15/11/06 15:40:36 INFO OrcRawRecordMerger: min key = null, max key = null
15/11/06 15:40:36 INFO OrcRawRecordMerger: min key = null, max key = null
15/11/06 15:40:36 INFO ReaderImpl: Reading ORC rows from 
hdfs://localhost:9000/user/hive/warehouse/people/age=3/part-0 with 
{include: [true, true, false, false], offset: 0, length: 9223372036854775807}
15/11/06 15:40:36 INFO ReaderImpl: Reading ORC rows from 
hdfs://localhost:9000/user/hive/warehouse/people/age=3/part-0 with 
{include: [true, true, false, false], offset: 0, length: 9223372036854775807}
15/11/06 15:40:36 INFO RecordReaderFactory: Schema is not specified on read. 
Using file schema.
15/11/06 15:40:36 INFO RecordReaderFactory: Schema is not specified on read. 
Using file schema.
15/11/06 15:40:36 DEBUG RecordReaderImpl: chunks = [range start: 111 end: 126]
15/11/06 15:40:36 DEBUG RecordReaderImpl: chunks = [range start: 111 end: 126]
15/11/06 15:40:36 DEBUG RecordReaderImpl: merge = [data range [111, 126), size: 
15 type: array-backed]
15/11/06 15:40:36 DEBUG RecordReaderImpl: merge = [data range [111, 126), size: 
15 type: array-backed]
15/11/06 15:40:36 INFO GeneratePredicate: Code generated in 5.063287 ms



> spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate
> -
>
> Key: SPARK-11087
> URL: https://issues.apache.org/jira/browse/SPARK-11087
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
> Environment: orc file version 0.12 with HIVE_8732
> hive version 1.2.1.2.3.0.0-2557
>Reporter: patcharee
>Priority: Minor
>
> I have an external hive table stored as partitioned orc file (see the table 
> schema below). I tried to query from the table with where clause>
> hiveContext.setConf("spark.sql.orc.filterPushdown", "true")
> hiveContext.sql("select u, v from 4D where zone = 2 and x = 320 and y = 
> 117")). 
> But from the log file with debug logging level on, the ORC pushdown predicate 
> was not generated. 
> Unfortunately my table was not sorted when I inserted the data, but I 
> expected the ORC pushdown predicate should be generated (because of the where 
> clause) though
> Table schema
> 
> hive> describe formatted 4D;
> OK
> # col_namedata_type   comment 
>
> date  int 
> hhint 
> x int 
> y int

[jira] [Comment Edited] (SPARK-11087) spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate

2015-11-06 Thread patcharee (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14993771#comment-14993771
 ] 

patcharee edited comment on SPARK-11087 at 11/6/15 2:56 PM:


Hi, I found a scenario where the predicate does not work again. [~zzhan] Can 
you please have a look? 

First create a hive table >>
hive> create table people(name string, address string, phone string) 
partitioned by(age int) stored as orc;

Then use spark shell local mode to insert data and then query >>
120  import org.apache.spark.sql.Row
121  import org.apache.spark.{SparkConf, SparkContext}
122  import org.apache.spark.sql.types._
123  import 
org.apache.spark.sql.types.{StructType,StructField,StringType,IntegerType,FloatType}
124  sqlContext.setConf("hive.exec.dynamic.partition.mode","nonstrict")
125  val records = (1 to 10).map( i => Row(s"name_$i", s"address_$i", 
s"phone_$i", i ))
126  val schemaString = "name address phone age"
127  val schema = StructType(schemaString.split(" ").map(fieldName => if 
(fieldName.equals("age")) StructField(fieldName, IntegerType, true) else 
StructField(fieldName, StringType, true)))
128  val x = sc.parallelize(records)
129  val rDF = sqlContext.createDataFrame(x, schema)
130  
rDF.write.format("org.apache.spark.sql.hive.orc.DefaultSource").mode(org.apache.spark.sql.SaveMode.Append).partitionBy("age").saveAsTable("people")
131  sqlContext.setConf("spark.sql.orc.filterPushdown", "true")
132  val people = 
sqlContext.read.format("orc").load("/user/hive/warehouse/people")
133  people.registerTempTable("people")
134  sqlContext.sql("SELECT * FROM people WHERE age = 3 and name = 
'name_3'").count

Below is a part of the log message from the last command >>
15/11/06 15:40:36 INFO HadoopRDD: Input split: 
hdfs://localhost:9000/user/hive/warehouse/people/age=3/part-0:0+453
15/11/06 15:40:36 DEBUG OrcInputFormat: No ORC pushdown predicate
15/11/06 15:40:36 DEBUG OrcInputFormat: No ORC pushdown predicate
15/11/06 15:40:36 INFO OrcRawRecordMerger: min key = null, max key = null
15/11/06 15:40:36 INFO OrcRawRecordMerger: min key = null, max key = null
15/11/06 15:40:36 INFO ReaderImpl: Reading ORC rows from 
hdfs://localhost:9000/user/hive/warehouse/people/age=3/part-0 with 
{include: [true, true, false, false], offset: 0, length: 9223372036854775807}
15/11/06 15:40:36 INFO ReaderImpl: Reading ORC rows from 
hdfs://localhost:9000/user/hive/warehouse/people/age=3/part-0 with 
{include: [true, true, false, false], offset: 0, length: 9223372036854775807}
15/11/06 15:40:36 INFO RecordReaderFactory: Schema is not specified on read. 
Using file schema.
15/11/06 15:40:36 INFO RecordReaderFactory: Schema is not specified on read. 
Using file schema.
15/11/06 15:40:36 DEBUG RecordReaderImpl: chunks = [range start: 111 end: 126]
15/11/06 15:40:36 DEBUG RecordReaderImpl: chunks = [range start: 111 end: 126]
15/11/06 15:40:36 DEBUG RecordReaderImpl: merge = [data range [111, 126), size: 
15 type: array-backed]
15/11/06 15:40:36 DEBUG RecordReaderImpl: merge = [data range [111, 126), size: 
15 type: array-backed]
15/11/06 15:40:36 INFO GeneratePredicate: Code generated in 5.063287 ms




was (Author: patcharee):
Hi, I found a scenario where the predicate does not work again. [~zzhan] Can 
you please have a look? 

First create a hive table >>
hive> create table people(name string, address string, phone string) 
partitioned by(age int) stored as orc;

Then use spark shell local mode to insert data and then query >>
120  import org.apache.spark.sql.Row
121  import org.apache.spark.{SparkConf, SparkContext}
122  import org.apache.spark.sql.types._
123  import 
org.apache.spark.sql.types.{StructType,StructField,StringType,IntegerType,FloatType}
124  sqlContext.setConf("hive.exec.dynamic.partition.mode","nonstrict")
125  val records = (1 to 10).map( i => Row(s"name_$i", s"address_$i", 
s"phone_$i", i ))
126  val schemaString = "name address phone age"
127  val schema = StructType(schemaString.split(" ").map(fieldName => if 
(fieldName.equals("age")) StructField(fieldName, IntegerType, true) else 
StructField(fieldName, StringType, true)))
128  val x = sc.parallelize(records)
129  val rDF = sqlContext.createDataFrame(x, schema)
130  
rDF.write.format("org.apache.spark.sql.hive.orc.DefaultSource").mode(org.apache.spark.sql.SaveMode.Append).partitionBy("age").saveAsTable("people")
131  sqlContext.setConf("spark.sql.orc.filterPushdown", "true")
132  val people = 
sqlContext.read.format("orc").load("/user/hive/warehouse/people")
133  people.regist

How to run parallel on each DataFrame group

2015-11-05 Thread patcharee


Hi,

I need suggestions on my coding. I would like to split DataFrame (rowDF) 
by a column (depth) into groups. Then sort each group, repartition and 
save output of each group into one file. See code below>


val rowDF = sqlContext.createDataFrame(rowRDD, schema).cache()
for (i <- 0 to 16) {
   val filterDF = rowDF.filter("depth="+i)
   val finalDF = filterDF.sort("xy").coalesce(1)
finalDF.write.format("org.apache.spark.sql.hive.orc.DefaultSource").mode(org.apache.spark.sql.SaveMode.Append).partitionBy("depth").saveAsTable(args(3))
}

The problem is each group after filtered is handled by an executor one 
by one. How to change the code to allow each group run in parallel?


I looked at groupBy, but seem only for aggregation.

Thanks,
Patcharee

execute native system commands in Spark

2015-11-02 Thread patcharee


Hi,

Is it possible to execute native system commands (in parallel) Spark, 
like scala.sys.process ?


Best,
Patcharee

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Min-Max Index vs Bloom filter

2015-11-02 Thread patcharee


Hi,

For the orc format, which scenario that bloom filter is better than 
min-max index?


Best,
Patcharee

[jira] [Closed] (SPARK-11087) spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate

2015-10-23 Thread patcharee (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

patcharee closed SPARK-11087.
-
Resolution: Not A Problem

The predicate is indeed generated and can be found in the executor log

> spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate
> -
>
> Key: SPARK-11087
> URL: https://issues.apache.org/jira/browse/SPARK-11087
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
> Environment: orc file version 0.12 with HIVE_8732
> hive version 1.2.1.2.3.0.0-2557
>Reporter: patcharee
>Priority: Minor
>
> I have an external hive table stored as partitioned orc file (see the table 
> schema below). I tried to query from the table with where clause>
> hiveContext.setConf("spark.sql.orc.filterPushdown", "true")
> hiveContext.sql("select u, v from 4D where zone = 2 and x = 320 and y = 
> 117")). 
> But from the log file with debug logging level on, the ORC pushdown predicate 
> was not generated. 
> Unfortunately my table was not sorted when I inserted the data, but I 
> expected the ORC pushdown predicate should be generated (because of the where 
> clause) though
> Table schema
> 
> hive> describe formatted 4D;
> OK
> # col_namedata_type   comment 
>
> date  int 
> hhint 
> x int 
> y int 
> heightfloat   
> u float   
> v float   
> w float   
> phfloat   
> phb   float   
> t float   
> p float   
> pbfloat   
> qvaporfloat   
> qgraupfloat   
> qnice float   
> qnrainfloat   
> tke_pbl   float   
> el_pblfloat   
> qcloudfloat   
>
> # Partition Information
> # col_namedata_type   comment 
>
> zone  int 
> z int 
> year  int     
> month int 
>
> # Detailed Table Information   
> Database: default  
> Owner:patcharee
> CreateTime:   Thu Jul 09 16:46:54 CEST 2015
> LastAccessTime:   UNKNOWN  
> Protect Mode: None 
> Retention:0
> Location: hdfs://helmhdfs/apps/hive/warehouse/wrf_tables/4D   
>  
> Table Type:   EXTERNAL_TABLE   
> Table Parameters:  
>   EXTERNALTRUE
>   comment this table is imported from rwf_data/*/wrf/*
>   last_modified_bypatcharee   
>   last_modified_time  1439806692  
>   orc.compressZLIB
>   transient_lastDdlTime   1439806692  
>
> # Storage Information  
> SerDe Library:org.apache.hadoop.hive.ql.io.orc.OrcSerde
> InputFormat:  org.apache.hadoop.hive.ql.io.orc.OrcInputFormat  
> OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
>  
> Compressed:   No   
> Num Buckets:  -1   
> Bucket Columns:   []   
>

[jira] [Commented] (SPARK-11087) spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate

2015-10-23 Thread patcharee (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14970786#comment-14970786
 ] 

patcharee commented on SPARK-11087:
---

[~zzhan] I found the predicate generated in the executor log for the case using 
dataframe (not hiveContext.sql). Sorry for my mistake, and thanks for your help!

> spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate
> -
>
> Key: SPARK-11087
> URL: https://issues.apache.org/jira/browse/SPARK-11087
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
> Environment: orc file version 0.12 with HIVE_8732
> hive version 1.2.1.2.3.0.0-2557
>Reporter: patcharee
>Priority: Minor
>
> I have an external hive table stored as partitioned orc file (see the table 
> schema below). I tried to query from the table with where clause>
> hiveContext.setConf("spark.sql.orc.filterPushdown", "true")
> hiveContext.sql("select u, v from 4D where zone = 2 and x = 320 and y = 
> 117")). 
> But from the log file with debug logging level on, the ORC pushdown predicate 
> was not generated. 
> Unfortunately my table was not sorted when I inserted the data, but I 
> expected the ORC pushdown predicate should be generated (because of the where 
> clause) though
> Table schema
> 
> hive> describe formatted 4D;
> OK
> # col_namedata_type   comment 
>
> date  int 
> hhint 
> x int 
> y int 
> heightfloat   
> u float   
> v float   
> w float   
> phfloat   
> phb   float   
> t float   
> p float   
> pbfloat   
> qvaporfloat   
> qgraupfloat   
> qnice float   
> qnrainfloat   
> tke_pbl   float   
> el_pblfloat   
> qcloudfloat   
>
> # Partition Information
> # col_namedata_type   comment 
>
> zone  int 
> z int 
> year  int     
> month int 
>
> # Detailed Table Information   
> Database: default  
> Owner:patcharee
> CreateTime:   Thu Jul 09 16:46:54 CEST 2015
> LastAccessTime:   UNKNOWN  
> Protect Mode: None 
> Retention:0
> Location: hdfs://helmhdfs/apps/hive/warehouse/wrf_tables/4D   
>  
> Table Type:   EXTERNAL_TABLE   
> Table Parameters:  
>   EXTERNALTRUE
>   comment this table is imported from rwf_data/*/wrf/*
>   last_modified_bypatcharee   
>   last_modified_time  1439806692  
>   orc.compressZLIB
>   transient_lastDdlTime   1439806692  
>
> # Storage Information  
> SerDe Library:org.apache.hadoop.hive.ql.io.orc.OrcSerde
> InputFormat:  org.apache.hadoop.hive.ql.io.orc.OrcInputFormat  
> OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
>  
> Compressed:   No

the column names removed after insert select

2015-10-23 Thread patcharee


Hi

I inserted a table from select (insert into table newtable select date, 
hh, x, y from oldtable). After the insert the column names of the table 
have been removed, see the output below when I use hive --orcfiledump


- Type: struct<_col0:int,_col1:int,_col2:int,_col3:int>

while it is supposed to be

- Type: struct<date:int,hh:int,x:int,y:int>

Any ideas how this happened and how I can fix it. Please suggest me.

BR,
Patcharee

the number of files after merging

2015-10-22 Thread patcharee


Hi,

I am using alter command below to merge partitioned orc file on one 
partition:


alter table X partition(zone=1,z=1,year=2009,month=1) CONCATENATE;

- How can I control the number of files after merging? I would like to 
get only one file per partition.

- Is it possible to concatenate the whole table, not one-by-one partition?

Thanks,
Patcharee

[jira] [Commented] (SPARK-11087) spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate

2015-10-21 Thread patcharee (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14967661#comment-14967661
 ] 

patcharee commented on SPARK-11087:
---

Hi [~zzhan]

What version of hive and orc file you are using? Can I get your hive 
configuration file?


> spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate
> -
>
> Key: SPARK-11087
> URL: https://issues.apache.org/jira/browse/SPARK-11087
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
> Environment: orc file version 0.12 with HIVE_8732
> hive version 1.2.1.2.3.0.0-2557
>Reporter: patcharee
>Priority: Minor
>
> I have an external hive table stored as partitioned orc file (see the table 
> schema below). I tried to query from the table with where clause>
> hiveContext.setConf("spark.sql.orc.filterPushdown", "true")
> hiveContext.sql("select u, v from 4D where zone = 2 and x = 320 and y = 
> 117")). 
> But from the log file with debug logging level on, the ORC pushdown predicate 
> was not generated. 
> Unfortunately my table was not sorted when I inserted the data, but I 
> expected the ORC pushdown predicate should be generated (because of the where 
> clause) though
> Table schema
> 
> hive> describe formatted 4D;
> OK
> # col_namedata_type   comment 
>
> date  int 
> hhint 
> x int 
> y int 
> heightfloat   
> u float   
> v float   
> w float   
> phfloat   
> phb   float   
> t float   
> p float   
> pbfloat   
> qvaporfloat   
> qgraupfloat   
> qnice float   
> qnrainfloat   
> tke_pbl   float   
> el_pblfloat   
> qcloudfloat   
>
> # Partition Information
> # col_namedata_type   comment 
>
> zone  int 
> z int 
> year  int     
> month int 
>
> # Detailed Table Information   
> Database: default  
> Owner:patcharee
> CreateTime:   Thu Jul 09 16:46:54 CEST 2015
> LastAccessTime:   UNKNOWN  
> Protect Mode: None 
> Retention:0
> Location: hdfs://helmhdfs/apps/hive/warehouse/wrf_tables/4D   
>  
> Table Type:   EXTERNAL_TABLE   
> Table Parameters:  
>   EXTERNALTRUE
>   comment this table is imported from rwf_data/*/wrf/*
>   last_modified_bypatcharee   
>   last_modified_time  1439806692  
>   orc.compressZLIB
>   transient_lastDdlTime   1439806692  
>
> # Storage Information  
> SerDe Library:org.apache.hadoop.hive.ql.io.orc.OrcSerde
> InputFormat:  org.apache.hadoop.hive.ql.io.orc.OrcInputFormat  
> OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
>  
> Compressed:   No   
> Num Buckets:  -1   
> Bucket Colum

[jira] [Comment Edited] (SPARK-11087) spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate

2015-10-18 Thread patcharee (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14960296#comment-14960296
 ] 

patcharee edited comment on SPARK-11087 at 10/19/15 3:34 AM:
-

[~zzhan]

Below is my test. Please check. I tried to change 
"hive.exec.orc.split.strategy" also, but none of them given " OrcInputFormat 
[INFO] ORC pushdown predicate" as same as your result

2508  case class Contact(name: String, phone: String)
2509  case class Person(name: String, age: Int, contacts: Seq[Contact])
2510  val records = (1 to 100).map { i => Person(s"name_$i", i, (0 to 1).map { 
m => Contact(s"contact_$m", s"phone_$m") } )
2511  }
2512  sqlContext.setConf("spark.sql.orc.filterPushdown", "true")
2513  
sc.parallelize(records).toDF().write.format("orc").partitionBy("age").save("peoplePartitioned")
2514  val peoplePartitioned = 
sqlContext.read.format("orc").load("peoplePartitioned")
2515   peoplePartitioned.registerTempTable("peoplePartitioned")

scala> sqlContext.setConf("hive.exec.orc.split.strategy", "ETL")
15/10/16 09:10:49 DEBUG VariableSubstitution: Substitution is on: ETL
15/10/16 09:10:49 DEBUG VariableSubstitution: Substitution is on: ETL
15/10/16 09:10:49 DEBUG VariableSubstitution: Substitution is on: ETL
15/10/16 09:10:49 DEBUG VariableSubstitution: Substitution is on: ETL

scala>  sqlContext.sql("SELECT * FROM peoplePartitioned WHERE age = 20 and name 
= 'name_20'").count
15/10/16 09:10:52 DEBUG VariableSubstitution: Substitution is on: SELECT * FROM 
peoplePartitioned WHERE age = 20 and name = 'name_20'
15/10/16 09:10:52 DEBUG VariableSubstitution: Substitution is on: SELECT * FROM 
peoplePartitioned WHERE age = 20 and name = 'name_20'
15/10/16 09:10:53 INFO PerfLogger: 
15/10/16 09:10:53 INFO PerfLogger: 
15/10/16 09:10:53 DEBUG OrcInputFormat: Number of buckets specified by conf 
file is 0
15/10/16 09:10:53 DEBUG OrcInputFormat: Number of buckets specified by conf 
file is 0
15/10/16 09:10:53 DEBUG AcidUtils: in directory 
hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc
 base = null deltas = 0
15/10/16 09:10:53 DEBUG AcidUtils: in directory 
hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc
 base = null deltas = 0
15/10/16 09:10:53 DEBUG OrcInputFormat: BISplitStrategy strategy for 
hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc
15/10/16 09:10:53 DEBUG OrcInputFormat: BISplitStrategy strategy for 
hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc
15/10/16 09:10:53 INFO OrcInputFormat: FooterCacheHitRatio: 0/0
15/10/16 09:10:53 INFO OrcInputFormat: FooterCacheHitRatio: 0/0
15/10/16 09:10:53 DEBUG OrcInputFormat: 
hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc:0+551
 projected_columns_uncompressed_size: -1
15/10/16 09:10:53 DEBUG OrcInputFormat: 
hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc:0+551
 projected_columns_uncompressed_size: -1
15/10/16 09:10:53 INFO PerfLogger: 
15/10/16 09:10:53 INFO PerfLogger: 
res5: Long = 1

scala> sqlContext.setConf("hive.exec.orc.split.strategy", "BI")
15/10/16 09:11:13 DEBUG VariableSubstitution: Substitution is on: BI
15/10/16 09:11:13 DEBUG VariableSubstitution: Substitution is on: BI
15/10/16 09:11:13 DEBUG VariableSubstitution: Substitution is on: BI
15/10/16 09:11:13 DEBUG VariableSubstitution: Substitution is on: BI

scala>  sqlContext.sql("SELECT * FROM peoplePartitioned WHERE age = 20 and name 
= 'name_20'").count
15/10/16 09:11:19 DEBUG VariableSubstitution: Substitution is on: SELECT * FROM 
peoplePartitioned WHERE age = 20 and name = 'name_20'
15/10/16 09:11:19 DEBUG VariableSubstitution: Substitution is on: SELECT * FROM 
peoplePartitioned WHERE age = 20 and name = 'name_20'
15/10/16 09:11:19 INFO PerfLogger: 
15/10/16 09:11:19 INFO PerfLogger: 
15/10/16 09:11:19 DEBUG OrcInputFormat: Number of buckets specified by conf 
file is 0
15/10/16 09:11:19 DEBUG OrcInputFormat: Number of buckets specified by conf 
file is 0
15/10/16 09:11:19 DEBUG AcidUtils: in directory 
hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc
 base = null deltas = 0
15/10/16 09:11:19 DEBUG AcidUtils: in directory 
hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc
 base = null deltas = 0
15/10/16 09:11:19 DEBUG OrcInputFormat: BISplitStrategy strategy for 
hdfs://helmhdfs/user/patcharee/peo

[jira] [Commented] (SPARK-11087) spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate

2015-10-16 Thread patcharee (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14960296#comment-14960296
 ] 

patcharee commented on SPARK-11087:
---

[~zhazhan]

Below is my test. Please check. I tried to change 
"hive.exec.orc.split.strategy" also, but none of them given " OrcInputFormat 
[INFO] ORC pushdown predicate" as same as your result

2508  case class Contact(name: String, phone: String)
2509  case class Person(name: String, age: Int, contacts: Seq[Contact])
2510  val records = (1 to 100).map { i => Person(s"name_$i", i, (0 to 1).map { 
m => Contact(s"contact_$m", s"phone_$m") } )
2511  }
2512  sqlContext.setConf("spark.sql.orc.filterPushdown", "true")
2513  
sc.parallelize(records).toDF().write.format("orc").partitionBy("age").save("peoplePartitioned")
2514  val peoplePartitioned = 
sqlContext.read.format("orc").load("peoplePartitioned")
2515   peoplePartitioned.registerTempTable("peoplePartitioned")

scala> sqlContext.setConf("hive.exec.orc.split.strategy", "ETL")
15/10/16 09:10:49 DEBUG VariableSubstitution: Substitution is on: ETL
15/10/16 09:10:49 DEBUG VariableSubstitution: Substitution is on: ETL
15/10/16 09:10:49 DEBUG VariableSubstitution: Substitution is on: ETL
15/10/16 09:10:49 DEBUG VariableSubstitution: Substitution is on: ETL

scala>  sqlContext.sql("SELECT * FROM peoplePartitioned WHERE age = 20 and name 
= 'name_20'").count
15/10/16 09:10:52 DEBUG VariableSubstitution: Substitution is on: SELECT * FROM 
peoplePartitioned WHERE age = 20 and name = 'name_20'
15/10/16 09:10:52 DEBUG VariableSubstitution: Substitution is on: SELECT * FROM 
peoplePartitioned WHERE age = 20 and name = 'name_20'
15/10/16 09:10:53 INFO PerfLogger: 
15/10/16 09:10:53 INFO PerfLogger: 
15/10/16 09:10:53 DEBUG OrcInputFormat: Number of buckets specified by conf 
file is 0
15/10/16 09:10:53 DEBUG OrcInputFormat: Number of buckets specified by conf 
file is 0
15/10/16 09:10:53 DEBUG AcidUtils: in directory 
hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc
 base = null deltas = 0
15/10/16 09:10:53 DEBUG AcidUtils: in directory 
hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc
 base = null deltas = 0
15/10/16 09:10:53 DEBUG OrcInputFormat: BISplitStrategy strategy for 
hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc
15/10/16 09:10:53 DEBUG OrcInputFormat: BISplitStrategy strategy for 
hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc
15/10/16 09:10:53 INFO OrcInputFormat: FooterCacheHitRatio: 0/0
15/10/16 09:10:53 INFO OrcInputFormat: FooterCacheHitRatio: 0/0
15/10/16 09:10:53 DEBUG OrcInputFormat: 
hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc:0+551
 projected_columns_uncompressed_size: -1
15/10/16 09:10:53 DEBUG OrcInputFormat: 
hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc:0+551
 projected_columns_uncompressed_size: -1
15/10/16 09:10:53 INFO PerfLogger: 
15/10/16 09:10:53 INFO PerfLogger: 
res5: Long = 1

scala> sqlContext.setConf("hive.exec.orc.split.strategy", "BI")
15/10/16 09:11:13 DEBUG VariableSubstitution: Substitution is on: BI
15/10/16 09:11:13 DEBUG VariableSubstitution: Substitution is on: BI
15/10/16 09:11:13 DEBUG VariableSubstitution: Substitution is on: BI
15/10/16 09:11:13 DEBUG VariableSubstitution: Substitution is on: BI

scala>  sqlContext.sql("SELECT * FROM peoplePartitioned WHERE age = 20 and name 
= 'name_20'").count
15/10/16 09:11:19 DEBUG VariableSubstitution: Substitution is on: SELECT * FROM 
peoplePartitioned WHERE age = 20 and name = 'name_20'
15/10/16 09:11:19 DEBUG VariableSubstitution: Substitution is on: SELECT * FROM 
peoplePartitioned WHERE age = 20 and name = 'name_20'
15/10/16 09:11:19 INFO PerfLogger: 
15/10/16 09:11:19 INFO PerfLogger: 
15/10/16 09:11:19 DEBUG OrcInputFormat: Number of buckets specified by conf 
file is 0
15/10/16 09:11:19 DEBUG OrcInputFormat: Number of buckets specified by conf 
file is 0
15/10/16 09:11:19 DEBUG AcidUtils: in directory 
hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc
 base = null deltas = 0
15/10/16 09:11:19 DEBUG AcidUtils: in directory 
hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc
 base = null deltas = 0
15/10/16 09:11:19 DEBUG OrcInputFormat: BISplitStrategy strategy for 
hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7

Re: sql query orc slow

2015-10-13 Thread Patcharee Thongtra

Hi Zhan Zhang,

Is my problem (which is ORC predicate is not generated from WHERE clause 
even though spark.sql.orc.filterPushdown=true) can be related to some 
factors below ?

- orc file version (File Version: 0.12 with HIVE_8732)
- hive version (using Hive 1.2.1.2.3.0.0-2557)
- orc table is not sorted / indexed
- the split strategy hive.exec.orc.split.strategy

BR,
Patcharee

On 10/09/2015 08:01 PM, Zhan Zhang wrote:
That is weird. Unfortunately, there is no debug info available on this 
part. Can you please open a JIRA to add some debug information on the 
driver side?

Thanks.

Zhan Zhang

On Oct 9, 2015, at 10:22 AM, patcharee <patcharee.thong...@uni.no 
<mailto:patcharee.thong...@uni.no>> wrote:

I set hiveContext.setConf("spark.sql.orc.filterPushdown", "true"). 
But from the log No ORC pushdown predicate for my query with WHERE 
clause.

15/10/09 19:16:01 DEBUG OrcInputFormat: No ORC pushdown predicate

I did not understand what wrong with this.

BR,
Patcharee

On 09. okt. 2015 19:10, Zhan Zhang wrote:
In your case, you manually set an AND pushdown, and the predicate is 
right based on your setting, : leaf-0 = (EQUALS x 320)

The right way is to enable the predicate pushdown as follows.
sqlContext.setConf("spark.sql.orc.filterPushdown", "true”)

Thanks.

Zhan Zhang

On Oct 9, 2015, at 9:58 AM, patcharee <patcharee.thong...@uni.no> wrote:

Hi Zhan Zhang

Actually my query has WHERE clause "select date, month, year, hh, 
(u*0.9122461 - v*-0.40964267), (v*0.9122461 + u*-0.40964267), z 
from 4D where x = 320 and y = 117 and zone == 2 and year=2009 and z 
>= 2 and z <= 8", column "x", "y" is not partition column, the 
others are partition columns. I expected the system will use 
predicate pushdown. I turned on the debug and found pushdown 
predicate was not generated ("DEBUG OrcInputFormat: No ORC pushdown 
predicate")

Then I tried to set the search argument explicitly (on the column 
"x" which is not partition column)

   val xs = 
SearchArgumentFactory.newBuilder().startAnd().equals("x", 
320).end().build()

   hiveContext.setConf("hive.io.file.readcolumn.names", "x")
   hiveContext.setConf("sarg.pushdown", xs.toKryo())

this time in the log pushdown predicate was generated but results 
was wrong (no results at all)

15/10/09 18:36:06 INFO OrcInputFormat: ORC pushdown predicate: 
leaf-0 = (EQUALS x 320)

expr = leaf-0

Any ideas What wrong with this? Why the ORC pushdown predicate is 
not applied by the system?

BR,
Patcharee

On 09. okt. 2015 18:31, Zhan Zhang wrote:

Hi Patcharee,

>From the query, it looks like only the column pruning will be 
applied. Partition pruning and predicate pushdown does not have 
effect. Do you see big IO difference between two methods?

The potential reason of the speed difference I can think of may be 
the different versions of OrcInputFormat. The hive path may use 
NewOrcInputFormat, but the spark path use OrcInputFormat.

Thanks.

Zhan Zhang

On Oct 8, 2015, at 11:55 PM, patcharee <patcharee.thong...@uni.no> 
wrote:

Yes, the predicate pushdown is enabled, but still take longer 
time than the first method

BR,
Patcharee

On 08. okt. 2015 18:43, Zhan Zhang wrote:

Hi Patcharee,

Did you enable the predicate pushdown in the second method?

Thanks.

Zhan Zhang

On Oct 8, 2015, at 1:43 AM, patcharee 
<patcharee.thong...@uni.no> wrote:

Hi,

I am using spark sql 1.5 to query a hive table stored as 
partitioned orc file. We have the total files is about 6000 
files and each file size is about 245MB.

What is the difference between these two query methods below:

1. Using query on hive table directly

hiveContext.sql("select col1, col2 from table1")

2. Reading from orc file, register temp table and query from 
the temp table

val c = 
hiveContext.read.format("orc").load("/apps/hive/warehouse/table1")

c.registerTempTable("regTable")
hiveContext.sql("select col1, col2 from regTable")

When the number of files is large (query all from the total 
6000 files) , the second case is much slower then the first 
one. Any ideas why?

BR,

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
<mailto:user-unsubscr...@spark.apache.org>
For additional commands, e-mail: 
<mailto:user-h...@spark.apache.org>user-h...@spark.apache.org

orc table with sorted field

2015-10-13 Thread Patcharee Thongtra


Hi,

How can I create a partitioned orc table with sorted field(s)? I tried 
to use sorted by keyword, but failed parse exception>


CREATE TABLE peoplesort (name string, age int) partition by (bddate int) 
SORTED BY (age) stored as orc


Is it possible to have some sorted columns? From hive ddl page, it seems 
only bucket table can be sorted.


Any suggestions please

BR,
Patcharee

Re: sql query orc slow

2015-10-13 Thread Patcharee Thongtra


Hi Zhan Zhang,

Here is the issue https://issues.apache.org/jira/browse/SPARK-11087

BR,
Patcharee

On 10/13/2015 06:47 PM, Zhan Zhang wrote:

Hi Patcharee,

I am not sure which side is wrong, driver or executor. If it is 
executor side, the reason you mentioned may be possible. But if the 
driver side didn’t set the predicate at all, then somewhere else is 
broken.


Can you please file a JIRA with a simple reproduce step, and let me 
know the JIRA number?


Thanks.

Zhan Zhang

On Oct 13, 2015, at 1:01 AM, Patcharee Thongtra 
<patcharee.thong...@uni.no <mailto:patcharee.thong...@uni.no>> wrote:



Hi Zhan Zhang,

Is my problem (which is ORC predicate is not generated from WHERE 
clause even though spark.sql.orc.filterPushdown=true) can be related 
to some factors below ?


- orc file version (File Version: 0.12 with HIVE_8732)
- hive version (using Hive 1.2.1.2.3.0.0-2557)
- orc table is not sorted / indexed
- the split strategy hive.exec.orc.split.strategy

BR,
Patcharee


On 10/09/2015 08:01 PM, Zhan Zhang wrote:
That is weird. Unfortunately, there is no debug info available on 
this part. Can you please open a JIRA to add some debug information 
on the driver side?


Thanks.

Zhan Zhang

On Oct 9, 2015, at 10:22 AM, patcharee <patcharee.thong...@uni.no> 
wrote:


I set hiveContext.setConf("spark.sql.orc.filterPushdown", "true"). 
But from the log No ORC pushdown predicate for my query with WHERE 
clause.


15/10/09 19:16:01 DEBUG OrcInputFormat: No ORC pushdown predicate

I did not understand what wrong with this.

BR,
Patcharee

On 09. okt. 2015 19:10, Zhan Zhang wrote:
In your case, you manually set an AND pushdown, and the predicate 
is right based on your setting, : leaf-0 = (EQUALS x 320)


The right way is to enable the predicate pushdown as follows.
sqlContext.setConf("spark.sql.orc.filterPushdown", "true”)

Thanks.

Zhan Zhang







On Oct 9, 2015, at 9:58 AM, patcharee <patcharee.thong...@uni.no> 
wrote:



Hi Zhan Zhang

Actually my query has WHERE clause "select date, month, year, hh, 
(u*0.9122461 - v*-0.40964267), (v*0.9122461 + u*-0.40964267), z 
from 4D where x = 320 and y = 117 and zone == 2 and year=2009 and 
z >= 2 and z <= 8", column "x", "y" is not partition column, the 
others are partition columns. I expected the system will use 
predicate pushdown. I turned on the debug and found pushdown 
predicate was not generated ("DEBUG OrcInputFormat: No ORC 
pushdown predicate")


Then I tried to set the search argument explicitly (on the column 
"x" which is not partition column)


   val xs = 
SearchArgumentFactory.newBuilder().startAnd().equals("x", 
320).end().build()

   hiveContext.setConf("hive.io.file.readcolumn.names", "x")
   hiveContext.setConf("sarg.pushdown", xs.toKryo())

this time in the log pushdown predicate was generated but results 
was wrong (no results at all)


15/10/09 18:36:06 INFO OrcInputFormat: ORC pushdown predicate: 
leaf-0 = (EQUALS x 320)

expr = leaf-0

Any ideas What wrong with this? Why the ORC pushdown predicate is 
not applied by the system?


BR,
Patcharee

On 09. okt. 2015 18:31, Zhan Zhang wrote:

Hi Patcharee,

>From the query, it looks like only the column pruning will be 
applied. Partition pruning and predicate pushdown does not have 
effect. Do you see big IO difference between two methods?


The potential reason of the speed difference I can think of may 
be the different versions of OrcInputFormat. The hive path may 
use NewOrcInputFormat, but the spark path use OrcInputFormat.


Thanks.

Zhan Zhang

On Oct 8, 2015, at 11:55 PM, patcharee 
<patcharee.thong...@uni.no> wrote:


Yes, the predicate pushdown is enabled, but still take longer 
time than the first method


BR,
Patcharee

On 08. okt. 2015 18:43, Zhan Zhang wrote:

Hi Patcharee,

Did you enable the predicate pushdown in the second method?

Thanks.

Zhan Zhang

On Oct 8, 2015, at 1:43 AM, patcharee 
<patcharee.thong...@uni.no> wrote:



Hi,

I am using spark sql 1.5 to query a hive table stored as 
partitioned orc file. We have the total files is about 6000 
files and each file size is about 245MB.


What is the difference between these two query methods below:

1. Using query on hive table directly

hiveContext.sql("select col1, col2 from table1")

2. Reading from orc file, register temp table and query from 
the temp table


val c = 
hiveContext.read.format("orc").load("/apps/hive/warehouse/table1")

c.registerTempTable("regTable")
hiveContext.sql("select col1, col2 from regTable")

When the number of files is large (query all from the total 
6000 files) , the second case is much slower then the first 
one. Any ideas why?


BR,




-
To unsubscribe, e-mail: 
<mailto:user-unsubscr...@spark.apache.org>user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
<mailto:user-h...@spark.apache.org>user-h...@spark.apache.org

[jira] [Created] (SPARK-11087) spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate

2015-10-13 Thread patcharee (JIRA)

patcharee created SPARK-11087:
-

 Summary: spark.sql.orc.filterPushdown does not work, No ORC 
pushdown predicate
 Key: SPARK-11087
 URL: https://issues.apache.org/jira/browse/SPARK-11087
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.1
 Environment: orc file version 0.12 with HIVE_8732
hive version 1.2.1.2.3.0.0-2557
Reporter: patcharee
Priority: Minor


I have an external hive table stored as partitioned orc file (see the table 
schema below). I tried to query from the table with where clause>

hiveContext.setConf("spark.sql.orc.filterPushdown", "true")
hiveContext.sql("select u, v from 4D where zone = 2 and x = 320 and y = 117")). 

But from the log file with debug logging level on, the ORC pushdown predicate 
was not generated. 

Unfortunately my table was not sorted when I inserted the data, but I expected 
the ORC pushdown predicate should be generated (because of the where clause) 
though

Table schema

hive> describe formatted 4D;
OK
# col_name  data_type   comment 
 
dateint 
hh  int 
x   int 
y   int 
height  float   
u   float   
v   float   
w   float   
ph  float   
phb float   
t   float   
p   float   
pb  float   
qvapor  float   
qgraup  float   
qnice   float   
qnrain  float   
tke_pbl float   
el_pbl  float   
qcloud  float   
 
# Partition Information  
# col_name  data_type   comment 
 
zoneint 
z   int 
yearint 
month   int 
 
# Detailed Table Information 
Database:   default      
Owner:  patcharee
CreateTime: Thu Jul 09 16:46:54 CEST 2015
LastAccessTime: UNKNOWN  
Protect Mode:   None 
Retention:  0
Location:   hdfs://helmhdfs/apps/hive/warehouse/wrf_tables/4D   
 
Table Type: EXTERNAL_TABLE   
Table Parameters:
EXTERNALTRUE
comment this table is imported from rwf_data/*/wrf/*
    last_modified_bypatcharee   
last_modified_time  1439806692  
orc.compressZLIB
transient_lastDdlTime   1439806692  
 
# Storage Information
SerDe Library:  org.apache.hadoop.hive.ql.io.orc.OrcSerde
InputFormat:org.apache.hadoop.hive.ql.io.orc.OrcInputFormat  
OutputFormat:   org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
 
Compressed: No   
Num Buckets:-1   
Bucket Columns: []   
Sort Columns:   []   
Storage Desc Params: 
serialization.format1   
Time taken: 0.388 seconds, Fetched: 58 row(s)



Data was inserted into this table by another spark job>

df.write.format("org.apache.spark.sql.hive.orc.DefaultSource").mode(org.apache.spark.sql.SaveMode.Append).partitionBy("zone","z","year","month").saveAsTable("4D")




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-

Re: sql query orc slow

2015-10-09 Thread patcharee

Yes, the predicate pushdown is enabled, but still take longer time than 
the first method


BR,
Patcharee

On 08. okt. 2015 18:43, Zhan Zhang wrote:

Hi Patcharee,

Did you enable the predicate pushdown in the second method?

Thanks.

Zhan Zhang

On Oct 8, 2015, at 1:43 AM, patcharee <patcharee.thong...@uni.no> wrote:


Hi,

I am using spark sql 1.5 to query a hive table stored as partitioned orc file. 
We have the total files is about 6000 files and each file size is about 245MB.

What is the difference between these two query methods below:

1. Using query on hive table directly

hiveContext.sql("select col1, col2 from table1")

2. Reading from orc file, register temp table and query from the temp table

val c = hiveContext.read.format("orc").load("/apps/hive/warehouse/table1")
c.registerTempTable("regTable")
hiveContext.sql("select col1, col2 from regTable")

When the number of files is large (query all from the total 6000 files) , the 
second case is much slower then the first one. Any ideas why?

BR,




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org





-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: sql query orc slow

2015-10-09 Thread patcharee

I set hiveContext.setConf("spark.sql.orc.filterPushdown", "true"). But 
from the log No ORC pushdown predicate for my query with WHERE clause.


15/10/09 19:16:01 DEBUG OrcInputFormat: No ORC pushdown predicate

I did not understand what wrong with this.

BR,
Patcharee

On 09. okt. 2015 19:10, Zhan Zhang wrote:
In your case, you manually set an AND pushdown, and the predicate is 
right based on your setting, : leaf-0 = (EQUALS x 320)


The right way is to enable the predicate pushdown as follows.
sqlContext.setConf("spark.sql.orc.filterPushdown", "true”)

Thanks.

Zhan Zhang







On Oct 9, 2015, at 9:58 AM, patcharee <patcharee.thong...@uni.no 
<mailto:patcharee.thong...@uni.no>> wrote:



Hi Zhan Zhang

Actually my query has WHERE clause "select date, month, year, hh, 
(u*0.9122461 - v*-0.40964267), (v*0.9122461 + u*-0.40964267), z from 
4D where x = 320 and y = 117 and zone == 2 and year=2009 and z >= 2 
and z <= 8", column "x", "y" is not partition column, the others are 
partition columns. I expected the system will use predicate pushdown. 
I turned on the debug and found pushdown predicate was not generated 
("DEBUG OrcInputFormat: No ORC pushdown predicate")


Then I tried to set the search argument explicitly (on the column "x" 
which is not partition column)


   val xs = SearchArgumentFactory.newBuilder().startAnd().equals("x", 
320).end().build()

   hiveContext.setConf("hive.io.file.readcolumn.names", "x")
   hiveContext.setConf("sarg.pushdown", xs.toKryo())

this time in the log pushdown predicate was generated but results was 
wrong (no results at all)


15/10/09 18:36:06 INFO OrcInputFormat: ORC pushdown predicate: leaf-0 
= (EQUALS x 320)

expr = leaf-0

Any ideas What wrong with this? Why the ORC pushdown predicate is not 
applied by the system?


BR,
Patcharee

On 09. okt. 2015 18:31, Zhan Zhang wrote:

Hi Patcharee,

>From the query, it looks like only the column pruning will be 
applied. Partition pruning and predicate pushdown does not have 
effect. Do you see big IO difference between two methods?


The potential reason of the speed difference I can think of may be 
the different versions of OrcInputFormat. The hive path may use 
NewOrcInputFormat, but the spark path use OrcInputFormat.


Thanks.

Zhan Zhang

On Oct 8, 2015, at 11:55 PM, patcharee <patcharee.thong...@uni.no 
<mailto:patcharee.thong...@uni.no>> wrote:


Yes, the predicate pushdown is enabled, but still take longer time 
than the first method


BR,
Patcharee

On 08. okt. 2015 18:43, Zhan Zhang wrote:

Hi Patcharee,

Did you enable the predicate pushdown in the second method?

Thanks.

Zhan Zhang

On Oct 8, 2015, at 1:43 AM, patcharee <patcharee.thong...@uni.no 
<mailto:patcharee.thong...@uni.no>> wrote:



Hi,

I am using spark sql 1.5 to query a hive table stored as 
partitioned orc file. We have the total files is about 6000 files 
and each file size is about 245MB.


What is the difference between these two query methods below:

1. Using query on hive table directly

hiveContext.sql("select col1, col2 from table1")

2. Reading from orc file, register temp table and query from the 
temp table


val c = 
hiveContext.read.format("orc").load("/apps/hive/warehouse/table1")

c.registerTempTable("regTable")
hiveContext.sql("select col1, col2 from regTable")

When the number of files is large (query all from the total 6000 
files) , the second case is much slower then the first one. Any 
ideas why?


BR,




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
<mailto:user-unsubscr...@spark.apache.org>
For additional commands, e-mail: user-h...@spark.apache.org 
<mailto:user-h...@spark.apache.org>

Re: sql query orc slow

2015-10-09 Thread patcharee


Hi Zhan Zhang

Actually my query has WHERE clause "select date, month, year, hh, 
(u*0.9122461 - v*-0.40964267), (v*0.9122461 + u*-0.40964267), z from 4D 
where x = 320 and y = 117 and zone == 2 and year=2009 and z >= 2 and z 
<= 8", column "x", "y" is not partition column, the others are partition 
columns. I expected the system will use predicate pushdown. I turned on 
the debug and found pushdown predicate was not generated ("DEBUG 
OrcInputFormat: No ORC pushdown predicate")


Then I tried to set the search argument explicitly (on the column "x" 
which is not partition column)


val xs = SearchArgumentFactory.newBuilder().startAnd().equals("x", 
320).end().build()

hiveContext.setConf("hive.io.file.readcolumn.names", "x")
hiveContext.setConf("sarg.pushdown", xs.toKryo())

this time in the log pushdown predicate was generated but results was 
wrong (no results at all)


15/10/09 18:36:06 INFO OrcInputFormat: ORC pushdown predicate: leaf-0 = 
(EQUALS x 320)

expr = leaf-0

Any ideas What wrong with this? Why the ORC pushdown predicate is not 
applied by the system?


BR,
Patcharee

On 09. okt. 2015 18:31, Zhan Zhang wrote:

Hi Patcharee,

>From the query, it looks like only the column pruning will be applied. 
Partition pruning and predicate pushdown does not have effect. Do you see big IO 
difference between two methods?

The potential reason of the speed difference I can think of may be the 
different versions of OrcInputFormat. The hive path may use NewOrcInputFormat, 
but the spark path use OrcInputFormat.

Thanks.

Zhan Zhang

On Oct 8, 2015, at 11:55 PM, patcharee <patcharee.thong...@uni.no> wrote:


Yes, the predicate pushdown is enabled, but still take longer time than the 
first method

BR,
Patcharee

On 08. okt. 2015 18:43, Zhan Zhang wrote:

Hi Patcharee,

Did you enable the predicate pushdown in the second method?

Thanks.

Zhan Zhang

On Oct 8, 2015, at 1:43 AM, patcharee <patcharee.thong...@uni.no> wrote:


Hi,

I am using spark sql 1.5 to query a hive table stored as partitioned orc file. 
We have the total files is about 6000 files and each file size is about 245MB.

What is the difference between these two query methods below:

1. Using query on hive table directly

hiveContext.sql("select col1, col2 from table1")

2. Reading from orc file, register temp table and query from the temp table

val c = hiveContext.read.format("orc").load("/apps/hive/warehouse/table1")
c.registerTempTable("regTable")
hiveContext.sql("select col1, col2 from regTable")

When the number of files is large (query all from the total 6000 files) , the 
second case is much slower then the first one. Any ideas why?

BR,




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org







-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

hiveContext sql number of tasks

2015-10-07 Thread patcharee


Hi,

I do a sql query on about 10,000 partitioned orc files. Because of the 
partition schema the files cannot be merged any longer (to reduce the 
total number).


From this command hiveContext.sql(sqlText), the 10K tasks were created 
to handle each file. Is it possible to use less tasks? How to force the 
spark sql to use less tasks?


BR,
Patcharee

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Idle time between jobs

2015-09-16 Thread patcharee


Hi,

I am using Spark 1.5. I have a spark application which is divided into 
some jobs. I noticed from the Event Timeline - Spark History UI, that 
there was idle time between jobs. See below, job 1 was submitted at 
11:20:49 and finished at 11:20:52, but the job 2 was submitted "16s" 
after (at 11:21:08). I wonder what is going on during 16s? Any suggestions?


Job IdDescription SubmittedDuration
2 saveAsTextFile at GenerateHistogram.scala:143 2015/09/16 
11:21:08 0.7 s

1 collect at GenerateHistogram.scala:132 2015/09/16 11:20:49 2 s
0 count at GenerateHistogram.scala:129 2015/09/16 11:20:41 9 s

Below is log

15/09/16 11:20:52 INFO DAGScheduler: Job 1 finished: collect at 
GenerateHistogram.scala:132, took 2.221756 s
15/09/16 11:21:08 INFO deprecation: mapred.tip.id is deprecated. 
Instead, use mapreduce.task.id
15/09/16 11:21:08 INFO deprecation: mapred.task.id is deprecated. 
Instead, use mapreduce.task.attempt.id
15/09/16 11:21:08 INFO deprecation: mapred.task.is.map is deprecated. 
Instead, use mapreduce.task.ismap
15/09/16 11:21:08 INFO deprecation: mapred.task.partition is deprecated. 
Instead, use mapreduce.task.partition
15/09/16 11:21:08 INFO deprecation: mapred.job.id is deprecated. 
Instead, use mapreduce.job.id
15/09/16 11:21:08 INFO FileOutputCommitter: File Output Committer 
Algorithm version is 1
15/09/16 11:21:08 INFO SparkContext: Starting job: saveAsTextFile at 
GenerateHistogram.scala:143
15/09/16 11:21:08 INFO DAGScheduler: Got job 2 (saveAsTextFile at 
GenerateHistogram.scala:143) with 1 output partitions
15/09/16 11:21:08 INFO DAGScheduler: Final stage: ResultStage 
2(saveAsTextFile at GenerateHistogram.scala:143)


BR,
Patcharee


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

spark performance - executor computing time

2015-09-15 Thread patcharee


Hi,

I was running a job (on Spark 1.5 + Yarn + java 8). In a stage that 
lookup 
(org.apache.spark.rdd.PairRDDFunctions.lookup(PairRDDFunctions.scala:873)) 
there was an executor that took the executor computing time > 6 times of 
median. This executor had almost the same shuffle read size and low gc 
time as others.


What can impact the executor computing time? Any suggestions what 
parameters I should monitor/configure?


BR,
Patcharee



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

spark 1.5 sort slow

2015-09-01 Thread patcharee


Hi,

I found spark 1.5 sorting is very slow compared to spark 1.4. Below is 
my code snippet


val sqlRDD = sql("select date, u, v, z from fino3_hr3 where zone == 
2 and z >= 2 and z <= order by date, z")

println("sqlRDD " + sqlRDD.count())

The fino3_hr3 (in the sql command) is a hive table in orc format, 
partitioned by zone and z.


Spark 1.5 takes 4.5 mins to execute this sql, while spark 1.4 takes 1.5 
mins. I noticed that dissimilar to spark 1.4 when spark 1.5 sorted, data 
was shuffled into few tasks, not divided for all tasks. Do I need to set 
any configuration explicitly? Any suggestions?


BR,
Patcharee

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

embedded pig in the custer

2015-07-22 Thread patcharee


Hi,

I am using pig 0.14. How to run embedded pig (with PigServer) on the 
cluster which fully packs all classpath and default configuration for 
pig, hadoop, yarn, hdfs?
With this simple solution java -cp myjar.jar mymainclass, it will 
definitely throw classnotfound exception, other exception.


BR,
Patcharee

Re: character '' not supported here

2015-07-20 Thread patcharee


Hi,

I created a hive table stored as orc file (partitioned and compressed by 
ZLIB) from Hive CLI, added data into this table by a Spark application. 
After adding I was able to query data and everything looked fine. Then I 
concatenated the table from Hive CLI. After that I am not able to query 
data, like select count(*) from Table, any more, just got error line 1:1 
character '' not supported here, no matter Tez or MR engine.


How can you solve the problem in your case?

BR,
Patcharee


On 18. juli 2015 21:26, Nitin Pawar wrote:

can you tell exactly what steps you did/?
also did you try running the query with processing to MR instead of tez?
not sure this issue with orc file formats .. i had once faced issues 
on alter table for orc backed tabled on adding a new column


On Sun, Jul 19, 2015 at 12:05 AM, pth001 patcharee.thong...@uni.no 
mailto:patcharee.thong...@uni.no wrote:


Hi,

The query result

11236119012.64043-5.970886 8.5592070.0   
0.00.0-19.686993 1308.804799848.00.006196644   
0.00.0 301.274750.382470460.0NULL11 20081
11236122012.513598-6.3671713 7.3927946   
0.00.00.0-22.300392 1441.054799848.0   
0.00508465060.00.0 112.207870.304595230.0   
NULL11 20081
5122503682415.1955.172235 4.9027147   
-0.0244086120.023590.553 -38.96928-1130.0469   
74660.542.5969802E-4 9.706164E-1123054.2680.0   
0.241967370.0 NULL1120081
9121449412.25196412.081688 -9.594620.0   
0.00.0-25.93576258.65625 99848.00.0021708217   
0.00.01.2963213 1.15602660.0NULL11   
20081
9121458412.3020987.752461 -12.1834630.0   
0.00.0-24.983763 351.195399848.00.0023723599   
0.00.0 1.41373750.992398860.0NULL11 20081


I stored table in orc format, partitioned and compressed by ZLIB.
The problem happened just after I concatenate table.

BR,
Patcharee

On 18/07/15 12:46, Nitin Pawar wrote:

select * without where will work because it does not involve file
processing
I suspect the problem is with field delimiter so i asked for
records so that we can see whats the data in each column

are you using csv file with columns delimited by some char and it
has numeric data in quotes ?

On Sat, Jul 18, 2015 at 3:58 PM, patcharee
patcharee.thong...@uni.no mailto:patcharee.thong...@uni.no wrote:

This select * from table limit 5; works, but not others. So?

Patcharee


On 18. juli 2015 12:08, Nitin Pawar wrote:

can you do select * from table limit 5;

On Sat, Jul 18, 2015 at 3:35 PM, patcharee
patcharee.thong...@uni.no
mailto:patcharee.thong...@uni.no wrote:

Hi,

I am using hive 0.14 with Tez engine. Found a weird
problem. Any suggestions?

hive select count(*) from 4D;
line 1:1 character '' not supported here
line 1:2 character '' not supported here
line 1:3 character '' not supported here
line 1:4 character '' not supported here
line 1:5 character '' not supported here
line 1:6 character '' not supported here
line 1:7 character '' not supported here
line 1:8 character '' not supported here
line 1:9 character '' not supported here
...
...
line 1:131 character '' not supported here
line 1:132 character '' not supported here
line 1:133 character '' not supported here
line 1:134 character '' not supported here
line 1:135 character '' not supported here
line 1:136 character '' not supported here
line 1:137 character '' not supported here
line 1:138 character '' not supported here
line 1:139 character '' not supported here
line 1:140 character '' not supported here
line 1:141 character '' not supported here
line 1:142 character '' not supported here
line 1:143 character '' not supported here
line 1:144 character '' not supported here
line 1:145 character '' not supported here
line 1:146 character '' not supported here

BR,
Patcharee





-- 
Nitin Pawar





-- 
Nitin Pawar





--
Nitin Pawar

Re: character '' not supported here

2015-07-20 Thread patcharee


Hi,

I created a hive table stored as orc file (partitioned and compressed by 
ZLIB) from Hive CLI, added data into this table by a Spark application. 
After adding I was able to query data and everything looked fine. Then I 
concatenated the table from Hive CLI. After that I am not able to query 
data, like select count(*) from Table, any more, just got error line 1:1 
character '' not supported here, no matter Tez or MR engine.


How can you solve the problem in your case?

BR,
Patcharee


On 18. juli 2015 21:26, Nitin Pawar wrote:

can you tell exactly what steps you did/?
also did you try running the query with processing to MR instead of tez?
not sure this issue with orc file formats .. i had once faced issues 
on alter table for orc backed tabled on adding a new column


On Sun, Jul 19, 2015 at 12:05 AM, pth001 patcharee.thong...@uni.no 
mailto:patcharee.thong...@uni.no wrote:


Hi,

The query result

11236119012.64043-5.970886 8.5592070.0   
0.00.0-19.686993 1308.804799848.00.006196644   
0.00.0 301.274750.382470460.0NULL11 20081
11236122012.513598-6.3671713 7.3927946   
0.00.00.0-22.300392 1441.054799848.0   
0.00508465060.00.0 112.207870.304595230.0   
NULL11 20081
5122503682415.1955.172235 4.9027147   
-0.0244086120.023590.553 -38.96928-1130.0469   
74660.542.5969802E-4 9.706164E-1123054.2680.0   
0.241967370.0 NULL1120081
9121449412.25196412.081688 -9.594620.0   
0.00.0-25.93576258.65625 99848.00.0021708217   
0.00.01.2963213 1.15602660.0NULL11   
20081
9121458412.3020987.752461 -12.1834630.0   
0.00.0-24.983763 351.195399848.00.0023723599   
0.00.0 1.41373750.992398860.0NULL11 20081


I stored table in orc format, partitioned and compressed by ZLIB.
The problem happened just after I concatenate table.

BR,
Patcharee

On 18/07/15 12:46, Nitin Pawar wrote:

select * without where will work because it does not involve file
processing
I suspect the problem is with field delimiter so i asked for
records so that we can see whats the data in each column

are you using csv file with columns delimited by some char and it
has numeric data in quotes ?

On Sat, Jul 18, 2015 at 3:58 PM, patcharee
patcharee.thong...@uni.no mailto:patcharee.thong...@uni.no wrote:

This select * from table limit 5; works, but not others. So?

Patcharee


On 18. juli 2015 12:08, Nitin Pawar wrote:

can you do select * from table limit 5;

On Sat, Jul 18, 2015 at 3:35 PM, patcharee
patcharee.thong...@uni.no
mailto:patcharee.thong...@uni.no wrote:

Hi,

I am using hive 0.14 with Tez engine. Found a weird
problem. Any suggestions?

hive select count(*) from 4D;
line 1:1 character '' not supported here
line 1:2 character '' not supported here
line 1:3 character '' not supported here
line 1:4 character '' not supported here
line 1:5 character '' not supported here
line 1:6 character '' not supported here
line 1:7 character '' not supported here
line 1:8 character '' not supported here
line 1:9 character '' not supported here
...
...
line 1:131 character '' not supported here
line 1:132 character '' not supported here
line 1:133 character '' not supported here
line 1:134 character '' not supported here
line 1:135 character '' not supported here
line 1:136 character '' not supported here
line 1:137 character '' not supported here
line 1:138 character '' not supported here
line 1:139 character '' not supported here
line 1:140 character '' not supported here
line 1:141 character '' not supported here
line 1:142 character '' not supported here
line 1:143 character '' not supported here
line 1:144 character '' not supported here
line 1:145 character '' not supported here
line 1:146 character '' not supported here

BR,
Patcharee





-- 
Nitin Pawar





-- 
Nitin Pawar





--
Nitin Pawar

character '' not supported here

2015-07-18 Thread patcharee


Hi,

I am using hive 0.14 with Tez engine. Found a weird problem. Any 
suggestions?


hive select count(*) from 4D;
line 1:1 character '' not supported here
line 1:2 character '' not supported here
line 1:3 character '' not supported here
line 1:4 character '' not supported here
line 1:5 character '' not supported here
line 1:6 character '' not supported here
line 1:7 character '' not supported here
line 1:8 character '' not supported here
line 1:9 character '' not supported here
...
...
line 1:131 character '' not supported here
line 1:132 character '' not supported here
line 1:133 character '' not supported here
line 1:134 character '' not supported here
line 1:135 character '' not supported here
line 1:136 character '' not supported here
line 1:137 character '' not supported here
line 1:138 character '' not supported here
line 1:139 character '' not supported here
line 1:140 character '' not supported here
line 1:141 character '' not supported here
line 1:142 character '' not supported here
line 1:143 character '' not supported here
line 1:144 character '' not supported here
line 1:145 character '' not supported here
line 1:146 character '' not supported here

BR,
Patcharee

character '' not supported here

2015-07-18 Thread patcharee


Hi,

I am using hive 0.14 with Tez engine. Found a weird problem. Any 
suggestions?


hive select count(*) from 4D;
line 1:1 character '' not supported here
line 1:2 character '' not supported here
line 1:3 character '' not supported here
line 1:4 character '' not supported here
line 1:5 character '' not supported here
line 1:6 character '' not supported here
line 1:7 character '' not supported here
line 1:8 character '' not supported here
line 1:9 character '' not supported here
...
...
line 1:131 character '' not supported here
line 1:132 character '' not supported here
line 1:133 character '' not supported here
line 1:134 character '' not supported here
line 1:135 character '' not supported here
line 1:136 character '' not supported here
line 1:137 character '' not supported here
line 1:138 character '' not supported here
line 1:139 character '' not supported here
line 1:140 character '' not supported here
line 1:141 character '' not supported here
line 1:142 character '' not supported here
line 1:143 character '' not supported here
line 1:144 character '' not supported here
line 1:145 character '' not supported here
line 1:146 character '' not supported here

BR,
Patcharee

Re: character '' not supported here

2015-07-18 Thread patcharee


This select * from table limit 5; works, but not others. So?

Patcharee

On 18. juli 2015 12:08, Nitin Pawar wrote:

can you do select * from table limit 5;

On Sat, Jul 18, 2015 at 3:35 PM, patcharee patcharee.thong...@uni.no 
mailto:patcharee.thong...@uni.no wrote:


Hi,

I am using hive 0.14 with Tez engine. Found a weird problem. Any
suggestions?

hive select count(*) from 4D;
line 1:1 character '' not supported here
line 1:2 character '' not supported here
line 1:3 character '' not supported here
line 1:4 character '' not supported here
line 1:5 character '' not supported here
line 1:6 character '' not supported here
line 1:7 character '' not supported here
line 1:8 character '' not supported here
line 1:9 character '' not supported here
...
...
line 1:131 character '' not supported here
line 1:132 character '' not supported here
line 1:133 character '' not supported here
line 1:134 character '' not supported here
line 1:135 character '' not supported here
line 1:136 character '' not supported here
line 1:137 character '' not supported here
line 1:138 character '' not supported here
line 1:139 character '' not supported here
line 1:140 character '' not supported here
line 1:141 character '' not supported here
line 1:142 character '' not supported here
line 1:143 character '' not supported here
line 1:144 character '' not supported here
line 1:145 character '' not supported here
line 1:146 character '' not supported here

BR,
Patcharee





--
Nitin Pawar

Re: character '' not supported here

2015-07-18 Thread patcharee


This select * from table limit 5; works, but not others. So?

Patcharee

On 18. juli 2015 12:08, Nitin Pawar wrote:

can you do select * from table limit 5;

On Sat, Jul 18, 2015 at 3:35 PM, patcharee patcharee.thong...@uni.no 
mailto:patcharee.thong...@uni.no wrote:


Hi,

I am using hive 0.14 with Tez engine. Found a weird problem. Any
suggestions?

hive select count(*) from 4D;
line 1:1 character '' not supported here
line 1:2 character '' not supported here
line 1:3 character '' not supported here
line 1:4 character '' not supported here
line 1:5 character '' not supported here
line 1:6 character '' not supported here
line 1:7 character '' not supported here
line 1:8 character '' not supported here
line 1:9 character '' not supported here
...
...
line 1:131 character '' not supported here
line 1:132 character '' not supported here
line 1:133 character '' not supported here
line 1:134 character '' not supported here
line 1:135 character '' not supported here
line 1:136 character '' not supported here
line 1:137 character '' not supported here
line 1:138 character '' not supported here
line 1:139 character '' not supported here
line 1:140 character '' not supported here
line 1:141 character '' not supported here
line 1:142 character '' not supported here
line 1:143 character '' not supported here
line 1:144 character '' not supported here
line 1:145 character '' not supported here
line 1:146 character '' not supported here

BR,
Patcharee





--
Nitin Pawar

Re: fails to alter table concatenate

2015-06-30 Thread patcharee


Actually it works on mr. So the problem is from tez. thanks!

BR,
Patcharee

On 30. juni 2015 10:23, Nitin Pawar wrote:
can you try doing same by changing the query engine from tez to mr1? 
not sure if its hive bug or tez bug


On Tue, Jun 30, 2015 at 1:46 PM, patcharee patcharee.thong...@uni.no 
mailto:patcharee.thong...@uni.no wrote:


Hi,

I am using hive 0.14. It fails to alter table concatenate
occasionally (see the exception below). It is strange that it
fails from time to time not predictable. Is there any suggestion/clue?

hive alter table 4dim partition(zone=2,z=15,year=2005,month=4)
CONCATENATE;


VERTICES  STATUS  TOTAL  COMPLETED  RUNNING PENDING
FAILED  KILLED


File MergeFAILED -1  00 -1  0 
 0



VERTICES: 00/01  [--] 0% ELAPSED TIME:
1435651968.00 s


Status: Failed
Vertex failed, vertexName=File Merge,
vertexId=vertex_1435307579867_0041_1_00, diagnostics=[Vertex
vertex_1435307579867_0041_1_00 [File Merge] killed/failed due
to:ROOT_INPUT_INIT_FAILURE, Vertex Input:

[hdfs://service-10-0.local:8020/apps/hive/warehouse/wrf_tables/4dim/zone=2/z=15/year=2005/month=4]
initializer failed, vertex=vertex_1435307579867_0041_1_00 [File
Merge], java.lang.NullPointerException
at org.apache.hadoop.hive.ql.io

http://org.apache.hadoop.hive.ql.io.HiveInputFormat.init(HiveInputFormat.java:265)
at org.apache.hadoop.hive.ql.io

http://org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:452)
at

org.apache.tez.mapreduce.hadoop.MRInputHelpers.generateOldSplits(MRInputHelpers.java:441)
at

org.apache.tez.mapreduce.hadoop.MRInputHelpers.generateInputSplitsToMem(MRInputHelpers.java:295)
at

org.apache.tez.mapreduce.common.MRInputAMSplitGenerator.initialize(MRInputAMSplitGenerator.java:124)
at

org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:245)
at

org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:239)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at

org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at

org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:239)
at

org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:226)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
]
DAG failed due to vertex failure. failedVertices:1 killedVertices:0
FAILED: Execution Error, return code 2 from
org.apache.hadoop.hive.ql.exec.DDLTask

BR,
Patcharee




--
Nitin Pawar

fails to alter table concatenate

2015-06-30 Thread patcharee


Hi,

I am using hive 0.14. It fails to alter table concatenate occasionally 
(see the exception below). It is strange that it fails from time to time 
not predictable. Is there any suggestion/clue?


hive alter table 4dim partition(zone=2,z=15,year=2005,month=4) CONCATENATE;

VERTICES  STATUS  TOTAL  COMPLETED  RUNNING  PENDING 
FAILED  KILLED


File MergeFAILED -1  00 -1   0   0

VERTICES: 00/01  [--] 0%ELAPSED TIME: 
1435651968.00 s


Status: Failed
Vertex failed, vertexName=File Merge, 
vertexId=vertex_1435307579867_0041_1_00, diagnostics=[Vertex 
vertex_1435307579867_0041_1_00 [File Merge] killed/failed due 
to:ROOT_INPUT_INIT_FAILURE, Vertex Input: 
[hdfs://service-10-0.local:8020/apps/hive/warehouse/wrf_tables/4dim/zone=2/z=15/year=2005/month=4] 
initializer failed, vertex=vertex_1435307579867_0041_1_00 [File Merge], 
java.lang.NullPointerException
at 
org.apache.hadoop.hive.ql.io.HiveInputFormat.init(HiveInputFormat.java:265)
at 
org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:452)
at 
org.apache.tez.mapreduce.hadoop.MRInputHelpers.generateOldSplits(MRInputHelpers.java:441)
at 
org.apache.tez.mapreduce.hadoop.MRInputHelpers.generateInputSplitsToMem(MRInputHelpers.java:295)
at 
org.apache.tez.mapreduce.common.MRInputAMSplitGenerator.initialize(MRInputAMSplitGenerator.java:124)
at 
org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:245)
at 
org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:239)

at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at 
org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:239)
at 
org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:226)

at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)
]
DAG failed due to vertex failure. failedVertices:1 killedVertices:0
FAILED: Execution Error, return code 2 from 
org.apache.hadoop.hive.ql.exec.DDLTask


BR,
Patcharee

Re: Kryo serialization of classes in additional jars

2015-06-26 Thread patcharee


Hi,

I am having this problem on spark 1.4. Do you have any ideas how to 
solve it? I tried to use spark.executor.extraClassPath, but it did not help


BR,
Patcharee

On 04. mai 2015 23:47, Imran Rashid wrote:
Oh, this seems like a real pain.  You should file a jira, I didn't see 
an open issue -- if nothing else just to document the issue.


As you've noted, the problem is that the serializer is created 
immediately in the executors, right when the SparkEnv is created, but 
the other jars aren't downloaded later.  I think you could workaround 
with some combination of pushing the jars to the cluster manually, and 
then using spark.executor.extraClassPath


On Wed, Apr 29, 2015 at 6:42 PM, Akshat Aranya aara...@gmail.com 
mailto:aara...@gmail.com wrote:


Hi,

Is it possible to register kryo serialization for classes
contained in jars that are added with spark.jars?  In my
experiment it doesn't seem to work, likely because the class
registration happens before the jar is shipped to the executor and
added to the classloader.  Here's the general idea of what I want
to do:

   val sparkConf = new SparkConf(true)
  .set(spark.jars, foo.jar)
  .setAppName(foo)
  .set(spark.serializer,
org.apache.spark.serializer.KryoSerializer)

// register classes contained in foo.jar
sparkConf.registerKryoClasses(Array(
  classOf[com.foo.Foo],
  classOf[com.foo.Bar]))

Re: HiveContext saveAsTable create wrong partition

2015-06-16 Thread patcharee

I found if I move the partitioned columns in schemaString and in Row to 
the end of the sequence, then it works correctly...


On 16. juni 2015 11:14, patcharee wrote:

Hi,

I am using spark 1.4 and HiveContext to append data into a partitioned 
hive table. I found that the data insert into the table is correct, 
but the partition(folder) created is totally wrong.

Below is my code snippet

--- 

val schemaString = zone z year month date hh x y height u v w ph phb 
t p pb qvapor qgraup qnice qnrain tke_pbl el_pbl

val schema =
  StructType(
schemaString.split( ).map(fieldName =
  if (fieldName.equals(zone) || fieldName.equals(z) || 
fieldName.equals(year) || fieldName.equals(month) ||
  fieldName.equals(date) || fieldName.equals(hh) || 
fieldName.equals(x) || fieldName.equals(y))

StructField(fieldName, IntegerType, true)
  else
StructField(fieldName, FloatType, true)
))

val pairVarRDD =
sc.parallelize(Seq((Row(2,42,2009,3,1,0,218,365,9989.497.floatValue(),29.627113.floatValue(),19.071793.floatValue(),0.11982734.floatValue(),3174.6812.floatValue(), 

97735.2.floatValue(),16.389032.floatValue(),-96.62891.floatValue(),25135.365.floatValue(),2.6476808E-5.floatValue(),0.0.floatValue(),13195.351.floatValue(), 


0.0.floatValue(),0.1.floatValue(),0.0.floatValue()))
))

val partitionedTestDF2 = sqlContext.createDataFrame(pairVarRDD, schema)

partitionedTestDF2.write.format(org.apache.spark.sql.hive.orc.DefaultSource) 

.mode(org.apache.spark.sql.SaveMode.Append).partitionBy(zone,z,year,month).saveAsTable(test4DimBySpark) 



--- 



The table contains 23 columns (longer than Tuple maximum length), so I 
use Row Object to store raw data, not Tuple.

Here is some message from spark when it saved data

15/06/16 10:39:22 INFO metadata.Hive: Renaming 
src:hdfs://service-10-0.local:8020/tmp/hive-patcharee/hive_2015-06-16_10-39-21_205_8768669104487548472-1/-ext-1/zone=13195/z=0/year=0/month=0/part-1;dest: 
hdfs://service-10-0.local:8020/apps/hive/warehouse/test4dimBySpark/zone=13195/z=0/year=0/month=0/part-1;Status:true 

15/06/16 10:39:22 INFO metadata.Hive: New loading path = 
hdfs://service-10-0.local:8020/tmp/hive-patcharee/hive_2015-06-16_10-39-21_205_8768669104487548472-1/-ext-1/zone=13195/z=0/year=0/month=0 
with partSpec {zone=13195, z=0, year=0, month=0}


From the raw data (pairVarRDD) zone = 2, z = 42, year = 2009, month = 
3. But spark created a partition {zone=13195, z=0, year=0, month=0}.


When I queried from hive

hive select * from test4dimBySpark;
OK
242200931.00.0218.0365.09989.497 
29.62711319.0717930.11982734-3174.681297735.2 
16.389032-96.6289125135.3652.6476808E-50.0 13195
000

hive select zone, z, year, month from test4dimBySpark;
OK
13195000
hive dfs -ls /apps/hive/warehouse/test4dimBySpark/*/*/*/*;
Found 2 items
-rw-r--r--   3 patcharee hdfs   1411 2015-06-16 10:39 
/apps/hive/warehouse/test4dimBySpark/zone=13195/z=0/year=0/month=0/part-1


The data stored in the table is correct zone = 2, z = 42, year = 2009, 
month = 3, but the partition created was wrong 
zone=13195/z=0/year=0/month=0


Is this a bug or what could be wrong? Any suggestion is appreciated.

BR,
Patcharee







-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

HiveContext saveAsTable create wrong partition

2015-06-16 Thread patcharee


Hi,

I am using spark 1.4 and HiveContext to append data into a partitioned 
hive table. I found that the data insert into the table is correct, but 
the partition(folder) created is totally wrong.

Below is my code snippet

---
val schemaString = zone z year month date hh x y height u v w ph phb t 
p pb qvapor qgraup qnice qnrain tke_pbl el_pbl

val schema =
  StructType(
schemaString.split( ).map(fieldName =
  if (fieldName.equals(zone) || fieldName.equals(z) || 
fieldName.equals(year) || fieldName.equals(month) ||
  fieldName.equals(date) || fieldName.equals(hh) || 
fieldName.equals(x) || fieldName.equals(y))

StructField(fieldName, IntegerType, true)
  else
StructField(fieldName, FloatType, true)
))

val pairVarRDD =
sc.parallelize(Seq((Row(2,42,2009,3,1,0,218,365,9989.497.floatValue(),29.627113.floatValue(),19.071793.floatValue(),0.11982734.floatValue(),3174.6812.floatValue(),
97735.2.floatValue(),16.389032.floatValue(),-96.62891.floatValue(),25135.365.floatValue(),2.6476808E-5.floatValue(),0.0.floatValue(),13195.351.floatValue(),
0.0.floatValue(),0.1.floatValue(),0.0.floatValue()))
))

val partitionedTestDF2 = sqlContext.createDataFrame(pairVarRDD, schema)

partitionedTestDF2.write.format(org.apache.spark.sql.hive.orc.DefaultSource)
.mode(org.apache.spark.sql.SaveMode.Append).partitionBy(zone,z,year,month).saveAsTable(test4DimBySpark)

---

The table contains 23 columns (longer than Tuple maximum length), so I 
use Row Object to store raw data, not Tuple.

Here is some message from spark when it saved data

15/06/16 10:39:22 INFO metadata.Hive: Renaming 
src:hdfs://service-10-0.local:8020/tmp/hive-patcharee/hive_2015-06-16_10-39-21_205_8768669104487548472-1/-ext-1/zone=13195/z=0/year=0/month=0/part-1;dest: 
hdfs://service-10-0.local:8020/apps/hive/warehouse/test4dimBySpark/zone=13195/z=0/year=0/month=0/part-1;Status:true
15/06/16 10:39:22 INFO metadata.Hive: New loading path = 
hdfs://service-10-0.local:8020/tmp/hive-patcharee/hive_2015-06-16_10-39-21_205_8768669104487548472-1/-ext-1/zone=13195/z=0/year=0/month=0 
with partSpec {zone=13195, z=0, year=0, month=0}


From the raw data (pairVarRDD) zone = 2, z = 42, year = 2009, month = 
3. But spark created a partition {zone=13195, z=0, year=0, month=0}.


When I queried from hive

hive select * from test4dimBySpark;
OK
242200931.00.0218.0365.09989.497 
29.62711319.0717930.11982734-3174.681297735.2 
16.389032-96.6289125135.3652.6476808E-50.0 131950
00

hive select zone, z, year, month from test4dimBySpark;
OK
13195000
hive dfs -ls /apps/hive/warehouse/test4dimBySpark/*/*/*/*;
Found 2 items
-rw-r--r--   3 patcharee hdfs   1411 2015-06-16 10:39 
/apps/hive/warehouse/test4dimBySpark/zone=13195/z=0/year=0/month=0/part-1


The data stored in the table is correct zone = 2, z = 42, year = 2009, 
month = 3, but the partition created was wrong 
zone=13195/z=0/year=0/month=0


Is this a bug or what could be wrong? Any suggestion is appreciated.

BR,
Patcharee







-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

sql.catalyst.ScalaReflection scala.reflect.internal.MissingRequirementError

2015-06-15 Thread patcharee


Hi,

I use spark 0.14. I tried to create dataframe from RDD below, but got 
scala.reflect.internal.MissingRequirementError


val partitionedTestDF2 = pairVarRDD.toDF(column1,column2,column3)
//pairVarRDD is RDD[Record4Dim_2], and Record4Dim_2 is a Case Class

How can I fix this?

Exception in thread main 
scala.reflect.internal.MissingRequirementError: class etl.Record4Dim_2 
in JavaMirror with sun.misc.Launcher$AppClassLoader@30177039 of type 
class sun.misc.Launcher$AppClassLoader with classpath 
[file:/local/spark140/conf/,file:/local/spark140/lib/spark-assembly-1.4.0-SNAPSHOT-hadoop2.6.0.jar,file:/local/spark140/lib/datanucleus-core-3.2.10.jar,file:/local/spark140/lib/datanucleus-rdbms-3.2.9.jar,file:/local/spark140/lib/datanucleus-api-jdo-3.2.6.jar,file:/etc/hadoop/conf/] 
and parent being sun.misc.Launcher$ExtClassLoader@52c8c6d9 of type class 
sun.misc.Launcher$ExtClassLoader with classpath 
[file:/usr/jdk64/jdk1.7.0_67/jre/lib/ext/sunec.jar,file:/usr/jdk64/jdk1.7.0_67/jre/lib/ext/sunjce_provider.jar,file:/usr/jdk64/jdk1.7.0_67/jre/lib/ext/sunpkcs11.jar,file:/usr/jdk64/jdk1.7.0_67/jre/lib/ext/zipfs.jar,file:/usr/jdk64/jdk1.7.0_67/jre/lib/ext/localedata.jar,file:/usr/jdk64/jdk1.7.0_67/jre/lib/ext/dnsns.jar] 
and parent being primordial classloader with boot classpath 
[/usr/jdk64/jdk1.7.0_67/jre/lib/resources.jar:/usr/jdk64/jdk1.7.0_67/jre/lib/rt.jar:/usr/jdk64/jdk1.7.0_67/jre/lib/sunrsasign.jar:/usr/jdk64/jdk1.7.0_67/jre/lib/jsse.jar:/usr/jdk64/jdk1.7.0_67/jre/lib/jce.jar:/usr/jdk64/jdk1.7.0_67/jre/lib/charsets.jar:/usr/jdk64/jdk1.7.0_67/jre/lib/jfr.jar:/usr/jdk64/jdk1.7.0_67/jre/classes] 
not found.
at 
scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16)
at 
scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17)
at 
scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48)
at 
scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61)
at 
scala.reflect.internal.Mirrors$RootsBase.staticModuleOrClass(Mirrors.scala:72)
at 
scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:119)
at 
scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:21)
at 
no.uni.computing.etl.LoadWrfV14$$typecreator1$1.apply(LoadWrfV14.scala:91)
at 
scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231)

at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231)
at 
org.apache.spark.sql.catalyst.ScalaReflection$class.localTypeOf(ScalaReflection.scala:71)
at 
org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:59)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:28)
at 
org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:410)
at 
org.apache.spark.sql.SQLContext$implicits$.rddToDataFrameHolder(SQLContext.scala:335)


BR,
Patcharee


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: hiveContext.sql NullPointerException

2015-06-11 Thread patcharee


Hi,

Does 
df.write.partitionBy(partitions).format(format).mode(overwrite).saveAsTable(tbl) 
support orc file?


I tried df.write.partitionBy(zone, z, year, 
month).format(orc).mode(overwrite).saveAsTable(tbl), but after 
the insert my table tbl schema has been changed to something I did not 
expected ..


-- FROM --
CREATE EXTERNAL TABLE `4dim`(`u` float,   `v` float)
PARTITIONED BY (`zone` int, `z` int, `year` int, `month` int)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
TBLPROPERTIES (
  'orc.compress'='ZLIB',
  'transient_lastDdlTime'='1433016475')

-- TO --
CREATE TABLE `4dim`(`col` arraystring COMMENT 'from deserializer')
PARTITIONED BY (`zone` int COMMENT '', `z` int COMMENT '', `year` int 
COMMENT '', `month` int COMMENT '')
ROW FORMAT SERDE 
'org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe'

STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.SequenceFileInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat'
TBLPROPERTIES (
  'EXTERNAL'='FALSE',
  'spark.sql.sources.provider'='orc',
  'spark.sql.sources.schema.numParts'='1',
'spark.sql.sources.schema.part.0'='{\type\:\struct\,\fields\:[{\name\:\u\,\type\:\float\,\nullable\:true,\metadata\:{}},{\name\:\v\,\type\:\float\,\nullable\:true,\metadata\:{}},{\name\:\zone\,\type\:\integer\,\nullable\:true,\metadata\:{}},{\name\:\z\,\type\:\integer\,\nullable\:true,\metadata\:{}},{\name\:\year\,\type\:\integer\,\nullable\:true,\metadata\:{}},{\name\:\month\,\type\:\integer\,\nullable\:true,\metadata\:{}}]}', 


  'transient_lastDdlTime'='1434055247')


I noticed there are files stored in hdfs as *.orc, but when I tried to 
query from hive I got nothing. How can I fix this? Any suggestions please


BR,
Patcharee


On 07. juni 2015 16:40, Cheng Lian wrote:
Spark SQL supports Hive dynamic partitioning, so one possible 
workaround is to create a Hive table partitioned by zone, z, year, and 
month dynamically, and then insert the whole dataset into it directly.


In 1.4, we also provides dynamic partitioning support for non-Hive 
environment, and you can do something like this:


df.write.partitionBy(zone, z, year, 
month).format(parquet).mode(overwrite).saveAsTable(tbl)


Cheng

On 6/7/15 9:48 PM, patcharee wrote:

Hi,

How can I expect to work on HiveContext on the executor? If only the 
driver can see HiveContext, does it mean I have to collect all 
datasets (very large) to the driver and use HiveContext there? It 
will be memory overload on the driver and fail.


BR,
Patcharee

On 07. juni 2015 11:51, Cheng Lian wrote:

Hi,

This is expected behavior. HiveContext.sql (and also 
DataFrame.registerTempTable) is only expected to be invoked on 
driver side. However, the closure passed to RDD.foreach is executed 
on executor side, where no viable HiveContext instance exists.


Cheng

On 6/7/15 10:06 AM, patcharee wrote:

Hi,

I try to insert data into a partitioned hive table. The groupByKey 
is to combine dataset into a partition of the hive table. After the 
groupByKey, I converted the iterable[X] to DB by X.toList.toDF(). 
But the hiveContext.sql  throws NullPointerException, see below. 
Any suggestions? What could be wrong? Thanks!


val varWHeightFlatRDD = 
varWHeightRDD.flatMap(FlatMapUtilClass().flatKeyFromWrf).groupByKey()

  .foreach(
x = {
  val zone = x._1._1
  val z = x._1._2
  val year = x._1._3
  val month = x._1._4
  val df_table_4dim = x._2.toList.toDF()
  df_table_4dim.registerTempTable(table_4Dim)
  hiveContext.sql(INSERT OVERWRITE table 4dim partition 
(zone= + zone + ,z= + z + ,year= + year + ,month= + month + 
)  +
select date, hh, x, y, height, u, v, w, ph, phb, t, p, 
pb, qvapor, qgraup, qnice, qnrain, tke_pbl, el_pbl from table_4Dim);


})


java.lang.NullPointerException
at 
org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:100)
at 
no.uni.computing.etl.LoadWrfIntoHiveOptReduce1$$anonfun$7.apply(LoadWrfIntoHiveOptReduce1.scala:113)
at 
no.uni.computing.etl.LoadWrfIntoHiveOptReduce1$$anonfun$7.apply(LoadWrfIntoHiveOptReduce1.scala:103)

at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at 
org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
at 
org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:798)
at 
org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:798)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1511)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1511)
at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)

at org.apache.spark.scheduler.Task.run(Task.scala:64)
at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203

Re: hiveContext.sql NullPointerException

2015-06-08 Thread patcharee


Hi,

Thanks for your guidelines. I will try it out.

Btw how do you know HiveContext.sql (and also 
DataFrame.registerTempTable) is only expected to be invoked on driver 
side? Where can I find document?


BR,
Patcharee


On 07. juni 2015 16:40, Cheng Lian wrote:
Spark SQL supports Hive dynamic partitioning, so one possible 
workaround is to create a Hive table partitioned by zone, z, year, and 
month dynamically, and then insert the whole dataset into it directly.


In 1.4, we also provides dynamic partitioning support for non-Hive 
environment, and you can do something like this:


df.write.partitionBy(zone, z, year, 
month).format(parquet).mode(overwrite).saveAsTable(tbl)


Cheng

On 6/7/15 9:48 PM, patcharee wrote:

Hi,

How can I expect to work on HiveContext on the executor? If only the 
driver can see HiveContext, does it mean I have to collect all 
datasets (very large) to the driver and use HiveContext there? It 
will be memory overload on the driver and fail.


BR,
Patcharee

On 07. juni 2015 11:51, Cheng Lian wrote:

Hi,

This is expected behavior. HiveContext.sql (and also 
DataFrame.registerTempTable) is only expected to be invoked on 
driver side. However, the closure passed to RDD.foreach is executed 
on executor side, where no viable HiveContext instance exists.


Cheng

On 6/7/15 10:06 AM, patcharee wrote:

Hi,

I try to insert data into a partitioned hive table. The groupByKey 
is to combine dataset into a partition of the hive table. After the 
groupByKey, I converted the iterable[X] to DB by X.toList.toDF(). 
But the hiveContext.sql  throws NullPointerException, see below. 
Any suggestions? What could be wrong? Thanks!


val varWHeightFlatRDD = 
varWHeightRDD.flatMap(FlatMapUtilClass().flatKeyFromWrf).groupByKey()

  .foreach(
x = {
  val zone = x._1._1
  val z = x._1._2
  val year = x._1._3
  val month = x._1._4
  val df_table_4dim = x._2.toList.toDF()
  df_table_4dim.registerTempTable(table_4Dim)
  hiveContext.sql(INSERT OVERWRITE table 4dim partition 
(zone= + zone + ,z= + z + ,year= + year + ,month= + month + 
)  +
select date, hh, x, y, height, u, v, w, ph, phb, t, p, 
pb, qvapor, qgraup, qnice, qnrain, tke_pbl, el_pbl from table_4Dim);


})


java.lang.NullPointerException
at 
org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:100)
at 
no.uni.computing.etl.LoadWrfIntoHiveOptReduce1$$anonfun$7.apply(LoadWrfIntoHiveOptReduce1.scala:113)
at 
no.uni.computing.etl.LoadWrfIntoHiveOptReduce1$$anonfun$7.apply(LoadWrfIntoHiveOptReduce1.scala:103)

at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at 
org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
at 
org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:798)
at 
org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:798)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1511)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1511)
at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)

at org.apache.spark.scheduler.Task.run(Task.scala:64)
at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:744)

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org












-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: hiveContext.sql NullPointerException

2015-06-07 Thread patcharee


Hi,

How can I expect to work on HiveContext on the executor? If only the 
driver can see HiveContext, does it mean I have to collect all datasets 
(very large) to the driver and use HiveContext there? It will be memory 
overload on the driver and fail.


BR,
Patcharee

On 07. juni 2015 11:51, Cheng Lian wrote:

Hi,

This is expected behavior. HiveContext.sql (and also 
DataFrame.registerTempTable) is only expected to be invoked on driver 
side. However, the closure passed to RDD.foreach is executed on 
executor side, where no viable HiveContext instance exists.


Cheng

On 6/7/15 10:06 AM, patcharee wrote:

Hi,

I try to insert data into a partitioned hive table. The groupByKey is 
to combine dataset into a partition of the hive table. After the 
groupByKey, I converted the iterable[X] to DB by X.toList.toDF(). But 
the hiveContext.sql  throws NullPointerException, see below. Any 
suggestions? What could be wrong? Thanks!


val varWHeightFlatRDD = 
varWHeightRDD.flatMap(FlatMapUtilClass().flatKeyFromWrf).groupByKey()

  .foreach(
x = {
  val zone = x._1._1
  val z = x._1._2
  val year = x._1._3
  val month = x._1._4
  val df_table_4dim = x._2.toList.toDF()
  df_table_4dim.registerTempTable(table_4Dim)
  hiveContext.sql(INSERT OVERWRITE table 4dim partition 
(zone= + zone + ,z= + z + ,year= + year + ,month= + month + ) 
 +
select date, hh, x, y, height, u, v, w, ph, phb, t, p, 
pb, qvapor, qgraup, qnice, qnrain, tke_pbl, el_pbl from table_4Dim);


})


java.lang.NullPointerException
at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:100)
at 
no.uni.computing.etl.LoadWrfIntoHiveOptReduce1$$anonfun$7.apply(LoadWrfIntoHiveOptReduce1.scala:113)
at 
no.uni.computing.etl.LoadWrfIntoHiveOptReduce1$$anonfun$7.apply(LoadWrfIntoHiveOptReduce1.scala:103)

at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at 
org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)

at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:798)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:798)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1511)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1511)
at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)

at org.apache.spark.scheduler.Task.run(Task.scala:64)
at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:744)

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org







-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

write multiple outputs by key

2015-06-06 Thread patcharee


Hi,

How can I write to multiple outputs for each key? I tried to create 
custom partitioner or define the number of partition but does not work. 
There are only the few tasks/partitions (which equals to the number of 
all key combination) gets large datasets, data is not splitting to all 
tasks/partition. The job failed as the few tasks handled too far large 
datasets. Below is my code snippet.


val varWFlatRDD = 
varWRDD.map(FlatMapUtilClass().flatKeyFromWrf).groupByKey() //key are 
(zone, z, year, month)

.foreach(
x = {
  val z = x._1._1
  val year = x._1._2
  val month = x._1._3
  val df_table_4dim = x._2.toList.toDF()
  df_table_4dim.registerTempTable(table_4Dim)
  hiveContext.sql(INSERT OVERWRITE table 4dim partition 
(zone= + ZONE + ,z= + z + ,year= + year + ,month= + month + )  +
select date, hh, x, y, height, u, v, w, ph, phb, t, p, pb, 
qvapor, qgraup, qnice, qnrain, tke_pbl, el_pbl from table_4Dim);

  })


From the spark history UI, at groupByKey there are  1000 tasks (equals 
to the parent's # partitions). at foreach there are  1000 tasks as 
well, but 50 tasks (same as the # all key combination)  gets datasets.


How can I fix this problem? Any suggestions are appreciated.

BR,
Patcharee



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

hiveContext.sql NullPointerException

2015-06-06 Thread patcharee


Hi,

I try to insert data into a partitioned hive table. The groupByKey is to 
combine dataset into a partition of the hive table. After the 
groupByKey, I converted the iterable[X] to DB by X.toList.toDF(). But 
the hiveContext.sql  throws NullPointerException, see below. Any 
suggestions? What could be wrong? Thanks!


val varWHeightFlatRDD = 
varWHeightRDD.flatMap(FlatMapUtilClass().flatKeyFromWrf).groupByKey()

  .foreach(
x = {
  val zone = x._1._1
  val z = x._1._2
  val year = x._1._3
  val month = x._1._4
  val df_table_4dim = x._2.toList.toDF()
  df_table_4dim.registerTempTable(table_4Dim)
  hiveContext.sql(INSERT OVERWRITE table 4dim partition 
(zone= + zone + ,z= + z + ,year= + year + ,month= + month + )  +
select date, hh, x, y, height, u, v, w, ph, phb, t, p, pb, 
qvapor, qgraup, qnice, qnrain, tke_pbl, el_pbl from table_4Dim);


})


java.lang.NullPointerException
at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:100)
at 
no.uni.computing.etl.LoadWrfIntoHiveOptReduce1$$anonfun$7.apply(LoadWrfIntoHiveOptReduce1.scala:113)
at 
no.uni.computing.etl.LoadWrfIntoHiveOptReduce1$$anonfun$7.apply(LoadWrfIntoHiveOptReduce1.scala:103)

at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at 
org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)

at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:798)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:798)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1511)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1511)

at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:744)

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: FetchFailed Exception

2015-06-05 Thread patcharee


Hi,

I has this problem before, and in my case it is because the 
executor/container was killed by yarn when it used more memory than 
allocated. You can check if your case is the same by checking yarn node 
manager log.


Best,
Patcharee

On 05. juni 2015 07:25, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote:

I see this

Is this a problem with my code or the cluster ? Is there any way to 
fix it ?


FetchFailed(BlockManagerId(2, phxdpehdc9dn2441.stratus.phx.ebay.com 
http://phxdpehdc9dn2441.stratus.phx.ebay.com, 59574), shuffleId=1, 
mapId=80, reduceId=20, message=
org.apache.spark.shuffle.FetchFailedException: Failed to connect to 
phxdpehdc9dn2441.stratus.phx.ebay.com/10.115.81.47:59574 
http://phxdpehdc9dn2441.stratus.phx.ebay.com/10.115.81.47:59574
at 
org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67)
at 
org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
at 
org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)

at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)

at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at 
org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125)
at 
org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:160)
at 
org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)

at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:159)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)

at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Failed to connect to 
phxdpehdc9dn2441.stratus.phx.ebay.com/10.115.81.47:59574 
http://phxdpehdc9dn2441.stratus.phx.ebay.com/10.115.81.47:59574
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:191)
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)
at 
org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:78)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
... 3 more
Caused by: java.net.ConnectException: Connection refused: 
phxdpehdc9dn2441.stratus.phx.ebay.com/10.115.81.47:59574 
http://phxdpehdc9dn2441.stratus.phx.ebay.com/10.115.81.47:59574

at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
at 
io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:208)
at 
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:287)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized

NullPointerException SQLConf.setConf

2015-06-04 Thread patcharee


Hi,

I am using Hive 0.14 and spark 0.13. I got 
java.lang.NullPointerException when inserted into hive. Any suggestions 
please.


hiveContext.sql(INSERT OVERWRITE table 4dim partition (zone= + ZONE + 
,z= + zz + ,year= + YEAR + ,month= + MONTH + )  +
select date, hh, x, y, height, u, v, w, ph, phb, t, p, pb, 
qvapor, qgraup, qnice, qnrain, tke_pbl, el_pbl from table_4Dim where z= 
+ zz);


java.lang.NullPointerException
at org.apache.spark.sql.SQLConf.setConf(SQLConf.scala:196)
at org.apache.spark.sql.SQLContext.setConf(SQLContext.scala:74)
at 
org.apache.spark.sql.hive.HiveContext.hiveconf$lzycompute(HiveContext.scala:251)
at 
org.apache.spark.sql.hive.HiveContext.hiveconf(HiveContext.scala:250)

at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:95)
at 
no.uni.computing.etl.LoadWrfIntoHiveOptReduce1$$anonfun$main$3$$anonfun$apply$1.apply(LoadWrfIntoHiveOptReduce1.scala:110)
at 
no.uni.computing.etl.LoadWrfIntoHiveOptReduce1$$anonfun$main$3$$anonfun$apply$1.apply(LoadWrfIntoHiveOptReduce1.scala:107)

at scala.collection.immutable.Range.foreach(Range.scala:141)
at 
no.uni.computing.etl.LoadWrfIntoHiveOptReduce1$$anonfun$main$3.apply(LoadWrfIntoHiveOptReduce1.scala:107)
at 
no.uni.computing.etl.LoadWrfIntoHiveOptReduce1$$anonfun$main$3.apply(LoadWrfIntoHiveOptReduce1.scala:107)
at 
org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:806)
at 
org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:806)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1511)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1511)

at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:744)

Best,
Patcharee

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

MetaException(message:java.security.AccessControlException: Permission denied

2015-06-03 Thread patcharee


Hi,

I was running a spark job to insert overwrite hive table and got 
Permission denied. My question is why spark job did the insert by using 
user 'hive', not myself who ran the job? How can I fix the problem?


val hiveContext = new HiveContext(sc)
import hiveContext.implicits._
hiveContext.sql(INSERT OVERWRITE table 4dim ... )


Caused by: MetaException(message:java.security.AccessControlException: 
Permission denied: user=hive, access=WRITE, 
inode=/apps/hive/warehouse/wrf_tables/4dim/zone=2/z=1/year=2009/month=1:patcharee:hdfs:drwxr-xr-x
at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkFsPermission(FSPermissionChecker.java:271)
at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:257)
at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:185)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6795)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6777)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPathAccess(FSNamesystem.java:6702)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAccess(FSNamesystem.java:9529)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkAccess(NameNodeRpcServer.java:1516)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.checkAccess(ClientNamenodeProtocolServerSideTranslatorPB.java:1433)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)

at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)

at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
)
at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$alter_partition_result$alter_partition_resultStandardScheme.read(ThriftHiveMetastore.java)
at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$alter_partition_result$alter_partition_resultStandardScheme.read(ThriftHiveMetastore.java)
at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$alter_partition_result.read(ThriftHiveMetastore.java)

at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_alter_partition(ThriftHiveMetastore.java:2033)
at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.alter_partition(ThriftHiveMetastore.java:2018)
at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.alter_partition(HiveMetaStoreClient.java:1091)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89)

at com.sun.proxy.$Proxy37.alter_partition(Unknown Source)
at 
org.apache.hadoop.hive.ql.metadata.Hive.alterPartition(Hive.java:469)

... 26 more


BR,
Patcharee

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

MetaException(message:java.security.AccessControlException: Permission denied

2015-06-03 Thread patcharee


Hi,

I was running a spark job to insert overwrite hive table and got 
Permission denied. My question is why spark job did the insert by using 
user 'hive', not myself who ran the job? How can I fix the problem?


val hiveContext = new HiveContext(sc)
import hiveContext.implicits._
hiveContext.sql(INSERT OVERWRITE table 4dim ... )


Caused by: MetaException(message:java.security.AccessControlException: 
Permission denied: user=hive, access=WRITE, 
inode=/apps/hive/warehouse/wrf_tables/4dim/zone=2/z=1/year=2009/month=1:patcharee:hdfs:drwxr-xr-x
at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkFsPermission(FSPermissionChecker.java:271)
at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:257)
at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:185)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6795)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6777)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPathAccess(FSNamesystem.java:6702)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAccess(FSNamesystem.java:9529)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkAccess(NameNodeRpcServer.java:1516)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.checkAccess(ClientNamenodeProtocolServerSideTranslatorPB.java:1433)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)

at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)

at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
)
at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$alter_partition_result$alter_partition_resultStandardScheme.read(ThriftHiveMetastore.java)
at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$alter_partition_result$alter_partition_resultStandardScheme.read(ThriftHiveMetastore.java)
at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$alter_partition_result.read(ThriftHiveMetastore.java)

at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_alter_partition(ThriftHiveMetastore.java:2033)
at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.alter_partition(ThriftHiveMetastore.java:2018)
at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.alter_partition(HiveMetaStoreClient.java:1091)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89)

at com.sun.proxy.$Proxy37.alter_partition(Unknown Source)
at 
org.apache.hadoop.hive.ql.metadata.Hive.alterPartition(Hive.java:469)

... 26 more


BR,
Patcharee

Re: ERROR cluster.YarnScheduler: Lost executor

2015-06-03 Thread patcharee

(NioEventLoop.java:528)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)

at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)

... 1 more

)

I am using spark 1.3.1, is the problem from the 
https://issues.apache.org/jira/browse/SPARK-4516?


Best,
Patcharee

On 03. juni 2015 10:11, Akhil Das wrote:
Which version of spark? Looks like you are hitting this one 
https://issues.apache.org/jira/browse/SPARK-4516


Thanks
Best Regards

On Wed, Jun 3, 2015 at 1:06 PM, patcharee patcharee.thong...@uni.no 
mailto:patcharee.thong...@uni.no wrote:


This is log I can get

15/06/02 16:37:31 INFO shuffle.RetryingBlockFetcher: Retrying
fetch (2/3) for 4 outstanding blocks after 5000 ms
15/06/02 16:37:36 INFO client.TransportClientFactory: Found
inactive connection to compute-10-3.local/10.10.255.238:33671
http://10.10.255.238:33671, creating a new one.
15/06/02 16:37:36 WARN server.TransportChannelHandler: Exception
in connection from /10.10.255.238:35430 http://10.10.255.238:35430
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at
sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
at

io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311)
at
io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
at

io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:225)
at

io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
at
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at

io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at
io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at

io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
at java.lang.Thread.run(Thread.java:744)
15/06/02 16:37:36 ERROR server.TransportRequestHandler: Error
sending result
ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=1033433133943,
chunkIndex=1},

buffer=FileSegmentManagedBuffer{file=/hdisk3/hadoop/yarn/local/usercache/patcharee/appcache/application_1432633634512_0213/blockmgr-12d59e6b-0895-4a0e-9d06-152d2f7ee855/09/shuffle_0_56_0.data,
offset=896, length=1132499356}} to /10.10.255.238:35430
http://10.10.255.238:35430; closing connection
java.nio.channels.ClosedChannelException
15/06/02 16:37:38 ERROR shuffle.RetryingBlockFetcher: Exception
while beginning fetch of 4 outstanding blocks (after 2 retries)
java.io.IOException: Failed to connect to
compute-10-3.local/10.10.255.238:33671 http://10.10.255.238:33671
at

org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:191)
at

org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)
at

org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:78)
at

org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
at

org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
at

org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: java.net.ConnectException: Connection refused:
compute-10-3.local/10.10.255.238:33671 http://10.10.255.238:33671
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:735)
at

io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:208)
at

io.netty.channel.nio.AbstractNioChannel

Re: ERROR cluster.YarnScheduler: Lost executor

2015-06-03 Thread patcharee


This is log I can get

15/06/02 16:37:31 INFO shuffle.RetryingBlockFetcher: Retrying fetch 
(2/3) for 4 outstanding blocks after 5000 ms
15/06/02 16:37:36 INFO client.TransportClientFactory: Found inactive 
connection to compute-10-3.local/10.10.255.238:33671, creating a new one.
15/06/02 16:37:36 WARN server.TransportChannelHandler: Exception in 
connection from /10.10.255.238:35430

java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
at 
io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311)
at 
io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
at 
io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:225)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)

at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)

at java.lang.Thread.run(Thread.java:744)
15/06/02 16:37:36 ERROR server.TransportRequestHandler: Error sending 
result 
ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=1033433133943, 
chunkIndex=1}, 
buffer=FileSegmentManagedBuffer{file=/hdisk3/hadoop/yarn/local/usercache/patcharee/appcache/application_1432633634512_0213/blockmgr-12d59e6b-0895-4a0e-9d06-152d2f7ee855/09/shuffle_0_56_0.data, 
offset=896, length=1132499356}} to /10.10.255.238:35430; closing connection

java.nio.channels.ClosedChannelException
15/06/02 16:37:38 ERROR shuffle.RetryingBlockFetcher: Exception while 
beginning fetch of 4 outstanding blocks (after 2 retries)
java.io.IOException: Failed to connect to 
compute-10-3.local/10.10.255.238:33671
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:191)
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)
at 
org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:78)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)

at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:744)
Caused by: java.net.ConnectException: Connection refused: 
compute-10-3.local/10.10.255.238:33671

at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:735)
at 
io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:208)
at 
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:287)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)

at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)

... 1 more



Best,
Patcharee

On 03. juni 2015 09:21, Akhil Das wrote:

You need to look into your executor/worker logs to see whats going on.

Thanks
Best Regards

On Wed, Jun 3, 2015 at 12:01 PM, patcharee patcharee.thong...@uni.no 
mailto:patcharee.thong...@uni.no wrote:


Hi,

What can be the cause of this ERROR cluster.YarnScheduler: Lost
executor? How can I fix it?

Best,
Patcharee

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
mailto:user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

ERROR cluster.YarnScheduler: Lost executor

2015-06-03 Thread patcharee


Hi,

What can be the cause of this ERROR cluster.YarnScheduler: Lost 
executor? How can I fix it?


Best,
Patcharee

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Insert overwrite to hive - ArrayIndexOutOfBoundsException

2015-06-02 Thread patcharee


Hi,

I am using spark 1.3.1. I tried to insert (a new partition) into an 
existing partitioned hive table, but got ArrayIndexOutOfBoundsException. 
Below is a code snippet and the debug log. Any suggestions please.


+
case class Record4Dim(key: String, date: Int, hh: Int, x: Int, y: Int, 
z: Int, height: Float,

  u: Float , v: Float, w: Float, ph: Float, phb: Float,
  t: Float, p: Float, pb: Float, qvapor: Float, qgraup: Float,
  qnice: Float, qnrain: Float, tke_pbl: Float, el_pbl: Float)

def flatKeyFromWrf(x: (String, (Map[String,Float], Float))): Record4Dim 
= {  }




val varWHeightFlatRDD = 
varWHeightRDD.map(FlatMapUtilClass().flatKeyFromWrf).toDF()

varWHeightFlatRDD.registerTempTable(table_4Dim)
for (zz - 1 to 51)
hiveContext.sql(INSERT OVERWRITE table 4dim partition (zone= + 
ZONE + ,z= + zz + ,year= + YEAR + ,month= + MONTH + )  +
select date, hh, x, y, height, u, v, w, ph, phb, t, pb, pb, 
qvapor, qgraup, qnice, tke_pbl, el_pbl from table_4Dim where z= + zz);


+

15/06/01 21:07:20 DEBUG YarnHistoryService: Enqueue [1433192840040]: 
SparkListenerTaskEnd(4,0,ResultTask,ExceptionFailure(java.lang.ArrayIndexOutOfBoundsException,18,[Ljava.lang.StackTraceElement;@5783ce22,java.lang.ArrayIndexOutOfBoundsException: 
18
at 
org.apache.spark.sql.catalyst.expressions.GenericRow.isNullAt(rows.scala:79)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1$1.apply(InsertIntoHiveTable.scala:103)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1$1.apply(InsertIntoHiveTable.scala:100)

at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.scala:100)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:82)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:82)

at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)


Best,
Patcharee

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

pig performance on reading/filtering orc file

2015-05-29 Thread patcharee


Hi,

I am using pig 0.14 to work on partitioned orc file. I tried to improve 
my pig performance. However I am curious why using filter at the 
beginning (approach 1) does not help and takes even longer times than 
replicated join (approach 2). This filter is supposed to cut down a lot 
of data to be taken from orc file. Is this related to how I partition 
the orc file? Any guidelines/suggestions are appreciated.


---
Approach 1
---
coordinate = LOAD 'coordinate' USING 
org.apache.hive.hcatalog.pig.HCatLoader();

coordinate_zone = FILTER coordinate BY zone == 2;

coordinate_xy = LIMIT coordinate_zone 1;

rawdata_u = LOAD 'u' USING org.apache.hive.hcatalog.pig.HCatLoader();
rawdata_u_1 = foreach rawdata_u generate 
date,hh,(double)xlong_u,(double)xlat_u,height,u,zone,year,month;

u_filter = FILTER rawdata_u_1 by zone == 2;

/ HERE I try to filter and expect to get better performance, but it 
is not /
u_filter = FILTER u_filter by xlong_u == coordinate_xy.xlong_u and 
xlat_u == coordinate_xy.xlat_u;


---
Approach 2
---
coordinate = LOAD 'coordinate' USING 
org.apache.hive.hcatalog.pig.HCatLoader();

coordinate_zone = FILTER coordinate BY zone == 2;

coordinate_xy = LIMIT coordinate_zone 1;

rawdata_u = LOAD 'u' USING org.apache.hive.hcatalog.pig.HCatLoader();
u_filter = FILTER rawdata_u by zone == 2
join_u_coordinate_cossin = join u_filter by (xlong_u, xlat_u), 
coordinate_xy by (xlong_u, xlat_u) USING 'replicated';



Best,
Patcharee

cast column float

2015-05-27 Thread patcharee


Hi,

I queried a table based on value of two float columns

select count(*) from u where xlong_u = 7.1578474 and xlat_u = 55.192524;
select count(*) from u where xlong_u = cast(7.1578474 as float) and 
xlat_u = cast(55.192524 as float);


Both query returned 0 records, even though there are some records 
matched the condition. What can be wrong? I am using Hive 0.14


BR,
Patcharee

EOFException - TezJob - Cannot submit DAG

2015-05-22 Thread patcharee


Hi,

I ran a pig script on tez and got the EOFException. Check at 
http://wiki.apache.org/hadoop/EOFException I have no ideas at all how I 
can fix it. However I did not get the exception when I executed this pig 
script on MR.


I am using HadoopVersion: 2.6.0.2.2.4.2-2, PigVersion: 0.14.0.2.2.4.2-2, 
TezVersion: 0.5.2.2.2.4.2-2


I will appreciate any suggestions. Thanks.

2015-05-22 14:44:13,638 [PigTezLauncher-0] ERROR 
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - Cannot submit 
DAG - Application id: application_143223768_0133
org.apache.tez.dag.api.TezException: 
com.google.protobuf.ServiceException: java.io.EOFException: End of File 
Exception between local host is: compute-10-0.local/10.10.255.241; 
destination host is: compute-10-3.local:47111; : java.io.EOFException; 
For more details see:  http://wiki.apache.org/hadoop/EOFException

at org.apache.tez.client.TezClient.submitDAGSession(TezClient.java:415)
at org.apache.tez.client.TezClient.submitDAG(TezClient.java:351)
at 
org.apache.pig.backend.hadoop.executionengine.tez.TezJob.run(TezJob.java:162)
at 
org.apache.pig.backend.hadoop.executionengine.tez.TezLauncher$1.run(TezLauncher.java:167)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)

at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:744)
Caused by: com.google.protobuf.ServiceException: java.io.EOFException: 
End of File Exception between local host is: 
compute-10-0.local/10.10.255.241; destination host is: 
compute-10-3.local:47111; : java.io.EOFException; For more details 
see:  http://wiki.apache.org/hadoop/EOFException
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:246)

at com.sun.proxy.$Proxy31.submitDAG(Unknown Source)
at org.apache.tez.client.TezClient.submitDAGSession(TezClient.java:408)
... 8 more
Caused by: java.io.EOFException: End of File Exception between local 
host is: compute-10-0.local/10.10.255.241; destination host is: 
compute-10-3.local:47111; : java.io.EOFException; For more details 
see:  http://wiki.apache.org/hadoop/EOFException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)

at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764)
at org.apache.hadoop.ipc.Client.call(Client.java:1473)
at org.apache.hadoop.ipc.Client.call(Client.java:1400)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)

... 10 more
Caused by: java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at 
org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1072)

at org.apache.hadoop.ipc.Client$Connection.run(Client.java:967)

saveasorcfile on partitioned orc

2015-05-20 Thread patcharee


Hi,

I followed the information on 
https://www.mail-archive.com/reviews@spark.apache.org/msg141113.html to 
save orc file with spark 1.2.1.


I can save data to a new orc file. I wonder how to save data to an 
existing and partitioned orc file? Any suggestions?


BR,
Patcharee

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: conflict from apache commons codec

2015-05-20 Thread patcharee


Thanks for your input. There are two weird points found:

- In my cluster (also the resources being localized for the tasks) there 
are two versions; commons-codec-1.4.jar and commons-codec-1.7.jar. I 
think commons-codec since version 1.4 shows be fine, but I still got the 
exception (NoSuchMethodError 
org.apache.commons.codec.binary.Base64.decodeBase64(Ljava/lang/String;)[B). 
Any ideas?


- I tried to run the same pig script+same environment on small number of 
files and large number of files. The former (small number of files) did 
not throw the exception, but the latter did. What can be wrong for the 
latter?


BR,
Patcharee


On 20. mai 2015 09:37, Siddharth Seth wrote:
My best guess would be that an older version of commons-codec is also 
on the classpath for the running task. If you have access to the 
local-dirs configured under YARN - you could find the application dir 
in the local-dirs and see what exists in the classpath for a container.


Alternately, set tez.generate.debug.artifacts to true. This should 
give you access to the dag plan in text form via the YARN UI - which 
will list out the resources being localized for tasks.


On Tue, May 19, 2015 at 2:19 AM, patcharee patcharee.thong...@uni.no 
mailto:patcharee.thong...@uni.no wrote:


Hi,

I am using Pig version 0.14 with tez version 0.5.2. I have apache
commons-codec-1.4.jar on the machine. However, I got the common
codec exception when I executed a Pig job with tez.

2015-05-19 11:01:04,784 INFO [AsyncDispatcher event handler]
history.HistoryEventHandler:
[HISTORY][DAG:dag_1431972385685_0021_1][Event:TASK_ATTEMPT_FINISHED]:
vertexName=scope-384,
taskAttemptId=attempt_1431972385685_0021_1_00_04_0,
startTime=1432025450752, finishTime=1432026064784,
timeTaken=614032, status=FAILED, diagnostics=Error: Fatal Error
cause TezChild exit.:java.lang.NoSuchMethodError:
org.apache.commons.codec.binary.Base64.decodeBase64(Ljava/lang/String;)[B
at

org.apache.hadoop.yarn.util.AuxiliaryServiceHelper.getServiceDataFromEnv(AuxiliaryServiceHelper.java:37)
at

org.apache.tez.runtime.api.impl.TezTaskContextImpl.getServiceProviderMetaData(TezTaskContextImpl.java:175)
at

org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput.generateEventsOnClose(OrderedPartitionedKVOutput.java:187)
at

org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput.close(OrderedPartitionedKVOutput.java:148)
at

org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.close(LogicalIOProcessorRuntimeTask.java:348)
at

org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:178)
at

org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at

org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at

org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.call(TezTaskRunner.java:168)
at

org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.call(TezTaskRunner.java:163)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)

What could be wrong?

BR,
Patcharee

hive on Tez - merging orc files

2015-04-24 Thread patcharee


Hi,

Is there anyone using hortonworks sandbox 2.2? I am trying to use hive 
on Tez on the sandbox. I set the running engine in hive-site.xml to Tez.


property
  namehive.execution.engine/name
  valuetez/value
/property

Then I ran the script that alters a table to merge small orc files 
(alter table orc_merge5a partition(st=0.8) concatenate;). The merging 
feature worked, but Hive does not use Tez, it used MapReduce, so weird!


Another point, I tried to run the same script on the production cluster 
which is on always Tez, the merging feature sometimes worked, sometimes 
did not.


I would appreciate any suggestions.

BR,
Patcharee

Re: merge small orc files

2015-04-21 Thread patcharee


Hi Gopal,

Thanks for your explanation.

What could be the case that SET hive.merge.orcfile.stripe.level=true  
alter table table concatenate do not work? I have a dynamic 
partitioned table (stored as orc). I tried to alter concatenate, but it 
did not work. See my test result.


hive SET hive.merge.orcfile.stripe.level=true;
hive alter table orc_merge5a partition(st=0.8) concatenate;
Starting Job = job_1424363133313_0053, Tracking URL = 
http://service-test-1-2.testlocal:8088/proxy/application_1424363133313_0053/
Kill Command = /usr/hdp/2.2.0.0-2041/hadoop/bin/hadoop job  -kill 
job_1424363133313_0053

Hadoop job information for null: number of mappers: 0; number of reducers: 0
2015-04-21 12:32:56,165 null map = 0%,  reduce = 0%
2015-04-21 12:33:05,964 null map = 100%,  reduce = 0%
Ended Job = job_1424363133313_0053
Loading data to table default.orc_merge5a partition (st=0.8)
Moved: 
'hdfs://service-test-1-0.testlocal:8020/apps/hive/warehouse/orc_merge5a/st=0.8/00_0' 
to trash at: 
hdfs://service-test-1-0.testlocal:8020/user/patcharee/.Trash/Current
Moved: 
'hdfs://service-test-1-0.testlocal:8020/apps/hive/warehouse/orc_merge5a/st=0.8/02_0' 
to trash at: 
hdfs://service-test-1-0.testlocal:8020/user/patcharee/.Trash/Current
Partition default.orc_merge5a{st=0.8} stats: [numFiles=2, numRows=0, 
totalSize=1067, rawDataSize=0]

MapReduce Jobs Launched:
Stage-null:  HDFS Read: 0 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
Time taken: 22.839 seconds
hive dfs -ls ${hiveconf:hive.metastore.warehouse.dir}/orc_merge5a/st=0.8/;
Found 2 items
-rw-r--r--   3 patcharee hdfs534 2015-04-21 12:33 
/apps/hive/warehouse/orc_merge5a/st=0.8/00_0
-rw-r--r--   3 patcharee hdfs533 2015-04-21 12:33 
/apps/hive/warehouse/orc_merge5a/st=0.8/01_0


It seems nothing happened when I altered table concatenate. Any ideas?

BR,
Patcharee

On 21. april 2015 04:41, Gopal Vijayaraghavan wrote:

Hi,


How to set the configuration hive-site.xml to automatically merge small
orc file (output from mapreduce job) in hive 0.14 ?

Hive cannot add work-stages to a map-reduce job.

Hive follows merge.mapfiles=true when Hive generates a plan, by adding
more work to the plan as a conditional task.


-rwxr-xr-x   1 root hdfs  29072 2015-04-20 15:23
/apps/hive/warehouse/coordinate/zone=2/part-r-0

This looks like it was written by an MRv2 Reducer and not by the Hive
FileSinkOperator  handled by the MR outputcommitter instead of the Hive
MoveTask.

But 0.14 has an option which helps ³hive.merge.orcfile.stripe.level². If
that is true (like your setting), then do

³alter table table concatenate²

which effectively concatenates ORC blocks (without decompressing them),
while maintaining metadata linkage of start/end offsets in the footer.

Cheers,
Gopal

Re: merge small orc files

2015-04-21 Thread patcharee


Hi Gopal,

The table created is not  a bucketed table, but a dynamic partitioned 
table. I took the script test from 
https://svn.apache.org/repos/asf/hive/trunk/ql/src/test/queries/clientpositive/orc_merge7.q


- create table orc_merge5 (userid bigint, string1 string, subtype 
double, decimal1 decimal, ts timestamp) stored as orc;
- create table orc_merge5a (userid bigint, string1 string, subtype 
double, decimal1 decimal, ts timestamp) partitioned by (st double) 
stored as orc;


I sent you the desc formatted table and application log. I just found 
out that there are some TezException which could be the cause of the 
problem. Please let me know how to fix it.


BR,

Patcharee


On 21. april 2015 13:10, Gopal Vijayaraghavan wrote:



alter table table concatenate do not work? I have a dynamic
partitioned table (stored as orc). I tried to alter concatenate, but it
did not work. See my test result.

ORC fast concatenate does work on partitioned tables, but it doesn¹t work
on bucketed tables.

Bucketed tables cannot merge files, since the file count is capped by the
numBuckets parameter.


hive dfs -ls
${hiveconf:hive.metastore.warehouse.dir}/orc_merge5a/st=0.8/;
Found 2 items
-rw-r--r--   3 patcharee hdfs534 2015-04-21 12:33
/apps/hive/warehouse/orc_merge5a/st=0.8/00_0
-rw-r--r--   3 patcharee hdfs533 2015-04-21 12:33
/apps/hive/warehouse/orc_merge5a/st=0.8/01_0

Is this a bucketed table?

When you look at the point of view of split generation  cluster
parallelism, bucketing is an anti-pattern, since in most query schemas it
significantly slows down the slowest task.

Making the fastest task faster isn¹t often worth it, if the overall query
time goes up.

Also if you want to, you can send me the yarn logs -applicationId app-id
and the desc formatted of the table, which will help me understand what¹s
happening better.

Cheers,
Gopal






Container: container_1424363133313_0082_01_03 on compute-test-1-2.testlocal_45454
===
LogType:stderr
Log Upload Time:21-Apr-2015 14:17:54
LogLength:0
Log Contents:

LogType:stdout
Log Upload Time:21-Apr-2015 14:17:54
LogLength:2124
Log Contents:
0.294: [GC [PSYoungGen: 3642K-490K(6656K)] 3642K-1308K(62976K), 0.0071100 secs] [Times: user=0.00 sys=0.00, real=0.01 secs] 
0.600: [GC [PSYoungGen: 6110K-496K(12800K)] 6929K-1992K(69120K), 0.0058540 secs] [Times: user=0.01 sys=0.00, real=0.00 secs] 
1.061: [GC [PSYoungGen: 10217K-496K(12800K)] 11714K-3626K(69120K), 0.0077230 secs] [Times: user=0.01 sys=0.00, real=0.01 secs] 
1.477: [GC [PSYoungGen: 8914K-512K(25088K)] 12045K-5154K(81408K), 0.0095740 secs] [Times: user=0.01 sys=0.01, real=0.01 secs] 
2.361: [GC [PSYoungGen: 14670K-512K(25088K)] 19313K-6827K(81408K), 0.0106680 secs] [Times: user=0.01 sys=0.00, real=0.01 secs] 
3.476: [GC [PSYoungGen: 22967K-3059K(51712K)] 29282K-9958K(108032K), 0.0201770 secs] [Times: user=0.02 sys=0.00, real=0.02 secs] 
5.538: [GC [PSYoungGen: 50438K-3568K(52224K)] 57336K-15383K(108544K), 0.0374340 secs] [Times: user=0.04 sys=0.01, real=0.04 secs] 
6.811: [GC [PSYoungGen: 29358K-6331K(61440K)] 41173K-18282K(117760K), 0.0421300 secs] [Times: user=0.03 sys=0.01, real=0.04 secs] 
7.689: [GC [PSYoungGen: 28530K-6401K(61440K)] 40482K-19476K(117760K), 0.0443730 secs] [Times: user=0.03 sys=0.00, real=0.05 secs] 
Heap
 PSYoungGen  total 61440K, used 28333K [0xfbb8, 0x0001, 0x0001)
  eden space 54784K, 40% used [0xfbb8,0xfdf463f8,0xff10)
lgrp 0 space 2K, 49% used [0xfbb8,0xfc95add8,0xfd7b6000)
lgrp 1 space 25896K, 29% used [0xfd7b6000,0xfdf463f8,0xff10)
  from space 6656K, 96% used [0xff10,0xff740400,0xff78)
  to   space 8704K, 0% used [0xff78,0xff78,0x0001)
 ParOldGen   total 56320K, used 13075K [0xd9a0, 0xdd10, 0xfbb8)
  object space 56320K, 23% used [0xd9a0,0xda6c4e68,0xdd10)
 PSPermGen   total 28672K, used 28383K [0xd480, 0xd640, 0xd9a0)
  object space 28672K, 98% used [0xd480,0xd63b7f78,0xd640)

LogType:syslog
Log Upload Time:21-Apr-2015 14:17:54
LogLength:1355
Log Contents:
2015-04-21 14:17:40,208 INFO [main] task.TezChild: TezChild starting
2015-04-21 14:17:41,856 INFO [main] task.TezChild: PID, containerIdentifier:  15169, container_1424363133313_0082_01_03
2015-04-21 14:17:41,985 INFO [main] impl.MetricsConfig: loaded properties from hadoop-metrics2.properties
2015-04-21 14:17:42,146 INFO [main] impl.MetricsSystemImpl: Scheduled snapshot period at 60 second(s).
2015-04-21 14:17:42,146 INFO [main] impl.MetricsSystemImpl: TezTask metrics system started
2015-04-21 14:17:42,355 INFO [TezChild] task.ContainerReporter: Attempting to fetch new

merge small orc files

2015-04-20 Thread patcharee


Hi,

How to set the configuration hive-site.xml to automatically merge small 
orc file (output from mapreduce job) in hive 0.14 ?


This is my current configuration

property
  namehive.merge.mapfiles/name
  valuetrue/value
/property

property
  namehive.merge.mapredfiles/name
  valuetrue/value
/property

property
  namehive.merge.orcfile.stripe.level/name
  valuetrue/value
/property

However the output from a mapreduce job, which is stored into an orc 
file, was not merged. This is the output


-rwxr-xr-x   1 root hdfs  0 2015-04-20 15:23 
/apps/hive/warehouse/coordinate/zone=2/_SUCCESS
-rwxr-xr-x   1 root hdfs  29072 2015-04-20 15:23 
/apps/hive/warehouse/coordinate/zone=2/part-r-0
-rwxr-xr-x   1 root hdfs  29049 2015-04-20 15:23 
/apps/hive/warehouse/coordinate/zone=2/part-r-1
-rwxr-xr-x   1 root hdfs  29075 2015-04-20 15:23 
/apps/hive/warehouse/coordinate/zone=2/part-r-2


Any ideas?

BR,
Patcharee

override log4j.properties

2015-04-09 Thread patcharee


Hello,

How to override log4j.properties for a specific spark job?

BR,
Patcharee


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark Job History Server

2015-03-18 Thread patcharee


I turned it on. But it failed to start. In the log,

Spark assembly has been built with Hive, including Datanucleus jars on 
classpath
Spark Command: /usr/lib/jvm/java-1.7.0-openjdk.x86_64/bin/java -cp 
:/root/spark-1.3.0-bin-hadoop2.4/sbin/../conf:/root/spark-1.3.0-bin-hadoop2.4/lib/spark-assembly-1.3.0-hadoop2.4.0.jar:/root/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar:/root/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar:/root/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar:/etc/hadoop/conf 
-XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m 
-Xmx512m org.apache.spark.deploy.history.HistoryServer



15/03/18 10:23:46 WARN NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
Exception in thread main java.lang.ClassNotFoundException: 
org.apache.spark.deploy.yarn.history.YarnHistoryProvider

at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:191)
at 
org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:183)
at 
org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala)


Patcharee

On 18. mars 2015 11:35, Akhil Das wrote:

You can simply turn it on using:
|./sbin/start-history-server.sh|

Read more here http://spark.apache.org/docs/1.3.0/monitoring.html.


Thanks
Best Regards

On Wed, Mar 18, 2015 at 4:00 PM, patcharee patcharee.thong...@uni.no 
mailto:patcharee.thong...@uni.no wrote:


Hi,

I am using spark 1.3. I would like to use Spark Job History
Server. I added the following line into conf/spark-defaults.conf

spark.yarn.services
org.apache.spark.deploy.yarn.history.YarnHistoryService
spark.history.provider
org.apache.spark.deploy.yarn.history.YarnHistoryProvider
spark.yarn.historyServer.address sandbox.hortonworks.com:19888
http://sandbox.hortonworks.com:19888

But got Exception in thread main
java.lang.ClassNotFoundException:
org.apache.spark.deploy.yarn.history.YarnHistoryProvider

What class is really needed? How to fix it?

Br,
Patcharee

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
mailto:user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
mailto:user-h...@spark.apache.org

Spark Job History Server

2015-03-18 Thread patcharee


Hi,

I am using spark 1.3. I would like to use Spark Job History Server. I 
added the following line into conf/spark-defaults.conf


spark.yarn.services org.apache.spark.deploy.yarn.history.YarnHistoryService
spark.history.provider 
org.apache.spark.deploy.yarn.history.YarnHistoryProvider

spark.yarn.historyServer.address  sandbox.hortonworks.com:19888

But got Exception in thread main java.lang.ClassNotFoundException: 
org.apache.spark.deploy.yarn.history.YarnHistoryProvider


What class is really needed? How to fix it?

Br,
Patcharee

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark Job History Server

2015-03-18 Thread patcharee


Hi,

My spark was compiled with yarn profile, I can run spark on yarn without 
problem.


For the spark job history server problem, I checked 
spark-assembly-1.3.0-hadoop2.4.0.jar and found that the package 
org.apache.spark.deploy.yarn.history is missing. I don't know why


BR,
Patcharee


On 18. mars 2015 11:43, Akhil Das wrote:
You are not having yarn package in the classpath. You need to build 
your spark it with yarn. You can read these docs. 
http://spark.apache.org/docs/1.3.0/running-on-yarn.html


Thanks
Best Regards

On Wed, Mar 18, 2015 at 4:07 PM, patcharee patcharee.thong...@uni.no 
mailto:patcharee.thong...@uni.no wrote:


I turned it on. But it failed to start. In the log,

Spark assembly has been built with Hive, including Datanucleus
jars on classpath
Spark Command: /usr/lib/jvm/java-1.7.0-openjdk.x86_64/bin/java -cp

:/root/spark-1.3.0-bin-hadoop2.4/sbin/../conf:/root/spark-1.3.0-bin-hadoop2.4/lib/spark-assembly-1.3.0-hadoop2.4.0.jar:/root/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar:/root/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar:/root/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar:/etc/hadoop/conf
-XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m
-Xmx512m org.apache.spark.deploy.history.HistoryServer


15/03/18 10:23:46 WARN NativeCodeLoader: Unable to load
native-hadoop library for your platform... using builtin-java
classes where applicable
Exception in thread main java.lang.ClassNotFoundException:
org.apache.spark.deploy.yarn.history.YarnHistoryProvider
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at
sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:191)
at
org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:183)
at
org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala)

Patcharee


On 18. mars 2015 11:35, Akhil Das wrote:

You can simply turn it on using:
|./sbin/start-history-server.sh|

Read more here
http://spark.apache.org/docs/1.3.0/monitoring.html.


Thanks
Best Regards

On Wed, Mar 18, 2015 at 4:00 PM, patcharee
patcharee.thong...@uni.no mailto:patcharee.thong...@uni.no wrote:

Hi,

I am using spark 1.3. I would like to use Spark Job History
Server. I added the following line into conf/spark-defaults.conf

spark.yarn.services
org.apache.spark.deploy.yarn.history.YarnHistoryService
spark.history.provider
org.apache.spark.deploy.yarn.history.YarnHistoryProvider
spark.yarn.historyServer.address
sandbox.hortonworks.com:19888
http://sandbox.hortonworks.com:19888

But got Exception in thread main
java.lang.ClassNotFoundException:
org.apache.spark.deploy.yarn.history.YarnHistoryProvider

What class is really needed? How to fix it?

Br,
Patcharee

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
mailto:user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
mailto:user-h...@spark.apache.org

1 2 >

1 - 100 of 149 matches

Mail list logo