Re: reading ORC format on Spark-SQL

2016-02-11 Thread Philip Lee
flat scaling. because it is not over the capacity yet? but you know loading csv file is not much big as I guess. Could you correct me? Thanks in advance. Best, Phil On Wed, Feb 10, 2016 at 11:17 PM, Philip Lee <philjj...@gmail.com> wrote: > Thansk for your reply! > > according

reading ORC format on Spark-SQL

2016-02-10 Thread Philip Lee
What kind of steps exists when reading ORC format on Spark-SQL? I meant usually reading csv file is just directly reading the dataset on memory. But I feel like Spark-SQL has some steps when reading ORC format. For example, they have to create table to insert the dataset? and then they insert the

Re: reading ORC format on Spark-SQL

2016-02-10 Thread Philip Lee
Thansk for your reply! according to you because of its natural property of ORC, it cannot be splited by the default chunk. Because it is not composed of lines like csv. Until you run out of capacity, a distributed system *has* to show sub-linear scaling - and will show flat scaling upto a

Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-02 Thread Philip Lee
>From my experience, spark sql has its own optimizer to support Hive query and metastore. After 1.5.2 spark, its optimizer is named catalyst. 2016. 2. 3. 오전 12:12에 "Xuefu Zhang" 님이 작성: > I think the diff is not only about which does optimization but more on > feature parity.

Re: ORC format

2016-02-02 Thread Philip Lee
hive.apache.org > *Subject:* Re: ORC format > > > > ORC does not currently expose a primary key to the user, though we have > talked of having it do that. As Mich says the indexing on ORC is oriented > towards statistics that help the optimizer plan the query. T

ORC format

2016-02-01 Thread Philip Lee
Hello, I experiment the performance of some systems between ORC and CSV file. I read about ORC documentation on Hive website, but still curious of some things. I know ORC format is faster on filtering or reading because it has indexing. Has it advantage of joining two tables of ORC dataset as

Re: ORC format

2016-02-01 Thread Philip Lee
Also, when making ORC from CSV, for indexing every key on each coulmn is made, or a primary on a table is made ? If keys are made on each column in a table, accessing any column in some functions like filtering should be faster. On Mon, Feb 1, 2016 at 4:21 PM, Philip Lee <philjj...@gmail.

Re: ORC format

2016-02-01 Thread Philip Lee
be understood as given or endorsed by Peridale Technology > Ltd, its subsidiaries or their employees, unless expressly so stated. It is > the responsibility of the recipient to ensure that this email is virus > free, therefore neither Peridale Technology Ltd, its subsidiaries n

Hive bug? about no such table

2015-12-18 Thread Philip Lee
I think It is from Hive Bug about something related to metastore. Here is the thing. After I generated scale factor 300 named bigbench300 and bigbench100, which already existed before, I run "hive job with bigbench300". At first it was really fine. Then I run hive job with bigbench100 again. It

Re: Hi, Hive People urgent question about [Distribute By] function

2015-10-27 Thread Philip Lee
, you defined the partition function for DBY. On Sun, Oct 25, 2015 at 12:59 AM, Philip Lee <philjj...@gmail.com> wrote: > Hello, the same question about DISTRIBUTE BY on Hive. > > Accorring to you, you do not use hashCode of Object class on DBY, > Distribute By. > > I

Re: Hi, Hive People urgent question about [Distribute By] function

2015-10-24 Thread Philip Lee
, you defined the partition function for DBY. Regards, Philip Lee On Thu, Oct 22, 2015 at 7:13 PM, Gopal Vijayaraghavan <gop...@apache.org> wrote: > > > so do you think if we want the same result from Hive and Spark or the > >other freamwork, how could we try this one ?

Hi, Hive People urgent question about [Distribute By] function

2015-10-22 Thread Philip Lee
Hello, I am working on Flink and Spark majoring in Computer Science in Berlin. I have the important question. Well, this question is from what I do these days, which is translations Hive Query to Flink. When applying [Distribute By] on Hive to the framework, the function should be

Re: Hi, Hive People urgent question about [Distribute By] function

2015-10-22 Thread Philip Lee
Thanks for your help. so do you think if we want the same result from Hive and Spark or the other freamwork, how could we try this one ? could you tell me in detail. Regards, Philip On Thu, Oct 22, 2015 at 6:25 PM, Gopal Vijayaraghavan wrote: > > > When applying