Hi,
I’m not sure I understand your initial question…
Depending on the compression algo, you may or may not be able to split the
file.
So if its not splittable, you have a single long running thread.
My guess is that you end up with a very long single partition.
If so, if you repartition,
Does the same happen if all the tables are in ORC format? It might be just
simpler to convert the text table to ORC since it is rather small
> On 29 Jun 2016, at 15:14, Mich Talebzadeh wrote:
>
> Hi all,
>
> It finished in 2 hours 18 minutes!
>
> Started at
>
I think the TEZ engine is much more maintained with respect to optimizations
related to Orc , hive , vectorizing, querying than the mr engine. It will be
definitely better to use it.
Mr is also deprecated in hive 2.0.
For me it does not make sense to use mr with hive larger than 1.1.
As I
Hi, guys!
As far as I remember, Spark does not use all peculiarities and
optimizations of ORC. Moreover, the possibility to read ORC files appeared
not so long time ago in Spark.
So, despite "victorious" results announced in
http://hortonworks.com/blog/bringing-orc-support-into-apache-spark/ ,
This is what I am getting in the container log for mr
2016-06-28 23:25:53,808 INFO [main]
org.apache.hadoop.hive.ql.exec.FileSinkOperator: Writing to temp file: FS
That is a good point.
The ORC table property is as follows
TBLPROPERTIES ( "orc.compress"="SNAPPY",
"orc.stripe.size"="268435456",
"orc.row.index.stride"="1")
which puts each stripe at 256MB
Just to clarify this is spark running on Hive tables. I don't think the use
of TEZ, MR or Spark as
Bzip2 is splittable for text files.
Btw in Orc the question of splittable does not matter because each stripe is
compressed individually.
Have you tried tez? As far as I recall (at least it was in the first version of
Hive) mr uses for order by a single reducer which is a bottleneck.
Do you
Hi,
I have a simple join between table sales2 a compressed (snappy) ORC with 22
million rows and another simple table sales_staging under a million rows
stored as a text file with no compression.
The join is very simple
val s2 = HiveContext.table("sales2").select("PROD_ID")
val s =