ucketing = true; ")
>>> sqlContext.sql("set hive.enforce.sorting = true; ")
>>> sqlContext.sql(" CREATE EXTERNAL TABLE IF NOT EXISTS users (userId
>>> STRING, userRecord STRING) PARTITIONED BY (idPartitioner STRING,
>>> dtPartitioner STRING) sto
Hello,
Looks like you are hitting this:
https://issues.apache.org/jira/browse/HIVE-11940.
Thanks,
Bijay
On Thu, Jun 9, 2016 at 9:25 PM, Mich Talebzadeh
wrote:
> cam you provide a code snippet of how you are populating the target table
> from temp table.
>
>
> HTH
Sorry, for the confusion this was supposed to be answer for another thread.
Bijay
On Sat, Apr 30, 2016 at 2:37 PM, Bijay Kumar Pathak
wrote:
> Hi,
>
> I was facing the same issue on Spark 1.6. My data size was around 100 GB
> and was writing in the partition Hive table.
>
> I
Hi,
I am running Spark 1.6 on EMR. I have workflow which does the fiollowing
things:
1. Read the 2 flat file, create the data frame and join it.
2. Read the particular partition from the hive table and joins the
dataframe from 1 with it.
3. Finally, insert overwrite into hive table
Hi,
I am running Spark 1.6 on EMR. I have workflow which does the fiollowing
things:
1. Read the 2 flat file, create the data frame and join it.
2. Read the particular partition from the hive table and joins the
dataframe from 1 with it.
3. Finally, insert overwrite into hive table
Hi,
I have written the UDF for doing same in pyspark DataFrame since some of my
dates are before unix standard time epoch of 1/1/1970. I have more than 250
columns and applying custom date_format UDF to more than 50 columns. I am
getting OOM error and poor performance because of UDF.
What's your
Hi,
Can you elaborate what's the issues you are facing, I am doing the similar
kind of join so I may be able to provide you with some suggestions and
pointers.
Thanks,
BIjay
On Thu, Mar 24, 2016 at 5:12 AM, pseudo oduesp
wrote:
> hi , i spent two months of my times to
Here is another way you can achieve that(in Python):
base_df.withColumn("column_name","column_expression_for_new_column")
# To add new row create the data frame containing the new row and do the
unionAll()
base_df.unionAll(new_df)
# Another approach convert to rdd add required fields and convert
Hello,
I am getting all the value of field as NULL while reading Hive Table with
Partition in SPARK 1.5.0 running on CDH5.5.1 with YARN (Dynamic Allocation).
Below is the command I used in Spark_Shell:
import org.apache.spark.sql.hive.HiveContext
val hiveContext = new HiveContext(sc)
val
Hello,
We wanted to tune the Spark running on YARN cluster.The Spark History
Server UI shows lots of parameters like:
- GC time
- Task Duration
- Shuffle R/W
- Shuffle Spill (Memory/Disk)
- Serialization Time (Task/Result)
- Scheduler Delay
Among the above metrics, which are
The Spark Sort-Based Shuffle (default from 1.1) keeps the data from
each Map tasks to memory until they they can't fit after which they
are sorted and spilled to disk. You can reduce the Shuffle write to
disk by increasing spark.shuffle.memoryFraction(default 0.2).
By writing the shuffle output
from older map tasks to
memory?
On Tue, Mar 31, 2015 at 1:19 PM, Bijay Pathak bijay.pat...@cloudwick.com
wrote:
The Spark Sort-Based Shuffle (default from 1.1) keeps the data from
each Map tasks to memory until they they can't fit after which they
are sorted and spilled to disk. You can reduce
Hello,
I am running TeraSort https://github.com/ehiggs/spark-terasort on 100GB
of data. The final metrics I am getting on Shuffle Spill are:
Shuffle Spill(Memory): 122.5 GB
Shuffle Spill(Disk): 3.4 GB
What's the difference and relation between these two metrics? Does these
mean 122.5 GB was
13 matches
Mail list logo