Hi Ankur,
I also tried setting a property to write parquet file size of 256MB. I am
using pyspark below is how I set the property but it's not working for me.
How did you set the property?
spark_context._jsc.hadoopConfiguration().setInt( "dfs.blocksize", 268435456)
Hi,
Try this one:
df_join = df1.*join*(df2, 'Id', "fullouter")
Thanks,
Bijay
On Tue, May 17, 2016 at 9:39 AM, ram kumar wrote:
> Hi,
>
> I tried to join two dataframe
>
> df_join = df1.*join*(df2, ((df1("Id") === df2("Id")), "fullouter")
>
>
Hi,
How can we disable writing _common_metdata while saving Data Frame in
parquet format in PySpark. I tried to set the property using below command
but didn't helped.
sparkContext._jsc.hadoopConfiguration().set("parquet.enable.summary-metadata",
"false")
Thanks,
Bijay
64-bit JVM with less than 32G heap, you might want
> to enable -XX:+UseCompressedOops[1]. And if your dataframe is somehow
> generating more than 2^31-1 number of arrays, you might have to rethink
> your options.
>
> [1] https://spark.apache.org/docs/latest/tuning.html
>
> On Wed,
Hi,
I am reading the parquet file around 50+ G which has 4013 partitions with
240 columns. Below is my configuration
driver : 20G memory with 4 cores
executors: 45 executors with 15G memory and 4 cores.
I tried to read the data using both Dataframe read and using hive context
to read the data
Thanks Ted. This looks like the issue since I am running it in EMR and the
Hive version is 1.0.0.
Thanks,
Bijay
On Wed, May 4, 2016 at 10:29 AM, Ted Yu <yuzhih...@gmail.com> wrote:
> Looks like you were hitting HIVE-11940
>
> On Wed, May 4, 2016 at 10:02 AM, Bijay Kuma
Hello,
I am writing Dataframe of around 60+ GB into partitioned Hive Table using
hiveContext in parquet format. The Spark insert overwrite jobs completes in
a reasonable amount of time around 20 minutes.
But the job is taking a huge amount of time more than 2 hours to copy data
from .hivestaging
Hi,
I was facing the same issue on Spark 1.6. My data size was around 100 GB
and was writing in the partition Hive table.
I was able to solve this issue by starting from 6G of memory and reaching
upto 15GB of memory per executor with overhead of 2GB and partitioning
the DataFrame before doing
existing data in the table or partition
>
>- unless IF NOT EXISTS is provided for a partition (as of Hive 0.9.0
><https://issues.apache.org/jira/browse/HIVE-2612>).
>
>
>
> Thanks.
>
> Zhan Zhang
>
> On Apr 21, 2016, at 3:20 PM, Bijay Kumar Pathak &l
Hi,
I have a job which writes to the Hive table with dynamic partition. Inside
the job, I am writing into the table two-time but I am only seeing the
partition with last write although I can see in the Spark UI it is
processing data fro both the partition.
Below is the query I am using to write
Hello,
I have spark jobs packaged in zipped and deployed using cluster mode in AWS
EMR. The job has to read conf file packaged with the zip under the
resources directory. I can read the conf file in the client mode but not in
cluster mode.
How do I read the conf file packaged in the zip while
wrote:
> Try increasing the memory allocated for this job.
>
> Sent from Outlook for iPhone <https://aka.ms/wp8k5y>
>
>
>
>
> On Sun, Apr 10, 2016 at 9:12 PM -0700, "Bijay Kumar Pathak" <
> bkpat...@mtu.edu> wrote:
>
> Hi,
>>
>>
Hi,
I am running Spark 1.6 on EMR. I have workflow which does the following
things:
1. Read the 2 flat file, create the data frame and join it.
2. Read the particular partition from the hive table and joins the
dataframe from 1 with it.
3. Finally, insert overwrite into hive table
13 matches
Mail list logo