> Perhaps the real "fix" is to figure out why is logical plan creation so
> slow for 700 columns.
>
>
> On Thu, Jun 30, 2016 at 1:58 PM, Darshan Singh <darshan.m...@gmail.com>
> wrote:
>
>> Is there a way I can use same Logical plan for a query. Everything will
>
Is there a way I can use same Logical plan for a query. Everything will be
same except underlying file will be different.
Issue is that my query has around 700 columns and Generating logical plan
takes 20 seconds and it happens every 2 minutes but every time underlying
file is different.
I do
fle.partitions',10)
> sc = SparkContext(conf=conf)
>
>
>
> On Fri, Jun 24, 2016 at 6:46 AM, Darshan Singh <darshan.m...@gmail.com>
> wrote:
>
>> Hi,
>>
>> My default parallelism is 100. Now I join 2 dataframes with 20 partitions
>> each , joined d
Hi,
My default parallelism is 100. Now I join 2 dataframes with 20 partitions
each , joined dataframe has 100 partition. I want to know what is the way
to keep it to 20 (except re-partition and coalesce.
Also, when i join these 2 dataframes I am using 4 columns as joined
columns. The dataframes
These are 2 parameters and default value for these are 0.6 and 0.2 which is
around 80%. I am wondering where remaining 0.2 % goes. Is it for JVM for
other memory requirements?
If yes, then what is spark.memory.fraction used for.
My understanding is that if we have 10GB of memory per executor
Hi,
I am using standalone spark cluster and using zookeeper cluster for the
high availbilty. I am getting sometimes error when I start the master. The
error is related to Leader election in curator and says that noMethod found
(getProcess) and master doesnt get started.
Just wondering what could
Hi,
I have a dataframe df1 and I partitioned it by col1,col2 and persisted it.
Then I created new dataframe df2.
val df2 = df1.sortWithinPartitions("col1","col2","col3")
df1.persist()
df2.persist()
df1.count()
df2.count()
now I expect that any group by statement using the "col1","col2","col3"
Hi,
I have an application which uses 3 parquet files , 2 of which are large and
another one is small. These files are in hdfs and are partitioned by column
"col1". Now I create 3 data-frames one for each parquet file but I pass
col1 value so that it reads the relevant data. I always read from
Hi,
I would like to know if there is any max limit of union of data-frames. How
does performance of say 1 data frame union will be in spark of which
all the data will be in cache?
Other option is 1 partitions of a single dataframe.
Thanks
spark.apache.org
>
> Thanks for the information. When I mention map side join. I meant that
> each partition from 1 DF join with partition with same key of DF 2 on the
> worker node without shuffling the data.In other words do as much as work
> within worker node before shuffling the
Thanks for the information. When I mention map side join. I meant that each
partition from 1 DF join with partition with same key of DF 2 on the worker
node without shuffling the data.In other words do as much as work within
worker node before shuffling the data.
Thanks
Darshan Singh
On Wed
Thanks a lot for this. I was thinking of using cogrouped RDDs. We will try
to move to 1.6 as there are other issues as well in 1.5.2.
Same code is much faster in the 1.6.1.But plan wise I do not see much
diff.Why it is still partitioning and then sorting and then joining?
I expect it to sort
I used 1.5.2.I have used movies data to reproduce the issue. Below is the
physical plan. I am not sure why it is hash partitioning the data and then
sort and then join. I expect the data to be joined first and then send for
further processing.
I sort of expect a common partitioner which will work
:41 PM, Darshan Singh <darshan.m...@gmail.com>
wrote:
> Thanks a lot. I will try this one as well.
>
> On Tue, Apr 5, 2016 at 9:28 PM, Michael Armbrust <mich...@databricks.com>
> wrote:
>
>> The following should ensure partition pruning happens:
>>
>>
d.load("/path/to/data").where("country = 'UK'")
>
> On Tue, Apr 5, 2016 at 1:13 PM, Darshan Singh <darshan.m...@gmail.com>
> wrote:
>
>> Thanks for the reply.
>>
>> Now I saved the part_movies as parquet file.
>>
>> Then created new
.
>
> On Tue, Apr 5, 2016 at 12:14 PM, Darshan Singh <darshan.m...@gmail.com>
> wrote:
>
>> Thanks. It is not my exact scenario but I have tried to reproduce it. I
>> have used 1.5.2.
>>
>> I have a part-movies data-frame which has 20 partitions 1 each for a
Thanks. It is not my exact scenario but I have tried to reproduce it. I
have used 1.5.2.
I have a part-movies data-frame which has 20 partitions 1 each for a movie.
I created following query
val part_sql = sqlContext.sql("select * from part_movies where movie = 10")
part_sql.count()
I expect
17 matches
Mail list logo