Hi
I have a question regarding data frame partition. I read a hive table from
spark and following spark api converts it as DF.
test_df = sqlContext.sql(“select * from hivetable1”)
How does spark decide partition of test_df? Is there a way to partition test_df
based on some column while
-PartitionedTables
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-PartitionedTables
DataFrameWriter also has a partitionBy method.
On Thu, Aug 20, 2015 at 7:29 AM, VIJAYAKUMAR JAWAHARLAL sparkh...@data2o.io
mailto:sparkh...@data2o.io wrote:
Hi
I have
of items
into memory could be causing it.
Either way, the logs for the executors should be able to give you some
insight, have you looked at those yet?
On Tue, Aug 18, 2015 at 6:26 PM, VIJAYAKUMAR JAWAHARLAL sparkh...@data2o.io
mailto:sparkh...@data2o.io wrote:
Hi All
Why am I
.
On 8/17/15, 12:39 PM, VIJAYAKUMAR JAWAHARLAL sparkh...@data2o.io wrote:
Thanks for your help
I tried to cache the lookup tables and left out join with the big table
(DF). Join does not seem to be using broadcast join-still it goes with hash
partition join and shuffling big table
Hi
I am trying to compute stats on a lookup table from spark which resides in
hive. I am invoking spark API as follows. It gives me NoSuchTableException.
Table is double verified and subsequent statement “sqlContext.sql(“select *
from cpatext.lkup”)” picks up the table correctly. I am
Hi All
Why am I getting ExecutorLostFailure and executors are completely lost for rest
of the processing? Eventually it makes job to fail. One thing for sure that lot
of shuffling happens across executors in my program.
Is there a way to understand and debug ExecutorLostFailure? Any pointers
...@granturing.com
wrote:
You could cache the lookup DataFrames, it’ll then do a broadcast join.
On 8/14/15, 9:39 AM, VIJAYAKUMAR JAWAHARLAL sparkh...@data2o.io wrote:
Hi
I am facing huge performance problem when I am trying to left outer join
very big data set (~140GB
Hi
I am facing huge performance problem when I am trying to left outer join very
big data set (~140GB) with bunch of small lookups [Start schema type]. I am
using data frame in spark sql. It looks like data is shuffled and skewed when
that join happens. Is there any way to improve performance