Data frame created from hive table and its partition

2015-08-20 Thread VIJAYAKUMAR JAWAHARLAL
Hi I have a question regarding data frame partition. I read a hive table from spark and following spark api converts it as DF. test_df = sqlContext.sql(“select * from hivetable1”) How does spark decide partition of test_df? Is there a way to partition test_df based on some column while

Re: Data frame created from hive table and its partition

2015-08-20 Thread VIJAYAKUMAR JAWAHARLAL
-PartitionedTables https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-PartitionedTables DataFrameWriter also has a partitionBy method. On Thu, Aug 20, 2015 at 7:29 AM, VIJAYAKUMAR JAWAHARLAL sparkh...@data2o.io mailto:sparkh...@data2o.io wrote: Hi I have

Re: What is the reason for ExecutorLostFailure?

2015-08-19 Thread VIJAYAKUMAR JAWAHARLAL
of items into memory could be causing it. Either way, the logs for the executors should be able to give you some insight, have you looked at those yet? On Tue, Aug 18, 2015 at 6:26 PM, VIJAYAKUMAR JAWAHARLAL sparkh...@data2o.io mailto:sparkh...@data2o.io wrote: Hi All Why am I

Re: Left outer joining big data set with small lookups

2015-08-18 Thread VIJAYAKUMAR JAWAHARLAL
. On 8/17/15, 12:39 PM, VIJAYAKUMAR JAWAHARLAL sparkh...@data2o.io wrote: Thanks for your help I tried to cache the lookup tables and left out join with the big table (DF). Join does not seem to be using broadcast join-still it goes with hash partition join and shuffling big table

COMPUTE STATS on hive table - NoSuchTableException

2015-08-18 Thread VIJAYAKUMAR JAWAHARLAL
Hi I am trying to compute stats on a lookup table from spark which resides in hive. I am invoking spark API as follows. It gives me NoSuchTableException. Table is double verified and subsequent statement “sqlContext.sql(“select * from cpatext.lkup”)” picks up the table correctly. I am

What is the reason for ExecutorLostFailure?

2015-08-18 Thread VIJAYAKUMAR JAWAHARLAL
Hi All Why am I getting ExecutorLostFailure and executors are completely lost for rest of the processing? Eventually it makes job to fail. One thing for sure that lot of shuffling happens across executors in my program. Is there a way to understand and debug ExecutorLostFailure? Any pointers

Re: Left outer joining big data set with small lookups

2015-08-17 Thread VIJAYAKUMAR JAWAHARLAL
...@granturing.com wrote: You could cache the lookup DataFrames, it’ll then do a broadcast join. On 8/14/15, 9:39 AM, VIJAYAKUMAR JAWAHARLAL sparkh...@data2o.io wrote: Hi I am facing huge performance problem when I am trying to left outer join very big data set (~140GB

Left outer joining big data set with small lookups

2015-08-14 Thread VIJAYAKUMAR JAWAHARLAL
Hi I am facing huge performance problem when I am trying to left outer join very big data set (~140GB) with bunch of small lookups [Start schema type]. I am using data frame in spark sql. It looks like data is shuffled and skewed when that join happens. Is there any way to improve performance