Re: spark sql hive-shims

2015-05-14 Thread Lior Chaga
I see that the pre-built distributions includes hive-shims-0.23 shaded in
spark-assembly jar (unlike when I make the distribution myself).
Does anyone knows what I should do to include the shims in my distribution?


On Thu, May 14, 2015 at 9:52 AM, Lior Chaga lio...@taboola.com wrote:

 Ultimately it was PermGen out of memory. I somehow missed it in the log

 On Thu, May 14, 2015 at 9:24 AM, Lior Chaga lio...@taboola.com wrote:

 After profiling with YourKit, I see there's an OutOfMemoryException in
 context SQLContext.applySchema. Again, it's a very small RDD. Each executor
 has 180GB RAM.

 On Thu, May 14, 2015 at 8:53 AM, Lior Chaga lio...@taboola.com wrote:

 Hi,

 Using spark sql with HiveContext. Spark version is 1.3.1
 When running local spark everything works fine. When running on spark
 cluster I get ClassNotFoundError org.apache.hadoop.hive.shims.Hadoop23Shims.
 This class belongs to hive-shims-0.23, and is a runtime dependency for
 spark-hive:

 [INFO] org.apache.spark:spark-hive_2.10:jar:1.3.1
 [INFO] +- org.spark-project.hive:hive-metastore:jar:0.13.1a:compile
 [INFO] |  +- org.spark-project.hive:hive-shims:jar:0.13.1a:compile
 [INFO] |  |  +-
 org.spark-project.hive.shims:hive-shims-common:jar:0.13.1a:compile
 [INFO] |  |  +-
 org.spark-project.hive.shims:hive-shims-0.20:jar:0.13.1a:runtime
 [INFO] |  |  +-
 org.spark-project.hive.shims:hive-shims-common-secure:jar:0.13.1a:compile
 [INFO] |  |  +-
 org.spark-project.hive.shims:hive-shims-0.20S:jar:0.13.1a:runtime
 [INFO] |  |  \-
 org.spark-project.hive.shims:hive-shims-0.23:jar:0.13.1a:runtime



 My spark distribution is:
 make-distribution.sh --tgz  -Phive -Phive-thriftserver -DskipTests


 If I try to add this dependency to my driver project, then the exception
 disappears, but then the task is stuck when registering an rdd as a table
 (I get timeout after 30 seconds). I should emphasize that the first rdd I
 register as a table is a very small one (about 60K row), and as I said - it
 runs swiftly in local.
 I suspect maybe other dependencies are missing, but they fail silently.

 Would be grateful if anyone knows how to solve it.

 Lior






Re: spark sql hive-shims

2015-05-14 Thread Lior Chaga
After profiling with YourKit, I see there's an OutOfMemoryException in
context SQLContext.applySchema. Again, it's a very small RDD. Each executor
has 180GB RAM.

On Thu, May 14, 2015 at 8:53 AM, Lior Chaga lio...@taboola.com wrote:

 Hi,

 Using spark sql with HiveContext. Spark version is 1.3.1
 When running local spark everything works fine. When running on spark
 cluster I get ClassNotFoundError org.apache.hadoop.hive.shims.Hadoop23Shims.
 This class belongs to hive-shims-0.23, and is a runtime dependency for
 spark-hive:

 [INFO] org.apache.spark:spark-hive_2.10:jar:1.3.1
 [INFO] +- org.spark-project.hive:hive-metastore:jar:0.13.1a:compile
 [INFO] |  +- org.spark-project.hive:hive-shims:jar:0.13.1a:compile
 [INFO] |  |  +-
 org.spark-project.hive.shims:hive-shims-common:jar:0.13.1a:compile
 [INFO] |  |  +-
 org.spark-project.hive.shims:hive-shims-0.20:jar:0.13.1a:runtime
 [INFO] |  |  +-
 org.spark-project.hive.shims:hive-shims-common-secure:jar:0.13.1a:compile
 [INFO] |  |  +-
 org.spark-project.hive.shims:hive-shims-0.20S:jar:0.13.1a:runtime
 [INFO] |  |  \-
 org.spark-project.hive.shims:hive-shims-0.23:jar:0.13.1a:runtime



 My spark distribution is:
 make-distribution.sh --tgz  -Phive -Phive-thriftserver -DskipTests


 If I try to add this dependency to my driver project, then the exception
 disappears, but then the task is stuck when registering an rdd as a table
 (I get timeout after 30 seconds). I should emphasize that the first rdd I
 register as a table is a very small one (about 60K row), and as I said - it
 runs swiftly in local.
 I suspect maybe other dependencies are missing, but they fail silently.

 Would be grateful if anyone knows how to solve it.

 Lior




Re: spark sql hive-shims

2015-05-14 Thread Lior Chaga
Ultimately it was PermGen out of memory. I somehow missed it in the log

On Thu, May 14, 2015 at 9:24 AM, Lior Chaga lio...@taboola.com wrote:

 After profiling with YourKit, I see there's an OutOfMemoryException in
 context SQLContext.applySchema. Again, it's a very small RDD. Each executor
 has 180GB RAM.

 On Thu, May 14, 2015 at 8:53 AM, Lior Chaga lio...@taboola.com wrote:

 Hi,

 Using spark sql with HiveContext. Spark version is 1.3.1
 When running local spark everything works fine. When running on spark
 cluster I get ClassNotFoundError org.apache.hadoop.hive.shims.Hadoop23Shims.
 This class belongs to hive-shims-0.23, and is a runtime dependency for
 spark-hive:

 [INFO] org.apache.spark:spark-hive_2.10:jar:1.3.1
 [INFO] +- org.spark-project.hive:hive-metastore:jar:0.13.1a:compile
 [INFO] |  +- org.spark-project.hive:hive-shims:jar:0.13.1a:compile
 [INFO] |  |  +-
 org.spark-project.hive.shims:hive-shims-common:jar:0.13.1a:compile
 [INFO] |  |  +-
 org.spark-project.hive.shims:hive-shims-0.20:jar:0.13.1a:runtime
 [INFO] |  |  +-
 org.spark-project.hive.shims:hive-shims-common-secure:jar:0.13.1a:compile
 [INFO] |  |  +-
 org.spark-project.hive.shims:hive-shims-0.20S:jar:0.13.1a:runtime
 [INFO] |  |  \-
 org.spark-project.hive.shims:hive-shims-0.23:jar:0.13.1a:runtime



 My spark distribution is:
 make-distribution.sh --tgz  -Phive -Phive-thriftserver -DskipTests


 If I try to add this dependency to my driver project, then the exception
 disappears, but then the task is stuck when registering an rdd as a table
 (I get timeout after 30 seconds). I should emphasize that the first rdd I
 register as a table is a very small one (about 60K row), and as I said - it
 runs swiftly in local.
 I suspect maybe other dependencies are missing, but they fail silently.

 Would be grateful if anyone knows how to solve it.

 Lior