Re: Does spark supports the Hive function posexplode function?

2015-07-12 Thread David Sabater Dinter
It seems this feature was added in Hive 0.13.
https://issues.apache.org/jira/browse/HIVE-4943

I would assume this is supported as Spark is by default compiled using Hive
0.13.1.

On Sun, Jul 12, 2015 at 7:42 PM, Ruslan Dautkhanov dautkha...@gmail.com
wrote:

 You can see what Spark SQL functions are supported in Spark by doing the
 following in a notebook:
 %sql show functions


 https://forums.databricks.com/questions/665/is-hive-coalesce-function-supported-in-sparksql.html

 I think Spark SQL support is currently around Hive ~0.11?



 --
 Ruslan Dautkhanov

 On Tue, Jul 7, 2015 at 3:10 PM, Jeff J Li l...@us.ibm.com wrote:

 I am trying to use the posexplode function in the HiveContext to
 auto-generate a sequence number. This feature is supposed to be available
 Hive 0.13.0.

 SELECT name, phone FROM contact LATERAL VIEW
 posexplode(phoneList.phoneNumber) phoneTable AS pos, phone

 My test program failed with the following

 java.lang.ClassNotFoundException: posexplode
 at java.net.URLClassLoader.findClass(URLClassLoader.java:665)
 at java.lang.ClassLoader.loadClassHelper(ClassLoader.java:942)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:851)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:827)
 at
 org.apache.spark.sql.hive.HiveFunctionWrapper.createFunction(Shim13.scala:147)
 at
 org.apache.spark.sql.hive.HiveGenericUdtf.function$lzycompute(hiveUdfs.scala:274)
 at
 org.apache.spark.sql.hive.HiveGenericUdtf.function(hiveUdfs.scala:274)

 Does spark support this Hive function posexplode? If not, how to patch it
 to support this? I am on Spark 1.3.1

 Thanks,
 Jeff Li







Re: How to upgrade Spark version in CDH 5.4

2015-07-12 Thread David Sabater Dinter
As Sean suggested you can actually build Spark 1.4 for CDH 5.4.x and also
include Hive libraries for 0.13.1, but *this will be completely unsupported
by Cloudera*.
I would suggest to do that only if you just want to experiment with new
features from Spark 1.4. I.e. Run SparkSQL with sort-merge join to run
expensive BI types of queries over a Hadoop YARN cluster.
You just need to keep the build in your user folder across the cluster,
then include the Path for the binaries (spark-shell, spark-subimt,
pyspark,etc...) in your user profile and make sure you include CDH
Classpath to load all dependencies.

On Sun, Jul 12, 2015 at 8:19 PM, Sean Owen so...@cloudera.com wrote:

 Yeah, it won't technically be supported, and you shouldn't go
 modifying the actual installation, but if you just make your own build
 of 1.4 for CDH 5.4 and use that build to launch YARN-based apps, I
 imagine it will Just Work for most any use case.

 On Sun, Jul 12, 2015 at 7:34 PM, Ruslan Dautkhanov dautkha...@gmail.com
 wrote:
  Good question. I'd like to know the same. Although I think you'll loose
  supportability.
 
 
 
  --
  Ruslan Dautkhanov
 
  On Wed, Jul 8, 2015 at 2:03 AM, Ashish Dutt ashish.du...@gmail.com
 wrote:
 
 
  Hi,
  I need to upgrade spark version 1.3 to version 1.4 on CDH 5.4.
  I checked the documentation here but I do not see any thing relevant
 
  Any suggestions directing to a solution are welcome.
 
  Thanks,
  Ashish
 
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




SparkSQL cache table with multiple replicas

2015-07-03 Thread David Sabater Dinter
Hi all,
Do you know if there is an option to specify how many replicas we want
while caching in memory a table in SparkSQL Thrift server? I have not seen
any option so far but I assumed there is an option as you can see in the
Storage section of the UI that there is 1 x replica of your
Dataframe/Table...

I believe there can be a good use case on where you want to replicate a
dimension table across your nodes to improve response times when running
typical BI DWH types of queries (Just to avoid having to broadcast data
every time and again).

Do you think that would be a good addition to SparkSQL?



Regards.


Issues when saving dataframe in Spark 1.4 with parquet format

2015-07-01 Thread David Sabater Dinter
Hi chaps,
It seems there is an issue while saving dataframes in Spark 1.4.

The default file extension inside Hive warehouse folder is now
part-r-X.gz.parquet but while running queries from SparkSQL Thriftserver is
still looking for part-r-X.parquet.

Is there any config parameter we can use as workaround? Is there any Jira
opened about the same? Am I missing anything if I upgraded from Spark 1.3
to 1.4?

The only similar reference I have seen is this:
http://mail-archives.apache.org/mod_mbox/spark-user/201506.mbox/%3ccahp0wa+japfvj+pc2mzwomzb+mmdozfbr-xaxdbkoppe68t...@mail.gmail.com%3E



Thanks.