[Spark SQL] dependencies to use test helpers

2019-07-24 Thread James Pirz
I have a Scala application in which I have added some extra rules to
Catalyst.
While adding some unit tests, I am trying to use some existing functions
from Catalyst's test code: Specifically comparePlans() and normalizePlan()
under PlanTestBase

[1].

I am just wondering which additional dependencies I need to add to my
project to access them. Currently, I have below dependencies but they do
not cover above APIs.

libraryDependencies += "org.apache.spark" % "spark-core_2.11" % "2.4.3"
libraryDependencies += "org.apache.spark" % "spark-sql_2.11" % "2.4.3"
libraryDependencies += "org.apache.spark" % "spark-catalyst_2.11" % "2.4.3"


[1] 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/PlanTest.scala

Thanks,
James


Re: Setting executors per worker - Standalone

2015-09-29 Thread James Pirz
Thanks for your help.
You were correct about the memory settings. Previously I had following
config:

--executor-memory 8g --conf spark.executor.cores=1

Which was really conflicting, as in spark-env.sh I had:

export SPARK_WORKER_CORES=4
export SPARK_WORKER_MEMORY=8192m

So the memory budget per worker was not enough to launch several executors.
By switching to:

--executor-memory 2g --conf spark.executor.cores=1

Now I can see that on each machine I have one worker, with 4 executors.

Thanks again for your help.


On Tue, Sep 29, 2015 at 1:30 AM, Robin East <robin.e...@xense.co.uk> wrote:

> I’m currently testing this exact setup - it work for me using both —conf
> spark.exeuctors.cores=1 and —executor-cores 1. Do you have some memory
> settings that need to be adjusted as well? Or do you accidentally have
> —total-executor-cores set as well? You should be able to tell from looking
> at the environment tab on the Application UI
>
> ---
> Robin East
> *Spark GraphX in Action* Michael Malak and Robin East
> Manning Publications Co.
> http://www.manning.com/books/spark-graphx-in-action
>
>
>
>
>
> On 29 Sep 2015, at 04:47, James Pirz <james.p...@gmail.com> wrote:
>
> Thanks for your reply.
>
> Setting it as
>
> --conf spark.executor.cores=1
>
> when I start spark-shell (as an example application) indeed sets the
> number of cores per executor as 1 (which is 4 before), but I still have 1
> executor per worker. What I am really looking for is having 1 worker with 4
> executor (each with one core) per machine when I run my application. Based
> one the documentation it seems it is feasible, but it is not clear as how.
>
> Thanks.
>
> On Mon, Sep 28, 2015 at 8:46 PM, Jeff Zhang <zjf...@gmail.com> wrote:
>
>> use "--executor-cores 1" you will get 4 executors per worker since you
>> have 4 cores per worker
>>
>>
>>
>> On Tue, Sep 29, 2015 at 8:24 AM, James Pirz <james.p...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I am using speak 1.5 (standalone mode) on a cluster with 10 nodes while
>>> each machine has 12GB of RAM and 4 cores. On each machine I have one worker
>>> which is running one executor that grabs all 4 cores. I am interested to
>>> check the performance with "one worker but 4 executors per machine - each
>>> with one core".
>>>
>>> I can see that "running multiple executors per worker in Standalone
>>> mode" is possible based on the closed issue:
>>>
>>> https://issues.apache.org/jira/browse/SPARK-1706
>>>
>>> But I can not find a way to do that. "SPARK_EXECUTOR_INSTANCES" is only
>>> available for the Yarn mode, and in the standalone mode I can just set
>>> "SPARK_WORKER_INSTANCES" and "SPARK_WORKER_CORES" and "SPARK_WORKER_MEMORY".
>>>
>>> Any hint or suggestion would be great.
>>>
>>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>
>


Re: Setting executors per worker - Standalone

2015-09-28 Thread James Pirz
Thanks for your reply.

Setting it as

--conf spark.executor.cores=1

when I start spark-shell (as an example application) indeed sets the number
of cores per executor as 1 (which is 4 before), but I still have 1 executor
per worker. What I am really looking for is having 1 worker with 4 executor
(each with one core) per machine when I run my application. Based one the
documentation it seems it is feasible, but it is not clear as how.

Thanks.

On Mon, Sep 28, 2015 at 8:46 PM, Jeff Zhang <zjf...@gmail.com> wrote:

> use "--executor-cores 1" you will get 4 executors per worker since you
> have 4 cores per worker
>
>
>
> On Tue, Sep 29, 2015 at 8:24 AM, James Pirz <james.p...@gmail.com> wrote:
>
>> Hi,
>>
>> I am using speak 1.5 (standalone mode) on a cluster with 10 nodes while
>> each machine has 12GB of RAM and 4 cores. On each machine I have one worker
>> which is running one executor that grabs all 4 cores. I am interested to
>> check the performance with "one worker but 4 executors per machine - each
>> with one core".
>>
>> I can see that "running multiple executors per worker in Standalone mode"
>> is possible based on the closed issue:
>>
>> https://issues.apache.org/jira/browse/SPARK-1706
>>
>> But I can not find a way to do that. "SPARK_EXECUTOR_INSTANCES" is only
>> available for the Yarn mode, and in the standalone mode I can just set
>> "SPARK_WORKER_INSTANCES" and "SPARK_WORKER_CORES" and "SPARK_WORKER_MEMORY".
>>
>> Any hint or suggestion would be great.
>>
>>
>
>
> --
> Best Regards
>
> Jeff Zhang
>


Setting executors per worker - Standalone

2015-09-28 Thread James Pirz
Hi,

I am using speak 1.5 (standalone mode) on a cluster with 10 nodes while
each machine has 12GB of RAM and 4 cores. On each machine I have one worker
which is running one executor that grabs all 4 cores. I am interested to
check the performance with "one worker but 4 executors per machine - each
with one core".

I can see that "running multiple executors per worker in Standalone mode"
is possible based on the closed issue:

https://issues.apache.org/jira/browse/SPARK-1706

But I can not find a way to do that. "SPARK_EXECUTOR_INSTANCES" is only
available for the Yarn mode, and in the standalone mode I can just set
"SPARK_WORKER_INSTANCES" and "SPARK_WORKER_CORES" and "SPARK_WORKER_MEMORY".

Any hint or suggestion would be great.


Repartitioning external table in Spark sql

2015-08-18 Thread James Pirz
I am using Spark 1.4.1 , in stand-alone mode, on a cluster of 3 nodes.

Using Spark sql and Hive Context, I am trying to run a simple scan query on
an existing Hive table (which is an external table consisting of rows in
text files stored in HDFS - it is NOT parquet, ORC or any other richer
format).

DataFrame res = hiveCtx.sql(SELECT * FROM lineitem WHERE L_LINENUMBER 
0);

What I observe is the performance of this full scan in Spark is not
comparable with Hive (it is almost 4 times slower). Checking the resource
usage, what I see is workers/executors do not do parallel scans but they
scan on a per-node basis (first executors from the worker(s) on node 1 do
reading from disk, while other two nodes are not doing I/O and just receive
data from the first node and through network, then 2nd node does the scan
and then the third one).
I also realized that if I load this data file directly from my spark
context (using textFile() ) and run count() on that (not using spark sql)
then I can get a better performance by increasing number of partitions. I
am just trying to do the same thing (increasing number of partitions in the
beginning) in Spark sql as:

var tab = sqlContext.read.table(lineitem);
tab.repartition(1000);
OR
tab.coalesce(1000);

but none of repartition() or coalesce() methods actually work - they do not
return an error, but if I check

var p = tab.rdd.partitions.size;

before and after calling any of them, it returns the same number of
partitions.

I am just wondering how I can change the number of partitions for a Hive
external table, in Spark Sql.

Any help/suggestion would be appreciated.


Re: worker and executor memory

2015-08-14 Thread James Pirz
Additional Comment:
I checked the disk usage on the 3 nodes (using iostat) and it seems that
reading from HDFS partitions happen in a node-by-node basis. Only one of
the nodes shows active IO (as read) at any given time while the other two
nodes are idle IO-wise. I am not sure why the tasks are scheduled that way,
as it is a map-only job and reading can happen in parallel.

On Thu, Aug 13, 2015 at 9:10 PM, James Pirz james.p...@gmail.com wrote:

 Hi,

 I am using Spark 1.4 on a cluster (stand-alone mode), across 3 machines,
 for a workload similar to TPCH (analytical queries with multiple/multi-way
 large joins and aggregations). Each machine has 12GB of Memory and 4 cores.
 My total data size is 150GB, stored in HDFS (stored as Hive tables), and I
 am running my queries through Spark SQL using hive context.
 After checking the performance tuning documents on the spark page and some
 clips from latest spark summit, I decided to set the following configs in
 my spark-env:

 SPARK_WORKER_INSTANCES=4
 SPARK_WORKER_CORES=1
 SPARK_WORKER_MEMORY=2500M

 (As my tasks tend to be long so the overhead of starting multiple JVMs,
 one per worker is much less than the total query times). As I monitor the
 job progress, I realized that while the Worker memory is 2.5GB, the
 executors (one per worker) have max memory of 512MB (which is default). I
 enlarged this value in my application as:

 conf.set(spark.executor.memory, 2.5g);

 Trying to give max available memory on each worker to its only executor,
 but I observed that my queries are running slower than the prev case
 (default 512MB). Changing 2.5g to 1g improved the performance time, it is
 close to but still worse than 512MB case. I guess what I am missing here is
 what is the relationship between the WORKER_MEMORY and 'executor.memory'.

 - Isn't it the case that WORKER tries to split this memory among its
 executors (in my case its only executor) ? Or there are other stuff being
 done worker which need memory ?

 - What other important parameters I need to look into and tune at this
 point to get the best response time out of my HW ? (I have read about Kryo
 serializer, and I am about trying that - I am mainly concerned about memory
 related settings and also knobs related to parallelism of my jobs). As an
 example, for a simple scan-only query, Spark is worse than Hive (almost 3
 times slower) while both are scanning the exact same table  file format.
 That is why I believe I am missing some params by leaving them as defaults.

 Any hint/suggestion would be highly appreciated.





worker and executor memory

2015-08-13 Thread James Pirz
Hi,

I am using Spark 1.4 on a cluster (stand-alone mode), across 3 machines,
for a workload similar to TPCH (analytical queries with multiple/multi-way
large joins and aggregations). Each machine has 12GB of Memory and 4 cores.
My total data size is 150GB, stored in HDFS (stored as Hive tables), and I
am running my queries through Spark SQL using hive context.
After checking the performance tuning documents on the spark page and some
clips from latest spark summit, I decided to set the following configs in
my spark-env:

SPARK_WORKER_INSTANCES=4
SPARK_WORKER_CORES=1
SPARK_WORKER_MEMORY=2500M

(As my tasks tend to be long so the overhead of starting multiple JVMs, one
per worker is much less than the total query times). As I monitor the job
progress, I realized that while the Worker memory is 2.5GB, the executors
(one per worker) have max memory of 512MB (which is default). I enlarged
this value in my application as:

conf.set(spark.executor.memory, 2.5g);

Trying to give max available memory on each worker to its only executor,
but I observed that my queries are running slower than the prev case
(default 512MB). Changing 2.5g to 1g improved the performance time, it is
close to but still worse than 512MB case. I guess what I am missing here is
what is the relationship between the WORKER_MEMORY and 'executor.memory'.

- Isn't it the case that WORKER tries to split this memory among its
executors (in my case its only executor) ? Or there are other stuff being
done worker which need memory ?

- What other important parameters I need to look into and tune at this
point to get the best response time out of my HW ? (I have read about Kryo
serializer, and I am about trying that - I am mainly concerned about memory
related settings and also knobs related to parallelism of my jobs). As an
example, for a simple scan-only query, Spark is worse than Hive (almost 3
times slower) while both are scanning the exact same table  file format.
That is why I believe I am missing some params by leaving them as defaults.

Any hint/suggestion would be highly appreciated.


Re: spark-submit does not use hive-site.xml

2015-06-10 Thread James Pirz
Thanks for your help !
Switching to HiveContext fixed the issue.

Just one side comment:
In the documentation regarding Hive Tables and HiveContext
https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables,
we see:

// sc is an existing JavaSparkContext.HiveContext sqlContext = new
org.apache.spark.sql.hive.HiveContext(sc);


But this is incorrect as the constructor in HiveContext does not accept a
JavaSparkContext, but a SparkContext. (the comment is basically
misleading). The correct code snippet should be:

HiveContext sqlContext = new org.apache.spark.sql.hive.HiveContext(sc.sc());


Thanks again for your help.




On Wed, Jun 10, 2015 at 1:17 AM, Cheng Lian lian.cs@gmail.com wrote:

  Hm, this is a common confusion... Although the variable name is
 `sqlContext` in Spark shell, it's actually a `HiveContext`, which extends
 `SQLContext` and has the ability to communicate with Hive metastore.

 So your program need to instantiate a
 `org.apache.spark.sql.hive.HiveContext` instead.

 Cheng


 On 6/10/15 10:19 AM, James Pirz wrote:

 I am using Spark (standalone) to run queries (from a remote client)
 against data in tables that are already defined/loaded in Hive.

 I have started metastore service in Hive successfully, and by putting
 hive-site.xml, with proper metastore.uri, in $SPARK_HOME/conf directory, I
 tried to share its config with spark.

  When I start spark-shell, it gives me a default sqlContext, and I can
 use that to access my Hive's tables with no problem.

  But once I submit a similar query via Spark application through
 'spark-submit', it does not see the tables and it seems it does not pick
 hive-site.xml which is under conf directory in Spark's home. I tried to use
 '--files' argument with spark-submit to pass hive-site.xml' to the
 workers, but it did not change anything.

  Here is how I try to run the application:

  $SPARK_HOME/bin/spark-submit --class SimpleClient --master
 spark://my-spark-master:7077 --files=$SPARK_HOME/conf/hive-site.xml
  simple-sql-client-1.0.jar

  Here is the simple example code that I try to run (in Java):

  SparkConf conf = new SparkConf().setAppName(Simple SQL Client);

 JavaSparkContext sc = new JavaSparkContext(conf);

 SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc);

 DataFrame res = sqlContext.sql(show tables);

 res.show();


  Here are the SW versions:
 Spark: 1.3
 Hive: 1.2
 Hadoop: 2.6

  Thanks in advance for any suggestion.





Re: Running SparkSql against Hive tables

2015-06-09 Thread James Pirz
Thanks Ayan, I used beeline in Spark to connect to Hiveserver2 that I
started from my Hive. So as you said, It is really interacting with Hive as
a typical 3rd party application, and it is NOT using Spark execution
engine. I was thinking that it gets metastore info from Hive, but uses
Spark to execute the query.

I already have created  loaded tables in Hive, and now I want to use Spark
to run SQL queries against those tables. I just want to submit SQL queries
in Spark, and against the data in Hive, wout writing an application (Just
similar to the way that one would pass SQL scripts to Hive or Shark). Going
through the Spark documentation, I realized Spark SQL is the component that
I need to use. But do you mean I have to write a client Spark application
to do that ? Is there any way that one could pass SQL scripts directly
through command-line  Spark runs it in distributed mode on the cluster,
against the already existing data in Hive ?

On Mon, Jun 8, 2015 at 5:53 PM, ayan guha guha.a...@gmail.com wrote:

 I am afraid you are going other way around :) If you want to use Hive in
 spark, you'd need a HiveContext with  hive config files in spark cluster
 (eveery node). This was spark can talk to hive metastore. Then you can
 write queries on hive table using hiveContext's sql method and spark will
 run it (either by reading from hive and creating RDD or lettinghive run the
 query using MR). Final result will be a spark dataFrame.

 What you currently doing is using beeline to connect to hive, which should
 work even without spark.

 Best
 Ayan

 On Tue, Jun 9, 2015 at 10:42 AM, James Pirz james.p...@gmail.com wrote:

 Thanks for the help!
 I am actually trying Spark SQL to run queries against tables that I've
 defined in Hive.

 I follow theses steps:
 - I start hiveserver2 and in Spark, I start Spark's Thrift server by:
 $SPARK_HOME/sbin/start-thriftserver.sh --master
 spark://spark-master-node-ip:7077

 - and I start beeline:
 $SPARK_HOME/bin/beeline

 - In my beeline session, I connect to my running hiveserver2
 !connect jdbc:hive2://hive-node-ip:1

 and I can run queries successfully. But based on hiveserver2 logs, It
 seems it actually uses Hadoop's MR to run queries,  *not* Spark's
 workers. My goals is to access Hive's tables' data, but run queries through
 Spark SQL using Spark workers (not Hadoop).

 Is it possible to do that via Spark SQL (its CLI) or through its thrift
 server ? (I tried to find some basic examples in the documentation, but I
 was not able to) - Any suggestion or hint on how I can do that would be
 highly appreciated.

 Thnx

 On Sun, Jun 7, 2015 at 6:39 AM, Cheng Lian lian.cs@gmail.com wrote:



 On 6/6/15 9:06 AM, James Pirz wrote:

 I am pretty new to Spark, and using Spark 1.3.1, I am trying to use
 'Spark SQL' to run some SQL scripts, on the cluster. I realized that for a
 better performance, it is a good idea to use Parquet files. I have 2
 questions regarding that:

  1) If I wanna use Spark SQL against  *partitioned  bucketed* tables
 with Parquet format in Hive, does the provided spark binary on the apache
 website support that or do I need to build a new spark binary with some
 additional flags ? (I found a note
 https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables
  in
 the documentation about enabling Hive support, but I could not fully get it
 as what the correct way of building is, if I need to build)

 Yes, Hive support is enabled by default now for the binaries on the
 website. However, currently Spark SQL doesn't support buckets yet.


  2) Does running Spark SQL against tables in Hive downgrade the
 performance, and it is better that I load parquet files directly to HDFS or
 having Hive in the picture is harmless ?

 If you're using Parquet, then it should be fine since by default Spark
 SQL uses its own native Parquet support to read Parquet Hive tables.


  Thnx






 --
 Best Regards,
 Ayan Guha



Re: Running SparkSql against Hive tables

2015-06-09 Thread James Pirz
I am trying to use Spark 1.3 (Standalone) against Hive 1.2 running on
Hadoop 2.6.
I looked the ThriftServer2 logs, and I realized that the server was not
starting properly, because of failure in creating a server socket. In fact,
I had passed the URI to my Hiveserver2 service, launched from Hive, and the
beeline in Spark was directly talking to Hive's hiveserver2 and it was just
using it as a Hive service.

I could fix starting the Thriftserver2 in Spark (by changing port), but I
guess the missing puzzle piece for me is: How does Spark SQL re-uses the
already created table in Hive ? I mean do I have to write an application
that uses HiveContext to do that and submit it to Spark for execution, or
is there a way to run SQL scripts directly via command line (in distributed
mode and on the cluster) - (Just similar to the way that one would use Hive
(or Shark) command line by passing a query file with -f flag). Looking at
the Spark SQL documentation, it seems that it is possible. Please correct
me if I am wrong.

On Mon, Jun 8, 2015 at 6:56 PM, Cheng Lian lian.cs@gmail.com wrote:


 On 6/9/15 8:42 AM, James Pirz wrote:

 Thanks for the help!
 I am actually trying Spark SQL to run queries against tables that I've
 defined in Hive.

  I follow theses steps:
 - I start hiveserver2 and in Spark, I start Spark's Thrift server by:
 $SPARK_HOME/sbin/start-thriftserver.sh --master
 spark://spark-master-node-ip:7077

  - and I start beeline:
 $SPARK_HOME/bin/beeline

  - In my beeline session, I connect to my running hiveserver2
 !connect jdbc:hive2://hive-node-ip:1

  and I can run queries successfully. But based on hiveserver2 logs, It
 seems it actually uses Hadoop's MR to run queries,  *not* Spark's
 workers. My goals is to access Hive's tables' data, but run queries through
 Spark SQL using Spark workers (not Hadoop).

 Hm, interesting. HiveThriftServer2 should never issue MR jobs to perform
 queries. I did receive two reports in the past which also say MR jobs
 instead of Spark jobs were issued to perform the SQL query. However, I only
 reproduced this issue in a rare corner case, which uses HTTP mode to
 connect to Hive 0.12.0. Apparently this isn't your case. Would you mind to
 provide more details so that I can dig in?  The following information would
 be very helpful:

 1. Hive version
 2. A copy of your hive-site.xml
 3. Hadoop version
 4. Full HiveThriftServer2 log (which can be found in $SPARK_HOME/logs)

 Thanks in advance!


  Is it possible to do that via Spark SQL (its CLI) or through its thrift
 server ? (I tried to find some basic examples in the documentation, but I
 was not able to) - Any suggestion or hint on how I can do that would be
 highly appreciated.

  Thnx

 On Sun, Jun 7, 2015 at 6:39 AM, Cheng Lian lian.cs@gmail.com wrote:



 On 6/6/15 9:06 AM, James Pirz wrote:

 I am pretty new to Spark, and using Spark 1.3.1, I am trying to use
 'Spark SQL' to run some SQL scripts, on the cluster. I realized that for a
 better performance, it is a good idea to use Parquet files. I have 2
 questions regarding that:

  1) If I wanna use Spark SQL against  *partitioned  bucketed* tables
 with Parquet format in Hive, does the provided spark binary on the apache
 website support that or do I need to build a new spark binary with some
 additional flags ? (I found a note
 https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables
  in
 the documentation about enabling Hive support, but I could not fully get it
 as what the correct way of building is, if I need to build)

  Yes, Hive support is enabled by default now for the binaries on the
 website. However, currently Spark SQL doesn't support buckets yet.


  2) Does running Spark SQL against tables in Hive downgrade the
 performance, and it is better that I load parquet files directly to HDFS or
 having Hive in the picture is harmless ?

  If you're using Parquet, then it should be fine since by default Spark
 SQL uses its own native Parquet support to read Parquet Hive tables.


  Thnx







spark-submit does not use hive-site.xml

2015-06-09 Thread James Pirz
I am using Spark (standalone) to run queries (from a remote client) against
data in tables that are already defined/loaded in Hive.

I have started metastore service in Hive successfully, and by putting
hive-site.xml, with proper metastore.uri, in $SPARK_HOME/conf directory, I
tried to share its config with spark.

When I start spark-shell, it gives me a default sqlContext, and I can use
that to access my Hive's tables with no problem.

But once I submit a similar query via Spark application through
'spark-submit', it does not see the tables and it seems it does not pick
hive-site.xml which is under conf directory in Spark's home. I tried to use
'--files' argument with spark-submit to pass hive-site.xml' to the
workers, but it did not change anything.

Here is how I try to run the application:

$SPARK_HOME/bin/spark-submit --class SimpleClient --master
spark://my-spark-master:7077 --files=$SPARK_HOME/conf/hive-site.xml
 simple-sql-client-1.0.jar

Here is the simple example code that I try to run (in Java):

SparkConf conf = new SparkConf().setAppName(Simple SQL Client);

JavaSparkContext sc = new JavaSparkContext(conf);

SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc);

DataFrame res = sqlContext.sql(show tables);

res.show();


Here are the SW versions:
Spark: 1.3
Hive: 1.2
Hadoop: 2.6

Thanks in advance for any suggestion.


Re: Running SparkSql against Hive tables

2015-06-08 Thread James Pirz
Thanks for the help!
I am actually trying Spark SQL to run queries against tables that I've
defined in Hive.

I follow theses steps:
- I start hiveserver2 and in Spark, I start Spark's Thrift server by:
$SPARK_HOME/sbin/start-thriftserver.sh --master
spark://spark-master-node-ip:7077

- and I start beeline:
$SPARK_HOME/bin/beeline

- In my beeline session, I connect to my running hiveserver2
!connect jdbc:hive2://hive-node-ip:1

and I can run queries successfully. But based on hiveserver2 logs, It seems
it actually uses Hadoop's MR to run queries,  *not* Spark's workers. My
goals is to access Hive's tables' data, but run queries through Spark SQL
using Spark workers (not Hadoop).

Is it possible to do that via Spark SQL (its CLI) or through its thrift
server ? (I tried to find some basic examples in the documentation, but I
was not able to) - Any suggestion or hint on how I can do that would be
highly appreciated.

Thnx

On Sun, Jun 7, 2015 at 6:39 AM, Cheng Lian lian.cs@gmail.com wrote:



 On 6/6/15 9:06 AM, James Pirz wrote:

 I am pretty new to Spark, and using Spark 1.3.1, I am trying to use 'Spark
 SQL' to run some SQL scripts, on the cluster. I realized that for a better
 performance, it is a good idea to use Parquet files. I have 2 questions
 regarding that:

  1) If I wanna use Spark SQL against  *partitioned  bucketed* tables
 with Parquet format in Hive, does the provided spark binary on the apache
 website support that or do I need to build a new spark binary with some
 additional flags ? (I found a note
 https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables 
 in
 the documentation about enabling Hive support, but I could not fully get it
 as what the correct way of building is, if I need to build)

 Yes, Hive support is enabled by default now for the binaries on the
 website. However, currently Spark SQL doesn't support buckets yet.


  2) Does running Spark SQL against tables in Hive downgrade the
 performance, and it is better that I load parquet files directly to HDFS or
 having Hive in the picture is harmless ?

 If you're using Parquet, then it should be fine since by default Spark SQL
 uses its own native Parquet support to read Parquet Hive tables.


  Thnx