Hive 1.0 support in Spark

2015-05-19 Thread Kannan Rajah
Does Spark 1.3.1 support Hive 1.0? If not, which version of Spark will
start supporting Hive 1.0?

--
Kannan


Re: Spark permission denied error when invoking saveAsTextFile

2015-04-01 Thread Kannan Rajah
Ignore the question. There was a Hadoop setting that needed to be set to
get it working.


--
Kannan

On Wed, Apr 1, 2015 at 1:37 PM, Kannan Rajah kra...@maprtech.com wrote:

 Running a simple word count job in standalone mode as a non root user from
 spark-shell. The spark master, worker services are running as root user.

 The problem is the _temporary under /user/krajah/output2/_temporary/0 dir
 is being created with root permission even when running the job as non root
 user - krajah in this case. The higher level directories are getting
 created with right permission though. There was a similar question posted
 long time back, but there is no answer:
 http://mail-archives.apache.org/mod_mbox/mesos-user/201408.mbox/%3CCAAeYHL2M9J9xEotf_0zXmZXy2_x-oBHa=xxl2naft203o6u...@mail.gmail.com%3E


 *Wrong permission for child directory*
 drwxr-xr-x   - root   root0 2015-04-01 11:20
 /user/krajah/output2/_temporary/0/_temporary


 *Right permission for parent directories*
 hadoop fs -ls -R /user/krajah/my_output
 drwxr-xr-x   - krajah krajah  1 2015-04-01 11:46
 /user/krajah/my_output/_temporary
 drwxr-xr-x   - krajah krajah  3 2015-04-01 11:46
 /user/krajah/my_output/_temporary/0

 *Job and Stacktrace*

 scala val file = sc.textFile(/user/krajah/junk.txt)
 scala val counts = file.flatMap(line = line.split( ))
 scala .map(word = (word, 1))
 scala .reduceByKey(_ + _)

 scala counts.saveAsTextFile(/user/krajah/count2)
 java.io.IOException: Error: Permission denied
 at com.mapr.fs.MapRFileSystem.rename(MapRFileSystem.java:926)
 at
 org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:345)
 at
 org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:362)
 at
 org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:310)
 at
 org.apache.hadoop.mapred.FileOutputCommitter.commitJob(FileOutputCommitter.java:136)
 at
 org.apache.spark.SparkHadoopWriter.commitJob(SparkHadoopWriter.scala:127)
 at
 org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1079)
 at
 org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:944)
 at
 org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:853)
 at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1199)
 at $iwC$$iwC$$iwC$$iwC.init(console:17)
 at $iwC$$iwC$$iwC.init(console:22)
 at $iwC$$iwC.init(console:24)
 at $iwC.init(console:26)
 at init(console:28)
 at .init(console:32)
 at .clinit(console)
 at .init(console:7)
 at .clinit(console)
 at $print(console)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)


 --
 Kannan



Spark permission denied error when invoking saveAsTextFile

2015-04-01 Thread Kannan Rajah
Running a simple word count job in standalone mode as a non root user from
spark-shell. The spark master, worker services are running as root user.

The problem is the _temporary under /user/krajah/output2/_temporary/0 dir
is being created with root permission even when running the job as non root
user - krajah in this case. The higher level directories are getting
created with right permission though. There was a similar question posted
long time back, but there is no answer:
http://mail-archives.apache.org/mod_mbox/mesos-user/201408.mbox/%3CCAAeYHL2M9J9xEotf_0zXmZXy2_x-oBHa=xxl2naft203o6u...@mail.gmail.com%3E


*Wrong permission for child directory*
drwxr-xr-x   - root   root0 2015-04-01 11:20
/user/krajah/output2/_temporary/0/_temporary


*Right permission for parent directories*
hadoop fs -ls -R /user/krajah/my_output
drwxr-xr-x   - krajah krajah  1 2015-04-01 11:46
/user/krajah/my_output/_temporary
drwxr-xr-x   - krajah krajah  3 2015-04-01 11:46
/user/krajah/my_output/_temporary/0

*Job and Stacktrace*

scala val file = sc.textFile(/user/krajah/junk.txt)
scala val counts = file.flatMap(line = line.split( ))
scala .map(word = (word, 1))
scala .reduceByKey(_ + _)

scala counts.saveAsTextFile(/user/krajah/count2)
java.io.IOException: Error: Permission denied
at com.mapr.fs.MapRFileSystem.rename(MapRFileSystem.java:926)
at
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:345)
at
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:362)
at
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:310)
at
org.apache.hadoop.mapred.FileOutputCommitter.commitJob(FileOutputCommitter.java:136)
at
org.apache.spark.SparkHadoopWriter.commitJob(SparkHadoopWriter.scala:127)
at
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1079)
at
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:944)
at
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:853)
at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1199)
at $iwC$$iwC$$iwC$$iwC.init(console:17)
at $iwC$$iwC$$iwC.init(console:22)
at $iwC$$iwC.init(console:24)
at $iwC.init(console:26)
at init(console:28)
at .init(console:32)
at .clinit(console)
at .init(console:7)
at .clinit(console)
at $print(console)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)


--
Kannan


Is SPARK_CLASSPATH really deprecated?

2015-02-26 Thread Kannan Rajah
SparkConf.scala logs a warning saying SPARK_CLASSPATH is deprecated and we
should use spark.executor.extraClassPath instead. But the online
documentation states that spark.executor.extraClassPath is only meant for
backward compatibility.

https://spark.apache.org/docs/1.2.0/configuration.html#execution-behavior

Which one is right? I have a use case to submit a hbase job from
spark-shell and make it run using YARN. In this case, I need to somehow add
the hbase jars to the classpath of the executor. If I add it to
SPARK_CLASSPATH and export it it works fine. Alternatively, if I set
the spark.executor.extraClassPath in spark-defaults.conf, it works fine.
But the reason I don't like spark-defaults.conf is that I need to hard code
it instead of relying on a script to generate the classpath. I can use a
script in spark-env.sh and set SPARK_CLASSPATH.

Given that compute-classpath uses SPARK_CLASSPATH variable, why is it
marked as deprecated?

--
Kannan


Re: Is SPARK_CLASSPATH really deprecated?

2015-02-26 Thread Kannan Rajah
Thanks Marcelo. Do you think it would be useful to make
spark.executor.extraClassPath be made to pick up some environment variable
that can be set from spark-env.sh? Here is a example.

spark-env.sh
--
executor_extra_cp = get_hbase_jars_for_cp
export executor_extra_cp

spark-defaults.conf
-
spark.executor.extraClassPath = ${executor_extra_cp}

This will let us add logic inside get_hbase_jars_for_cp function to pick
the right version hbase jars. There could be multiple versions installed on
the node.



--
Kannan

On Thu, Feb 26, 2015 at 6:08 PM, Marcelo Vanzin van...@cloudera.com wrote:

 On Thu, Feb 26, 2015 at 5:12 PM, Kannan Rajah kra...@maprtech.com wrote:
  Also, I would like to know if there is a localization overhead when we
 use
  spark.executor.extraClassPath. Again, in the case of hbase, these jars
 would
  be typically available on all nodes. So there is no need to localize them
  from the node where job was submitted. I am wondering if we use the
  SPARK_CLASSPATH approach, then it would not do localization. That would
 be
  an added benefit.
  Please clarify.

 spark.executor.extraClassPath doesn't localize anything. It just
 prepends those classpath entries to the usual classpath used to launch
 the executor. There's no copying of files or anything, so they're
 expected to exist on the nodes.

 It's basically exactly the same as SPARK_CLASSPATH, but broken down to
 two options (one for the executors, and one for the driver).

 --
 Marcelo



Re: Is SPARK_CLASSPATH really deprecated?

2015-02-26 Thread Kannan Rajah
There is a usability concern I have with the current way of specifying
--jars. Imagine a use case like hbase where a lot of jobs need it in its
classpath. This needs to be set every time. If we use
spark.executor.extraClassPath,
then we just need to set it once But there is no programmatic way to set
this value, like picking up from an environment variable or by running a
script that generates classpath.  You need to hard code the jars in
spark-defaults.conf.

Also, I would like to know if there is a localization overhead when we use
spark.executor.extraClassPath. Again, in the case of hbase, these jars
would be typically available on all nodes. So there is no need to localize
them from the node where job was submitted. I am wondering if we use the
SPARK_CLASSPATH approach, then it would not do localization. That would be
an added benefit.
Please clarify.




--
Kannan

On Thu, Feb 26, 2015 at 4:15 PM, Marcelo Vanzin van...@cloudera.com wrote:

 SPARK_CLASSPATH is definitely deprecated, but my understanding is that
 spark.executor.extraClassPath is not, so maybe the documentation needs
 fixing.

 I'll let someone who might know otherwise comment, though.

 On Thu, Feb 26, 2015 at 2:43 PM, Kannan Rajah kra...@maprtech.com wrote:
  SparkConf.scala logs a warning saying SPARK_CLASSPATH is deprecated and
 we
  should use spark.executor.extraClassPath instead. But the online
  documentation states that spark.executor.extraClassPath is only meant for
  backward compatibility.
 
 
 https://spark.apache.org/docs/1.2.0/configuration.html#execution-behavior
 
  Which one is right? I have a use case to submit a hbase job from
 spark-shell
  and make it run using YARN. In this case, I need to somehow add the hbase
  jars to the classpath of the executor. If I add it to SPARK_CLASSPATH and
  export it it works fine. Alternatively, if I set the
  spark.executor.extraClassPath in spark-defaults.conf, it works fine. But
 the
  reason I don't like spark-defaults.conf is that I need to hard code it
  instead of relying on a script to generate the classpath. I can use a
 script
  in spark-env.sh and set SPARK_CLASSPATH.
 
  Given that compute-classpath uses SPARK_CLASSPATH variable, why is it
 marked
  as deprecated?
 
  --
  Kannan



 --
 Marcelo



Re: Spark-SQL 1.2.0 sort by results are not consistent with Hive

2015-02-25 Thread Kannan Rajah
Cheng, We tried this setting and it still did not help. This was on Spark
1.2.0.


--
Kannan

On Mon, Feb 23, 2015 at 6:38 PM, Cheng Lian lian.cs@gmail.com wrote:

  (Move to user list.)

 Hi Kannan,

 You need to set mapred.map.tasks to 1 in hive-site.xml. The reason is this
 line of code
 https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala#L68,
 which overrides spark.default.parallelism. Also,
 spark.sql.shuffle.parallelism isn’t used here since there’s no shuffle
 involved (we only need to sort within a partition).

 Default value of mapred.map.tasks is 2
 https://hadoop.apache.org/docs/r1.0.4/mapred-default.html. You may see
 that the Spark SQL result can be divided into two sorted parts from the
 middle.

 Cheng

 On 2/19/15 10:33 AM, Kannan Rajah wrote:

   According to hive documentation, sort by is supposed to order the results
 for each reducer. So if we set a single reducer, then the results should be
 sorted, right? But this is not happening. Any idea why? Looks like the
 settings I am using to restrict the number of reducers is not having an
 effect.

 *Tried the following:*

 Set spark.default.parallelism to 1

 Set spark.sql.shuffle.partitions to 1

 These were set in hive-site.xml and also inside spark shell.


 *Spark-SQL*

 create table if not exists testSortBy (key int, name string, age int);
 LOAD DATA LOCAL INPATH '/home/mapr/sample-name-age.txt' OVERWRITE INTO TABLE
 testSortBy;
 select * from testSortBY;

 1Aditya28
 2aash25
 3prashanth27
 4bharath26
 5terry27
 6nanda26
 7pradeep27
 8pratyay26


 set spark.default.parallelism=1;

 set spark.sql.shuffle.partitions=1;

 select name,age from testSortBy sort by age; aash 25 bharath 26 prashanth
 27 Aditya 28 nanda 26 pratyay 26 terry 27 pradeep 27 *HIVE* select name,age
 from testSortBy sort by age;

 aash25
 bharath26
 nanda26
 pratyay26
 prashanth27
 terry27
 pradeep27
 Aditya28


 --
 Kannan


   ​