Hive 1.0 support in Spark
Does Spark 1.3.1 support Hive 1.0? If not, which version of Spark will start supporting Hive 1.0? -- Kannan
Re: Spark permission denied error when invoking saveAsTextFile
Ignore the question. There was a Hadoop setting that needed to be set to get it working. -- Kannan On Wed, Apr 1, 2015 at 1:37 PM, Kannan Rajah kra...@maprtech.com wrote: Running a simple word count job in standalone mode as a non root user from spark-shell. The spark master, worker services are running as root user. The problem is the _temporary under /user/krajah/output2/_temporary/0 dir is being created with root permission even when running the job as non root user - krajah in this case. The higher level directories are getting created with right permission though. There was a similar question posted long time back, but there is no answer: http://mail-archives.apache.org/mod_mbox/mesos-user/201408.mbox/%3CCAAeYHL2M9J9xEotf_0zXmZXy2_x-oBHa=xxl2naft203o6u...@mail.gmail.com%3E *Wrong permission for child directory* drwxr-xr-x - root root0 2015-04-01 11:20 /user/krajah/output2/_temporary/0/_temporary *Right permission for parent directories* hadoop fs -ls -R /user/krajah/my_output drwxr-xr-x - krajah krajah 1 2015-04-01 11:46 /user/krajah/my_output/_temporary drwxr-xr-x - krajah krajah 3 2015-04-01 11:46 /user/krajah/my_output/_temporary/0 *Job and Stacktrace* scala val file = sc.textFile(/user/krajah/junk.txt) scala val counts = file.flatMap(line = line.split( )) scala .map(word = (word, 1)) scala .reduceByKey(_ + _) scala counts.saveAsTextFile(/user/krajah/count2) java.io.IOException: Error: Permission denied at com.mapr.fs.MapRFileSystem.rename(MapRFileSystem.java:926) at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:345) at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:362) at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:310) at org.apache.hadoop.mapred.FileOutputCommitter.commitJob(FileOutputCommitter.java:136) at org.apache.spark.SparkHadoopWriter.commitJob(SparkHadoopWriter.scala:127) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1079) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:944) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:853) at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1199) at $iwC$$iwC$$iwC$$iwC.init(console:17) at $iwC$$iwC$$iwC.init(console:22) at $iwC$$iwC.init(console:24) at $iwC.init(console:26) at init(console:28) at .init(console:32) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) -- Kannan
Spark permission denied error when invoking saveAsTextFile
Running a simple word count job in standalone mode as a non root user from spark-shell. The spark master, worker services are running as root user. The problem is the _temporary under /user/krajah/output2/_temporary/0 dir is being created with root permission even when running the job as non root user - krajah in this case. The higher level directories are getting created with right permission though. There was a similar question posted long time back, but there is no answer: http://mail-archives.apache.org/mod_mbox/mesos-user/201408.mbox/%3CCAAeYHL2M9J9xEotf_0zXmZXy2_x-oBHa=xxl2naft203o6u...@mail.gmail.com%3E *Wrong permission for child directory* drwxr-xr-x - root root0 2015-04-01 11:20 /user/krajah/output2/_temporary/0/_temporary *Right permission for parent directories* hadoop fs -ls -R /user/krajah/my_output drwxr-xr-x - krajah krajah 1 2015-04-01 11:46 /user/krajah/my_output/_temporary drwxr-xr-x - krajah krajah 3 2015-04-01 11:46 /user/krajah/my_output/_temporary/0 *Job and Stacktrace* scala val file = sc.textFile(/user/krajah/junk.txt) scala val counts = file.flatMap(line = line.split( )) scala .map(word = (word, 1)) scala .reduceByKey(_ + _) scala counts.saveAsTextFile(/user/krajah/count2) java.io.IOException: Error: Permission denied at com.mapr.fs.MapRFileSystem.rename(MapRFileSystem.java:926) at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:345) at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:362) at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:310) at org.apache.hadoop.mapred.FileOutputCommitter.commitJob(FileOutputCommitter.java:136) at org.apache.spark.SparkHadoopWriter.commitJob(SparkHadoopWriter.scala:127) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1079) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:944) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:853) at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1199) at $iwC$$iwC$$iwC$$iwC.init(console:17) at $iwC$$iwC$$iwC.init(console:22) at $iwC$$iwC.init(console:24) at $iwC.init(console:26) at init(console:28) at .init(console:32) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) -- Kannan
Is SPARK_CLASSPATH really deprecated?
SparkConf.scala logs a warning saying SPARK_CLASSPATH is deprecated and we should use spark.executor.extraClassPath instead. But the online documentation states that spark.executor.extraClassPath is only meant for backward compatibility. https://spark.apache.org/docs/1.2.0/configuration.html#execution-behavior Which one is right? I have a use case to submit a hbase job from spark-shell and make it run using YARN. In this case, I need to somehow add the hbase jars to the classpath of the executor. If I add it to SPARK_CLASSPATH and export it it works fine. Alternatively, if I set the spark.executor.extraClassPath in spark-defaults.conf, it works fine. But the reason I don't like spark-defaults.conf is that I need to hard code it instead of relying on a script to generate the classpath. I can use a script in spark-env.sh and set SPARK_CLASSPATH. Given that compute-classpath uses SPARK_CLASSPATH variable, why is it marked as deprecated? -- Kannan
Re: Is SPARK_CLASSPATH really deprecated?
Thanks Marcelo. Do you think it would be useful to make spark.executor.extraClassPath be made to pick up some environment variable that can be set from spark-env.sh? Here is a example. spark-env.sh -- executor_extra_cp = get_hbase_jars_for_cp export executor_extra_cp spark-defaults.conf - spark.executor.extraClassPath = ${executor_extra_cp} This will let us add logic inside get_hbase_jars_for_cp function to pick the right version hbase jars. There could be multiple versions installed on the node. -- Kannan On Thu, Feb 26, 2015 at 6:08 PM, Marcelo Vanzin van...@cloudera.com wrote: On Thu, Feb 26, 2015 at 5:12 PM, Kannan Rajah kra...@maprtech.com wrote: Also, I would like to know if there is a localization overhead when we use spark.executor.extraClassPath. Again, in the case of hbase, these jars would be typically available on all nodes. So there is no need to localize them from the node where job was submitted. I am wondering if we use the SPARK_CLASSPATH approach, then it would not do localization. That would be an added benefit. Please clarify. spark.executor.extraClassPath doesn't localize anything. It just prepends those classpath entries to the usual classpath used to launch the executor. There's no copying of files or anything, so they're expected to exist on the nodes. It's basically exactly the same as SPARK_CLASSPATH, but broken down to two options (one for the executors, and one for the driver). -- Marcelo
Re: Is SPARK_CLASSPATH really deprecated?
There is a usability concern I have with the current way of specifying --jars. Imagine a use case like hbase where a lot of jobs need it in its classpath. This needs to be set every time. If we use spark.executor.extraClassPath, then we just need to set it once But there is no programmatic way to set this value, like picking up from an environment variable or by running a script that generates classpath. You need to hard code the jars in spark-defaults.conf. Also, I would like to know if there is a localization overhead when we use spark.executor.extraClassPath. Again, in the case of hbase, these jars would be typically available on all nodes. So there is no need to localize them from the node where job was submitted. I am wondering if we use the SPARK_CLASSPATH approach, then it would not do localization. That would be an added benefit. Please clarify. -- Kannan On Thu, Feb 26, 2015 at 4:15 PM, Marcelo Vanzin van...@cloudera.com wrote: SPARK_CLASSPATH is definitely deprecated, but my understanding is that spark.executor.extraClassPath is not, so maybe the documentation needs fixing. I'll let someone who might know otherwise comment, though. On Thu, Feb 26, 2015 at 2:43 PM, Kannan Rajah kra...@maprtech.com wrote: SparkConf.scala logs a warning saying SPARK_CLASSPATH is deprecated and we should use spark.executor.extraClassPath instead. But the online documentation states that spark.executor.extraClassPath is only meant for backward compatibility. https://spark.apache.org/docs/1.2.0/configuration.html#execution-behavior Which one is right? I have a use case to submit a hbase job from spark-shell and make it run using YARN. In this case, I need to somehow add the hbase jars to the classpath of the executor. If I add it to SPARK_CLASSPATH and export it it works fine. Alternatively, if I set the spark.executor.extraClassPath in spark-defaults.conf, it works fine. But the reason I don't like spark-defaults.conf is that I need to hard code it instead of relying on a script to generate the classpath. I can use a script in spark-env.sh and set SPARK_CLASSPATH. Given that compute-classpath uses SPARK_CLASSPATH variable, why is it marked as deprecated? -- Kannan -- Marcelo
Re: Spark-SQL 1.2.0 sort by results are not consistent with Hive
Cheng, We tried this setting and it still did not help. This was on Spark 1.2.0. -- Kannan On Mon, Feb 23, 2015 at 6:38 PM, Cheng Lian lian.cs@gmail.com wrote: (Move to user list.) Hi Kannan, You need to set mapred.map.tasks to 1 in hive-site.xml. The reason is this line of code https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala#L68, which overrides spark.default.parallelism. Also, spark.sql.shuffle.parallelism isn’t used here since there’s no shuffle involved (we only need to sort within a partition). Default value of mapred.map.tasks is 2 https://hadoop.apache.org/docs/r1.0.4/mapred-default.html. You may see that the Spark SQL result can be divided into two sorted parts from the middle. Cheng On 2/19/15 10:33 AM, Kannan Rajah wrote: According to hive documentation, sort by is supposed to order the results for each reducer. So if we set a single reducer, then the results should be sorted, right? But this is not happening. Any idea why? Looks like the settings I am using to restrict the number of reducers is not having an effect. *Tried the following:* Set spark.default.parallelism to 1 Set spark.sql.shuffle.partitions to 1 These were set in hive-site.xml and also inside spark shell. *Spark-SQL* create table if not exists testSortBy (key int, name string, age int); LOAD DATA LOCAL INPATH '/home/mapr/sample-name-age.txt' OVERWRITE INTO TABLE testSortBy; select * from testSortBY; 1Aditya28 2aash25 3prashanth27 4bharath26 5terry27 6nanda26 7pradeep27 8pratyay26 set spark.default.parallelism=1; set spark.sql.shuffle.partitions=1; select name,age from testSortBy sort by age; aash 25 bharath 26 prashanth 27 Aditya 28 nanda 26 pratyay 26 terry 27 pradeep 27 *HIVE* select name,age from testSortBy sort by age; aash25 bharath26 nanda26 pratyay26 prashanth27 terry27 pradeep27 Aditya28 -- Kannan