Read HDFS file from an executor(closure)
Hi, Is there a way to read a text file from inside a spark executor? I need to do this for an streaming application where we need to read a file(whose contents would change) from a closure. I cannot use the "sc.textFile" method since spark context is not serializable. I also cannot read a file using the Hadoop Api since the "FileSystem" class is not serializable as well. Does anyone have any idea on how I can go about this? Thanks, Udit
Kafka Direct Stream
Hi, I am using spark direct stream to consume from multiple topics in Kafka. I am able to consume fine but I am stuck at how to separate the data for each topic since I need to process data differently depending on the topic. I basically want to split the RDD consisting on N topics into N RDD's each having 1 topic. Any help would be appreciated. Thanks in advance, Udit
Provide sampling ratio while loading json in spark version > 1.4.0
Hi, In earlier versions of spark(< 1.4.0), we were able to specify the sampling ratio while using *sqlContext.JsonFile* or *sqlContext.JsonRDD* so that we dont inspect each and every element while inferring the schema. I see that the use of these methods is deprecated in the newer spark version and the suggested way is to use *read().json()* to load a json file and return a dataframe. Is there a way to specify the sampling ratio using these methods? Or am I doing something incorrect? Thanks, Udit
Spark thrift server on yarn
Hi, I am trying to start a spark thrift server using the following command on Spark 1.3.1 running on yarn: * ./sbin/start-thriftserver.sh --master yarn://resourcemanager.snc1:8032 --executor-memory 512m --hiveconf hive.server2.thrift.bind.host=test-host.sn1 --hiveconf hive.server2.thrift.port=10001 --queue public* It starts up fine and is able to connect to the hive metastore. I now need to view some temporary tables using this thrift server so I start up SparkSql and register a temp table. But the problem is that I am unable to view the temp table using the beeline client. I am pretty sure I am going wrong somewhere and the spark documentation does not clearly say how to run the thrift server in yarn mode or maybe I missed something. Could someone tell me how this is to be done or point me to some documentation? Thanks in advance, Udit
Re: Spark thrift server on yarn
I registered it in a new Spark SQL CLI. Yeah I thought so too about how the temp tables were accessible across different applications without using a job-server. I see that running* HiveThriftServer2.startWithContext(hiveContext) *within the spark app starts up a thrift server. On Tue, Aug 25, 2015 at 5:32 PM, Cheng, Hao hao.ch...@intel.com wrote: Did you register temp table via the beeline or in a new Spark SQL CLI? As I know, the temp table cannot cross the HiveContext. Hao *From:* Udit Mehta [mailto:ume...@groupon.com] *Sent:* Wednesday, August 26, 2015 8:19 AM *To:* user *Subject:* Spark thrift server on yarn Hi, I am trying to start a spark thrift server using the following command on Spark 1.3.1 running on yarn: * ./sbin/start-thriftserver.sh --master yarn://resourcemanager.snc1:8032 --executor-memory 512m --hiveconf hive.server2.thrift.bind.host=test-host.sn1 --hiveconf hive.server2.thrift.port=10001 --queue public* It starts up fine and is able to connect to the hive metastore. I now need to view some temporary tables using this thrift server so I start up SparkSql and register a temp table. But the problem is that I am unable to view the temp table using the beeline client. I am pretty sure I am going wrong somewhere and the spark documentation does not clearly say how to run the thrift server in yarn mode or maybe I missed something. Could someone tell me how this is to be done or point me to some documentation? Thanks in advance, Udit
Json Serde used by Spark Sql
Hi, I was wondering what json serde does spark sql use. I created a JsonRDD out of a json file and then registered it as a temp table to query. I can then query the table using dot notation for nested structs/arrays. I was wondering how does spark sql deserialize the json data based on the query. Thanks in advance, Udit
Spark metrics source
Hi, I am running spark 1.3 on yarn and am trying to publish some metrics from my app. I see that we need to use the codahale library to create a source and then specify the source in the metrics.properties. Does somebody have a sample metrics source which I can use in my app to forward the metrics to a jmx sink? Thanks, Udit
Metrics Servlet on spark 1.2
Hi, I am unable to access the metrics servlet on spark 1.2. I tried to access it from the app master UI on port 4040 but i dont see any metrics there. Is it a known issue with spark 1.2 or am I doing something wrong? Also how do I publish my own metrics and view them on this servlet? Thanks, Udit
Re: HDP 2.2 AM abort : Unable to find ExecutorLauncher class
Thanks. Would that distribution work for hdp 2.2? On Fri, Apr 17, 2015 at 2:19 PM, Zhan Zhang zzh...@hortonworks.com wrote: You don’t need to put any yarn assembly in hdfs. The spark assembly jar will include everything. It looks like your package does not include yarn module, although I didn’t find anything wrong in your mvn command. Can you check whether the ExecutorLauncher class is in your jar file or not? BTW: For spark-1.3, you can use the binary distribution from apache. Thanks. Zhan Zhang On Apr 17, 2015, at 2:01 PM, Udit Mehta ume...@groupon.com wrote: I followed the steps described above and I still get this error: Error: Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher I am trying to build spark 1.3 on hdp 2.2. I built spark from source using: build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver -DskipTests package Maybe I am not putting the correct yarn assembly on hdfs or some other issue? Thanks, Udit On Mon, Mar 30, 2015 at 10:18 AM, Zhan Zhang zzh...@hortonworks.com wrote: Hi Folks, Just to summarize it to run SPARK on HDP distribution. 1. The spark version has to be 1.3.0 and above if you are using upstream distribution. This configuration is mainly for HDP rolling upgrade purpose, and the patch only went into spark upstream from 1.3.0. 2. In $SPARK_HOME/conf/sp[ark-defaults.conf, adding following settings. spark.driver.extraJavaOptions -Dhdp.version=x spark.yarn.am.extraJavaOptions -Dhdp.version=x 3. In $SPARK_HOME/java-opts, add following options. -Dhdp.version=x Thanks. Zhan Zhang On Mar 30, 2015, at 6:56 AM, Doug Balog doug.sparku...@dugos.com wrote: The “best” solution to spark-shell’s problem is creating a file $SPARK_HOME/conf/java-opts with “-Dhdp.version=2.2.0.0-2014” Cheers, Doug On Mar 28, 2015, at 1:25 PM, Michael Stone mst...@mathom.us wrote: I've also been having trouble running 1.3.0 on HDP. The spark.yarn.am.extraJavaOptions -Dhdp.version=2.2.0.0-2041 configuration directive seems to work with pyspark, but not propagate when using spark-shell. (That is, everything works find with pyspark, and spark-shell fails with the bad substitution message.) Mike Stone - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: HDP 2.2 AM abort : Unable to find ExecutorLauncher class
Hi, This is the log trace: https://gist.github.com/uditmehta27/511eac0b76e6d61f8b47 On the yarn RM UI, I see : Error: Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher The command I run is: bin/spark-shell --master yarn-client The spark defaults I use is: spark.yarn.jar hdfs://namenode1-dev.snc1:8020/spark/spark-assembly-1.3.0-hadoop2.4.0.jar spark.yarn.access.namenodes hdfs://namenode1-dev.snc1:8032 spark.dynamicAllocation.enabled false spark.scheduler.mode FAIR spark.driver.extraJavaOptions -Dhdp.version=2.2.0.0-2041 spark.yarn.am.extraJavaOptions -Dhdp.version=2.2.0.0-2041 Is there anything wrong in what I am trying to do? thanks again! On Fri, Apr 17, 2015 at 2:56 PM, Zhan Zhang zzh...@hortonworks.com wrote: Hi Udit, By the way, do you mind to share the whole log trace? Thanks. Zhan Zhang On Apr 17, 2015, at 2:26 PM, Udit Mehta ume...@groupon.com wrote: I am just trying to launch a spark shell and not do anything fancy. I got the binary distribution from apache and put the spark assembly on hdfs. I then specified the yarn.jars option in spark defaults to point to the assembly in hdfs. I still got the same error so though I had to build it for hdp. I am using hdp 2.2 with hadoop 2.6/ On Fri, Apr 17, 2015 at 2:21 PM, Udit Mehta ume...@groupon.com wrote: Thanks. Would that distribution work for hdp 2.2? On Fri, Apr 17, 2015 at 2:19 PM, Zhan Zhang zzh...@hortonworks.com wrote: You don’t need to put any yarn assembly in hdfs. The spark assembly jar will include everything. It looks like your package does not include yarn module, although I didn’t find anything wrong in your mvn command. Can you check whether the ExecutorLauncher class is in your jar file or not? BTW: For spark-1.3, you can use the binary distribution from apache. Thanks. Zhan Zhang On Apr 17, 2015, at 2:01 PM, Udit Mehta ume...@groupon.com wrote: I followed the steps described above and I still get this error: Error: Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher I am trying to build spark 1.3 on hdp 2.2. I built spark from source using: build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver -DskipTests package Maybe I am not putting the correct yarn assembly on hdfs or some other issue? Thanks, Udit On Mon, Mar 30, 2015 at 10:18 AM, Zhan Zhang zzh...@hortonworks.com wrote: Hi Folks, Just to summarize it to run SPARK on HDP distribution. 1. The spark version has to be 1.3.0 and above if you are using upstream distribution. This configuration is mainly for HDP rolling upgrade purpose, and the patch only went into spark upstream from 1.3.0. 2. In $SPARK_HOME/conf/sp[ark-defaults.conf, adding following settings. spark.driver.extraJavaOptions -Dhdp.version=x spark.yarn.am.extraJavaOptions -Dhdp.version=x 3. In $SPARK_HOME/java-opts, add following options. -Dhdp.version=x Thanks. Zhan Zhang On Mar 30, 2015, at 6:56 AM, Doug Balog doug.sparku...@dugos.com wrote: The “best” solution to spark-shell’s problem is creating a file $SPARK_HOME/conf/java-opts with “-Dhdp.version=2.2.0.0-2014” Cheers, Doug On Mar 28, 2015, at 1:25 PM, Michael Stone mst...@mathom.us wrote: I've also been having trouble running 1.3.0 on HDP. The spark.yarn.am.extraJavaOptions -Dhdp.version=2.2.0.0-2041 configuration directive seems to work with pyspark, but not propagate when using spark-shell. (That is, everything works find with pyspark, and spark-shell fails with the bad substitution message.) Mike Stone - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: HDP 2.2 AM abort : Unable to find ExecutorLauncher class
I followed the steps described above and I still get this error: Error: Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher I am trying to build spark 1.3 on hdp 2.2. I built spark from source using: build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver -DskipTests package Maybe I am not putting the correct yarn assembly on hdfs or some other issue? Thanks, Udit On Mon, Mar 30, 2015 at 10:18 AM, Zhan Zhang zzh...@hortonworks.com wrote: Hi Folks, Just to summarize it to run SPARK on HDP distribution. 1. The spark version has to be 1.3.0 and above if you are using upstream distribution. This configuration is mainly for HDP rolling upgrade purpose, and the patch only went into spark upstream from 1.3.0. 2. In $SPARK_HOME/conf/sp[ark-defaults.conf, adding following settings. spark.driver.extraJavaOptions -Dhdp.version=x spark.yarn.am.extraJavaOptions -Dhdp.version=x 3. In $SPARK_HOME/java-opts, add following options. -Dhdp.version=x Thanks. Zhan Zhang On Mar 30, 2015, at 6:56 AM, Doug Balog doug.sparku...@dugos.com wrote: The “best” solution to spark-shell’s problem is creating a file $SPARK_HOME/conf/java-opts with “-Dhdp.version=2.2.0.0-2014” Cheers, Doug On Mar 28, 2015, at 1:25 PM, Michael Stone mst...@mathom.us wrote: I've also been having trouble running 1.3.0 on HDP. The spark.yarn.am.extraJavaOptions -Dhdp.version=2.2.0.0-2041 configuration directive seems to work with pyspark, but not propagate when using spark-shell. (That is, everything works find with pyspark, and spark-shell fails with the bad substitution message.) Mike Stone - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: HDP 2.2 AM abort : Unable to find ExecutorLauncher class
Thanks Zhang, that solved the error. This is probably not documented anywhere so I missed it. Thanks again, Udit On Fri, Apr 17, 2015 at 3:24 PM, Zhan Zhang zzh...@hortonworks.com wrote: Besides the hdp.version in spark-defaults.conf, I think you probably forget to put the file* java-opts* under $SPARK_HOME/conf with following contents. [root@c6402 conf]# pwd /usr/hdp/current/spark-client/conf [root@c6402 conf]# ls fairscheduler.xml.template * java-opts *log4j.properties.template metrics.properties.template spark-defaults.conf spark-env.sh hive-site.xml log4j.properties metrics.properties slaves.template spark-defaults.conf.template spark-env.sh.template *[root@c6402 conf]# more java-opts* * -Dhdp.version=2.2.0.0-2041* [root@c6402 conf]# Thanks. Zhan Zhang On Apr 17, 2015, at 3:09 PM, Udit Mehta ume...@groupon.com wrote: Hi, This is the log trace: https://gist.github.com/uditmehta27/511eac0b76e6d61f8b47 On the yarn RM UI, I see : Error: Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher The command I run is: bin/spark-shell --master yarn-client The spark defaults I use is: spark.yarn.jar hdfs://namenode1-dev.snc1:8020/spark/spark-assembly-1.3.0-hadoop2.4.0.jar spark.yarn.access.namenodes hdfs://namenode1-dev.snc1:8032 spark.dynamicAllocation.enabled false spark.scheduler.mode FAIR spark.driver.extraJavaOptions -Dhdp.version=2.2.0.0-2041 spark.yarn.am.extraJavaOptions -Dhdp.version=2.2.0.0-2041 Is there anything wrong in what I am trying to do? thanks again! On Fri, Apr 17, 2015 at 2:56 PM, Zhan Zhang zzh...@hortonworks.com wrote: Hi Udit, By the way, do you mind to share the whole log trace? Thanks. Zhan Zhang On Apr 17, 2015, at 2:26 PM, Udit Mehta ume...@groupon.com wrote: I am just trying to launch a spark shell and not do anything fancy. I got the binary distribution from apache and put the spark assembly on hdfs. I then specified the yarn.jars option in spark defaults to point to the assembly in hdfs. I still got the same error so though I had to build it for hdp. I am using hdp 2.2 with hadoop 2.6/ On Fri, Apr 17, 2015 at 2:21 PM, Udit Mehta ume...@groupon.com wrote: Thanks. Would that distribution work for hdp 2.2? On Fri, Apr 17, 2015 at 2:19 PM, Zhan Zhang zzh...@hortonworks.com wrote: You don’t need to put any yarn assembly in hdfs. The spark assembly jar will include everything. It looks like your package does not include yarn module, although I didn’t find anything wrong in your mvn command. Can you check whether the ExecutorLauncher class is in your jar file or not? BTW: For spark-1.3, you can use the binary distribution from apache. Thanks. Zhan Zhang On Apr 17, 2015, at 2:01 PM, Udit Mehta ume...@groupon.com wrote: I followed the steps described above and I still get this error: Error: Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher I am trying to build spark 1.3 on hdp 2.2. I built spark from source using: build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver -DskipTests package Maybe I am not putting the correct yarn assembly on hdfs or some other issue? Thanks, Udit On Mon, Mar 30, 2015 at 10:18 AM, Zhan Zhang zzh...@hortonworks.com wrote: Hi Folks, Just to summarize it to run SPARK on HDP distribution. 1. The spark version has to be 1.3.0 and above if you are using upstream distribution. This configuration is mainly for HDP rolling upgrade purpose, and the patch only went into spark upstream from 1.3.0. 2. In $SPARK_HOME/conf/sp[ark-defaults.conf, adding following settings. spark.driver.extraJavaOptions -Dhdp.version=x spark.yarn.am.extraJavaOptions -Dhdp.version=x 3. In $SPARK_HOME/java-opts, add following options. -Dhdp.version=x Thanks. Zhan Zhang On Mar 30, 2015, at 6:56 AM, Doug Balog doug.sparku...@dugos.com wrote: The “best” solution to spark-shell’s problem is creating a file $SPARK_HOME/conf/java-opts with “-Dhdp.version=2.2.0.0-2014” Cheers, Doug On Mar 28, 2015, at 1:25 PM, Michael Stone mst...@mathom.us wrote: I've also been having trouble running 1.3.0 on HDP. The spark.yarn.am.extraJavaOptions -Dhdp.version=2.2.0.0-2041 configuration directive seems to work with pyspark, but not propagate when using spark-shell. (That is, everything works find with pyspark, and spark-shell fails with the bad substitution message.) Mike Stone - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
How to use the --files arg
Hi, Suppose I have a command and I pass the --files arg as below: bin/spark-submit --class com.test.HelloWorld --master yarn-cluster --num-executors 8 --driver-memory 512m --executor-memory 2048m --executor-cores 4 --queue public * --files $HOME/myfile.txt* --name test_1 ~/test_code-1.0-SNAPSHOT.jar Can anyone tell me how do I access this file in my executors? Basically I want to read this file to get some configs. I tries to read from my HDFS Home dir but that doesnt work. Thanks, Udit
Re: why Shuffle Write is not zero when everything is cached and there is enough memory?
I have noticed a similar issue when using spark streaming. The spark shuffle write size increases to a large size(in GB) and then the app crashes saying: java.io.FileNotFoundException: /data/vol0/nodemanager/usercache/$user/appcache/application_1427480955913_0339/spark-local-20150330231234-db1a/0b/temp_shuffle_1b23808f-f285-40b2-bec7-1c6790050d7f (No such file or directory) I dont understand why the shuffle size increases to such a large value for long running jobs. Thanks, Udiy On Mon, Mar 30, 2015 at 4:01 AM, shahab shahab.mok...@gmail.com wrote: Thanks Saisai. I will try your solution, but still i don't understand why filesystem should be used where there is a plenty of memory available! On Mon, Mar 30, 2015 at 11:22 AM, Saisai Shao sai.sai.s...@gmail.com wrote: Shuffle write will finally spill the data into file system as a bunch of files. If you want to avoid disk write, you can mount a ramdisk and configure spark.local.dir to this ram disk. So shuffle output will write to memory based FS, and will not introduce disk IO. Thanks Jerry 2015-03-30 17:15 GMT+08:00 shahab shahab.mok...@gmail.com: Hi, I was looking at SparkUI, Executors, and I noticed that I have 597 MB for Shuffle while I am using cached temp-table and the Spark had 2 GB free memory (the number under Memory Used is 597 MB /2.6 GB) ?!!! Shouldn't be Shuffle Write be zero and everything (map/reduce) tasks be done in memory? best, /Shahab
Re: why Shuffle Write is not zero when everything is cached and there is enough memory?
Thanks for the reply. This will reduce the shuffle write to disk to an extent but for a long running job(multiple days), the shuffle write would still occupy a lot of space on disk. Why do we need to store the data from older map tasks to memory? On Tue, Mar 31, 2015 at 1:19 PM, Bijay Pathak bijay.pat...@cloudwick.com wrote: The Spark Sort-Based Shuffle (default from 1.1) keeps the data from each Map tasks to memory until they they can't fit after which they are sorted and spilled to disk. You can reduce the Shuffle write to disk by increasing spark.shuffle.memoryFraction(default 0.2). By writing the shuffle output to disk the Spark lineage can be truncated when the RDDs are already materialized as the side-effects of earlier shuffle.This is the under the hood optimization in Spark which is only possible because of shuffle output output being written to disk. You can set spark.shuffle.spill to false if you don't want to spill to the disk and assuming you have enough heap memory. On Tue, Mar 31, 2015 at 12:35 PM, Udit Mehta ume...@groupon.com wrote: I have noticed a similar issue when using spark streaming. The spark shuffle write size increases to a large size(in GB) and then the app crashes saying: java.io.FileNotFoundException: /data/vol0/nodemanager/usercache/$user/appcache/application_1427480955913_0339/spark-local-20150330231234-db1a/0b/temp_shuffle_1b23808f-f285-40b2-bec7-1c6790050d7f (No such file or directory) I dont understand why the shuffle size increases to such a large value for long running jobs. Thanks, Udiy On Mon, Mar 30, 2015 at 4:01 AM, shahab shahab.mok...@gmail.com wrote: Thanks Saisai. I will try your solution, but still i don't understand why filesystem should be used where there is a plenty of memory available! On Mon, Mar 30, 2015 at 11:22 AM, Saisai Shao sai.sai.s...@gmail.com wrote: Shuffle write will finally spill the data into file system as a bunch of files. If you want to avoid disk write, you can mount a ramdisk and configure spark.local.dir to this ram disk. So shuffle output will write to memory based FS, and will not introduce disk IO. Thanks Jerry 2015-03-30 17:15 GMT+08:00 shahab shahab.mok...@gmail.com: Hi, I was looking at SparkUI, Executors, and I noticed that I have 597 MB for Shuffle while I am using cached temp-table and the Spark had 2 GB free memory (the number under Memory Used is 597 MB /2.6 GB) ?!!! Shouldn't be Shuffle Write be zero and everything (map/reduce) tasks be done in memory? best, /Shahab
log4j.properties in jar
Hi, Is it possible to put the log4j.properties in the application jar such that the driver and the executors use this log4j file. Do I need to specify anything while submitting my app so that this file is used? Thanks, Udit
Re: Hive context datanucleus error
has this issue been fixed in spark 1.2: https://issues.apache.org/jira/browse/SPARK-2624 On Mon, Mar 23, 2015 at 9:19 PM, Udit Mehta ume...@groupon.com wrote: I am trying to run a simple query to view tables in my hive metastore using hive context. I am getting this error: spark Persistence process has been specified to use a *ClassLoader Resolve* of name datanucleus yet this has not been found by the DataNucleus plugin mechanism. Please check your CLASSPATH and plugin specification. https://www.google.com/search?espv=2biw=1440bih=802q=spark+Persistence+process+has+been+specified+to+use+a+ClassLoader+Resolve+of+name+%22datanucleus%22+yet+this+has+not+been+found+by+the+DataNucleus+plugin+mechanism.+Please+check+your+CLASSPATH+and+plugin+specification.spell=1sa=Xei=seQQVbPnCoyZNo6lgIgBved=0CBoQBSgA I am able to access the metastore using the spark-sql. Can someone point out what the issue could be? thanks
Re: Does HiveContext connect to HiveServer2?
Another question related to this, how can we propagate the hive-site.xml to all workers when running in the yarn cluster mode? On Tue, Mar 24, 2015 at 10:09 AM, Marcelo Vanzin van...@cloudera.com wrote: It does neither. If you provide a Hive configuration to Spark, HiveContext will connect to your metastore server, otherwise it will create its own metastore in the working directory (IIRC). On Tue, Mar 24, 2015 at 8:58 AM, nitinkak001 nitinkak...@gmail.com wrote: I am wondering if HiveContext connects to HiveServer2 or does it work though Hive CLI. The reason I am asking is because Cloudera has deprecated Hive CLI. If the connection is through HiverServer2, is there a way to specify user credentials? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Does-HiveContext-connect-to-HiveServer2-tp22200.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Marcelo - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark per app logging
Yes each application can use its own log4j.properties but I am not sure how to configure log4j so that the driver and executor write to file. This is because if we set the spark.executor.extraJavaOptions it will read from a file and that is not what I need. How do I configure log4j from the app so that the driver and the executors use these configs? Thanks, Udit On Sat, Mar 21, 2015 at 3:13 AM, Jeffrey Jedele jeffrey.jed...@gmail.com wrote: Hi, I'm not completely sure about this either, but this is what we are doing currently: Configure your logging to write to STDOUT, not to a file explicitely. Spark will capture stdour and stderr and separate the messages into a app/driver folder structure in the configured worker directory. We then use logstash to collect the logs and index them to a elasticsearch cluster (Spark seems to produce a lot of logging data). With some simple regex processing, you also get the application id as searchable field. Regards, Jeff 2015-03-20 22:37 GMT+01:00 Ted Yu yuzhih...@gmail.com: Are these jobs the same jobs, just run by different users or, different jobs ? If the latter, can each application use its own log4j.properties ? Cheers On Fri, Mar 20, 2015 at 1:43 PM, Udit Mehta ume...@groupon.com wrote: Hi, We have spark setup such that there are various users running multiple jobs at the same time. Currently all the logs go to 1 file specified in the log4j.properties. Is it possible to configure log4j in spark for per app/user logging instead of sending all logs to 1 file mentioned in the log4j.properties? Thanks Udit
Hive context datanucleus error
I am trying to run a simple query to view tables in my hive metastore using hive context. I am getting this error: spark Persistence process has been specified to use a *ClassLoader Resolve* of name datanucleus yet this has not been found by the DataNucleus plugin mechanism. Please check your CLASSPATH and plugin specification. https://www.google.com/search?espv=2biw=1440bih=802q=spark+Persistence+process+has+been+specified+to+use+a+ClassLoader+Resolve+of+name+%22datanucleus%22+yet+this+has+not+been+found+by+the+DataNucleus+plugin+mechanism.+Please+check+your+CLASSPATH+and+plugin+specification.spell=1sa=Xei=seQQVbPnCoyZNo6lgIgBved=0CBoQBSgA I am able to access the metastore using the spark-sql. Can someone point out what the issue could be? thanks
Spark per app logging
Hi, We have spark setup such that there are various users running multiple jobs at the same time. Currently all the logs go to 1 file specified in the log4j.properties. Is it possible to configure log4j in spark for per app/user logging instead of sending all logs to 1 file mentioned in the log4j.properties? Thanks Udit