Read HDFS file from an executor(closure)

2016-01-12 Thread Udit Mehta
Hi,

Is there a way to read a text file from inside a spark executor? I need to
do this for an streaming application where we need to read a file(whose
contents would change) from a closure.

I cannot use the "sc.textFile" method since spark context is not
serializable. I also cannot read a file using the Hadoop Api since the
"FileSystem" class is not serializable as well.

Does anyone have any idea on how I can go about this?

Thanks,
Udit


Kafka Direct Stream

2015-09-30 Thread Udit Mehta
Hi,

I am using spark direct stream to consume from multiple topics in Kafka. I
am able to consume fine but I am stuck at how to separate the data for each
topic since I need to process data differently depending on the topic.
I basically want to split the RDD consisting on N topics into N RDD's each
having 1 topic.

Any help would be appreciated.

Thanks in advance,
Udit


Provide sampling ratio while loading json in spark version > 1.4.0

2015-09-23 Thread Udit Mehta
Hi,

In earlier versions of spark(< 1.4.0), we were able to specify the sampling
ratio while using *sqlContext.JsonFile* or *sqlContext.JsonRDD* so that we
dont inspect each and every element while inferring the schema.
I see that the use of these methods is deprecated in the newer spark
version and the suggested way is to use *read().json()* to load a json file
and return a dataframe. Is there a way to specify the sampling ratio using
these methods? Or am I doing something incorrect?

Thanks,
Udit


Spark thrift server on yarn

2015-08-25 Thread Udit Mehta
Hi,

I am trying to start a spark thrift server using the following command on
Spark 1.3.1 running on yarn:



* ./sbin/start-thriftserver.sh --master yarn://resourcemanager.snc1:8032
--executor-memory 512m --hiveconf
hive.server2.thrift.bind.host=test-host.sn1 --hiveconf
hive.server2.thrift.port=10001 --queue public*
It starts up fine and is able to connect to the hive metastore.
I now need to view some temporary tables using this thrift server so I
start up SparkSql and register a temp table.
But the problem is that I am unable to view the temp table using the
beeline client. I am pretty sure I am going wrong somewhere and the spark
documentation does not clearly say how to run the thrift server in yarn
mode or maybe I missed something.
Could someone tell me how this is to be done or point me to some
documentation?

Thanks in advance,
Udit


Re: Spark thrift server on yarn

2015-08-25 Thread Udit Mehta
I registered it in a new Spark SQL CLI. Yeah I thought so too about how the
temp tables were accessible across different applications without using a
job-server. I see that running*
HiveThriftServer2.startWithContext(hiveContext) *within the spark app
starts up a thrift server.

On Tue, Aug 25, 2015 at 5:32 PM, Cheng, Hao hao.ch...@intel.com wrote:

 Did you register temp table via the beeline or in a new Spark SQL CLI?



 As I know, the temp table cannot cross the HiveContext.



 Hao



 *From:* Udit Mehta [mailto:ume...@groupon.com]
 *Sent:* Wednesday, August 26, 2015 8:19 AM
 *To:* user
 *Subject:* Spark thrift server on yarn



 Hi,

 I am trying to start a spark thrift server using the following command on
 Spark 1.3.1 running on yarn:

 * ./sbin/start-thriftserver.sh --master yarn://resourcemanager.snc1:8032
 --executor-memory 512m --hiveconf
 hive.server2.thrift.bind.host=test-host.sn1 --hiveconf
 hive.server2.thrift.port=10001 --queue public*

 It starts up fine and is able to connect to the hive metastore.

 I now need to view some temporary tables using this thrift server so I
 start up SparkSql and register a temp table.

 But the problem is that I am unable to view the temp table using the
 beeline client. I am pretty sure I am going wrong somewhere and the spark
 documentation does not clearly say how to run the thrift server in yarn
 mode or maybe I missed something.
 Could someone tell me how this is to be done or point me to some
 documentation?

 Thanks in advance,

 Udit



Json Serde used by Spark Sql

2015-08-18 Thread Udit Mehta
Hi,

I was wondering what json serde does spark sql use. I created a JsonRDD out
of a json file and then registered it as a temp table to query. I can then
query the table using dot notation for nested structs/arrays. I was
wondering how does spark sql deserialize the json data based on the query.

Thanks in advance,
Udit


Spark metrics source

2015-04-20 Thread Udit Mehta
Hi,

I am running spark 1.3 on yarn and am trying to publish some metrics from
my app. I see that we need to use the codahale library to create a source
and then specify the source in the metrics.properties.
Does somebody have a sample metrics source which I can use in my app to
forward the metrics to a jmx sink?

Thanks,
Udit


Metrics Servlet on spark 1.2

2015-04-17 Thread Udit Mehta
Hi,

I am unable to access the metrics servlet on spark 1.2. I tried to access
it from the app master UI on port 4040 but i dont see any metrics there. Is
it a known issue with spark 1.2 or am I doing something wrong?
Also how do I publish my own metrics and view them on this servlet?

Thanks,
Udit


Re: HDP 2.2 AM abort : Unable to find ExecutorLauncher class

2015-04-17 Thread Udit Mehta
Thanks. Would that distribution work for hdp 2.2?

On Fri, Apr 17, 2015 at 2:19 PM, Zhan Zhang zzh...@hortonworks.com wrote:

  You don’t need to put any yarn assembly in hdfs. The spark assembly jar
 will include everything. It looks like your package does not include yarn
 module, although I didn’t find anything wrong in your mvn command. Can you
 check whether the ExecutorLauncher class is in your jar file or not?

  BTW: For spark-1.3, you can use the binary distribution from apache.

  Thanks.

  Zhan Zhang



  On Apr 17, 2015, at 2:01 PM, Udit Mehta ume...@groupon.com wrote:

I followed the steps described above and I still get this error:


 Error: Could not find or load main class 
 org.apache.spark.deploy.yarn.ExecutorLauncher


  I am trying to build spark 1.3 on hdp 2.2.
  I built spark from source using:
 build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive
 -Phive-thriftserver -DskipTests package

  Maybe I am not putting the correct yarn assembly on hdfs or some other
 issue?

  Thanks,
  Udit

 On Mon, Mar 30, 2015 at 10:18 AM, Zhan Zhang zzh...@hortonworks.com
 wrote:

 Hi Folks,

  Just to summarize it to run SPARK on HDP distribution.

  1. The spark version has to be 1.3.0 and above if you are using
 upstream distribution.  This configuration is mainly for HDP rolling
 upgrade purpose, and the patch only went into spark upstream from 1.3.0.

  2. In $SPARK_HOME/conf/sp[ark-defaults.conf, adding following settings.
 spark.driver.extraJavaOptions -Dhdp.version=x

spark.yarn.am.extraJavaOptions -Dhdp.version=x

  3. In $SPARK_HOME/java-opts, add following options.
-Dhdp.version=x

  Thanks.

  Zhan Zhang



  On Mar 30, 2015, at 6:56 AM, Doug Balog doug.sparku...@dugos.com
 wrote:

 The “best” solution to spark-shell’s  problem is creating a file
 $SPARK_HOME/conf/java-opts
 with “-Dhdp.version=2.2.0.0-2014”

 Cheers,

 Doug

 On Mar 28, 2015, at 1:25 PM, Michael Stone mst...@mathom.us wrote:

 I've also been having trouble running 1.3.0 on HDP. The
 spark.yarn.am.extraJavaOptions -Dhdp.version=2.2.0.0-2041
 configuration directive seems to work with pyspark, but not propagate
 when using spark-shell. (That is, everything works find with pyspark, and
 spark-shell fails with the bad substitution message.)

 Mike Stone

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org



 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org







Re: HDP 2.2 AM abort : Unable to find ExecutorLauncher class

2015-04-17 Thread Udit Mehta
Hi,

This is the log trace:
https://gist.github.com/uditmehta27/511eac0b76e6d61f8b47

On the yarn RM UI, I see :

Error: Could not find or load main class
org.apache.spark.deploy.yarn.ExecutorLauncher


The command I run is: bin/spark-shell --master yarn-client

The spark defaults I use is:
spark.yarn.jar
hdfs://namenode1-dev.snc1:8020/spark/spark-assembly-1.3.0-hadoop2.4.0.jar
spark.yarn.access.namenodes hdfs://namenode1-dev.snc1:8032
spark.dynamicAllocation.enabled false
spark.scheduler.mode FAIR
spark.driver.extraJavaOptions -Dhdp.version=2.2.0.0-2041
spark.yarn.am.extraJavaOptions -Dhdp.version=2.2.0.0-2041

Is there anything wrong in what I am trying to do?

thanks again!


On Fri, Apr 17, 2015 at 2:56 PM, Zhan Zhang zzh...@hortonworks.com wrote:

  Hi Udit,

  By the way, do you mind to share the whole log trace?

  Thanks.

  Zhan Zhang

  On Apr 17, 2015, at 2:26 PM, Udit Mehta ume...@groupon.com wrote:

  I am just trying to launch a spark shell and not do anything fancy. I
 got the binary distribution from apache and put the spark assembly on hdfs.
 I then specified the yarn.jars option in spark defaults to point to the
 assembly in hdfs. I still got the same error so though I had to build it
 for hdp. I am using hdp 2.2 with hadoop 2.6/

 On Fri, Apr 17, 2015 at 2:21 PM, Udit Mehta ume...@groupon.com wrote:

 Thanks. Would that distribution work for hdp 2.2?

 On Fri, Apr 17, 2015 at 2:19 PM, Zhan Zhang zzh...@hortonworks.com
 wrote:

  You don’t need to put any yarn assembly in hdfs. The spark assembly
 jar will include everything. It looks like your package does not include
 yarn module, although I didn’t find anything wrong in your mvn command. Can
 you check whether the ExecutorLauncher class is in your jar file or not?

  BTW: For spark-1.3, you can use the binary distribution from apache.

  Thanks.

  Zhan Zhang



  On Apr 17, 2015, at 2:01 PM, Udit Mehta ume...@groupon.com wrote:

I followed the steps described above and I still get this error:


 Error: Could not find or load main class 
 org.apache.spark.deploy.yarn.ExecutorLauncher


  I am trying to build spark 1.3 on hdp 2.2.
  I built spark from source using:
 build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive
 -Phive-thriftserver -DskipTests package

  Maybe I am not putting the correct yarn assembly on hdfs or some other
 issue?

  Thanks,
  Udit

 On Mon, Mar 30, 2015 at 10:18 AM, Zhan Zhang zzh...@hortonworks.com
 wrote:

 Hi Folks,

  Just to summarize it to run SPARK on HDP distribution.

  1. The spark version has to be 1.3.0 and above if you are using
 upstream distribution.  This configuration is mainly for HDP rolling
 upgrade purpose, and the patch only went into spark upstream from 1.3.0.

  2. In $SPARK_HOME/conf/sp[ark-defaults.conf, adding following
 settings.
 spark.driver.extraJavaOptions -Dhdp.version=x

spark.yarn.am.extraJavaOptions -Dhdp.version=x

  3. In $SPARK_HOME/java-opts, add following options.
-Dhdp.version=x

  Thanks.

  Zhan Zhang



  On Mar 30, 2015, at 6:56 AM, Doug Balog doug.sparku...@dugos.com
 wrote:

 The “best” solution to spark-shell’s  problem is creating a file
 $SPARK_HOME/conf/java-opts
 with “-Dhdp.version=2.2.0.0-2014”

 Cheers,

 Doug

 On Mar 28, 2015, at 1:25 PM, Michael Stone mst...@mathom.us wrote:

 I've also been having trouble running 1.3.0 on HDP. The
 spark.yarn.am.extraJavaOptions -Dhdp.version=2.2.0.0-2041
 configuration directive seems to work with pyspark, but not propagate
 when using spark-shell. (That is, everything works find with pyspark, and
 spark-shell fails with the bad substitution message.)

 Mike Stone

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org



 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org










Re: HDP 2.2 AM abort : Unable to find ExecutorLauncher class

2015-04-17 Thread Udit Mehta
I followed the steps described above and I still get this error:


Error: Could not find or load main class
org.apache.spark.deploy.yarn.ExecutorLauncher


I am trying to build spark 1.3 on hdp 2.2.
I built spark from source using:
build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive
-Phive-thriftserver -DskipTests package

Maybe I am not putting the correct yarn assembly on hdfs or some other
issue?

Thanks,
Udit

On Mon, Mar 30, 2015 at 10:18 AM, Zhan Zhang zzh...@hortonworks.com wrote:

  Hi Folks,

  Just to summarize it to run SPARK on HDP distribution.

  1. The spark version has to be 1.3.0 and above if you are using upstream
 distribution.  This configuration is mainly for HDP rolling upgrade
 purpose, and the patch only went into spark upstream from 1.3.0.

  2. In $SPARK_HOME/conf/sp[ark-defaults.conf, adding following settings.
 spark.driver.extraJavaOptions -Dhdp.version=x

spark.yarn.am.extraJavaOptions -Dhdp.version=x

  3. In $SPARK_HOME/java-opts, add following options.
-Dhdp.version=x

  Thanks.

  Zhan Zhang



  On Mar 30, 2015, at 6:56 AM, Doug Balog doug.sparku...@dugos.com wrote:

 The “best” solution to spark-shell’s  problem is creating a file
 $SPARK_HOME/conf/java-opts
 with “-Dhdp.version=2.2.0.0-2014”

 Cheers,

 Doug

 On Mar 28, 2015, at 1:25 PM, Michael Stone mst...@mathom.us wrote:

 I've also been having trouble running 1.3.0 on HDP. The
 spark.yarn.am.extraJavaOptions -Dhdp.version=2.2.0.0-2041
 configuration directive seems to work with pyspark, but not propagate when
 using spark-shell. (That is, everything works find with pyspark, and
 spark-shell fails with the bad substitution message.)

 Mike Stone

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org



 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Re: HDP 2.2 AM abort : Unable to find ExecutorLauncher class

2015-04-17 Thread Udit Mehta
Thanks Zhang, that solved the error. This is probably not documented
anywhere so I missed it.

Thanks again,
Udit

On Fri, Apr 17, 2015 at 3:24 PM, Zhan Zhang zzh...@hortonworks.com wrote:

  Besides the hdp.version in spark-defaults.conf, I think you probably
 forget to put the file* java-opts* under $SPARK_HOME/conf with following
 contents.

   [root@c6402 conf]# pwd
 /usr/hdp/current/spark-client/conf
 [root@c6402 conf]# ls
 fairscheduler.xml.template * java-opts *log4j.properties.template
  metrics.properties.template  spark-defaults.conf   spark-env.sh
 hive-site.xml   log4j.properties  metrics.properties
 slaves.template  spark-defaults.conf.template
  spark-env.sh.template
 *[root@c6402 conf]# more java-opts*
 *  -Dhdp.version=2.2.0.0-2041*
 [root@c6402 conf]#


  Thanks.

  Zhan Zhang


  On Apr 17, 2015, at 3:09 PM, Udit Mehta ume...@groupon.com wrote:

 Hi,

  This is the log trace:
 https://gist.github.com/uditmehta27/511eac0b76e6d61f8b47

  On the yarn RM UI, I see :

 Error: Could not find or load main class 
 org.apache.spark.deploy.yarn.ExecutorLauncher


  The command I run is: bin/spark-shell --master yarn-client

  The spark defaults I use is:
 spark.yarn.jar
 hdfs://namenode1-dev.snc1:8020/spark/spark-assembly-1.3.0-hadoop2.4.0.jar
 spark.yarn.access.namenodes hdfs://namenode1-dev.snc1:8032
 spark.dynamicAllocation.enabled false
 spark.scheduler.mode FAIR
 spark.driver.extraJavaOptions -Dhdp.version=2.2.0.0-2041
 spark.yarn.am.extraJavaOptions -Dhdp.version=2.2.0.0-2041

  Is there anything wrong in what I am trying to do?

  thanks again!


 On Fri, Apr 17, 2015 at 2:56 PM, Zhan Zhang zzh...@hortonworks.com
 wrote:

 Hi Udit,

  By the way, do you mind to share the whole log trace?

  Thanks.

  Zhan Zhang

  On Apr 17, 2015, at 2:26 PM, Udit Mehta ume...@groupon.com wrote:

  I am just trying to launch a spark shell and not do anything fancy. I
 got the binary distribution from apache and put the spark assembly on hdfs.
 I then specified the yarn.jars option in spark defaults to point to the
 assembly in hdfs. I still got the same error so though I had to build it
 for hdp. I am using hdp 2.2 with hadoop 2.6/

 On Fri, Apr 17, 2015 at 2:21 PM, Udit Mehta ume...@groupon.com wrote:

 Thanks. Would that distribution work for hdp 2.2?

 On Fri, Apr 17, 2015 at 2:19 PM, Zhan Zhang zzh...@hortonworks.com
 wrote:

  You don’t need to put any yarn assembly in hdfs. The spark assembly
 jar will include everything. It looks like your package does not include
 yarn module, although I didn’t find anything wrong in your mvn command. Can
 you check whether the ExecutorLauncher class is in your jar file or
 not?

  BTW: For spark-1.3, you can use the binary distribution from apache.

  Thanks.

  Zhan Zhang



  On Apr 17, 2015, at 2:01 PM, Udit Mehta ume...@groupon.com wrote:

I followed the steps described above and I still get this error:


 Error: Could not find or load main class 
 org.apache.spark.deploy.yarn.ExecutorLauncher


  I am trying to build spark 1.3 on hdp 2.2.
  I built spark from source using:
 build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive
 -Phive-thriftserver -DskipTests package

  Maybe I am not putting the correct yarn assembly on hdfs or some other
 issue?

  Thanks,
  Udit

 On Mon, Mar 30, 2015 at 10:18 AM, Zhan Zhang zzh...@hortonworks.com
 wrote:

 Hi Folks,

  Just to summarize it to run SPARK on HDP distribution.

  1. The spark version has to be 1.3.0 and above if you are using
 upstream distribution.  This configuration is mainly for HDP rolling
 upgrade purpose, and the patch only went into spark upstream from 1.3.0.

  2. In $SPARK_HOME/conf/sp[ark-defaults.conf, adding following
 settings.
 spark.driver.extraJavaOptions -Dhdp.version=x

spark.yarn.am.extraJavaOptions -Dhdp.version=x

  3. In $SPARK_HOME/java-opts, add following options.
-Dhdp.version=x

  Thanks.

  Zhan Zhang



  On Mar 30, 2015, at 6:56 AM, Doug Balog doug.sparku...@dugos.com
 wrote:

 The “best” solution to spark-shell’s  problem is creating a file
 $SPARK_HOME/conf/java-opts
 with “-Dhdp.version=2.2.0.0-2014”

 Cheers,

 Doug

 On Mar 28, 2015, at 1:25 PM, Michael Stone mst...@mathom.us wrote:

 I've also been having trouble running 1.3.0 on HDP. The
 spark.yarn.am.extraJavaOptions -Dhdp.version=2.2.0.0-2041
 configuration directive seems to work with pyspark, but not propagate
 when using spark-shell. (That is, everything works find with pyspark, and
 spark-shell fails with the bad substitution message.)

 Mike Stone

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org



 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org












How to use the --files arg

2015-04-10 Thread Udit Mehta
Hi,

Suppose I have a command and I pass the --files arg as below:

bin/spark-submit --class com.test.HelloWorld --master yarn-cluster
--num-executors 8 --driver-memory 512m --executor-memory 2048m
--executor-cores 4 --queue public * --files $HOME/myfile.txt*  --name
test_1 ~/test_code-1.0-SNAPSHOT.jar

Can anyone tell me how do I access this file in my executors?
Basically I want to read this file to get some configs. I tries to read
from my HDFS Home dir but that doesnt work.

Thanks,
Udit


Re: why Shuffle Write is not zero when everything is cached and there is enough memory?

2015-03-31 Thread Udit Mehta
I have noticed a similar issue when using spark streaming. The spark
shuffle write size increases to a large size(in GB) and then the app
crashes saying:
java.io.FileNotFoundException:
/data/vol0/nodemanager/usercache/$user/appcache/application_1427480955913_0339/spark-local-20150330231234-db1a/0b/temp_shuffle_1b23808f-f285-40b2-bec7-1c6790050d7f
(No such file or directory)

I dont understand why the shuffle size increases to such a large value for
long running jobs.

Thanks,
Udiy

On Mon, Mar 30, 2015 at 4:01 AM, shahab shahab.mok...@gmail.com wrote:

 Thanks Saisai. I will try your solution, but still i don't understand why
 filesystem should be used where there is a plenty of memory available!



 On Mon, Mar 30, 2015 at 11:22 AM, Saisai Shao sai.sai.s...@gmail.com
 wrote:

 Shuffle write will finally spill the data into file system as a bunch of
 files. If you want to avoid disk write, you can mount a ramdisk and
 configure spark.local.dir to this ram disk. So shuffle output will write
 to memory based FS, and will not introduce disk IO.

 Thanks
 Jerry

 2015-03-30 17:15 GMT+08:00 shahab shahab.mok...@gmail.com:

 Hi,

 I was looking at SparkUI, Executors, and I noticed that I have 597 MB
 for  Shuffle while I am using cached temp-table and the Spark had 2 GB
 free memory (the number under Memory Used is 597 MB /2.6 GB) ?!!!

 Shouldn't be Shuffle Write be zero and everything (map/reduce) tasks be
 done in memory?

 best,

 /Shahab






Re: why Shuffle Write is not zero when everything is cached and there is enough memory?

2015-03-31 Thread Udit Mehta
Thanks for the reply.
This will reduce the shuffle write to disk to an extent but for a long
running job(multiple days), the shuffle write would still occupy a lot of
space on disk. Why do we need to store the data from older map tasks to
memory?

On Tue, Mar 31, 2015 at 1:19 PM, Bijay Pathak bijay.pat...@cloudwick.com
wrote:

 The Spark Sort-Based Shuffle (default from 1.1) keeps the data from
 each Map tasks to memory until they they can't fit after which they
 are sorted and spilled to disk. You can reduce the Shuffle write to
 disk by increasing spark.shuffle.memoryFraction(default 0.2).

 By writing the shuffle output to disk the Spark lineage can be
 truncated when the RDDs are already materialized as the side-effects
 of earlier shuffle.This is the under the hood optimization in Spark
 which is only possible because of shuffle output output being written
 to disk.

 You can set spark.shuffle.spill to false if you don't want to spill to
 the disk and assuming you have enough heap memory.

 On Tue, Mar 31, 2015 at 12:35 PM, Udit Mehta ume...@groupon.com wrote:
  I have noticed a similar issue when using spark streaming. The spark
 shuffle
  write size increases to a large size(in GB) and then the app crashes
 saying:
  java.io.FileNotFoundException:
 
 /data/vol0/nodemanager/usercache/$user/appcache/application_1427480955913_0339/spark-local-20150330231234-db1a/0b/temp_shuffle_1b23808f-f285-40b2-bec7-1c6790050d7f
  (No such file or directory)
 
  I dont understand why the shuffle size increases to such a large value
 for
  long running jobs.
 
  Thanks,
  Udiy
 
  On Mon, Mar 30, 2015 at 4:01 AM, shahab shahab.mok...@gmail.com wrote:
 
  Thanks Saisai. I will try your solution, but still i don't understand
 why
  filesystem should be used where there is a plenty of memory available!
 
 
 
  On Mon, Mar 30, 2015 at 11:22 AM, Saisai Shao sai.sai.s...@gmail.com
  wrote:
 
  Shuffle write will finally spill the data into file system as a bunch
 of
  files. If you want to avoid disk write, you can mount a ramdisk and
  configure spark.local.dir to this ram disk. So shuffle output will
 write
  to memory based FS, and will not introduce disk IO.
 
  Thanks
  Jerry
 
  2015-03-30 17:15 GMT+08:00 shahab shahab.mok...@gmail.com:
 
  Hi,
 
  I was looking at SparkUI, Executors, and I noticed that I have 597 MB
  for  Shuffle while I am using cached temp-table and the Spark had 2
 GB free
  memory (the number under Memory Used is 597 MB /2.6 GB) ?!!!
 
  Shouldn't be Shuffle Write be zero and everything (map/reduce) tasks
 be
  done in memory?
 
  best,
 
  /Shahab
 
 
 
 



log4j.properties in jar

2015-03-30 Thread Udit Mehta
Hi,


Is it possible to put the log4j.properties in the application jar such that
the driver and the executors use this log4j file. Do I need to specify
anything while submitting my app so that this file is used?

Thanks,
Udit


Re: Hive context datanucleus error

2015-03-24 Thread Udit Mehta
has this issue been fixed in spark 1.2:
https://issues.apache.org/jira/browse/SPARK-2624

On Mon, Mar 23, 2015 at 9:19 PM, Udit Mehta ume...@groupon.com wrote:

 I am trying to run a simple query to view tables in my hive metastore
 using hive context.
 I am getting this error:
 spark Persistence process has been specified to use a *ClassLoader
 Resolve* of name datanucleus yet this has not been found by the
 DataNucleus plugin mechanism. Please check your CLASSPATH and plugin
 specification.
 https://www.google.com/search?espv=2biw=1440bih=802q=spark+Persistence+process+has+been+specified+to+use+a+ClassLoader+Resolve+of+name+%22datanucleus%22+yet+this+has+not+been+found+by+the+DataNucleus+plugin+mechanism.+Please+check+your+CLASSPATH+and+plugin+specification.spell=1sa=Xei=seQQVbPnCoyZNo6lgIgBved=0CBoQBSgA

 I am able to access the metastore using the spark-sql.
 Can someone point out what the issue could be?

 thanks



Re: Does HiveContext connect to HiveServer2?

2015-03-24 Thread Udit Mehta
Another question related to this, how can we propagate the hive-site.xml to
all workers when running in the yarn cluster mode?

On Tue, Mar 24, 2015 at 10:09 AM, Marcelo Vanzin van...@cloudera.com
wrote:

 It does neither. If you provide a Hive configuration to Spark,
 HiveContext will connect to your metastore server, otherwise it will
 create its own metastore in the working directory (IIRC).

 On Tue, Mar 24, 2015 at 8:58 AM, nitinkak001 nitinkak...@gmail.com
 wrote:
  I am wondering if HiveContext connects to HiveServer2 or does it work
 though
  Hive CLI. The reason I am asking is because Cloudera has deprecated Hive
  CLI.
 
  If the connection is through HiverServer2, is there a way to specify user
  credentials?
 
 
 
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Does-HiveContext-connect-to-HiveServer2-tp22200.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 



 --
 Marcelo

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Spark per app logging

2015-03-23 Thread Udit Mehta
Yes each application can use its own log4j.properties but I am not sure how
to configure log4j so that the driver and executor write to file. This is
because if we set the spark.executor.extraJavaOptions it will read from a
file and that is not what I need.
How do I configure log4j from the app so that the driver and the executors
use these configs?

Thanks,
Udit

On Sat, Mar 21, 2015 at 3:13 AM, Jeffrey Jedele jeffrey.jed...@gmail.com
wrote:

 Hi,
 I'm not completely sure about this either, but this is what we are doing
 currently:
 Configure your logging to write to STDOUT, not to a file explicitely.
 Spark will capture stdour and stderr and separate the messages into a
 app/driver folder structure in the configured worker directory.

 We then use logstash to collect the logs and index them to a elasticsearch
 cluster (Spark seems to produce a lot of logging data). With some simple
 regex processing, you also get the application id as searchable field.

 Regards,
 Jeff

 2015-03-20 22:37 GMT+01:00 Ted Yu yuzhih...@gmail.com:

 Are these jobs the same jobs, just run by different users or, different
 jobs ?
 If the latter, can each application use its own log4j.properties ?

 Cheers

 On Fri, Mar 20, 2015 at 1:43 PM, Udit Mehta ume...@groupon.com wrote:

 Hi,

 We have spark setup such that there are various users running multiple
 jobs at the same time. Currently all the logs go to 1 file specified in the
 log4j.properties.
 Is it possible to configure log4j in spark for per app/user logging
 instead of sending all logs to 1 file mentioned in the log4j.properties?

 Thanks
 Udit






Hive context datanucleus error

2015-03-23 Thread Udit Mehta
I am trying to run a simple query to view tables in my hive metastore using
hive context.
I am getting this error:
spark Persistence process has been specified to use a *ClassLoader Resolve* of
name datanucleus yet this has not been found by the DataNucleus plugin
mechanism. Please check your CLASSPATH and plugin specification.
https://www.google.com/search?espv=2biw=1440bih=802q=spark+Persistence+process+has+been+specified+to+use+a+ClassLoader+Resolve+of+name+%22datanucleus%22+yet+this+has+not+been+found+by+the+DataNucleus+plugin+mechanism.+Please+check+your+CLASSPATH+and+plugin+specification.spell=1sa=Xei=seQQVbPnCoyZNo6lgIgBved=0CBoQBSgA

I am able to access the metastore using the spark-sql.
Can someone point out what the issue could be?

thanks


Spark per app logging

2015-03-20 Thread Udit Mehta
Hi,

We have spark setup such that there are various users running multiple jobs
at the same time. Currently all the logs go to 1 file specified in the
log4j.properties.
Is it possible to configure log4j in spark for per app/user logging instead
of sending all logs to 1 file mentioned in the log4j.properties?

Thanks
Udit