Read HDFS file from an executor(closure)

2016-01-12 Thread Udit Mehta
Hi, Is there a way to read a text file from inside a spark executor? I need to do this for an streaming application where we need to read a file(whose contents would change) from a closure. I cannot use the "sc.textFile" method since spark context is not serializable. I also cannot read a file

Kafka Direct Stream

2015-09-30 Thread Udit Mehta
Hi, I am using spark direct stream to consume from multiple topics in Kafka. I am able to consume fine but I am stuck at how to separate the data for each topic since I need to process data differently depending on the topic. I basically want to split the RDD consisting on N topics into N RDD's

Provide sampling ratio while loading json in spark version > 1.4.0

2015-09-23 Thread Udit Mehta
Hi, In earlier versions of spark(< 1.4.0), we were able to specify the sampling ratio while using *sqlContext.JsonFile* or *sqlContext.JsonRDD* so that we dont inspect each and every element while inferring the schema. I see that the use of these methods is deprecated in the newer spark version

Spark thrift server on yarn

2015-08-25 Thread Udit Mehta
Hi, I am trying to start a spark thrift server using the following command on Spark 1.3.1 running on yarn: * ./sbin/start-thriftserver.sh --master yarn://resourcemanager.snc1:8032 --executor-memory 512m --hiveconf hive.server2.thrift.bind.host=test-host.sn1 --hiveconf

Re: Spark thrift server on yarn

2015-08-25 Thread Udit Mehta
, 2015 at 5:32 PM, Cheng, Hao hao.ch...@intel.com wrote: Did you register temp table via the beeline or in a new Spark SQL CLI? As I know, the temp table cannot cross the HiveContext. Hao *From:* Udit Mehta [mailto:ume...@groupon.com] *Sent:* Wednesday, August 26, 2015 8:19 AM *To:* user

Json Serde used by Spark Sql

2015-08-18 Thread Udit Mehta
Hi, I was wondering what json serde does spark sql use. I created a JsonRDD out of a json file and then registered it as a temp table to query. I can then query the table using dot notation for nested structs/arrays. I was wondering how does spark sql deserialize the json data based on the query.

Spark metrics source

2015-04-20 Thread Udit Mehta
Hi, I am running spark 1.3 on yarn and am trying to publish some metrics from my app. I see that we need to use the codahale library to create a source and then specify the source in the metrics.properties. Does somebody have a sample metrics source which I can use in my app to forward the

Metrics Servlet on spark 1.2

2015-04-17 Thread Udit Mehta
Hi, I am unable to access the metrics servlet on spark 1.2. I tried to access it from the app master UI on port 4040 but i dont see any metrics there. Is it a known issue with spark 1.2 or am I doing something wrong? Also how do I publish my own metrics and view them on this servlet? Thanks,

Re: HDP 2.2 AM abort : Unable to find ExecutorLauncher class

2015-04-17 Thread Udit Mehta
didn’t find anything wrong in your mvn command. Can you check whether the ExecutorLauncher class is in your jar file or not? BTW: For spark-1.3, you can use the binary distribution from apache. Thanks. Zhan Zhang On Apr 17, 2015, at 2:01 PM, Udit Mehta ume...@groupon.com wrote: I

Re: HDP 2.2 AM abort : Unable to find ExecutorLauncher class

2015-04-17 Thread Udit Mehta
-Dhdp.version=2.2.0.0-2041 Is there anything wrong in what I am trying to do? thanks again! On Fri, Apr 17, 2015 at 2:56 PM, Zhan Zhang zzh...@hortonworks.com wrote: Hi Udit, By the way, do you mind to share the whole log trace? Thanks. Zhan Zhang On Apr 17, 2015, at 2:26 PM, Udit Mehta

Re: HDP 2.2 AM abort : Unable to find ExecutorLauncher class

2015-04-17 Thread Udit Mehta
I followed the steps described above and I still get this error: Error: Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher I am trying to build spark 1.3 on hdp 2.2. I built spark from source using: build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive

Re: HDP 2.2 AM abort : Unable to find ExecutorLauncher class

2015-04-17 Thread Udit Mehta
log4j.properties metrics.properties slaves.template spark-defaults.conf.template spark-env.sh.template *[root@c6402 conf]# more java-opts* * -Dhdp.version=2.2.0.0-2041* [root@c6402 conf]# Thanks. Zhan Zhang On Apr 17, 2015, at 3:09 PM, Udit Mehta ume

How to use the --files arg

2015-04-10 Thread Udit Mehta
Hi, Suppose I have a command and I pass the --files arg as below: bin/spark-submit --class com.test.HelloWorld --master yarn-cluster --num-executors 8 --driver-memory 512m --executor-memory 2048m --executor-cores 4 --queue public * --files $HOME/myfile.txt* --name test_1

Re: why Shuffle Write is not zero when everything is cached and there is enough memory?

2015-03-31 Thread Udit Mehta
I have noticed a similar issue when using spark streaming. The spark shuffle write size increases to a large size(in GB) and then the app crashes saying: java.io.FileNotFoundException:

Re: why Shuffle Write is not zero when everything is cached and there is enough memory?

2015-03-31 Thread Udit Mehta
spark.shuffle.spill to false if you don't want to spill to the disk and assuming you have enough heap memory. On Tue, Mar 31, 2015 at 12:35 PM, Udit Mehta ume...@groupon.com wrote: I have noticed a similar issue when using spark streaming. The spark shuffle write size increases to a large

log4j.properties in jar

2015-03-30 Thread Udit Mehta
Hi, Is it possible to put the log4j.properties in the application jar such that the driver and the executors use this log4j file. Do I need to specify anything while submitting my app so that this file is used? Thanks, Udit

Re: Hive context datanucleus error

2015-03-24 Thread Udit Mehta
has this issue been fixed in spark 1.2: https://issues.apache.org/jira/browse/SPARK-2624 On Mon, Mar 23, 2015 at 9:19 PM, Udit Mehta ume...@groupon.com wrote: I am trying to run a simple query to view tables in my hive metastore using hive context. I am getting this error: spark Persistence

Re: Does HiveContext connect to HiveServer2?

2015-03-24 Thread Udit Mehta
Another question related to this, how can we propagate the hive-site.xml to all workers when running in the yarn cluster mode? On Tue, Mar 24, 2015 at 10:09 AM, Marcelo Vanzin van...@cloudera.com wrote: It does neither. If you provide a Hive configuration to Spark, HiveContext will connect to

Re: Spark per app logging

2015-03-23 Thread Udit Mehta
? Cheers On Fri, Mar 20, 2015 at 1:43 PM, Udit Mehta ume...@groupon.com wrote: Hi, We have spark setup such that there are various users running multiple jobs at the same time. Currently all the logs go to 1 file specified in the log4j.properties. Is it possible to configure log4j in spark

Hive context datanucleus error

2015-03-23 Thread Udit Mehta
I am trying to run a simple query to view tables in my hive metastore using hive context. I am getting this error: spark Persistence process has been specified to use a *ClassLoader Resolve* of name datanucleus yet this has not been found by the DataNucleus plugin mechanism. Please check your

Spark per app logging

2015-03-20 Thread Udit Mehta
Hi, We have spark setup such that there are various users running multiple jobs at the same time. Currently all the logs go to 1 file specified in the log4j.properties. Is it possible to configure log4j in spark for per app/user logging instead of sending all logs to 1 file mentioned in the