Re: High virtual memory consumption on spark-submit client.

2016-05-12 Thread Harsh J
How many CPU cores are on that machine? Read http://qr.ae/8Uv3Xq You can also confirm the above by running the pmap utility on your process and most of the virtual memory would be under 'anon'. On Fri, 13 May 2016 09:11 jone, wrote: > The virtual memory is 9G When i run

Re: Access fields by name/index from Avro data read from Kafka through Spark Streaming

2016-02-25 Thread Harsh J
You should be able to cast the object type to the real underlying type (GenericRecord (if generic, which is so by default), or the actual type class (if specific)). The underlying implementation of KafkaAvroDecoder seems to use either one of those depending on a config switch:

Re: how to access local file from Spark sc.textFile("file:///path to/myfile")

2015-12-11 Thread Harsh J
General note: The /root is a protected local directory, meaning that if your program spawns as a non-root user, it will never be able to access the file. On Sat, Dec 12, 2015 at 12:21 AM Zhan Zhang wrote: > As Sean mentioned, you cannot referring to the local file in

Re: Classpath problem trying to use DataFrames

2015-12-11 Thread Harsh J
Do you have all your hive jars listed in the classpath.txt / SPARK_DIST_CLASSPATH env., specifically the hive-exec jar? Is the location of that jar also the same on all the distributed hosts? Passing an explicit executor classpath string may also help overcome this (replace HIVE_BASE_DIR to the

Re: Unable to acces hive table (created through hive context) in hive console

2015-12-10 Thread Harsh J
Are you certain you are providing Spark with the right Hive configuration? Is there a valid HIVE_CONF_DIR defined in your spark-env.sh, with a hive-site.xml detailing the location/etc. of the metastore service and/or DB? Without a valid metastore config, Hive may switch to using a local

Re: INotifyDStream - where to find it?

2015-12-10 Thread Harsh J
I couldn't spot it anywhere on the web so it doesn't look to be contributed yet, but note that the HDFS APIs are already available per https://issues.apache.org/jira/browse/HDFS-6634 (you can see the test case for an implementation guideline in Java:

Re: Can't filter

2015-12-10 Thread Harsh J
Are you sure you do not have any messages preceding the trace, such as one quoting which class is found to be missing? That'd be helpful to see and suggest what may (exactly) be going wrong. It appear similar to https://issues.apache.org/jira/browse/SPARK-8368, but I cannot tell for certain cause

Re: Spark job submission REST API

2015-12-10 Thread Harsh J
You could take a look at Livy also: https://github.com/cloudera/livy#welcome-to-livy-the-rest-spark-server On Fri, Dec 11, 2015 at 8:17 AM Andrew Or wrote: > Hello, > > The hidden API was implemented for use internally and there are no plans > to make it public at this

Re: how to reference aggregate columns

2015-12-09 Thread Harsh J
While the DataFrame lookups can identify that anonymous column name, SparkSql does not appear to do so. You should use an alias instead: val rows = Seq (("X", 1), ("Y", 5), ("Z", 4)) val rdd = sc.parallelize(rows) val dataFrame = rdd.toDF("user","clicks") val sumDf =

Re: SparkStreaming variable scope

2015-12-09 Thread Harsh J
> and then calling getRowID() in the lambda, because the function gets sent to the executor right? Yes, that is correct (vs. a one time evaluation, as was with your assignment earlier). On Thu, Dec 10, 2015 at 3:34 AM Pinela wrote: > Hey Bryan, > > Thank for the answer ;) I