Re: High virtual memory consumption on spark-submit client.

2016-05-12 Thread Harsh J
How many CPU cores are on that machine? Read http://qr.ae/8Uv3Xq You can also confirm the above by running the pmap utility on your process and most of the virtual memory would be under 'anon'. On Fri, 13 May 2016 09:11 jone, wrote: > The virtual memory is 9G When i run org.apache.spark.example

Re: Access fields by name/index from Avro data read from Kafka through Spark Streaming

2016-02-25 Thread Harsh J
You should be able to cast the object type to the real underlying type (GenericRecord (if generic, which is so by default), or the actual type class (if specific)). The underlying implementation of KafkaAvroDecoder seems to use either one of those depending on a config switch: https://github.com/co

Re: how to access local file from Spark sc.textFile("file:///path to/myfile")

2015-12-11 Thread Harsh J
General note: The /root is a protected local directory, meaning that if your program spawns as a non-root user, it will never be able to access the file. On Sat, Dec 12, 2015 at 12:21 AM Zhan Zhang wrote: > As Sean mentioned, you cannot referring to the local file in your remote > machine (execu

Re: Classpath problem trying to use DataFrames

2015-12-11 Thread Harsh J
Do you have all your hive jars listed in the classpath.txt / SPARK_DIST_CLASSPATH env., specifically the hive-exec jar? Is the location of that jar also the same on all the distributed hosts? Passing an explicit executor classpath string may also help overcome this (replace HIVE_BASE_DIR to the ro

Re: DataFrame creation delay?

2015-12-10 Thread Harsh J
The option of "spark.sql.hive.metastorePartitionPruning=true" will not work unless you have a partition column predicate in your query. Your query of "select * from temp.log" does not do this. The slowdown appears to be due to the need of loading all partition metadata. Have you also tried to see

Re: Spark job submission REST API

2015-12-10 Thread Harsh J
You could take a look at Livy also: https://github.com/cloudera/livy#welcome-to-livy-the-rest-spark-server On Fri, Dec 11, 2015 at 8:17 AM Andrew Or wrote: > Hello, > > The hidden API was implemented for use internally and there are no plans > to make it public at this point. It was originally i

Re: Can't filter

2015-12-10 Thread Harsh J
Are you sure you do not have any messages preceding the trace, such as one quoting which class is found to be missing? That'd be helpful to see and suggest what may (exactly) be going wrong. It appear similar to https://issues.apache.org/jira/browse/SPARK-8368, but I cannot tell for certain cause I

Re: Unable to acces hive table (created through hive context) in hive console

2015-12-10 Thread Harsh J
Are you certain you are providing Spark with the right Hive configuration? Is there a valid HIVE_CONF_DIR defined in your spark-env.sh, with a hive-site.xml detailing the location/etc. of the metastore service and/or DB? Without a valid metastore config, Hive may switch to using a local (embedded)

Re: INotifyDStream - where to find it?

2015-12-10 Thread Harsh J
I couldn't spot it anywhere on the web so it doesn't look to be contributed yet, but note that the HDFS APIs are already available per https://issues.apache.org/jira/browse/HDFS-6634 (you can see the test case for an implementation guideline in Java: https://github.com/apache/hadoop/blob/trunk/hado

Re: SparkStreaming variable scope

2015-12-09 Thread Harsh J
> and then calling getRowID() in the lambda, because the function gets sent to the executor right? Yes, that is correct (vs. a one time evaluation, as was with your assignment earlier). On Thu, Dec 10, 2015 at 3:34 AM Pinela wrote: > Hey Bryan, > > Thank for the answer ;) I knew it was a basic

Re: how to reference aggregate columns

2015-12-09 Thread Harsh J
While the DataFrame lookups can identify that anonymous column name, SparkSql does not appear to do so. You should use an alias instead: val rows = Seq (("X", 1), ("Y", 5), ("Z", 4)) val rdd = sc.parallelize(rows) val dataFrame = rdd.toDF("user","clicks") val sumDf = dataFrame.groupBy("user").agg(