How to load partial data from HDFS using Spark SQL

2016-01-01 Thread SRK
Hi, How to load partial data from hdfs using Spark SQL? Suppose I want to load data based on a filter like "Select * from table where id = " using Spark SQL with DataFrames, how can that be done? The idea here is that I do not want to load the whole data into memory when I use the SQL and I

Re: How to load partial data from HDFS using Spark SQL

2016-01-01 Thread UMESH CHAUDHARY
Ok, so whats wrong in using : var df=HiveContext.sql("Select * from table where id = ") //filtered data frame df.count On Sat, Jan 2, 2016 at 11:56 AM, SRK wrote: > Hi, > > How to load partial data from hdfs using Spark SQL? Suppose I want to load > data based on a

Re: [SparkSQL][Parquet] Read from nested parquet data

2016-01-01 Thread lin
Hi Cheng, Thank you for your informative explanation; it is quite helpful. We'd like to try both approaches; should we have some progress, we would update this thread so that anybody interested can follow. Thanks again @yanboliang, @chenglian!

Re: How to find cause(waiting threads etc) of hanging job for 7 hours?

2016-01-01 Thread Umesh Kacha
Hi thanks I did that and I have attached thread dump images. That was the intention of my question asking for help to identify which waiting thread is culprit. Regards, Umesh On Sat, Jan 2, 2016 at 8:38 AM, Prabhu Joseph wrote: > Take thread dump of Executor process

Re: Cannot get repartitioning to work

2016-01-01 Thread Jeff Zhang
You are using the wrong RDD, use the returned RDD as following. val repartitionedRDD = results.repartition(20) println(repartitionedRDD.partitions.size) On Sat, Jan 2, 2016 at 10:38 AM, jimitkr wrote: > Hi, > > I'm trying to test some custom parallelism and

Unable to read JSON input in Spark (YARN Cluster)

2016-01-01 Thread ๏̯͡๏
Version: Spark 1.5.2 *Spark built with Hive* git clone git://github.com/apache/spark.git ./make-distribution.sh --tgz -Phadoop-2.4 -Pyarn -Dhadoop.version=2.4.0 -Phive -Phive-thriftserver *Input:* -sh-4.1$ hadoop fs -du -h /user/dvasthimal/poc_success_spark/data/input 2.5 G

Re: How to save only values via saveAsHadoopFile or saveAsNewAPIHadoopFile

2016-01-01 Thread jimitkr
Doesn't this work?pair.values.saveAsHadoopFile() -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-save-only-values-via-saveAsHadoopFile-or-saveAsNewAPIHadoopFile-tp25828p25853.html Sent from the Apache Spark User List mailing list archive at

How to find cause(waiting threads etc) of hanging job for 7 hours?

2016-01-01 Thread unk1102
Hi I have a Spark job which hangs for around 7 hours or more than that until jobs killed out by Autosys because of time out. Data is not huge I am sure it stucks because of GC but I cant find source code which causes GC I am reusing almost all variable trying to minimize creating local objects

Re: How to find cause(waiting threads etc) of hanging job for 7 hours?

2016-01-01 Thread Prabhu Joseph
Take thread dump of Executor process several times in a short time period and check what each threads are doing at different times which will help to identify the expensive sections in user code. Thanks, Prabhu Joseph On Sat, Jan 2, 2016 at 3:28 AM, unk1102 wrote: >

Re: SparkSQL integration issue with AWS S3a

2016-01-01 Thread Jerry Lam
Hi Kostiantyn, You should be able to use spark.conf to specify s3a keys. I don't remember exactly but you can add hadoop properties by prefixing spark.hadoop.* * is the s3a properties. For instance, spark.hadoop.s3a.access.key wudjgdueyhsj Of course, you need to make sure the property key is

Re: How to find cause(waiting threads etc) of hanging job for 7 hours?

2016-01-01 Thread unk1102
Sorry please see attached waiting thread log -- View this message in context:

frequent itemsets

2016-01-01 Thread Roberto Pagliari
When using the frequent itemsets APIs, I'm running into stackOverflow exception whenever there are too many combinations to deal with and/or too many transactions and/or too many items. Does anyone know how many transactions/items these APIs can deal with? Thank you ,

Cannot get repartitioning to work

2016-01-01 Thread jimitkr
Hi, I'm trying to test some custom parallelism and repartitioning in spark. First, i reduce my RDD (forcing creation of 10 partitions for the same). I then repartition the data to 20 partitions and print out the number of partitions, but i always get 10. Looks like the repartition command is

Re: Spark MLLib KMeans Performance on Amazon EC2 M3.2xlarge

2016-01-01 Thread Yanbo Liang
Hi Jia, I think the examples you provided is not very suitable to illustrate what driver and executors do, because it's not show the internal implementation of the KMeans algorithm. You can refer the source code of MLlib Kmeans (

Re: does HashingTF maintain a inverse index?

2016-01-01 Thread Yanbo Liang
Hi Andy, Spark ML/MLlib does not provide a transformer to map HashingTF generated feature back to words currently. 2016-01-01 8:37 GMT+08:00 Hayri Volkan Agun : > Hi, > > If you are using pipeline api, you do not need to map features back to > documents. > Your input

Re: NotSerializableException exception while using TypeTag in Scala 2.10

2016-01-01 Thread Yanbo Liang
I also hit this bug, have you resolved this issue? Or could you give some suggestions? 2014-07-28 18:33 GMT+08:00 Aniket Bhatnagar : > I am trying to serialize objects contained in RDDs using runtime > relfection via TypeTag. However, the Spark job keeps > failing

Re: How to specify the numFeatures in HashingTF

2016-01-01 Thread Yanbo Liang
You can refer the following code snippet to set numFeatures for HashingTF: val hashingTF = new HashingTF() .setInputCol("words") .setOutputCol("features") .setNumFeatures(n) 2015-10-16 0:17 GMT+08:00 Nick Pentreath : > Setting the numfeatures higher

Re: ERROR server.TThreadPoolServer: Error occurred during processing of message

2016-01-01 Thread Dasun Hegoda
? On Tue, Dec 29, 2015 at 12:08 AM, Dasun Hegoda wrote: > Anyone? > > On Sun, Dec 27, 2015 at 11:30 AM, Dasun Hegoda > wrote: > >> I was able to figure out where the problem is exactly. It's spark. >> because when I start the hiveserver2

sqlContext Client cannot authenticate via:[TOKEN, KERBEROS]

2016-01-01 Thread philippe L
Hi everyone, I'm actually facing a weird situation with the hivecontext and Kerberos on yarn-client mode. actual configuration: HDP 2.2 ( Hive 0.14 , HDFS 2.6 , yarn 2.6 ) - SPARK 1.5.2 and HA namenode activated - Kerberos enabled Situation : In the same spark context, I do receive "random"

how to extend java transformer from Scala UnaryTransformer ?

2016-01-01 Thread Andy Davidson
I am trying to write a trivial transformer I use use in my pipeline. I am using java and spark 1.5.2. It was suggested that I use the Tokenize.scala class as an example. This should be very easy how ever I do not understand Scala, I am having trouble debugging the following exception. Any help

Deploying on TOMCAT

2016-01-01 Thread rahulganesh
I am having trouble in deploying spark on tomcat server. I have created a spark java program and i have created a servlet to access it in the web application. But when ever i run the i am not able to get the output says java.lang.outOfMemory or some other errors. Is it possible to deploy spark on