Re: Get filename in Spark Streaming
Thank you Emre, This helps, i am able to get filename. But i am not sure how to fit this into Dstream RDD. val inputStream = ssc.textFileStream("/hdfs Path/") inputStream is Dstreamrdd and in foreachrdd , am doing my processing inputStream.foreachRDD(rdd => { * //how to get filename here??* }) Can you please help. On Thu, Feb 5, 2015 at 11:15 PM, Emre Sevinc wrote: > Hello, > > Did you check the following? > > > http://themodernlife.github.io/scala/spark/hadoop/hdfs/2014/09/28/spark-input-filename/ > > http://apache-spark-user-list.1001560.n3.nabble.com/access-hdfs-file-name-in-map-td6551.html > > -- > Emre Sevinç > > > On Fri, Feb 6, 2015 at 2:16 AM, Subacini B wrote: > >> Hi All, >> >> We have filename with timestamp say ABC_1421893256000.txt and the >> timestamp needs to be extracted from file name for further processing.Is >> there a way to get input file name picked up by spark streaming job? >> >> Thanks in advance >> >> Subacini >> > > > > -- > Emre Sevinc >
Get filename in Spark Streaming
Hi All, We have filename with timestamp say ABC_1421893256000.txt and the timestamp needs to be extracted from file name for further processing.Is there a way to get input file name picked up by spark streaming job? Thanks in advance Subacini
Improve performance using spark streaming + sparksql
Hi All, I have a cluster of 3 nodes [each 8 core/32 GB memory]. My program uses Spark Streaming with Spark SQL[Spark 1.1] and writes incoming JSON to elasticsearch, Hbase. Below is my code and i receive json files [input data varies from 30MB to 300 MB] every 10 seconds. Irrespective of 3 nodes or 1 node, processing time is pretty close, say 15 files takes 5 mins to do end -end process. I have set spark.streaming.unpersist to true , stream repartition to 3 , Kryo serialization and using UseCompressedOops and some more GC mechanism and perf tuning mentioned in spark docs. Still there is no much improvement . Is there any other way that i can tune my code to improve performance. Appreciate your help.thanks in advance. inputStream.foreachRDD(rdd => { // Get SchemaRdd val inputJson = sqlContext.jsonRDD(rdd) //Store it in sparksql table inputJson.registerTempTable("tableA") // Parse JSON val outputJson = sqlContext.sql("select colA, colB etcc from tableA") //Write to ES, Hbase [uses foreachpartition] ESUtil.writeToES(outputJson, coreName) HBaseUtils.saveAsHBaseTable(outputJson, "hbasetable") } })
Re: SchemaRDD to Hbase
Hi , Can someone help me , Any pointers would help. Thanks Subacini On Fri, Dec 19, 2014 at 10:47 PM, Subacini B wrote: > Hi All, > > Is there any API that can be used directly to write schemaRDD to HBase?? > If not, what is the best way to write schemaRDD to HBase. > > Thanks > Subacini >
SchemaRDD to Hbase
Hi All, Is there any API that can be used directly to write schemaRDD to HBase?? If not, what is the best way to write schemaRDD to HBase. Thanks Subacini
Processing multiple request in cluster
hi All, How to run concurrently multiple requests on same cluster. I have a program using *spark streaming context *which reads* streaming data* and writes it to HBase. It works fine, the problem is when multiple requests are submitted to cluster, only first request is processed as the entire cluster is used for this request. Rest of the requests are in waiting mode. i have set spark.cores.max to 2 or less, so that it can process another request,but if there is only one request cluster is not utilized properly. Is there any way, that spark cluster can process streaming request concurrently at the same time effectively utitlizing cluster, something like sharkserver Thanks Subacini
Re: Spark SQL - groupby
Hi, Can someone provide me pointers for this issue. Thanks Subacini On Wed, Jul 2, 2014 at 3:34 PM, Subacini B wrote: > Hi, > > Below code throws compilation error , "not found: *value Sum*" . Can > someone help me on this. Do i need to add any jars or imports ? even for > Count , same error is thrown > > val queryResult = sql("select * from Table) > queryResult.groupBy('colA)('colA,*Sum*('colB) as 'totB).aggregate( > *Sum*('totB)).collect().foreach(println) > > Thanks > subacini >
Shark Vs Spark SQL
Hi, http://mail-archives.apache.org/mod_mbox/spark-user/201403.mbox/%3cb75376b8-7a57-4161-b604-f919886cf...@gmail.com%3E This talks about Shark backend will be replaced with Spark SQL engine in future. Does that mean Spark will continue to support Shark + Spark SQL for long term? OR After some period, Shark will be decommissioned ?? Thanks Subacini
Spark SQL - groupby
Hi, Below code throws compilation error , "not found: *value Sum*" . Can someone help me on this. Do i need to add any jars or imports ? even for Count , same error is thrown val queryResult = sql("select * from Table) queryResult.groupBy('colA)('colA,*Sum*('colB) as 'totB).aggregate(*Sum* ('totB)).collect().foreach(println) Thanks subacini
Spark SQL : Join throws exception
Hi All, Running this join query sql("SELECT * FROM A_TABLE A JOIN B_TABLE B WHERE A.status=1").collect().foreach(println) throws Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 1.0:3 failed 4 times, most recent failure: Exception failure in TID 12 on host X.X.X.X: *org.apache.spark.sql.catalyst.errors.package$TreeNodeException: No function to evaluate expression. type: UnresolvedAttribute, tree: 'A.status* org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.eval(unresolved.scala:59) org.apache.spark.sql.catalyst.expressions.Equals.eval(predicates.scala:147) org.apache.spark.sql.catalyst.expressions.And.eval(predicates.scala:100) org.apache.spark.sql.execution.Filter$$anonfun$2$$anonfun$apply$1.apply(basicOperators.scala:52) org.apache.spark.sql.execution.Filter$$anonfun$2$$anonfun$apply$1.apply(basicOperators.scala:52) scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:390) scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$1.apply(Aggregate.scala:137) org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$1.apply(Aggregate.scala:134) org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559) org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) org.apache.spark.scheduler.Task.run(Task.scala:51) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) java.lang.Thread.run(Thread.java:695) Driver stacktrace: Can someone help me. Thanks in advance.
Re: Spark Worker Core Allocation
Thanks Sean, let me try to set spark.deploy.spreadOut as false. On Sun, Jun 8, 2014 at 12:44 PM, Sean Owen wrote: > Have a look at: > > https://spark.apache.org/docs/1.0.0/job-scheduling.html > https://spark.apache.org/docs/1.0.0/spark-standalone.html > > The default is to grab resource on all nodes. In your case you could set > spark.cores.max to 2 or less to enable running two apps on a cluster of > 4-core machines simultaneously. > > See also spark.deploy.defaultCores > > But you may really be after spark.deploy.spreadOut. if you make it false > it will instead try to take all resource from a few nodes. > On Jun 8, 2014 1:55 AM, "Subacini B" wrote: > >> Hi All, >> >> My cluster has 5 workers each having 4 cores (So total 20 cores).It is >> in stand alone mode (not using Mesos or Yarn).I want two programs to run at >> same time. So I have configured "spark.cores.max=3" , but when i run the >> program it allocates three cores taking one core from each worker making 3 >> workers to run the program , >> >> How to configure such that it takes 3 cores from 1 worker so that i can >> use other workers for second program. >> >> Thanks in advance >> Subacini >> >
Re: Spark Worker Core Allocation
HI, I am stuck here, my cluster is not effficiently utilized . Appreciate any input on this. Thanks Subacini On Sat, Jun 7, 2014 at 10:54 PM, Subacini B wrote: > Hi All, > > My cluster has 5 workers each having 4 cores (So total 20 cores).It is in > stand alone mode (not using Mesos or Yarn).I want two programs to run at > same time. So I have configured "spark.cores.max=3" , but when i run the > program it allocates three cores taking one core from each worker making 3 > workers to run the program , > > How to configure such that it takes 3 cores from 1 worker so that i can > use other workers for second program. > > Thanks in advance > Subacini >
Spark Worker Core Allocation
Hi All, My cluster has 5 workers each having 4 cores (So total 20 cores).It is in stand alone mode (not using Mesos or Yarn).I want two programs to run at same time. So I have configured "spark.cores.max=3" , but when i run the program it allocates three cores taking one core from each worker making 3 workers to run the program , How to configure such that it takes 3 cores from 1 worker so that i can use other workers for second program. Thanks in advance Subacini