from:"Subacini B"

Re: Get filename in Spark Streaming

2015-02-06 Thread Subacini B

Thank you Emre, This helps, i am able to get filename.

But i am not sure how to fit this into Dstream RDD.

val inputStream = ssc.textFileStream("/hdfs Path/")

inputStream is Dstreamrdd and in foreachrdd , am doing my processing

 inputStream.foreachRDD(rdd => {
   * //how to get filename here??*
})


Can you please help.


On Thu, Feb 5, 2015 at 11:15 PM, Emre Sevinc  wrote:

> Hello,
>
> Did you check the following?
>
>
> http://themodernlife.github.io/scala/spark/hadoop/hdfs/2014/09/28/spark-input-filename/
>
> http://apache-spark-user-list.1001560.n3.nabble.com/access-hdfs-file-name-in-map-td6551.html
>
> --
> Emre Sevinç
>
>
> On Fri, Feb 6, 2015 at 2:16 AM, Subacini B  wrote:
>
>> Hi All,
>>
>> We have filename with timestamp say ABC_1421893256000.txt and the
>> timestamp  needs to be extracted from file name for further processing.Is
>> there a way to get input file name  picked up by spark streaming job?
>>
>> Thanks in advance
>>
>> Subacini
>>
>
>
>
> --
> Emre Sevinc
>

Get filename in Spark Streaming

2015-02-05 Thread Subacini B

Hi All,

We have filename with timestamp say ABC_1421893256000.txt and the
timestamp  needs to be extracted from file name for further processing.Is
there a way to get input file name  picked up by spark streaming job?

Thanks in advance

Subacini

Improve performance using spark streaming + sparksql

2015-01-24 Thread Subacini B

Hi All,

I have a cluster of 3 nodes [each 8 core/32 GB memory].

My program uses Spark Streaming with Spark SQL[Spark 1.1]  and writes
incoming JSON  to elasticsearch, Hbase. Below is my code and i receive
json files [input data varies from 30MB to 300 MB] every 10 seconds.
Irrespective of 3 nodes or 1 node, processing time is pretty close, say 15
files takes 5 mins to do end -end process.

I have set spark.streaming.unpersist to true  , stream repartition to 3 ,
Kryo serialization  and using UseCompressedOops and some more GC mechanism
and perf tuning mentioned in spark docs. Still there is no much improvement
. Is there any other way that i can tune my code to improve performance.
Appreciate your help.thanks in advance.

 inputStream.foreachRDD(rdd => {

// Get SchemaRdd
   val inputJson = sqlContext.jsonRDD(rdd)

//Store it in sparksql table
   inputJson.registerTempTable("tableA")

// Parse JSON
val outputJson = sqlContext.sql("select colA, colB etcc  from
tableA")


//Write to ES, Hbase [uses  foreachpartition]
ESUtil.writeToES(outputJson, coreName)
 HBaseUtils.saveAsHBaseTable(outputJson, "hbasetable")

  }
})

Re: SchemaRDD to Hbase

2014-12-20 Thread Subacini B

Hi ,

Can someone help me , Any pointers would help.

Thanks
Subacini

On Fri, Dec 19, 2014 at 10:47 PM, Subacini B  wrote:

> Hi All,
>
> Is there any API that can be used  directly  to write schemaRDD to HBase??
> If not, what is the best way to write  schemaRDD to HBase.
>
> Thanks
> Subacini
>

SchemaRDD to Hbase

2014-12-19 Thread Subacini B

Hi All,

Is there any API that can be used  directly  to write schemaRDD to HBase??
If not, what is the best way to write  schemaRDD to HBase.

Thanks
Subacini

Processing multiple request in cluster

2014-09-24 Thread Subacini B

hi All,

How to run concurrently multiple requests on same cluster.

I have a program using *spark streaming context *which reads* streaming
data* and writes it to HBase. It works fine, the problem is when multiple
requests are submitted to cluster, only first request is processed as the
entire cluster is used for this request. Rest of the requests are in
waiting mode.

i have set  spark.cores.max to 2 or less, so that it can process another
request,but if there is only one request cluster is not utilized properly.

Is there any way, that spark cluster can process streaming request
concurrently at the same time effectively utitlizing cluster, something
like sharkserver

Thanks
Subacini

Re: Spark SQL - groupby

2014-07-03 Thread Subacini B

Hi,

Can someone provide me pointers for this issue.

Thanks
Subacini


On Wed, Jul 2, 2014 at 3:34 PM, Subacini B  wrote:

> Hi,
>
> Below code throws  compilation error , "not found: *value Sum*" . Can
> someone help me on this. Do i need to add any jars or imports ? even for
> Count , same error is thrown
>
> val queryResult = sql("select * from Table)
>  queryResult.groupBy('colA)('colA,*Sum*('colB) as 'totB).aggregate(
> *Sum*('totB)).collect().foreach(println)
>
> Thanks
> subacini
>

Shark Vs Spark SQL

2014-07-02 Thread Subacini B

Hi,

http://mail-archives.apache.org/mod_mbox/spark-user/201403.mbox/%3cb75376b8-7a57-4161-b604-f919886cf...@gmail.com%3E

This talks about  Shark backend will be replaced with Spark SQL engine in
future.
Does that mean Spark will continue to support Shark + Spark SQL for long
term? OR
After some period, Shark will be decommissioned ??

Thanks
Subacini

Spark SQL - groupby

2014-07-02 Thread Subacini B

Hi,

Below code throws  compilation error , "not found: *value Sum*" . Can
someone help me on this. Do i need to add any jars or imports ? even for
Count , same error is thrown

val queryResult = sql("select * from Table)
 queryResult.groupBy('colA)('colA,*Sum*('colB) as 'totB).aggregate(*Sum*
('totB)).collect().foreach(println)

Thanks
subacini

Spark SQL : Join throws exception

2014-07-01 Thread Subacini B

Hi All,

Running this join query
 sql("SELECT * FROM  A_TABLE A JOIN  B_TABLE B WHERE
A.status=1").collect().foreach(println)

throws

Exception in thread "main" org.apache.spark.SparkException: Job aborted due
to stage failure: Task 1.0:3 failed 4 times, most recent failure: Exception
failure in TID 12 on host X.X.X.X:
*org.apache.spark.sql.catalyst.errors.package$TreeNodeException:
No function to evaluate expression. type: UnresolvedAttribute, tree:
'A.status*

org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.eval(unresolved.scala:59)

org.apache.spark.sql.catalyst.expressions.Equals.eval(predicates.scala:147)

org.apache.spark.sql.catalyst.expressions.And.eval(predicates.scala:100)

org.apache.spark.sql.execution.Filter$$anonfun$2$$anonfun$apply$1.apply(basicOperators.scala:52)

org.apache.spark.sql.execution.Filter$$anonfun$2$$anonfun$apply$1.apply(basicOperators.scala:52)
scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:390)
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)

org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$1.apply(Aggregate.scala:137)

org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$1.apply(Aggregate.scala:134)
org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559)
org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559)

org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)

org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
org.apache.spark.scheduler.Task.run(Task.scala:51)

org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
java.lang.Thread.run(Thread.java:695)
Driver stacktrace:

Can someone help me.

Thanks in advance.

Re: Spark Worker Core Allocation

2014-06-08 Thread Subacini B

Thanks Sean, let me try to set spark.deploy.spreadOut  as  false.


On Sun, Jun 8, 2014 at 12:44 PM, Sean Owen  wrote:

> Have a look at:
>
> https://spark.apache.org/docs/1.0.0/job-scheduling.html
> https://spark.apache.org/docs/1.0.0/spark-standalone.html
>
> The default is to grab resource on all nodes. In your case you could set
> spark.cores.max to 2 or less to enable running two apps on a cluster of
> 4-core machines simultaneously.
>
> See also spark.deploy.defaultCores
>
> But you may really be after spark.deploy.spreadOut. if you make it false
> it will instead try to take all resource from a few nodes.
>  On Jun 8, 2014 1:55 AM, "Subacini B"  wrote:
>
>> Hi All,
>>
>> My cluster has 5 workers each having 4 cores (So total 20 cores).It is
>> in stand alone mode (not using Mesos or Yarn).I want two programs to run at
>> same time. So I have configured "spark.cores.max=3" , but when i run the
>> program it allocates three cores taking one core from each worker making 3
>> workers to run the program ,
>>
>> How to configure such that it takes 3 cores from 1 worker so that i can
>> use other workers for second program.
>>
>> Thanks in advance
>> Subacini
>>
>

Re: Spark Worker Core Allocation

2014-06-08 Thread Subacini B

HI,

I am stuck here, my cluster is not effficiently utilized . Appreciate any
input on this.

Thanks
Subacini


On Sat, Jun 7, 2014 at 10:54 PM, Subacini B  wrote:

> Hi All,
>
> My cluster has 5 workers each having 4 cores (So total 20 cores).It is  in
> stand alone mode (not using Mesos or Yarn).I want two programs to run at
> same time. So I have configured "spark.cores.max=3" , but when i run the
> program it allocates three cores taking one core from each worker making 3
> workers to run the program ,
>
> How to configure such that it takes 3 cores from 1 worker so that i can
> use other workers for second program.
>
> Thanks in advance
> Subacini
>

Spark Worker Core Allocation

2014-06-07 Thread Subacini B

Hi All,

My cluster has 5 workers each having 4 cores (So total 20 cores).It is  in
stand alone mode (not using Mesos or Yarn).I want two programs to run at
same time. So I have configured "spark.cores.max=3" , but when i run the
program it allocates three cores taking one core from each worker making 3
workers to run the program ,

How to configure such that it takes 3 cores from 1 worker so that i can use
other workers for second program.

Thanks in advance
Subacini

Re: Get filename in Spark Streaming

Get filename in Spark Streaming

Improve performance using spark streaming + sparksql

Re: SchemaRDD to Hbase

SchemaRDD to Hbase

Processing multiple request in cluster

Re: Spark SQL - groupby

Shark Vs Spark SQL

Spark SQL - groupby

Spark SQL : Join throws exception

Re: Spark Worker Core Allocation

Re: Spark Worker Core Allocation

Spark Worker Core Allocation

13 matches

Site Navigation

Mail list logo

Footer information