WrappedArray to row of relational Db

2017-04-26 Thread vaibhavrtk
I have nested structure which i read from an xml using spark-Xml. I want to
use spark sql to convert this nested structure to different relational
tables

(WrappedArray([WrappedArray([[null,592006340,null],null,BA,M,1724]),N,2017-04-05T16:31:03,586257528),659925562)

which has a schema:
StructType(StructField(AirSegment,ArrayType(StructType(

StructField(CodeshareDetails,ArrayType(StructType(StructField(Links,StructType(StructField(_VALUE,StringType,true),
StructField(_mktSegmentID,LongType,true),
StructField(_oprSegmentID,LongType,true)),true),
StructField(_alphaSuffix,StringType,true), 
StructField(_carrierCode,StringType,true), 
StructField(_codeshareType,StringType,true), 
StructField(_flightNumber,StringType,true)),true),true),
StructField(_adsIsDeleted,StringType,true), 
StructField(_adsLastUpdateTimestamp,StringType,true), 
StructField(_AirID,LongType,true)),true),true), 
StructField(flightId,LongType,true))


*Question: Now as you can see this codeshareDetails is a wrappedArray inside
a Wrapped array. How can I extract these wrapped array rows along with the
_AirID column so that I can insert these rows in the codeshare table
(sqliteDb) (having column related to codeshare only along with _AirID as
foreign key, used for joining back).*

*PS:I tried exploding but in case if there are multiple rows in the
AirSegment array it doesnt work properly*

My table Structure is mentioned below:

Flight-contatining flightId and other Details:
AirSegment: Containing _AirID(PK), flightID(FK), and AirSegmentDetails
CodeshareDetails: containing CodeshareDetails as well as _AirID(FK)

Let me know if you need any more information



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/WrappedArray-to-row-of-relational-Db-tp28625.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Running multiple Spark Jobs on Yarn( Client mode)

2016-07-20 Thread vaibhavrtk
I have a silly question:

Do multiple spark jobs running on yarn have any impact on each other?
e.g. If the traffic on one streaming job increases too much does it have any
effect on second job? Will it slow it down or any other consequences?

I have enough resources(memory,cores) for both jobs in the same cluster.

Thanks
Vaibhav



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Running-multiple-Spark-Jobs-on-Yarn-Client-mode-tp27364.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



JoinWithCassandraTable over individual queries

2016-04-25 Thread vaibhavrtk
Hi
I have an RDD with elements as tuple ((key1,key2),value) where (key1,key2)
is the partitioning key in my Cassandra table
Now for each such  element I have to do a read from Cassandra table. My
Cassandra table and spark cluster are in different nodes and cant be
co-located.
Right now I am doing individual query using session.execute("...").* Should
I prefer joinWithCassandraTable over individual queries? Do I get some
performance benefit?*

As i understand joinWithCassandraTable is ultimately going to perform
queries for each partitioningKey(or primary key not sure).

Regards 
Vaibhav



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/JoinWithCassandraTable-over-individual-queries-tp26833.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Relation between number of partitions and cores.

2016-04-01 Thread vaibhavrtk
As per Spark programming guide, it says "we should have 2-4 partitions for
each CPU in your cluster.". In this case how does 1 CPU core process 2-4
partitions at the same time?

Does it do context switching between tasks or run them in parallel? If it
does context switching how is it efficient compared to 1:1 partition vs
Core?

PS: If we are using Kafka direct API  in which kafka partitions=  Rdd
partitions. Does that mean we should give 40 kafka partitions for 10 CPU
Cores?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Relation-between-number-of-partitions-and-cores-tp26658.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Is one batch created by Streaming Context always equal to one RDD?

2015-10-19 Thread vaibhavrtk




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Is-one-batch-created-by-Streaming-Context-always-equal-to-one-RDD-tp25117.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



log4j Spark-worker performance problem

2015-09-28 Thread vaibhavrtk
Hello

We need a lot of logging for our application about 1000 lines needed to be
logged per message we process and we process 1000 msgs/sec. So total lines
needed to be logged is /1000*1000/sec/. As it is going to be written in a
file. Will writing so much logs will impact the processing power of spark by
a lot?
If yes, What can be the alternative?

Note: This much logging is required for the appropriate monitoring of the
application.
Let me know if more information is needed.

Thanks
Vaibhav




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/log4j-Spark-worker-performance-problem-tp24842.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org