Re: Num of executors and cores

2016-07-26 Thread Mail.com
Hi,

In spark submit, I specify --master yarn-client.
When I go to executors in UI I do see all the 12 different executors assigned. 
But for the stage when I drill down to Tasks I saw only 8 tasks with index 0-7.

I ran again increasing the number of executors as 15 and I now see 12 tasks for 
the stage.

Still like to understand even if 12 executors were available why there was only 
8 tasks for the stage. 

Thanks,
Pradeep



> On Jul 26, 2016, at 8:46 AM, Jacek Laskowski <ja...@japila.pl> wrote:
> 
> Hi,
> 
> Where's this yarn-client mode specified? When you said "However, when
> I run the job I see that the stage which reads the directory has only
> 8 tasks." -- how do you see 8 tasks for a stage? It appears you're in
> local[*] mode on a 8-core machine (like me) and that's why I'm asking
> such basic questions.
> 
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
> 
> 
>> On Tue, Jul 26, 2016 at 2:39 PM, Mail.com <pradeep.mi...@mail.com> wrote:
>> More of jars and files and app name. It runs on yarn-client mode.
>> 
>> Thanks,
>> Pradeep
>> 
>>> On Jul 26, 2016, at 7:10 AM, Jacek Laskowski <ja...@japila.pl> wrote:
>>> 
>>> Hi,
>>> 
>>> What's ""? What master URL do you use?
>>> 
>>> Pozdrawiam,
>>> Jacek Laskowski
>>> 
>>> https://medium.com/@jaceklaskowski/
>>> Mastering Apache Spark http://bit.ly/mastering-apache-spark
>>> Follow me at https://twitter.com/jaceklaskowski
>>> 
>>> 
>>>> On Tue, Jul 26, 2016 at 2:18 AM, Mail.com <pradeep.mi...@mail.com> wrote:
>>>> Hi All,
>>>> 
>>>> I have a directory which has 12 files. I want to read the entire file so I 
>>>> am reading it as wholeTextFiles(dirpath, numPartitions).
>>>> 
>>>> I run spark-submit as  --num-executors 12 
>>>> --executor-cores 1 and numPartitions 12.
>>>> 
>>>> However, when I run the job I see that the stage which reads the directory 
>>>> has only 8 tasks. So some task reads more than one file and takes twice 
>>>> the time.
>>>> 
>>>> What can I do that the files are read by 12 tasks  I.e one file per task.
>>>> 
>>>> Thanks,
>>>> Pradeep
>>>> 
>>>> -
>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>> 
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Num of executors and cores

2016-07-26 Thread Mail.com
More of jars and files and app name. It runs on yarn-client mode.

Thanks,
Pradeep

> On Jul 26, 2016, at 7:10 AM, Jacek Laskowski <ja...@japila.pl> wrote:
> 
> Hi,
> 
> What's ""? What master URL do you use?
> 
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
> 
> 
>> On Tue, Jul 26, 2016 at 2:18 AM, Mail.com <pradeep.mi...@mail.com> wrote:
>> Hi All,
>> 
>> I have a directory which has 12 files. I want to read the entire file so I 
>> am reading it as wholeTextFiles(dirpath, numPartitions).
>> 
>> I run spark-submit as  --num-executors 12 --executor-cores 
>> 1 and numPartitions 12.
>> 
>> However, when I run the job I see that the stage which reads the directory 
>> has only 8 tasks. So some task reads more than one file and takes twice the 
>> time.
>> 
>> What can I do that the files are read by 12 tasks  I.e one file per task.
>> 
>> Thanks,
>> Pradeep
>> 
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: spark context stop vs close

2016-07-25 Thread Mail.com
Okay. Yes it is JavaSparkContext. Thanks.

> On Jul 24, 2016, at 1:31 PM, Sean Owen <so...@cloudera.com> wrote:
> 
> I think this is about JavaSparkContext which implements the standard
> Closeable interface for convenience. Both do exactly the same thing.
> 
>> On Sun, Jul 24, 2016 at 6:27 PM, Jacek Laskowski <ja...@japila.pl> wrote:
>> Hi,
>> 
>> I can only find stop. Where did you find close?
>> 
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark http://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>> 
>> 
>>> On Sat, Jul 23, 2016 at 3:11 PM, Mail.com <pradeep.mi...@mail.com> wrote:
>>> Hi All,
>>> 
>>> Where should we us spark context stop vs close. Should we stop the context 
>>> first and then close.
>>> 
>>> Are general guidelines around this. When I stop and later try to close I 
>>> get RPC already closed error.
>>> 
>>> Thanks,
>>> Pradeep
>>> 
>>> 
>>> 
>>> 
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> 
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Num of executors and cores

2016-07-25 Thread Mail.com
Hi All,

I have a directory which has 12 files. I want to read the entire file so I am 
reading it as wholeTextFiles(dirpath, numPartitions).

I run spark-submit as  --num-executors 12 --executor-cores 1 
and numPartitions 12.

However, when I run the job I see that the stage which reads the directory has 
only 8 tasks. So some task reads more than one file and takes twice the time.

What can I do that the files are read by 12 tasks  I.e one file per task.

Thanks,
Pradeep

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



spark context stop vs close

2016-07-23 Thread Mail.com
Hi All,

Where should we us spark context stop vs close. Should we stop the context 
first and then close.

Are general guidelines around this. When I stop and later try to close I get 
RPC already closed error.

Thanks,
Pradeep




-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: How to connect HBase and Spark using Python?

2016-07-22 Thread Mail.com
Hbase Spark module will be available with Hbase 2.0. Is that out yet?

> On Jul 22, 2016, at 8:50 PM, Def_Os  wrote:
> 
> So it appears it should be possible to use HBase's new hbase-spark module, if
> you follow this pattern:
> https://hbase.apache.org/book.html#_sparksql_dataframes
> 
> Unfortunately, when I run my example from PySpark, I get the following
> exception:
> 
> 
>> py4j.protocol.Py4JJavaError: An error occurred while calling o120.save.
>> : java.lang.RuntimeException: org.apache.hadoop.hbase.spark.DefaultSource
>> does not allow create table as select.
>>at scala.sys.package$.error(package.scala:27)
>>at
>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:259)
>>at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148)
>>at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>at java.lang.reflect.Method.invoke(Method.java:606)
>>at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>>at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
>>at py4j.Gateway.invoke(Gateway.java:259)
>>at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>>at py4j.commands.CallCommand.execute(CallCommand.java:79)
>>at py4j.GatewayConnection.run(GatewayConnection.java:209)
>>at java.lang.Thread.run(Thread.java:745)
> 
> Even when I created the table in HBase first, it still failed.
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-connect-HBase-and-Spark-using-Python-tp27372p27397.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark Streaming - Direct Approach

2016-07-11 Thread Mail.com
Hi All,

Can someone please confirm if streaming direct approach for reading Kafka is 
still experimental or can it be used for production use.

I see the documentation and talk from TD suggesting the advantages of the 
approach but docs state it is an "experimental" feature. 

Please suggest

Thanks,
Pradeep

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Running streaming applications in Production environment

2016-06-14 Thread Mail.com
Hi All,

Can you please advise best practices to running streaming jobs in Production 
that reads from Kafka.

How do we trigger them - through a start script and best ways to monitor the 
application is running and send alert when down etc.

Thanks,
Pradeep






-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Kafka connection logs in Spark

2016-05-26 Thread Mail.com
Hi Cody,

I used Horton Works jars for spark streaming that would enable get messages  
from Kafka with kerberos.

Thanks,
Pradeep


> On May 26, 2016, at 11:04 AM, Cody Koeninger <c...@koeninger.org> wrote:
> 
> I wouldn't expect kerberos to work with anything earlier than the beta
> consumer for kafka 0.10
> 
>> On Wed, May 25, 2016 at 9:41 PM, Mail.com <pradeep.mi...@mail.com> wrote:
>> Hi All,
>> 
>> I am connecting Spark 1.6 streaming  to Kafka 0.8.2 with Kerberos. I ran 
>> spark streaming in debug mode, but do not see any log saying it connected to 
>> Kafka or  topic etc. How could I enable that.
>> 
>> My spark streaming job runs but no messages are fetched from the RDD. Please 
>> suggest.
>> 
>> Thanks,
>> Pradeep
>> 
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Kafka connection logs in Spark

2016-05-25 Thread Mail.com
Hi All,

I am connecting Spark 1.6 streaming  to Kafka 0.8.2 with Kerberos. I ran spark 
streaming in debug mode, but do not see any log saying it connected to Kafka or 
 topic etc. How could I enable that. 

My spark streaming job runs but no messages are fetched from the RDD. Please 
suggest.

Thanks,
Pradeep

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: rpc.RpcTimeoutException: Futures timed out after [120 seconds]

2016-05-20 Thread Mail.com
Yes.

Sent from my iPhone

> On May 20, 2016, at 10:11 AM, Sahil Sareen <sareen...@gmail.com> wrote:
> 
> I'm not sure if this happens on small files or big ones as I have a mix of 
> them always.
> Did you see this only for big files?
> 
>> On Fri, May 20, 2016 at 7:36 PM, Mail.com <pradeep.mi...@mail.com> wrote:
>> Hi Sahil,
>> 
>> I have seen this with high GC time. Do you ever get this error with small 
>> volume files
>> 
>> Pradeep
>> 
>>> On May 20, 2016, at 9:32 AM, Sahil Sareen <sareen...@gmail.com> wrote:
>>> 
>>> Hey all
>>> 
>>> I'm using Spark-1.6.1 and occasionally seeing executors lost and hurting my 
>>> application performance due to these errors.
>>> Can someone please let out all the possible problems that could cause this?
>>> 
>>> 
>>> Full log:
>>> 
>>> 16/05/19 02:17:54 ERROR ContextCleaner: Error cleaning broadcast 266685
>>> org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 
>>> seconds]. This timeout is controlled by spark.rpc.askTimeout
>>> at 
>>> org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcEnv.scala:214)
>>> at 
>>> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcEnv.scala:229)
>>> at 
>>> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcEnv.scala:225)
>>> at 
>>> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
>>> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcEnv.scala:242)
>>> at 
>>> org.apache.spark.storage.BlockManagerMaster.removeBroadcast(BlockManagerMaster.scala:136)
>>> at 
>>> org.apache.spark.broadcast.TorrentBroadcast$.unpersist(TorrentBroadcast.scala:228)
>>> at 
>>> org.apache.spark.broadcast.TorrentBroadcastFactory.unbroadcast(TorrentBroadcastFactory.scala:45)
>>> at 
>>> org.apache.spark.broadcast.BroadcastManager.unbroadcast(BroadcastManager.scala:67)
>>> at 
>>> org.apache.spark.ContextCleaner.doCleanupBroadcast(ContextCleaner.scala:214)
>>> at 
>>> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:170)
>>> at 
>>> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:161)
>>> at scala.Option.foreach(Option.scala:257)
>>> at 
>>> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:161)
>>> at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1136)
>>> at 
>>> org.apache.spark.ContextCleaner.org$apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:154)
>>> at org.apache.spark.ContextCleaner$$anon$3.run(ContextCleaner.scala:67)
>>> Caused by: java.util.concurrent.TimeoutException: Futures timed out after 
>>> [120 seconds]
>>> at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
>>> at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>>> at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
>>> at 
>>> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>>> at scala.concurrent.Await$.result(package.scala:190)
>>> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcEnv.scala:241)
>>> ... 12 more
>>> 16/05/19 02:18:26 ERROR TaskSchedulerImpl: Lost executor 
>>> 20160421-192532-1677787146-5050-40596-S23 on ip-10-0-1-70.ec2.internal: 
>>> Executor heartbeat timed out after 161447 ms
>>> 16/05/19 02:18:53 ERROR TaskSchedulerImpl: Lost executor 
>>> 20160421-192532-1677787146-5050-40596-S23 on ip-10-0-1-70.ec2.internal: 
>>> remote Rpc client disassociated
>>> 
>>> Thanks
>>> Sahil
> 


Re: rpc.RpcTimeoutException: Futures timed out after [120 seconds]

2016-05-20 Thread Mail.com
Hi Sahil,

I have seen this with high GC time. Do you ever get this error with small 
volume files

Pradeep

> On May 20, 2016, at 9:32 AM, Sahil Sareen  wrote:
> 
> Hey all
> 
> I'm using Spark-1.6.1 and occasionally seeing executors lost and hurting my 
> application performance due to these errors.
> Can someone please let out all the possible problems that could cause this?
> 
> 
> Full log:
> 
> 16/05/19 02:17:54 ERROR ContextCleaner: Error cleaning broadcast 266685
> org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 
> seconds]. This timeout is controlled by spark.rpc.askTimeout
> at 
> org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcEnv.scala:214)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcEnv.scala:229)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcEnv.scala:225)
> at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcEnv.scala:242)
> at 
> org.apache.spark.storage.BlockManagerMaster.removeBroadcast(BlockManagerMaster.scala:136)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$.unpersist(TorrentBroadcast.scala:228)
> at 
> org.apache.spark.broadcast.TorrentBroadcastFactory.unbroadcast(TorrentBroadcastFactory.scala:45)
> at 
> org.apache.spark.broadcast.BroadcastManager.unbroadcast(BroadcastManager.scala:67)
> at 
> org.apache.spark.ContextCleaner.doCleanupBroadcast(ContextCleaner.scala:214)
> at 
> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:170)
> at 
> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:161)
> at scala.Option.foreach(Option.scala:257)
> at 
> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:161)
> at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1136)
> at 
> org.apache.spark.ContextCleaner.org$apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:154)
> at org.apache.spark.ContextCleaner$$anon$3.run(ContextCleaner.scala:67)
> Caused by: java.util.concurrent.TimeoutException: Futures timed out after 
> [120 seconds]
> at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
> at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
> at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
> at 
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
> at scala.concurrent.Await$.result(package.scala:190)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcEnv.scala:241)
> ... 12 more
> 16/05/19 02:18:26 ERROR TaskSchedulerImpl: Lost executor 
> 20160421-192532-1677787146-5050-40596-S23 on ip-10-0-1-70.ec2.internal: 
> Executor heartbeat timed out after 161447 ms
> 16/05/19 02:18:53 ERROR TaskSchedulerImpl: Lost executor 
> 20160421-192532-1677787146-5050-40596-S23 on ip-10-0-1-70.ec2.internal: 
> remote Rpc client disassociated
> 
> Thanks
> Sahil


Re: KafkaUtils.createDirectStream Not Fetching Messages with Confluent Serializers as Value Decoder.

2016-05-19 Thread Mail.com
Hi Muthu,

Do you have Kerberos enabled?

Thanks,
Pradeep

> On May 19, 2016, at 12:17 AM, Ramaswamy, Muthuraman 
> <muthuraman.ramasw...@viasat.com> wrote:
> 
> I am using Spark 1.6.1 and Kafka 0.9+ It works for both receiver and 
> receiver-less mode.
> 
> One thing I noticed when you specify invalid topic name, KafkaUtils doesn't 
> fetch any messages. So, check you have specified the topic name correctly.
> 
> ~Muthu
> ____
> From: Mail.com [pradeep.mi...@mail.com]
> Sent: Monday, May 16, 2016 9:33 PM
> To: Ramaswamy, Muthuraman
> Cc: Cody Koeninger; spark users
> Subject: Re: KafkaUtils.createDirectStream Not Fetching Messages with 
> Confluent Serializers as Value Decoder.
> 
> Hi Muthu,
> 
> Are you on spark 1.4.1 and Kafka 0.8.2? I have a similar issue even for 
> simple string messages.
> 
> Console producer and consumer work fine. But spark always reruns empty RDD. I 
> am using Receiver based Approach.
> 
> Thanks,
> Pradeep
> 
>> On May 16, 2016, at 8:19 PM, Ramaswamy, Muthuraman 
>> <muthuraman.ramasw...@viasat.com> wrote:
>> 
>> Yes, I can see the messages. Also, I wrote a quick custom decoder for avro 
>> and it works fine for the following:
>> 
>>>> kvs = KafkaUtils.createDirectStream(ssc, [topic], {"metadata.broker.list": 
>>>> brokers}, valueDecoder=decoder)
>> 
>> But, when I use the Confluent Serializers to leverage the Schema Registry 
>> (based on the link shown below), it doesn’t work for me. I am not sure 
>> whether I need to configure any more details to consume the Schema Registry. 
>> I can fetch the schema from the schema registry based on is Ids. The decoder 
>> method is not returning any values for me.
>> 
>> ~Muthu
>> 
>> 
>> 
>>> On 5/16/16, 10:49 AM, "Cody Koeninger" <c...@koeninger.org> wrote:
>>> 
>>> Have you checked to make sure you can receive messages just using a
>>> byte array for value?
>>> 
>>> On Mon, May 16, 2016 at 12:33 PM, Ramaswamy, Muthuraman
>>> <muthuraman.ramasw...@viasat.com> wrote:
>>>> I am trying to consume AVRO formatted message through
>>>> KafkaUtils.createDirectStream. I followed the listed below example (refer
>>>> link) but the messages are not being fetched by the Stream.
>>>> 
>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__stackoverflow.com_questions_30339636_spark-2Dpython-2Davro-2Dkafka-2Ddeserialiser=CwIBaQ=jcv3orpCsv7C4ly8-ubDob57ycZ4jvhoYZNDBA06fPk=NQ-dw5X8CJcqaXIvIdMUUdkL0fHDonD07FZzTY3CgiU=Nc-rPMFydyCrwOZuNWs2GmSL4NkN8eGoR-mkJUlkCx0=hwqxCKl3P4_9pKWeo1OGR134QegMRe3Xh22_WMy-5q8=
>>>> 
>>>> Is there any code missing that I must add to make the above sample work.
>>>> Say, I am not sure how the confluent serializers would know the avro schema
>>>> info as it knows only the Schema Registry URL info.
>>>> 
>>>> Appreciate your help.
>>>> 
>>>> ~Muthu
>> ?B‹CB•?È?[œÝXœØÜšX™K??K[XZ[?ˆ?\Ù\‹][œÝXœØÜšX™P?Ü?\šË˜\?XÚ?K›Ü™ÃB‘›Üˆ?Y??]?[Û˜[??ÛÛ[X[™?Ë??K[XZ[?ˆ?\Ù\‹Z?[???Ü?\šË˜\?XÚ?K›Ü™ÃBƒB
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Filter out the elements from xml file in Spark

2016-05-19 Thread Mail.com
Hi Yogesh,

Can you try map operation and get what you need. Whatever parser you are using. 
You could also look at spark-XML package . 

Thanks,
Pradeep
> On May 19, 2016, at 4:39 AM, Yogesh Vyas  wrote:
> 
> Hi,
> I had xml files which I am reading through textFileStream, and then
> filtering out the required elements using traditional conditions and
> loops. I would like to know if  there is any specific packages or
> functions provided in spark to perform operations on RDD of xml?
> 
> Regards,
> Yogesh
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: KafkaUtils.createDirectStream Not Fetching Messages with Confluent Serializers as Value Decoder.

2016-05-18 Thread Mail.com
Adding back users.



> On May 18, 2016, at 11:49 AM, Mail.com <pradeep.mi...@mail.com> wrote:
> 
> Hi Uladzimir,
> 
> I run is as below.
> 
> Spark-submit --class com.test --num-executors 4 --executor-cores 5 --queue 
> Dev --master yarn-client --driver-memory 512M --executor-memory 512M test.jar
> 
> Thanks,
> Pradeep
> 
> 
>> On May 18, 2016, at 5:45 AM, Vova Shelgunov <vvs...@gmail.com> wrote:
>> 
>> Hi Pradeep,
>> 
>> How do you run your spark application? What is spark master? How many cores 
>> do you allocate?
>> 
>> Regards,
>> Uladzimir
>> 
>>> On May 17, 2016 7:33 AM, "Mail.com" <pradeep.mi...@mail.com> wrote:
>>> Hi Muthu,
>>> 
>>> Are you on spark 1.4.1 and Kafka 0.8.2? I have a similar issue even for 
>>> simple string messages.
>>> 
>>> Console producer and consumer work fine. But spark always reruns empty RDD. 
>>> I am using Receiver based Approach.
>>> 
>>> Thanks,
>>> Pradeep
>>> 
>>> > On May 16, 2016, at 8:19 PM, Ramaswamy, Muthuraman 
>>> > <muthuraman.ramasw...@viasat.com> wrote:
>>> >
>>> > Yes, I can see the messages. Also, I wrote a quick custom decoder for 
>>> > avro and it works fine for the following:
>>> >
>>> >>> kvs = KafkaUtils.createDirectStream(ssc, [topic], 
>>> >>> {"metadata.broker.list": brokers}, valueDecoder=decoder)
>>> >
>>> > But, when I use the Confluent Serializers to leverage the Schema Registry 
>>> > (based on the link shown below), it doesn’t work for me. I am not sure 
>>> > whether I need to configure any more details to consume the Schema 
>>> > Registry. I can fetch the schema from the schema registry based on is 
>>> > Ids. The decoder method is not returning any values for me.
>>> >
>>> > ~Muthu
>>> >
>>> >
>>> >
>>> >> On 5/16/16, 10:49 AM, "Cody Koeninger" <c...@koeninger.org> wrote:
>>> >>
>>> >> Have you checked to make sure you can receive messages just using a
>>> >> byte array for value?
>>> >>
>>> >> On Mon, May 16, 2016 at 12:33 PM, Ramaswamy, Muthuraman
>>> >> <muthuraman.ramasw...@viasat.com> wrote:
>>> >>> I am trying to consume AVRO formatted message through
>>> >>> KafkaUtils.createDirectStream. I followed the listed below example 
>>> >>> (refer
>>> >>> link) but the messages are not being fetched by the Stream.
>>> >>>
>>> >>> https://urldefense.proofpoint.com/v2/url?u=http-3A__stackoverflow.com_questions_30339636_spark-2Dpython-2Davro-2Dkafka-2Ddeserialiser=CwIBaQ=jcv3orpCsv7C4ly8-ubDob57ycZ4jvhoYZNDBA06fPk=NQ-dw5X8CJcqaXIvIdMUUdkL0fHDonD07FZzTY3CgiU=Nc-rPMFydyCrwOZuNWs2GmSL4NkN8eGoR-mkJUlkCx0=hwqxCKl3P4_9pKWeo1OGR134QegMRe3Xh22_WMy-5q8=
>>> >>>
>>> >>> Is there any code missing that I must add to make the above sample work.
>>> >>> Say, I am not sure how the confluent serializers would know the avro 
>>> >>> schema
>>> >>> info as it knows only the Schema Registry URL info.
>>> >>>
>>> >>> Appreciate your help.
>>> >>>
>>> >>> ~Muthu
>>> >  
>>> > B‹CB• 
>>> > È [œÝXœØÜšX™K  K[XZ[ ˆ \Ù\‹][œÝXœØÜšX™P Ü \šË˜\ XÚ K›Ü™ÃB‘›Üˆ Y  ] [Û˜[  
>>> > ÛÛ[X[™ Ë  K[XZ[ ˆ \Ù\‹Z [   Ü \šË˜\ XÚ K›Ü™ÃBƒB
>>> 
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org


Re: KafkaUtils.createDirectStream Not Fetching Messages with Confluent Serializers as Value Decoder.

2016-05-16 Thread Mail.com
Hi Muthu,

Are you on spark 1.4.1 and Kafka 0.8.2? I have a similar issue even for simple 
string messages.

Console producer and consumer work fine. But spark always reruns empty RDD. I 
am using Receiver based Approach. 

Thanks,
Pradeep

> On May 16, 2016, at 8:19 PM, Ramaswamy, Muthuraman 
>  wrote:
> 
> Yes, I can see the messages. Also, I wrote a quick custom decoder for avro 
> and it works fine for the following:
> 
>>> kvs = KafkaUtils.createDirectStream(ssc, [topic], {"metadata.broker.list": 
>>> brokers}, valueDecoder=decoder)
> 
> But, when I use the Confluent Serializers to leverage the Schema Registry 
> (based on the link shown below), it doesn’t work for me. I am not sure 
> whether I need to configure any more details to consume the Schema Registry. 
> I can fetch the schema from the schema registry based on is Ids. The decoder 
> method is not returning any values for me.
> 
> ~Muthu
> 
> 
> 
>> On 5/16/16, 10:49 AM, "Cody Koeninger"  wrote:
>> 
>> Have you checked to make sure you can receive messages just using a
>> byte array for value?
>> 
>> On Mon, May 16, 2016 at 12:33 PM, Ramaswamy, Muthuraman
>>  wrote:
>>> I am trying to consume AVRO formatted message through
>>> KafkaUtils.createDirectStream. I followed the listed below example (refer
>>> link) but the messages are not being fetched by the Stream.
>>> 
>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__stackoverflow.com_questions_30339636_spark-2Dpython-2Davro-2Dkafka-2Ddeserialiser=CwIBaQ=jcv3orpCsv7C4ly8-ubDob57ycZ4jvhoYZNDBA06fPk=NQ-dw5X8CJcqaXIvIdMUUdkL0fHDonD07FZzTY3CgiU=Nc-rPMFydyCrwOZuNWs2GmSL4NkN8eGoR-mkJUlkCx0=hwqxCKl3P4_9pKWeo1OGR134QegMRe3Xh22_WMy-5q8=
>>>  
>>> 
>>> Is there any code missing that I must add to make the above sample work.
>>> Say, I am not sure how the confluent serializers would know the avro schema
>>> info as it knows only the Schema Registry URL info.
>>> 
>>> Appreciate your help.
>>> 
>>> ~Muthu
> B‹CB•È[œÝXœØÜšX™KK[XZ[ˆ\Ù\‹][œÝXœØÜšX™PÜ\šË˜\XÚK›Ü™ÃB‘›ÜˆY][Û˜[ÛÛ[X[™ËK[XZ[ˆ\Ù\‹Z[Ü\šË˜\XÚK›Ü™ÃBƒB

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Executors and Cores

2016-05-15 Thread Mail.com
Hi Mich,

We have HDP 2.3.2 where spark will run on 21 nodes each having 250 gb memory.  
Jobs run in yarn-client and yarn-cluster mode.

We have other teams using the same cluster to build their applications.

Regards,
Pradeep


> On May 15, 2016, at 1:37 PM, Mich Talebzadeh <mich.talebza...@gmail.com> 
> wrote:
> 
> Hi Pradeep,
> 
> In your case what type of cluster we are taking about? A standalone cluster?
> 
> HTh
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
>  
> 
>> On 15 May 2016 at 13:19, Mail.com <pradeep.mi...@mail.com> wrote:
>> Hi ,
>> 
>> I have seen multiple videos on spark tuning which shows how to determine # 
>> cores, #executors and memory size of the job.
>> 
>> In all that I have seen, it seems each job has to be given the max resources 
>> allowed in the cluster.
>> 
>> How do we factor in input size as well? I am processing a 1gb compressed 
>> file then I can live with say 10 executors and not 21 etc..
>> 
>> Also do we consider other jobs in the cluster that could be running? I will 
>> use only 20 GB out of available 300 gb etc..
>> 
>> Thanks,
>> Pradeep
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
> 


Executors and Cores

2016-05-15 Thread Mail.com
Hi ,

I have seen multiple videos on spark tuning which shows how to determine # 
cores, #executors and memory size of the job.

In all that I have seen, it seems each job has to be given the max resources 
allowed in the cluster.

How do we factor in input size as well? I am processing a 1gb compressed file 
then I can live with say 10 executors and not 21 etc..

Also do we consider other jobs in the cluster that could be running? I will use 
only 20 GB out of available 300 gb etc..

Thanks,
Pradeep
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark 1.4.1 + Kafka 0.8.2 with Kerberos

2016-05-13 Thread Mail.com
Hi All,

I am trying to get spark 1.4.1 (Java) work with Kafka 0.8.2 in Kerberos enabled 
cluster. HDP 2.3.2

Is there any document I can refer to.

Thanks,
Pradeep 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: XML Processing using Spark SQL

2016-05-12 Thread Mail.com
Hi Arun,

Could you try using Stax or JaxB.

Thanks,
Pradeep

> On May 12, 2016, at 8:35 PM, Hyukjin Kwon  wrote:
> 
> Hi Arunkumar,
> 
> 
> I guess your records are self-closing ones.
> 
> There is an issue open here, https://github.com/databricks/spark-xml/issues/92
> 
> This is about XmlInputFormat.scala and it seems a bit tricky to handle the 
> case so I left open until now.
> 
> 
> Thanks!
> 
> 
> 2016-05-13 5:03 GMT+09:00 Arunkumar Chandrasekar :
>> Hello,
>> 
>> Greetings.
>> 
>> I'm trying to process a xml file exported from Health Kit application using 
>> Spark SQL for learning purpose. The sample record data is like the below:
>> 
>>  > sourceVersion="9.3" device="HKDevice: 0x7896, name:iPhone, 
>> manufacturer:Apple, model:iPhone, hardware:iPhone7,2, software:9.3" 
>> unit="count" creationDate="2016-04-23 19:31:33 +0530" startDate="2016-04-23 
>> 19:00:20 +0530" endDate="2016-04-23 19:01:41 +0530" value="31"/>
>> 
>>  > sourceVersion="9.3.1" device="HKDevice: 0x85746, name:iPhone, 
>> manufacturer:Apple, model:iPhone, hardware:iPhone7,2, software:9.3.1" 
>> unit="count" creationDate="2016-04-24 05:45:00 +0530" startDate="2016-04-24 
>> 05:25:04 +0530" endDate="2016-04-24 05:25:24 +0530" value="10"/>.
>> 
>> I want to have the column name of my table as the field value like type, 
>> sourceName, sourceVersion and the row entries as their respective values 
>> like HKQuantityTypeIdentifierStepCount, Vizhi, 9.3.1,..
>> 
>> I took a look at the Spark-XML, but didn't get any information in my case 
>> (my xml is not well formed with the tags). Is there any other option to 
>> convert the record that I have mentioned above into a schema format for 
>> playing with Spark SQL?
>> 
>> Thanks in Advance.
>> 
>> Thank You,
>> Arun Chandrasekar
>> chan.arunku...@gmail.com
> 


Re: Spark-csv- partitionBy

2016-05-10 Thread Mail.com
Hi,

I don't want to reduce partitions. Should write files depending upon the column 
value.

Trying to understand how reducing partition size will make it work.

Regards,
Pradeep

> On May 9, 2016, at 6:42 PM, Gourav Sengupta <gourav.sengu...@gmail.com> wrote:
> 
> Hi,
> 
> its supported, try to use coalesce(1) (the spelling is wrong) and after that 
> do the partitions.
> 
> Regards,
> Gourav
> 
>> On Mon, May 9, 2016 at 7:12 PM, Mail.com <pradeep.mi...@mail.com> wrote:
>> Hi,
>> 
>> I have to write tab delimited file and need to have one directory for each 
>> unique value of a column.
>> 
>> I tried using spark-csv with partitionBy and seems it is not supported. Is 
>> there any other option available for doing this?
>> 
>> Regards,
>> Pradeep
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
> 


Spark-csv- partitionBy

2016-05-09 Thread Mail.com
Hi,

I have to write tab delimited file and need to have one directory for each 
unique value of a column.

I tried using spark-csv with partitionBy and seems it is not supported. Is 
there any other option available for doing this?

Regards,
Pradeep
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Error in spark-xml

2016-05-02 Thread Mail.com
Can you try once by creating your own schema file and using it to read the XML.

I had similar issue but got that resolved by custom schema and by specifying 
each attribute in that.

Pradeep


> On May 1, 2016, at 9:45 AM, Hyukjin Kwon  wrote:
> 
> To be more clear,
> 
> If you set the rowTag as "book", then it will produces an exception which is 
> an issue opened here, https://github.com/databricks/spark-xml/issues/92
> 
> Currently it does not support to parse a single element with only a value as 
> a row.
> 
> 
> If you set the rowTag as "bkval", then it should work. I tested the case 
> below to double check.
> 
> If it does not work as below, please open an issue with some information so 
> that I can reproduce.
> 
> 
> I tested the case above with the data below
> 
>   
> bk_113
> bk_114
>   
>   
> bk_114
> bk_116
>   
>   
> bk_115
> bk_116
>   
> 
> 
> 
> I tested this with the codes below
> 
> val path = "path-to-file"
> sqlContext.read
>   .format("xml")
>   .option("rowTag", "bkval")
>   .load(path)
>   .show()
> 
> Thanks!
> 
> 
> 2016-05-01 15:11 GMT+09:00 Hyukjin Kwon :
>> Hi Sourav,
>> 
>> I think it is an issue. XML will assume the element by the rowTag as object.
>> 
>>  Could you please open an issue in 
>> https://github.com/databricks/spark-xml/issues please?
>> 
>> Thanks!
>> 
>> 
>> 2016-05-01 5:08 GMT+09:00 Sourav Mazumder :
>>> Hi,
>>> 
>>> Looks like there is a problem in spark-xml if the xml has multiple 
>>> attributes with no child element.
>>> 
>>> For example say the xml has a nested object as below 
>>> 
>>> bk_113
>>> bk_114
>>>  
>>> 
>>> Now if I create a dataframe starting with rowtag bkval and then I do a 
>>> select on that data frame it gives following error.
>>> 
>>> 
>>> scala.MatchError: ENDDOCUMENT (of class 
>>> com.sun.xml.internal.stream.events.EndDocumentEvent) at 
>>> com.databricks.spark.xml.parsers.StaxXmlParser$.checkEndElement(StaxXmlParser.scala:94)
>>>  at  
>>> com.databricks.spark.xml.parsers.StaxXmlParser$.com$databricks$spark$xml$parsers$StaxXmlParser$$convertObject(StaxXmlParser.scala:295)
>>>  at 
>>> com.databricks.spark.xml.parsers.StaxXmlParser$$anonfun$parse$1$$anonfun$apply$4.apply(StaxXmlParser.scala:58)
>>>  at 
>>> com.databricks.spark.xml.parsers.StaxXmlParser$$anonfun$parse$1$$anonfun$apply$4.apply(StaxXmlParser.scala:46)
>>>  at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at 
>>> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at 
>>> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at 
>>> scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308) at 
>>> scala.collection.Iterator$class.foreach(Iterator.scala:727) at 
>>> scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at 
>>> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at 
>>> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) 
>>> at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) 
>>> at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at 
>>> scala.collection.AbstractIterator.to(Iterator.scala:1157) at 
>>> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) 
>>> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at 
>>> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) 
>>> at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at 
>>> org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
>>>  at 
>>> org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
>>>  at 
>>> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
>>>  at 
>>> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
>>>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at 
>>> org.apache.spark.scheduler.Task.run(Task.scala:88) at 
>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at 
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>>  at 
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>>  at java.lang.Thread.run(Thread.java:745)
>>> 
>>> However if there is only one row like below, it works fine.
>>> 
>>> 
>>> bk_113
>>> 
>>> 
>>> Any workaround ?
>>> 
>>> Regards,
>>> Sourav
> 


Re: JavaSparkContext.wholeTextFiles read directory

2016-04-26 Thread Mail.com
wholeTextFiles() works.  It is just that it does not provide the parallelism.

This is on Spark 1.4. HDP 2.3.2. Batch jobs.

Thanks

> On Apr 26, 2016, at 9:16 PM, Harjit Singh <harjit.si...@deciphernow.com> 
> wrote:
> 
> You will have to write your customReceiver to do that. I don’t think 
> wholeTextFile is designed for that.
> 
> - Harjit
>> On Apr 26, 2016, at 7:19 PM, Mail.com <pradeep.mi...@mail.com> wrote:
>> 
>> 
>> Hi All,
>> I am reading entire directory of gz XML files with wholeTextFiles. 
>> 
>> I understand as it is gz and with wholeTextFiles the individual files are 
>> not splittable but why the entire directory is read by one executor, single 
>> task. I have provided number of executors as number of files in that 
>> directory.
>> 
>> Is the only option here is to repartition after the xmls are read and parsed 
>> with JaxB.
>> 
>> Regards,
>> Pradeep
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
> 
> v/r,
> Harjit Singh
> Decipher Technology Studios
> email:harjit.sin...@deciphernow.com
> mobile: 303-870-0883
> website: deciphertechstudios.com <http://deciphertechstudios.com/>
> 
> GPG:
> keyserver: hkps://hkps.pool.sks-keyservers.net
> keyid: D814A2EF
> 
> 
> 
> 
> 


JavaSparkContext.wholeTextFiles read directory

2016-04-26 Thread Mail.com

Hi All,
I am reading entire directory of gz XML files with wholeTextFiles. 

I understand as it is gz and with wholeTextFiles the individual files are not 
splittable but why the entire directory is read by one executor, single task. I 
have provided number of executors as number of files in that directory.

Is the only option here is to repartition after the xmls are read and parsed 
with JaxB.

Regards,
Pradeep
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Create tab separated file from a dataframe spark 1.4 with Java

2016-04-21 Thread Mail.com
> Hi 

I have a dataframe and need to write to a tab separated file using spark 1.4 
and Java.
 
Can some one please suggest.

Thanks,
Pradeep 

Re: spark on yarn

2016-04-20 Thread Mail.com
I get an error with a message that state what is max number of cores allowed.


> On Apr 20, 2016, at 11:21 AM, Shushant Arora  
> wrote:
> 
> I am running a spark application on yarn cluster.
> 
> say I have available vcors in cluster as 100.And I start spark application 
> with --num-executors 200 --num-cores 2 (so I need total 200*2=400 vcores) but 
> in my cluster only 100 are available.
> 
> What will happen ? Will the job abort or it will be submitted successfully 
> and 100 vcores will be aallocated to 50 executors and rest executors will be 
> started as soon as vcores are available ?
> 
> Please note dynamic allocation is not enabled in cluster. I have old version 
> 1.2.
> 
> Thanks
> 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Parse XML using java spark

2016-04-18 Thread Mail.com

You might look at using JaxB or Stax. If it is simple enough use data frames 
auto generated scheme.

Pradeep

> On Apr 18, 2016, at 6:37 PM, Jinan Alhajjaj  wrote:
> 
> Thank you for your help.
> I would like to parse the XML file using Java not scala . Can you please 
> provide me with exsample of how to parse XMl via java using spark. My XML 
> file is Wikipedia dump file 
> Thank you