spark session jdbc performance

2017-10-24 Thread Naveen Madhire
Hi,



I am trying to fetch data from Oracle DB using a subquery and experiencing
lot of performance issues.



Below is the query I am using,



*Using Spark 2.0.2*



*val *df = spark_session.read.format(*"jdbc"*)
.option(*"driver"*,*"*oracle.jdbc.OracleDriver*"*)
.option(*"url"*, jdbc_url)
   .option(*"user"*, user)
   .option(*"password"*, pwd)
   .option(*"dbtable"*, *"subquery"*)
   .option(*"partitionColumn"*, *"id"*)  //primary key column uniformly
distributed
   .option(*"lowerBound"*, *"1"*)
   .option(*"upperBound"*, *"50"*)
.option(*"numPartitions"*, 30)
.load()



The above query is running using the 30 partitions, but when I see the UI
it is only using 1 partiton to run the query.



Can anyone tell if I am missing anything or do I need to anything else to
tune the performance of the query.

 *Thanks*


spark session jdbc performance

2017-10-24 Thread Madhire, Naveen
Hi,

I am trying to fetch data from Oracle DB using a subquery and experiencing lot 
of performance issues.

Below is the query I am using,

Using Spark 2.0.2

val df = spark_session.read.format("jdbc")
.option("driver","oracle.jdbc.OracleDriver")
.option("url", jdbc_url)
   .option("user", user)
   .option("password", pwd)
   .option("dbtable", "subquery")
   .option("partitionColumn", "id")  //primary key column uniformly distributed
   .option("lowerBound", "1")
   .option("upperBound", "50")
.option("numPartitions", 30)
.load()

The above query is running using the 30 partitions, but when I see the UI it is 
only using 1 partiton to run the query.

Can anyone tell if I am missing anything or do I need to anything else to tune 
the performance of the query.

Thanks


The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates and may only be used solely in performance of 
work or services for Capital One. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed. If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.


spark session jdbc performance

2017-10-24 Thread Madhire, Naveen
Hi,

I am trying to fetch data from Oracle DB using a subquery and experiencing lot 
of performance issues.

Below is the query I am using,

Using Spark 2.0.2

val df = spark_session.read.format("jdbc")
.option("driver","oracle.jdbc.OracleDriver")
.option("url", jdbc_url)
   .option("user", user)
   .option("password", pwd)
   .option("dbtable", "subquery")
   .option("partitionColumn", "id")  //primary key column uniformly distributed
   .option("lowerBound", "1")
   .option("upperBound", "50")
.option("numPartitions", 30)
.load()

The above query is running using the 30 partitions, but when I see the UI it is 
only using 1 partiton to run the query.

Can anyone tell if I am missing anything or do I need to anything else to tune 
the performance of the query.

Thanks


The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates and may only be used solely in performance of 
work or services for Capital One. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed. If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.


PySpark pickling behavior

2017-10-11 Thread Naveen Swamy
Hello fellow users,

1) I am wondering if there is documentation or guidelines to understand in
what situations does Pyspark decide to pickle the functions I use in the
map method.
2) Are there best practices to avoid pickling and sharing variables, etc,

I have a situation where I want to pass to the map methods, however, those
methods use C++ libraries underneath and Pyspark decides to pickle the
entire object and fails when trying to do that.

I tried to use broadcast, the moment I turn my function to use additional
parameters that must be passed through the map object spark decides to
create an object and try to serialize that

Now I can probably create a dummy function that just does the sharing
of the variables and initialize locally. I can chain that to the map
method, I think that would pretty awkward if I have to resort to that.

Here is my situation in code:

class Model(object):
  __metaclass__ = Singleton
  model_loaded = False
  mod = None
@staticmethoddef load(args):
  # load model@staticmethod  def predict(input, args):
  if not model_loaded:
load(args)
  mod.predict(input)
def spark_main()
  args = parse_args()
  lines = read()
  rdd = sc.parallelize(lines)
  rdd = rdd.map(lambda x: Model.predict(x, args) //*fails here with:
pickle.PicklingError: Could not serialize object: TypeError: can't
pickle thread.lock objects*

Thanks, Naveen


Loading objects only once

2017-09-27 Thread Naveen Swamy
Hello all,

I am a new user to Spark, please bear with me if this has been discussed
earlier.

I am trying to run batch inference using DL frameworks pre-trained models
and Spark. Basically, I want to download a model(which is usually ~500 MB)
onto the workers and load the model and run inference on images fetched
from the source like S3 something like this
rdd = sc.parallelize(load_from_s3)
rdd.map(fetch_from_s3).map(read_file).map(predict)

I was able to get it running in local mode on Jupyter, However, I would
like to load the model only once and not every map operation. A setup hook
would have nice which loads the model once into the JVM, I came across this
JIRA https://issues.apache.org/jira/browse/SPARK-650  which suggests that I
can use Singleton and static initialization. I tried to do this using
a Singleton metaclass following the thread here https://stackoverflow.com/
questions/6760685/creating-a-singleton-in-python. Following this failed
miserably complaining that Spark cannot serialize ctype objects with
pointer references.

After a lot of trial and error, I moved the code to a separate file by
creating a static method for predict that checks if a class variable is set
or not and loads the model if not set. This approach does not sound thread
safe to me, So I wanted to reach out and see if there are established
patterns on how to achieve something like this.


Also, I would like to understand the executor->tasks->python process
mapping, Does each task gets mapped to a separate python process?  The
reason I ask is I want to be to use mapPartition method to load a batch of
files and run inference on them separately for which I need to load the
object once per task. Any


Thanks for your time in answering my question.

Cheers, Naveen


Re: Spark streaming persist to hdfs question

2017-06-25 Thread Naveen Madhire
We are also doing transformations, thats the reason using spark streaming.
Does Spark streaming support tumbling windows? I was thinking I can use a
window operation to writing into HDFS.

Thanks

On Sun, Jun 25, 2017 at 10:23 PM, ayan guha <guha.a...@gmail.com> wrote:

> I would suggest to use Flume, if possible, as it has in built HDFS log
> rolling capabilities
>
> On Mon, Jun 26, 2017 at 1:09 PM, Naveen Madhire <vmadh...@umail.iu.edu>
> wrote:
>
>> Hi,
>>
>> I am using spark streaming with 1 minute duration to read data from kafka
>> topic, apply transformations and persist into HDFS.
>>
>> The application is creating a new directory every 1 minute with many
>> partition files(= nbr of partitions). What parameter should I need to
>> change/configure to persist and create a HDFS directory say *every 30
>> minutes* instead of duration of the spark streaming application?
>>
>>
>> Any help would be appreciated.
>>
>> Thanks,
>> Naveen
>>
>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>


Spark streaming persist to hdfs question

2017-06-25 Thread Naveen Madhire
Hi,

I am using spark streaming with 1 minute duration to read data from kafka
topic, apply transformations and persist into HDFS.

The application is creating a new directory every 1 minute with many
partition files(= nbr of partitions). What parameter should I need to
change/configure to persist and create a HDFS directory say *every 30
minutes* instead of duration of the spark streaming application?


Any help would be appreciated.

Thanks,
Naveen


Re:

2017-01-23 Thread Naveen
Hi Keith,

Can you try including a clean-up step at the end of job, before driver is
out of SparkContext, to clean the necessary files through some regex
patterns or so, on all nodes in your cluster by default. If files are not
available on few nodes, that should not be a problem, isnnt?



On Sun, Jan 22, 2017 at 1:26 AM, Mark Hamstra 
wrote:

> I wouldn't say that Executors are dumb, but there are some pretty clear
> divisions of concepts and responsibilities across the different pieces of
> the Spark architecture. A Job is a concept that is completely unknown to an
> Executor, which deals instead with just the Tasks that it is given.  So you
> are correct, Jacek, that any notification of a Job end has to come from the
> Driver.
>
> On Sat, Jan 21, 2017 at 2:10 AM, Jacek Laskowski  wrote:
>
>> Executors are "dumb", i.e. they execute TaskRunners for tasks
>> and...that's it.
>>
>> Your logic should be on the driver that can intercept events
>> and...trigger cleanup.
>>
>> I don't think there's another way to do it.
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>>
>>
>> On Fri, Jan 20, 2017 at 10:47 PM, Keith Chapman 
>> wrote:
>> > Hi Jacek,
>> >
>> > I've looked at SparkListener and tried it, I see it getting fired on the
>> > master but I don't see it getting fired on the workers in a cluster.
>> >
>> > Regards,
>> > Keith.
>> >
>> > http://keith-chapman.com
>> >
>> > On Fri, Jan 20, 2017 at 11:09 AM, Jacek Laskowski 
>> wrote:
>> >>
>> >> Hi,
>> >>
>> >> (redirecting to users as it has nothing to do with Spark project
>> >> development)
>> >>
>> >> Monitor jobs and stages using SparkListener and submit cleanup jobs
>> where
>> >> a condition holds.
>> >>
>> >> Jacek
>> >>
>> >> On 20 Jan 2017 3:57 a.m., "Keith Chapman" 
>> wrote:
>> >>>
>> >>> Hi ,
>> >>>
>> >>> Is it possible for an executor (or slave) to know when an actual job
>> >>> ends? I'm running spark on a cluster (with yarn) and my workers
>> create some
>> >>> temporary files that I would like to clean up once the job ends. Is
>> there a
>> >>> way for the worker to detect that a job has finished? I tried doing
>> it in
>> >>> the JobProgressListener but it does not seem to work in a cluster.
>> The event
>> >>> is not triggered in the worker.
>> >>>
>> >>> Regards,
>> >>> Keith.
>> >>>
>> >>> http://keith-chapman.com
>> >
>> >
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>


Re: 答复: submit spark task on yarn asynchronously via java?

2016-12-24 Thread Naveen
Hi,

Please use SparkLauncher API class and invoke the threads using async calls
using Futures.
Using SparkLauncher, you can mention class name, application resouce,
arguments to be passed to the driver, deploy-mode etc.
I would suggest to use scala's Future, is scala code is possible.

https://spark.apache.org/docs/1.5.1/api/java/org/apache/spark/launcher/SparkLauncher.html
https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/Future.html



On Fri, Dec 23, 2016 at 7:10 AM, Linyuxin  wrote:

> Hi,
>
> Could Anybody help?
>
>
>
> *发件人:* Linyuxin
> *发送时间:* 2016年12月22日 14:18
> *收件人:* user 
> *主题:* submit spark task on yarn asynchronously via java?
>
>
>
> Hi All,
>
>
>
> Version:
>
> Spark 1.5.1
>
> Hadoop 2.7.2
>
>
>
> Is there any way to submit and monitor spark task on yarn via java
> asynchronously?
>
>
>
>
>


Re: Launching multiple spark jobs within a main spark job.

2016-12-24 Thread Naveen
Thanks Liang, Vadim and everyone for your inputs!!

With this clarity, I've tried client modes for both main and sub-spark
jobs. Every main spark job and its corresponding threaded spark jobs are
coming up on the YARN applications list and the jobs are getting executed
properly. I need to now test with cluster modes at both levels, and need to
setup spark-submit and few configurations properly on all data nodes in the
cluster. I will share the updates as and when I execute and analyze further.

Concern now which I am thinking is: how to throttle multiple jobs launching
based on the YARN cluster's availability. This exercise will be similar to
performing cluster's break-point analysis. But problem here is that we will
not know the file sizes until we read and get in memory and since Spark's
memory mechanics are more subtle and fragile, need to be 100% sure and
avoid OOM (out-of-memory) issues. Not sure if there is any process
available which can poll resource manager's information and tell if any
further jobs can be submitted to YARN.


On Thu, Dec 22, 2016 at 7:26 AM, Liang-Chi Hsieh  wrote:

>
> If you run the main driver and other Spark jobs in client mode, you can
> make
> sure they (I meant all the drivers) are running at the same node. Of course
> all drivers now consume the resources at the same node.
>
> If you run the main driver in client mode, but run other Spark jobs in
> cluster mode, the drivers of those Spark jobs will be launched at other
> nodes in the cluster. It should work too. It is as same as you run a Spark
> app in client mode and more others in cluster mode.
>
> If you run your main driver in cluster mode, and run other Spark jobs in
> cluster mode too, you may need  Spark properly installed in all nodes in
> the
> cluster, because those Spark jobs will be launched at the node which the
> main driver is running on.
>
>
>
>
>
> -
> Liang-Chi Hsieh | @viirya
> Spark Technology Center
> http://www.spark.tc/
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/Launching-multiple-
> spark-jobs-within-a-main-spark-job-tp20311p20327.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Launching multiple spark jobs within a main spark job.

2016-12-21 Thread Naveen
Thanks Liang!
I get your point. It would mean that when launching spark jobs, mode needs
to be specified as client for all spark jobs.
However, my concern is to know if driver's memory(which is launching spark
jobs) will be used completely by the Future's(sparkcontext's) or these
spawned sparkcontexts will get different nodes / executors from resource
manager?

On Wed, Dec 21, 2016 at 6:43 PM, Naveen <hadoopst...@gmail.com> wrote:

> Hi Sebastian,
>
> Yes, for fetching the details from Hive and HBase, I would want to use
> Spark's HiveContext etc.
> However, based on your point, I might have to check if JDBC based driver
> connection could be used to do the same.
>
> Main reason for this is to avoid a client-server architecture design.
>
> If we go by a normal scala app without creating a sparkcontext as per your
> suggestion, then
> 1. it turns out to be a client program on cluster on a single node, and
> for any multiple invocation through xyz scheduler , it will be invoked
> always from that same node
> 2. Having client program on a single data node might create a hotspot for
> that data node which might create a bottleneck as all invocations might
> create JVMs on that node itself.
> 3. With above, we will loose the Spark on YARN's feature of dynamically
> allocating a driver on any available data node through RM and NM
> co-ordination. With YARN and Cluster mode of invoking a spark-job, it will
> help distribute multiple application(main one) in cluster uniformly.
>
> Thanks and please let me know your views.
>
>
> On Wed, Dec 21, 2016 at 5:43 PM, Sebastian Piu <sebastian@gmail.com>
> wrote:
>
>> Is there any reason you need a context on the application launching the
>> jobs?
>> You can use SparkLauncher in a normal app and just listen for state
>> transitions
>>
>> On Wed, 21 Dec 2016, 11:44 Naveen, <hadoopst...@gmail.com> wrote:
>>
>>> Hi Team,
>>>
>>> Thanks for your responses.
>>> Let me give more details in a picture of how I am trying to launch jobs.
>>>
>>> Main spark job will launch other spark-job similar to calling multiple
>>> spark-submit within a Spark driver program.
>>> These spawned threads for new jobs will be totally different components,
>>> so these cannot be implemented using spark actions.
>>>
>>> sample code:
>>>
>>> -
>>>
>>> Object Mainsparkjob {
>>>
>>> main(...){
>>>
>>> val sc=new SparkContext(..)
>>>
>>> Fetch from hive..using hivecontext
>>> Fetch from hbase
>>>
>>> //spawning multiple Futures..
>>> Val future1=Future{
>>> Val sparkjob= SparkLauncher(...).launch; spark.waitFor
>>> }
>>>
>>> Similarly, future2 to futureN.
>>>
>>> future1.onComplete{...}
>>> }
>>>
>>> }// end of mainsparkjob
>>> --
>>>
>>>
>>> [image: Inline image 1]
>>>
>>> On Wed, Dec 21, 2016 at 3:13 PM, David Hodeffi <
>>> david.hode...@niceactimize.com> wrote:
>>>
>>> I am not familiar of any problem with that.
>>>
>>> Anyway, If you run spark applicaction you would have multiple jobs,
>>> which makes sense that it is not a problem.
>>>
>>>
>>>
>>> Thanks David.
>>>
>>>
>>>
>>> *From:* Naveen [mailto:hadoopst...@gmail.com]
>>> *Sent:* Wednesday, December 21, 2016 9:18 AM
>>> *To:* d...@spark.apache.org; user@spark.apache.org
>>> *Subject:* Launching multiple spark jobs within a main spark job.
>>>
>>>
>>>
>>> Hi Team,
>>>
>>>
>>>
>>> Is it ok to spawn multiple spark jobs within a main spark job, my main
>>> spark job's driver which was launched on yarn cluster, will do some
>>> preprocessing and based on it, it needs to launch multilple spark jobs on
>>> yarn cluster. Not sure if this right pattern.
>>>
>>>
>>>
>>> Please share your thoughts.
>>>
>>> Sample code i ve is as below for better understanding..
>>>
>>> -
>>>
>>>
>>>
>>> Object Mainsparkjob {
>>>
>>>
>>>
>>> main(...){
>>>
>>>
>>>
>>> val sc=new SparkContext(..)
>>>
>>>
>>>
>>> Fetch from hive..using hivecontext
>>>
>>> Fetch from hbase
>>>
>>>
>>>
>>> //spawni

Re: Launching multiple spark jobs within a main spark job.

2016-12-21 Thread Naveen
Hi Sebastian,

Yes, for fetching the details from Hive and HBase, I would want to use
Spark's HiveContext etc.
However, based on your point, I might have to check if JDBC based driver
connection could be used to do the same.

Main reason for this is to avoid a client-server architecture design.

If we go by a normal scala app without creating a sparkcontext as per your
suggestion, then
1. it turns out to be a client program on cluster on a single node, and for
any multiple invocation through xyz scheduler , it will be invoked always
from that same node
2. Having client program on a single data node might create a hotspot for
that data node which might create a bottleneck as all invocations might
create JVMs on that node itself.
3. With above, we will loose the Spark on YARN's feature of dynamically
allocating a driver on any available data node through RM and NM
co-ordination. With YARN and Cluster mode of invoking a spark-job, it will
help distribute multiple application(main one) in cluster uniformly.

Thanks and please let me know your views.


On Wed, Dec 21, 2016 at 5:43 PM, Sebastian Piu <sebastian@gmail.com>
wrote:

> Is there any reason you need a context on the application launching the
> jobs?
> You can use SparkLauncher in a normal app and just listen for state
> transitions
>
> On Wed, 21 Dec 2016, 11:44 Naveen, <hadoopst...@gmail.com> wrote:
>
>> Hi Team,
>>
>> Thanks for your responses.
>> Let me give more details in a picture of how I am trying to launch jobs.
>>
>> Main spark job will launch other spark-job similar to calling multiple
>> spark-submit within a Spark driver program.
>> These spawned threads for new jobs will be totally different components,
>> so these cannot be implemented using spark actions.
>>
>> sample code:
>>
>> -
>>
>> Object Mainsparkjob {
>>
>> main(...){
>>
>> val sc=new SparkContext(..)
>>
>> Fetch from hive..using hivecontext
>> Fetch from hbase
>>
>> //spawning multiple Futures..
>> Val future1=Future{
>> Val sparkjob= SparkLauncher(...).launch; spark.waitFor
>> }
>>
>> Similarly, future2 to futureN.
>>
>> future1.onComplete{...}
>> }
>>
>> }// end of mainsparkjob
>> --
>>
>>
>> [image: Inline image 1]
>>
>> On Wed, Dec 21, 2016 at 3:13 PM, David Hodeffi <
>> david.hode...@niceactimize.com> wrote:
>>
>> I am not familiar of any problem with that.
>>
>> Anyway, If you run spark applicaction you would have multiple jobs, which
>> makes sense that it is not a problem.
>>
>>
>>
>> Thanks David.
>>
>>
>>
>> *From:* Naveen [mailto:hadoopst...@gmail.com]
>> *Sent:* Wednesday, December 21, 2016 9:18 AM
>> *To:* d...@spark.apache.org; user@spark.apache.org
>> *Subject:* Launching multiple spark jobs within a main spark job.
>>
>>
>>
>> Hi Team,
>>
>>
>>
>> Is it ok to spawn multiple spark jobs within a main spark job, my main
>> spark job's driver which was launched on yarn cluster, will do some
>> preprocessing and based on it, it needs to launch multilple spark jobs on
>> yarn cluster. Not sure if this right pattern.
>>
>>
>>
>> Please share your thoughts.
>>
>> Sample code i ve is as below for better understanding..
>>
>> -
>>
>>
>>
>> Object Mainsparkjob {
>>
>>
>>
>> main(...){
>>
>>
>>
>> val sc=new SparkContext(..)
>>
>>
>>
>> Fetch from hive..using hivecontext
>>
>> Fetch from hbase
>>
>>
>>
>> //spawning multiple Futures..
>>
>> Val future1=Future{
>>
>> Val sparkjob= SparkLauncher(...).launch; spark.waitFor
>>
>> }
>>
>>
>>
>> Similarly, future2 to futureN.
>>
>>
>>
>> future1.onComplete{...}
>>
>> }
>>
>>
>>
>> }// end of mainsparkjob
>>
>> --
>>
>>
>> Confidentiality: This communication and any attachments are intended for
>> the above-named persons only and may be confidential and/or legally
>> privileged. Any opinions expressed in this communication are not
>> necessarily those of NICE Actimize. If this communication has come to you
>> in error you must take no action based on it, nor must you copy or show it
>> to anyone; please delete/destroy and inform the sender by e-mail
>> immediately.
>> Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
>> Viruses: Although we have taken steps toward ensuring that this e-mail
>> and attachments are free from any virus, we advise that in keeping with
>> good computing practice the recipient should ensure they are actually virus
>> free.
>>
>>
>>


Re: Launching multiple spark jobs within a main spark job.

2016-12-21 Thread Naveen
Hi Team,

Thanks for your responses.
Let me give more details in a picture of how I am trying to launch jobs.

Main spark job will launch other spark-job similar to calling multiple
spark-submit within a Spark driver program.
These spawned threads for new jobs will be totally different components, so
these cannot be implemented using spark actions.

sample code:
-

Object Mainsparkjob {

main(...){

val sc=new SparkContext(..)

Fetch from hive..using hivecontext
Fetch from hbase

//spawning multiple Futures..
Val future1=Future{
Val sparkjob= SparkLauncher(...).launch; spark.waitFor
}

Similarly, future2 to futureN.

future1.onComplete{...}
}

}// end of mainsparkjob
--


[image: Inline image 1]

On Wed, Dec 21, 2016 at 3:13 PM, David Hodeffi <
david.hode...@niceactimize.com> wrote:

> I am not familiar of any problem with that.
>
> Anyway, If you run spark applicaction you would have multiple jobs, which
> makes sense that it is not a problem.
>
>
>
> Thanks David.
>
>
>
> *From:* Naveen [mailto:hadoopst...@gmail.com]
> *Sent:* Wednesday, December 21, 2016 9:18 AM
> *To:* d...@spark.apache.org; user@spark.apache.org
> *Subject:* Launching multiple spark jobs within a main spark job.
>
>
>
> Hi Team,
>
>
>
> Is it ok to spawn multiple spark jobs within a main spark job, my main
> spark job's driver which was launched on yarn cluster, will do some
> preprocessing and based on it, it needs to launch multilple spark jobs on
> yarn cluster. Not sure if this right pattern.
>
>
>
> Please share your thoughts.
>
> Sample code i ve is as below for better understanding..
>
> -
>
>
>
> Object Mainsparkjob {
>
>
>
> main(...){
>
>
>
> val sc=new SparkContext(..)
>
>
>
> Fetch from hive..using hivecontext
>
> Fetch from hbase
>
>
>
> //spawning multiple Futures..
>
> Val future1=Future{
>
> Val sparkjob= SparkLauncher(...).launch; spark.waitFor
>
> }
>
>
>
> Similarly, future2 to futureN.
>
>
>
> future1.onComplete{...}
>
> }
>
>
>
> }// end of mainsparkjob
>
> --
>
>
> Confidentiality: This communication and any attachments are intended for
> the above-named persons only and may be confidential and/or legally
> privileged. Any opinions expressed in this communication are not
> necessarily those of NICE Actimize. If this communication has come to you
> in error you must take no action based on it, nor must you copy or show it
> to anyone; please delete/destroy and inform the sender by e-mail
> immediately.
> Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
> Viruses: Although we have taken steps toward ensuring that this e-mail and
> attachments are free from any virus, we advise that in keeping with good
> computing practice the recipient should ensure they are actually virus free.
>


Launching multiple spark jobs within a main spark job.

2016-12-20 Thread Naveen
Hi Team,

Is it ok to spawn multiple spark jobs within a main spark job, my main
spark job's driver which was launched on yarn cluster, will do some
preprocessing and based on it, it needs to launch multilple spark jobs on
yarn cluster. Not sure if this right pattern.

Please share your thoughts.
Sample code i ve is as below for better understanding..
-

Object Mainsparkjob {

main(...){

val sc=new SparkContext(..)

Fetch from hive..using hivecontext
Fetch from hbase

//spawning multiple Futures..
Val future1=Future{
Val sparkjob= SparkLauncher(...).launch; spark.waitFor
}

Similarly, future2 to futureN.

future1.onComplete{...}
}

}// end of mainsparkjob
--


java.lang.ClassCastException: optional binary element (UTF8) is not a group

2016-09-20 Thread Rajan, Naveen
)
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
  at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
  at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
  at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
  at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:211)
  at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
 Source)
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
  at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
  at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
  at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
  at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
  at org.apache.spark.scheduler.Task.run(Task.scala:85)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
  ... 3 more

Regards,
Naveen




This email is confidential and intended only for the use of the individual or 
entity named above and may contain information that is privileged. If you are 
not the intended recipient, you are notified that any dissemination, 
distribution or copying of this email is strictly prohibited. If you have 
received this email in error, please notify us immediately by return email or 
telephone and destroy the original message. - This mail is sent via Sony Asia 
Pacific Mail Gateway..


Spark DataFrame sum of multiple columns

2016-04-21 Thread Naveen Kumar Pokala
Hi,

Do we have any way to perform Row level operations in spark dataframes.


For example,

I have a dataframe with columns from A,B,C,...Z.. I want to add one more column 
New Column with sum of all column values.

A

B

C

D

.

.

.

Z

New Column

1

2

4

3







26

351



Can somebody help me on this?


Thanks,
Naveen


reading EOF exception while reading parquet ile from hadoop

2016-04-20 Thread Naveen Kumar Pokala
Hi,

I am trying to read parquet file(for ex: one.parquet)

I am creating rdd out of it like ..

My program In scala like below...

val data = 
sqlContext.read.parquet("hdfs://machine:port/home/user/one.parquet").rdd.map { 
x => (x.getString(0),x) }

data.count()


I am using spark 1.4 and Hadoop 2.4


java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at 
parquet.hadoop.ParquetInputSplit.readArray(ParquetInputSplit.java:240)
at parquet.hadoop.ParquetInputSplit.readUTF8(ParquetInputSplit.java:230)
at 
parquet.hadoop.ParquetInputSplit.readFields(ParquetInputSplit.java:197)
at 
org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:285)
at 
org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:77)
at 
org.apache.spark.SerializableWritable$$anonfun$readObject$1.apply$mcV$sp(SerializableWritable.scala:45)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1239)
at 
org.apache.spark.SerializableWritable.readObject(SerializableWritable.scala:41)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at 
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1896)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:69)
at 
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:95)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Thanks,
Naveen Kumar Pokala
[cid:image001.jpg@01D19B26.32EE0FE0]



Standard deviation on multiple columns

2016-04-18 Thread Naveen Kumar Pokala
Hi,

I am using spark 1.6.0

I want to find standard deviation of columns that will come dynamically.

  val stdDevOnAll = columnNames.map { x => stddev(x }

causalDf.groupBy(causalDf("A"),causalDf("B"),causalDf("C"))
.agg(stdDevOnAll:_*) //error line


I am trying to do as above.

But it is giving me compilation error as below.

overloaded method value agg with alternatives: (expr: 
org.apache.spark.sql.Column,exprs: 
org.apache.spark.sql.Column*)org.apache.spark.sql.DataFrame  (exprs: 
java.util.Map[String,String])org.apache.spark.sql.DataFrame  (exprs: 
scala.collection.immutable.Map[String,String])org.apache.spark.sql.DataFrame 
 (aggExpr: (String, String),aggExprs: (String, 
String)*)org.apache.spark.sql.DataFrame cannot be applied to 
(org.apache.spark.sql.Column)


Naveen






Determinant of Matrix

2015-08-24 Thread Naveen

Hi,

Is there any function to find the determinant of a mllib.linalg.Matrix 
(a covariance matrix) using Spark?



Regards,
Naveen

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Convert mllib.linalg.Matrix to Breeze

2015-08-20 Thread Naveen

Hi All,

Is there anyway to convert a mllib matrix to a Dense Matrix of Breeze? 
Any leads are appreciated.



Thanks,
Naveen

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Convert mllib.linalg.Matrix to Breeze

2015-08-20 Thread Naveen

Hi,

Thanks for the reply. I tried Matrix.toBreeze() which returns the 
following error:


*/method toBreeze in trait Matrix cannot be accessed in 
org.apache.spark.mllib.linalg.Matrix/*



On Thursday 20 August 2015 07:50 PM, Burak Yavuz wrote:
Matrix.toBreeze is a private method. MLlib matrices have the same 
structure as Breeze Matrices. Just create a new Breeze matrix like 
this 
https://github.com/apache/spark/blob/43e0135421b2262cbb0e06aae53523f663b4f959/mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala#L270. 



Best,
Burak


On Thu, Aug 20, 2015 at 3:28 AM, Yanbo Liang yblia...@gmail.com 
mailto:yblia...@gmail.com wrote:


You can use Matrix.toBreeze()

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala#L56
 .

2015-08-20 18:24 GMT+08:00 Naveen nav...@formcept.com
mailto:nav...@formcept.com:

Hi All,

Is there anyway to convert a mllib matrix to a Dense Matrix of
Breeze? Any leads are appreciated.


Thanks,
Naveen

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
mailto:user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
mailto:user-h...@spark.apache.org







Repartition question

2015-08-03 Thread Naveen Madhire
Hi All,

I am running the WikiPedia parsing example present in the Advance
Analytics with Spark book.

https://github.com/sryza/aas/blob/d3f62ef3ed43a59140f4ae8afbe2ef81fc643ef2/ch06-lsa/src/main/scala/com/cloudera/datascience/lsa/ParseWikipedia.scala#l112


The partitions of the RDD returned by the readFile function (mentioned
above) is of 32MB size. So if my file size is 100 MB, RDD is getting
created with 4 partitions with approx 32MB  size.


I am running this in a standalone spark cluster mode, every thing is
working fine only little confused about the nbr of partitions and the size.

I want to increase the nbr of partitions for the RDD to make use of the
cluster. Is calling repartition() after this the only option or can I pass
something in the above method to have more partitions of the RDD.

Please let me know.

Thanks.


pyspark issue

2015-07-27 Thread Naveen Madhire
Hi,

I am running pyspark in windows and I am seeing an error while adding
pyfiles to the sparkcontext. below is the example,

sc = SparkContext(local,Sample,pyFiles=C:/sample/yattag.zip)

this fails with no file found error for C


The below logic is treating the path as individual files like C, : / etc.

https://github.com/apache/spark/blob/master/python/pyspark/context.py#l195


It works if I use Spark Conf,

sparkConf.set(spark.submit.pyFiles,***C:/sample/yattag.zip***)
sc = SparkContext(local,Sample,conf=sparkConf)


Is this an existing issue or I am not including the files in correct
way in Spark Context?


Thanks.





when I run this, I am getting


LinearRegressionWithSGD Outputs NaN

2015-07-21 Thread Naveen

Hi ,

I am trying to use LinearRegressionWithSGD on Million Song Data Set and 
my model returns NaN's as weights and 0.0 as the intercept. What might 
be the issue for the error ? I am using Spark 1.40 in standalone mode.


Below is my model:

val numIterations = 100
 val stepSize = 1.0
 val regParam = 0.01
 val regType = L2
 val algorithm = new LinearRegressionWithSGD()
algorithm.optimizer.setNumIterations(numIterations).setStepSize(stepSize).setRegParam(regParam)
 val model = algorithm.run(parsedTrainData)

Regards,
Naveen

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: PySpark Nested Json Parsing

2015-07-20 Thread Naveen Madhire
I had the similar issue with spark 1.3
After migrating to Spark 1.4 and using sqlcontext.read.json it worked well
I think you can look at dataframe select and explode options to read the
nested json elements, array etc.

Thanks.


On Mon, Jul 20, 2015 at 11:07 AM, Davies Liu dav...@databricks.com wrote:

 Could you try SQLContext.read.json()?

 On Mon, Jul 20, 2015 at 9:06 AM, Davies Liu dav...@databricks.com wrote:
  Before using the json file as text file, can you make sure that each
  json string can fit in one line? Because textFile() will split the
  file by '\n'
 
  On Mon, Jul 20, 2015 at 3:26 AM, Ajay ajay0...@gmail.com wrote:
  Hi,
 
  I am new to Apache Spark. I am trying to parse nested json using
 pyspark.
  Here is the code by which I am trying to parse Json.
  I am using Apache Spark 1.2.0 version of cloudera CDH 5.3.2.
 
  lines = sc.textFile(inputFile)
 
  import json
  def func(x):
  json_str = json.loads(x)
  if json_str['label']:
  if json_str['label']['label2']:
  return (1,1)
  return (0,1)
 
  lines.map(func).reduceByKey(lambda a,b: a +
 b).saveAsTextFile(outputFile)
 
  I am getting following error,
  ERROR [Executor task launch worker-13] executor.Executor
  (Logging.scala:logError(96)) - Exception in task 1.0 in stage 14.0 (TID
 25)
  org.apache.spark.api.python.PythonException: Traceback (most recent call
  last):
File
 
 /opt/cloudera/parcels/CDH-5.3.2-1.cdh5.3.2.p0.10/lib/spark/python/pyspark/worker.py,
  line 107, in main
  process()
File
 
 /opt/cloudera/parcels/CDH-5.3.2-1.cdh5.3.2.p0.10/lib/spark/python/pyspark/worker.py,
  line 98, in process
  serializer.dump_stream(func(split_index, iterator), outfile)
File
 
 /opt/cloudera/parcels/CDH-5.3.2-1.cdh5.3.2.p0.10/lib/spark/python/pyspark/rdd.py,
  line 2073, in pipeline_func
  return func(split, prev_func(split, iterator))
File
 
 /opt/cloudera/parcels/CDH-5.3.2-1.cdh5.3.2.p0.10/lib/spark/python/pyspark/rdd.py,
  line 2073, in pipeline_func
  return func(split, prev_func(split, iterator))
File
 
 /opt/cloudera/parcels/CDH-5.3.2-1.cdh5.3.2.p0.10/lib/spark/python/pyspark/rdd.py,
  line 247, in func
  return f(iterator)
File
 
 /opt/cloudera/parcels/CDH-5.3.2-1.cdh5.3.2.p0.10/lib/spark/python/pyspark/rdd.py,
  line 1561, in combineLocally
  merger.mergeValues(iterator)
File
 
 /opt/cloudera/parcels/CDH-5.3.2-1.cdh5.3.2.p0.10/lib/spark/python/pyspark/shuffle.py,
  line 252, in mergeValues
  for k, v in iterator:
File stdin, line 2, in func
File /usr/lib64/python2.6/json/__init__.py, line 307, in loads
  return _default_decoder.decode(s)
File /usr/lib64/python2.6/json/decoder.py, line 319, in decode
  obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File /usr/lib64/python2.6/json/decoder.py, line 336, in raw_decode
  obj, end = self._scanner.iterscan(s, **kw).next()
File /usr/lib64/python2.6/json/scanner.py, line 55, in iterscan
  rval, next_pos = action(m, context)
File /usr/lib64/python2.6/json/decoder.py, line 183, in JSONObject
  value, end = iterscan(s, idx=end, context=context).next()
File /usr/lib64/python2.6/json/scanner.py, line 55, in iterscan
  rval, next_pos = action(m, context)
File /usr/lib64/python2.6/json/decoder.py, line 183, in JSONObject
  value, end = iterscan(s, idx=end, context=context).next()
File /usr/lib64/python2.6/json/scanner.py, line 55, in iterscan
  rval, next_pos = action(m, context)
File /usr/lib64/python2.6/json/decoder.py, line 155, in JSONString
  return scanstring(match.string, match.end(), encoding, strict)
  ValueError: Invalid \escape: line 1 column 855 (char 855)
 
  at
 org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:137)
  at
 org.apache.spark.api.python.PythonRDD$$anon$1.init(PythonRDD.scala:174)
  at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
  at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:305)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
  at
 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
  at
 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
  at org.apache.spark.scheduler.Task.run(Task.scala:56)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
  at
 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
  at
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
  at java.lang.Thread.run(Thread.java:745)
  2015-07-20 09:58:24,727 ERROR [Executor task launch worker-12]
  executor.Executor (Logging.scala:logError(96)) - Exception in task 0.0
 in
  stage 14.0 (TID 24)
  org.apache.spark.api.python.PythonException: Traceback (most recent call
  

Re: How to extract complex JSON structures using Apache Spark 1.4.0 Data Frames

2015-07-18 Thread Naveen Madhire
I am facing the same issue, i tried this but getting compilation error for
the $ in the explode function

So, I had to modify to the below to make it work.

df.select(explode(new Column(entities.user_mentions)).as(mention))




On Wed, Jun 24, 2015 at 2:48 PM, Michael Armbrust mich...@databricks.com
wrote:

 Starting in Spark 1.4 there is also an explode that you can use directly
 from the select clause (much like in HiveQL):

 import org.apache.spark.sql.functions._
 df.select(explode($entities.user_mentions).as(mention))

 Unlike standard HiveQL, you can also include other attributes in the
 select or even $*.


 On Wed, Jun 24, 2015 at 8:34 AM, Yin Huai yh...@databricks.com wrote:

 The function accepted by explode is f: Row = TraversableOnce[A]. Seems
 user_mentions is an array of structs. So, can you change your
 pattern matching to the following?

 case Row(rows: Seq[_]) = rows.asInstanceOf[Seq[Row]].map(elem = ...)

 On Wed, Jun 24, 2015 at 5:27 AM, Gustavo Arjones 
 garjo...@socialmetrix.com wrote:

 Hi All,

 I am using the new *Apache Spark version 1.4.0 Data-frames API* to
 extract information from Twitter's Status JSON, mostly focused on the 
 Entities
 Object https://dev.twitter.com/overview/api/entities - the relevant
 part to this question is showed below:

 {
   ...
   ...
   entities: {
 hashtags: [],
 trends: [],
 urls: [],
 user_mentions: [
   {
 screen_name: linobocchini,
 name: Lino Bocchini,
 id: 187356243,
 id_str: 187356243,
 indices: [ 3, 16 ]
   },
   {
 screen_name: jeanwyllys_real,
 name: Jean Wyllys,
 id: 23176,
 id_str: 23176,
 indices: [ 79, 95 ]
   }
 ],
 symbols: []
   },
   ...
   ...
 }

 There are several examples on how extract information from primitives
 types as string, integer, etc - but I couldn't find anything on how to
 process those kind of *complex* structures.

 I tried the code below but it is still doesn't work, it throws an
 Exception

 val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)

 val tweets = sqlContext.read.json(tweets.json)

 // this function is just to filter empty entities.user_mentions[] nodes
 // some tweets doesn't contains any mentions
 import org.apache.spark.sql.functions.udf
 val isEmpty = udf((value: List[Any]) = value.isEmpty)

 import org.apache.spark.sql._
 import sqlContext.implicits._
 case class UserMention(id: Long, idStr: String, indices: Array[Long], name: 
 String, screenName: String)

 val mentions = tweets.select(entities.user_mentions).
   filter(!isEmpty($user_mentions)).
   explode($user_mentions) {
   case Row(arr: Array[Row]) = arr.map { elem =
 UserMention(
   elem.getAs[Long](id),
   elem.getAs[String](is_str),
   elem.getAs[Array[Long]](indices),
   elem.getAs[String](name),
   elem.getAs[String](screen_name))
   }
 }

 mentions.first

 Exception when I try to call mentions.first:

 scala mentions.first
 15/06/23 22:15:06 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 8)
 scala.MatchError: [List([187356243,187356243,List(3, 16),Lino 
 Bocchini,linobocchini], [23176,23176,List(79, 95),Jean 
 Wyllys,jeanwyllys_real])] (of class 
 org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
 at 
 $line37.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(console:34)
 at 
 $line37.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(console:34)
 at scala.Function1$$anonfun$andThen$1.apply(Function1.scala:55)
 at 
 org.apache.spark.sql.catalyst.expressions.UserDefinedGenerator.eval(generators.scala:81)

 What is wrong here? I understand it is related to the types but I
 couldn't figure out it yet.

 As additional context, the structure mapped automatically is:

 scala mentions.printSchema
 root
  |-- user_mentions: array (nullable = true)
  ||-- element: struct (containsNull = true)
  |||-- id: long (nullable = true)
  |||-- id_str: string (nullable = true)
  |||-- indices: array (nullable = true)
  ||||-- element: long (containsNull = true)
  |||-- name: string (nullable = true)
  |||-- screen_name: string (nullable = true)

 *NOTE 1:* I know it is possible to solve this using HiveQL but I would
 like to use Data-frames once there is so much momentum around it.

 SELECT explode(entities.user_mentions) as mentions
 FROM tweets

 *NOTE 2:* the *UDF* val isEmpty = udf((value: List[Any]) =
 value.isEmpty) is a ugly hack and I'm missing something here, but was
 the only way I came up to avoid a NPE

 I’ve posted the same question on SO:
 http://stackoverflow.com/questions/31016156/how-to-extract-complex-json-structures-using-apache-spark-1-4-0-data-frames

 Thanks all!
 - gustavo






Job aborted due to stage failure: Task not serializable:

2015-07-15 Thread Naveen Dabas



  

 I am using the below code and using kryo serializer .when i run this code i 
got this error : Task not serializable in commented line2) how broadcast 
variables are treated in exceotu.are they local variables or can be used in any 
function defined as global variables.
object StreamingLogInput {  def main(args: Array[String]) {    val master = 
args(0)    val conf = new SparkConf().setAppName(StreamingLogInput)    // 
Create a StreamingContext with a 1 second batch size        val sc = new 
SparkContext(conf)    val lines=sc.parallelize(List(eoore is test,testing is 
error report))    //val ssc = new StreamingContext(sc, Seconds(30))    //val 
lines = ssc.socketTextStream(localhost, )    val 
filter=sc.textFile(/user/nadabas/filters/fltr)    val 
filarr=filter.collect().toArray    val broadcastVar = sc.broadcast(filarr)      
  // val out=lines.transform{rdd=rdd.filter(x=fil(broadcastVar.value,x))}    
val out=lines.filter(x=fil(broadcastVar.value,x))  //error is coming        
out.collect()      }  def fil(x1:Array[String],y1:String)={    val y=y1 // val 
x=broadcastVar.value    val x=x1  var flag:Boolean=false     for(a-x)  {    
if(y.contains(a))    flag=true    }    flag    }   }

   

Re: Spark and HDFS

2015-07-15 Thread Naveen Madhire
Yes. I did this recently. You need to copy the cloudera cluster related
conf files into the local machine
and set HADOOP_CONF_DIR or YARN_CONF_DIR.

And also local machine should be able to ssh to the cloudera cluster.

On Wed, Jul 15, 2015 at 8:51 AM, ayan guha guha.a...@gmail.com wrote:

 Assuming you run spark locally (ie either local mode or standalone cluster
 on your localm/c)
 1. You need to have hadoop binaries locally
 2. You need to have hdfs-site on Spark Classpath of your local m/c

 I would suggest you to start off with local files to play around.

 If you need to run spark on CDH cluster using Yarn, then you need to use
 spark-submit to yarn cluster. You can see a very good example here:
 https://spark.apache.org/docs/latest/running-on-yarn.html



 On Wed, Jul 15, 2015 at 10:36 PM, Jeskanen, Elina elina.jeska...@cgi.com
 wrote:

  I have Spark 1.4 on my local machine and I would like to connect to our
 local 4 nodes Cloudera cluster. But how?



 In the example it says text_file = spark.textFile(hdfs://...), but can
 you advise me in where to get this hdfs://... -address?



 Thanks!



 Elina








 --
 Best Regards,
 Ayan Guha



Spark executor memory information

2015-07-14 Thread Naveen Dabas
Hi,



I am new to spark and need some guidance on below mentioned points:
1)I am using spark 1.2,is it possible to see how much memory is being 
allocated to an executor for web UI. If not how can we figure that out.2)    I 
am interested in source code of mlib,it is possible to get access to it and 
spark core implementations.

   

Re: Unit tests of spark application

2015-07-13 Thread Naveen Madhire
Thanks. Spark-testing-base works pretty well.

On Fri, Jul 10, 2015 at 3:23 PM, Burak Yavuz brk...@gmail.com wrote:

 I can +1 Holden's spark-testing-base package.

 Burak

 On Fri, Jul 10, 2015 at 12:23 PM, Holden Karau hol...@pigscanfly.ca
 wrote:

 Somewhat biased of course, but you can also use spark-testing-base from
 spark-packages.org as a basis for your unittests.

 On Fri, Jul 10, 2015 at 12:03 PM, Daniel Siegmann 
 daniel.siegm...@teamaol.com wrote:

 On Fri, Jul 10, 2015 at 1:41 PM, Naveen Madhire vmadh...@umail.iu.edu
 wrote:

 I want to write junit test cases in scala for testing spark
 application. Is there any guide or link which I can refer.


 https://spark.apache.org/docs/latest/programming-guide.html#unit-testing

 Typically I create test data using SparkContext.parallelize and then
 call RDD.collect to get the results to assert.




 --
 Cell : 425-233-8271
 Twitter: https://twitter.com/holdenkarau
 Linked In: https://www.linkedin.com/in/holdenkarau





Unit tests of spark application

2015-07-10 Thread Naveen Madhire
Hi,

I want to write junit test cases in scala for testing spark application. Is
there any guide or link which I can refer.

Thank you very much.

-Naveen


DataFrame question

2015-07-07 Thread Naveen Madhire
Hi All,

I am working with dataframes and have been struggling with this thing, any
pointers would be helpful.

I've a Json file with the schema like this,

links: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- desc: string (nullable = true)
 |||-- id: string (nullable = true)


I want to fetch id and desc as an RDD like this RDD[(String,String)]

i am using dataframes*df.select(links.desc,links.id
http://links.id/).rdd*

the above dataframe is returning an RDD like this
RDD[(List(String),List(String)]


So, links:[{one,1},{two,2},{three,3}] json should return and
RDD[(one,1),(two,2),(three,3)]

can anyone tell me how the dataframe select should be modified?


Re: Has anyone run Python Spark application on Yarn-cluster mode ? (which has 3rd party Python modules to be shipped with)

2015-06-25 Thread Naveen Madhire
Hi Marcelo, Quick Question.

I am using Spark 1.3 and using Yarn Client mode. It is working well,
provided I have to manually pip-install all the 3rd party libraries like
numpy etc to the executor nodes.



So the SPARK-5479 fix in 1.5 which you mentioned fix this as well?
Thanks.


On Thu, Jun 25, 2015 at 2:22 PM, Marcelo Vanzin van...@cloudera.com wrote:

 That sounds like SPARK-5479 which is not in 1.4...

 On Thu, Jun 25, 2015 at 12:17 PM, Elkhan Dadashov elkhan8...@gmail.com
 wrote:

 In addition to previous emails, when i try to execute this command from
 command line:

 ./bin/spark-submit --verbose --master yarn-cluster --py-files
  mypython/libs/numpy-1.9.2.zip --deploy-mode cluster
 mypython/scripts/kmeans.py /kmeans_data.txt 5 1.0


 - numpy-1.9.2.zip - is downloaded numpy package
 - kmeans.py is default example which comes with Spark 1.4
 - kmeans_data.txt  - is default data file which comes with Spark 1.4


 It fails saying that it could not find numpy:

 File kmeans.py, line 31, in module
 import numpy
 ImportError: No module named numpy

 Has anyone run Python Spark application on Yarn-cluster mode ? (which has
 3rd party Python modules to be shipped with)

 What are the configurations or installations to be done before running
 Python Spark job with 3rd party dependencies on Yarn-cluster ?

 Thanks in advance.


 --
 Marcelo



Re: How to set HBaseConfiguration in Spark

2015-05-20 Thread Naveen Madhire
Cloudera blog has some details.

Please check if this is helpful to you.

http://blog.cloudera.com/blog/2014/12/new-in-cloudera-labs-sparkonhbase/

Thanks.

On Wed, May 20, 2015 at 4:21 AM, donhoff_h 165612...@qq.com wrote:

 Hi, all

 I wrote a program to get HBaseConfiguration object in Spark. But after I
 printed the content of this hbase-conf object, I found they were wrong. For
 example, the property hbase.zookeeper.quorum should be
 bgdt01.dev.hrb,bgdt02.dev.hrb,bgdt03.hrb. But the printed value is
 localhost.

 Could anybody tell me how to set up the HBase Configuration in Spark? No
 matter it should be set in a configuration file or be set by a Spark API.
 Many Thanks!

 The code of my program is listed below:
 object TestHBaseConf {
  def main(args: Array[String]) {
val conf = new SparkConf()
val sc = new SparkContext(conf)
val hbConf = HBaseConfiguration.create()
hbConf.addResource(file:///etc/hbase/conf/hbase-site.xml)
val it = hbConf.iterator()
while(it.hasNext) {
  val e = it.next()
  println(Key=+ e.getKey + Value=+e.getValue)
}

val rdd = sc.parallelize(Array(1,2,3,4,5,6,7,8,9))
val result = rdd.sum()
println(result=+result)
sc.stop()
  }
 }



Failed to locate the winutils binary in the hadoop binary path

2015-01-29 Thread Naveen Kumar Pokala
 AkkaUtils: Connecting to HeartbeatReceiver: 
akka.tcp://sparkdri...@ii01-hdhlg32.ciqhyd.com:60464/user/HeartbeatReceiver
15/01/29 17:21:28 INFO SparkILoop: Created spark context..
Spark context available as sc.




-Naveen


Pyspark Interactive shell

2015-01-06 Thread Naveen Kumar Pokala
Hi,

Anybody tried to connect to spark cluster( on UNIX machines)  from windows 
interactive shell ?

-Naveen.


pyspark.daemon not found

2014-12-31 Thread Naveen Kumar Pokala
Error from python worker:
  python: module pyspark.daemon not found
PYTHONPATH was:
  /home/npokala/data/spark-install/spark-master/python:



Please can somebody help me on this, how to resolve the issue.

-Naveen


Re: pyspark.daemon not found

2014-12-31 Thread Naveen Kumar Pokala
Hi,

I am receiving the following error when I am trying to connect spark cluster( 
which is on unix) from my windows machine using pyspark interactive shell

 pyspark -master (spark cluster url)

Then I executed the following  commands.


lines = sc.textFile(hdfs://master/data/spark/SINGLE.TXT)
lineLengths = lines.map(lambda s: len(s))
totalLength = lineLengths.reduce(lambda a, b: a + b)

I got the following Error

14/12/31 16:20:15 INFO DAGScheduler: Job 0 failed: reduce at stdin:1, took 
6.960438 s
Traceback (most recent call last):
  File stdin, line 1, in module
  File C:\Users\npokala\Downloads\spark-master\python\pyspark\rdd.py, line 
715, in reduce
vals = self.mapPartitions(func).collect()
  File C:\Users\npokala\Downloads\spark-master\python\pyspark\rdd.py, line 
676, in collect
bytesInJava = self._jrdd.collect().iterator()
  File 
C:\Users\npokala\Downloads\spark-master\python\lib\py4j-0.8.2.1-src.zip\py4j\java_gateway.py,
 line 538, in __call__
  File 
C:\Users\npokala\Downloads\spark-master\python\lib\py4j-0.8.2.1-src.zip\py4j\protocol.py,
 line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o23.collect.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 
7,
Error from python worker:
  python: module pyspark.daemon not found
PYTHONPATH was:
  
/home/npokala/data/spark-install/spark-master/python:/home/npokala/data/spark-install/spark-master/python/lib/py4j-0.8.2.1-src.zip:/home/npokala/data/spark-ins
java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at 
org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:163)
at 
org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86)
at 
org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62)
at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:102)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:265)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1202)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:696)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1420)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1375)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

Please help me to resolve this issue.


-Naveen



From: Naveen Kumar Pokala [mailto:npok...@spcapitaliq.com]
Sent: Wednesday, December 31, 2014 2:28 PM
To: user@spark.apache.org
Subject: pyspark.daemon not found

Error from python worker:
  python

Fwd: Sample Spark Program Error

2014-12-31 Thread Naveen Madhire
Hi All,

I am trying to run a sample Spark program using Scala SBT,

Below is the program,

def main(args: Array[String]) {

  val logFile = E:/ApacheSpark/usb/usb/spark/bin/README.md // Should
be some file on your system
  val sc = new SparkContext(local, Simple App,
E:/ApacheSpark/usb/usb/spark/bin,List(target/scala-2.10/sbt2_2.10-1.0.jar))
  val logData = sc.textFile(logFile, 2).cache()

  val numAs = logData.filter(line = line.contains(a)).count()
  val numBs = logData.filter(line = line.contains(b)).count()

  println(Lines with a: %s, Lines with b: %s.format(numAs, numBs))

}


Below is the error log,


14/12/30 23:20:21 INFO rdd.HadoopRDD: Input split:
file:/E:/ApacheSpark/usb/usb/spark/bin/README.md:0+673
14/12/30 23:20:21 INFO storage.MemoryStore: ensureFreeSpace(2032) called
with curMem=34047, maxMem=280248975
14/12/30 23:20:21 INFO storage.MemoryStore: Block rdd_1_0 stored as values
in memory (estimated size 2032.0 B, free 267.2 MB)
14/12/30 23:20:21 INFO storage.BlockManagerInfo: Added rdd_1_0 in memory on
zealot:61452 (size: 2032.0 B, free: 267.3 MB)
14/12/30 23:20:21 INFO storage.BlockManagerMaster: Updated info of block
rdd_1_0
14/12/30 23:20:21 INFO executor.Executor: Finished task 0.0 in stage 0.0
(TID 0). 2300 bytes result sent to driver
14/12/30 23:20:21 INFO scheduler.TaskSetManager: Starting task 1.0 in stage
0.0 (TID 1, localhost, PROCESS_LOCAL, 1264 bytes)
14/12/30 23:20:21 INFO executor.Executor: Running task 1.0 in stage 0.0
(TID 1)
14/12/30 23:20:21 INFO spark.CacheManager: Partition rdd_1_1 not found,
computing it
14/12/30 23:20:21 INFO rdd.HadoopRDD: Input split:
file:/E:/ApacheSpark/usb/usb/spark/bin/README.md:673+673
14/12/30 23:20:21 INFO scheduler.TaskSetManager: Finished task 0.0 in stage
0.0 (TID 0) in 3507 ms on localhost (1/2)
14/12/30 23:20:21 INFO storage.MemoryStore: ensureFreeSpace(1912) called
with curMem=36079, maxMem=280248975
14/12/30 23:20:21 INFO storage.MemoryStore: Block rdd_1_1 stored as values
in memory (estimated size 1912.0 B, free 267.2 MB)
14/12/30 23:20:21 INFO storage.BlockManagerInfo: Added rdd_1_1 in memory on
zealot:61452 (size: 1912.0 B, free: 267.3 MB)
14/12/30 23:20:21 INFO storage.BlockManagerMaster: Updated info of block
rdd_1_1
14/12/30 23:20:21 INFO executor.Executor: Finished task 1.0 in stage 0.0
(TID 1). 2300 bytes result sent to driver
14/12/30 23:20:21 INFO scheduler.TaskSetManager: Finished task 1.0 in stage
0.0 (TID 1) in 261 ms on localhost (2/2)
14/12/30 23:20:21 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0,
whose tasks have all completed, from pool
14/12/30 23:20:21 INFO scheduler.DAGScheduler: Stage 0 (count at
Test1.scala:19) finished in 3.811 s
14/12/30 23:20:21 INFO spark.SparkContext: Job finished: count at
Test1.scala:19, took 3.997365232 s
14/12/30 23:20:21 INFO spark.SparkContext: Starting job: count at
Test1.scala:20
14/12/30 23:20:21 INFO scheduler.DAGScheduler: Got job 1 (count at
Test1.scala:20) with 2 output partitions (allowLocal=false)
14/12/30 23:20:21 INFO scheduler.DAGScheduler: Final stage: Stage 1(count
at Test1.scala:20)
14/12/30 23:20:21 INFO scheduler.DAGScheduler: Parents of final stage:
List()
14/12/30 23:20:21 INFO scheduler.DAGScheduler: Missing parents: List()
14/12/30 23:20:21 INFO scheduler.DAGScheduler: Submitting Stage 1
(FilteredRDD[3] at filter at Test1.scala:20), which has no missing parents
14/12/30 23:20:21 INFO storage.MemoryStore: ensureFreeSpace(2600) called
with curMem=37991, maxMem=280248975
14/12/30 23:20:21 INFO storage.MemoryStore: Block broadcast_2 stored as
values in memory (estimated size 2.5 KB, free 267.2 MB)
14/12/30 23:20:21 INFO scheduler.DAGScheduler: Submitting 2 missing tasks
from Stage 1 (FilteredRDD[3] at filter at Test1.scala:20)
14/12/30 23:20:21 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0
with 2 tasks
14/12/30 23:20:21 INFO scheduler.TaskSetManager: Starting task 0.0 in stage
1.0 (TID 2, localhost, ANY, 1264 bytes)
14/12/30 23:20:21 INFO executor.Executor: Running task 0.0 in stage 1.0
(TID 2)
14/12/30 23:20:21 INFO storage.BlockManager: Found block rdd_1_0 locally
14/12/30 23:20:21 INFO executor.Executor: Finished task 0.0 in stage 1.0
(TID 2). 1731 bytes result sent to driver
14/12/30 23:20:21 INFO scheduler.TaskSetManager: Starting task 1.0 in stage
1.0 (TID 3, localhost, ANY, 1264 bytes)
14/12/30 23:20:21 INFO executor.Executor: Running task 1.0 in stage 1.0
(TID 3)
14/12/30 23:20:21 INFO storage.BlockManager: Found block rdd_1_1 locally
14/12/30 23:20:21 INFO executor.Executor: Finished task 1.0 in stage 1.0
(TID 3). 1731 bytes result sent to driver
14/12/30 23:20:21 INFO scheduler.TaskSetManager: Finished task 1.0 in stage
1.0 (TID 3) in 7 ms on localhost (1/2)
14/12/30 23:20:21 INFO scheduler.TaskSetManager: Finished task 0.0 in stage
1.0 (TID 2) in 16 ms on localhost (2/2)
14/12/30 23:20:21 INFO scheduler.DAGScheduler: Stage 1 (count at
Test1.scala:20) finished in 0.016 s
14/12/30 23:20:21 INFO 

Re: Fwd: Sample Spark Program Error

2014-12-31 Thread Naveen Madhire
Yes. The exception is gone now after adding stop() at the end.
Can you please tell me what this stop() does at the end. Does it disable
the spark context.

On Wed, Dec 31, 2014 at 10:09 AM, RK prk...@yahoo.com wrote:

 If you look at your program output closely, you can see the following
 output.
 Lines with a: 24, Lines with b: 15

 The exception seems to be happening with Spark cleanup after executing
 your code. Try adding sc.stop() at the end of your program to see if the
 exception goes away.




   On Wednesday, December 31, 2014 6:40 AM, Naveen Madhire 
 vmadh...@umail.iu.edu wrote:




 Hi All,

 I am trying to run a sample Spark program using Scala SBT,

 Below is the program,

 def main(args: Array[String]) {

   val logFile = E:/ApacheSpark/usb/usb/spark/bin/README.md // Should
 be some file on your system
   val sc = new SparkContext(local, Simple App,
 E:/ApacheSpark/usb/usb/spark/bin,List(target/scala-2.10/sbt2_2.10-1.0.jar))
   val logData = sc.textFile(logFile, 2).cache()

   val numAs = logData.filter(line = line.contains(a)).count()
   val numBs = logData.filter(line = line.contains(b)).count()

   println(Lines with a: %s, Lines with b: %s.format(numAs, numBs))

 }


 Below is the error log,


 14/12/30 23:20:21 INFO rdd.HadoopRDD: Input split:
 file:/E:/ApacheSpark/usb/usb/spark/bin/README.md:0+673
 14/12/30 23:20:21 INFO storage.MemoryStore: ensureFreeSpace(2032) called
 with curMem=34047, maxMem=280248975
 14/12/30 23:20:21 INFO storage.MemoryStore: Block rdd_1_0 stored as values
 in memory (estimated size 2032.0 B, free 267.2 MB)
 14/12/30 23:20:21 INFO storage.BlockManagerInfo: Added rdd_1_0 in memory
 on zealot:61452 (size: 2032.0 B, free: 267.3 MB)
 14/12/30 23:20:21 INFO storage.BlockManagerMaster: Updated info of block
 rdd_1_0
 14/12/30 23:20:21 INFO executor.Executor: Finished task 0.0 in stage 0.0
 (TID 0). 2300 bytes result sent to driver
 14/12/30 23:20:21 INFO scheduler.TaskSetManager: Starting task 1.0 in
 stage 0.0 (TID 1, localhost, PROCESS_LOCAL, 1264 bytes)
 14/12/30 23:20:21 INFO executor.Executor: Running task 1.0 in stage 0.0
 (TID 1)
 14/12/30 23:20:21 INFO spark.CacheManager: Partition rdd_1_1 not found,
 computing it
 14/12/30 23:20:21 INFO rdd.HadoopRDD: Input split:
 file:/E:/ApacheSpark/usb/usb/spark/bin/README.md:673+673
 14/12/30 23:20:21 INFO scheduler.TaskSetManager: Finished task 0.0 in
 stage 0.0 (TID 0) in 3507 ms on localhost (1/2)
 14/12/30 23:20:21 INFO storage.MemoryStore: ensureFreeSpace(1912) called
 with curMem=36079, maxMem=280248975
 14/12/30 23:20:21 INFO storage.MemoryStore: Block rdd_1_1 stored as values
 in memory (estimated size 1912.0 B, free 267.2 MB)
 14/12/30 23:20:21 INFO storage.BlockManagerInfo: Added rdd_1_1 in memory
 on zealot:61452 (size: 1912.0 B, free: 267.3 MB)
 14/12/30 23:20:21 INFO storage.BlockManagerMaster: Updated info of block
 rdd_1_1
 14/12/30 23:20:21 INFO executor.Executor: Finished task 1.0 in stage 0.0
 (TID 1). 2300 bytes result sent to driver
 14/12/30 23:20:21 INFO scheduler.TaskSetManager: Finished task 1.0 in
 stage 0.0 (TID 1) in 261 ms on localhost (2/2)
 14/12/30 23:20:21 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0,
 whose tasks have all completed, from pool
 14/12/30 23:20:21 INFO scheduler.DAGScheduler: Stage 0 (count at
 Test1.scala:19) finished in 3.811 s
 14/12/30 23:20:21 INFO spark.SparkContext: Job finished: count at
 Test1.scala:19, took 3.997365232 s
 14/12/30 23:20:21 INFO spark.SparkContext: Starting job: count at
 Test1.scala:20
 14/12/30 23:20:21 INFO scheduler.DAGScheduler: Got job 1 (count at
 Test1.scala:20) with 2 output partitions (allowLocal=false)
 14/12/30 23:20:21 INFO scheduler.DAGScheduler: Final stage: Stage 1(count
 at Test1.scala:20)
 14/12/30 23:20:21 INFO scheduler.DAGScheduler: Parents of final stage:
 List()
 14/12/30 23:20:21 INFO scheduler.DAGScheduler: Missing parents: List()
 14/12/30 23:20:21 INFO scheduler.DAGScheduler: Submitting Stage 1
 (FilteredRDD[3] at filter at Test1.scala:20), which has no missing parents
 14/12/30 23:20:21 INFO storage.MemoryStore: ensureFreeSpace(2600) called
 with curMem=37991, maxMem=280248975
 14/12/30 23:20:21 INFO storage.MemoryStore: Block broadcast_2 stored as
 values in memory (estimated size 2.5 KB, free 267.2 MB)
 14/12/30 23:20:21 INFO scheduler.DAGScheduler: Submitting 2 missing tasks
 from Stage 1 (FilteredRDD[3] at filter at Test1.scala:20)
 14/12/30 23:20:21 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0
 with 2 tasks
 14/12/30 23:20:21 INFO scheduler.TaskSetManager: Starting task 0.0 in
 stage 1.0 (TID 2, localhost, ANY, 1264 bytes)
 14/12/30 23:20:21 INFO executor.Executor: Running task 0.0 in stage 1.0
 (TID 2)
 14/12/30 23:20:21 INFO storage.BlockManager: Found block rdd_1_0 locally
 14/12/30 23:20:21 INFO executor.Executor: Finished task 0.0 in stage 1.0
 (TID 2). 1731 bytes result sent to driver
 14/12/30 23:20:21 INFO scheduler.TaskSetManager: Starting task 1.0 in
 stage

python: module pyspark.daemon not found

2014-12-29 Thread Naveen Kumar Pokala
), partitions).map(f).reduce(add)
  File C:\Users\npokala\Downloads\spark-master\python\pyspark\rdd.py, line 
715, in reduce
vals = self.mapPartitions(func).collect()
  File C:\Users\npokala\Downloads\spark-master\python\pyspark\rdd.py, line 
676, in collect
bytesInJava = self._jrdd.collect().iterator()
  File 
C:\Users\npokala\Downloads\spark-master\python\lib\py4j-0.8.2.1-src.zip\py4j\java_gateway.py,
 line 538, in __call__
  File 
C:\Users\npokala\Downloads\spark-master\python\lib\py4j-0.8.2.1-src.zip\py4j\protocol.py,
 line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o23.collect.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 
6, nj09mhf0730.mhf.mhc): org.apache.spark.SparkException:
Error from python worker:
  python: module pyspark.daemon not found
PYTHONPATH was:
  
/home/npokala/data/spark-install/spark-master/python:/home/npokala/data/spark-install/spark-master/python/lib/py4j-0.8.2.1-src.zip:/home/npokala/data/spark-install/spark-master/assembly/target/scala-2.10/spark-assembly-1.3.0-SNAPSHOT-hadoop2.4.0.jar:/home/npokala/data/spark-install/spark-master/sbin/../python/lib/py4j-0.8.2.1-src.zip:/home/npokala/data/spark-install/spark-master/sbin/../python:
java.io.EOFException
   at java.io.DataInputStream.readInt(DataInputStream.java:392)
   at 
org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:163)
   at 
org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86)
   at 
org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62)
   at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:102)
   at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:265)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
   at org.apache.spark.scheduler.Task.run(Task.scala:56)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:

   at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214)

   at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203)

   at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202)

   at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)

   at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1202)

   at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)

   at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)

   at scala.Option.foreach(Option.scala:236)

   at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:696)

   at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1420)

   at akka.actor.Actor$class.aroundReceive(Actor.scala:465)

   at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1375)

   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)

   at akka.actor.ActorCell.invoke(ActorCell.scala:487)

   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)

   at akka.dispatch.Mailbox.run(Mailbox.scala:220)

   at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)

   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

   at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

   at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

   at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Please can anyone suggest me how to resolve the issue.

-Naveen


Spark Job submit

2014-11-26 Thread Naveen Kumar Pokala
Hi.

Is there a way to submit spark job on Hadoop-YARN  cluster from java code.

-Naveen


RE: Control number of parquet generated from JavaSchemaRDD

2014-11-25 Thread Naveen Kumar Pokala
Hi,

While submitting your spark job mention --executor-cores 2 --num-executors 24 
it will divide the dataset into 24*2 parquet files.

Or set spark.default.parallelism value like 50 on sparkconf object. It will 
divide the dataset into 50 files into your HDFS.


-Naveen

-Original Message-
From: tridib [mailto:tridib.sama...@live.com] 
Sent: Tuesday, November 25, 2014 9:54 AM
To: u...@spark.incubator.apache.org
Subject: Control number of parquet generated from JavaSchemaRDD

Hello,
I am reading around 1000 input files from disk in an RDD and generating 
parquet. It always produces same number of parquet files as number of input 
files. I tried to merge them using 

rdd.coalesce(n) and/or rdd.repatition(n).
also tried using:

int MB_128 = 128*1024*1024;
sc.hadoopConfiguration().setInt(dfs.blocksize, MB_128);
sc.hadoopConfiguration().setInt(parquet.block.size, MB_128);

No luck.
Is there a way to control the size/number of parquet files generated?

Thanks
Tridib



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Control-number-of-parquet-generated-from-JavaSchemaRDD-tp19717.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Submit Spark driver on Yarn Cluster in client mode

2014-11-24 Thread Naveen Kumar Pokala
Hi,

I want to submit my spark program from my machine on a YARN Cluster in yarn 
client mode.

How to specify al l the required details through SPARK submitter.

Please provide me some details.

-Naveen.


Re: Submit Spark driver on Yarn Cluster in client mode

2014-11-24 Thread Naveen Kumar Pokala
Hi Akhil,

But driver and yarn both are in different networks, How to specify (export 
HADOOP_CONF_DIR=XXX) path.

Like driver is from my windows machine and yarn is some unix machine on 
different network.

-Naveen.

From: Akhil Das [mailto:ak...@sigmoidanalytics.com]
Sent: Monday, November 24, 2014 4:08 PM
To: Naveen Kumar Pokala
Cc: user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: Submit Spark driver on Yarn Cluster in client mode

You can export the hadoop configurations dir (export HADOOP_CONF_DIR=XXX) in 
the environment and then submit it like:

./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master yarn-cluster \  # can also be `yarn-client` for client mode
  --executor-memory 20G \
  --num-executors 50 \
  /path/to/examples.jar \
  1000

More details over here 
https://spark.apache.org/docs/1.1.0/submitting-applications.html#launching-applications-with-spark-submit

Thanks
Best Regards

On Mon, Nov 24, 2014 at 4:01 PM, Naveen Kumar Pokala 
npok...@spcapitaliq.commailto:npok...@spcapitaliq.com wrote:
Hi,

I want to submit my spark program from my machine on a YARN Cluster in yarn 
client mode.

How to specify al l the required details through SPARK submitter.

Please provide me some details.

-Naveen.



RE: Null pointer exception with larger datasets

2014-11-18 Thread Naveen Kumar Pokala
Thanks Akhil.

-Naveen.


From: Akhil Das [mailto:ak...@sigmoidanalytics.com]
Sent: Tuesday, November 18, 2014 1:19 PM
To: Naveen Kumar Pokala
Cc: user@spark.apache.org
Subject: Re: Null pointer exception with larger datasets

Make sure your list is not null, if that is null then its more like doing:

JavaRDDStudent distData = sc.parallelize(null)

distData.foreach(println)


Thanks
Best Regards

On Tue, Nov 18, 2014 at 12:07 PM, Naveen Kumar Pokala 
npok...@spcapitaliq.commailto:npok...@spcapitaliq.com wrote:
Hi,

I am having list Students and size is one Lakh and I am trying to save the 
file. It is throwing null pointer exception.

JavaRDDStudent distData = sc.parallelize(list);

distData.saveAsTextFile(hdfs://master/data/spark/instruments.txt);


14/11/18 01:33:21 WARN scheduler.TaskSetManager: Lost task 5.0 in stage 0.0 
(TID 5, master): java.lang.NullPointerException:
org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1158)
org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1158)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)

org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:984)

org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:974)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)


How to handle this?

-Naveen



HDFS read text file

2014-11-17 Thread Naveen Kumar Pokala
Hi,


JavaRDDInstrument studentsData = sc.parallelize(list);--list is Student Info 
ListStudent

studentsData.saveAsTextFile(hdfs://master/data/spark/instruments.txt);

above statements saved the students information in the HDFS as a text file. 
Each object is a line in text file as below.

[cid:image001.png@01D0027F.FB321550]

How to read that file, I mean each line as Object of student.

-Naveen


Null pointer exception with larger datasets

2014-11-17 Thread Naveen Kumar Pokala
Hi,

I am having list Students and size is one Lakh and I am trying to save the 
file. It is throwing null pointer exception.

JavaRDDStudent distData = sc.parallelize(list);

distData.saveAsTextFile(hdfs://master/data/spark/instruments.txt);


14/11/18 01:33:21 WARN scheduler.TaskSetManager: Lost task 5.0 in stage 0.0 
(TID 5, master): java.lang.NullPointerException:
org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1158)
org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1158)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)

org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:984)

org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:974)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)


How to handle this?

-Naveen


Spark GCLIB error

2014-11-13 Thread Naveen Kumar Pokala
Hi,

I am receiving following error when I am trying to run sample spark program.


Caused by: java.lang.UnsatisfiedLinkError: 
/tmp/hadoop-npokala/nm-local-dir/usercache/npokala/appcache/application_1415881017544_0003/container_1415881017544_0003_01_01/tmp/snappy-1.0.5.3-ec0ee911-2f8c-44c0-ae41-5e0b2727d880-libsnappyjava.so:
 /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.9' not found (required by 
/tmp/hadoop-npokala/nm-local-dir/usercache/npokala/appcache/application_1415881017544_0003/container_1415881017544_0003_01_01/tmp/snappy-1.0.5.3-ec0ee911-2f8c-44c0-ae41-5e0b2727d880-libsnappyjava.so)
at java.lang.ClassLoader$NativeLibrary.load(Native Method)
at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1929)
at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1814)
at java.lang.Runtime.load0(Runtime.java:809)
at java.lang.System.load(System.java:1083)
at org.xerial.snappy.SnappyNativeLoader.load(SnappyNativeLoader.java:39)
... 29 more


-Naveen.


RE: scala.MatchError

2014-11-12 Thread Naveen Kumar Pokala
Hi,

Do you mean with java, I shouldn’t have Issue class as a property (attribute) 
in Instrument Class?

Ex :

Class Issue {
Int a;
}

Class Instrument {

Issue issue;

}


How about scala? Does it support such user defined datatypes in classes

Case class Issue .


case class Issue( a:Int = 0)

case class Instrument(issue: Issue = null)




-Naveen

From: Michael Armbrust [mailto:mich...@databricks.com]
Sent: Wednesday, November 12, 2014 12:09 AM
To: Xiangrui Meng
Cc: Naveen Kumar Pokala; user@spark.apache.org
Subject: Re: scala.MatchError

Xiangrui is correct that is must be a java bean, also nested classes are not 
yet supported in java.

On Tue, Nov 11, 2014 at 10:11 AM, Xiangrui Meng 
men...@gmail.commailto:men...@gmail.com wrote:
I think you need a Java bean class instead of a normal class. See
example here: http://spark.apache.org/docs/1.1.0/sql-programming-guide.html
(switch to the java tab). -Xiangrui

On Tue, Nov 11, 2014 at 7:18 AM, Naveen Kumar Pokala
npok...@spcapitaliq.commailto:npok...@spcapitaliq.com wrote:
 Hi,



 This is my Instrument java constructor.



 public Instrument(Issue issue, Issuer issuer, Issuing issuing) {

 super();

 this.issue = issue;

 this.issuer = issuer;

 this.issuing = issuing;

 }





 I am trying to create javaschemaRDD



 JavaSchemaRDD schemaInstruments = sqlCtx.applySchema(distData,
 Instrument.class);



 Remarks:

 



 Instrument, Issue, Issuer, Issuing all are java classes



 distData is holding List Instrument 





 I am getting the following error.







 Exception in thread Driver java.lang.reflect.InvocationTargetException

 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

 at java.lang.reflect.Method.invoke(Method.java:483)

 at
 org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162)

 Caused by: scala.MatchError: class sample.spark.test.Issue (of class
 java.lang.Class)

 at
 org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$getSchema$1.apply(JavaSQLContext.scala:189)

 at
 org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$getSchema$1.apply(JavaSQLContext.scala:188)

 at
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)

 at
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)

 at
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)

 at
 scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)

 at
 scala.collection.TraversableLike$class.map(TraversableLike.scala:244)

 at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)

 at
 org.apache.spark.sql.api.java.JavaSQLContext.getSchema(JavaSQLContext.scala:188)

 at
 org.apache.spark.sql.api.java.JavaSQLContext.applySchema(JavaSQLContext.scala:90)

 at sample.spark.test.SparkJob.main(SparkJob.java:33)

 ... 5 more



 Please help me.



 Regards,

 Naveen.
-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.orgmailto:user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.orgmailto:user-h...@spark.apache.org



Spark SQL configurations

2014-11-12 Thread Naveen Kumar Pokala
[cid:image001.png@01CFFE9C.25904980]

Hi,

How to set the above properties on  JavaSQLContext. I am not able to see 
setConf method  on JavaSQLContext Object.

I have added spark core jar and spark assembly jar to my build path. And I am 
using spark 1.1.0 and hadoop 2.4.0

--Naveen


RE: Spark SQL configurations

2014-11-12 Thread Naveen Kumar Pokala
Thanks Akhil.
-Naveen

From: Akhil Das [mailto:ak...@sigmoidanalytics.com]
Sent: Wednesday, November 12, 2014 6:38 PM
To: Naveen Kumar Pokala
Cc: user@spark.apache.org
Subject: Re: Spark SQL configurations

JavaSQLContext.sqlContext.setConf is available.

Thanks
Best Regards

On Wed, Nov 12, 2014 at 5:14 PM, Naveen Kumar Pokala 
npok...@spcapitaliq.commailto:npok...@spcapitaliq.com wrote:
[cid:image001.png@01CFFEB0.6E4AD190]

Hi,

How to set the above properties on  JavaSQLContext. I am not able to see 
setConf method  on JavaSQLContext Object.

I have added spark core jar and spark assembly jar to my build path. And I am 
using spark 1.1.0 and hadoop 2.4.0

--Naveen



Snappy error with Spark SQL

2014-11-12 Thread Naveen Kumar Pokala
HI,

I am facing the following problem when I am trying to save my RDD as parquet 
File.


14/11/12 07:43:59 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 1.0 
(TID 48,): org.xerial.snappy.SnappyError: [FAILED_TO_LOAD_NATIVE_LIBRARY] null
org.xerial.snappy.SnappyLoader.load(SnappyLoader.java:236)
org.xerial.snappy.Snappy.clinit(Snappy.java:48)
parquet.hadoop.codec.SnappyCompressor.compress(SnappyCompressor.java:64)

org.apache.hadoop.io.compress.CompressorStream.compress(CompressorStream.java:81)

org.apache.hadoop.io.compress.CompressorStream.finish(CompressorStream.java:92)

parquet.hadoop.CodecFactory$BytesCompressor.compress(CodecFactory.java:109)

parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:70)

parquet.column.impl.ColumnWriterImpl.writePage(ColumnWriterImpl.java:119)
parquet.column.impl.ColumnWriterImpl.flush(ColumnWriterImpl.java:199)

parquet.column.impl.ColumnWriteStoreImpl.flush(ColumnWriteStoreImpl.java:108)

parquet.hadoop.InternalParquetRecordWriter.flushStore(InternalParquetRecordWriter.java:146)

parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:110)
parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:73)

org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:305)

org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318)

org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)
14/11/12 07:43:59 WARN scheduler.TaskSetManager: Lost task 3.0 in stage 1.0 
(TID 51,): java.lang.NoClassDefFoundError: Could not initialize class 
org.xerial.snappy.Snappy
parquet.hadoop.codec.SnappyCompressor.compress(SnappyCompressor.java:64)

org.apache.hadoop.io.compress.CompressorStream.compress(CompressorStream.java:81)

org.apache.hadoop.io.compress.CompressorStream.finish(CompressorStream.java:92)

parquet.hadoop.CodecFactory$BytesCompressor.compress(CodecFactory.java:109)

parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:70)

parquet.column.impl.ColumnWriterImpl.writePage(ColumnWriterImpl.java:119)
parquet.column.impl.ColumnWriterImpl.flush(ColumnWriterImpl.java:199)

parquet.column.impl.ColumnWriteStoreImpl.flush(ColumnWriteStoreImpl.java:108)

parquet.hadoop.InternalParquetRecordWriter.flushStore(InternalParquetRecordWriter.java:146)

parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:110)
parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:73)

org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:305)

org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318)

org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)


Please help me.

Regards,
Naveen.





save as file

2014-11-11 Thread Naveen Kumar Pokala
Hi,

I am spark 1.1.0. I need a help regarding saving rdd in a JSON file?

How to do that? And how to mentions hdfs path in the program.


-Naveen




scala.MatchError

2014-11-11 Thread Naveen Kumar Pokala
Hi,

This is my Instrument java constructor.

public Instrument(Issue issue, Issuer issuer, Issuing issuing) {
super();
this.issue = issue;
this.issuer = issuer;
this.issuing = issuing;
}


I am trying to create javaschemaRDD

JavaSchemaRDD schemaInstruments = sqlCtx.applySchema(distData, 
Instrument.class);

Remarks:


Instrument, Issue, Issuer, Issuing all are java classes

distData is holding List Instrument 


I am getting the following error.



Exception in thread Driver java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162)
Caused by: scala.MatchError: class sample.spark.test.Issue (of class 
java.lang.Class)
at 
org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$getSchema$1.apply(JavaSQLContext.scala:189)
at 
org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$getSchema$1.apply(JavaSQLContext.scala:188)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at 
org.apache.spark.sql.api.java.JavaSQLContext.getSchema(JavaSQLContext.scala:188)
at 
org.apache.spark.sql.api.java.JavaSQLContext.applySchema(JavaSQLContext.scala:90)
at sample.spark.test.SparkJob.main(SparkJob.java:33)
... 5 more

Please help me.

Regards,
Naveen.


Parallelize on spark context

2014-11-06 Thread Naveen Kumar Pokala
Hi,

JavaRDDInteger distData = sc.parallelize(data);

On what basis parallelize splits the data into multiple datasets. How to handle 
if we want these many datasets to be executed per executor?

For example, my data is of 1000 integers list and I am having 2 node yarn 
cluster. It is diving into 2 batches of 500 size.

Regards,
Naveen.


RE: Parallelize on spark context

2014-11-06 Thread Naveen Kumar Pokala
Hi,

In the documentation is I found something like this.

spark.default.parallelism

· Local mode: number of cores on the local machine
· Mesos fine grained mode: 8
· Others: total number of cores on all executor nodes or 2, whichever 
is larger


I am using 2 node cluster with 48 cores(24+24). As per above no of data sets 
should be 1000/48=20.83, can be around 20 or 21.

But it is dividing into 2 sets of each 500 size.

I have used the function sc.parallelize(data, 10). But 10 datasets of size 100. 
8 datasets executing on one node  and 2 datasets on another node.

How to check how many cores are running to complete task of 8 datasets?(Is 
there any commands or UI to check that)

Regards,
Naveen.


From: holden.ka...@gmail.com [mailto:holden.ka...@gmail.com] On Behalf Of 
Holden Karau
Sent: Friday, November 07, 2014 12:46 PM
To: Naveen Kumar Pokala
Cc: user@spark.apache.org
Subject: Re: Parallelize on spark context

Hi Naveen,

So by default when we call parallelize it will be parallelized by the default 
number (which we can control with the property spark.default.parallelism) or if 
we just want a specific instance of parallelize to have a different number of 
partitions, we can instead call sc.parallelize(data, numpartitions). The 
default value of this is documented in 
http://spark.apache.org/docs/latest/configuration.html#spark-properties

Cheers,

Holden :)

On Thu, Nov 6, 2014 at 10:43 PM, Naveen Kumar Pokala 
npok...@spcapitaliq.commailto:npok...@spcapitaliq.com wrote:
Hi,

JavaRDDInteger distData = sc.parallelize(data);

On what basis parallelize splits the data into multiple datasets. How to handle 
if we want these many datasets to be executed per executor?

For example, my data is of 1000 integers list and I am having 2 node yarn 
cluster. It is diving into 2 batches of 500 size.

Regards,
Naveen.



--
Cell : 425-233-8271


Number cores split up

2014-11-05 Thread Naveen Kumar Pokala
Hi,

I have a 2 node yarn cluster and I am using spark 1.1.0 to submit my tasks.

As per the documentation of spark, number of cores are maximum cores available. 
So does it mean each node creates no of cores = no of threads to process the 
job assigned to that node.

For ex,

ListInteger data = new ArrayListInteger();
  for(int i=0;i1000;i++)
 data.add(i);

   JavaRDDInteger distData = sc.parallelize(data);

distData=distData.map(
   new FunctionInteger, Integer() {

  @Override
  public Integer call(Integer arg0) throws 
Exception {
 return arg0*arg0;
  }


   }
   );



   distData.count();

The above program dividing my RDD into 2 batches of 500 size, and submitting to 
the executors.


1)  So each executor will use all the cores of the node CPU to process 500 size 
batch am I right?


2)  If so, Does it mean each executor uses multi threading? Is that execution 
parallel or sequential on node.


3)  How to check how many cores an executor is using to process my jobs?


4)  Do we have any chance to control the batch division on nodes?





Please  give some clarity on above.

Thanks  Regards,
Naveen


Spark Debugging

2014-10-30 Thread Naveen Kumar Pokala
Hi,

  I have installed 2 node hadoop cluster (For example, on Unix machines A and 
B. A master node and data node, B is data node)

  I am submitting my driver programs through SPARK 1.1.0 with bin/spark-submit 
from Putty Client from my Windows machine.

I want to debug my program from Eclipse on my local machine. I am not able to 
find a way to debug.

Please let me know the ways to debug my driver program as well as executor 
programs


Regards,

Naveen.