DirectFileOutputCommitter in Spark 2.3.1

2018-09-19 Thread Priya Ch
Hello Team,

I am trying to write a DataSet as parquet file in Append mode partitioned
by few columns. However since the job is time consuming, I would like to
enable DirectFileOutputCommitter (i.e by-passing the writes to temporary
folder).

Version of the spark i am using is 2.3.1.

Can someone please help in enabling the configuration which allows direct
write to S3 both in case of appending, writing new files and overwriting
the files.

Thanks,
Padma CH


Video analytics on SPark

2016-09-09 Thread Priya Ch
Hi All,

I have video surveillance data and this needs to be processed in Spark. I
am going through the Spark + OpenCV. How to load .mp4 images into an RDD ?
Can we directly do this or the video needs to be coverted to sequenceFile ?

Thanks,
Padma CH


Re: Send real-time alert using Spark

2016-07-12 Thread Priya Ch
I mean model training on incoming data is taken care by Spark. For detected
anomalies, need to send alert. Could we do this with Spark or any other
framework like Akka/REST API would do it ?

Thanks,
Padma CH

On Tue, Jul 12, 2016 at 7:30 PM, Marcin Tustin <mtus...@handybook.com>
wrote:

> Priya,
>
> You wouldn't necessarily "use spark" to send the alert. Spark is in an
> important sense one library among many. You can have your application use
> any other library available for your language to send the alert.
>
> Marcin
>
> On Tue, Jul 12, 2016 at 9:25 AM, Priya Ch <learnings.chitt...@gmail.com>
> wrote:
>
>> Hi All,
>>
>>  I am building Real-time Anomaly detection system where I am using
>> k-means to detect anomaly. Now in-order to send alert to mobile or an email
>> alert how do i send it using Spark itself ?
>>
>> Thanks,
>> Padma CH
>>
>
>
> Want to work at Handy? Check out our culture deck and open roles
> <http://www.handy.com/careers>
> Latest news <http://www.handy.com/press> at Handy
> Handy just raised $50m
> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/>
>  led
> by Fidelity
>
>


Re: Spark Task failure with File segment length as negative

2016-07-06 Thread Priya Ch
Is anyone resolved this ?


Thanks,
Padma CH

On Wed, Jun 22, 2016 at 4:39 PM, Priya Ch <learnings.chitt...@gmail.com>
wrote:

> Hi All,
>
> I am running Spark Application with 1.8TB of data (which is stored in Hive
> tables format).  I am reading the data using HiveContect and processing it.
> The cluster has 5 nodes total, 25 cores per machine and 250Gb per node. I
> am launching the application with 25 executors with 5 cores each and 45GB
> per executor. Also, specified the property
> spark.yarn.executor.memoryOverhead=2024.
>
> During the execution, tasks are lost and ShuffleMapTasks are re-submitted.
> I am seeing that tasks are failing with the following message -
>
> *java.lang.IllegalArgumentException: requirement failed: File segment
> length cannot be negative (got -27045427)*
>
>
>
>
>
>
>
>
>
> * at scala.Predef$.require(Predef.scala:233)*
>
>
>
>
>
>
>
>
>
> * at org.apache.spark.storage.FileSegment.(FileSegment.scala:28)*
>
>
>
>
>
>
>
>
>
> * at
> org.apache.spark.storage.DiskBlockObjectWriter.fileSegment(DiskBlockObjectWriter.scala:220)*
>
>
>
>
>
>
>
>
>
> * at
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.writeSortedFile(ShuffleExternalSorter.java:184)*
>
>
>
>
>
>
>
>
>
> * at
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.closeAndGetSpills(ShuffleExternalSorter.java:398)*
>
>
>
>
>
>
>
>
>
> * at
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.closeAndWriteOutput(UnsafeShuffleWriter.java:206)*
>
>
>
>
>
>
>
>
>
> * at
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:166)*
>
>
>
>
>
>
>
>
>
> * at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)*
>
>
>
>
>
>
>
>
>
> * at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)*
>
>
>
>
>
>
>
>
>
> * at org.apache.spark.scheduler.Task.run(Task.scala:89)*
>
>
>
>
>
>
>
>
>
> * at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)*
>
>
>
>
>
>
>
>
>
> * at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)*
>
>
>
>
>
>
>
>
>
> * at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)*
>
>
>
>
>
>
>
>
>
> I understood that its because the shuffle block is > 2G, the Int value is
> taking negative and throwing the above exeception.
>
> Can someone throw light on this ? What is the fix for this ?
>
> Thanks,
> Padma CH
>
>
>
>
>
>
>
>
>
>
>


Spark Task failure with File segment length as negative

2016-06-22 Thread Priya Ch
Hi All,

I am running Spark Application with 1.8TB of data (which is stored in Hive
tables format).  I am reading the data using HiveContect and processing it.
The cluster has 5 nodes total, 25 cores per machine and 250Gb per node. I
am launching the application with 25 executors with 5 cores each and 45GB
per executor. Also, specified the property
spark.yarn.executor.memoryOverhead=2024.

During the execution, tasks are lost and ShuffleMapTasks are re-submitted.
I am seeing that tasks are failing with the following message -

*java.lang.IllegalArgumentException: requirement failed: File segment
length cannot be negative (got -27045427)*









* at scala.Predef$.require(Predef.scala:233)*









* at org.apache.spark.storage.FileSegment.(FileSegment.scala:28)*









* at
org.apache.spark.storage.DiskBlockObjectWriter.fileSegment(DiskBlockObjectWriter.scala:220)*









* at
org.apache.spark.shuffle.sort.ShuffleExternalSorter.writeSortedFile(ShuffleExternalSorter.java:184)*









* at
org.apache.spark.shuffle.sort.ShuffleExternalSorter.closeAndGetSpills(ShuffleExternalSorter.java:398)*









* at
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.closeAndWriteOutput(UnsafeShuffleWriter.java:206)*









* at
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:166)*









* at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)*









* at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)*









* at org.apache.spark.scheduler.Task.run(Task.scala:89)*









* at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)*









* at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)*









* at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)*









I understood that its because the shuffle block is > 2G, the Int value is
taking negative and throwing the above exeception.

Can someone throw light on this ? What is the fix for this ?

Thanks,
Padma CH


Spark Job Execution halts during shuffle...

2016-05-26 Thread Priya Ch
Hello Team,


 I am trying to perform join 2 rdds where one is of size 800 MB and the
other is 190 MB. During the join step, my job halts and I don't see
progress in the execution.

This is the message I see on console -

INFO spark.MapOutputTrackerMasterEndPoint: Asked to send map output
locations for shuffle 0 to :4
INFO spark.MapOutputTrackerMasterEndPoint: Asked to send map output
locations for shuffle 1 to :4

After these messages, I dont see any progress. I am using Spark 1.6.0
version and yarn scheduler (running in YARN client mode). My cluster
configurations is - 3 node cluster (1 master and 2 slaves). Each slave has
1 TB hard disk space, 300GB memory and 32 cores.

HDFS block size is 128 MB.

Thanks,
Padma Ch


Cartesian join on RDDs taking too much time

2016-05-25 Thread Priya Ch
Hi All,

  I have two RDDs A and B where in A is of size 30 MB and B is of size 7
MB, A.cartesian(B) is taking too much time. Is there any bottleneck in
cartesian operation ?

I am using spark 1.6.0 version

Regards,
Padma Ch


Re: java.io.FileNotFoundException(Too many open files) in Spark streaming

2016-01-06 Thread Priya Ch
Running 'lsof' will let us know the open files but how do we come to know
the root cause behind opening too many files.

Thanks,
Padma CH

On Wed, Jan 6, 2016 at 8:39 AM, Hamel Kothari <hamelkoth...@gmail.com>
wrote:

> The "Too Many Files" part of the exception is just indicative of the fact
> that when that call was made, too many files were already open. It doesn't
> necessarily mean that that line is the source of all of the open files,
> that's just the point at which it hit its limit.
>
> What I would recommend is to try to run this code again and use "lsof" on
> one of the spark executors (perhaps run it in a for loop, writing the
> output to separate files) until it fails and see which files are being
> opened, if there's anything that seems to be taking up a clear majority
> that might key you in on the culprit.
>
> On Tue, Jan 5, 2016 at 9:48 PM Priya Ch <learnings.chitt...@gmail.com>
> wrote:
>
>> Yes, the fileinputstream is closed. May be i didn't show in the screen
>> shot .
>>
>> As spark implements, sort-based shuffle, there is a parameter called
>> maximum merge factor which decides the number of files that can be merged
>> at once and this avoids too many open files. I am suspecting that it is
>> something related to this.
>>
>> Can someone confirm on this ?
>>
>> On Tue, Jan 5, 2016 at 11:19 PM, Annabel Melongo <
>> melongo_anna...@yahoo.com> wrote:
>>
>>> Vijay,
>>>
>>> Are you closing the fileinputstream at the end of each loop (
>>> in.close())? My guess is those streams aren't close and thus the "too many
>>> open files" exception.
>>>
>>>
>>> On Tuesday, January 5, 2016 8:03 AM, Priya Ch <
>>> learnings.chitt...@gmail.com> wrote:
>>>
>>>
>>> Can some one throw light on this ?
>>>
>>> Regards,
>>> Padma Ch
>>>
>>> On Mon, Dec 28, 2015 at 3:59 PM, Priya Ch <learnings.chitt...@gmail.com>
>>> wrote:
>>>
>>> Chris, we are using spark 1.3.0 version. we have not set  
>>> spark.streaming.concurrentJobs
>>> this parameter. It takes the default value.
>>>
>>> Vijay,
>>>
>>>   From the tack trace it is evident that 
>>> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$1.apply$mcVI$sp(ExternalSorter.scala:730)
>>> is throwing the exception. I opened the spark source code and visited the
>>> line which is throwing this exception i.e
>>>
>>> [image: Inline image 1]
>>>
>>> The lie which is marked in red is throwing the exception. The file is
>>> ExternalSorter.scala in org.apache.spark.util.collection package.
>>>
>>> i went through the following blog
>>> http://blog.cloudera.com/blog/2015/01/improving-sort-performance-in-apache-spark-its-a-double/
>>> and understood that there is merge factor which decide the number of
>>> on-disk files that could be merged. Is it some way related to this ?
>>>
>>> Regards,
>>> Padma CH
>>>
>>> On Fri, Dec 25, 2015 at 7:51 PM, Chris Fregly <ch...@fregly.com> wrote:
>>>
>>> and which version of Spark/Spark Streaming are you using?
>>>
>>> are you explicitly setting the spark.streaming.concurrentJobs to
>>> something larger than the default of 1?
>>>
>>> if so, please try setting that back to 1 and see if the problem still
>>> exists.
>>>
>>> this is a dangerous parameter to modify from the default - which is why
>>> it's not well-documented.
>>>
>>>
>>> On Wed, Dec 23, 2015 at 8:23 AM, Vijay Gharge <vijay.gha...@gmail.com>
>>> wrote:
>>>
>>> Few indicators -
>>>
>>> 1) during execution time - check total number of open files using lsof
>>> command. Need root permissions. If it is cluster not sure much !
>>> 2) which exact line in the code is triggering this error ? Can you paste
>>> that snippet ?
>>>
>>>
>>> On Wednesday 23 December 2015, Priya Ch <learnings.chitt...@gmail.com>
>>> wrote:
>>>
>>> ulimit -n 65000
>>>
>>> fs.file-max = 65000 ( in etc/sysctl.conf file)
>>>
>>> Thanks,
>>> Padma Ch
>>>
>>> On Tue, Dec 22, 2015 at 6:47 PM, Yash Sharma <yash...@gmail.com> wrote:
>>>
>>> Could you share the ulimit for your setup please ?
>>> - Thanks, via mobile,  excuse brevity.
>>> On Dec 22, 2015 6:39 PM

Re: java.io.FileNotFoundException(Too many open files) in Spark streaming

2016-01-06 Thread Priya Ch
The line of code which I highlighted in the screenshot is within the spark
source code. Spark implements sort-based shuffle implementation and the
spilled files are merged using the merge sort.

Here is the link
https://issues.apache.org/jira/secure/attachment/12655884/Sort-basedshuffledesign.pdf
which would convey the same.

On Wed, Jan 6, 2016 at 8:19 PM, Annabel Melongo <melongo_anna...@yahoo.com>
wrote:

> Priya,
>
> It would be helpful if you put the entire trace log along with your code
> to help determine the root cause of the error.
>
> Thanks
>
>
> On Wednesday, January 6, 2016 4:00 AM, Priya Ch <
> learnings.chitt...@gmail.com> wrote:
>
>
> Running 'lsof' will let us know the open files but how do we come to know
> the root cause behind opening too many files.
>
> Thanks,
> Padma CH
>
> On Wed, Jan 6, 2016 at 8:39 AM, Hamel Kothari <hamelkoth...@gmail.com>
> wrote:
>
> The "Too Many Files" part of the exception is just indicative of the fact
> that when that call was made, too many files were already open. It doesn't
> necessarily mean that that line is the source of all of the open files,
> that's just the point at which it hit its limit.
>
> What I would recommend is to try to run this code again and use "lsof" on
> one of the spark executors (perhaps run it in a for loop, writing the
> output to separate files) until it fails and see which files are being
> opened, if there's anything that seems to be taking up a clear majority
> that might key you in on the culprit.
>
> On Tue, Jan 5, 2016 at 9:48 PM Priya Ch <learnings.chitt...@gmail.com>
> wrote:
>
> Yes, the fileinputstream is closed. May be i didn't show in the screen
> shot .
>
> As spark implements, sort-based shuffle, there is a parameter called
> maximum merge factor which decides the number of files that can be merged
> at once and this avoids too many open files. I am suspecting that it is
> something related to this.
>
> Can someone confirm on this ?
>
> On Tue, Jan 5, 2016 at 11:19 PM, Annabel Melongo <
> melongo_anna...@yahoo.com> wrote:
>
> Vijay,
>
> Are you closing the fileinputstream at the end of each loop ( in.close())?
> My guess is those streams aren't close and thus the "too many open files"
> exception.
>
>
> On Tuesday, January 5, 2016 8:03 AM, Priya Ch <
> learnings.chitt...@gmail.com> wrote:
>
>
> Can some one throw light on this ?
>
> Regards,
> Padma Ch
>
> On Mon, Dec 28, 2015 at 3:59 PM, Priya Ch <learnings.chitt...@gmail.com>
> wrote:
>
> Chris, we are using spark 1.3.0 version. we have not set  
> spark.streaming.concurrentJobs
> this parameter. It takes the default value.
>
> Vijay,
>
>   From the tack trace it is evident that 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$1.apply$mcVI$sp(ExternalSorter.scala:730)
> is throwing the exception. I opened the spark source code and visited the
> line which is throwing this exception i.e
>
> [image: Inline image 1]
>
> The lie which is marked in red is throwing the exception. The file is
> ExternalSorter.scala in org.apache.spark.util.collection package.
>
> i went through the following blog
> http://blog.cloudera.com/blog/2015/01/improving-sort-performance-in-apache-spark-its-a-double/
> and understood that there is merge factor which decide the number of
> on-disk files that could be merged. Is it some way related to this ?
>
> Regards,
> Padma CH
>
> On Fri, Dec 25, 2015 at 7:51 PM, Chris Fregly <ch...@fregly.com> wrote:
>
> and which version of Spark/Spark Streaming are you using?
>
> are you explicitly setting the spark.streaming.concurrentJobs to
> something larger than the default of 1?
>
> if so, please try setting that back to 1 and see if the problem still
> exists.
>
> this is a dangerous parameter to modify from the default - which is why
> it's not well-documented.
>
>
> On Wed, Dec 23, 2015 at 8:23 AM, Vijay Gharge <vijay.gha...@gmail.com>
> wrote:
>
> Few indicators -
>
> 1) during execution time - check total number of open files using lsof
> command. Need root permissions. If it is cluster not sure much !
> 2) which exact line in the code is triggering this error ? Can you paste
> that snippet ?
>
>
> On Wednesday 23 December 2015, Priya Ch <learnings.chitt...@gmail.com>
> wrote:
>
> ulimit -n 65000
>
> fs.file-max = 65000 ( in etc/sysctl.conf file)
>
> Thanks,
> Padma Ch
>
> On Tue, Dec 22, 2015 at 6:47 PM, Yash Sharma <yash...@gmail.com> wrote:
>
> Could you share the ulimit for your setup please ?
> - Thanks, via mobile,  excuse bre

Re: java.io.FileNotFoundException(Too many open files) in Spark streaming

2016-01-05 Thread Priya Ch
Yes, the fileinputstream is closed. May be i didn't show in the screen shot
.

As spark implements, sort-based shuffle, there is a parameter called
maximum merge factor which decides the number of files that can be merged
at once and this avoids too many open files. I am suspecting that it is
something related to this.

Can someone confirm on this ?

On Tue, Jan 5, 2016 at 11:19 PM, Annabel Melongo <melongo_anna...@yahoo.com>
wrote:

> Vijay,
>
> Are you closing the fileinputstream at the end of each loop ( in.close())?
> My guess is those streams aren't close and thus the "too many open files"
> exception.
>
>
> On Tuesday, January 5, 2016 8:03 AM, Priya Ch <
> learnings.chitt...@gmail.com> wrote:
>
>
> Can some one throw light on this ?
>
> Regards,
> Padma Ch
>
> On Mon, Dec 28, 2015 at 3:59 PM, Priya Ch <learnings.chitt...@gmail.com>
> wrote:
>
> Chris, we are using spark 1.3.0 version. we have not set  
> spark.streaming.concurrentJobs
> this parameter. It takes the default value.
>
> Vijay,
>
>   From the tack trace it is evident that 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$1.apply$mcVI$sp(ExternalSorter.scala:730)
> is throwing the exception. I opened the spark source code and visited the
> line which is throwing this exception i.e
>
> [image: Inline image 1]
>
> The lie which is marked in red is throwing the exception. The file is
> ExternalSorter.scala in org.apache.spark.util.collection package.
>
> i went through the following blog
> http://blog.cloudera.com/blog/2015/01/improving-sort-performance-in-apache-spark-its-a-double/
> and understood that there is merge factor which decide the number of
> on-disk files that could be merged. Is it some way related to this ?
>
> Regards,
> Padma CH
>
> On Fri, Dec 25, 2015 at 7:51 PM, Chris Fregly <ch...@fregly.com> wrote:
>
> and which version of Spark/Spark Streaming are you using?
>
> are you explicitly setting the spark.streaming.concurrentJobs to
> something larger than the default of 1?
>
> if so, please try setting that back to 1 and see if the problem still
> exists.
>
> this is a dangerous parameter to modify from the default - which is why
> it's not well-documented.
>
>
> On Wed, Dec 23, 2015 at 8:23 AM, Vijay Gharge <vijay.gha...@gmail.com>
> wrote:
>
> Few indicators -
>
> 1) during execution time - check total number of open files using lsof
> command. Need root permissions. If it is cluster not sure much !
> 2) which exact line in the code is triggering this error ? Can you paste
> that snippet ?
>
>
> On Wednesday 23 December 2015, Priya Ch <learnings.chitt...@gmail.com>
> wrote:
>
> ulimit -n 65000
>
> fs.file-max = 65000 ( in etc/sysctl.conf file)
>
> Thanks,
> Padma Ch
>
> On Tue, Dec 22, 2015 at 6:47 PM, Yash Sharma <yash...@gmail.com> wrote:
>
> Could you share the ulimit for your setup please ?
> - Thanks, via mobile,  excuse brevity.
> On Dec 22, 2015 6:39 PM, "Priya Ch" <learnings.chitt...@gmail.com> wrote:
>
> Jakob,
>
>Increased the settings like fs.file-max in /etc/sysctl.conf and also
> increased user limit in /etc/security/limits.conf. But still see the same
> issue.
>
> On Fri, Dec 18, 2015 at 12:54 AM, Jakob Odersky <joder...@gmail.com>
> wrote:
>
> It might be a good idea to see how many files are open and try increasing
> the open file limit (this is done on an os level). In some application
> use-cases it is actually a legitimate need.
>
> If that doesn't help, make sure you close any unused files and streams in
> your code. It will also be easier to help diagnose the issue if you send an
> error-reproducing snippet.
>
>
>
>
>
> --
> Regards,
> Vijay Gharge
>
>
>
>
>
>
> --
>
> *Chris Fregly*
> Principal Data Solutions Engineer
> IBM Spark Technology Center, San Francisco, CA
> http://spark.tc | http://advancedspark.com
>
>
>
>
>
>


Re: passing SparkContext as parameter

2015-09-21 Thread Priya Ch
can i use this sparkContext on executors ??
In my application, i have scenario of reading from db for certain records
in rdd. Hence I need sparkContext to read from DB (cassandra in our case),

If sparkContext couldn't be sent to executors , what is the workaround for
this ??

On Mon, Sep 21, 2015 at 3:06 PM, Petr Novak <oss.mli...@gmail.com> wrote:

> add @transient?
>
> On Mon, Sep 21, 2015 at 11:27 AM, Priya Ch <learnings.chitt...@gmail.com>
> wrote:
>
>> Hello All,
>>
>> How can i pass sparkContext as a parameter to a method in an object.
>> Because passing sparkContext is giving me TaskNotSerializable Exception.
>>
>> How can i achieve this ?
>>
>> Thanks,
>> Padma Ch
>>
>
>


Re: Spark Streaming..Exception

2015-09-14 Thread Priya Ch
Hi All,

 I came across the related old conversation on the above issue (
https://issues.apache.org/jira/browse/SPARK-5594. ) Is the issue fixed? I
tried different values for spark.cleaner.ttl  -> 0sec, -1sec,
2000sec,..none of them worked. I also tried setting
spark.streaming.unpersist -> true. What is the possible solution for this ?
Is this a bug in Spark 1.3.0? Changing the scheduling mode to Stand-alone
or Mesos mode would work fine ??

Please someone share your views on this.

On Sat, Sep 12, 2015 at 11:04 PM, Priya Ch <learnings.chitt...@gmail.com>
wrote:

> Hello All,
>
>  When I push messages into kafka and read into streaming application, I
> see the following exception-
>  I am running the application on YARN and no where broadcasting the
> message within the application. Just simply reading message, parsing it and
> populating fields in a class and then printing the dstream (using
> DStream.print).
>
>  Have no clue if this is cluster issue or spark version issue or node
> issue. The strange part is, sometimes the message is processed but
> sometimes I see the below exception -
>
> java.io.IOException: org.apache.spark.SparkException: Failed to get
> broadcast_5_piece0 of broadcast_5
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1155)
> at
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:164)
> at
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
> at
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
> at
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:87)
> at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
> at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:58)
> at org.apache.spark.scheduler.Task.run(Task.scala:64)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.SparkException: Failed to get
> broadcast_5_piece0 of broadcast_5
> at
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:137)
> at
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:137)
> at scala.Option.getOrElse(Option.scala:120)
> at
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:136)
> at
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:119)
> at
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:119)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at org.apache.spark.broadcast.TorrentBroadcast.org
> <http://org.apache.spark.broadcast.torrentbroadcast.org/>
> $apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:119)
> at
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:174)
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1152)
>
>
> I would be glad if someone can throw some light on this.
>
> Thanks,
> Padma Ch
>
>


Spark Streaming..Exception

2015-09-12 Thread Priya Ch
Hello All,

 When I push messages into kafka and read into streaming application, I see
the following exception-
 I am running the application on YARN and no where broadcasting the message
within the application. Just simply reading message, parsing it and
populating fields in a class and then printing the dstream (using
DStream.print).

 Have no clue if this is cluster issue or spark version issue or node
issue. The strange part is, sometimes the message is processed but
sometimes I see the below exception -

java.io.IOException: org.apache.spark.SparkException: Failed to get
broadcast_5_piece0 of broadcast_5
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1155)
at
org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:164)
at
org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
at
org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
at
org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:87)
at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:58)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Failed to get
broadcast_5_piece0 of broadcast_5
at
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:137)
at
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:137)
at scala.Option.getOrElse(Option.scala:120)
at
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:136)
at
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:119)
at
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:119)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.apache.spark.broadcast.TorrentBroadcast.org

$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:119)
at
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:174)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1152)


I would be glad if someone can throw some light on this.

Thanks,
Padma Ch


Fwd: Writing streaming data to cassandra creates duplicates

2015-08-04 Thread Priya Ch
Yes...union would be one solution. I am not doing any aggregation hence
reduceByKey would not be useful. If I use groupByKey, messages with same
key would be obtained in a partition. But groupByKey is very expensive
operation as it involves shuffle operation. My ultimate goal is to write
the messages to cassandra. if the messages with same key are handled by
different streams...there would be concurrency issues. To resolve this i
can union dstreams and apply hash parttioner so that it would bring all the
same keys to a single partition or do a groupByKey which does the same.

As groupByKey is expensive, is there any work around for this ?

On Thu, Jul 30, 2015 at 2:33 PM, Juan Rodríguez Hortalá 
juan.rodriguez.hort...@gmail.com wrote:

 Hi,

 Just my two cents. I understand your problem is that your problem is that
 you have messages with the same key in two different dstreams. What I would
 do would be making a union of all the dstreams with StreamingContext.union
 or several calls to DStream.union, and then I would create a pair dstream
 with the primary key as key, and then I'd use groupByKey or reduceByKey (or
 combineByKey etc) to combine the messages with the same primary key.

 Hope that helps.

 Greetings,

 Juan


 2015-07-30 10:50 GMT+02:00 Priya Ch learnings.chitt...@gmail.com:

 Hi All,

  Can someone throw insights on this ?

 On Wed, Jul 29, 2015 at 8:29 AM, Priya Ch learnings.chitt...@gmail.com
 wrote:



 Hi TD,

  Thanks for the info. I have the scenario like this.

  I am reading the data from kafka topic. Let's say kafka has 3
 partitions for the topic. In my streaming application, I would configure 3
 receivers with 1 thread each such that they would receive 3 dstreams (from
 3 partitions of kafka topic) and also I implement partitioner. Now there is
 a possibility of receiving messages with same primary key twice or more,
 one is at the time message is created and other times if there is an update
 to any fields for same message.

 If two messages M1 and M2 with same primary key are read by 2 receivers
 then even the partitioner in spark would still end up in parallel
 processing as there are altogether in different dstreams. How do we address
 in this situation ?

 Thanks,
 Padma Ch

 On Tue, Jul 28, 2015 at 12:12 PM, Tathagata Das t...@databricks.com
 wrote:

 You have to partition that data on the Spark Streaming by the primary
 key, and then make sure insert data into Cassandra atomically per key, or
 per set of keys in the partition. You can use the combination of the (batch
 time, and partition Id) of the RDD inside foreachRDD as the unique id for
 the data you are inserting. This will guard against multiple attempts to
 run the task that inserts into Cassandra.

 See
 http://spark.apache.org/docs/latest/streaming-programming-guide.html#semantics-of-output-operations

 TD

 On Sun, Jul 26, 2015 at 11:19 AM, Priya Ch 
 learnings.chitt...@gmail.com wrote:

 Hi All,

  I have a problem when writing streaming data to cassandra. Or
 existing product is on Oracle DB in which while wrtiting data, locks are
 maintained such that duplicates in the DB are avoided.

 But as spark has parallel processing architecture, if more than 1
 thread is trying to write same data i.e with same primary key, is there as
 any scope to created duplicates? If yes, how to address this problem 
 either
 from spark or from cassandra side ?

 Thanks,
 Padma Ch









Re: Writing streaming data to cassandra creates duplicates

2015-07-30 Thread Priya Ch
Hi All,

 Can someone throw insights on this ?

On Wed, Jul 29, 2015 at 8:29 AM, Priya Ch learnings.chitt...@gmail.com
wrote:



 Hi TD,

  Thanks for the info. I have the scenario like this.

  I am reading the data from kafka topic. Let's say kafka has 3 partitions
 for the topic. In my streaming application, I would configure 3 receivers
 with 1 thread each such that they would receive 3 dstreams (from 3
 partitions of kafka topic) and also I implement partitioner. Now there is a
 possibility of receiving messages with same primary key twice or more, one
 is at the time message is created and other times if there is an update to
 any fields for same message.

 If two messages M1 and M2 with same primary key are read by 2 receivers
 then even the partitioner in spark would still end up in parallel
 processing as there are altogether in different dstreams. How do we address
 in this situation ?

 Thanks,
 Padma Ch

 On Tue, Jul 28, 2015 at 12:12 PM, Tathagata Das t...@databricks.com
 wrote:

 You have to partition that data on the Spark Streaming by the primary
 key, and then make sure insert data into Cassandra atomically per key, or
 per set of keys in the partition. You can use the combination of the (batch
 time, and partition Id) of the RDD inside foreachRDD as the unique id for
 the data you are inserting. This will guard against multiple attempts to
 run the task that inserts into Cassandra.

 See
 http://spark.apache.org/docs/latest/streaming-programming-guide.html#semantics-of-output-operations

 TD

 On Sun, Jul 26, 2015 at 11:19 AM, Priya Ch learnings.chitt...@gmail.com
 wrote:

 Hi All,

  I have a problem when writing streaming data to cassandra. Or existing
 product is on Oracle DB in which while wrtiting data, locks are maintained
 such that duplicates in the DB are avoided.

 But as spark has parallel processing architecture, if more than 1 thread
 is trying to write same data i.e with same primary key, is there as any
 scope to created duplicates? If yes, how to address this problem either
 from spark or from cassandra side ?

 Thanks,
 Padma Ch







Fwd: Writing streaming data to cassandra creates duplicates

2015-07-28 Thread Priya Ch
Hi TD,

 Thanks for the info. I have the scenario like this.

 I am reading the data from kafka topic. Let's say kafka has 3 partitions
for the topic. In my streaming application, I would configure 3 receivers
with 1 thread each such that they would receive 3 dstreams (from 3
partitions of kafka topic) and also I implement partitioner. Now there is a
possibility of receiving messages with same primary key twice or more, one
is at the time message is created and other times if there is an update to
any fields for same message.

If two messages M1 and M2 with same primary key are read by 2 receivers
then even the partitioner in spark would still end up in parallel
processing as there are altogether in different dstreams. How do we address
in this situation ?

Thanks,
Padma Ch

On Tue, Jul 28, 2015 at 12:12 PM, Tathagata Das t...@databricks.com wrote:

 You have to partition that data on the Spark Streaming by the primary key,
 and then make sure insert data into Cassandra atomically per key, or per
 set of keys in the partition. You can use the combination of the (batch
 time, and partition Id) of the RDD inside foreachRDD as the unique id for
 the data you are inserting. This will guard against multiple attempts to
 run the task that inserts into Cassandra.

 See
 http://spark.apache.org/docs/latest/streaming-programming-guide.html#semantics-of-output-operations

 TD

 On Sun, Jul 26, 2015 at 11:19 AM, Priya Ch learnings.chitt...@gmail.com
 wrote:

 Hi All,

  I have a problem when writing streaming data to cassandra. Or existing
 product is on Oracle DB in which while wrtiting data, locks are maintained
 such that duplicates in the DB are avoided.

 But as spark has parallel processing architecture, if more than 1 thread
 is trying to write same data i.e with same primary key, is there as any
 scope to created duplicates? If yes, how to address this problem either
 from spark or from cassandra side ?

 Thanks,
 Padma Ch





Writing streaming data to cassandra creates duplicates

2015-07-26 Thread Priya Ch
Hi All,

 I have a problem when writing streaming data to cassandra. Or existing
product is on Oracle DB in which while wrtiting data, locks are maintained
such that duplicates in the DB are avoided.

But as spark has parallel processing architecture, if more than 1 thread is
trying to write same data i.e with same primary key, is there as any scope
to created duplicates? If yes, how to address this problem either from
spark or from cassandra side ?

Thanks,
Padma Ch


Spark exception when sending message to akka actor

2014-12-22 Thread Priya Ch
Hi All,

I have akka remote actors running on 2 nodes. I submitted spark application
from node1. In the spark code, in one of the rdd, i am sending message to
actor running on node1. My Spark code is as follows:




class ActorClient extends Actor with Serializable
{
  import context._

  val currentActor: ActorSelection =
context.system.actorSelection(akka.tcp://
ActorSystem@192.168.145.183:2551/user/MasterActor)
  implicit val timeout = Timeout(10 seconds)


  def receive =
  {
  case msg:String = { if(msg.contains(Spark))
   { currentActor ! msg
 sender ! Local
   }
   else
   {
println(Received..+msg)
val future=currentActor ? msg
val result = Await.result(future,
timeout.duration).asInstanceOf[String]
if(result.contains(ACK))
  sender ! OK
   }
 }
  case PoisonPill = context.stop(self)
  }
}

object SparkExec extends Serializable
{

  implicit val timeout = Timeout(10 seconds)
   val actorSystem=ActorSystem(ClientActorSystem)
   val
actor=actorSystem.actorOf(Props(classOf[ActorClient]),name=ClientActor)

 def main(args:Array[String]) =
  {

 val conf = new SparkConf().setAppName(DeepLearningSpark)

 val sc=new SparkContext(conf)

val
textrdd=sc.textFile(hdfs://IMPETUS-DSRV02:9000/deeplearning/sample24k.csv)
val rdd1=textrddmap{ line = println(In Map...)

   val future = actor ? Hello..Spark
   val result =
Await.result(future,timeout.duration).asInstanceOf[String]
   if(result.contains(Local)){
 println(Recieved in map+result)
  //actorSystem.shutdown
  }
  (10)
 }


 val rdd2=rdd1.map{ x =
 val future=actor ? Done
 val result = Await.result(future,
timeout.duration).asInstanceOf[String]
  if(result.contains(OK))
  {
   actorSystem.stop(remoteActor)
   actorSystem.shutdown
  }
 (2) }
 rdd2.saveAsTextFile(/home/padma/SparkAkkaOut)
}

}

In my ActorClientActor, through actorSelection, identifying the remote
actor and sending the message. Once the messages are sent, in *rdd2*, after
receiving ack from remote actor, i am killing the actor ActorClient and
shutting down the ActorSystem.

The above code is throwing the following exception:




14/12/22 19:04:36 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 0.0
(TID 1, IMPETUS-DSRV05.impetus.co.in):
java.lang.ExceptionInInitializerError:
com.impetus.spark.SparkExec$$anonfun$2.apply(SparkExec.scala:166)
com.impetus.spark.SparkExec$$anonfun$2.apply(SparkExec.scala:159)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)

org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:984)

org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:974)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
org.apache.spark.scheduler.Task.run(Task.scala:54)

org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
java.lang.Thread.run(Thread.java:722)
14/12/22 19:04:36 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0
(TID 0, IMPETUS-DSRV05.impetus.co.in): java.lang.NoClassDefFoundError:
Could not initialize class com.impetus.spark.SparkExec$
com.impetus.spark.SparkExec$$anonfun$2.apply(SparkExec.scala:166)
com.impetus.spark.SparkExec$$anonfun$2.apply(SparkExec.scala:159)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)

org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:984)

org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:974)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
org.apache.spark.scheduler.Task.run(Task.scala:54)

org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)


1gb file processing...task doesn't launch on all the node...Unseen exception

2014-11-14 Thread Priya Ch
Hi All,

  We have set up 2 node cluster (NODE-DSRV05 and NODE-DSRV02) each is
having 32gb RAM and 1 TB hard disk capacity and 8 cores of cpu. We have set
up hdfs which has 2 TB capacity and the block size is 256 mb   When we try
to process 1 gb file on spark, we see the following exception

14/11/14 17:01:42 INFO scheduler.TaskSetManager: Starting task 0.0 in stage
0.0 (TID 0, NODE-DSRV05.impetus.co.in, NODE_LOCAL, 1667 bytes)
14/11/14 17:01:42 INFO scheduler.TaskSetManager: Starting task 1.0 in stage
0.0 (TID 1, NODE-DSRV05.impetus.co.in, NODE_LOCAL, 1667 bytes)
14/11/14 17:01:42 INFO scheduler.TaskSetManager: Starting task 2.0 in stage
0.0 (TID 2, NODE-DSRV05.impetus.co.in, NODE_LOCAL, 1667 bytes)
14/11/14 17:01:43 INFO cluster.SparkDeploySchedulerBackend: Registered
executor: 
Actor[akka.tcp://sparkExecutor@IMPETUS-DSRV02:41124/user/Executor#539551156]
with ID 0
14/11/14 17:01:43 INFO storage.BlockManagerMasterActor: Registering block
manager NODE-DSRV05.impetus.co.in:60432 with 2.1 GB RAM
14/11/14 17:01:43 INFO storage.BlockManagerMasterActor: Registering block
manager NODE-DSRV02:47844 with 2.1 GB RAM
14/11/14 17:01:43 INFO network.ConnectionManager: Accepted connection from [
NODE-DSRV05.impetus.co.in/192.168.145.195:51447]
14/11/14 17:01:43 INFO network.SendingConnection: Initiating connection to [
NODE-DSRV05.impetus.co.in/192.168.145.195:60432]
14/11/14 17:01:43 INFO network.SendingConnection: Connected to [
NODE-DSRV05.impetus.co.in/192.168.145.195:60432], 1 messages pending
14/11/14 17:01:43 INFO storage.BlockManagerInfo: Added broadcast_1_piece0
in memory on NODE-DSRV05.impetus.co.in:60432 (size: 17.1 KB, free: 2.1 GB)
14/11/14 17:01:43 INFO storage.BlockManagerInfo: Added broadcast_0_piece0
in memory on NODE-DSRV05.impetus.co.in:60432 (size: 14.1 KB, free: 2.1 GB)
14/11/14 17:01:44 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0
(TID 0, NODE-DSRV05.impetus.co.in): java.lang.NullPointerException:
org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:609)
org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:609)

org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
org.apache.spark.scheduler.Task.run(Task.scala:54)

org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
java.lang.Thread.run(Thread.java:722)
14/11/14 17:01:44 INFO scheduler.TaskSetManager: Starting task 0.1 in stage
0.0 (TID 3, NODE-DSRV05.impetus.co.in, NODE_LOCAL, 1667 bytes)
14/11/14 17:01:44 INFO scheduler.TaskSetManager: Lost task 1.0 in stage 0.0
(TID 1) on executor NODE-DSRV05.impetus.co.in:
java.lang.NullPointerException (null) [duplicate 1]
14/11/14 17:01:44 INFO scheduler.TaskSetManager: Lost task 2.0 in stage 0.0
(TID 2) on executor NODE-DSRV05.impetus.co.in:
java.lang.NullPointerException (null) [duplicate 2]
14/11/14 17:01:44 INFO scheduler.TaskSetManager: Starting task 2.1 in stage
0.0 (TID 4, NODE-DSRV05.impetus.co.in, NODE_LOCAL, 1667 bytes)
14/11/14 17:01:44 INFO scheduler.TaskSetManager: Starting task 1.1 in stage
0.0 (TID 5, NODE-DSRV02, NODE_LOCAL, 1667 bytes)
14/11/14 17:01:44 INFO scheduler.TaskSetManager: Lost task 0.1 in stage 0.0
(TID 3) on executor NODE-DSRV05.impetus.co.in:
java.lang.NullPointerException (null) [duplicate 3]
14/11/14 17:01:44 INFO scheduler.TaskSetManager: Starting task 0.2 in stage
0.0 (TID 6, NODE-DSRV02, NODE_LOCAL, 1667 bytes)
14/11/14 17:01:44 INFO scheduler.TaskSetManager: Lost task 2.1 in stage 0.0
(TID 4) on executor NODE-DSRV05.impetus.co.in:
java.lang.NullPointerException (null) [duplicate 4]
14/11/14 17:01:44 INFO scheduler.TaskSetManager: Starting task 2.2 in stage
0.0 (TID 7, NODE-DSRV02, NODE_LOCAL, 1667 bytes)


What I see is, it couldnt launch tasks on NODE-DSRV05 and processing it on
single node i.e NODE-DSRV02. When we tried with 360 MB of data, I dont see
any exception but the entire processing is done by only one node. I couldnt
figure out where the issue lies.

Any suggestions on what kind of situations might cause such issue ?

Thanks,
Padma Ch


Breeze Library usage in Spark

2014-10-03 Thread Priya Ch
Hi Team,

When I am trying to use DenseMatrix of breeze library in spark, its
throwing me the following error:

java.lang.noclassdeffounderror: breeze/storage/Zero


Can someone help me on this ?

Thanks,
Padma Ch


Fwd: Breeze Library usage in Spark

2014-10-03 Thread Priya Ch
yes. I have included breeze-0.9 in build.sbt file. I ll change this to 0.7.
Apart from this, do we need to include breeze jars explicitly in the spark
context as sc.addJar() ? and what about the dependencies
netlib-native_ref-linux-
  x86_64-1.1-natives.jar,
netlib-native_system-linux-x86_64-1.1-natives.jar ? Need to be included in
classpath ?

On Fri, Oct 3, 2014 at 11:01 PM, David Hall d...@cs.berkeley.edu wrote:

 yeah, breeze.storage.Zero was introduced in either 0.8 or 0.9.

 On Fri, Oct 3, 2014 at 9:45 AM, Xiangrui Meng men...@gmail.com wrote:

 Did you add a different version of breeze to the classpath? In Spark
 1.0, we use breeze 0.7, and in Spark 1.1 we use 0.9. If the breeze
 version you used is different from the one comes with Spark, you might
 see class not found. -Xiangrui

 On Fri, Oct 3, 2014 at 4:22 AM, Priya Ch learnings.chitt...@gmail.com
 wrote:
  Hi Team,
 
  When I am trying to use DenseMatrix of breeze library in spark, its
 throwing
  me the following error:
 
  java.lang.noclassdeffounderror: breeze/storage/Zero
 
 
  Can someone help me on this ?
 
  Thanks,
  Padma Ch
 
 

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org





Subscription request for developer community

2014-06-12 Thread Priya Ch
Please accept the request