Redirect Incubator pages

2014-04-04 Thread Andrew Ash
I occasionally see links to pages in the spark.incubator.apache.org domain.
 Can we HTTP 301 redirect that whole domain to spark.apache.org now that
the project has graduated?  The content seems identical.

That would also make the eventual decommission of the incubator domain much
easier as usage declines with time.

Thanks!
Andrew


Re: Spark on other parallel filesystems

2014-04-04 Thread Anand Avati
On Fri, Apr 4, 2014 at 5:12 PM, Matei Zaharia wrote:

> As long as the filesystem is mounted at the same path on every node, you
> should be able to just run Spark and use a file:// URL for your files.
>
> The only downside with running it this way is that Lustre won't expose
> data locality info to Spark, the way HDFS does. That may not matter if it's
> a network-mounted file system though.
>

Is the locality querying mechanism specific to HDFS mode, or is it possible
to implement plugins in Spark to query location in other ways on other
filesystems? I ask because, glusterfs can expose data location of a file
through virtual extended attributes and I would be interested in making
Spark exploit that locality when the file location is specified as
glusterfs:// (or querying the xattr blindly for file://). How much of a
difference does data locality make for Spark use cases anyways (since most
of the computation happens in memory)? Any sort of numbers?

Thanks!
Avati


>
>
Matei
>
> On Apr 4, 2014, at 4:56 PM, Venkat Krishnamurthy 
> wrote:
>
>  All
>
>  Are there any drawbacks or technical challenges (or any information,
> really) related to using Spark directly on a global parallel filesystem
>  like Lustre/GPFS?
>
>  Any idea of what would be involved in doing a minimal proof of concept?
> Is it just possible to run Spark unmodified (without the HDFS substrate)
> for a start, or will that not work at all? I do know that it's possible to
> implement Tachyon on Lustre and get the HDFS interface - just looking at
> other options.
>
>  Venkat
>
>
>


Re: Spark on other parallel filesystems

2014-04-04 Thread Jeremy Freeman
We run Spark (in Standalone mode) on top of a network-mounted file system 
(NFS), rather than HDFS, and find it to work great. It required no modification 
or special configuration to set this up; as Matei says, we just point Spark to 
data using the file location.

-- Jeremy

On Apr 4, 2014, at 8:12 PM, Matei Zaharia  wrote:

> As long as the filesystem is mounted at the same path on every node, you 
> should be able to just run Spark and use a file:// URL for your files.
> 
> The only downside with running it this way is that Lustre won’t expose data 
> locality info to Spark, the way HDFS does. That may not matter if it’s a 
> network-mounted file system though.
> 
> Matei
> 
> On Apr 4, 2014, at 4:56 PM, Venkat Krishnamurthy  wrote:
> 
>> All
>> 
>> Are there any drawbacks or technical challenges (or any information, really) 
>> related to using Spark directly on a global parallel filesystem  like 
>> Lustre/GPFS? 
>> 
>> Any idea of what would be involved in doing a minimal proof of concept? Is 
>> it just possible to run Spark unmodified (without the HDFS substrate) for a 
>> start, or will that not work at all? I do know that it’s possible to 
>> implement Tachyon on Lustre and get the HDFS interface – just looking at 
>> other options.
>> 
>> Venkat
> 



Re: Heartbeat exceeds

2014-04-04 Thread Patrick Wendell
If you look in the Spark UI, do you see any garbage collection happening?
My best guess is that some of the executors are going into GC and they are
timing out. You can manually increase the timeout by setting the Spark conf:

spark.storage.blockManagerSlaveTimeoutMs

to a higher value. In your case it's setting this to 45000 or 45 seconds.




On Fri, Apr 4, 2014 at 5:52 PM, Debasish Das wrote:

> Hi,
>
> In my ALS runs I am noticing messages that complain about heart beats:
>
> 14/04/04 20:43:09 WARN BlockManagerMasterActor: Removing BlockManager
> BlockManagerId(17, machine1, 53419, 0) with no recent heart beats: 48476ms
> exceeds 45000ms
> 14/04/04 20:43:09 WARN BlockManagerMasterActor: Removing BlockManager
> BlockManagerId(12, machine2, 60714, 0) with no recent heart beats: 45328ms
> exceeds 45000ms
> 14/04/04 20:43:09 WARN BlockManagerMasterActor: Removing BlockManager
> BlockManagerId(19, machine3, 39496, 0) with no recent heart beats: 53259ms
> exceeds 45000ms
>
> Is this some issue with the underlying jvm over which akka is run ? Can I
> increase the heartbeat somehow to get these messages resolved ?
>
> Any more insight about the possible cause for the heartbeat will be
> helpful...
>
> It tried to re-run the job but it ultimately failed...
>
> Also I am noticing negative numbers in the stage duration:
>
>
>
>
> Any insights into the problem will be very helpful...
>
> Thanks.
> Deb
>


Re: How to create a RPM package

2014-04-04 Thread Patrick Wendell
We might be able to incorporate the maven rpm plugin into our build. If
that can be done in an elegant way it would be nice to have that
distribution target for people who wanted to try this with arbitrary Spark
versions...

Personally I have no familiarity with that plug-in, so curious if anyone in
the community has feedback from trying this.

- Patrick


On Fri, Apr 4, 2014 at 12:43 PM, Rahul Singhal wrote:

>   Hi Christophe,
>
>  Thanks for your reply and the spec file. I have solved my issue for now.
> I didn't want to rely building spark using the spec file (%build section)
> as I don't want to be maintaining the list of files that need to be
> packaged. I ended up adding maven build support to make-distribution.sh.
> This script produces a tar ball which I can then use to create a RPM
> package.
>
>   Thanks,
> Rahul Singhal
>
>   From: Christophe Préaud 
> Reply-To: "user@spark.apache.org" 
> Date: Friday 4 April 2014 7:55 PM
> To: "user@spark.apache.org" 
> Subject: Re: How to create a RPM package
>
>   Hi Rahul,
>
> Spark will be available in Fedora 21 (see:
> https://fedoraproject.org/wiki/SIGs/bigdata/packaging/Spark), currently
> scheduled on 2014-10-14 but they already have produced spec files and
> source RPMs.
> If you are stuck with EL6 like me, you can have a look at the attached
> spec file, which you can probably adapt to your need.
>
> Christophe.
>
> On 04/04/2014 09:10, Rahul Singhal wrote:
>
>  Hello Community,
>
>  This is my first mail to the list and I have a small question. The maven
> build 
> page
>  mentions
> a way to create a debian package but I was wondering if there is a simple
> way (preferably through maven) to create a RPM package. Is there a script
> (which is probably used for spark releases) that I can get my hands on? Or
> should I write one on my own?
>
>  P.S. I don't want to use the "alien" software to convert a debian
> package to a RPM.
>
>   Thanks,
>  Rahul Singhal
>
>
>
> --
> Kelkoo SAS
> Société par Actions Simplifiée
> Au capital de EURO 4.168.964,30
> Siège social : 8, rue du Sentier 75002 Paris
> 425 093 069 RCS Paris
>
> Ce message et les pièces jointes sont confidentiels et établis à
> l'attention exclusive de leurs destinataires. Si vous n'êtes pas le
> destinataire de ce message, merci de le détruire et d'en avertir
> l'expéditeur.
>


Re: Largest Spark Cluster

2014-04-04 Thread Patrick Wendell
Hey Parviz,

There was a similar thread a while ago... I think that many companies like
to be discrete about the size of large clusters. But of course it would be
great if people wanted to share openly :)

For my part - I can say that Spark has been benchmarked on
hundreds-of-nodes clusters before and on jobs that crunch hundreds of
terabytes (uncompressed) of data.

- Patrick


On Fri, Apr 4, 2014 at 12:05 PM, Parviz Deyhim  wrote:

> Spark community,
>
>
> What's the size of the largest Spark cluster ever deployed? I've heard
> Yahoo is running Spark on several hundred nodes but don't know the actual
> number.
>
> can someone share?
>
> Thanks
>


exactly once

2014-04-04 Thread Bharath Bhushan
Does spark in general assure exactly once semantics? What happens to 
those guarantees in the presence of updateStateByKey operations -- are 
they also assured to be exactly once?


Thanks
manku.timma at outlook dot com


Heartbeat exceeds

2014-04-04 Thread Debasish Das
Hi,

In my ALS runs I am noticing messages that complain about heart beats:

14/04/04 20:43:09 WARN BlockManagerMasterActor: Removing BlockManager
BlockManagerId(17, machine1, 53419, 0) with no recent heart beats: 48476ms
exceeds 45000ms
14/04/04 20:43:09 WARN BlockManagerMasterActor: Removing BlockManager
BlockManagerId(12, machine2, 60714, 0) with no recent heart beats: 45328ms
exceeds 45000ms
14/04/04 20:43:09 WARN BlockManagerMasterActor: Removing BlockManager
BlockManagerId(19, machine3, 39496, 0) with no recent heart beats: 53259ms
exceeds 45000ms

Is this some issue with the underlying jvm over which akka is run ? Can I
increase the heartbeat somehow to get these messages resolved ?

Any more insight about the possible cause for the heartbeat will be
helpful...

It tried to re-run the job but it ultimately failed...

Also I am noticing negative numbers in the stage duration:




Any insights into the problem will be very helpful...

Thanks.
Deb


Re: Avro serialization

2014-04-04 Thread Ron Gonzalez
Thanks will take a look...

Sent from my iPad

> On Apr 3, 2014, at 7:49 AM, FRANK AUSTIN NOTHAFT  
> wrote:
> 
> We use avro objects in our project, and have a Kryo serializer for generic 
> Avro SpecificRecords. Take a look at:
> 
> https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/main/scala/edu/berkeley/cs/amplab/adam/serialization/ADAMKryoRegistrator.scala
> 
> Also, Matt Massie has a good blog post about this at 
> http://zenfractal.com/2013/08/21/a-powerful-big-data-trio/.
> 
> Frank Austin Nothaft
> fnoth...@berkeley.edu
> fnoth...@eecs.berkeley.edu
> 202-340-0466
> 
> 
>> On Thu, Apr 3, 2014 at 7:16 AM, Ian O'Connell  wrote:
>> Objects been transformed need to be one of these in flight. Source data can 
>> just use the mapreduce input formats, so anything you can do with mapred. 
>> doing an avro one for this you probably want one of :
>> https://github.com/kevinweil/elephant-bird/blob/master/core/src/main/java/com/twitter/elephantbird/mapreduce/input/*ProtoBuf*
>> 
>> or just whatever your using at the moment to open them in a MR job probably 
>> could be re-purposed
>> 
>> 
>>> On Thu, Apr 3, 2014 at 7:11 AM, Ron Gonzalez  wrote:
>>> 
>>> Hi,
>>>   I know that sources need to either be java serializable or use kryo 
>>> serialization.
>>>   Does anyone have sample code that reads, transforms and writes avro files 
>>> in spark?
>>> 
>>> Thanks,
>>> Ron
> 


Re: Spark on other parallel filesystems

2014-04-04 Thread Matei Zaharia
As long as the filesystem is mounted at the same path on every node, you should 
be able to just run Spark and use a file:// URL for your files.

The only downside with running it this way is that Lustre won’t expose data 
locality info to Spark, the way HDFS does. That may not matter if it’s a 
network-mounted file system though.

Matei

On Apr 4, 2014, at 4:56 PM, Venkat Krishnamurthy  wrote:

> All
> 
> Are there any drawbacks or technical challenges (or any information, really) 
> related to using Spark directly on a global parallel filesystem  like 
> Lustre/GPFS? 
> 
> Any idea of what would be involved in doing a minimal proof of concept? Is it 
> just possible to run Spark unmodified (without the HDFS substrate) for a 
> start, or will that not work at all? I do know that it’s possible to 
> implement Tachyon on Lustre and get the HDFS interface – just looking at 
> other options.
> 
> Venkat



Spark on other parallel filesystems

2014-04-04 Thread Venkat Krishnamurthy
All

Are there any drawbacks or technical challenges (or any information, really) 
related to using Spark directly on a global parallel filesystem  like 
Lustre/GPFS?

Any idea of what would be involved in doing a minimal proof of concept? Is it 
just possible to run Spark unmodified (without the HDFS substrate) for a start, 
or will that not work at all? I do know that it’s possible to implement Tachyon 
on Lustre and get the HDFS interface – just looking at other options.

Venkat


Re: Spark output compression on HDFS

2014-04-04 Thread Azuryy
There is no compress type for snappy.


Sent from my iPhone5s

> On 2014年4月4日, at 23:06, Konstantin Kudryavtsev 
>  wrote:
> 
> Can anybody suggest how to change compression level (Record, Block) for 
> Snappy? 
> if it possible, of course
> 
> thank you in advance
> 
> Thank you,
> Konstantin Kudryavtsev
> 
> 
>> On Thu, Apr 3, 2014 at 10:28 PM, Konstantin Kudryavtsev 
>>  wrote:
>> Thanks all, it works fine now and I managed to compress output. However, I 
>> am still in stuck... How is it possible to set compression type for Snappy? 
>> I mean to set up record or block level of compression for output
>> 
>>> On Apr 3, 2014 1:15 AM, "Nicholas Chammas"  
>>> wrote:
>>> Thanks for pointing that out.
>>> 
>>> 
 On Wed, Apr 2, 2014 at 6:11 PM, Mark Hamstra  
 wrote:
 First, you shouldn't be using spark.incubator.apache.org anymore, just 
 spark.apache.org.  Second, saveAsSequenceFile doesn't appear to exist in 
 the Python API at this point. 
 
 
> On Wed, Apr 2, 2014 at 3:00 PM, Nicholas Chammas 
>  wrote:
> Is this a Scala-only feature?
> 
> 
>> On Wed, Apr 2, 2014 at 5:55 PM, Patrick Wendell  
>> wrote:
>> For textFile I believe we overload it and let you set a codec directly:
>> 
>> https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/FileSuite.scala#L59
>> 
>> For saveAsSequenceFile yep, I think Mark is right, you need an option.
>> 
>> 
>>> On Wed, Apr 2, 2014 at 12:36 PM, Mark Hamstra  
>>> wrote:
>>> http://www.scala-lang.org/api/2.10.3/index.html#scala.Option
>>> 
>>> The signature is 'def saveAsSequenceFile(path: String, codec: 
>>> Option[Class[_ <: CompressionCodec]] = None)', but you are providing a 
>>> Class, not an Option[Class].  
>>> 
>>> Try counts.saveAsSequenceFile(output, 
>>> Some(classOf[org.apache.hadoop.io.compress.SnappyCodec]))
>>> 
>>> 
>>> 
 On Wed, Apr 2, 2014 at 12:18 PM, Kostiantyn Kudriavtsev 
  wrote:
 Hi there,
 
 
 
 I've started using Spark recently and evaluating possible use cases in 
 our company. 
 
 I'm trying to save RDD as compressed Sequence file. I'm able to save 
 non-compressed file be calling:
 
 
 
 
 
 counts.saveAsSequenceFile(output)
 where counts is my RDD (IntWritable, Text). However, I didn't manage 
 to compress output. I tried several configurations and always got 
 exception:
 
 
 
 
 
 counts.saveAsSequenceFile(output, 
 classOf[org.apache.hadoop.io.compress.SnappyCodec])
 :21: error: type mismatch;
  found   : 
 Class[org.apache.hadoop.io.compress.SnappyCodec](classOf[org.apache.hadoop.io.compress.SnappyCodec])
  required: Option[Class[_ <: 
 org.apache.hadoop.io.compress.CompressionCodec]]
   counts.saveAsSequenceFile(output, 
 classOf[org.apache.hadoop.io.compress.SnappyCodec])
 
  counts.saveAsSequenceFile(output, 
 classOf[org.apache.spark.io.SnappyCompressionCodec])
 :21: error: type mismatch;
  found   : 
 Class[org.apache.spark.io.SnappyCompressionCodec](classOf[org.apache.spark.io.SnappyCompressionCodec])
  required: Option[Class[_ <: 
 org.apache.hadoop.io.compress.CompressionCodec]]
   counts.saveAsSequenceFile(output, 
 classOf[org.apache.spark.io.SnappyCompressionCodec])
 and it doesn't work even for Gzip:
 
 
 
 
 
  counts.saveAsSequenceFile(output, 
 classOf[org.apache.hadoop.io.compress.GzipCodec])
 :21: error: type mismatch;
  found   : 
 Class[org.apache.hadoop.io.compress.GzipCodec](classOf[org.apache.hadoop.io.compress.GzipCodec])
  required: Option[Class[_ <: 
 org.apache.hadoop.io.compress.CompressionCodec]]
   counts.saveAsSequenceFile(output, 
 classOf[org.apache.hadoop.io.compress.GzipCodec])
 Could you please suggest solution? also, I didn't find how is it 
 possible to specify compression parameters (i.e. compression type for 
 Snappy). I wondered if you could share code snippets for 
 writing/reading RDD with compression? 
 
 Thank you in advance,
 
 Konstantin Kudryavtsev
> 


Re: How are exceptions in map functions handled in Spark?

2014-04-04 Thread Matei Zaharia
Make sure you initialize a log4j Log object on the workers and not on the 
driver program. If you’re somehow referencing a logInfo method on the driver 
program, the Log object might not get sent across the network correctly (though 
you’d usually get some other error there, like NotSerializableException). Maybe 
try using println() at first.

As Andrew said, the log output will go into the stdout and stderr files in the 
work directory on your worker. You can also access those from the Spark 
cluster’s web UI (click on a worker there, then click on stdout / stderr).

Matei


On Apr 4, 2014, at 11:49 AM, John Salvatier  wrote:

> Is there a way to log exceptions inside a mapping function? logError and 
> logInfo seem to freeze things. 
> 
> 
> On Fri, Apr 4, 2014 at 11:02 AM, Matei Zaharia  
> wrote:
> Exceptions should be sent back to the driver program and logged there (with a 
> SparkException thrown if a task fails more than 4 times), but there were some 
> bugs before where this did not happen for non-Serializable exceptions. We 
> changed it to pass back the stack traces only (as text), which should always 
> work. I’d recommend trying a newer Spark version, 0.8 should be easy to 
> upgrade to from 0.7.
> 
> Matei
> 
> On Apr 4, 2014, at 10:40 AM, John Salvatier  wrote:
> 
> > I'm trying to get a clear idea about how exceptions are handled in Spark? 
> > Is there somewhere where I can read about this? I'm on spark .7
> >
> > For some reason I was under the impression that such exceptions are 
> > swallowed and the value that produced them ignored but the exception is 
> > logged. However, right now we're seeing the task just re-tried over and 
> > over again in an infinite loop because there's a value that always 
> > generates an exception.
> >
> > John
> 
> 



Re: Having spark-ec2 join new slaves to existing cluster

2014-04-04 Thread Nicholas Chammas
Sweet, thanks for the instructions. This will do for resizing a dev cluster
that you can bring down at will.

I will open a JIRA issue about adding the functionality I described to
spark-ec2.


On Fri, Apr 4, 2014 at 3:43 PM, Matei Zaharia wrote:

> This can't be done through the script right now, but you can do it
> manually as long as the cluster is stopped. If the cluster is stopped, just
> go into the AWS Console, right click a slave and choose "launch more of
> these" to add more. Or select multiple slaves and delete them. When you run
> spark-ec2 start the next time to start your cluster, it will set it up on
> all the machines it finds in the mycluster-slaves security group.
>
> This is pretty hacky so it would definitely be good to add this feature;
> feel free to open a JIRA about it.
>
> Matei
>
> On Apr 4, 2014, at 12:16 PM, Nicholas Chammas 
> wrote:
>
> I would like to be able to use spark-ec2 to launch new slaves and add them
> to an existing, running cluster. Similarly, I would also like to remove
> slaves from an existing cluster.
>
> Use cases include:
>
>1. Oh snap, I sized my cluster incorrectly. Let me add/remove some
>slaves.
>2. During scheduled batch processing, I want to add some new slaves,
>perhaps on spot instances. When that processing is done, I want to kill
>them. (Cruel, I know.)
>
> I gather this is not possible at the moment. spark-ec2 appears to be able
> to launch new slaves for an existing cluster only if the master is stopped.
> I also do not see any ability to remove slaves from a cluster.
>
> Is that correct? Are there plans to add such functionality to spark-ec2 in
> the future?
>
> Nick
>
>
> --
> View this message in context: Having spark-ec2 join new slaves to
> existing 
> cluster
> Sent from the Apache Spark User List mailing list 
> archiveat
> Nabble.com.
>
>
>


Re: How are exceptions in map functions handled in Spark?

2014-04-04 Thread Andrew Or
Logging inside a map function shouldn't "freeze things." The messages
should be logged on the worker logs, since the code is executed on the
executors. If you throw a SparkException, however, it'll be propagated to
the driver after it has failed 4 or more times (by default).

On Fri, Apr 4, 2014 at 11:57 AM, John Salvatier wrote:

> Btw, thank you for your help.
>
>
> On Fri, Apr 4, 2014 at 11:49 AM, John Salvatier wrote:
>
>> Is there a way to log exceptions inside a mapping function? logError and
>> logInfo seem to freeze things.
>>
>>
>> On Fri, Apr 4, 2014 at 11:02 AM, Matei Zaharia 
>> wrote:
>>
>>> Exceptions should be sent back to the driver program and logged there
>>> (with a SparkException thrown if a task fails more than 4 times), but there
>>> were some bugs before where this did not happen for non-Serializable
>>> exceptions. We changed it to pass back the stack traces only (as text),
>>> which should always work. I'd recommend trying a newer Spark version, 0.8
>>> should be easy to upgrade to from 0.7.
>>>
>>> Matei
>>>
>>> On Apr 4, 2014, at 10:40 AM, John Salvatier 
>>> wrote:
>>>
>>> > I'm trying to get a clear idea about how exceptions are handled in
>>> Spark? Is there somewhere where I can read about this? I'm on spark .7
>>> >
>>> > For some reason I was under the impression that such exceptions are
>>> swallowed and the value that produced them ignored but the exception is
>>> logged. However, right now we're seeing the task just re-tried over and
>>> over again in an infinite loop because there's a value that always
>>> generates an exception.
>>> >
>>> > John
>>>
>>>
>>
>


Re: reduceByKeyAndWindow Java

2014-04-04 Thread Eduardo Costa Alfaia

Hi Tathagata,

You are right, this code compile, but I am some problems with high 
memory consummation, I sent today some email about this, but no response 
until now.


Thanks
Em 4/4/14, 22:56, Tathagata Das escreveu:
I havent really compiled the code, but it looks good to me. Why? Is 
there any problem you are facing?


TD


On Fri, Apr 4, 2014 at 8:03 AM, Eduardo Costa Alfaia 
mailto:e.costaalf...@unibs.it>> wrote:



Hi guys,

I would like knowing if the part of code is right to use in Window.

JavaPairDStream wordCounts = words.map(
103   new PairFunction() {
104 @Override
105 public Tuple2 call(String s) {
106   return new Tuple2(s, 1);
107 }
108   }).reduceByKeyAndWindow(
109 new Function2() {
110   public Integer call(Integer i1, Integer i2) { return
i1 + i2; }
111 },
112 new Function2() {
113   public Integer call(Integer i1, Integer i2) { return
i1 - i2; }
114 },
115 new Duration(60 * 5 * 1000),
116 new Duration(1 * 1000)
117   );



Thanks

-- 
Informativa sulla Privacy: http://www.unibs.it/node/8155






--
Informativa sulla Privacy: http://www.unibs.it/node/8155


Re: reduceByKeyAndWindow Java

2014-04-04 Thread Tathagata Das
I havent really compiled the code, but it looks good to me. Why? Is there
any problem you are facing?

TD


On Fri, Apr 4, 2014 at 8:03 AM, Eduardo Costa Alfaia  wrote:

>
> Hi guys,
>
> I would like knowing if the part of code is right to use in Window.
>
> JavaPairDStream wordCounts = words.map(
> 103   new PairFunction() {
> 104 @Override
> 105 public Tuple2 call(String s) {
> 106   return new Tuple2(s, 1);
> 107 }
> 108   }).reduceByKeyAndWindow(
> 109 new Function2() {
> 110   public Integer call(Integer i1, Integer i2) { return i1 +
> i2; }
> 111 },
> 112 new Function2() {
> 113   public Integer call(Integer i1, Integer i2) { return i1 -
> i2; }
> 114 },
> 115 new Duration(60 * 5 * 1000),
> 116 new Duration(1 * 1000)
> 117   );
>
>
>
> Thanks
>
> --
> Informativa sulla Privacy: http://www.unibs.it/node/8155
>


Re: How to create a RPM package

2014-04-04 Thread Rahul Singhal
Hi Christophe,

Thanks for your reply and the spec file. I have solved my issue for now. I 
didn't want to rely building spark using the spec file (%build section) as I 
don't want to be maintaining the list of files that need to be packaged. I 
ended up adding maven build support to make-distribution.sh. This script 
produces a tar ball which I can then use to create a RPM package.

Thanks,
Rahul Singhal

From: Christophe Préaud 
mailto:christophe.pre...@kelkoo.com>>
Reply-To: "user@spark.apache.org" 
mailto:user@spark.apache.org>>
Date: Friday 4 April 2014 7:55 PM
To: "user@spark.apache.org" 
mailto:user@spark.apache.org>>
Subject: Re: How to create a RPM package

Hi Rahul,

Spark will be available in Fedora 21 (see: 
https://fedoraproject.org/wiki/SIGs/bigdata/packaging/Spark), currently 
scheduled on 2014-10-14 but they already have produced spec files and source 
RPMs.
If you are stuck with EL6 like me, you can have a look at the attached spec 
file, which you can probably adapt to your need.

Christophe.

On 04/04/2014 09:10, Rahul Singhal wrote:
Hello Community,

This is my first mail to the list and I have a small question. The maven build 
page
 mentions a way to create a debian package but I was wondering if there is a 
simple way (preferably through maven) to create a RPM package. Is there a 
script (which is probably used for spark releases) that I can get my hands on? 
Or should I write one on my own?

P.S. I don't want to use the "alien" software to convert a debian package to a 
RPM.

Thanks,
Rahul Singhal



Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
message, merci de le détruire et d'en avertir l'expéditeur.


Re: Having spark-ec2 join new slaves to existing cluster

2014-04-04 Thread Matei Zaharia
This can’t be done through the script right now, but you can do it manually as 
long as the cluster is stopped. If the cluster is stopped, just go into the AWS 
Console, right click a slave and choose “launch more of these” to add more. Or 
select multiple slaves and delete them. When you run spark-ec2 start the next 
time to start your cluster, it will set it up on all the machines it finds in 
the mycluster-slaves security group.

This is pretty hacky so it would definitely be good to add this feature; feel 
free to open a JIRA about it.

Matei

On Apr 4, 2014, at 12:16 PM, Nicholas Chammas  
wrote:

> I would like to be able to use spark-ec2 to launch new slaves and add them to 
> an existing, running cluster. Similarly, I would also like to remove slaves 
> from an existing cluster.
> 
> Use cases include:
> Oh snap, I sized my cluster incorrectly. Let me add/remove some slaves.
> During scheduled batch processing, I want to add some new slaves, perhaps on 
> spot instances. When that processing is done, I want to kill them. (Cruel, I 
> know.)
> I gather this is not possible at the moment. spark-ec2 appears to be able to 
> launch new slaves for an existing cluster only if the master is stopped. I 
> also do not see any ability to remove slaves from a cluster.
> 
> Is that correct? Are there plans to add such functionality to spark-ec2 in 
> the future?
> 
> Nick
> 
> 
> View this message in context: Having spark-ec2 join new slaves to existing 
> cluster
> Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: example of non-line oriented input data?

2014-04-04 Thread Matei Zaharia
FYI, one thing we’ve added now is support for reading multiple text files from 
a directory as separate records: https://github.com/apache/spark/pull/327. This 
should remove the need for mapPartitions discussed here.

Avro and SequenceFiles look like they may not make it for 1.0, but there’s a 
chance that Parquet support with Spark SQL will, which should let you store 
binary data a bit better.

Matei

On Mar 19, 2014, at 3:12 PM, Jeremy Freeman  wrote:

> Another vote on this, support for simple SequenceFiles and/or Avro would be 
> terrific, as using plain text can be very space-inefficient, especially for 
> numerical data.
> 
> -- Jeremy
> 
> On Mar 19, 2014, at 5:24 PM, Nicholas Chammas  
> wrote:
> 
>> I'd second the request for Avro support in Python first, followed by Parquet.
>> 
>> 
>> On Wed, Mar 19, 2014 at 2:14 PM, Evgeny Shishkin  
>> wrote:
>> 
>> On 19 Mar 2014, at 19:54, Diana Carroll  wrote:
>> 
>>> Actually, thinking more on this question, Matei: I'd definitely say support 
>>> for Avro.  There's a lot of interest in this!!
>>> 
>> 
>> Agree, and parquet as default Cloudera Impala format.
>> 
>> 
>> 
>> 
>>> On Tue, Mar 18, 2014 at 8:14 PM, Matei Zaharia  
>>> wrote:
>>> BTW one other thing — in your experience, Diana, which non-text 
>>> InputFormats would be most useful to support in Python first? Would it be 
>>> Parquet or Avro, simple SequenceFiles with the Hadoop Writable types, or 
>>> something else? I think a per-file text input format that does the stuff we 
>>> did here would also be good.
>>> 
>>> Matei
>>> 
>>> 
>>> On Mar 18, 2014, at 3:27 PM, Matei Zaharia  wrote:
>>> 
 Hi Diana,
 
 This seems to work without the iter() in front if you just return 
 treeiterator. What happened when you didn’t include that? Treeiterator 
 should return an iterator.
 
 Anyway, this is a good example of mapPartitions. It’s one where you want 
 to view the whole file as one object (one XML here), so you couldn’t 
 implement this using a flatMap, but you still want to return multiple 
 values. The MLlib example you saw needs Python 2.7 because unfortunately 
 that is a requirement for our Python MLlib support (see 
 http://spark.incubator.apache.org/docs/0.9.0/python-programming-guide.html#libraries).
  We’d like to relax this later but we’re using some newer features of 
 NumPy and Python. The rest of PySpark works on 2.6.
 
 In terms of the size in memory, here both the string s and the XML tree 
 constructed from it need to fit in, so you can’t work on very large 
 individual XML files. You may be able to use a streaming XML parser 
 instead to extract elements from the data in a streaming fashion, without 
 every materializing the whole tree. 
 http://docs.python.org/2/library/xml.sax.reader.html#module-xml.sax.xmlreader
  is one example.
 
 Matei
 
 On Mar 18, 2014, at 7:49 AM, Diana Carroll  wrote:
 
> Well, if anyone is still following this, I've gotten the following code 
> working which in theory should allow me to parse whole XML files: (the 
> problem was that I can't return the tree iterator directly.  I have to 
> call iter().  Why?)
> 
> import xml.etree.ElementTree as ET
> 
> # two source files, format   name="...">..
> mydata=sc.textFile("file:/home/training/countries*.xml") 
> 
> def parsefile(iterator):
> s = ''
> for i in iterator: s = s + str(i)
> tree = ET.fromstring(s)
> treeiterator = tree.getiterator("country")
> # why to I have to convert an iterator to an iterator?  not sure but 
> required
> return iter(treeiterator)
> 
> mydata.mapPartitions(lambda x: parsefile(x)).map(lambda element: 
> element.attrib).collect()
> 
> The output is what I expect:
> [{'name': 'Liechtenstein'}, {'name': 'Singapore'}, {'name': 'Panama'}]
> 
> BUT I'm a bit concerned about the construction of the string "s".  How 
> big can my file be before converting it to a string becomes problematic?
> 
> 
> 
> On Tue, Mar 18, 2014 at 9:41 AM, Diana Carroll  
> wrote:
> Thanks, Matei.
> 
> In the context of this discussion, it would seem mapParitions is 
> essential, because it's the only way I'm going to be able to process each 
> file as a whole, in our example of a large number of small XML files 
> which need to be parsed as a whole file because records are not required 
> to be on a single line.
> 
> The theory makes sense but I'm still utterly lost as to how to implement 
> it.  Unfortunately there's only a single example of the use of 
> mapPartitions in any of the Python example programs, which is the log 
> regression example, which I can't run because it requires Python 2.7 and 
> I'm on Python 2.6.  (aside: I couldn't find any statement that Python 2.6 
> is unsupported

Re: Parallelism level

2014-04-04 Thread Eduardo Costa Alfaia

Thanks Nicolas
Em 4/4/14, 21:19, Nicholas Chammas escreveu:
If you want more parallelism, you need more cores. So, use a machine 
with more cores, or use a cluster of machines. spark-ec2 
 is the easiest 
way to do this.


If you're stuck on a single machine with 2 cores, then set your 
default parallelism to 2. Setting it to a higher number won't do 
anything helpful.



On Fri, Apr 4, 2014 at 2:47 PM, Eduardo Costa Alfaia 
mailto:e.costaalf...@unibs.it>> wrote:


What do you advice me Nicholas?

Em 4/4/14, 19:05, Nicholas Chammas escreveu:

If you're running on one machine with 2 cores, I believe all you
can get out of it are 2 concurrent tasks at any one time. So
setting your default parallelism to 20 won't help.


On Fri, Apr 4, 2014 at 11:41 AM, Eduardo Costa Alfaia
mailto:e.costaalf...@unibs.it>> wrote:

Hi all,

I have put this line in my spark-env.sh:
-Dspark.default.parallelism=20

 this parallelism level, is it correct?
 The machine's processor is a dual core.

Thanks

-- 
Informativa sulla Privacy: http://www.unibs.it/node/8155






Informativa sulla Privacy: http://www.unibs.it/node/8155





--
Informativa sulla Privacy: http://www.unibs.it/node/8155


Re: Parallelism level

2014-04-04 Thread Nicholas Chammas
If you want more parallelism, you need more cores. So, use a machine with
more cores, or use a cluster of machines.
spark-ec2is the
easiest way to do this.

If you're stuck on a single machine with 2 cores, then set your default
parallelism to 2. Setting it to a higher number won't do anything helpful.


On Fri, Apr 4, 2014 at 2:47 PM, Eduardo Costa Alfaia  wrote:

>  What do you advice me Nicholas?
>
> Em 4/4/14, 19:05, Nicholas Chammas escreveu:
>
> If you're running on one machine with 2 cores, I believe all you can get
> out of it are 2 concurrent tasks at any one time. So setting your default
> parallelism to 20 won't help.
>
>
> On Fri, Apr 4, 2014 at 11:41 AM, Eduardo Costa Alfaia <
> e.costaalf...@unibs.it> wrote:
>
>> Hi all,
>>
>> I have put this line in my spark-env.sh:
>> -Dspark.default.parallelism=20
>>
>>  this parallelism level, is it correct?
>>  The machine's processor is a dual core.
>>
>> Thanks
>>
>> --
>> Informativa sulla Privacy: http://www.unibs.it/node/8155
>>
>
>
>
> Informativa sulla Privacy: http://www.unibs.it/node/8155
>


Having spark-ec2 join new slaves to existing cluster

2014-04-04 Thread Nicholas Chammas
I would like to be able to use spark-ec2 to launch new slaves and add them
to an existing, running cluster. Similarly, I would also like to remove
slaves from an existing cluster.

Use cases include:

   1. Oh snap, I sized my cluster incorrectly. Let me add/remove some
   slaves.
   2. During scheduled batch processing, I want to add some new slaves,
   perhaps on spot instances. When that processing is done, I want to kill
   them. (Cruel, I know.)

I gather this is not possible at the moment. spark-ec2 appears to be able
to launch new slaves for an existing cluster only if the master is stopped.
I also do not see any ability to remove slaves from a cluster.

Is that correct? Are there plans to add such functionality to spark-ec2 in
the future?

Nick




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Having-spark-ec2-join-new-slaves-to-existing-cluster-tp3783.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Largest Spark Cluster

2014-04-04 Thread Parviz Deyhim
Spark community,


What's the size of the largest Spark cluster ever deployed? I've heard
Yahoo is running Spark on several hundred nodes but don't know the actual
number.

can someone share?

Thanks


Re: Example of creating expressions for SchemaRDD methods

2014-04-04 Thread Michael Armbrust
Minor typo in the example.  The first SELECT statement should actually be:

sql("SELECT * FROM src")

Where `src` is a HiveTable with schema (key INT value STRING).


On Fri, Apr 4, 2014 at 11:35 AM, Michael Armbrust wrote:

>
> In such construct, each operator builds on the previous one, including any
>> materialized results etc. If I use a SQL for each of them, I suspect the
>> later SQLs will not leverage the earlier SQLs by any means - hence these
>> will be inefficient to first approach. Let me know if this is not correct.
>>
>
> This is not correct.  When you run a SQL statement and register it as a
> table, it is the logical plan for this query is used when this virtual
> table is referenced in later queries, not the results.  SQL queries are
> lazy, just like RDDs and DSL queries.  This is illustrated below.
>
>
> scala> sql("SELECT * FROM selectQuery")
> res3: org.apache.spark.sql.SchemaRDD =
> SchemaRDD[12] at RDD at SchemaRDD.scala:93
> == Query Plan ==
> HiveTableScan [key#4,value#5], (MetastoreRelation default, src, None), None
>
> scala> sql("SELECT * FROM src").registerAsTable("selectQuery")
>
> scala> sql("SELECT key FROM selectQuery")
> res5: org.apache.spark.sql.SchemaRDD =
> SchemaRDD[24] at RDD at SchemaRDD.scala:93
> == Query Plan ==
> HiveTableScan [key#8], (MetastoreRelation default, src, None), None
>
> Even though the second query is running over the "results" of the first
> query (which requested all columns using *), the optimizer is still able to
> come up with an efficient plan that avoids reading "value" from the table,
> as can be seen by the arguments of the HiveTableScan.
>
> Note that if you call sqlContext.cacheTable("selectQuery") then you are
> correct.  The results will be materialized in an in-memory columnar format,
> and subsequent queries will be run over these materialized results.
>
>
>> The reason for building expressions is that the use case needs these to
>> be created on the fly based on some case class at runtime.
>>
>> I.e., I can't type these in REPL. The scala code will define some case
>> class A (a: ... , b: ..., c: ... ) where class name, member names and types
>> will be known before hand and the RDD will be defined on this. Then based
>> on user action, above pipeline needs to be constructed on fly. Thus the
>> expressions has to be constructed on fly from class members and other
>> predicates etc., most probably using expression constructors.
>>
>> Could you please share how expressions could be constructed using the
>> APIs on expression (and not on REPL) ?
>>
>
> I'm not sure I completely understand the use case here, but you should be
> able to construct symbols and use the DSL to create expressions at runtime,
> just like in the REPL.
>
> val attrName: String = "name"
> val addExpression: Expression = Symbol(attrName) + Symbol(attrName)
>
> There is currently no public API for constructing expressions manually
> other than SQL or the DSL.  While you could dig into
> org.apache.spark.sql.catalyst.expressions._, these APIs are considered
> internal, and *will not be stable in between versions*.
>
> Michael
>
>
>
>


Re: How are exceptions in map functions handled in Spark?

2014-04-04 Thread John Salvatier
Btw, thank you for your help.


On Fri, Apr 4, 2014 at 11:49 AM, John Salvatier wrote:

> Is there a way to log exceptions inside a mapping function? logError and
> logInfo seem to freeze things.
>
>
> On Fri, Apr 4, 2014 at 11:02 AM, Matei Zaharia wrote:
>
>> Exceptions should be sent back to the driver program and logged there
>> (with a SparkException thrown if a task fails more than 4 times), but there
>> were some bugs before where this did not happen for non-Serializable
>> exceptions. We changed it to pass back the stack traces only (as text),
>> which should always work. I'd recommend trying a newer Spark version, 0.8
>> should be easy to upgrade to from 0.7.
>>
>> Matei
>>
>> On Apr 4, 2014, at 10:40 AM, John Salvatier  wrote:
>>
>> > I'm trying to get a clear idea about how exceptions are handled in
>> Spark? Is there somewhere where I can read about this? I'm on spark .7
>> >
>> > For some reason I was under the impression that such exceptions are
>> swallowed and the value that produced them ignored but the exception is
>> logged. However, right now we're seeing the task just re-tried over and
>> over again in an infinite loop because there's a value that always
>> generates an exception.
>> >
>> > John
>>
>>
>


Re: How are exceptions in map functions handled in Spark?

2014-04-04 Thread John Salvatier
Is there a way to log exceptions inside a mapping function? logError and
logInfo seem to freeze things.


On Fri, Apr 4, 2014 at 11:02 AM, Matei Zaharia wrote:

> Exceptions should be sent back to the driver program and logged there
> (with a SparkException thrown if a task fails more than 4 times), but there
> were some bugs before where this did not happen for non-Serializable
> exceptions. We changed it to pass back the stack traces only (as text),
> which should always work. I'd recommend trying a newer Spark version, 0.8
> should be easy to upgrade to from 0.7.
>
> Matei
>
> On Apr 4, 2014, at 10:40 AM, John Salvatier  wrote:
>
> > I'm trying to get a clear idea about how exceptions are handled in
> Spark? Is there somewhere where I can read about this? I'm on spark .7
> >
> > For some reason I was under the impression that such exceptions are
> swallowed and the value that produced them ignored but the exception is
> logged. However, right now we're seeing the task just re-tried over and
> over again in an infinite loop because there's a value that always
> generates an exception.
> >
> > John
>
>


Re: Parallelism level

2014-04-04 Thread Eduardo Costa Alfaia

What do you advice me Nicholas?

Em 4/4/14, 19:05, Nicholas Chammas escreveu:
If you're running on one machine with 2 cores, I believe all you can 
get out of it are 2 concurrent tasks at any one time. So setting your 
default parallelism to 20 won't help.



On Fri, Apr 4, 2014 at 11:41 AM, Eduardo Costa Alfaia 
mailto:e.costaalf...@unibs.it>> wrote:


Hi all,

I have put this line in my spark-env.sh:
-Dspark.default.parallelism=20

 this parallelism level, is it correct?
 The machine's processor is a dual core.

Thanks

-- 
Informativa sulla Privacy: http://www.unibs.it/node/8155






--
Informativa sulla Privacy: http://www.unibs.it/node/8155


Re: Example of creating expressions for SchemaRDD methods

2014-04-04 Thread Michael Armbrust
> In such construct, each operator builds on the previous one, including any
> materialized results etc. If I use a SQL for each of them, I suspect the
> later SQLs will not leverage the earlier SQLs by any means - hence these
> will be inefficient to first approach. Let me know if this is not correct.
>

This is not correct.  When you run a SQL statement and register it as a
table, it is the logical plan for this query is used when this virtual
table is referenced in later queries, not the results.  SQL queries are
lazy, just like RDDs and DSL queries.  This is illustrated below.


scala> sql("SELECT * FROM selectQuery")
res3: org.apache.spark.sql.SchemaRDD =
SchemaRDD[12] at RDD at SchemaRDD.scala:93
== Query Plan ==
HiveTableScan [key#4,value#5], (MetastoreRelation default, src, None), None

scala> sql("SELECT * FROM src").registerAsTable("selectQuery")

scala> sql("SELECT key FROM selectQuery")
res5: org.apache.spark.sql.SchemaRDD =
SchemaRDD[24] at RDD at SchemaRDD.scala:93
== Query Plan ==
HiveTableScan [key#8], (MetastoreRelation default, src, None), None

Even though the second query is running over the "results" of the first
query (which requested all columns using *), the optimizer is still able to
come up with an efficient plan that avoids reading "value" from the table,
as can be seen by the arguments of the HiveTableScan.

Note that if you call sqlContext.cacheTable("selectQuery") then you are
correct.  The results will be materialized in an in-memory columnar format,
and subsequent queries will be run over these materialized results.


> The reason for building expressions is that the use case needs these to be
> created on the fly based on some case class at runtime.
>
> I.e., I can't type these in REPL. The scala code will define some case
> class A (a: ... , b: ..., c: ... ) where class name, member names and types
> will be known before hand and the RDD will be defined on this. Then based
> on user action, above pipeline needs to be constructed on fly. Thus the
> expressions has to be constructed on fly from class members and other
> predicates etc., most probably using expression constructors.
>
> Could you please share how expressions could be constructed using the APIs
> on expression (and not on REPL) ?
>

I'm not sure I completely understand the use case here, but you should be
able to construct symbols and use the DSL to create expressions at runtime,
just like in the REPL.

val attrName: String = "name"
val addExpression: Expression = Symbol(attrName) + Symbol(attrName)

There is currently no public API for constructing expressions manually
other than SQL or the DSL.  While you could dig into
org.apache.spark.sql.catalyst.expressions._, these APIs are considered
internal, and *will not be stable in between versions*.

Michael


Re: Status of MLI?

2014-04-04 Thread Yi Zou
Hi, Evan,

Just noticed this thread, do you mind sharing more details regarding
algorithms targetted at hyperparameter tuning/model selection? or a link to
dev git repo for that work.

thanks,
yi


On Wed, Apr 2, 2014 at 6:03 PM, Evan R. Sparks wrote:

> Targeting 0.9.0 should work out of the box (just a change to the
> build.sbt) - I'll push some changes I've been sitting on to the public repo
> in the next couple of days.
>
>
> On Wed, Apr 2, 2014 at 4:05 AM, Krakna H  wrote:
>
>> Thanks for the update Evan! In terms of using MLI, I see that the Github
>> code is linked to Spark 0.8; will it not work with 0.9 (which is what I
>> have set up) or higher versions?
>>
>>
>> On Wed, Apr 2, 2014 at 1:44 AM, Evan R. Sparks [via Apache Spark User
>> List] <[hidden email] 
>> > wrote:
>>
>>> Hi there,
>>>
>>> MLlib is the first component of MLbase - MLI and the higher levels of
>>> the stack are still being developed. Look for updates in terms of our
>>> progress on the hyperparameter tuning/model selection problem in the next
>>> month or so!
>>>
>>> - Evan
>>>
>>>
>>> On Tue, Apr 1, 2014 at 8:05 PM, Krakna H <[hidden 
>>> email]
>>> > wrote:
>>>
 Hi Nan,

 I was actually referring to MLI/MLBase (http://www.mlbase.org); is
 this being actively developed?

 I'm familiar with mllib and have been looking at its documentation.

 Thanks!


 On Tue, Apr 1, 2014 at 10:44 PM, Nan Zhu [via Apache Spark User List] 
 <[hidden
 email] > wrote:

>  mllib has been part of Spark distribution (under mllib directory),
> also check http://spark.apache.org/docs/latest/mllib-guide.html
>
> and for JIRA, because of the recent migration to apache JIRA, I think
> all mllib-related issues should be under the Spark umbrella,
> https://issues.apache.org/jira/browse/SPARK/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel
>
> --
> Nan Zhu
>
> On Tuesday, April 1, 2014 at 10:38 PM, Krakna H wrote:
>
> What is the current development status of MLI/MLBase? I see that the
> github repo is lying dormant (https://github.com/amplab/MLI) and JIRA
> has had no activity in the last 30 days (
> https://spark-project.atlassian.net/browse/MLI/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel).
> Is the plan to add a lot of this into mllib itself without needing a
> separate API?
>
> Thanks!
>
> --
> View this message in context: Status of 
> MLI?
> Sent from the Apache Spark User List mailing list 
> archiveat
> Nabble.com.
>
>
>
>
> --
>  If you reply to this email, your message will be added to the
> discussion below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Status-of-MLI-tp3610p3611.html
>  To start a new topic under Apache Spark User List, email [hidden
> email] 
> To unsubscribe from Apache Spark User List, click here.
> NAML
>


 --
 View this message in context: Re: Status of 
 MLI?

 Sent from the Apache Spark User List mailing list 
 archiveat Nabble.com.

>>>
>>>
>>>
>>> --
>>>  If you reply to this email, your message will be added to the
>>> discussion below:
>>>
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Status-of-MLI-tp3610p3615.html
>>>  To start a new topic under Apache Spark User List, email [hidden 
>>> email]
>>> To unsubscribe from Apache Spark User List, click here.
>>> NAML
>>>
>>
>>
>> --
>> View this 

Re: How are exceptions in map functions handled in Spark?

2014-04-04 Thread Matei Zaharia
Exceptions should be sent back to the driver program and logged there (with a 
SparkException thrown if a task fails more than 4 times), but there were some 
bugs before where this did not happen for non-Serializable exceptions. We 
changed it to pass back the stack traces only (as text), which should always 
work. I’d recommend trying a newer Spark version, 0.8 should be easy to upgrade 
to from 0.7.

Matei

On Apr 4, 2014, at 10:40 AM, John Salvatier  wrote:

> I'm trying to get a clear idea about how exceptions are handled in Spark? Is 
> there somewhere where I can read about this? I'm on spark .7
> 
> For some reason I was under the impression that such exceptions are swallowed 
> and the value that produced them ignored but the exception is logged. 
> However, right now we're seeing the task just re-tried over and over again in 
> an infinite loop because there's a value that always generates an exception.
> 
> John



How are exceptions in map functions handled in Spark?

2014-04-04 Thread John Salvatier
I'm trying to get a clear idea about how exceptions are handled in Spark?
Is there somewhere where I can read about this? I'm on spark .7

For some reason I was under the impression that such exceptions are
swallowed and the value that produced them ignored but the exception is
logged. However, right now we're seeing the task just re-tried over and
over again in an infinite loop because there's a value that always
generates an exception.

John


Re: java.lang.NoClassDefFoundError: scala/tools/nsc/transform/UnCurry$UnCurryTransformer...

2014-04-04 Thread Marcelo Vanzin
Hi Francis,

This might be a long shot, but do you happen to have built spark on an
encrypted home dir?

(I was running into the same error when I was doing that. Rebuilding
on an unencrypted disk fixed the issue. This is a known issue /
limitation with ecryptfs. It's weird that the build doesn't fail, but
you do get warnings about the long file names.)


On Wed, Apr 2, 2014 at 3:26 AM, Francis.Hu  wrote:
> I stuck in a NoClassDefFoundError.  Any helps that would be appreciated.
>
> I download spark 0.9.0 source, and then run this command to build it :
> SPARK_HADOOP_VERSION=2.2.0 SPARK_YARN=true sbt/sbt assembly

>
> java.lang.NoClassDefFoundError:
> scala/tools/nsc/transform/UnCurry$UnCurryTransformer$$anonfun$14$$anonfun$apply$5$$anonfun$scala$tools$nsc$transform$UnCurry$UnCurryTransformer$$anonfun$$anonfun$$transformInConstructor$1$1

-- 
Marcelo


Re: Parallelism level

2014-04-04 Thread Nicholas Chammas
If you're running on one machine with 2 cores, I believe all you can get
out of it are 2 concurrent tasks at any one time. So setting your default
parallelism to 20 won't help.


On Fri, Apr 4, 2014 at 11:41 AM, Eduardo Costa Alfaia <
e.costaalf...@unibs.it> wrote:

> Hi all,
>
> I have put this line in my spark-env.sh:
> -Dspark.default.parallelism=20
>
>  this parallelism level, is it correct?
>  The machine's processor is a dual core.
>
> Thanks
>
> --
> Informativa sulla Privacy: http://www.unibs.it/node/8155
>


RAM Increase

2014-04-04 Thread Eduardo Costa Alfaia

Hi Guys,

Could anyone explain me this behavior? After 2 min of tests

computer1- worker
computer10 - worker
computer8 - driver(master)

computer1
 18:24:31 up 73 days,  7:14,  1 user,  load average: 3.93, 2.45, 1.14
   total   used   free shared buffers 
cached

Mem:  3945   3925 19 0 18   1368
-/+ buffers/cache:   2539   1405
Swap:0  0  0
computer10
 18:22:38 up 44 days, 21:26,  2 users,  load average: 3.05, 2.20, 1.03
 total   used   free shared buffers 
cached

Mem:  5897   5292 604 0 46   2707
-/+ buffers/cache:   2538   3358
Swap:0  0  0
computer8
 18:24:13 up 122 days, 22 min, 13 users,  load average: 1.10, 0.93, 0.82
 total   used   free shared buffers 
cached

Mem:  5897   5841 55 0113   2747
-/+ buffers/cache:   2980   2916
Swap:0  0  0

Thanks Guys

--
Informativa sulla Privacy: http://www.unibs.it/node/8155


Re: Error reading HDFS file using spark 0.9.0 / hadoop 2.2.0 - incompatible protobuf 2.5 and 2.4.1

2014-04-04 Thread Prasad
Hi Wisely,
Could you please post your pom.xml here.

Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Error-reading-HDFS-file-using-spark-0-9-0-hadoop-2-2-0-incompatible-protobuf-2-5-and-2-4-1-tp2158p3770.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Regarding Sparkcontext object

2014-04-04 Thread Daniel Siegmann
On Wed, Apr 2, 2014 at 7:11 PM, yh18190  wrote:

> Is it always needed that sparkcontext object be created in Main method of
> class.Is it necessary?Can we create "sc" object in other class and try to
> use it by passing this object through function and use it?
>

The Spark context can be initialized wherever you like and passed around
just as any other object. Just don't try to create multiple contexts
against "local" (without stopping the previous one first), or you may get
ArrayStoreExceptions (I learned that one the hard way).

-- 
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegm...@velos.io W: www.velos.io


Parallelism level

2014-04-04 Thread Eduardo Costa Alfaia

Hi all,

I have put this line in my spark-env.sh:
-Dspark.default.parallelism=20

 this parallelism level, is it correct?
 The machine's processor is a dual core.

Thanks

--
Informativa sulla Privacy: http://www.unibs.it/node/8155


Re: Hadoop 2.X Spark Client Jar 0.9.0 problem

2014-04-04 Thread Erik Freed
Thanks all for the update - I have actually built using those options every
which way I can think of so perhaps this is something I am doing about how
I upload the jar to our artifactory repo server. Anyone have a working pom
file for the publish of a spark 0.9 hadoop 2.X publish to a maven repo
server?

cheers,
Erik


On Fri, Apr 4, 2014 at 7:54 AM, Rahul Singhal wrote:

>   Hi Erik,
>
>  I am working with TOT branch-0.9 (> 0.9.1) and the following works for
> me for maven build:
>
>  export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M
> -XX:ReservedCodeCacheSize=512m"
> mvn -Pyarn -Dhadoop.version=2.3.0 -Dyarn.version=2.3.0 -DskipTests clean
> package
>
>
>  And from http://spark.apache.org/docs/latest/running-on-yarn.html, for
> sbt build, you could try:
>
>  SPARK_HADOOP_VERSION=2.3.0 SPARK_YARN=true sbt/sbt assembly
>
>   Thanks,
> Rahul Singhal
>
>   From: Erik Freed 
> Reply-To: "user@spark.apache.org" 
> Date: Friday 4 April 2014 7:58 PM
> To: "user@spark.apache.org" 
> Subject: Hadoop 2.X Spark Client Jar 0.9.0 problem
>
>   Hi All,
>
>  I am not sure if this is a 0.9.0 problem to be fixed in 0.9.1 so perhaps
> already being addressed, but I am having a devil of a time with a spark
> 0.9.0 client jar for hadoop 2.X. If I go to the site and download:
>
>
>- Download binaries for Hadoop 2 (HDP2, CDH5): find an Apache mirror 
> 
>or direct file 
> download
>
> I get a jar with what appears to be hadoop 1.0.4 that fails when using
> hadoop 2.3.0. I have tried repeatedly to build the source tree with the
> correct options per the documentation but always seemingly ending up with
> hadoop 1.0.4.  As far as I can tell the reason that the jar available on
> the web site doesn't have the correct hadoop client in it, is because the
> build itself is having that problem.
>
>  I am about to try to troubleshoot the build but wanted to see if anyone
> out there has encountered the same problem and/or if I am just doing
> something dumb (!)
>
>
>  Anyone else using hadoop 2.X? How do you get the right client jar if so?
>
>  cheers,
> Erik
>
>  --
> Erik James Freed
> CoDecision Software
> 510.859.3360
> erikjfr...@codecision.com
>
> 1480 Olympus Avenue
> Berkeley, CA
> 94708
>
> 179 Maria Lane
> Orcas, WA
> 98245
>



-- 
Erik James Freed
CoDecision Software
510.859.3360
erikjfr...@codecision.com

1480 Olympus Avenue
Berkeley, CA
94708

179 Maria Lane
Orcas, WA
98245


Re: Job initialization performance of Spark standalone mode vs YARN

2014-04-04 Thread Ron Gonzalez
Hi,
  Can you explain a little more what's going on? Which one submits a job to the 
yarn cluster that creates an application master and spawns containers for the 
local jobs? I tried yarn-client and submitted to our yarn cluster and it seems 
to work that way.  Shouldn't Client.scala be running within the AppMaster 
instance in this run mode?
  How exactly does yarn-standalone work?

Thanks,
Ron

Sent from my iPhone

> On Apr 3, 2014, at 11:19 AM, Kevin Markey  wrote:
> 
> We are now testing precisely what you ask about in our environment.  But 
> Sandy's questions are relevant.  The bigger issue is not Spark vs. Yarn but 
> "client" vs. "standalone" and where the client is located on the network 
> relative to the cluster.
> 
> The "client" options that locate the client/master remote from the cluster, 
> while useful for interactive queries, suffer from considerable network 
> traffic overhead as the master schedules and transfers data with the worker 
> nodes on the cluster.  The "standalone" options locate the master/client on 
> the cluster.  In yarn-standalone, the master is a thread contained by the 
> Yarn Resource Manager.  Lots less traffic, as the master is co-located with 
> the worker nodes on the cluster and its scheduling/data communication has 
> less latency.
> 
> In my comparisons between yarn-client and yarn-standalone (so as not to 
> conflate yarn vs Spark), yarn-client computation time is at least double 
> yarn-standalone!  At least for a job with lots of stages and lots of 
> client/worker communication, although rather few "collect" actions, so it's 
> mainly scheduling that's relevant here.
> 
> I'll be posting more information as I have it available.
> 
> Kevin
> 
> 
>> On 03/03/2014 03:48 PM, Sandy Ryza wrote:
>> Are you running in yarn-standalone mode or yarn-client mode?  Also, what 
>> YARN scheduler and what NodeManager heartbeat?  
>> 
>> 
>> On Sun, Mar 2, 2014 at 9:41 PM, polkosity  wrote:
>>> Thanks for the advice Mayur.
>>> 
>>> I thought I'd report back on the performance difference...  Spark standalone
>>> mode has executors processing at capacity in under a second :)
>>> 
>>> 
>>> 
>>> --
>>> View this message in context: 
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Job-initialization-performance-of-Spark-standalone-mode-vs-YARN-tp2016p2243.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 


Re: Spark output compression on HDFS

2014-04-04 Thread Konstantin Kudryavtsev
Can anybody suggest how to change compression level (Record, Block) for
Snappy?
if it possible, of course

thank you in advance

Thank you,
Konstantin Kudryavtsev


On Thu, Apr 3, 2014 at 10:28 PM, Konstantin Kudryavtsev <
kudryavtsev.konstan...@gmail.com> wrote:

> Thanks all, it works fine now and I managed to compress output. However, I
> am still in stuck... How is it possible to set compression type for Snappy?
> I mean to set up record or block level of compression for output
>  On Apr 3, 2014 1:15 AM, "Nicholas Chammas" 
> wrote:
>
>> Thanks for pointing that out.
>>
>>
>> On Wed, Apr 2, 2014 at 6:11 PM, Mark Hamstra wrote:
>>
>>> First, you shouldn't be using spark.incubator.apache.org anymore, just
>>> spark.apache.org.  Second, saveAsSequenceFile doesn't appear to exist
>>> in the Python API at this point.
>>>
>>>
>>> On Wed, Apr 2, 2014 at 3:00 PM, Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
 Is this a 
 Scala-onlyfeature?


 On Wed, Apr 2, 2014 at 5:55 PM, Patrick Wendell wrote:

> For textFile I believe we overload it and let you set a codec directly:
>
>
> https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/FileSuite.scala#L59
>
> For saveAsSequenceFile yep, I think Mark is right, you need an option.
>
>
> On Wed, Apr 2, 2014 at 12:36 PM, Mark Hamstra  > wrote:
>
>> http://www.scala-lang.org/api/2.10.3/index.html#scala.Option
>>
>> The signature is 'def saveAsSequenceFile(path: String, codec:
>> Option[Class[_ <: CompressionCodec]] = None)', but you are providing a
>> Class, not an Option[Class].
>>
>> Try counts.saveAsSequenceFile(output,
>> Some(classOf[org.apache.hadoop.io.compress.SnappyCodec]))
>>
>>
>>
>> On Wed, Apr 2, 2014 at 12:18 PM, Kostiantyn Kudriavtsev <
>> kudryavtsev.konstan...@gmail.com> wrote:
>>
>>> Hi there,
>>>
>>>
>>> I've started using Spark recently and evaluating possible use cases
>>> in our company.
>>>
>>> I'm trying to save RDD as compressed Sequence file. I'm able to save
>>> non-compressed file be calling:
>>>
>>>
>>>
>>>
>>> counts.saveAsSequenceFile(output)
>>>
>>> where counts is my RDD (IntWritable, Text). However, I didn't manage
>>> to compress output. I tried several configurations and always got 
>>> exception:
>>>
>>>
>>>
>>>
>>> counts.saveAsSequenceFile(output, 
>>> classOf[org.apache.hadoop.io.compress.SnappyCodec])
>>> :21: error: type mismatch;
>>>  found   : 
>>> Class[org.apache.hadoop.io.compress.SnappyCodec](classOf[org.apache.hadoop.io.compress.SnappyCodec])
>>>  required: Option[Class[_ <: 
>>> org.apache.hadoop.io.compress.CompressionCodec]]
>>>   counts.saveAsSequenceFile(output, 
>>> classOf[org.apache.hadoop.io.compress.SnappyCodec])
>>>
>>>  counts.saveAsSequenceFile(output, 
>>> classOf[org.apache.spark.io.SnappyCompressionCodec])
>>> :21: error: type mismatch;
>>>  found   : 
>>> Class[org.apache.spark.io.SnappyCompressionCodec](classOf[org.apache.spark.io.SnappyCompressionCodec])
>>>  required: Option[Class[_ <: 
>>> org.apache.hadoop.io.compress.CompressionCodec]]
>>>   counts.saveAsSequenceFile(output, 
>>> classOf[org.apache.spark.io.SnappyCompressionCodec])
>>>
>>> and it doesn't work even for Gzip:
>>>
>>>
>>>
>>>
>>>  counts.saveAsSequenceFile(output, 
>>> classOf[org.apache.hadoop.io.compress.GzipCodec])
>>> :21: error: type mismatch;
>>>  found   : 
>>> Class[org.apache.hadoop.io.compress.GzipCodec](classOf[org.apache.hadoop.io.compress.GzipCodec])
>>>  required: Option[Class[_ <: 
>>> org.apache.hadoop.io.compress.CompressionCodec]]
>>>   counts.saveAsSequenceFile(output, 
>>> classOf[org.apache.hadoop.io.compress.GzipCodec])
>>>
>>> Could you please suggest solution? also, I didn't find how is it
>>> possible to specify compression parameters (i.e. compression type for
>>> Snappy). I wondered if you could share code snippets for writing/reading
>>> RDD with compression?
>>>
>>> Thank you in advance,
>>> Konstantin Kudryavtsev
>>>
>>
>>
>

>>>
>>


Re: how to save RDD partitions in different folders?

2014-04-04 Thread Konstantin Kudryavtsev
Hi Evan,

Could you please provide a code-snippet? Because it not clear for me, in
Hadoop you need to engage addNamedOutput method and I'm in stuck how to use
it from Spark

Thank you,
Konstantin Kudryavtsev


On Fri, Apr 4, 2014 at 5:27 PM, Evan Sparks  wrote:

> Have a look at MultipleOutputs in the hadoop API. Spark can read and write
> to arbitrary hadoop formats.
>
> > On Apr 4, 2014, at 6:01 AM, dmpour23  wrote:
> >
> > Hi all,
> > Say I have an input file which I would like to partition using
> > HashPartitioner k times.
> >
> > Calling  rdd.saveAsTextFile(""hdfs://"); will save k files as part-0
> > part-k
> > Is there a way to save each partition in specific folders?
> >
> > i.e. src
> >  part0/part-0
> >  part1/part-1
> >  part1/part-k
> >
> > thanks
> > Dimitri
> >
> >
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/how-to-save-RDD-partitions-in-different-folders-tp3754.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


reduceByKeyAndWindow Java

2014-04-04 Thread Eduardo Costa Alfaia


Hi guys,

I would like knowing if the part of code is right to use in Window.

JavaPairDStream wordCounts = words.map(
103   new PairFunction() {
104 @Override
105 public Tuple2 call(String s) {
106   return new Tuple2(s, 1);
107 }
108   }).reduceByKeyAndWindow(
109 new Function2() {
110   public Integer call(Integer i1, Integer i2) { return i1 + 
i2; }

111 },
112 new Function2() {
113   public Integer call(Integer i1, Integer i2) { return i1 - 
i2; }

114 },
115 new Duration(60 * 5 * 1000),
116 new Duration(1 * 1000)
117   );



Thanks

--
Informativa sulla Privacy: http://www.unibs.it/node/8155


Re: Hadoop 2.X Spark Client Jar 0.9.0 problem

2014-04-04 Thread Amit Tewari
I believe you got to set following

SPARK_HADOOP_VERSION=2.2.0 (or whatever your version is)
SPARK_YARN=true

then type sbt/sbt assembly

If you are using Maven to compile

mvn -Pyarn -Dhadoop.version=2.2.0 -Dyarn.version=2.2.0 -DskipTests clean
package


Hope this helps

-A


On Fri, Apr 4, 2014 at 7:28 AM, Erik Freed wrote:

> Hi All,
>
> I am not sure if this is a 0.9.0 problem to be fixed in 0.9.1 so perhaps
> already being addressed, but I am having a devil of a time with a spark
> 0.9.0 client jar for hadoop 2.X. If I go to the site and download:
>
>
>- Download binaries for Hadoop 2 (HDP2, CDH5): find an Apache mirror 
> 
>or direct file 
> download
>
> I get a jar with what appears to be hadoop 1.0.4 that fails when using
> hadoop 2.3.0. I have tried repeatedly to build the source tree with the
> correct options per the documentation but always seemingly ending up with
> hadoop 1.0.4.  As far as I can tell the reason that the jar available on
> the web site doesn't have the correct hadoop client in it, is because the
> build itself is having that problem.
>
> I am about to try to troubleshoot the build but wanted to see if anyone
> out there has encountered the same problem and/or if I am just doing
> something dumb (!)
>
>
> Anyone else using hadoop 2.X? How do you get the right client jar if so?
>
> cheers,
> Erik
>
> --
> Erik James Freed
> CoDecision Software
> 510.859.3360
> erikjfr...@codecision.com
>
> 1480 Olympus Avenue
> Berkeley, CA
> 94708
>
> 179 Maria Lane
> Orcas, WA
> 98245
>


Re: Hadoop 2.X Spark Client Jar 0.9.0 problem

2014-04-04 Thread Rahul Singhal
Hi Erik,

I am working with TOT branch-0.9 (> 0.9.1) and the following works for me for 
maven build:

export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m"
mvn -Pyarn -Dhadoop.version=2.3.0 -Dyarn.version=2.3.0 -DskipTests clean package


And from http://spark.apache.org/docs/latest/running-on-yarn.html, for sbt 
build, you could try:

SPARK_HADOOP_VERSION=2.3.0 SPARK_YARN=true sbt/sbt assembly

Thanks,
Rahul Singhal

From: Erik Freed mailto:erikjfr...@codecision.com>>
Reply-To: "user@spark.apache.org" 
mailto:user@spark.apache.org>>
Date: Friday 4 April 2014 7:58 PM
To: "user@spark.apache.org" 
mailto:user@spark.apache.org>>
Subject: Hadoop 2.X Spark Client Jar 0.9.0 problem

Hi All,

I am not sure if this is a 0.9.0 problem to be fixed in 0.9.1 so perhaps 
already being addressed, but I am having a devil of a time with a spark 0.9.0 
client jar for hadoop 2.X. If I go to the site and download:


  *   Download binaries for Hadoop 2 (HDP2, CDH5): find an Apache mirror 

 or direct file 
download

I get a jar with what appears to be hadoop 1.0.4 that fails when using hadoop 
2.3.0. I have tried repeatedly to build the source tree with the correct 
options per the documentation but always seemingly ending up with hadoop 1.0.4. 
 As far as I can tell the reason that the jar available on the web site doesn't 
have the correct hadoop client in it, is because the build itself is having 
that problem.

I am about to try to troubleshoot the build but wanted to see if anyone out 
there has encountered the same problem and/or if I am just doing something dumb 
(!)


Anyone else using hadoop 2.X? How do you get the right client jar if so?

cheers,
Erik

--
Erik James Freed
CoDecision Software
510.859.3360
erikjfr...@codecision.com

1480 Olympus Avenue
Berkeley, CA
94708

179 Maria Lane
Orcas, WA
98245


Hadoop 2.X Spark Client Jar 0.9.0 problem

2014-04-04 Thread Erik Freed
Hi All,

I am not sure if this is a 0.9.0 problem to be fixed in 0.9.1 so perhaps
already being addressed, but I am having a devil of a time with a spark
0.9.0 client jar for hadoop 2.X. If I go to the site and download:


   - Download binaries for Hadoop 2 (HDP2, CDH5): find an Apache
mirror 

   or direct file
download

I get a jar with what appears to be hadoop 1.0.4 that fails when using
hadoop 2.3.0. I have tried repeatedly to build the source tree with the
correct options per the documentation but always seemingly ending up with
hadoop 1.0.4.  As far as I can tell the reason that the jar available on
the web site doesn't have the correct hadoop client in it, is because the
build itself is having that problem.

I am about to try to troubleshoot the build but wanted to see if anyone out
there has encountered the same problem and/or if I am just doing something
dumb (!)


Anyone else using hadoop 2.X? How do you get the right client jar if so?

cheers,
Erik

-- 
Erik James Freed
CoDecision Software
510.859.3360
erikjfr...@codecision.com

1480 Olympus Avenue
Berkeley, CA
94708

179 Maria Lane
Orcas, WA
98245


Re: how to save RDD partitions in different folders?

2014-04-04 Thread Evan Sparks
Have a look at MultipleOutputs in the hadoop API. Spark can read and write to 
arbitrary hadoop formats. 

> On Apr 4, 2014, at 6:01 AM, dmpour23  wrote:
> 
> Hi all,
> Say I have an input file which I would like to partition using
> HashPartitioner k times.
> 
> Calling  rdd.saveAsTextFile(""hdfs://"); will save k files as part-0
> part-k
> Is there a way to save each partition in specific folders?
> 
> i.e. src
>  part0/part-0 
>  part1/part-1
>  part1/part-k
> 
> thanks
> Dimitri
> 
> 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/how-to-save-RDD-partitions-in-different-folders-tp3754.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.


Driver increase memory utilization

2014-04-04 Thread Eduardo Costa Alfaia

   Hi Guys,

Could anyone help me understand this driver behavior when I start the 
JavaNetworkWordCount?


computer8
 16:24:07 up 121 days, 22:21, 12 users,  load average: 0.66, 1.27, 1.55
total   used   free shared buffers 
cached

Mem:  5897   4341 1555 0227   2798
-/+ buffers/cache:   1315   4581
Swap:0  0  0

in 2 minutes

computer8
 16:23:08 up 121 days, 22:20, 12 users,  load average: 0.80, 1.43, 1.62
 total   used   free shared buffers 
cached

Mem:  5897   5866 30 0230   3255
-/+ buffers/cache:   2380   3516
Swap:0  0  0


Thanks

--
Informativa sulla Privacy: http://www.unibs.it/node/8155


Re: How to create a RPM package

2014-04-04 Thread Christophe Préaud

Hi Rahul,

Spark will be available in Fedora 21 (see: 
https://fedoraproject.org/wiki/SIGs/bigdata/packaging/Spark), currently 
scheduled on 2014-10-14 but they already have produced spec files and source 
RPMs.
If you are stuck with EL6 like me, you can have a look at the attached spec 
file, which you can probably adapt to your need.

Christophe.

On 04/04/2014 09:10, Rahul Singhal wrote:
Hello Community,

This is my first mail to the list and I have a small question. The maven build 
page
 mentions a way to create a debian package but I was wondering if there is a simple 
way (preferably through maven) to create a RPM package. Is there a script (which is 
probably used for spark releases) that I can get my hands on? Or should I write one 
on my own?

P.S. I don't want to use the "alien" software to convert a debian package to a 
RPM.

Thanks,
Rahul Singhal



Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
message, merci de le détruire et d'en avertir l'expéditeur.
Name: spark
Version:  0.9.0

# Build time settings
%global _full_version %{version}-incubating
%global _final_name %{name}-%{_full_version}
%global _spark_hadoop_version 2.2.0
%global _spark_dir /opt

Release:  2
Summary:  Lightning-fast cluster computing
Group:Development/Libraries
License:  ASL 2.0
URL:  http://spark.apache.org/
Source0:  http://www.eu.apache.org/dist/incubator/spark/%{_final_name}/%{_final_name}.tgz
BuildRequires: git
Requires:  /bin/bash
Requires:  /bin/sh
Requires:  /usr/bin/env

%description
Apache Spark is a fast and general engine for large-scale data processing.


%prep
%setup -q -n %{_final_name}


%build
SPARK_HADOOP_VERSION=%{_spark_hadoop_version} SPARK_YARN=true ./sbt/sbt assembly
find bin -type f -name '*.cmd' -exec rm -f {} \;


%install
mkdir -p ${RPM_BUILD_ROOT}%{_spark_dir}/%{name}/%{_final_name}/{conf,jars}
echo "Spark %{_full_version} built for Hadoop %{_spark_hadoop_version}" > "${RPM_BUILD_ROOT}%{_spark_dir}/%{name}/%{_final_name}/RELEASE"
cp assembly/target/scala*/spark-assembly-%{_full_version}-hadoop%{_spark_hadoop_version}.jar ${RPM_BUILD_ROOT}%{_spark_dir}/%{name}/%{_final_name}/jars/spark-assembly-hadoop.jar
cp conf/*.template ${RPM_BUILD_ROOT}%{_spark_dir}/%{name}/%{_final_name}/conf
cp -r bin ${RPM_BUILD_ROOT}%{_spark_dir}/%{name}/%{_final_name}
cp -r python ${RPM_BUILD_ROOT}%{_spark_dir}/%{name}/%{_final_name}
cp -r sbin ${RPM_BUILD_ROOT}%{_spark_dir}/%{name}/%{_final_name}


%files
%defattr(-,root,root,-)
%{_spark_dir}/%{name}

%changelog
* Mon Mar 31 2014 Christophe Préaud  0.9.0-2
- Use description and Summary from Fedora RPM

* Wed Mar 26 2014 Christophe Préaud  0.9.0-1
- first version with changelog :-)


Re: Spark 1.0.0 release plan

2014-04-04 Thread Tom Graves
Do we have a list of things we really want to get in for 1.X?   Perhaps move 
any jira out to a 1.1 release if we aren't targetting them for 1.0.

 It might be nice to send out reminders when these dates are approaching. 

Tom
On Thursday, April 3, 2014 11:19 PM, Bhaskar Dutta  wrote:
 
Thanks a lot guys!





On Fri, Apr 4, 2014 at 5:34 AM, Patrick Wendell  wrote:

Btw - after that initial thread I proposed a slightly more detailed set of 
dates:
>
>https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage 
>
>
>- Patrick
>
>
>
>On Thu, Apr 3, 2014 at 11:28 AM, Matei Zaharia  wrote:
>
>Hey Bhaskar, this is still the plan, though QAing might take longer than 15 
>days. Right now since we’ve passed April 1st, the only features considered for 
>a merge are those that had pull requests in review before. (Some big ones are 
>things like annotating the public APIs and simplifying configuration). Bug 
>fixes and things like adding Python / Java APIs for new components will also 
>still be considered.
>>
>>
>>Matei
>>
>>
>>On Apr 3, 2014, at 10:30 AM, Bhaskar Dutta  wrote:
>>
>>Hi,
>>>
>>>
>>>Is there any change in the release plan for Spark 1.0.0-rc1 release date 
>>>from what is listed in the "Proposal for Spark Release Strategy" thread?
>>>== Tentative Release Window for 1.0.0 ==
Feb 1st - April 1st: General development
April 1st: Code freeze for new features
April 15th: RC1
>>>Thanks,
>>>Bhaskar
>>
>

How to start history tracking URL

2014-04-04 Thread zhxfl
I run spark client on yarn, and use "history-daemon.sh start historyserver" to 
start tracking the hisotory URL,but it didn't work, why?

how to save RDD partitions in different folders?

2014-04-04 Thread dmpour23
Hi all,
Say I have an input file which I would like to partition using
HashPartitioner k times.

Calling  rdd.saveAsTextFile(""hdfs://"); will save k files as part-0
part-k  
Is there a way to save each partition in specific folders?

i.e. src
  part0/part-0 
  part1/part-1
  part1/part-k

thanks
Dimitri





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-save-RDD-partitions-in-different-folders-tp3754.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


RAM high consume

2014-04-04 Thread Eduardo Costa Alfaia

 Hi all,

I am doing some tests using JavaNetworkWordcount and I have some 
questions about the performance machine, my tests' time are 
approximately 2 min.


Why does the RAM Memory decrease meaningly? I have done tests with 2, 3 
machines and I had gotten the same behavior.


What should I do to get a better performance in this case?


# Star Test

computer1
 total   used   free sharedbuffers cached
Mem:  3945711 3233 0  3430
-/+ buffers/cache:276   3668
Swap:0  0  0

 14:42:50 up 73 days,  3:32,  2 users,  load average: 0.00, 0.06, 0.21


14/04/04 14:24:56 INFO BlockManagerMasterActor$BlockManagerInfo: Added 
input-0-1396614314400 in memory on computer1.ant-net:60820 (size: 826.1 
KB, free: 542.9 MB)
14/04/04 14:24:56 INFO MemoryStore: ensureFreeSpace(845956) called with 
curMem=47278100, maxMem=825439027
14/04/04 14:24:56 INFO MemoryStore: Block input-0-1396614314400 stored 
as bytes to memory (size 826.1 KB, free 741.3 MB)
14/04/04 14:24:56 INFO BlockManagerMasterActor$BlockManagerInfo: Added 
input-0-1396614314400 in memory on computer8.ant-net:49743 (size: 826.1 
KB, free: 741.3 MB)
14/04/04 14:24:56 INFO BlockManagerMaster: Updated info of block 
input-0-1396614314400
14/04/04 14:24:56 INFO TaskSetManager: Finished TID 272 in 84 ms on 
computer1.ant-net (progress: 0/1)

14/04/04 14:24:56 INFO TaskSchedulerImpl: Remove TaskSet 43.0 from pool
14/04/04 14:24:56 INFO DAGScheduler: Completed ResultTask(43, 0)
14/04/04 14:24:56 INFO DAGScheduler: Stage 43 (take at 
DStream.scala:594) finished in 0.088 s
14/04/04 14:24:56 INFO SparkContext: Job finished: take at 
DStream.scala:594, took 1.872875734 s

---
Time: 1396614289000 ms
---
(Santiago,1)
(liveliness,1)
(Sun,1)
(reapers,1)
(offer,3)
(BARBER,3)
(shrewdness,1)
(truism,1)
(hits,1)
(merchant,1)



# End Test
computer1
 total   used   free sharedbuffers cached
Mem:  3945   2209 1735 0  5773
-/+ buffers/cache:   1430   2514
Swap:0  0  0

 14:46:05 up 73 days,  3:35,  2 users,  load average: 2.69, 1.07, 0.55


14/04/04 14:26:57 INFO TaskSetManager: Starting task 183.0:0 as TID 696 
on executor 0: computer1.ant-net (PROCESS_LOCAL)
14/04/04 14:26:57 INFO TaskSetManager: Serialized task 183.0:0 as 1981 
bytes in 0 ms
14/04/04 14:26:57 INFO MapOutputTrackerMasterActor: Asked to send map 
output locations for shuffle 81 to sp...@computer1.ant-net:44817
14/04/04 14:26:57 INFO MapOutputTrackerMaster: Size of output statuses 
for shuffle 81 is 212 bytes
14/04/04 14:26:57 INFO BlockManagerMasterActor$BlockManagerInfo: Added 
input-0-1396614336600 on disk on computer1.ant-net:60820 (size: 1441.7 KB)
14/04/04 14:26:57 INFO BlockManagerMasterActor$BlockManagerInfo: Added 
input-0-1396614435200 in memory on computer1.ant-net:60820 (size: 1295.7 
KB, free: 589.3 KB)
14/04/04 14:26:57 INFO TaskSetManager: Finished TID 696 in 56 ms on 
computer1.ant-net (progress: 0/1)

14/04/04 14:26:57 INFO TaskSchedulerImpl: Remove TaskSet 183.0 from pool
14/04/04 14:26:57 INFO DAGScheduler: Completed ResultTask(183, 0)
14/04/04 14:26:57 INFO DAGScheduler: Stage 183 (take at 
DStream.scala:594) finished in 0.057 s
14/04/04 14:26:57 INFO SparkContext: Job finished: take at 
DStream.scala:594, took 1.575268894 s

---
Time: 1396614359000 ms
---
(hapless,9)
(reapers,8)
(amazed,113)
(feebleness,7)
(offer,148)
(rabble,27)
(exchanging,7)
(merchant,20)
(incentives,2)
(quarrel,48)
...


Thanks Guys

--
Informativa sulla Privacy: http://www.unibs.it/node/8155


Explain Add Input

2014-04-04 Thread Eduardo Costa Alfaia

Hi all,

Could anyone explain me about the lines below?

computer1 - worker
computer8 - driver(master)

14/04/04 14:24:56 INFO BlockManagerMasterActor$BlockManagerInfo: Added 
input-0-1396614314800 in memory on computer1.ant-net:60820 (size: 1262.5 
KB, free: 540.3 MB)
14/04/04 14:24:56 INFO MemoryStore: ensureFreeSpace(1292780) called with 
curMem=49555672, maxMem=825439027
14/04/04 14:24:56 INFO MemoryStore: Block input-0-1396614314800 stored 
as bytes to memory (size 1262.5 KB, free 738.7 MB)
14/04/04 14:24:56 INFO BlockManagerMasterActor$BlockManagerInfo: Added 
input-0-1396614314800 in memory on computer8.ant-net:49743 (size: 1262.5 
KB, free: 738.7 MB)


Why does spark add the same input in computer8, which is the Driver(master)?

Thanks guys

--
Informativa sulla Privacy: http://www.unibs.it/node/8155


Re: module not found: org.eclipse.paho#mqtt-client;0.4.0

2014-04-04 Thread Sean Owen
This ultimately means a problem with SSL in the version of Java you
are using to run SBT. If you look around the internet, you'll see a
bunch of discussion, most of which seems to boil down to reinstall, or
update, Java.
--
Sean Owen | Director, Data Science | London


On Fri, Apr 4, 2014 at 12:21 PM, Dear all  wrote:
>
> hello, all
>
>  i am a new guy to spark&scala.
>
>  Yestday i install spark failed,  and the message like this:
>
>   who can help me: why the matt-client-0.4.0.pom can't find? how should
> i do ?
>
>  thanks a lot!
>
> command: sbt/sbt assembly
>
> [info] Updating
> {file:/Users/alick/spark/spark-0.9.0-incubating/}external-mqtt...
>
> [info] Resolving org.eclipse.paho#mqtt-client;0.4.0 ...
>
> [error] Server access Error: java.lang.RuntimeException: Unexpected error:
> java.security.InvalidAlgorithmParameterException: the trustAnchors parameter
> must be non-empty
> url=https://oss.sonatype.org/content/repositories/snapshots/org/eclipse/paho/mqtt-client/0.4.0/mqtt-client-0.4.0.pom
>
> [error] Server access Error: java.lang.RuntimeException: Unexpected error:
> java.security.InvalidAlgorithmParameterException: the trustAnchors parameter
> must be non-empty
> url=https://oss.sonatype.org/service/local/staging/deploy/maven2/org/eclipse/paho/mqtt-client/0.4.0/mqtt-client-0.4.0.pom
>
> [error] Server access Error: java.lang.RuntimeException: Unexpected error:
> java.security.InvalidAlgorithmParameterException: the trustAnchors parameter
> must be non-empty
> url=https://repo.eclipse.org/content/repositories/paho-releases/org/eclipse/paho/mqtt-client/0.4.0/mqtt-client-0.4.0.pom
>
> [warn] module not found: org.eclipse.paho#mqtt-client;0.4.0
>
> [warn]  local: tried
>
> [warn]
> /Users/alick/.ivy2/local/org.eclipse.paho/mqtt-client/0.4.0/ivys/ivy.xml
>
> [warn]  Local Maven Repo: tried
>
> [warn]  sonatype-snapshots: tried
>
> [warn]
> https://oss.sonatype.org/content/repositories/snapshots/org/eclipse/paho/mqtt-client/0.4.0/mqtt-client-0.4.0.pom
>
> [warn]  sonatype-staging: tried
>
> [warn]
> https://oss.sonatype.org/service/local/staging/deploy/maven2/org/eclipse/paho/mqtt-client/0.4.0/mqtt-client-0.4.0.pom
>
> [warn]  Eclipse Repo: tried
>
> [warn]
> https://repo.eclipse.org/content/repositories/paho-releases/org/eclipse/paho/mqtt-client/0.4.0/mqtt-client-0.4.0.pom
>
> [warn]  public: tried
>
> [warn]
> http://repo1.maven.org/maven2/org/eclipse/paho/mqtt-client/0.4.0/mqtt-client-0.4.0.pom
>
>
>
>
>
>
>


module not found: org.eclipse.paho#mqtt-client;0.4.0

2014-04-04 Thread Dear all


hello, all


 i am a new guy to spark&scala.
 
 Yestday i install spark failed,  and the message like this:


  who can help me: why the matt-client-0.4.0.pom can't find? how should i 
do ? 


 thanks a lot!


command: sbt/sbt assembly

[info] Updating 
{file:/Users/alick/spark/spark-0.9.0-incubating/}external-mqtt...

[info] Resolving org.eclipse.paho#mqtt-client;0.4.0 ...

[error] Server access Error: java.lang.RuntimeException: Unexpected error: 
java.security.InvalidAlgorithmParameterException: the trustAnchors parameter 
must be non-empty 
url=https://oss.sonatype.org/content/repositories/snapshots/org/eclipse/paho/mqtt-client/0.4.0/mqtt-client-0.4.0.pom

[error] Server access Error: java.lang.RuntimeException: Unexpected error: 
java.security.InvalidAlgorithmParameterException: the trustAnchors parameter 
must be non-empty 
url=https://oss.sonatype.org/service/local/staging/deploy/maven2/org/eclipse/paho/mqtt-client/0.4.0/mqtt-client-0.4.0.pom

[error] Server access Error: java.lang.RuntimeException: Unexpected error: 
java.security.InvalidAlgorithmParameterException: the trustAnchors parameter 
must be non-empty 
url=https://repo.eclipse.org/content/repositories/paho-releases/org/eclipse/paho/mqtt-client/0.4.0/mqtt-client-0.4.0.pom

[warn] module not found: org.eclipse.paho#mqtt-client;0.4.0

[warn]  local: tried

[warn]   
/Users/alick/.ivy2/local/org.eclipse.paho/mqtt-client/0.4.0/ivys/ivy.xml

[warn]  Local Maven Repo: tried

[warn]  sonatype-snapshots: tried

[warn]   
https://oss.sonatype.org/content/repositories/snapshots/org/eclipse/paho/mqtt-client/0.4.0/mqtt-client-0.4.0.pom

[warn]  sonatype-staging: tried

[warn]   
https://oss.sonatype.org/service/local/staging/deploy/maven2/org/eclipse/paho/mqtt-client/0.4.0/mqtt-client-0.4.0.pom

[warn]  Eclipse Repo: tried

[warn]   
https://repo.eclipse.org/content/repositories/paho-releases/org/eclipse/paho/mqtt-client/0.4.0/mqtt-client-0.4.0.pom

[warn]  public: tried

[warn]   
http://repo1.maven.org/maven2/org/eclipse/paho/mqtt-client/0.4.0/mqtt-client-0.4.0.pom








Re: Error when run Spark on mesos

2014-04-04 Thread Gino Mathews
Hi,

The issue was due to meos version mismatch as I am using latest mesos 0.17.0, 
but spark uses 0.13.0.
Fixed by updating the SparkBuild.scala to latest version.
However I am now faced with errors in mesos worker threads.
I tried after after upgrading spark to 0.9.1., issues persists.  Thanks for the 
upgrade info.
Please let me know if this is a configuration issue of Hadoop. The 
configuration files are attached herewith.
I have configured HDFS as hdfs://master:9000.

Errorlog in the per atsk folder is follows:

Failed to copy from HDFS: hadoop fs -copyToLocal 
'hdfs://master/user/hduser/spark-0.9.1.tar.gz' './spark-0.9.1.tar.gz'
copyToLocal: Call From slave03/192.168.0.100 to master:8020 failed on 
connection exception: java.net.ConnectException: Connection refused; For more 
details see:  http://wiki.apache.org/hadoop/ConnectionRefused

Failed to fetch executors
Fetching resources into 
'/tmp/mesos/slaves/201404021638-2315299008-5050-6539-1/frameworks/201404041508-2315299008-5050-19945-/executors/201404021638-2315299008-5050-6539-1/runs/efd1fc4c-ec01-470c-9dd0-c34cc8052ffe'
Fetching resource 'hdfs://master/user/hduser/spark-0.9.1.tar.gz'

The worker output is as follows:


I0403 20:21:11.007439 24158 slave.cpp:837] Launching task 158 for framework 
201404032011-2315299008-5050-13793-
I0403 20:21:11.009099 24158 slave.cpp:947] Queuing task '158' for executor 
201404021638-2315299008-5050-6539-1 of framework 
'201404032011-2315299008-5050-13793-
I0403 20:21:11.009193 24158 slave.cpp:728] Got assigned task 159 for framework 
201404032011-2315299008-5050-13793-
I0403 20:21:11.009204 24161 process_isolator.cpp:100] Launching 
201404021638-2315299008-5050-6539-1 (cd spark-0*; ./sbin/spark-executor) in 
/tmp/mesos/slaves/201404021638-2315299008-5050-6539-1/frameworks/201404032011-2315299008-5050-13793-/executors/201404021638-2315299008-5050-6539-1/runs/99e767c5-5de5-43ba-aebc-86c119b4878f
 with resources mem(*):512; cpus(*):1' for framework 
201404032011-2315299008-5050-13793-
I0403 20:21:11.009348 24158 slave.cpp:837] Launching task 159 for framework 
201404032011-2315299008-5050-13793-
I0403 20:21:11.009418 24158 slave.cpp:947] Queuing task '159' for executor 
201404021638-2315299008-5050-6539-1 of framework 
'201404032011-2315299008-5050-13793-
I0403 20:21:11.010208 24161 process_isolator.cpp:163] Forked executor at 25380
I0403 20:21:11.013329 24159 slave.cpp:2090] Monitoring executor 
201404021638-2315299008-5050-6539-1 of framework 
201404032011-2315299008-5050-13793- forked at pid 25380
I0403 20:21:12.994786 24161 process_isolator.cpp:482] Telling slave of 
terminated executor '201404021638-2315299008-5050-6539-1' of framework 
201404032011-2315299008-5050-13793-
I0403 20:21:12.998713 24163 slave.cpp:2146] Executor 
'201404021638-2315299008-5050-6539-1' of framework 
201404032011-2315299008-5050-13793- has exited with status 255
I0403 20:21:13.000233 24163 slave.cpp:1757] Handling status update TASK_LOST 
(UUID: 2bdef9d2-c323-4d21-9b33-fedf7f8e9729) for task 158 of framework 
201404032011-2315299008-5050-13793- from @0.0.0.0:0
I0403 20:21:13.000493 24162 status_update_manager.cpp:314] Received status 
update TASK_LOST (UUID: 2bdef9d2-c323-4d21-9b33-fedf7f8e9729) for task 158 of 
framework 201404032011-2315299008-5050-13793-
I0403 20:21:13.000701 24162 status_update_manager.cpp:367] Forwarding status 
update TASK_LOST (UUID: 2bdef9d2-c323-4d21-9b33-fedf7f8e9729) for task 158 of 
framework 201404032011-2315299008-5050-13793- to master@192.168.0.138:5050
I0403 20:21:13.001829 24163 slave.cpp:1757] Handling status update TASK_LOST 
(UUID: 4b9b4f38-a83d-40df-8c1c-a7d857c0ce86) for task 159 of framework 
201404032011-2315299008-5050-13793- from @0.0.0.0:0
I0403 20:21:13.002003 24162 status_update_manager.cpp:314] Received status 
update TASK_LOST (UUID: 4b9b4f38-a83d-40df-8c1c-a7d857c0ce86) for task 159 of 
framework 201404032011-2315299008-5050-13793-
I0403 20:21:13.002125 24162 status_update_manager.cpp:367] Forwarding status 
update TASK_LOST (UUID: 4b9b4f38-a83d-40df-8c1c-a7d857c0ce86) for task 159 of 
framework 201404032011-2315299008-5050-13793- to master@192.168.0.138:5050
I0403 20:21:13.006757 24157 status_update_manager.cpp:392] Received status 
update acknowledgement (UUID: 2bdef9d2-c323-4d21-9b33-fedf7f8e9729) for task 
158 of framework 201404032011-2315299008-5050-13793-
I0403 20:21:13.006875 24157 status_update_manager.cpp:392] Received status 
update acknowledgement (UUID: 4b9b4f38-a83d-40df-8c1c-a7d857c0ce86) for task 
159 of framework 201404032011-2315299008-5050-13793-
I0403 20:21:13.006990 24156 slave.cpp:2281] Cleaning up executor 
'201404021638-2315299008-5050-6539-1' of framework 
201404032011-2315299008-5050-13793-
I0403 20:21:13.007307 24162 gc.cpp:56] Scheduling 
'/tmp/mesos/slaves/201404021638-2315299008-5050-6539-1/frameworks/201404032011-2315299008-5050-13793-/executors/201404021638-2315299008-505

How to create a RPM package

2014-04-04 Thread Rahul Singhal
Hello Community,

This is my first mail to the list and I have a small question. The maven build 
page
 mentions a way to create a debian package but I was wondering if there is a 
simple way (preferably through maven) to create a RPM package. Is there a 
script (which is probably used for spark releases) that I can get my hands on? 
Or should I write one on my own?

P.S. I don't want to use the "alien" software to convert a debian package to a 
RPM.

Thanks,
Rahul Singhal