Re: Spark performance for small queries

2015-01-22 Thread Saumitra Shahapure (Vizury)
Hello,

We were comparing performance of some of our production hive queries
between Hive and Spark. We compared Hive(0.13)+hadoop (1.2.1) against both
Spark 0.9 and 1.1. We could see that the performance gains have been good
in Spark.

We tried a very simple query,
select count(*) from T where col3=123
in both sparkSQL and Hive (with hive.map.aggr=true) and found that Spark
performance had been 2x better than Hive (120sec vs 60sec). Table T is
stored in S3 and contains 600MB single GZIP file.

My question is, why Spark is faster than Hive here? In both of the cases,
the file will be downloaded, uncompressed and lines will be counted by a
single process. For Hive case, reducer will be identity function
since hive.map.aggr is true.

Note that disk spills and network I/O are very less for Hive's case as well,


Re: what is the roadmap for Spark SQL dialect in the coming releases?

2015-01-22 Thread Niranda Perera
Hi,

would like to know if there is an update on this?

rgds

On Mon, Jan 12, 2015 at 10:44 AM, Niranda Perera niranda.per...@gmail.com
wrote:

 Hi,

 I found out that SparkSQL supports only a relatively small subset of SQL
 dialect currently.

 I would like to know the roadmap for the coming releases.

 And, are you focusing more on popularizing the 'Hive on Spark' SQL dialect
 or the Spark SQL dialect?

 Rgds
 --
 Niranda




-- 
Niranda


Re: Spark 1.1 (slow, working), Spark 1.2 (fast, freezing)

2015-01-22 Thread TJ Klein
Seems like it is a bug rather than a feature.
I filed a bug report: https://issues.apache.org/jira/browse/SPARK-5363



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-1-slow-working-Spark-1-2-fast-freezing-tp21278p21317.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Are these numbers abnormal for spark streaming?

2015-01-22 Thread Gerard Maas
and post the code (if possible).
In a nutshell, your processing time  batch interval,  resulting in an
ever-increasing delay that will end up in a crash.
3 secs to process 14 messages looks like a lot. Curious what the job logic
is.

-kr, Gerard.

On Thu, Jan 22, 2015 at 12:15 PM, Tathagata Das tathagata.das1...@gmail.com
 wrote:

 This is not normal. Its a huge scheduling delay!! Can you tell me more
 about the application?
 - cluser setup, number of receivers, whats the computation, etc.

 On Thu, Jan 22, 2015 at 3:11 AM, Ashic Mahtab as...@live.com wrote:

 Hate to do this...but...erm...bump? Would really appreciate input from
 others using Streaming. Or at least some docs that would tell me if these
 are expected or not.

 --
 From: as...@live.com
 To: user@spark.apache.org
 Subject: Are these numbers abnormal for spark streaming?
 Date: Wed, 21 Jan 2015 11:26:31 +


 Hi Guys,
 I've got Spark Streaming set up for a low data rate system (using spark's
 features for analysis, rather than high throughput). Messages are coming in
 throughout the day, at around 1-20 per second (finger in the air
 estimate...not analysed yet).  In the spark streaming UI for the
 application, I'm getting the following after 17 hours.

 Streaming

- *Started at: *Tue Jan 20 16:58:43 GMT 2015
- *Time since start: *18 hours 24 minutes 34 seconds
- *Network receivers: *2
- *Batch interval: *2 seconds
- *Processed batches: *16482
- *Waiting batches: *1



 Statistics over last 100 processed batchesReceiver Statistics

- Receiver


- Status


- Location


- Records in last batch
- [2015/01/21 11:23:18]


- Minimum rate
- [records/sec]


- Median rate
- [records/sec]


- Maximum rate
- [records/sec]


- Last Error

 RmqReceiver-0ACTIVEF
 144727-RmqReceiver-1ACTIVEBR
 124726-
 Batch Processing Statistics

MetricLast batchMinimum25th percentileMedian75th 
 percentileMaximumProcessing
Time3 seconds 994 ms157 ms4 seconds 16 ms4 seconds 961 ms5 seconds 3
ms5 seconds 171 msScheduling Delay9 hours 15 minutes 4 seconds9 hours
10 minutes 54 seconds9 hours 11 minutes 56 seconds9 hours 12 minutes
57 seconds9 hours 14 minutes 5 seconds9 hours 15 minutes 4 secondsTotal
Delay9 hours 15 minutes 8 seconds9 hours 10 minutes 58 seconds9 hours
12 minutes9 hours 13 minutes 2 seconds9 hours 14 minutes 10 seconds9
hours 15 minutes 8 seconds


 Are these normal. I was wondering what the scheduling delay and total
 delay terms are, and if it's normal for them to be 9 hours.

 I've got a standalone spark master and 4 spark nodes. The streaming app
 has been given 4 cores, and it's using 1 core per worker node. The
 streaming app is submitted from a 5th machine, and that machine has nothing
 but the driver running. The worker nodes are running alongside Cassandra
 (and reading and writing to it).

 Any insights would be appreciated.

 Regards,
 Ashic.





RE: Are these numbers abnormal for spark streaming?

2015-01-22 Thread Ashic Mahtab
Hi Gerard,
Thanks for the response.

The messages get desrialised from msgpack format, and one of the strings is 
desrialised to json. Certain fields are checked to decide if further processing 
is required. If so, it goes through a series of in mem filters to check if more 
processing is required. If so, only then does the heavy work start. That 
consists of a few db queries, and potential updates to the db + message on 
message queue. The majority of messages don't need processing. The messages 
needing processing at peak are about three every other second. 

One possible things that might be happening is the session initialisation and 
prepared statement initialisation for each partition. I can resort to some 
tricks, but I think I'll try increasing batch interval to 15 seconds. I'll 
report back with findings.

Thanks,
Ashic.

From: gerard.m...@gmail.com
Date: Thu, 22 Jan 2015 12:30:08 +0100
Subject: Re: Are these numbers abnormal for spark streaming?
To: tathagata.das1...@gmail.com
CC: as...@live.com; t...@databricks.com; user@spark.apache.org

and post the code (if possible).In a nutshell, your processing time  batch 
interval,  resulting in an ever-increasing delay that will end up in a crash.
3 secs to process 14 messages looks like a lot. Curious what the job logic is.
-kr, Gerard.
On Thu, Jan 22, 2015 at 12:15 PM, Tathagata Das tathagata.das1...@gmail.com 
wrote:
This is not normal. Its a huge scheduling delay!! Can you tell me more about 
the application?- cluser setup, number of receivers, whats the computation, etc.
On Thu, Jan 22, 2015 at 3:11 AM, Ashic Mahtab as...@live.com wrote:



Hate to do this...but...erm...bump? Would really appreciate input from others 
using Streaming. Or at least some docs that would tell me if these are expected 
or not.

From: as...@live.com
To: user@spark.apache.org
Subject: Are these numbers abnormal for spark streaming?
Date: Wed, 21 Jan 2015 11:26:31 +




Hi Guys,
I've got Spark Streaming set up for a low data rate system (using spark's 
features for analysis, rather than high throughput). Messages are coming in 
throughout the day, at around 1-20 per second (finger in the air estimate...not 
analysed yet).  In the spark streaming UI for the application, I'm getting the 
following after 17 hours.

StreamingStarted at: Tue Jan 20 16:58:43 GMT 2015Time since start: 18 hours 24 
minutes 34 secondsNetwork receivers: 2Batch interval: 2 secondsProcessed 
batches: 16482Waiting batches: 1

Statistics over last 100 processed batchesReceiver 
StatisticsReceiverStatusLocationRecords in last batch[2015/01/21 
11:23:18]Minimum rate[records/sec]Median rate[records/sec]Maximum 
rate[records/sec]Last ErrorRmqReceiver-0ACTIVEF
144727-RmqReceiver-1ACTIVEBR
124726-Batch Processing StatisticsMetricLast batchMinimum25th 
percentileMedian75th percentileMaximumProcessing Time3 seconds 994 ms157 ms4 
seconds 16 ms4 seconds 961 ms5 seconds 3 ms5 seconds 171 msScheduling Delay9 
hours 15 minutes 4 seconds9 hours 10 minutes 54 seconds9 hours 11 minutes 56 
seconds9 hours 12 minutes 57 seconds9 hours 14 minutes 5 seconds9 hours 15 
minutes 4 secondsTotal Delay9 hours 15 minutes 8 seconds9 hours 10 minutes 58 
seconds9 hours 12 minutes9 hours 13 minutes 2 seconds9 hours 14 minutes 10 
seconds9 hours 15 minutes 8 seconds
Are these normal. I was wondering what the scheduling delay and total delay 
terms are, and if it's normal for them to be 9 hours.

I've got a standalone spark master and 4 spark nodes. The streaming app has 
been given 4 cores, and it's using 1 core per worker node. The streaming app is 
submitted from a 5th machine, and that machine has nothing but the driver 
running. The worker nodes are running alongside Cassandra (and reading and 
writing to it).

Any insights would be appreciated.

Regards,
Ashic.

  



  

Re: Are these numbers abnormal for spark streaming?

2015-01-22 Thread Tathagata Das
This is not normal. Its a huge scheduling delay!! Can you tell me more
about the application?
- cluser setup, number of receivers, whats the computation, etc.

On Thu, Jan 22, 2015 at 3:11 AM, Ashic Mahtab as...@live.com wrote:

 Hate to do this...but...erm...bump? Would really appreciate input from
 others using Streaming. Or at least some docs that would tell me if these
 are expected or not.

 --
 From: as...@live.com
 To: user@spark.apache.org
 Subject: Are these numbers abnormal for spark streaming?
 Date: Wed, 21 Jan 2015 11:26:31 +


 Hi Guys,
 I've got Spark Streaming set up for a low data rate system (using spark's
 features for analysis, rather than high throughput). Messages are coming in
 throughout the day, at around 1-20 per second (finger in the air
 estimate...not analysed yet).  In the spark streaming UI for the
 application, I'm getting the following after 17 hours.

 Streaming

- *Started at: *Tue Jan 20 16:58:43 GMT 2015
- *Time since start: *18 hours 24 minutes 34 seconds
- *Network receivers: *2
- *Batch interval: *2 seconds
- *Processed batches: *16482
- *Waiting batches: *1



 Statistics over last 100 processed batchesReceiver Statistics

- Receiver


- Status


- Location


- Records in last batch
- [2015/01/21 11:23:18]


- Minimum rate
- [records/sec]


- Median rate
- [records/sec]


- Maximum rate
- [records/sec]


- Last Error

 RmqReceiver-0ACTIVEF
 144727-RmqReceiver-1ACTIVEBR
 124726-
 Batch Processing Statistics

MetricLast batchMinimum25th percentileMedian75th 
 percentileMaximumProcessing
Time3 seconds 994 ms157 ms4 seconds 16 ms4 seconds 961 ms5 seconds 3 ms5
seconds 171 msScheduling Delay9 hours 15 minutes 4 seconds9 hours 10
minutes 54 seconds9 hours 11 minutes 56 seconds9 hours 12 minutes 57
seconds9 hours 14 minutes 5 seconds9 hours 15 minutes 4 secondsTotal
Delay9 hours 15 minutes 8 seconds9 hours 10 minutes 58 seconds9 hours
12 minutes9 hours 13 minutes 2 seconds9 hours 14 minutes 10 seconds9
hours 15 minutes 8 seconds


 Are these normal. I was wondering what the scheduling delay and total
 delay terms are, and if it's normal for them to be 9 hours.

 I've got a standalone spark master and 4 spark nodes. The streaming app
 has been given 4 cores, and it's using 1 core per worker node. The
 streaming app is submitted from a 5th machine, and that machine has nothing
 but the driver running. The worker nodes are running alongside Cassandra
 (and reading and writing to it).

 Any insights would be appreciated.

 Regards,
 Ashic.



Re: Spark Team - Paco Nathan said that your team can help

2015-01-22 Thread Pankaj
http://spark.apache.org/docs/latest/


Follow this. Its easy to get started. Use prebuilt version of spark as of
now :D

On Thu, Jan 22, 2015 at 5:06 PM, Sudipta Banerjee 
asudipta.baner...@gmail.com wrote:



 Hi Apache-Spark team ,

 What are the system requirements installing Hadoop and Apache Spark?
 I have attached the screen shot of Gparted.


 Thanks and regards,
 Sudipta




 --
 Sudipta Banerjee
 Consultant, Business Analytics and Cloud Based Architecture
 Call me +919019578099


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org



Re: KNN for large data set

2015-01-22 Thread DEVAN M.S.
Thanks Xiangrui Meng will try this.

And, found this https://github.com/kaushikranjan/knnJoin also.
Will this work with double data ? Can we find out z value of
*Vector(10.3,4.5,3,5)* ?






On Thu, Jan 22, 2015 at 12:25 AM, Xiangrui Meng men...@gmail.com wrote:

 For large datasets, you need hashing in order to compute k-nearest
 neighbors locally. You can start with LSH + k-nearest in Google
 scholar: http://scholar.google.com/scholar?q=lsh+k+nearest -Xiangrui

 On Tue, Jan 20, 2015 at 9:55 PM, DEVAN M.S. msdeva...@gmail.com wrote:
  Hi all,
 
  Please help me to find out best way for K-nearest neighbor using spark
 for
  large data sets.
 



RE: Are these numbers abnormal for spark streaming?

2015-01-22 Thread Ashic Mahtab
Hate to do this...but...erm...bump? Would really appreciate input from others 
using Streaming. Or at least some docs that would tell me if these are expected 
or not.

From: as...@live.com
To: user@spark.apache.org
Subject: Are these numbers abnormal for spark streaming?
Date: Wed, 21 Jan 2015 11:26:31 +




Hi Guys,
I've got Spark Streaming set up for a low data rate system (using spark's 
features for analysis, rather than high throughput). Messages are coming in 
throughout the day, at around 1-20 per second (finger in the air estimate...not 
analysed yet).  In the spark streaming UI for the application, I'm getting the 
following after 17 hours.

StreamingStarted at: Tue Jan 20 16:58:43 GMT 2015Time since start: 18 hours 24 
minutes 34 secondsNetwork receivers: 2Batch interval: 2 secondsProcessed 
batches: 16482Waiting batches: 1

Statistics over last 100 processed batchesReceiver 
StatisticsReceiverStatusLocationRecords in last batch[2015/01/21 
11:23:18]Minimum rate[records/sec]Median rate[records/sec]Maximum 
rate[records/sec]Last ErrorRmqReceiver-0ACTIVEF
144727-RmqReceiver-1ACTIVEBR
124726-Batch Processing StatisticsMetricLast batchMinimum25th 
percentileMedian75th percentileMaximumProcessing Time3 seconds 994 ms157 ms4 
seconds 16 ms4 seconds 961 ms5 seconds 3 ms5 seconds 171 msScheduling Delay9 
hours 15 minutes 4 seconds9 hours 10 minutes 54 seconds9 hours 11 minutes 56 
seconds9 hours 12 minutes 57 seconds9 hours 14 minutes 5 seconds9 hours 15 
minutes 4 secondsTotal Delay9 hours 15 minutes 8 seconds9 hours 10 minutes 58 
seconds9 hours 12 minutes9 hours 13 minutes 2 seconds9 hours 14 minutes 10 
seconds9 hours 15 minutes 8 seconds
Are these normal. I was wondering what the scheduling delay and total delay 
terms are, and if it's normal for them to be 9 hours.

I've got a standalone spark master and 4 spark nodes. The streaming app has 
been given 4 cores, and it's using 1 core per worker node. The streaming app is 
submitted from a 5th machine, and that machine has nothing but the driver 
running. The worker nodes are running alongside Cassandra (and reading and 
writing to it).

Any insights would be appreciated.

Regards,
Ashic.

  

RE: Are these numbers abnormal for spark streaming?

2015-01-22 Thread Ashic Mahtab
Hi TD,
Here's some information:

1. Cluster has one standalone master, 4 workers. Workers are co-hosted with 
Apache Cassandra. Master is set up with external Zookeeper.
2. Each machine has 2 cores and 4GB of ram. This is for testing. All machines 
are vmware vms. Spark has 2GB dedicated to it on each node.
3. In addition to the streaming details, the master details as of now are given 
below. Only the streaming app is running.
4. I'm listening to two rabbitmq queues using a rabbitmq receiver (code: 
https://gist.github.com/ashic/b5edc7cfdc85aa60b066 ). Notifier code is here 
https://gist.github.com/ashic/9abd352c691eafc8c9f3 
5. The receivers are initialised with the following code:
val ssc = new StreamingContext(sc, Seconds(2))
val messages1 = ssc.receiverStream(new RmqReceiver(abc, abc, /, 
vdclog03, abc_input, None))
val messages2 = ssc.receiverStream(new RmqReceiver(abc, abc, /, 
vdclog04, abc_input, None))
val messages = messages1.union(messages2)
val notifier = new RabbitMQEventNotifier(vdclog03, abc, 
abc_output_events, abc, abc, /)

6. Usage:

  messages.map(x = ScalaMessagePack.read[RadioMessage](x))
  .flatMap(InputMessageParser.parse(_).getEvents())
  .foreachRDD(x = {
  x.foreachPartition(x = {
cassandraConnector.withSessionDo(session ={
  val graphStorage = new CassandraGraphStorage(session)
  val notificationStorage = new CassandraNotificationStorage(session)
  val savingNotifier = new SavingNotifier(notifier, notificationStorage)

  x.foreach(eventWrapper = eventWrapper.event match {
//do some queries.
// save some stuff if needed to cassandra
//raise a message to a separate queue with a msg = Unit() 
operation.

7. The algorithm is simple: listen to messages from two separate rmq queues. 
union them. for each message, check message properties. 
if needed, query cassandra for additional details (graph search..but done in 
0.5-3 seconds...and rare..shouldn't overwhelm with low input rate).
If needed, save some info back into cassandra (1-2ms), and raise an event to 
the notifier.

I'm probably missing something basic, just wondering what. It has been running 
fine for about 42 hours now, but the numbers are a tad worrying.

Cheers,
Ashic.


Workers: 4Cores: 8 Total, 4 UsedMemory: 8.0 GB Total, 2000.0 MB 
UsedApplications: 1 Running, 0 CompletedDrivers: 0 Running, 0 CompletedStatus: 
ALIVEWorkersIdAddressStateCoresMemoryworker-20141208131918-VDCAPP50.AAA.local-44476VDCAPP50.AAA.local:44476ALIVE2
 (1 Used)2.0 GB (500.0 MB 
Used)worker-20141208132012-VDCAPP52.AAA.local-34349VDCAPP52.AAA.local:34349ALIVE2
 (1 Used)2.0 GB (500.0 MB 
Used)worker-20141208132136-VDCAPP53.AAA.local-54000VDCAPP53.AAA.local:54000ALIVE2
 (1 Used)2.0 GB (500.0 MB 
Used)worker-2014121627-VDCAPP49.AAA.local-57899VDCAPP49.AAA.local:57899ALIVE2
 (1 Used)2.0 GB (500.0 MB Used)Running ApplicationsIDNameCoresMemory per 
NodeSubmitted TimeUserStateDurationapp-20150120165844-0005App1
4500.0 MB2015/01/20 16:58:44rootWAITING42.4 h

From: tathagata.das1...@gmail.com
Date: Thu, 22 Jan 2015 03:15:58 -0800
Subject: Re: Are these numbers abnormal for spark streaming?
To: as...@live.com; t...@databricks.com
CC: user@spark.apache.org

This is not normal. Its a huge scheduling delay!! Can you tell me more about 
the application?- cluser setup, number of receivers, whats the computation, etc.
On Thu, Jan 22, 2015 at 3:11 AM, Ashic Mahtab as...@live.com wrote:



Hate to do this...but...erm...bump? Would really appreciate input from others 
using Streaming. Or at least some docs that would tell me if these are expected 
or not.

From: as...@live.com
To: user@spark.apache.org
Subject: Are these numbers abnormal for spark streaming?
Date: Wed, 21 Jan 2015 11:26:31 +




Hi Guys,
I've got Spark Streaming set up for a low data rate system (using spark's 
features for analysis, rather than high throughput). Messages are coming in 
throughout the day, at around 1-20 per second (finger in the air estimate...not 
analysed yet).  In the spark streaming UI for the application, I'm getting the 
following after 17 hours.

StreamingStarted at: Tue Jan 20 16:58:43 GMT 2015Time since start: 18 hours 24 
minutes 34 secondsNetwork receivers: 2Batch interval: 2 secondsProcessed 
batches: 16482Waiting batches: 1

Statistics over last 100 processed batchesReceiver 
StatisticsReceiverStatusLocationRecords in last batch[2015/01/21 
11:23:18]Minimum rate[records/sec]Median rate[records/sec]Maximum 
rate[records/sec]Last ErrorRmqReceiver-0ACTIVEF
144727-RmqReceiver-1ACTIVEBR
124726-Batch Processing StatisticsMetricLast batchMinimum25th 
percentileMedian75th percentileMaximumProcessing Time3 seconds 994 ms157 ms4 
seconds 16 ms4 seconds 961 ms5 seconds 3 ms5 seconds 171 msScheduling Delay9 
hours 15 minutes 4 seconds9 hours 10 minutes 54 seconds9 hours 11 minutes 56 
seconds9 hours 12 minutes 57 seconds9 hours 14 minutes 5 seconds9 hours 15 
minutes 4 

Fwd: Spark Team - Paco Nathan said that your team can help

2015-01-22 Thread Sudipta Banerjee
Hi Apache-Spark team ,

What are the system requirements installing Hadoop and Apache Spark?
I have attached the screen shot of Gparted.


Thanks and regards,
Sudipta




-- 
Sudipta Banerjee
Consultant, Business Analytics and Cloud Based Architecture
Call me +919019578099

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark Team - Paco Nathan said that your team can help

2015-01-22 Thread Marco Shaw
Hi,

Let me reword your request so you understand how (too) generic your question 
is

Hi, I have $10,000, please find me some means of transportation so I can get 
to work.

Please provide (a lot) more details. If you can't, consider using one of the 
pre-built express VMs from either Cloudera, Hortonworks or MapR, for example. 

Marco



 On Jan 22, 2015, at 7:36 AM, Sudipta Banerjee asudipta.baner...@gmail.com 
 wrote:
 
 
 
 Hi Apache-Spark team ,
 
 What are the system requirements installing Hadoop and Apache Spark?
 I have attached the screen shot of Gparted.
 
 
 Thanks and regards,
 Sudipta 
 
 
 
 
 -- 
 Sudipta Banerjee
 Consultant, Business Analytics and Cloud Based Architecture 
 Call me +919019578099
 Screenshot - Wednesday 21 January 2015 - 10:55:29 IST.png
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Trying to run SparkSQL over Spark Streaming

2015-01-22 Thread nirandap
Hi, 

I'm also trying to use the insertInto method, but end up getting the
assertion error

Is there any workaround to this?? 

rgds



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Trying-to-run-SparkSQL-over-Spark-Streaming-tp12530p21316.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Is Apache Spark less accurate than Scikit Learn?

2015-01-22 Thread Robin East
Hi

There are many different variants of gradient descent mostly dealing with what 
the step size is and how it might be adjusted as the algorithm proceeds. Also 
if it uses a stochastic variant (as opposed to batch descent) then there are 
variations there too. I don’t know off-hand what MLlib’s detailed 
implementation is but no doubt there are differences between the two - perhaps 
someone with more knowledge of the internals could comment.

As you can tell from playing around with the parameters, step size is vitally 
important to the performance of the algorithm.


On 22 Jan 2015, at 06:44, Jacques Heunis jaaksem...@gmail.com wrote:

 Ah I see, thanks!
 I was just confused because given the same configuration, I would have 
 thought that Spark and Scikit would give more similar results, but I guess 
 this is simply not the case (as in your example, in order to get spark to 
 give an mse sufficiently close to scikit's you have to give it a 
 significantly larger step and iteration count).
 
 Would that then be a result of MLLib and Scikit differing slightly in their 
 exact implementation of the optimizer? Or rather a case of (as you say) 
 Scikit being a far more mature system (and therefore that MLLib would 'get 
 better' over time)? Surely it is far from ideal that to get the same results 
 you need more iterations (IE more computation), or do you think that that is 
 simply coincidence and that given a different model/dataset it may be the 
 other way around?
 
 I ask because I encountered this situation on other, larger datasets, so this 
 is not an isolated case (though being the simplest example I could think of I 
 would imagine that it's somewhat indicative of general behaviour)
 
 On Thu, Jan 22, 2015 at 1:57 AM, Robin East robin.e...@xense.co.uk wrote:
 I don’t get those results. I get:
 
 spark   0.14
 scikit-learn0.85
 
 The scikit-learn mse is due to the very low eta0 setting. Tweak that to 0.1 
 and push iterations to 400 and you get a mse ~= 0. Of course the coefficients 
 are both ~1 and the intercept ~0. Similarly if you change the mllib step size 
 to 0.5 and number of iterations to 1200 you again get a very low mse. One of 
 the issues with SGD is you have to tweak these parameters to tune the 
 algorithm.
 
 FWIW I wouldn’t see Spark MLlib as a replacement for scikit-learn. MLLib is 
 nowhere as mature as scikit learn. However if you have large datasets that 
 won’t sensibly fit the scikit-learn in-core model MLLib is one of the top 
 choices. Similarly if you are running proof of concepts that you are 
 eventually going to scale up to production environments then there is a 
 definite argument for using MLlib at both the PoC and production stages.
 
 
 On 21 Jan 2015, at 20:39, JacquesH jaaksem...@gmail.com wrote:
 
  I've recently been trying to get to know Apache Spark as a replacement for
  Scikit Learn, however it seems to me that even in simple cases, Scikit
  converges to an accurate model far faster than Spark does.
  For example I generated 1000 data points for a very simple linear function
  (z=x+y) with the following script:
 
  http://pastebin.com/ceRkh3nb
 
  I then ran the following Scikit script:
 
  http://pastebin.com/1aECPfvq
 
  And then this Spark script: (with spark-submit filename, no other
  arguments)
 
  http://pastebin.com/s281cuTL
 
  Strangely though, the error given by spark is an order of magnitude larger
  than that given by Scikit (0.185 and 0.045 respectively) despite the two
  models having a nearly identical setup (as far as I can tell)
  I understand that this is using SGD with very few iterations and so the
  results may differ but I wouldn't have thought that it would be anywhere
  near such a large difference or such a large error, especially given the
  exceptionally simple data.
 
  Is there something I'm misunderstanding in Spark? Is it not correctly
  configured? Surely I should be getting a smaller error than that?
 
 
 
  --
  View this message in context: 
  http://apache-spark-user-list.1001560.n3.nabble.com/Is-Apache-Spark-less-accurate-than-Scikit-Learn-tp21301.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 
 
 



spark-shell has syntax error on windows.

2015-01-22 Thread Vladimir Protsenko
I have a problem with running spark shell in windows 7. I made the following
steps:

1. downloaded and installed Scala 2.11.5
2. downloaded spark 1.2.0 by git clone git://github.com/apache/spark.git
3. run dev/change-version-to-2.11.sh and mvn -Dscala-2.11 -DskipTests clean
package (in git bash)

After installation tried to run spark-shell.cmd in cmd shell and it says
there is a syntax error in file. What could I do to fix problem? 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-shell-has-syntax-error-on-windows-tp21313.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



unable to write SequenceFile using saveAsNewAPIHadoopFile

2015-01-22 Thread Skanda
Hi All,

I'm using the saveAsNewAPIHadoopFile API to write SequenceFiles but I'm
getting the following runtime exception:

*Exception in thread main org.apache.spark.SparkException: Task not
serializable*
at
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
at org.apache.spark.SparkContext.clean(SparkContext.scala:1435)
at org.apache.spark.rdd.RDD.map(RDD.scala:271)
at
org.apache.spark.api.java.JavaRDDLike$class.mapToPair(JavaRDDLike.scala:102)
at org.apache.spark.api.java.JavaPairRDD.mapToPair(JavaPairRDD.scala:45)
at XoanonKMeansV2.main(XoanonKMeansV2.java:125)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
*Caused by: java.io.NotSerializableException:
org.apache.hadoop.io.IntWritable*
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183)
at
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
at
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
at
org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73)
at
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164)
... 13 more

Pls find below the code snippet:

joiningKeyPlusPredictedPoint.mapToPair(
new PairFunctionTuple2String, Integer, Text,
IntWritable() {
Text text = new Text();
IntWritable intwritable = new IntWritable();

@Override
public Tuple2Text, IntWritable call(
Tuple2String, Integer tuple) throws Exception
{
text.set(tuple._1);
intwritable.set(tuple._2);
return new Tuple2Text, IntWritable(text,
intwritable);
}
})

*.saveAsNewAPIHadoopFile(/mllib/data/clusteroutput_seq,
Text.class, IntWritable.class, SequenceFileOutputFormat.class);*

Regards,
Skanda


GraphX: ShortestPaths does not terminate on a grid graph

2015-01-22 Thread NicolasC


Hello,

I try to execute a simple program that runs the ShortestPaths algorithm
(org.apache.spark.graphx.lib.ShortestPaths) on a small grid graph.
I use Spark 1.2.0 downloaded from spark.apache.org.

The program's code is the following :

object GraphXGridSP {

def main(args : Array[String]) : Unit = {
val appname : String = GraphXGridShortestPath
val conf = new SparkConf().setAppName(appname)

val sc = new SparkContext(conf)

val gridSize : Int = 70
val nPartitions : Int = 4

val graph = GraphGenerators.gridGraph(sc, gridSize, gridSize).
  partitionBy( PartitionStrategy.EdgePartition2D, nPartitions)

val landmarks : Seq[VertexId] = Seq(0)
val graph2 : Graph[SPMap, Double]= ShortestPaths.run(graph, landmarks)
graph2.vertices.count
}
}

This program runs more than 2 hours when the grid size is 70x70 as above, and 
is then killed
by the resource manager of the cluster (Torque). After a 5-6 minutes of 
execution, the
Spark master UI does not even respond.

I use a cluster of 5 nodes (4 workers, 1 executor per node).

For a grid size of 30x30, the program terminates in about 20 seconds, and for a 
grid size
of 50x50 it finishes in about 80 seconds. The problem appears for a grid size 
of 70x70 and
above.

What's wrong with this program ?

Thanks for any help.

Regards.
Nicolas.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: unable to write SequenceFile using saveAsNewAPIHadoopFile

2015-01-22 Thread Sean Owen
First as an aside I am pretty sure you cannot reuse one Text and
IntWritable object here. Spark does not necessarily finish with one's value
before the next call(). Although it should not be directly related to the
serialization problem I suspect it is. Your function is not serializable
since it contains references to these cached writables. I think removing
them fixes both problems.
On Jan 22, 2015 9:42 AM, Skanda skanda.ganapa...@gmail.com wrote:

 Hi All,

 I'm using the saveAsNewAPIHadoopFile API to write SequenceFiles but I'm
 getting the following runtime exception:

 *Exception in thread main org.apache.spark.SparkException: Task not
 serializable*
 at
 org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166)
 at
 org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
 at org.apache.spark.SparkContext.clean(SparkContext.scala:1435)
 at org.apache.spark.rdd.RDD.map(RDD.scala:271)
 at
 org.apache.spark.api.java.JavaRDDLike$class.mapToPair(JavaRDDLike.scala:102)
 at
 org.apache.spark.api.java.JavaPairRDD.mapToPair(JavaPairRDD.scala:45)
 at XoanonKMeansV2.main(XoanonKMeansV2.java:125)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 *Caused by: java.io.NotSerializableException:
 org.apache.hadoop.io.IntWritable*
 at
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183)
 at
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
 at
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
 at
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
 at
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
 at
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
 at
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
 at
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
 at
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
 at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
 at
 org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
 at
 org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73)
 at
 org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164)
 ... 13 more

 Pls find below the code snippet:

 joiningKeyPlusPredictedPoint.mapToPair(
 new PairFunctionTuple2String, Integer, Text,
 IntWritable() {
 Text text = new Text();
 IntWritable intwritable = new IntWritable();

 @Override
 public Tuple2Text, IntWritable call(
 Tuple2String, Integer tuple) throws
 Exception {
 text.set(tuple._1);
 intwritable.set(tuple._2);
 return new Tuple2Text, IntWritable(text,
 intwritable);
 }
 })

 *.saveAsNewAPIHadoopFile(/mllib/data/clusteroutput_seq,
 Text.class, IntWritable.class, SequenceFileOutputFormat.class);*

 Regards,
 Skanda



Re: Discourse: A proposed alternative to the Spark User list

2015-01-22 Thread Petar Zecevic


Ok, thanks for the clarifications. I didn't know this list has to remain 
as the only official list.


Nabble is really not the best solution in the world, but we're stuck 
with it, I guess.


That's it from me on this subject.

Petar


On 22.1.2015. 3:55, Nicholas Chammas wrote:


I think a few things need to be laid out clearly:

 1. This mailing list is the “official” user discussion platform. That
is, it is sponsored and managed by the ASF.
 2. Users are free to organize independent discussion platforms
focusing on Spark, and there is already one such platform in Stack
Overflow under the |apache-spark| and related tags. Stack Overflow
works quite well.
 3. The ASF will not agree to deprecating or migrating this user list
to a platform that they do not control.
 4. This mailing list has grown to an unwieldy size and discussions
are hard to find or follow; discussion tooling is also lacking. We
want to improve the utility and user experience of this mailing list.
 5. We don’t want to fragment this “official” discussion community.
 6. Nabble is an independent product not affiliated with the ASF. It
offers a slightly better interface to the Apache mailing list
archives.

So to respond to some of your points, pzecevic:

Apache user group could be frozen (not accepting new questions, if
that’s possible) and redirect users to Stack Overflow (automatic
reply?).

From what I understand of the ASF’s policies, this is not possible. :( 
This mailing list must remain the official Spark user discussion platform.


Other thing, about new Stack Exchange site I proposed earlier. If
a new site is created, there is no problem with guidelines, I
think, because Spark community can apply different guidelines for
the new site.

I think Stack Overflow and the various Spark tags are working fine. I 
don’t see a compelling need for a Stack Exchange dedicated to Spark, 
either now or in the near future. Also, I doubt a Spark-specific site 
can pass the 4 tests in the Area 51 FAQ 
http://area51.stackexchange.com/faq:


  * Almost all Spark questions are on-topic for Stack Overflow
  * Stack Overflow already exists, it already has a tag for Spark, and
nobody is complaining
  * You’re not creating such a big group that you don’t have enough
experts to answer all possible questions
  * There’s a high probability that users of Stack Overflow would
enjoy seeing the occasional question about Spark

I think complaining won’t be sufficient. :)

Someone expressed a concern that they won’t allow creating a
project-specific site, but there already exist some
project-specific sites, like Tor, Drupal, Ubuntu…

The communities for these projects are many, many times larger than 
the Spark community is or likely ever will be, simply due to the 
nature of the problems they are solving.


What we need is an improvement to this mailing list. We need better 
tooling than Nabble to sit on top of the Apache archives, and we also 
need some way to control the volume and quality of mail on the list so 
that it remains a useful resource for the majority of users.


Nick

​

On Wed Jan 21 2015 at 3:13:21 PM pzecevic petar.zece...@gmail.com 
mailto:petar.zece...@gmail.com wrote:


Hi,
I tried to find the last reply by Nick Chammas (that I received in the
digest) using the Nabble web interface, but I cannot find it
(perhaps he
didn't reply directly to the user list?). That's one example of
Nabble's
usability.

Anyhow, I wanted to add my two cents...

Apache user group could be frozen (not accepting new questions, if
that's
possible) and redirect users to Stack Overflow (automatic reply?). Old
questions remain (and are searchable) on Nabble, new questions go
to Stack
Exchange, so no need for migration. That's the idea, at least, as
I'm not
sure if that's technically doable... Is it?
dev mailing list could perhaps stay on Nabble (it's not that
busy), or have
a special tag on Stack Exchange.

Other thing, about new Stack Exchange site I proposed earlier. If
a new site
is created, there is no problem with guidelines, I think, because
Spark
community can apply different guidelines for the new site.

There is a FAQ about creating new sites:
http://area51.stackexchange.com/faq
It says: Stack Exchange sites are free to create and free to use.
All we
ask is that you have an enthusiastic, committed group of expert
users who
check in regularly, asking and answering questions.
I think this requirement is satisfied...
Someone expressed a concern that they won't allow creating a
project-specific site, but there already exist some
project-specific sites,
like Tor, Drupal, Ubuntu...

Later, though, the FAQ also says:
If Y already exists, it already has a tag for X, and nobody is
complaining
(then you should not create a new 

Re: unable to write SequenceFile using saveAsNewAPIHadoopFile

2015-01-22 Thread Skanda
Yeah it worked like charm!! Thank you!

On Thu, Jan 22, 2015 at 2:28 PM, Sean Owen so...@cloudera.com wrote:

 First as an aside I am pretty sure you cannot reuse one Text and
 IntWritable object here. Spark does not necessarily finish with one's value
 before the next call(). Although it should not be directly related to the
 serialization problem I suspect it is. Your function is not serializable
 since it contains references to these cached writables. I think removing
 them fixes both problems.
 On Jan 22, 2015 9:42 AM, Skanda skanda.ganapa...@gmail.com wrote:

 Hi All,

 I'm using the saveAsNewAPIHadoopFile API to write SequenceFiles but I'm
 getting the following runtime exception:

 *Exception in thread main org.apache.spark.SparkException: Task not
 serializable*
 at
 org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166)
 at
 org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
 at org.apache.spark.SparkContext.clean(SparkContext.scala:1435)
 at org.apache.spark.rdd.RDD.map(RDD.scala:271)
 at
 org.apache.spark.api.java.JavaRDDLike$class.mapToPair(JavaRDDLike.scala:102)
 at
 org.apache.spark.api.java.JavaPairRDD.mapToPair(JavaPairRDD.scala:45)
 at XoanonKMeansV2.main(XoanonKMeansV2.java:125)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 *Caused by: java.io.NotSerializableException:
 org.apache.hadoop.io.IntWritable*
 at
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183)
 at
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
 at
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
 at
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
 at
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
 at
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
 at
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
 at
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
 at
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
 at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
 at
 org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
 at
 org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73)
 at
 org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164)
 ... 13 more

 Pls find below the code snippet:

 joiningKeyPlusPredictedPoint.mapToPair(
 new PairFunctionTuple2String, Integer, Text,
 IntWritable() {
 Text text = new Text();
 IntWritable intwritable = new IntWritable();

 @Override
 public Tuple2Text, IntWritable call(
 Tuple2String, Integer tuple) throws
 Exception {
 text.set(tuple._1);
 intwritable.set(tuple._2);
 return new Tuple2Text, IntWritable(text,
 intwritable);
 }
 })

 *.saveAsNewAPIHadoopFile(/mllib/data/clusteroutput_seq,
 Text.class, IntWritable.class, SequenceFileOutputFormat.class);*

 Regards,
 Skanda




Re: Discourse: A proposed alternative to the Spark User list

2015-01-22 Thread Petar Zecevic


But voting is done on dev list, right? That could stay there...

Overlay might be a fine solution, too, but that still gives two user 
lists (SO and Nabble+overlay).



On 22.1.2015. 10:42, Sean Owen wrote:


Yes, there is some project business like votes of record on releases 
that needs to be carried on in standard, simple accessible place and 
SO is not at all suitable.


Nobody is stuck with Nabble. The suggestion is to enable a different 
overlay on the existing list. SO remains a place you can ask questions 
too. So I agree with Nick's take.


BTW are there perhaps plans to split this mailing list into 
subproject-specific lists? That might also help tune in/out the subset 
of conversations of interest.


On Jan 22, 2015 10:30 AM, Petar Zecevic petar.zece...@gmail.com 
mailto:petar.zece...@gmail.com wrote:



Ok, thanks for the clarifications. I didn't know this list has to
remain as the only official list.

Nabble is really not the best solution in the world, but we're
stuck with it, I guess.

That's it from me on this subject.

Petar


On 22.1.2015. 3:55, Nicholas Chammas wrote:


I think a few things need to be laid out clearly:

 1. This mailing list is the “official” user discussion platform.
That is, it is sponsored and managed by the ASF.
 2. Users are free to organize independent discussion platforms
focusing on Spark, and there is already one such platform in
Stack Overflow under the |apache-spark| and related tags.
Stack Overflow works quite well.
 3. The ASF will not agree to deprecating or migrating this user
list to a platform that they do not control.
 4. This mailing list has grown to an unwieldy size and
discussions are hard to find or follow; discussion tooling is
also lacking. We want to improve the utility and user
experience of this mailing list.
 5. We don’t want to fragment this “official” discussion community.
 6. Nabble is an independent product not affiliated with the ASF.
It offers a slightly better interface to the Apache mailing
list archives.

So to respond to some of your points, pzecevic:

Apache user group could be frozen (not accepting new
questions, if that’s possible) and redirect users to Stack
Overflow (automatic reply?).

From what I understand of the ASF’s policies, this is not
possible. :( This mailing list must remain the official Spark
user discussion platform.

Other thing, about new Stack Exchange site I proposed
earlier. If a new site is created, there is no problem with
guidelines, I think, because Spark community can apply
different guidelines for the new site.

I think Stack Overflow and the various Spark tags are working
fine. I don’t see a compelling need for a Stack Exchange
dedicated to Spark, either now or in the near future. Also, I
doubt a Spark-specific site can pass the 4 tests in the Area 51
FAQ http://area51.stackexchange.com/faq:

  * Almost all Spark questions are on-topic for Stack Overflow
  * Stack Overflow already exists, it already has a tag for
Spark, and nobody is complaining
  * You’re not creating such a big group that you don’t have
enough experts to answer all possible questions
  * There’s a high probability that users of Stack Overflow would
enjoy seeing the occasional question about Spark

I think complaining won’t be sufficient. :)

Someone expressed a concern that they won’t allow creating a
project-specific site, but there already exist some
project-specific sites, like Tor, Drupal, Ubuntu…

The communities for these projects are many, many times larger
than the Spark community is or likely ever will be, simply due to
the nature of the problems they are solving.

What we need is an improvement to this mailing list. We need
better tooling than Nabble to sit on top of the Apache archives,
and we also need some way to control the volume and quality of
mail on the list so that it remains a useful resource for the
majority of users.

Nick

​

On Wed Jan 21 2015 at 3:13:21 PM pzecevic
petar.zece...@gmail.com mailto:petar.zece...@gmail.com wrote:

Hi,
I tried to find the last reply by Nick Chammas (that I
received in the
digest) using the Nabble web interface, but I cannot find it
(perhaps he
didn't reply directly to the user list?). That's one example
of Nabble's
usability.

Anyhow, I wanted to add my two cents...

Apache user group could be frozen (not accepting new
questions, if that's
possible) and redirect users to Stack Overflow (automatic
reply?). Old
questions remain (and are searchable) on Nabble, new
questions go to Stack
Exchange, so no need for 

Re: Spark on YARN: java.lang.ClassCastException SerializedLambda to org.apache.spark.api.java.function.Function in instance of org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1

2015-01-22 Thread thanhtien522
Update: I deployed a stand-alone spark in localhost then set Master as
spark://localhost:7077 and it met the same issue
Don't know how to solve it.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-YARN-java-lang-ClassCastException-SerializedLambda-to-org-apache-spark-api-java-function-Fu1-tp21261p21315.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Discourse: A proposed alternative to the Spark User list

2015-01-22 Thread Sean Owen
Yes, there is some project business like votes of record on releases that
needs to be carried on in standard, simple accessible place and SO is not
at all suitable.

Nobody is stuck with Nabble. The suggestion is to enable a different
overlay on the existing list. SO remains a place you can ask questions too.
So I agree with Nick's take.

BTW are there perhaps plans to split this mailing list into
subproject-specific lists? That might also help tune in/out the subset of
conversations of interest.
On Jan 22, 2015 10:30 AM, Petar Zecevic petar.zece...@gmail.com wrote:


 Ok, thanks for the clarifications. I didn't know this list has to remain
 as the only official list.

 Nabble is really not the best solution in the world, but we're stuck with
 it, I guess.

 That's it from me on this subject.

 Petar


 On 22.1.2015. 3:55, Nicholas Chammas wrote:

  I think a few things need to be laid out clearly:

1. This mailing list is the “official” user discussion platform. That
is, it is sponsored and managed by the ASF.
2. Users are free to organize independent discussion platforms
focusing on Spark, and there is already one such platform in Stack Overflow
under the apache-spark and related tags. Stack Overflow works quite
well.
3. The ASF will not agree to deprecating or migrating this user list
to a platform that they do not control.
4. This mailing list has grown to an unwieldy size and discussions are
hard to find or follow; discussion tooling is also lacking. We want to
improve the utility and user experience of this mailing list.
5. We don’t want to fragment this “official” discussion community.
6. Nabble is an independent product not affiliated with the ASF. It
offers a slightly better interface to the Apache mailing list archives.

 So to respond to some of your points, pzecevic:

 Apache user group could be frozen (not accepting new questions, if that’s
 possible) and redirect users to Stack Overflow (automatic reply?).

 From what I understand of the ASF’s policies, this is not possible. :(
 This mailing list must remain the official Spark user discussion platform.

 Other thing, about new Stack Exchange site I proposed earlier. If a new
 site is created, there is no problem with guidelines, I think, because
 Spark community can apply different guidelines for the new site.

 I think Stack Overflow and the various Spark tags are working fine. I
 don’t see a compelling need for a Stack Exchange dedicated to Spark, either
 now or in the near future. Also, I doubt a Spark-specific site can pass the
 4 tests in the Area 51 FAQ http://area51.stackexchange.com/faq:

- Almost all Spark questions are on-topic for Stack Overflow
- Stack Overflow already exists, it already has a tag for Spark, and
nobody is complaining
- You’re not creating such a big group that you don’t have enough
experts to answer all possible questions
- There’s a high probability that users of Stack Overflow would enjoy
seeing the occasional question about Spark

 I think complaining won’t be sufficient. :)

 Someone expressed a concern that they won’t allow creating a
 project-specific site, but there already exist some project-specific sites,
 like Tor, Drupal, Ubuntu…

 The communities for these projects are many, many times larger than the
 Spark community is or likely ever will be, simply due to the nature of the
 problems they are solving.

 What we need is an improvement to this mailing list. We need better
 tooling than Nabble to sit on top of the Apache archives, and we also need
 some way to control the volume and quality of mail on the list so that it
 remains a useful resource for the majority of users.

 Nick
 ​

 On Wed Jan 21 2015 at 3:13:21 PM pzecevic petar.zece...@gmail.com wrote:

 Hi,
 I tried to find the last reply by Nick Chammas (that I received in the
 digest) using the Nabble web interface, but I cannot find it (perhaps he
 didn't reply directly to the user list?). That's one example of Nabble's
 usability.

 Anyhow, I wanted to add my two cents...

 Apache user group could be frozen (not accepting new questions, if that's
 possible) and redirect users to Stack Overflow (automatic reply?). Old
 questions remain (and are searchable) on Nabble, new questions go to Stack
 Exchange, so no need for migration. That's the idea, at least, as I'm not
 sure if that's technically doable... Is it?
 dev mailing list could perhaps stay on Nabble (it's not that busy), or
 have
 a special tag on Stack Exchange.

 Other thing, about new Stack Exchange site I proposed earlier. If a new
 site
 is created, there is no problem with guidelines, I think, because Spark
 community can apply different guidelines for the new site.

 There is a FAQ about creating new sites:
 http://area51.stackexchange.com/faq
 It says: Stack Exchange sites are free to create and free to use. All we
 ask is that you have an enthusiastic, committed group of expert users who
 check 

Re: Are these numbers abnormal for spark streaming?

2015-01-22 Thread Sudipta Banerjee
Hi Ashic Mahtab,

The Cassandra and the Zookeeper are they installed as a part of Yarn
architecture or are they installed in a separate layer with Apache Spark .

Thanks and Regards,
Sudipta

On Thu, Jan 22, 2015 at 8:13 PM, Ashic Mahtab as...@live.com wrote:

 Hi Guys,
 So I changed the interval to 15 seconds. There's obviously a lot more
 messages per batch, but (I think) it looks a lot healthier. Can you see any
 major warning signs? I think that with 2 second intervals, the setup /
 teardown per partition was what was causing the delays.

 Streaming

- *Started at: *Thu Jan 22 13:23:12 GMT 2015
- *Time since start: *1 hour 17 minutes 16 seconds
- *Network receivers: *2
- *Batch interval: *15 seconds
- *Processed batches: *309
- *Waiting batches: *0



 Statistics over last 100 processed batchesReceiver Statistics

- Receiver


- Status


- Location


- Records in last batch
- [2015/01/22 14:40:29]


- Minimum rate
- [records/sec]


- Median rate
- [records/sec]


- Maximum rate
- [records/sec]


- Last Error

 RmqReceiver-0ACTIVEVDCAPP53.foo.local2.6 K29106295-RmqReceiver-1ACTIVE
 VDCAPP50.bar.local2.6 K29107291-
 Batch Processing Statistics

MetricLast batchMinimum25th percentileMedian75th 
 percentileMaximumProcessing
Time4 seconds 812 ms4 seconds 698 ms4 seconds 738 ms4 seconds 761 ms4
seconds 788 ms5 seconds 802 msScheduling Delay2 ms0 ms3 ms3 ms4 ms9 msTotal
Delay4 seconds 814 ms4 seconds 701 ms4 seconds 739 ms4 seconds 764 ms4
seconds 792 ms5 seconds 809 ms


 Regards,
 Ashic.
 --
 From: as...@live.com
 To: gerard.m...@gmail.com
 CC: user@spark.apache.org
 Subject: RE: Are these numbers abnormal for spark streaming?
 Date: Thu, 22 Jan 2015 12:32:05 +


 Hi Gerard,
 Thanks for the response.

 The messages get desrialised from msgpack format, and one of the strings
 is desrialised to json. Certain fields are checked to decide if further
 processing is required. If so, it goes through a series of in mem filters
 to check if more processing is required. If so, only then does the heavy
 work start. That consists of a few db queries, and potential updates to the
 db + message on message queue. The majority of messages don't need
 processing. The messages needing processing at peak are about three every
 other second.

 One possible things that might be happening is the session initialisation
 and prepared statement initialisation for each partition. I can resort to
 some tricks, but I think I'll try increasing batch interval to 15 seconds.
 I'll report back with findings.

 Thanks,
 Ashic.

 --
 From: gerard.m...@gmail.com
 Date: Thu, 22 Jan 2015 12:30:08 +0100
 Subject: Re: Are these numbers abnormal for spark streaming?
 To: tathagata.das1...@gmail.com
 CC: as...@live.com; t...@databricks.com; user@spark.apache.org

 and post the code (if possible).
 In a nutshell, your processing time  batch interval,  resulting in an
 ever-increasing delay that will end up in a crash.
 3 secs to process 14 messages looks like a lot. Curious what the job logic
 is.

 -kr, Gerard.

 On Thu, Jan 22, 2015 at 12:15 PM, Tathagata Das 
 tathagata.das1...@gmail.com wrote:

 This is not normal. Its a huge scheduling delay!! Can you tell me more
 about the application?
 - cluser setup, number of receivers, whats the computation, etc.

 On Thu, Jan 22, 2015 at 3:11 AM, Ashic Mahtab as...@live.com wrote:

 Hate to do this...but...erm...bump? Would really appreciate input from
 others using Streaming. Or at least some docs that would tell me if these
 are expected or not.

 --
 From: as...@live.com
 To: user@spark.apache.org
 Subject: Are these numbers abnormal for spark streaming?
 Date: Wed, 21 Jan 2015 11:26:31 +


 Hi Guys,
 I've got Spark Streaming set up for a low data rate system (using spark's
 features for analysis, rather than high throughput). Messages are coming in
 throughout the day, at around 1-20 per second (finger in the air
 estimate...not analysed yet).  In the spark streaming UI for the
 application, I'm getting the following after 17 hours.

 Streaming

- *Started at: *Tue Jan 20 16:58:43 GMT 2015
- *Time since start: *18 hours 24 minutes 34 seconds
- *Network receivers: *2
- *Batch interval: *2 seconds
- *Processed batches: *16482
- *Waiting batches: *1



 Statistics over last 100 processed batchesReceiver Statistics

- Receiver


- Status


- Location


- Records in last batch
- [2015/01/21 11:23:18]


- Minimum rate
- [records/sec]


- Median rate
- [records/sec]


- Maximum rate
- [records/sec]


- Last Error

 RmqReceiver-0ACTIVEF
 144727-RmqReceiver-1ACTIVEBR
 124726-
 Batch Processing Statistics

MetricLast batchMinimum25th percentileMedian75th 
 percentileMaximumProcessing
Time3 seconds 994 ms157 ms4 seconds 16 ms4 seconds 961 ms5 

Re: Discourse: A proposed alternative to the Spark User list

2015-01-22 Thread Gerard Maas
I've have been contributing to SO for a while now.  Here're few
observations I'd like to contribute to the discussion:

The level of questions on SO is often of more entry-level. Harder
questions (that require expertise in a certain area) remain unanswered for
a while. Same questions here on the list (as they are often cross-posted)
receive faster turnaround.
Roughly speaking, there're two groups of questions: Implementing things on
Spark and Running Spark.  The second one is borderline on SO guidelines as
they often involve cluster setups, long logs and little idea of what's
going on (mind you, often those questions come from people starting with
Spark)

In my opinion, Stack Overflow offers a better Q/A experience, in
particular, they have tooling in place to reduce duplicates, something that
often overloads this list (same getting started issues or how to map,
filter, flatmap over and over again).  That said, this list offers a
richer forum, where the expertise pool is a lot deeper.
Also, while SO is fairly strict in requiring posters from showing a minimal
amount of effort in the question being asked, this list is quite friendly
to the same behavior. This could be probably an element that makes the list
'lower impedance'.
One additional thing on SO is that the [apache-spark] tag is a 'low rep'
tag. Neither questions nor answers get significant voting, reducing the
'rep gaming' factor  (discouraging participation?)

Thinking about how to improve both platforms: SO[apache-spark] and this ML,
and get back the list to not overwhelming message volumes, we could
implement some 'load balancing' policies:
- encourage new users to use Stack Overflow, in particular, redirect newbie
questions to SO the friendly way: did you search SO already? or link to
an existing question.
  - most how to map, flatmap, filter, aggregate, reduce, ... would fall
under  this category
- encourage domain experts to hang on SO more often  (my impression is that
MLLib, GraphX are fairly underserved)
- have an 'scalation process' in place, where we could post
'interesting/hard/bug' questions from SO back to the list (or encourage the
poster to do so)
- update our community guidelines on [
http://spark.apache.org/community.html] to implement such policies.

Those are just some ideas on how to improve the community and better serve
the newcomers while avoiding overload of our existing expertise pool.

kr, Gerard.


On Thu, Jan 22, 2015 at 10:42 AM, Sean Owen so...@cloudera.com wrote:

 Yes, there is some project business like votes of record on releases that
 needs to be carried on in standard, simple accessible place and SO is not
 at all suitable.

 Nobody is stuck with Nabble. The suggestion is to enable a different
 overlay on the existing list. SO remains a place you can ask questions too.
 So I agree with Nick's take.

 BTW are there perhaps plans to split this mailing list into
 subproject-specific lists? That might also help tune in/out the subset of
 conversations of interest.
 On Jan 22, 2015 10:30 AM, Petar Zecevic petar.zece...@gmail.com wrote:


 Ok, thanks for the clarifications. I didn't know this list has to remain
 as the only official list.

 Nabble is really not the best solution in the world, but we're stuck with
 it, I guess.

 That's it from me on this subject.

 Petar


 On 22.1.2015. 3:55, Nicholas Chammas wrote:

  I think a few things need to be laid out clearly:

1. This mailing list is the “official” user discussion platform. That
is, it is sponsored and managed by the ASF.
2. Users are free to organize independent discussion platforms
focusing on Spark, and there is already one such platform in Stack 
 Overflow
under the apache-spark and related tags. Stack Overflow works quite
well.
3. The ASF will not agree to deprecating or migrating this user list
to a platform that they do not control.
4. This mailing list has grown to an unwieldy size and discussions
are hard to find or follow; discussion tooling is also lacking. We want to
improve the utility and user experience of this mailing list.
5. We don’t want to fragment this “official” discussion community.
6. Nabble is an independent product not affiliated with the ASF. It
offers a slightly better interface to the Apache mailing list archives.

 So to respond to some of your points, pzecevic:

 Apache user group could be frozen (not accepting new questions, if that’s
 possible) and redirect users to Stack Overflow (automatic reply?).

 From what I understand of the ASF’s policies, this is not possible. :(
 This mailing list must remain the official Spark user discussion platform.

 Other thing, about new Stack Exchange site I proposed earlier. If a new
 site is created, there is no problem with guidelines, I think, because
 Spark community can apply different guidelines for the new site.

 I think Stack Overflow and the various Spark tags are working fine. I
 don’t see a compelling need for a Stack 

Spark performance for small queries

2015-01-22 Thread Saumitra Shahapure (Vizury)
Hello,

We were comparing performance of some of our production hive queries
between Hive and Spark. We compared Hive(0.13)+hadoop (1.2.1) against both
Spark 0.9 and 1.1. We could see that the performance gains have been good
in Spark.

We tried a very simple query,
select count(*) from T where col3=123
in both sparkSQL and Hive (with hive.map.aggr=true) and found that Spark
performance had been 2x better than Hive (120sec vs 60sec). Table T is
stored in S3 and contains 600MB single GZIP file.

My question is, why Spark is faster than Hive here? In both of the cases,
the file will be downloaded, uncompressed and lines will be counted by a
single process. For Hive case, reducer will be identity function
since hive.map.aggr is true.

Note that disk spills and network I/O are very less for Hive's case as well,

--
Regards,
Saumitra Shahapure


Re: HDFS Namenode in safemode when I turn off my EC2 instance

2015-01-22 Thread Sean Owen
If you are using CDH, you would be shutting down services with
Cloudera Manager. I believe you can do it manually using Linux
'services' if you do the steps correctly across your whole cluster.
I'm not sure if the stock stop-all.sh script is supposed to work.
Certainly, if you are using CM, by far the easiest is to start/stop
all of these things in CM.

On Wed, Jan 21, 2015 at 6:08 PM, Su She suhsheka...@gmail.com wrote:
 Hello Sean  Akhil,

 I tried running the stop-all.sh script on my master and I got this message:

 localhost: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).
 chown: changing ownership of
 `/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/spark/logs': Operation
 not permitted
 no org.apache.spark.deploy.master.Master to stop

 I am running Spark (Yarn) via Cloudera Manager. I tried stopping it from
 Cloudera Manager first, but it looked like it was only stopping the history
 server, so I started Spark again and tried ./stop-all.sh and got the above
 message.

 Also, what is the command for shutting down storage or can I simply stop
 hdfs in Cloudera Manager?

 Thank you for the help!



 On Sat, Jan 17, 2015 at 12:58 PM, Su She suhsheka...@gmail.com wrote:

 Thanks Akhil and Sean for the responses.

 I will try shutting down spark, then storage and then the instances.
 Initially, when hdfs was in safe mode, I waited for 1 hour and the problem
 still persisted. I will try this new method.

 Thanks!



 On Sat, Jan 17, 2015 at 2:03 AM, Sean Owen so...@cloudera.com wrote:

 You would not want to turn off storage underneath Spark. Shut down
 Spark first, then storage, then shut down the instances. Reverse the
 order when restarting.

 HDFS will be in safe mode for a short time after being started before
 it becomes writeable. I would first check that it's not just that.
 Otherwise, find out why the cluster went into safe mode from the logs,
 fix it, and then leave safe mode.

 On Sat, Jan 17, 2015 at 9:03 AM, Akhil Das ak...@sigmoidanalytics.com
 wrote:
  Safest way would be to first shutdown HDFS and then shutdown Spark
  (call
  stop-all.sh would do) and then shutdown the machines.
 
  You can execute the following command to disable safe mode:
 
  hadoop fs -safemode leave
 
 
 
  Thanks
  Best Regards
 
  On Sat, Jan 17, 2015 at 8:31 AM, Su She suhsheka...@gmail.com wrote:
 
  Hello Everyone,
 
  I am encountering trouble running Spark applications when I shut down
  my
  EC2 instances. Everything else seems to work except Spark. When I try
  running a simple Spark application, like sc.parallelize() I get the
  message
  that hdfs name node is in safemode.
 
  Has anyone else had this issue? Is there a proper protocol I should be
  following to turn off my spark nodes?
 
  Thank you!
 
 
 




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: [mllib] Decision Tree - prediction probabilites of label classes

2015-01-22 Thread Sean Owen
You are right that this isn't implemented. I presume you could propose
a PR for this. The impurity calculator implementations already receive
category counts. The only drawback I see is having to store N
probabilities at each leaf, not 1.

On Wed, Jan 21, 2015 at 3:36 PM, Zsolt Tóth toth.zsolt@gmail.com wrote:
 Hi,

 I use DecisionTree for multi class classification.
 I can get the probability of the predicted label for every node in the
 decision tree from node.predict().prob(). Is it possible to retrieve or
 count the probability of every possible label class in the node?
 To be more clear:
 Say in Node A there are 4 of label 0.0, 2 of label 1.0 and 3 of label 2.0.
 If I'm correct predict.prob() is 4/9 in this case. I need the values 2/9 and
 3/9 for the 2 other labels.

 It would be great to retrieve the exact count of label classes ([4,2,3] in
 the example) but I don't think thats possible now. Is something like this
 planned for a future release?

 Thanks!

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Discourse: A proposed alternative to the Spark User list

2015-01-22 Thread Nicholas Chammas
I agree with Sean that a Spark-specific Stack Exchange likely won't help
and almost certainly won't make it out of Area 51. The idea certainly
sounds nice from our perspective as Spark users, but it doesn't mesh with
the structure of Stack Exchange or the criteria for creating new sites.

On Thu Jan 22 2015 at 1:23:14 PM Sean Owen so...@cloudera.com wrote:

 FWIW I am a moderator for datascience.stackexchange.com, and even that
 hasn't really achieved the critical mass that SE sites are supposed
 to: http://area51.stackexchange.com/proposals/55053/data-science

 I think a Spark site would have a lot less traffic. One annoyance is
 that people can't figure out when to post on SO vs Data Science vs
 Cross Validated. A Spark site would have the same problem,
 fragmentation and cross posting with SO. I don't think this would be
 accepted as a StackExchange site and don't think it helps.

 On Thu, Jan 22, 2015 at 6:16 PM, pierred pie...@demartines.com wrote:
 
  A dedicated stackexchange site for Apache Spark sounds to me like the
  logical solution.  Less trolling, more enthusiasm, and with the
  participation of the people on this list, I think it would very quickly
  become the reference for many technical questions, as well as a great
  vehicle to promote the awesomeness of Spark.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




RE: sparkcontext.objectFile return thousands of partitions

2015-01-22 Thread Wang, Ningjun (LNG-NPV)
Sean

You said


Ø  If you know that this number is too high you can request a number of 
partitions when you read it.

How to do that? Can you give a code snippet? I want to read it into 8 
partitions, so I do

val rdd2 = sc.objectFile[LabeledPoint]( 
(“file:///tmp/mydirfile:///\\tmp\mydir”, 8)
However rdd2 contains thousands of partitions instead of 8 partitions


Regards,

Ningjun Wang
Consulting Software Engineer
LexisNexis
121 Chanlon Road
New Providence, NJ 07974-1541

From: Sean Owen [mailto:so...@cloudera.com]
Sent: Wednesday, January 21, 2015 2:32 PM
To: Wang, Ningjun (LNG-NPV)
Cc: user@spark.apache.org
Subject: Re: sparkcontext.objectFile return thousands of partitions


You have 8 files, not 8 partitions. It does not follow that they should be read 
as 8 partitions since they are presumably large and so you would be stuck using 
at most 8 tasks in parallel to process. The number of partitions is determined 
by Hadoop input splits and generally makes a partition per block of data. If 
you know that this number is too high you can request a number of partitions 
when you read it. Don't coalesce, just read the desired number from the start.
On Jan 21, 2015 4:32 PM, Wang, Ningjun (LNG-NPV) 
ningjun.w...@lexisnexis.commailto:ningjun.w...@lexisnexis.com wrote:
Why sc.objectFile(…) return a Rdd with thousands of partitions?

I save a rdd to file system using

rdd.saveAsObjectFile(“file:///tmp/mydirfile:///\\tmp\mydir”)

Note that the rdd contains 7 millions object. I check the directory 
/tmp/mydir/, it contains 8 partitions

part-0  part-2  part-4  part-6  _SUCCESS
part-1  part-3  part-5  part-7

I then load the rdd back using

val rdd2 = sc.objectFile[LabeledPoint]( 
(“file:///tmp/mydirfile:///\\tmp\mydir”, 8)

I expect rdd2 to have 8 partitions. But from the master UI, I see that rdd2 has 
over 1000 partitions. This is very inefficient. How can I limit it to 8 
partitions just like what is stored on the file system?

Regards,

Ningjun Wang
Consulting Software Engineer
LexisNexis
121 Chanlon Road
New Providence, NJ 07974-1541



Re: Discourse: A proposed alternative to the Spark User list

2015-01-22 Thread Nicholas Chammas
we could implement some ‘load balancing’ policies:

I think Gerard’s suggestions are good. We need some “official” buy-in from
the project’s maintainers and heavy contributors and we should move forward
with them.

I know that at least Josh Rosen, Sean Owen, and Tathagata Das, who are
active on this list, are also active on SO
http://stackoverflow.com/tags/apache-spark/topusers. So perhaps we’re
already part of the way there.

Nick
​

On Thu Jan 22 2015 at 5:32:40 AM Gerard Maas gerard.m...@gmail.com wrote:

 I've have been contributing to SO for a while now.  Here're few
 observations I'd like to contribute to the discussion:

 The level of questions on SO is often of more entry-level. Harder
 questions (that require expertise in a certain area) remain unanswered for
 a while. Same questions here on the list (as they are often cross-posted)
 receive faster turnaround.
 Roughly speaking, there're two groups of questions: Implementing things on
 Spark and Running Spark.  The second one is borderline on SO guidelines as
 they often involve cluster setups, long logs and little idea of what's
 going on (mind you, often those questions come from people starting with
 Spark)

 In my opinion, Stack Overflow offers a better Q/A experience, in
 particular, they have tooling in place to reduce duplicates, something that
 often overloads this list (same getting started issues or how to map,
 filter, flatmap over and over again).  That said, this list offers a
 richer forum, where the expertise pool is a lot deeper.
 Also, while SO is fairly strict in requiring posters from showing a
 minimal amount of effort in the question being asked, this list is quite
 friendly to the same behavior. This could be probably an element that makes
 the list 'lower impedance'.
 One additional thing on SO is that the [apache-spark] tag is a 'low rep'
 tag. Neither questions nor answers get significant voting, reducing the
 'rep gaming' factor  (discouraging participation?)

 Thinking about how to improve both platforms: SO[apache-spark] and this
 ML, and get back the list to not overwhelming message volumes, we could
 implement some 'load balancing' policies:
 - encourage new users to use Stack Overflow, in particular, redirect
 newbie questions to SO the friendly way: did you search SO already? or
 link to an existing question.
   - most how to map, flatmap, filter, aggregate, reduce, ... would fall
 under  this category
 - encourage domain experts to hang on SO more often  (my impression is
 that MLLib, GraphX are fairly underserved)
 - have an 'scalation process' in place, where we could post
 'interesting/hard/bug' questions from SO back to the list (or encourage the
 poster to do so)
 - update our community guidelines on [
 http://spark.apache.org/community.html] to implement such policies.

 Those are just some ideas on how to improve the community and better serve
 the newcomers while avoiding overload of our existing expertise pool.

 kr, Gerard.


 On Thu, Jan 22, 2015 at 10:42 AM, Sean Owen so...@cloudera.com wrote:

 Yes, there is some project business like votes of record on releases that
 needs to be carried on in standard, simple accessible place and SO is not
 at all suitable.

 Nobody is stuck with Nabble. The suggestion is to enable a different
 overlay on the existing list. SO remains a place you can ask questions too.
 So I agree with Nick's take.

 BTW are there perhaps plans to split this mailing list into
 subproject-specific lists? That might also help tune in/out the subset of
 conversations of interest.
 On Jan 22, 2015 10:30 AM, Petar Zecevic petar.zece...@gmail.com
 wrote:


 Ok, thanks for the clarifications. I didn't know this list has to remain
 as the only official list.

 Nabble is really not the best solution in the world, but we're stuck
 with it, I guess.

 That's it from me on this subject.

 Petar


 On 22.1.2015. 3:55, Nicholas Chammas wrote:

  I think a few things need to be laid out clearly:

1. This mailing list is the “official” user discussion platform.
That is, it is sponsored and managed by the ASF.
2. Users are free to organize independent discussion platforms
focusing on Spark, and there is already one such platform in Stack 
 Overflow
under the apache-spark and related tags. Stack Overflow works quite
well.
3. The ASF will not agree to deprecating or migrating this user list
to a platform that they do not control.
4. This mailing list has grown to an unwieldy size and discussions
are hard to find or follow; discussion tooling is also lacking. We want 
 to
improve the utility and user experience of this mailing list.
5. We don’t want to fragment this “official” discussion community.
6. Nabble is an independent product not affiliated with the ASF. It
offers a slightly better interface to the Apache mailing list archives.

 So to respond to some of your points, pzecevic:

 Apache user group could be frozen (not accepting new 

Re: spark-shell has syntax error on windows.

2015-01-22 Thread Yana Kadiyska
I am not sure if you get the same exception as I do -- spark-shell2.cmd
works fine for me. Windows 7 as well. I've never bothered looking to fix it
as it seems spark-shell just calls spark-shell2 anyway...

On Thu, Jan 22, 2015 at 3:16 AM, Vladimir Protsenko protsenk...@gmail.com
wrote:

 I have a problem with running spark shell in windows 7. I made the
 following
 steps:

 1. downloaded and installed Scala 2.11.5
 2. downloaded spark 1.2.0 by git clone git://github.com/apache/spark.git
 3. run dev/change-version-to-2.11.sh and mvn -Dscala-2.11 -DskipTests
 clean
 package (in git bash)

 After installation tried to run spark-shell.cmd in cmd shell and it says
 there is a syntax error in file. What could I do to fix problem?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/spark-shell-has-syntax-error-on-windows-tp21313.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




RE: Are these numbers abnormal for spark streaming?

2015-01-22 Thread Ashic Mahtab
Another quick question... I've got 4 nodes with 2 cores each. I've assinged the 
streaming app 4 cores. It seems to be using one per node. I imagine forwarding 
from the receivers to the executors are causing unnecessary processing. Is 
there a way to specify that I want 2 cores from the same machines to be 
involved (even better if this can be specified during spark-submit)?

Thanks,
Ashic.

From: as...@live.com
To: gerard.m...@gmail.com; asudipta.baner...@gmail.com
CC: user@spark.apache.org; tathagata.das1...@gmail.com
Subject: RE: Are these numbers abnormal for spark streaming?
Date: Thu, 22 Jan 2015 15:40:17 +




Yup...looks like it. I can do some tricks to reduce setup costs further, but 
this is much better than where I was yesterday. Thanks for your awesome input :)

-Ashic.

From: gerard.m...@gmail.com
Date: Thu, 22 Jan 2015 16:34:38 +0100
Subject: Re: Are these numbers abnormal for spark streaming?
To: asudipta.baner...@gmail.com
CC: as...@live.com; user@spark.apache.org; tathagata.das1...@gmail.com

Given that the process, and in particular, the setup of connections, is bound 
to the number of partitions (in x.foreachPartition{ x= ???}), I think it would 
be worth trying reducing them.
Increasing the  'spark.streaming.BlockInterval' will do the trick (you can read 
the tuning details here:  http://www.virdata.com/tuning-spark/#Partitions)
-kr, Gerard.
On Thu, Jan 22, 2015 at 4:28 PM, Gerard Maas gerard.m...@gmail.com wrote:
So the system has gone from 7msg in 4.961 secs (median) to 106msgs in 4,761 
seconds. I think there's evidence that setup costs are quite high in this case 
and increasing the batch interval is helping.
On Thu, Jan 22, 2015 at 4:12 PM, Sudipta Banerjee asudipta.baner...@gmail.com 
wrote:
Hi Ashic Mahtab,

The Cassandra and the Zookeeper are they installed as a part of Yarn 
architecture or are they installed in a separate layer with Apache Spark .

Thanks and Regards,
Sudipta

On Thu, Jan 22, 2015 at 8:13 PM, Ashic Mahtab as...@live.com wrote:



Hi Guys,
So I changed the interval to 15 seconds. There's obviously a lot more messages 
per batch, but (I think) it looks a lot healthier. Can you see any major 
warning signs? I think that with 2 second intervals, the setup / teardown per 
partition was what was causing the delays.

StreamingStarted at: Thu Jan 22 13:23:12 GMT 2015Time since start: 1 hour 17 
minutes 16 secondsNetwork receivers: 2Batch interval: 15 secondsProcessed 
batches: 309Waiting batches: 0

Statistics over last 100 processed batchesReceiver 
StatisticsReceiverStatusLocationRecords in last batch[2015/01/22 
14:40:29]Minimum rate[records/sec]Median rate[records/sec]Maximum 
rate[records/sec]Last ErrorRmqReceiver-0ACTIVEVDCAPP53.foo.local2.6 
K29106295-RmqReceiver-1ACTIVEVDCAPP50.bar.local2.6 K29107291-Batch Processing 
StatisticsMetricLast batchMinimum25th percentileMedian75th 
percentileMaximumProcessing Time4 seconds 812 ms4 seconds 698 ms4 seconds 738 
ms4 seconds 761 ms4 seconds 788 ms5 seconds 802 msScheduling Delay2 ms0 ms3 ms3 
ms4 ms9 msTotal Delay4 seconds 814 ms4 seconds 701 ms4 seconds 739 ms4 seconds 
764 ms4 seconds 792 ms5 seconds 809 ms
Regards,
Ashic.
From: as...@live.com
To: gerard.m...@gmail.com
CC: user@spark.apache.org
Subject: RE: Are these numbers abnormal for spark streaming?
Date: Thu, 22 Jan 2015 12:32:05 +




Hi Gerard,
Thanks for the response.

The messages get desrialised from msgpack format, and one of the strings is 
desrialised to json. Certain fields are checked to decide if further processing 
is required. If so, it goes through a series of in mem filters to check if more 
processing is required. If so, only then does the heavy work start. That 
consists of a few db queries, and potential updates to the db + message on 
message queue. The majority of messages don't need processing. The messages 
needing processing at peak are about three every other second. 

One possible things that might be happening is the session initialisation and 
prepared statement initialisation for each partition. I can resort to some 
tricks, but I think I'll try increasing batch interval to 15 seconds. I'll 
report back with findings.

Thanks,
Ashic.

From: gerard.m...@gmail.com
Date: Thu, 22 Jan 2015 12:30:08 +0100
Subject: Re: Are these numbers abnormal for spark streaming?
To: tathagata.das1...@gmail.com
CC: as...@live.com; t...@databricks.com; user@spark.apache.org

and post the code (if possible).In a nutshell, your processing time  batch 
interval,  resulting in an ever-increasing delay that will end up in a crash.
3 secs to process 14 messages looks like a lot. Curious what the job logic is.
-kr, Gerard.
On Thu, Jan 22, 2015 at 12:15 PM, Tathagata Das tathagata.das1...@gmail.com 
wrote:
This is not normal. Its a huge scheduling delay!! Can you tell me more about 
the application?- cluser setup, number of receivers, whats the computation, etc.
On Thu, Jan 22, 2015 at 3:11 AM, Ashic Mahtab as...@live.com wrote:



Hate to do 

Re: Spark Team - Paco Nathan said that your team can help

2015-01-22 Thread Sean Owen
Yes, this isn't a well-formed question, and got maybe the response it
deserved, but the tone is veering off the rails. I just got a much
ruder reply from Sudipta privately, which I will not forward. Sudipta,
I suggest you take the responses you've gotten so far as about as much
answer as can be had here and do some work yourself, and come back
with much more specific questions, and it will all be helpful and
polite again.

On Thu, Jan 22, 2015 at 2:51 PM, Sudipta Banerjee
asudipta.baner...@gmail.com wrote:
 Hi Marco,

 Thanks for the confirmation. Please let me know what are the lot more detail
 you need to answer a very specific question  WHAT IS THE MINIMUM HARDWARE
 CONFIGURATION REQUIRED TO BUILT HDFS+ MAPREDUCE+SPARK+YARN  on a system?
 Please let me know if you need any further information and if you dont know
 please drive across with the $1 to Sir Paco Nathan and get me the
 answer.

 Thanks and Regards,
 Sudipta

 On Thu, Jan 22, 2015 at 5:33 PM, Marco Shaw marco.s...@gmail.com wrote:

 Hi,

 Let me reword your request so you understand how (too) generic your
 question is

 Hi, I have $10,000, please find me some means of transportation so I can
 get to work.

 Please provide (a lot) more details. If you can't, consider using one of
 the pre-built express VMs from either Cloudera, Hortonworks or MapR, for
 example.

 Marco



  On Jan 22, 2015, at 7:36 AM, Sudipta Banerjee
  asudipta.baner...@gmail.com wrote:
 
 
 
  Hi Apache-Spark team ,
 
  What are the system requirements installing Hadoop and Apache Spark?
  I have attached the screen shot of Gparted.
 
 
  Thanks and regards,
  Sudipta
 
 
 
 
  --
  Sudipta Banerjee
  Consultant, Business Analytics and Cloud Based Architecture
  Call me +919019578099
  Screenshot - Wednesday 21 January 2015 - 10:55:29 IST.png
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org




 --
 Sudipta Banerjee
 Consultant, Business Analytics and Cloud Based Architecture
 Call me +919019578099

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Exception: NoSuchMethodError: org.apache.spark.streaming.StreamingContext$.toPairDStreamFunctions

2015-01-22 Thread Adrian Mocanu
Hi
I get this exception when I run a Spark test case on my local machine:

An exception or error caused a run to abort: 
org.apache.spark.streaming.StreamingContext$.toPairDStreamFunctions(Lorg/apache/spark/streaming/dstream/DStream;Lscala/reflect/ClassTag;Lscala/reflect/ClassTag;Lscala/math/Ordering;)Lorg/apache/spark/streaming/dstream/PairDStreamFunctions;
java.lang.NoSuchMethodError: 
org.apache.spark.streaming.StreamingContext$.toPairDStreamFunctions(Lorg/apache/spark/streaming/dstream/DStream;Lscala/reflect/ClassTag;Lscala/reflect/ClassTag;Lscala/math/Ordering;)Lorg/apache/spark/streaming/dstream/PairDStreamFunctions;

In my test case I have these Spark related imports imports:
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.streaming.TestSuiteBase
import org.apache.spark.streaming.dstream.DStream
import org.apache.spark.streaming.StreamingContext.toPairDStreamFunctions

-Adrian



Re: How to 'Pipe' Binary Data in Apache Spark

2015-01-22 Thread Frank Austin Nothaft
Venkat,

No problem!

 So, creating a custom InputFormat or using sc.binaryFiles alone is not the 
 right solution.   We also need the modified version of RDD.pipe to support 
 binary data?  Is my understanding correct?  

Yep! That is correct. The custom InputFormat allows Spark to load binary 
formatted data from disk/HDFS/S3/etc…, but then the default RDD.pipe 
reads/writes text to a pipe, so you’d need the custom mapPartitions call.

 If yes, this can be added as new enhancement Jira request?

The code that I have right now is fairly custom to my application, but if there 
was interest, I would be glad to port it for the Spark core.

Regards,

Frank Austin Nothaft
fnoth...@berkeley.edu
fnoth...@eecs.berkeley.edu
202-340-0466

On Jan 22, 2015, at 7:11 AM, Venkat, Ankam ankam.ven...@centurylink.com wrote:

 Thanks Frank for your response. 
  
 So, creating a custom InputFormat or using sc.binaryFiles alone is not the 
 right solution.   We also need the modified version of RDD.pipe to support 
 binary data?  Is my understanding correct?  
  
 If yes, this can be added as new enhancement Jira request? 
  
 Nick:  What’s your take on this?
  
 Regards,
 Venkat Ankam
  
  
 From: Frank Austin Nothaft [mailto:fnoth...@berkeley.edu] 
 Sent: Wednesday, January 21, 2015 12:30 PM
 To: Venkat, Ankam
 Cc: Nick Allen; user@spark.apache.org
 Subject: Re: How to 'Pipe' Binary Data in Apache Spark
  
 Hi Venkat/Nick,
  
 The Spark RDD.pipe method pipes text data into a subprocess and then receives 
 text data back from that process. Once you have the binary data loaded into 
 an RDD properly, to pipe binary data to/from a subprocess (e.g., you want the 
 data in the pipes to contain binary, not text), you need to implement your 
 own, modified version of RDD.pipe. The implementation of RDD.pipe spawns a 
 process per partition (IIRC), as well as threads for writing to and reading 
 from the process (as well as stderr for the process). When writing via 
 RDD.pipe, Spark calls *.toString on the object, and pushes that text 
 representation down the pipe. There is an example of how to pipe binary data 
 from within a mapPartitions call using the Scala API in lines 107-177 of this 
 file. This specific code contains some nastiness around the packaging of 
 downstream libraries that we rely on in that project, so I’m not sure if it 
 is the cleanest way, but it is a workable way.
  
 Regards,
  
 Frank Austin Nothaft
 fnoth...@berkeley.edu
 fnoth...@eecs.berkeley.edu
 202-340-0466
  
 On Jan 21, 2015, at 9:17 AM, Venkat, Ankam ankam.ven...@centurylink.com 
 wrote:
 
 
 I am trying to solve similar problem.  I am using option # 2 as suggested by 
 Nick. 
  
 I have created an RDD with sc.binaryFiles for a list of .wav files.  But, I 
 am not able to pipe it to the external programs. 
  
 For example:
  sq = sc.binaryFiles(wavfiles)  ß All .wav files stored on “wavfiles” 
  directory on HDFS
  sq.keys().collect() ß works fine.  Shows the list of file names.
  sq.values().collect() ß works fine.  Shows the content of the files.
  sq.values().pipe(lambda x: subprocess.call(['/usr/local/bin/sox', '-t' 
  'wav', '-', '-n', 'stats'])).collect() ß Does not work.  Tried different 
  options. 
 AttributeError: 'function' object has no attribute 'read'
  
 Any suggestions?
  
 Regards,
 Venkat Ankam
  
 From: Nick Allen [mailto:n...@nickallen.org] 
 Sent: Friday, January 16, 2015 11:46 AM
 To: user@spark.apache.org
 Subject: Re: How to 'Pipe' Binary Data in Apache Spark
  
 I just wanted to reiterate the solution for the benefit of the community.
  
 The problem is not from my use of 'pipe', but that 'textFile' cannot be used 
 to read in binary data. (Doh) There are a couple options to move forward.
  
 1. Implement a custom 'InputFormat' that understands the binary input data. 
 (Per Sean Owen)
  
 2. Use 'SparkContext.binaryFiles' to read in the entire binary file as a 
 single record. This will impact performance as it prevents the use of more 
 than one mapper on the file's data.
  
 In my specific case for #1 I can only find one project from RIPE-NCC 
 (https://github.com/RIPE-NCC/hadoop-pcap) that does this. Unfortunately, it 
 appears to only support a limited set of network protocols.
  
  
 On Fri, Jan 16, 2015 at 10:40 AM, Nick Allen n...@nickallen.org wrote:
 Per your last comment, it appears I need something like this:
  
 https://github.com/RIPE-NCC/hadoop-pcap
  
 Thanks a ton.  That get me oriented in the right direction.  
  
 On Fri, Jan 16, 2015 at 10:20 AM, Sean Owen so...@cloudera.com wrote:
 Well it looks like you're reading some kind of binary file as text.
 That isn't going to work, in Spark or elsewhere, as binary data is not
 even necessarily the valid encoding of a string. There are no line
 breaks to delimit lines and thus elements of the RDD.
 
 Your input has some record structure (or else it's not really useful
 to put it into an RDD). You can encode this as a SequenceFile and read
 it with objectFile.

Re: spark streaming with checkpoint

2015-01-22 Thread Balakrishnan Narendran
Thank you Jerry,
   Does the window operation create new RDDs for each slide duration..?
I am asking this because i see a constant increase in memory even when
there is no logs received.
If not checkpoint is there any alternative that you would suggest.?


On Tue, Jan 20, 2015 at 7:08 PM, Shao, Saisai saisai.s...@intel.com wrote:

  Hi,



 Seems you have such a large window (24 hours), so the phenomena of memory
 increasing is expectable, because of window operation will cache the RDD
 within this window in memory. So for your requirement, memory should be
 enough to hold the data of 24 hours.



 I don’t think checkpoint in Spark Streaming can alleviate such problem,
 because checkpoint are mainly for fault tolerance.



 Thanks

 Jerry



 *From:* balu.naren [mailto:balu.na...@gmail.com]
 *Sent:* Tuesday, January 20, 2015 7:17 PM
 *To:* user@spark.apache.org
 *Subject:* spark streaming with checkpoint



 I am a beginner to spark streaming. So have a basic doubt regarding
 checkpoints. My use case is to calculate the no of unique users by day. I
 am using reduce by key and window for this. Where my window duration is 24
 hours and slide duration is 5 mins. I am updating the processed record to
 mongodb. Currently I am replace the existing record each time. But I see
 the memory is slowly increasing over time and kills the process after 1 and
 1/2 hours(in aws small instance). The DB write after the restart clears all
 the old data. So I understand checkpoint is the solution for this. But my
 doubt is

- What should my check point duration be..? As per documentation it
says 5-10 times of slide duration. But I need the data of entire day. So it
is ok to keep 24 hrs.
- Where ideally should the checkpoint be..? Initially when I receive
the stream or just before the window operation or after the data reduction
has taken place.


 Appreciate your help.
 Thank you
  --

 View this message in context: spark streaming with checkpoint
 http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-with-checkpoint-tp21263.html
 Sent from the Apache Spark User List mailing list archive
 http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com.



RE: Are these numbers abnormal for spark streaming?

2015-01-22 Thread Ashic Mahtab
Hi Sudipta,
Standalone spark master. Separate Zookeeper cluster. 4 worker nodes with 
cassandra + spark on each. No hadoop / hdfs / yarn.

Regards,
Ashic.

Date: Thu, 22 Jan 2015 20:42:43 +0530
Subject: Re: Are these numbers abnormal for spark streaming?
From: asudipta.baner...@gmail.com
To: as...@live.com
CC: gerard.m...@gmail.com; user@spark.apache.org; tathagata.das1...@gmail.com

Hi Ashic Mahtab,

The Cassandra and the Zookeeper are they installed as a part of Yarn 
architecture or are they installed in a separate layer with Apache Spark .

Thanks and Regards,
Sudipta

On Thu, Jan 22, 2015 at 8:13 PM, Ashic Mahtab as...@live.com wrote:



Hi Guys,
So I changed the interval to 15 seconds. There's obviously a lot more messages 
per batch, but (I think) it looks a lot healthier. Can you see any major 
warning signs? I think that with 2 second intervals, the setup / teardown per 
partition was what was causing the delays.

StreamingStarted at: Thu Jan 22 13:23:12 GMT 2015Time since start: 1 hour 17 
minutes 16 secondsNetwork receivers: 2Batch interval: 15 secondsProcessed 
batches: 309Waiting batches: 0

Statistics over last 100 processed batchesReceiver 
StatisticsReceiverStatusLocationRecords in last batch[2015/01/22 
14:40:29]Minimum rate[records/sec]Median rate[records/sec]Maximum 
rate[records/sec]Last ErrorRmqReceiver-0ACTIVEVDCAPP53.foo.local2.6 
K29106295-RmqReceiver-1ACTIVEVDCAPP50.bar.local2.6 K29107291-Batch Processing 
StatisticsMetricLast batchMinimum25th percentileMedian75th 
percentileMaximumProcessing Time4 seconds 812 ms4 seconds 698 ms4 seconds 738 
ms4 seconds 761 ms4 seconds 788 ms5 seconds 802 msScheduling Delay2 ms0 ms3 ms3 
ms4 ms9 msTotal Delay4 seconds 814 ms4 seconds 701 ms4 seconds 739 ms4 seconds 
764 ms4 seconds 792 ms5 seconds 809 ms
Regards,
Ashic.
From: as...@live.com
To: gerard.m...@gmail.com
CC: user@spark.apache.org
Subject: RE: Are these numbers abnormal for spark streaming?
Date: Thu, 22 Jan 2015 12:32:05 +




Hi Gerard,
Thanks for the response.

The messages get desrialised from msgpack format, and one of the strings is 
desrialised to json. Certain fields are checked to decide if further processing 
is required. If so, it goes through a series of in mem filters to check if more 
processing is required. If so, only then does the heavy work start. That 
consists of a few db queries, and potential updates to the db + message on 
message queue. The majority of messages don't need processing. The messages 
needing processing at peak are about three every other second. 

One possible things that might be happening is the session initialisation and 
prepared statement initialisation for each partition. I can resort to some 
tricks, but I think I'll try increasing batch interval to 15 seconds. I'll 
report back with findings.

Thanks,
Ashic.

From: gerard.m...@gmail.com
Date: Thu, 22 Jan 2015 12:30:08 +0100
Subject: Re: Are these numbers abnormal for spark streaming?
To: tathagata.das1...@gmail.com
CC: as...@live.com; t...@databricks.com; user@spark.apache.org

and post the code (if possible).In a nutshell, your processing time  batch 
interval,  resulting in an ever-increasing delay that will end up in a crash.
3 secs to process 14 messages looks like a lot. Curious what the job logic is.
-kr, Gerard.
On Thu, Jan 22, 2015 at 12:15 PM, Tathagata Das tathagata.das1...@gmail.com 
wrote:
This is not normal. Its a huge scheduling delay!! Can you tell me more about 
the application?- cluser setup, number of receivers, whats the computation, etc.
On Thu, Jan 22, 2015 at 3:11 AM, Ashic Mahtab as...@live.com wrote:



Hate to do this...but...erm...bump? Would really appreciate input from others 
using Streaming. Or at least some docs that would tell me if these are expected 
or not.

From: as...@live.com
To: user@spark.apache.org
Subject: Are these numbers abnormal for spark streaming?
Date: Wed, 21 Jan 2015 11:26:31 +




Hi Guys,
I've got Spark Streaming set up for a low data rate system (using spark's 
features for analysis, rather than high throughput). Messages are coming in 
throughout the day, at around 1-20 per second (finger in the air estimate...not 
analysed yet).  In the spark streaming UI for the application, I'm getting the 
following after 17 hours.

StreamingStarted at: Tue Jan 20 16:58:43 GMT 2015Time since start: 18 hours 24 
minutes 34 secondsNetwork receivers: 2Batch interval: 2 secondsProcessed 
batches: 16482Waiting batches: 1

Statistics over last 100 processed batchesReceiver 
StatisticsReceiverStatusLocationRecords in last batch[2015/01/21 
11:23:18]Minimum rate[records/sec]Median rate[records/sec]Maximum 
rate[records/sec]Last ErrorRmqReceiver-0ACTIVEF
144727-RmqReceiver-1ACTIVEBR
124726-Batch Processing StatisticsMetricLast batchMinimum25th 
percentileMedian75th percentileMaximumProcessing Time3 seconds 994 ms157 ms4 
seconds 16 ms4 seconds 961 ms5 seconds 3 ms5 seconds 171 msScheduling Delay9 
hours 15 minutes 4 seconds9 hours 10 minutes 

Re: Are these numbers abnormal for spark streaming?

2015-01-22 Thread Gerard Maas
Given that the process, and in particular, the setup of connections, is
bound to the number of partitions (in x.foreachPartition{ x= ???}), I
think it would be worth trying reducing them.
Increasing the  'spark.streaming.BlockInterval' will do the trick (you can
read the tuning details here:
http://www.virdata.com/tuning-spark/#Partitions)

-kr, Gerard.

On Thu, Jan 22, 2015 at 4:28 PM, Gerard Maas gerard.m...@gmail.com wrote:

 So the system has gone from 7msg in 4.961 secs (median) to 106msgs in
 4,761 seconds.
 I think there's evidence that setup costs are quite high in this case and
 increasing the batch interval is helping.

 On Thu, Jan 22, 2015 at 4:12 PM, Sudipta Banerjee 
 asudipta.baner...@gmail.com wrote:

 Hi Ashic Mahtab,

 The Cassandra and the Zookeeper are they installed as a part of Yarn
 architecture or are they installed in a separate layer with Apache Spark .

 Thanks and Regards,
 Sudipta

 On Thu, Jan 22, 2015 at 8:13 PM, Ashic Mahtab as...@live.com wrote:

 Hi Guys,
 So I changed the interval to 15 seconds. There's obviously a lot more
 messages per batch, but (I think) it looks a lot healthier. Can you see any
 major warning signs? I think that with 2 second intervals, the setup /
 teardown per partition was what was causing the delays.

 Streaming

- *Started at: *Thu Jan 22 13:23:12 GMT 2015
- *Time since start: *1 hour 17 minutes 16 seconds
- *Network receivers: *2
- *Batch interval: *15 seconds
- *Processed batches: *309
- *Waiting batches: *0



 Statistics over last 100 processed batchesReceiver Statistics

- Receiver


- Status


- Location


- Records in last batch
- [2015/01/22 14:40:29]


- Minimum rate
- [records/sec]


- Median rate
- [records/sec]


- Maximum rate
- [records/sec]


- Last Error

 RmqReceiver-0ACTIVEVDCAPP53.foo.local2.6 K29106295-RmqReceiver-1ACTIVE
 VDCAPP50.bar.local2.6 K29107291-
 Batch Processing Statistics

MetricLast batchMinimum25th percentileMedian75th 
 percentileMaximumProcessing
Time4 seconds 812 ms4 seconds 698 ms4 seconds 738 ms4 seconds 761 ms4
seconds 788 ms5 seconds 802 msScheduling Delay2 ms0 ms3 ms3 ms4 ms9
msTotal Delay4 seconds 814 ms4 seconds 701 ms4 seconds 739 ms4
seconds 764 ms4 seconds 792 ms5 seconds 809 ms


 Regards,
 Ashic.
 --
 From: as...@live.com
 To: gerard.m...@gmail.com
 CC: user@spark.apache.org
 Subject: RE: Are these numbers abnormal for spark streaming?
 Date: Thu, 22 Jan 2015 12:32:05 +


 Hi Gerard,
 Thanks for the response.

 The messages get desrialised from msgpack format, and one of the strings
 is desrialised to json. Certain fields are checked to decide if further
 processing is required. If so, it goes through a series of in mem filters
 to check if more processing is required. If so, only then does the heavy
 work start. That consists of a few db queries, and potential updates to the
 db + message on message queue. The majority of messages don't need
 processing. The messages needing processing at peak are about three every
 other second.

 One possible things that might be happening is the session
 initialisation and prepared statement initialisation for each partition. I
 can resort to some tricks, but I think I'll try increasing batch interval
 to 15 seconds. I'll report back with findings.

 Thanks,
 Ashic.

 --
 From: gerard.m...@gmail.com
 Date: Thu, 22 Jan 2015 12:30:08 +0100
 Subject: Re: Are these numbers abnormal for spark streaming?
 To: tathagata.das1...@gmail.com
 CC: as...@live.com; t...@databricks.com; user@spark.apache.org

 and post the code (if possible).
 In a nutshell, your processing time  batch interval,  resulting in an
 ever-increasing delay that will end up in a crash.
 3 secs to process 14 messages looks like a lot. Curious what the job
 logic is.

 -kr, Gerard.

 On Thu, Jan 22, 2015 at 12:15 PM, Tathagata Das 
 tathagata.das1...@gmail.com wrote:

 This is not normal. Its a huge scheduling delay!! Can you tell me more
 about the application?
 - cluser setup, number of receivers, whats the computation, etc.

 On Thu, Jan 22, 2015 at 3:11 AM, Ashic Mahtab as...@live.com wrote:

 Hate to do this...but...erm...bump? Would really appreciate input from
 others using Streaming. Or at least some docs that would tell me if these
 are expected or not.

 --
 From: as...@live.com
 To: user@spark.apache.org
 Subject: Are these numbers abnormal for spark streaming?
 Date: Wed, 21 Jan 2015 11:26:31 +


 Hi Guys,
 I've got Spark Streaming set up for a low data rate system (using
 spark's features for analysis, rather than high throughput). Messages are
 coming in throughout the day, at around 1-20 per second (finger in the air
 estimate...not analysed yet).  In the spark streaming UI for the
 application, I'm getting the following after 17 hours.

 Streaming

- *Started at: *Tue Jan 20 16:58:43 GMT 2015
- 

RE: How to 'Pipe' Binary Data in Apache Spark

2015-01-22 Thread Venkat, Ankam
How much time it takes to port it?

Spark committers:  Please let us know your thoughts.

Regards,
Venkat

From: Frank Austin Nothaft [mailto:fnoth...@berkeley.edu]
Sent: Thursday, January 22, 2015 9:08 AM
To: Venkat, Ankam
Cc: Nick Allen; user@spark.apache.org
Subject: Re: How to 'Pipe' Binary Data in Apache Spark

Venkat,

No problem!

So, creating a custom InputFormat or using sc.binaryFiles alone is not the 
right solution.   We also need the modified version of RDD.pipe to support 
binary data?  Is my understanding correct?

Yep! That is correct. The custom InputFormat allows Spark to load binary 
formatted data from disk/HDFS/S3/etc..., but then the default RDD.pipe 
reads/writes text to a pipe, so you'd need the custom mapPartitions call.


If yes, this can be added as new enhancement Jira request?

The code that I have right now is fairly custom to my application, but if there 
was interest, I would be glad to port it for the Spark core.

Regards,

Frank Austin Nothaft
fnoth...@berkeley.edumailto:fnoth...@berkeley.edu
fnoth...@eecs.berkeley.edumailto:fnoth...@eecs.berkeley.edu
202-340-0466

On Jan 22, 2015, at 7:11 AM, Venkat, Ankam 
ankam.ven...@centurylink.commailto:ankam.ven...@centurylink.com wrote:


Thanks Frank for your response.

So, creating a custom InputFormat or using sc.binaryFiles alone is not the 
right solution.   We also need the modified version of RDD.pipe to support 
binary data?  Is my understanding correct?

If yes, this can be added as new enhancement Jira request?

Nick:  What's your take on this?

Regards,
Venkat Ankam


From: Frank Austin Nothaft [mailto:fnoth...@berkeley.edu]
Sent: Wednesday, January 21, 2015 12:30 PM
To: Venkat, Ankam
Cc: Nick Allen; user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: How to 'Pipe' Binary Data in Apache Spark

Hi Venkat/Nick,

The Spark RDD.pipe method pipes text data into a subprocess and then receives 
text data back from that process. Once you have the binary data loaded into an 
RDD properly, to pipe binary data to/from a subprocess (e.g., you want the data 
in the pipes to contain binary, not text), you need to implement your own, 
modified version of RDD.pipe. The 
implementationhttps://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PipedRDD.scala
 of RDD.pipe spawns a process per partition (IIRC), as well as threads for 
writing to and reading from the process (as well as stderr for the process). 
When writing via RDD.pipe, Spark calls *.toString on the object, and pushes 
that text representation down the pipe. There is an example of how to pipe 
binary data from within a mapPartitions call using the Scala API in lines 
107-177 of this 
filehttps://github.com/bigdatagenomics/avocado/blob/master/avocado-core/src/main/scala/org/bdgenomics/avocado/genotyping/ExternalGenotyper.scala.
 This specific code contains some nastiness around the packaging of downstream 
libraries that we rely on in that project, so I'm not sure if it is the 
cleanest way, but it is a workable way.

Regards,

Frank Austin Nothaft
fnoth...@berkeley.edumailto:fnoth...@berkeley.edu
fnoth...@eecs.berkeley.edumailto:fnoth...@eecs.berkeley.edu
202-340-0466

On Jan 21, 2015, at 9:17 AM, Venkat, Ankam 
ankam.ven...@centurylink.commailto:ankam.ven...@centurylink.com wrote:



I am trying to solve similar problem.  I am using option # 2 as suggested by 
Nick.

I have created an RDD with sc.binaryFiles for a list of .wav files.  But, I am 
not able to pipe it to the external programs.

For example:
 sq = sc.binaryFiles(wavfiles)  -- All .wav files stored on wavfiles 
 directory on HDFS
 sq.keys().collect() -- works fine.  Shows the list of file names.
 sq.values().collect() -- works fine.  Shows the content of the files.
 sq.values().pipe(lambda x: subprocess.call(['/usr/local/bin/sox', '-t' 
 'wav', '-', '-n', 'stats'])).collect() -- Does not work.  Tried different 
 options.
AttributeError: 'function' object has no attribute 'read'

Any suggestions?

Regards,
Venkat Ankam

From: Nick Allen [mailto:n...@nickallen.org]
Sent: Friday, January 16, 2015 11:46 AM
To: user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: How to 'Pipe' Binary Data in Apache Spark

I just wanted to reiterate the solution for the benefit of the community.

The problem is not from my use of 'pipe', but that 'textFile' cannot be used to 
read in binary data. (Doh) There are a couple options to move forward.

1. Implement a custom 'InputFormat' that understands the binary input data. 
(Per Sean Owen)

2. Use 'SparkContext.binaryFiles' to read in the entire binary file as a single 
record. This will impact performance as it prevents the use of more than one 
mapper on the file's data.

In my specific case for #1 I can only find one project from RIPE-NCC 
(https://github.com/RIPE-NCC/hadoop-pcap) that does this. Unfortunately, it 
appears to only support a limited set of network protocols.


On Fri, Jan 16, 2015 at 10:40 AM, Nick Allen 

Large dataset, reduceByKey - java heap space error

2015-01-22 Thread Kane Kim
I'm trying to process a large dataset, mapping/filtering works ok, but
as long as I try to reduceByKey, I get out of memory errors:

http://pastebin.com/70M5d0Bn

Any ideas how I can fix that?

Thanks.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: PySpark Client

2015-01-22 Thread Chris Beavers
Hey Andrew,

Thanks for the response. Is this the issue you're referring to (the
duplicate linked there has an associated patch):
https://issues.apache.org/jira/browse/SPARK-5162 ?

Just to confirm that I understand this: with this patch, Python jobs can be
submitted to YARN, and a node from the cluster will act as the driver,
meaning that the Python version of the submission client vs. cluster
shouldn't be an issue?

Thanks,
Chris

On Tue, Jan 20, 2015 at 10:34 AM, Andrew Or and...@databricks.com wrote:

 Hi Chris,

 Short answer is no, not yet.

 Longer answer is that PySpark only supports client mode, which means your
 driver runs on the same machine as your submission client. By corollary
 this means your submission client must currently depend on all of Spark and
 its dependencies. There is a patch that supports this for *cluster* mode
 (as opposed to client mode), which would be the first step towards what you
 want.

 -Andrew

 2015-01-20 8:36 GMT-08:00 Chris Beavers cbeav...@trifacta.com:

 Hey all,

 Is there any notion of a lightweight python client for submitting jobs to
 a Spark cluster remotely? If I essentially install Spark on the client
 machine, and that machine has the same OS, same version of Python, etc.,
 then I'm able to communicate with the cluster just fine. But if Python
 versions differ slightly, then I start to see a lot of opaque errors that
 often bubble up as EOFExceptions. Furthermore, this just seems like a very
 heavy weight way to set up a client.

 Does anyone have any suggestions for setting up a thin pyspark client on
 a node which doesn't necessarily conform to the homogeneity of the target
 Spark cluster?

 Best,
 Chris





RE: Are these numbers abnormal for spark streaming?

2015-01-22 Thread Ashic Mahtab
Hi Guys,
So I changed the interval to 15 seconds. There's obviously a lot more messages 
per batch, but (I think) it looks a lot healthier. Can you see any major 
warning signs? I think that with 2 second intervals, the setup / teardown per 
partition was what was causing the delays.

StreamingStarted at: Thu Jan 22 13:23:12 GMT 2015Time since start: 1 hour 17 
minutes 16 secondsNetwork receivers: 2Batch interval: 15 secondsProcessed 
batches: 309Waiting batches: 0

Statistics over last 100 processed batchesReceiver 
StatisticsReceiverStatusLocationRecords in last batch[2015/01/22 
14:40:29]Minimum rate[records/sec]Median rate[records/sec]Maximum 
rate[records/sec]Last ErrorRmqReceiver-0ACTIVEVDCAPP53.foo.local2.6 
K29106295-RmqReceiver-1ACTIVEVDCAPP50.bar.local2.6 K29107291-Batch Processing 
StatisticsMetricLast batchMinimum25th percentileMedian75th 
percentileMaximumProcessing Time4 seconds 812 ms4 seconds 698 ms4 seconds 738 
ms4 seconds 761 ms4 seconds 788 ms5 seconds 802 msScheduling Delay2 ms0 ms3 ms3 
ms4 ms9 msTotal Delay4 seconds 814 ms4 seconds 701 ms4 seconds 739 ms4 seconds 
764 ms4 seconds 792 ms5 seconds 809 ms
Regards,
Ashic.
From: as...@live.com
To: gerard.m...@gmail.com
CC: user@spark.apache.org
Subject: RE: Are these numbers abnormal for spark streaming?
Date: Thu, 22 Jan 2015 12:32:05 +




Hi Gerard,
Thanks for the response.

The messages get desrialised from msgpack format, and one of the strings is 
desrialised to json. Certain fields are checked to decide if further processing 
is required. If so, it goes through a series of in mem filters to check if more 
processing is required. If so, only then does the heavy work start. That 
consists of a few db queries, and potential updates to the db + message on 
message queue. The majority of messages don't need processing. The messages 
needing processing at peak are about three every other second. 

One possible things that might be happening is the session initialisation and 
prepared statement initialisation for each partition. I can resort to some 
tricks, but I think I'll try increasing batch interval to 15 seconds. I'll 
report back with findings.

Thanks,
Ashic.

From: gerard.m...@gmail.com
Date: Thu, 22 Jan 2015 12:30:08 +0100
Subject: Re: Are these numbers abnormal for spark streaming?
To: tathagata.das1...@gmail.com
CC: as...@live.com; t...@databricks.com; user@spark.apache.org

and post the code (if possible).In a nutshell, your processing time  batch 
interval,  resulting in an ever-increasing delay that will end up in a crash.
3 secs to process 14 messages looks like a lot. Curious what the job logic is.
-kr, Gerard.
On Thu, Jan 22, 2015 at 12:15 PM, Tathagata Das tathagata.das1...@gmail.com 
wrote:
This is not normal. Its a huge scheduling delay!! Can you tell me more about 
the application?- cluser setup, number of receivers, whats the computation, etc.
On Thu, Jan 22, 2015 at 3:11 AM, Ashic Mahtab as...@live.com wrote:



Hate to do this...but...erm...bump? Would really appreciate input from others 
using Streaming. Or at least some docs that would tell me if these are expected 
or not.

From: as...@live.com
To: user@spark.apache.org
Subject: Are these numbers abnormal for spark streaming?
Date: Wed, 21 Jan 2015 11:26:31 +




Hi Guys,
I've got Spark Streaming set up for a low data rate system (using spark's 
features for analysis, rather than high throughput). Messages are coming in 
throughout the day, at around 1-20 per second (finger in the air estimate...not 
analysed yet).  In the spark streaming UI for the application, I'm getting the 
following after 17 hours.

StreamingStarted at: Tue Jan 20 16:58:43 GMT 2015Time since start: 18 hours 24 
minutes 34 secondsNetwork receivers: 2Batch interval: 2 secondsProcessed 
batches: 16482Waiting batches: 1

Statistics over last 100 processed batchesReceiver 
StatisticsReceiverStatusLocationRecords in last batch[2015/01/21 
11:23:18]Minimum rate[records/sec]Median rate[records/sec]Maximum 
rate[records/sec]Last ErrorRmqReceiver-0ACTIVEF
144727-RmqReceiver-1ACTIVEBR
124726-Batch Processing StatisticsMetricLast batchMinimum25th 
percentileMedian75th percentileMaximumProcessing Time3 seconds 994 ms157 ms4 
seconds 16 ms4 seconds 961 ms5 seconds 3 ms5 seconds 171 msScheduling Delay9 
hours 15 minutes 4 seconds9 hours 10 minutes 54 seconds9 hours 11 minutes 56 
seconds9 hours 12 minutes 57 seconds9 hours 14 minutes 5 seconds9 hours 15 
minutes 4 secondsTotal Delay9 hours 15 minutes 8 seconds9 hours 10 minutes 58 
seconds9 hours 12 minutes9 hours 13 minutes 2 seconds9 hours 14 minutes 10 
seconds9 hours 15 minutes 8 seconds
Are these normal. I was wondering what the scheduling delay and total delay 
terms are, and if it's normal for them to be 9 hours.

I've got a standalone spark master and 4 spark nodes. The streaming app has 
been given 4 cores, and it's using 1 core per worker node. The streaming app is 
submitted from a 5th machine, and that machine has nothing 

Re: Spark Team - Paco Nathan said that your team can help

2015-01-22 Thread Sudipta Banerjee
Hi Marco,

Thanks for the confirmation. Please let me know what are the lot more
detail you need to answer a very specific question  WHAT IS THE MINIMUM
HARDWARE CONFIGURATION REQUIRED TO BUILT HDFS+ MAPREDUCE+SPARK+YARN  on a
system? Please let me know if you need any further information and if you
dont know please drive across with the $1 to Sir Paco Nathan and get me
the answer.

Thanks and Regards,
Sudipta

On Thu, Jan 22, 2015 at 5:33 PM, Marco Shaw marco.s...@gmail.com wrote:

 Hi,

 Let me reword your request so you understand how (too) generic your
 question is

 Hi, I have $10,000, please find me some means of transportation so I can
 get to work.

 Please provide (a lot) more details. If you can't, consider using one of
 the pre-built express VMs from either Cloudera, Hortonworks or MapR, for
 example.

 Marco



  On Jan 22, 2015, at 7:36 AM, Sudipta Banerjee 
 asudipta.baner...@gmail.com wrote:
 
 
 
  Hi Apache-Spark team ,
 
  What are the system requirements installing Hadoop and Apache Spark?
  I have attached the screen shot of Gparted.
 
 
  Thanks and regards,
  Sudipta
 
 
 
 
  --
  Sudipta Banerjee
  Consultant, Business Analytics and Cloud Based Architecture
  Call me +919019578099
  Screenshot - Wednesday 21 January 2015 - 10:55:29 IST.png
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org




-- 
Sudipta Banerjee
Consultant, Business Analytics and Cloud Based Architecture
Call me +919019578099


Re: Are these numbers abnormal for spark streaming?

2015-01-22 Thread Gerard Maas
So the system has gone from 7msg in 4.961 secs (median) to 106msgs in 4,761
seconds.
I think there's evidence that setup costs are quite high in this case and
increasing the batch interval is helping.

On Thu, Jan 22, 2015 at 4:12 PM, Sudipta Banerjee 
asudipta.baner...@gmail.com wrote:

 Hi Ashic Mahtab,

 The Cassandra and the Zookeeper are they installed as a part of Yarn
 architecture or are they installed in a separate layer with Apache Spark .

 Thanks and Regards,
 Sudipta

 On Thu, Jan 22, 2015 at 8:13 PM, Ashic Mahtab as...@live.com wrote:

 Hi Guys,
 So I changed the interval to 15 seconds. There's obviously a lot more
 messages per batch, but (I think) it looks a lot healthier. Can you see any
 major warning signs? I think that with 2 second intervals, the setup /
 teardown per partition was what was causing the delays.

 Streaming

- *Started at: *Thu Jan 22 13:23:12 GMT 2015
- *Time since start: *1 hour 17 minutes 16 seconds
- *Network receivers: *2
- *Batch interval: *15 seconds
- *Processed batches: *309
- *Waiting batches: *0



 Statistics over last 100 processed batchesReceiver Statistics

- Receiver


- Status


- Location


- Records in last batch
- [2015/01/22 14:40:29]


- Minimum rate
- [records/sec]


- Median rate
- [records/sec]


- Maximum rate
- [records/sec]


- Last Error

 RmqReceiver-0ACTIVEVDCAPP53.foo.local2.6 K29106295-RmqReceiver-1ACTIVE
 VDCAPP50.bar.local2.6 K29107291-
 Batch Processing Statistics

MetricLast batchMinimum25th percentileMedian75th 
 percentileMaximumProcessing
Time4 seconds 812 ms4 seconds 698 ms4 seconds 738 ms4 seconds 761 ms4
seconds 788 ms5 seconds 802 msScheduling Delay2 ms0 ms3 ms3 ms4 ms9 
 msTotal
Delay4 seconds 814 ms4 seconds 701 ms4 seconds 739 ms4 seconds 764 ms4
seconds 792 ms5 seconds 809 ms


 Regards,
 Ashic.
 --
 From: as...@live.com
 To: gerard.m...@gmail.com
 CC: user@spark.apache.org
 Subject: RE: Are these numbers abnormal for spark streaming?
 Date: Thu, 22 Jan 2015 12:32:05 +


 Hi Gerard,
 Thanks for the response.

 The messages get desrialised from msgpack format, and one of the strings
 is desrialised to json. Certain fields are checked to decide if further
 processing is required. If so, it goes through a series of in mem filters
 to check if more processing is required. If so, only then does the heavy
 work start. That consists of a few db queries, and potential updates to the
 db + message on message queue. The majority of messages don't need
 processing. The messages needing processing at peak are about three every
 other second.

 One possible things that might be happening is the session initialisation
 and prepared statement initialisation for each partition. I can resort to
 some tricks, but I think I'll try increasing batch interval to 15 seconds.
 I'll report back with findings.

 Thanks,
 Ashic.

 --
 From: gerard.m...@gmail.com
 Date: Thu, 22 Jan 2015 12:30:08 +0100
 Subject: Re: Are these numbers abnormal for spark streaming?
 To: tathagata.das1...@gmail.com
 CC: as...@live.com; t...@databricks.com; user@spark.apache.org

 and post the code (if possible).
 In a nutshell, your processing time  batch interval,  resulting in an
 ever-increasing delay that will end up in a crash.
 3 secs to process 14 messages looks like a lot. Curious what the job
 logic is.

 -kr, Gerard.

 On Thu, Jan 22, 2015 at 12:15 PM, Tathagata Das 
 tathagata.das1...@gmail.com wrote:

 This is not normal. Its a huge scheduling delay!! Can you tell me more
 about the application?
 - cluser setup, number of receivers, whats the computation, etc.

 On Thu, Jan 22, 2015 at 3:11 AM, Ashic Mahtab as...@live.com wrote:

 Hate to do this...but...erm...bump? Would really appreciate input from
 others using Streaming. Or at least some docs that would tell me if these
 are expected or not.

 --
 From: as...@live.com
 To: user@spark.apache.org
 Subject: Are these numbers abnormal for spark streaming?
 Date: Wed, 21 Jan 2015 11:26:31 +


 Hi Guys,
 I've got Spark Streaming set up for a low data rate system (using spark's
 features for analysis, rather than high throughput). Messages are coming in
 throughout the day, at around 1-20 per second (finger in the air
 estimate...not analysed yet).  In the spark streaming UI for the
 application, I'm getting the following after 17 hours.

 Streaming

- *Started at: *Tue Jan 20 16:58:43 GMT 2015
- *Time since start: *18 hours 24 minutes 34 seconds
- *Network receivers: *2
- *Batch interval: *2 seconds
- *Processed batches: *16482
- *Waiting batches: *1



 Statistics over last 100 processed batchesReceiver Statistics

- Receiver


- Status


- Location


- Records in last batch
- [2015/01/21 11:23:18]


- Minimum rate
- [records/sec]


- Median rate
- [records/sec]


- 

RE: Are these numbers abnormal for spark streaming?

2015-01-22 Thread Ashic Mahtab
Yup...looks like it. I can do some tricks to reduce setup costs further, but 
this is much better than where I was yesterday. Thanks for your awesome input :)

-Ashic.

From: gerard.m...@gmail.com
Date: Thu, 22 Jan 2015 16:34:38 +0100
Subject: Re: Are these numbers abnormal for spark streaming?
To: asudipta.baner...@gmail.com
CC: as...@live.com; user@spark.apache.org; tathagata.das1...@gmail.com

Given that the process, and in particular, the setup of connections, is bound 
to the number of partitions (in x.foreachPartition{ x= ???}), I think it would 
be worth trying reducing them.
Increasing the  'spark.streaming.BlockInterval' will do the trick (you can read 
the tuning details here:  http://www.virdata.com/tuning-spark/#Partitions)
-kr, Gerard.
On Thu, Jan 22, 2015 at 4:28 PM, Gerard Maas gerard.m...@gmail.com wrote:
So the system has gone from 7msg in 4.961 secs (median) to 106msgs in 4,761 
seconds. I think there's evidence that setup costs are quite high in this case 
and increasing the batch interval is helping.
On Thu, Jan 22, 2015 at 4:12 PM, Sudipta Banerjee asudipta.baner...@gmail.com 
wrote:
Hi Ashic Mahtab,

The Cassandra and the Zookeeper are they installed as a part of Yarn 
architecture or are they installed in a separate layer with Apache Spark .

Thanks and Regards,
Sudipta

On Thu, Jan 22, 2015 at 8:13 PM, Ashic Mahtab as...@live.com wrote:



Hi Guys,
So I changed the interval to 15 seconds. There's obviously a lot more messages 
per batch, but (I think) it looks a lot healthier. Can you see any major 
warning signs? I think that with 2 second intervals, the setup / teardown per 
partition was what was causing the delays.

StreamingStarted at: Thu Jan 22 13:23:12 GMT 2015Time since start: 1 hour 17 
minutes 16 secondsNetwork receivers: 2Batch interval: 15 secondsProcessed 
batches: 309Waiting batches: 0

Statistics over last 100 processed batchesReceiver 
StatisticsReceiverStatusLocationRecords in last batch[2015/01/22 
14:40:29]Minimum rate[records/sec]Median rate[records/sec]Maximum 
rate[records/sec]Last ErrorRmqReceiver-0ACTIVEVDCAPP53.foo.local2.6 
K29106295-RmqReceiver-1ACTIVEVDCAPP50.bar.local2.6 K29107291-Batch Processing 
StatisticsMetricLast batchMinimum25th percentileMedian75th 
percentileMaximumProcessing Time4 seconds 812 ms4 seconds 698 ms4 seconds 738 
ms4 seconds 761 ms4 seconds 788 ms5 seconds 802 msScheduling Delay2 ms0 ms3 ms3 
ms4 ms9 msTotal Delay4 seconds 814 ms4 seconds 701 ms4 seconds 739 ms4 seconds 
764 ms4 seconds 792 ms5 seconds 809 ms
Regards,
Ashic.
From: as...@live.com
To: gerard.m...@gmail.com
CC: user@spark.apache.org
Subject: RE: Are these numbers abnormal for spark streaming?
Date: Thu, 22 Jan 2015 12:32:05 +




Hi Gerard,
Thanks for the response.

The messages get desrialised from msgpack format, and one of the strings is 
desrialised to json. Certain fields are checked to decide if further processing 
is required. If so, it goes through a series of in mem filters to check if more 
processing is required. If so, only then does the heavy work start. That 
consists of a few db queries, and potential updates to the db + message on 
message queue. The majority of messages don't need processing. The messages 
needing processing at peak are about three every other second. 

One possible things that might be happening is the session initialisation and 
prepared statement initialisation for each partition. I can resort to some 
tricks, but I think I'll try increasing batch interval to 15 seconds. I'll 
report back with findings.

Thanks,
Ashic.

From: gerard.m...@gmail.com
Date: Thu, 22 Jan 2015 12:30:08 +0100
Subject: Re: Are these numbers abnormal for spark streaming?
To: tathagata.das1...@gmail.com
CC: as...@live.com; t...@databricks.com; user@spark.apache.org

and post the code (if possible).In a nutshell, your processing time  batch 
interval,  resulting in an ever-increasing delay that will end up in a crash.
3 secs to process 14 messages looks like a lot. Curious what the job logic is.
-kr, Gerard.
On Thu, Jan 22, 2015 at 12:15 PM, Tathagata Das tathagata.das1...@gmail.com 
wrote:
This is not normal. Its a huge scheduling delay!! Can you tell me more about 
the application?- cluser setup, number of receivers, whats the computation, etc.
On Thu, Jan 22, 2015 at 3:11 AM, Ashic Mahtab as...@live.com wrote:



Hate to do this...but...erm...bump? Would really appreciate input from others 
using Streaming. Or at least some docs that would tell me if these are expected 
or not.

From: as...@live.com
To: user@spark.apache.org
Subject: Are these numbers abnormal for spark streaming?
Date: Wed, 21 Jan 2015 11:26:31 +




Hi Guys,
I've got Spark Streaming set up for a low data rate system (using spark's 
features for analysis, rather than high throughput). Messages are coming in 
throughout the day, at around 1-20 per second (finger in the air estimate...not 
analysed yet).  In the spark streaming UI for the application, I'm getting the 
following 

RE: Spark Team - Paco Nathan said that your team can help

2015-01-22 Thread Babu, Prashanth
Sudipta,

Use the Docker image [1] and play around with Hadoop and Spark in the VM for a 
while.
Decide on your use case(s) and then you can move ahead for installing on a 
cluster, etc.
This Docker image has all you want [HDFS + MapReduce + Spark + YARN].

All the best!

[1]: https://github.com/sequenceiq/docker-spark

From: Sudipta Banerjee [mailto:asudipta.baner...@gmail.com]
Sent: 22 January 2015 14:51
To: Marco Shaw
Cc: user@spark.apache.org
Subject: Re: Spark Team - Paco Nathan said that your team can help

Hi Marco,
Thanks for the confirmation. Please let me know what are the lot more detail 
you need to answer a very specific question  WHAT IS THE MINIMUM HARDWARE 
CONFIGURATION REQUIRED TO BUILT HDFS+ MAPREDUCE+SPARK+YARN  on a system? Please 
let me know if you need any further information and if you dont know please 
drive across with the $1 to Sir Paco Nathan and get me the answer.
Thanks and Regards,
Sudipta

On Thu, Jan 22, 2015 at 5:33 PM, Marco Shaw 
marco.s...@gmail.commailto:marco.s...@gmail.com wrote:
Hi,

Let me reword your request so you understand how (too) generic your question 
is

Hi, I have $10,000, please find me some means of transportation so I can get 
to work.

Please provide (a lot) more details. If you can't, consider using one of the 
pre-built express VMs from either Cloudera, Hortonworks or MapR, for example.

Marco



 On Jan 22, 2015, at 7:36 AM, Sudipta Banerjee 
 asudipta.baner...@gmail.commailto:asudipta.baner...@gmail.com wrote:



 Hi Apache-Spark team ,

 What are the system requirements installing Hadoop and Apache Spark?
 I have attached the screen shot of Gparted.


 Thanks and regards,
 Sudipta




 --
 Sudipta Banerjee
 Consultant, Business Analytics and Cloud Based Architecture
 Call me +919019578099
 Screenshot - Wednesday 21 January 2015 - 10:55:29 IST.png

 -
 To unsubscribe, e-mail: 
 user-unsubscr...@spark.apache.orgmailto:user-unsubscr...@spark.apache.org
 For additional commands, e-mail: 
 user-h...@spark.apache.orgmailto:user-h...@spark.apache.org



--
Sudipta Banerjee
Consultant, Business Analytics and Cloud Based Architecture
Call me +919019578099

__
Disclaimer: This email and any attachments are sent in strictest confidence
for the sole use of the addressee and may contain legally privileged,
confidential, and proprietary data. If you are not the intended recipient,
please advise the sender by replying promptly to this email and then delete
and destroy this email and any attachments without any further use, copying
or forwarding.


RE: How to 'Pipe' Binary Data in Apache Spark

2015-01-22 Thread Venkat, Ankam
Thanks Frank for your response.

So, creating a custom InputFormat or using sc.binaryFiles alone is not the 
right solution.   We also need the modified version of RDD.pipe to support 
binary data?  Is my understanding correct?

If yes, this can be added as new enhancement Jira request?

Nick:  What's your take on this?

Regards,
Venkat Ankam


From: Frank Austin Nothaft [mailto:fnoth...@berkeley.edu]
Sent: Wednesday, January 21, 2015 12:30 PM
To: Venkat, Ankam
Cc: Nick Allen; user@spark.apache.org
Subject: Re: How to 'Pipe' Binary Data in Apache Spark

Hi Venkat/Nick,

The Spark RDD.pipe method pipes text data into a subprocess and then receives 
text data back from that process. Once you have the binary data loaded into an 
RDD properly, to pipe binary data to/from a subprocess (e.g., you want the data 
in the pipes to contain binary, not text), you need to implement your own, 
modified version of RDD.pipe. The 
implementationhttps://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PipedRDD.scala
 of RDD.pipe spawns a process per partition (IIRC), as well as threads for 
writing to and reading from the process (as well as stderr for the process). 
When writing via RDD.pipe, Spark calls *.toString on the object, and pushes 
that text representation down the pipe. There is an example of how to pipe 
binary data from within a mapPartitions call using the Scala API in lines 
107-177 of this 
filehttps://github.com/bigdatagenomics/avocado/blob/master/avocado-core/src/main/scala/org/bdgenomics/avocado/genotyping/ExternalGenotyper.scala.
 This specific code contains some nastiness around the packaging of downstream 
libraries that we rely on in that project, so I'm not sure if it is the 
cleanest way, but it is a workable way.

Regards,

Frank Austin Nothaft
fnoth...@berkeley.edumailto:fnoth...@berkeley.edu
fnoth...@eecs.berkeley.edumailto:fnoth...@eecs.berkeley.edu
202-340-0466

On Jan 21, 2015, at 9:17 AM, Venkat, Ankam 
ankam.ven...@centurylink.commailto:ankam.ven...@centurylink.com wrote:


I am trying to solve similar problem.  I am using option # 2 as suggested by 
Nick.

I have created an RDD with sc.binaryFiles for a list of .wav files.  But, I am 
not able to pipe it to the external programs.

For example:
 sq = sc.binaryFiles(wavfiles)  -- All .wav files stored on wavfiles 
 directory on HDFS
 sq.keys().collect() -- works fine.  Shows the list of file names.
 sq.values().collect() -- works fine.  Shows the content of the files.
 sq.values().pipe(lambda x: subprocess.call(['/usr/local/bin/sox', '-t' 
 'wav', '-', '-n', 'stats'])).collect() -- Does not work.  Tried different 
 options.
AttributeError: 'function' object has no attribute 'read'

Any suggestions?

Regards,
Venkat Ankam

From: Nick Allen [mailto:n...@nickallen.org]
Sent: Friday, January 16, 2015 11:46 AM
To: user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: How to 'Pipe' Binary Data in Apache Spark

I just wanted to reiterate the solution for the benefit of the community.

The problem is not from my use of 'pipe', but that 'textFile' cannot be used to 
read in binary data. (Doh) There are a couple options to move forward.

1. Implement a custom 'InputFormat' that understands the binary input data. 
(Per Sean Owen)

2. Use 'SparkContext.binaryFiles' to read in the entire binary file as a single 
record. This will impact performance as it prevents the use of more than one 
mapper on the file's data.

In my specific case for #1 I can only find one project from RIPE-NCC 
(https://github.com/RIPE-NCC/hadoop-pcap) that does this. Unfortunately, it 
appears to only support a limited set of network protocols.


On Fri, Jan 16, 2015 at 10:40 AM, Nick Allen 
n...@nickallen.orgmailto:n...@nickallen.org wrote:
Per your last comment, it appears I need something like this:

https://github.com/RIPE-NCC/hadoop-pcap

Thanks a ton.  That get me oriented in the right direction.

On Fri, Jan 16, 2015 at 10:20 AM, Sean Owen 
so...@cloudera.commailto:so...@cloudera.com wrote:
Well it looks like you're reading some kind of binary file as text.
That isn't going to work, in Spark or elsewhere, as binary data is not
even necessarily the valid encoding of a string. There are no line
breaks to delimit lines and thus elements of the RDD.

Your input has some record structure (or else it's not really useful
to put it into an RDD). You can encode this as a SequenceFile and read
it with objectFile.

You could also write a custom InputFormat that knows how to parse pcap
records directly.

On Fri, Jan 16, 2015 at 3:09 PM, Nick Allen 
n...@nickallen.orgmailto:n...@nickallen.org wrote:
 I have an RDD containing binary data. I would like to use 'RDD.pipe' to pipe
 that binary data to an external program that will translate it to
 string/text data. Unfortunately, it seems that Spark is mangling the binary
 data before it gets passed to the external program.

 This code is representative of what I am trying to do. What am I doing
 wrong? 

Re: Spark Team - Paco Nathan said that your team can help

2015-01-22 Thread Jerry Lam
Hi Sudipta,

I would also like to suggest to ask this question in Cloudera mailing list
since you have HDFS, MAPREDUCE and Yarn requirements. Spark can work with
HDFS and YARN but it is more like a client to those clusters. Cloudera can
provide services to answer your question more clearly. I'm not affiliate
with Cloudera but it seems they are the only one who is very active in the
spark project and provides a hadoop distribution.

HTH,

Jerry

btw, who is Paco Nathan?

On Thu, Jan 22, 2015 at 10:03 AM, Babu, Prashanth 
prashanth.b...@nttdata.com wrote:

  Sudipta,



 Use the Docker image [1] and play around with Hadoop and Spark in the VM
 for a while.

 Decide on your use case(s) and then you can move ahead for installing on a
 cluster, etc.

 This Docker image has all you want [HDFS + MapReduce + Spark + YARN].



 All the best!



 [1]: https://github.com/sequenceiq/docker-spark



 *From:* Sudipta Banerjee [mailto:asudipta.baner...@gmail.com]
 *Sent:* 22 January 2015 14:51
 *To:* Marco Shaw
 *Cc:* user@spark.apache.org
 *Subject:* Re: Spark Team - Paco Nathan said that your team can help



 Hi Marco,

 Thanks for the confirmation. Please let me know what are the lot more
 detail you need to answer a very specific question  WHAT IS THE MINIMUM
 HARDWARE CONFIGURATION REQUIRED TO BUILT HDFS+ MAPREDUCE+SPARK+YARN  on a
 system? Please let me know if you need any further information and if you
 dont know please drive across with the $1 to Sir Paco Nathan and get me
 the answer.

 Thanks and Regards,

 Sudipta



 On Thu, Jan 22, 2015 at 5:33 PM, Marco Shaw marco.s...@gmail.com wrote:

 Hi,

 Let me reword your request so you understand how (too) generic your
 question is

 Hi, I have $10,000, please find me some means of transportation so I can
 get to work.

 Please provide (a lot) more details. If you can't, consider using one of
 the pre-built express VMs from either Cloudera, Hortonworks or MapR, for
 example.

 Marco




  On Jan 22, 2015, at 7:36 AM, Sudipta Banerjee 
 asudipta.baner...@gmail.com wrote:
 
 
 
  Hi Apache-Spark team ,
 
  What are the system requirements installing Hadoop and Apache Spark?
  I have attached the screen shot of Gparted.
 
 
  Thanks and regards,
  Sudipta
 
 
 
 
  --
  Sudipta Banerjee
  Consultant, Business Analytics and Cloud Based Architecture
  Call me +919019578099

  Screenshot - Wednesday 21 January 2015 - 10:55:29 IST.png
 

  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org




 --

 Sudipta Banerjee

 Consultant, Business Analytics and Cloud Based Architecture

 Call me +919019578099

 __
 Disclaimer: This email and any attachments are sent in strictest confidence
 for the sole use of the addressee and may contain legally privileged,
 confidential, and proprietary data. If you are not the intended recipient,
 please advise the sender by replying promptly to this email and then delete
 and destroy this email and any attachments without any further use, copying
 or forwarding.



RE: spark 1.1.0 save data to hdfs failed

2015-01-22 Thread ey-chih chow
I looked into the namenode log and found this message:
2015-01-22 22:18:39,441 WARN org.apache.hadoop.ipc.Server: Incorrect header or 
version mismatch from 10.33.140.233:53776 got version 9 expected version 4
What should I do to fix this?
Thanks.
Ey-Chih
From: eyc...@hotmail.com
To: yuzhih...@gmail.com
CC: user@spark.apache.org
Subject: RE: spark 1.1.0 save data to hdfs failed
Date: Wed, 21 Jan 2015 23:12:56 -0800




The hdfs release should be hadoop 1.0.4.
Ey-Chih Chow 

Date: Wed, 21 Jan 2015 16:56:25 -0800
Subject: Re: spark 1.1.0 save data to hdfs failed
From: yuzhih...@gmail.com
To: eyc...@hotmail.com
CC: user@spark.apache.org

What hdfs release are you using ?
Can you check namenode log around time of error below to see if there is some 
clue ?
Cheers
On Wed, Jan 21, 2015 at 4:51 PM, ey-chih chow eyc...@hotmail.com wrote:
Hi,



I used the following fragment of a scala program to save data to hdfs:



contextAwareEvents

.map(e = (new AvroKey(e), null))

.saveAsNewAPIHadoopFile(hdfs:// + masterHostname + :9000/ETL/output/

+ dateDir,

classOf[AvroKey[GenericRecord]],

classOf[NullWritable],

classOf[AvroKeyOutputFormat[GenericRecord]],

job.getConfiguration)



But it failed with the following error messages.  Is there any people who

can help?  Thanks.



Ey-Chih Chow



=



Exception in thread main java.lang.reflect.InvocationTargetException

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at

sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

at

sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606)

at

org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:40)

at 
org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)

Caused by: java.io.IOException: Failed on local exception:

java.io.EOFException; Host Details : local host is:

ip-10-33-140-157/10.33.140.157; destination host is:

ec2-54-203-58-2.us-west-2.compute.amazonaws.com:9000;

at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764)

at org.apache.hadoop.ipc.Client.call(Client.java:1415)

at org.apache.hadoop.ipc.Client.call(Client.java:1364)

at

org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)

at com.sun.proxy.$Proxy14.getFileInfo(Unknown Source)

at

org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:744)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at

sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

at

sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606)

at

org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)

at

org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)

at com.sun.proxy.$Proxy15.getFileInfo(Unknown Source)

at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1925)

at

org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1079)

at

org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1075)

at

org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)

at

org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1075)

at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1400)

at

org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:145)

at

org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:900)

at

org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopFile(PairRDDFunctions.scala:832)

at com.crowdstar.etl.ParseAndClean$.main(ParseAndClean.scala:101)

at com.crowdstar.etl.ParseAndClean.main(ParseAndClean.scala)

... 6 more

Caused by: java.io.EOFException

at java.io.DataInputStream.readInt(DataInputStream.java:392)

at

org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1055)

at org.apache.hadoop.ipc.Client$Connection.run(Client.java:950)



===











--

View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-1-0-save-data-to-hdfs-failed-tp21305.html

Sent from the Apache Spark User List mailing list archive at Nabble.com.




Re: Large dataset, reduceByKey - java heap space error

2015-01-22 Thread Sean McNamara
Hi Kane-

http://spark.apache.org/docs/latest/tuning.html has excellent information that 
may be helpful.  In particular increasing the number of tasks may help, as well 
as confirming that you don’t have more data than you're expecting landing on a 
key.

Also, if you are using spark  1.2.0,  setting spark.shuffle.manager=sort was a 
huge help for many of our shuffle heavy workloads (this is the default in 1.2.0 
now)

Cheers,

Sean


On Jan 22, 2015, at 3:15 PM, Kane Kim 
kane.ist...@gmail.commailto:kane.ist...@gmail.com wrote:

I'm trying to process a large dataset, mapping/filtering works ok, but
as long as I try to reduceByKey, I get out of memory errors:

http://pastebin.com/70M5d0Bn

Any ideas how I can fix that?

Thanks.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org




Re: reducing number of output files

2015-01-22 Thread Sean Owen
One output file is produced per partition. If you want fewer, use
coalesce() before saving the RDD.

On Thu, Jan 22, 2015 at 10:46 PM, Kane Kim kane.ist...@gmail.com wrote:
 How I can reduce number of output files? Is there a parameter to 
 saveAsTextFile?

 Thanks.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Results never return to driver | Spark Custom Reader

2015-01-22 Thread Harihar Nahak
Hi All, 

I wrote a custom reader to read a DB, and it is able to return key and value
as expected but after it finished it never returned to driver 

here is output of worker log : 
15/01/23 15:51:38 INFO worker.ExecutorRunner: Launch command: java -cp
::/usr/local/spark-1.2.0-bin-hadoop2.4/sbin/../conf:/usr/local/spark-1.2.0-bin-hadoop2.4/lib/spark-assembly-1.2.0-hadoop2.4.0.jar:/usr/local/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar:/usr/local/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar:/usr/local/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar:/usr/local/hadoop/etc/hadoop
-XX:MaxPermSize=128m -Dspark.driver.port=53484 -Xms1024M -Xmx1024M
org.apache.spark.executor.CoarseGrainedExecutorBackend
akka.tcp://sparkDriver@VM90:53484/user/CoarseGrainedScheduler 6 VM99
4 app-20150123155114-
akka.tcp://sparkWorker@VM99:44826/user/Worker
15/01/23 15:51:47 INFO worker.Worker: Executor app-20150123155114-/6
finished with state EXITED message Command exited with code 1 exitStatus 1
15/01/23 15:51:47 WARN remote.ReliableDeliverySupervisor: Association with
remote system [akka.tcp://sparkExecutor@VM99:57695] has failed, address is
now gated for [5000] ms. Reason is: [Disassociated].
15/01/23 15:51:47 INFO actor.LocalActorRef: Message
[akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from
Actor[akka://sparkWorker/deadLetters] to
Actor[akka://sparkWorker/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkWorker%40143.96.25.29%3A35065-4#-915179653]
was not delivered. [3] dead letters encountered. This logging can be turned
off or adjusted with configuration settings 'akka.log-dead-letters' and
'akka.log-dead-letters-during-shutdown'.
15/01/23 15:51:49 INFO worker.Worker: Asked to kill unknown executor
app-20150123155114-/6

If someone noticed any clue to fixed that will really appreciate. 



-
--Harihar
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Results-never-return-to-driver-Spark-Custom-Reader-tp21328.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



processing large dataset

2015-01-22 Thread Kane Kim
I'm trying to process 5TB of data, not doing anything fancy, just
map/filter and reduceByKey. Spent whole day today trying to get it
processed, but never succeeded. I've tried to deploy to ec2 with the
script provided with spark on pretty beefy machines (100 r3.2xlarge
nodes). Really frustrated that spark doesn't work out of the box for
anything bigger than word count sample. One big problem is that
defaults are not suitable for processing big datasets, provided ec2
script could do a better job, knowing instance type requested. Second
it takes hours to figure out what is wrong, when spark job fails
almost finished processing. Even after raising all limits as per
https://spark.apache.org/docs/latest/tuning.html it still fails (now
with: error communicating with MapOutputTracker).

After all I have only one question - how to get spark tuned up for
processing terabytes of data and is there a way to make this
configuration easier and more transparent?

Thanks.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: KNN for large data set

2015-01-22 Thread Sudipta Banerjee
Hi Devan and Xiangrui,

Can you please explain the cost and optimization function of the KNN
alogorithim that is being  used?

Thank and Regards,
Sudipta

On Thu, Jan 22, 2015 at 6:59 PM, DEVAN M.S. msdeva...@gmail.com wrote:

 Thanks Xiangrui Meng will try this.

 And, found this https://github.com/kaushikranjan/knnJoin also.
 Will this work with double data ? Can we find out z value of
 *Vector(10.3,4.5,3,5)* ?






 On Thu, Jan 22, 2015 at 12:25 AM, Xiangrui Meng men...@gmail.com wrote:

 For large datasets, you need hashing in order to compute k-nearest
 neighbors locally. You can start with LSH + k-nearest in Google
 scholar: http://scholar.google.com/scholar?q=lsh+k+nearest -Xiangrui

 On Tue, Jan 20, 2015 at 9:55 PM, DEVAN M.S. msdeva...@gmail.com wrote:
  Hi all,
 
  Please help me to find out best way for K-nearest neighbor using spark
 for
  large data sets.
 





-- 
Sudipta Banerjee
Consultant, Business Analytics and Cloud Based Architecture
Call me +919019578099


Exception in parsley pyspark cassandra hadoop connector

2015-01-22 Thread Nishant Sinha
I am following the repo on github about pyspark cassandra connector at

https://github.com/Parsely/pyspark-cassandra

On executing the line :


./run_script.py src/main/python/pyspark_cassandra_hadoop_example.py run test


It ends up wit an exception:

ERROR Executor: Exception in task 9.0 in stage 2.0 (TID 14)
java.io.NotSerializableException:
scala.collection.convert.Wrappers$MapWrapper


I am unable to figure out the cause of the exception

Thanks,
Nishant


Re: spark 1.1.0 save data to hdfs failed

2015-01-22 Thread Sean Owen
It means your client app is using Hadoop 2.x and your HDFS is Hadoop 1.x.

On Thu, Jan 22, 2015 at 10:32 PM, ey-chih chow eyc...@hotmail.com wrote:
 I looked into the namenode log and found this message:

 2015-01-22 22:18:39,441 WARN org.apache.hadoop.ipc.Server: Incorrect header
 or version mismatch from 10.33.140.233:53776 got version 9 expected version
 4

 What should I do to fix this?

 Thanks.

 Ey-Chih

 
 From: eyc...@hotmail.com
 To: yuzhih...@gmail.com
 CC: user@spark.apache.org
 Subject: RE: spark 1.1.0 save data to hdfs failed
 Date: Wed, 21 Jan 2015 23:12:56 -0800

 The hdfs release should be hadoop 1.0.4.

 Ey-Chih Chow

 
 Date: Wed, 21 Jan 2015 16:56:25 -0800
 Subject: Re: spark 1.1.0 save data to hdfs failed
 From: yuzhih...@gmail.com
 To: eyc...@hotmail.com
 CC: user@spark.apache.org

 What hdfs release are you using ?

 Can you check namenode log around time of error below to see if there is
 some clue ?

 Cheers

 On Wed, Jan 21, 2015 at 4:51 PM, ey-chih chow eyc...@hotmail.com wrote:

 Hi,

 I used the following fragment of a scala program to save data to hdfs:

 contextAwareEvents
 .map(e = (new AvroKey(e), null))
 .saveAsNewAPIHadoopFile(hdfs:// + masterHostname + :9000/ETL/output/
 + dateDir,
 classOf[AvroKey[GenericRecord]],
 classOf[NullWritable],
 classOf[AvroKeyOutputFormat[GenericRecord]],
 job.getConfiguration)

 But it failed with the following error messages.  Is there any people who
 can help?  Thanks.

 Ey-Chih Chow

 =

 Exception in thread main java.lang.reflect.InvocationTargetException
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at
 org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:40)
 at
 org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
 Caused by: java.io.IOException: Failed on local exception:
 java.io.EOFException; Host Details : local host is:
 ip-10-33-140-157/10.33.140.157; destination host is:
 ec2-54-203-58-2.us-west-2.compute.amazonaws.com:9000;
 at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764)
 at org.apache.hadoop.ipc.Client.call(Client.java:1415)
 at org.apache.hadoop.ipc.Client.call(Client.java:1364)
 at
 org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
 at com.sun.proxy.$Proxy14.getFileInfo(Unknown Source)
 at
 org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:744)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at
 org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
 at
 org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
 at com.sun.proxy.$Proxy15.getFileInfo(Unknown Source)
 at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1925)
 at
 org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1079)
 at
 org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1075)
 at
 org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
 at
 org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1075)
 at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1400)
 at
 org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:145)
 at
 org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:900)
 at
 org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopFile(PairRDDFunctions.scala:832)
 at com.crowdstar.etl.ParseAndClean$.main(ParseAndClean.scala:101)
 at com.crowdstar.etl.ParseAndClean.main(ParseAndClean.scala)
 ... 6 more
 Caused by: java.io.EOFException
 at java.io.DataInputStream.readInt(DataInputStream.java:392)
 at
 org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1055)
 at org.apache.hadoop.ipc.Client$Connection.run(Client.java:950)

 ===





 --
 View this message in 

Re: GraphX: ShortestPaths does not terminate on a grid graph

2015-01-22 Thread Ankur Dave
At 2015-01-22 02:06:37 -0800, NicolasC nicolas.ch...@inria.fr wrote:
 I try to execute a simple program that runs the ShortestPaths algorithm
 (org.apache.spark.graphx.lib.ShortestPaths) on a small grid graph.
 I use Spark 1.2.0 downloaded from spark.apache.org.

 This program runs more than 2 hours when the grid size is 70x70 as above, and 
 is then killed
 by the resource manager of the cluster (Torque). After a 5-6 minutes of 
 execution, the
 Spark master UI does not even respond.

 For a grid size of 30x30, the program terminates in about 20 seconds, and for 
 a grid size
 of 50x50 it finishes in about 80 seconds. The problem appears for a grid size 
 of 70x70 and
 above.

Unfortunately this problem is due to a Spark bug. In later iterations of 
iterative algorithms, the lineage maintained for fault tolerance grows long and 
causes Spark to consume increasing amounts of resources for scheduling and task 
serialization.

The workaround is to checkpoint the graph periodically, which writes it to 
stable storage and interrupts the lineage chain before it grows too long.

If you're able to recompile Spark, you can do this by applying the patch to 
GraphX at the end of this mail, and before running graph algorithms, calling

sc.setCheckpointDir(/tmp)

to set the checkpoint directory as desired.

Ankur

=== patch begins here ===

diff --git a/graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala 
b/graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala
index 5e55620..1fbbb87 100644
--- a/graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala
+++ b/graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala
@@ -126,6 +126,8 @@ object Pregel extends Logging {
 // Loop
 var prevG: Graph[VD, ED] = null
 var i = 0
+val checkpoint = g.vertices.context.getCheckpointDir.nonEmpty
+val checkpointFrequency = 25
 while (activeMessages  0  i  maxIterations) {
   // Receive the messages. Vertices that didn't get any messages do not 
appear in newVerts.
   val newVerts = g.vertices.innerJoin(messages)(vprog).cache()
@@ -139,6 +141,14 @@ object Pregel extends Logging {
   // get to send messages. We must cache messages so it can be 
materialized on the next line,
   // allowing us to uncache the previous iteration.
   messages = g.mapReduceTriplets(sendMsg, mergeMsg, Some((newVerts, 
activeDirection))).cache()
+
+  if (checkpoint  i % checkpointFrequency == checkpointFrequency - 1) {
+logInfo(Checkpointing in iteration  + i)
+g.vertices.checkpoint()
+g.edges.checkpoint()
+messages.checkpoint()
+  }
+
   // The call to count() materializes `messages`, `newVerts`, and the 
vertices of `g`. This
   // hides oldMessages (depended on by newVerts), newVerts (depended on by 
messages), and the
   // vertices of prevG (depended on by newVerts, oldMessages, and the 
vertices of g).

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



save a histogram to a file

2015-01-22 Thread SK
Hi,
histogram() returns an object that is a  pair of Arrays. There appears to be
no saveAsTextFile() for this paired object.

Currently I am using the following to save the output to a file:

val hist = a.histogram(10)

val arr1 = sc.parallelize(hist._1).saveAsTextFile(file1)
val arr2 = sc.parallelize(hist._2).saveAsTextFile(file2)

Is there a simpler way to save the histogram() result to a file?

thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/save-a-histogram-to-a-file-tp21324.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: spark streaming with checkpoint

2015-01-22 Thread Shao, Saisai
Hi,

A new RDD will be created in each slide duration, if there’s no data coming, an 
empty RDD will be generated.

I’m not sure there’s way to alleviate your problem from Spark side. Is your 
application design have to build such a large window, can you change your 
implementation if it is easy for you?

I think it’s better and easy for you to change your implementation rather than 
rely on Spark to handle this.

Thanks
Jerry

From: Balakrishnan Narendran [mailto:balu.na...@gmail.com]
Sent: Friday, January 23, 2015 12:19 AM
To: Shao, Saisai
Cc: user@spark.apache.org
Subject: Re: spark streaming with checkpoint

Thank you Jerry,
   Does the window operation create new RDDs for each slide duration..? I 
am asking this because i see a constant increase in memory even when there is 
no logs received.
If not checkpoint is there any alternative that you would suggest.?


On Tue, Jan 20, 2015 at 7:08 PM, Shao, Saisai 
saisai.s...@intel.commailto:saisai.s...@intel.com wrote:
Hi,

Seems you have such a large window (24 hours), so the phenomena of memory 
increasing is expectable, because of window operation will cache the RDD within 
this window in memory. So for your requirement, memory should be enough to hold 
the data of 24 hours.

I don’t think checkpoint in Spark Streaming can alleviate such problem, because 
checkpoint are mainly for fault tolerance.

Thanks
Jerry

From: balu.naren [mailto:balu.na...@gmail.commailto:balu.na...@gmail.com]
Sent: Tuesday, January 20, 2015 7:17 PM
To: user@spark.apache.orgmailto:user@spark.apache.org
Subject: spark streaming with checkpoint


I am a beginner to spark streaming. So have a basic doubt regarding 
checkpoints. My use case is to calculate the no of unique users by day. I am 
using reduce by key and window for this. Where my window duration is 24 hours 
and slide duration is 5 mins. I am updating the processed record to mongodb. 
Currently I am replace the existing record each time. But I see the memory is 
slowly increasing over time and kills the process after 1 and 1/2 hours(in aws 
small instance). The DB write after the restart clears all the old data. So I 
understand checkpoint is the solution for this. But my doubt is

  *   What should my check point duration be..? As per documentation it says 
5-10 times of slide duration. But I need the data of entire day. So it is ok to 
keep 24 hrs.
  *   Where ideally should the checkpoint be..? Initially when I receive the 
stream or just before the window operation or after the data reduction has 
taken place.

Appreciate your help.
Thank you


View this message in context: spark streaming with 
checkpointhttp://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-with-checkpoint-tp21263.html
Sent from the Apache Spark User List mailing list 
archivehttp://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com.



Re: reading a csv dynamically

2015-01-22 Thread Imran Rashid
Spark can definitely process data with optional fields.  It kinda depends
on what you want to do with the results -- its more of a object design /
knowing scala types question.

Eg., scala has a built in type Option specifically for handling optional
data, which works nicely in pattern matching  functional programming.
Just to save myself some typing, I'm going to show an example with 2 or 3
fields:

myProcessedRdd: RDD[(String, Double, Option[Double])] =
sc.textFile(file.csv).map{txt =
  val split = txt.split(,)
  val opt = if split.length == 3 Some(split.toDouble) else None
  (split(0),split(1).toDouble, opt)
}

then eg., say in a later processing step, you want to make the 3rd field
have a default of 6.9, you'd do something like:

myProcessedRdd.map{ case (name, count,ageOpt) =  //arbitrary variable
names I'm just making up
  val age = ageOpt.getOrElse(6.9)
   ...
}

You might be interested in reading up on Scala's Option type, and how you
can use it.  There are a lot of other options too, eg. the Either type if
you want to track 2 alternatives, Try for keeping track of errors, etc.
You can play around with all of them outside of spark.  Of course you could
do similar things in Java well without these types.  You just need to write
your own container for dealing w/ missing data, which could be really
simple in your use case.

I would advise against first creating a key w/ the number of fields, and
then doing a groupByKey.  Since you are only expecting 2 different lengths,
al the data will only go to two tasks, so this will not scale very well.
And though the data is now grouped by length, its all in one RDD, so you've
still got to figure out what to do with both record lengths.

Imran


On Wed, Jan 21, 2015 at 6:46 PM, Pankaj Narang pankajnaran...@gmail.com
wrote:

 Yes I think you need to create one map first which will keep the number of
 values in every line. Now you can group all the records with same number of
 values. Now you know how many types of arrays you will have.


 val dataRDD = sc.textFile(file.csv)
 val dataLengthRDD =   dataRDD .map(line=(_.split(,).length,line))
 val groupedData = dataLengthRDD.groupByKey()

 now you can process the groupedData as it will have arrays of length x in
 one RDD.

 groupByKey([numTasks])  When called on a dataset of (K, V) pairs, returns a
 dataset of (K, IterableV) pairs.


 I hope this helps

 Regards
 Pankaj
 Infoshore Software
 India




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/reading-a-csv-dynamically-tp21304p21307.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Missing output partition file in S3

2015-01-22 Thread Nicolas Mai
Hi,

My team is using Spark 1.0.1 and the project we're working on needs to
compute exact numbers, which are then saved to S3, to be reused later in
other Spark jobs to compute other numbers. The problem we noticed yesterday:
one of the output partition files in S3 was missing :/ (some part-00218)...
The problem only occurred once, and cannot be reproed. However because of
this incident, our numbers may not be reliable.

From the Spark logs (from the cluster which generated the files with the
missing partition), we noticed some errors appearing multiple times:
- Loss was due to java.io.FileNotFoundException
java.io.FileNotFoundException:
s3://xxx/_temporary/_attempt_201501142002__m_000368_12139/part-00368:
No such file or directory.
at
org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:340)
at
org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:165)
at
org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:172)
at
org.apache.hadoop.mapred.FileOutputCommitter.commitTask(FileOutputCommitter.java:132)
at 
org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:109)
at
org.apache.spark.rdd.PairRDDFunctions.org$apache$spark$rdd$PairRDDFunctions$$writeToFile$1(PairRDDFunctions.scala:785)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$2.apply(PairRDDFunctions.scala:788)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$2.apply(PairRDDFunctions.scala:788)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
at org.apache.spark.scheduler.Task.run(Task.scala:51)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)

And:
- WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(3,
ip-10-152-30-234.ec2.internal, 48973, 0) with no recent heart beats: 72614ms
exceeds 45000ms

Questions:
- Do those errors explain why the output partition file was missing?
(knowing that we still get those errors in our logs).
- Is there a way to detect data loss during runtime, and then stop our Spark
job completely ASAP if it happens?

Thanks,
Nicolas



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Missing-output-partition-file-in-S3-tp21326.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to make spark partition sticky, i.e. stay with node?

2015-01-22 Thread mingyu
Also, Setting spark.locality.wait=100 did not work for me.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-make-spark-partition-sticky-i-e-stay-with-node-tp21322p21325.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: sparkcontext.objectFile return thousands of partitions

2015-01-22 Thread Imran Rashid
I think you should also just be able to provide an input format that never
splits the input data.  This has come up before on the list, but I couldn't
find it.*

I think this should work, but I can't try it out at the moment.  Can you
please try and let us know if it works?

class TextFormatNoSplits extends TextInputFormat {
  override def isSplitable(fs: FileSystem, file: Path): Boolean = false
}

def textFileNoSplits(sc: SparkContext, path: String): RDD[String] = {
  //note this is just a copy of sc.textFile, with a different
InputFormatClass
  sc.hadoopFile(path, classOf[TextFormatNoSplits], classOf[LongWritable],
classOf[Text]).map(pair = pair._2.toString).setName(path)
}


* yes I realize the irony given the recent discussion about mailing list
vs. stackoverflow ...

On Thu, Jan 22, 2015 at 11:01 AM, Sean Owen so...@cloudera.com wrote:

 Yes, that second argument is what I was referring to, but yes it's a
 *minimum*, oops, right. OK, you will want to coalesce then, indeed.

 On Thu, Jan 22, 2015 at 6:51 PM, Wang, Ningjun (LNG-NPV)
 ningjun.w...@lexisnexis.com wrote:
  Ø  If you know that this number is too high you can request a number of
  partitions when you read it.
 
 
 
  How to do that? Can you give a code snippet? I want to read it into 8
  partitions, so I do
 
 
 
  val rdd2 = sc.objectFile[LabeledPoint]( (“file:///tmp/mydir”, 8)
 
  However rdd2 contains thousands of partitions instead of 8 partitions
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




RE: spark 1.1.0 save data to hdfs failed

2015-01-22 Thread ey-chih chow
Thanks.  But after I replace the maven dependence from
dependency 
groupIdorg.apache.hadoop/groupId 
artifactIdhadoop-client/artifactId 
version2.5.0-cdh5.2.0/version 
scopeprovided/scope exclusions
   exclusion 
groupIdorg.mortbay.jetty/groupId 
artifactIdservlet-api/artifactId   /exclusion 
  exclusion 
groupIdjavax.servlet/groupId 
artifactIdservlet-api/artifactId   /exclusion 
  exclusion 
groupIdio.netty/groupId 
artifactIdnetty/artifactId   /exclusion   
  /exclusions/dependency
todependency
 groupIdorg.apache.hadoop/groupId
 artifactIdhadoop-client/artifactId
 version1.0.4/version
 scopeprovided/scope
 exclusions
   exclusion
 groupIdorg.mortbay.jetty/groupId
 artifactIdservlet-api/artifactId
   /exclusion
   exclusion
 groupIdjavax.servlet/groupId
 artifactIdservlet-api/artifactId
   /exclusion
   exclusion
 groupIdio.netty/groupId
 artifactIdnetty/artifactId
   /exclusion
 /exclusions
/dependency
the warning message is still shown up in the namenode log.  Is there any other 
thing I need to do?


Thanks.


Ey-Chih Chow


 From: so...@cloudera.com
 Date: Thu, 22 Jan 2015 22:34:22 +
 Subject: Re: spark 1.1.0 save data to hdfs failed
 To: eyc...@hotmail.com
 CC: yuzhih...@gmail.com; user@spark.apache.org
 
 It means your client app is using Hadoop 2.x and your HDFS is Hadoop 1.x.
 
 On Thu, Jan 22, 2015 at 10:32 PM, ey-chih chow eyc...@hotmail.com wrote:
  I looked into the namenode log and found this message:
 
  2015-01-22 22:18:39,441 WARN org.apache.hadoop.ipc.Server: Incorrect header
  or version mismatch from 10.33.140.233:53776 got version 9 expected version
  4
 
  What should I do to fix this?
 
  Thanks.
 
  Ey-Chih
 
  
  From: eyc...@hotmail.com
  To: yuzhih...@gmail.com
  CC: user@spark.apache.org
  Subject: RE: spark 1.1.0 save data to hdfs failed
  Date: Wed, 21 Jan 2015 23:12:56 -0800
 
  The hdfs release should be hadoop 1.0.4.
 
  Ey-Chih Chow
 
  
  Date: Wed, 21 Jan 2015 16:56:25 -0800
  Subject: Re: spark 1.1.0 save data to hdfs failed
  From: yuzhih...@gmail.com
  To: eyc...@hotmail.com
  CC: user@spark.apache.org
 
  What hdfs release are you using ?
 
  Can you check namenode log around time of error below to see if there is
  some clue ?
 
  Cheers
 
  On Wed, Jan 21, 2015 at 4:51 PM, ey-chih chow eyc...@hotmail.com wrote:
 
  Hi,
 
  I used the following fragment of a scala program to save data to hdfs:
 
  contextAwareEvents
  .map(e = (new AvroKey(e), null))
  .saveAsNewAPIHadoopFile(hdfs:// + masterHostname + :9000/ETL/output/
  + dateDir,
  classOf[AvroKey[GenericRecord]],
  classOf[NullWritable],
  classOf[AvroKeyOutputFormat[GenericRecord]],
  job.getConfiguration)
 
  But it failed with the following error messages.  Is there any people who
  can help?  Thanks.
 
  Ey-Chih Chow
 
  =
 
  Exception in thread main java.lang.reflect.InvocationTargetException
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at
  sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
  at
  sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:606)
  at
  org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:40)
  at
  org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
  Caused by: java.io.IOException: Failed on local exception:
  java.io.EOFException; Host Details : local host is:
  ip-10-33-140-157/10.33.140.157; destination host is:
  ec2-54-203-58-2.us-west-2.compute.amazonaws.com:9000;
  at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764)
  at org.apache.hadoop.ipc.Client.call(Client.java:1415)
  at org.apache.hadoop.ipc.Client.call(Client.java:1364)

Re: reducing number of output files

2015-01-22 Thread DEVAN M.S.
Rdd.coalesce(1) will coalesce RDD and give only one output file.
coalesce(2) will give 2 wise versa.
On Jan 23, 2015 4:58 AM, Sean Owen so...@cloudera.com wrote:

 One output file is produced per partition. If you want fewer, use
 coalesce() before saving the RDD.

 On Thu, Jan 22, 2015 at 10:46 PM, Kane Kim kane.ist...@gmail.com wrote:
  How I can reduce number of output files? Is there a parameter to
 saveAsTextFile?
 
  Thanks.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Using third party libraries in pyspark

2015-01-22 Thread Davies Liu
You need to install these libraries on all the slaves, or submit via
spark-submit:

spark-submit --py-files  xxx

On Thu, Jan 22, 2015 at 11:23 AM, Mohit Singh mohit1...@gmail.com wrote:
 Hi,
   I might be asking something very trivial, but whats the recommend way of
 using third party libraries.
 I am using tables to read hdf5 format file..
 And here is the error trace:


 print rdd.take(2)
   File /tmp/spark/python/pyspark/rdd.py, line , in take
 res = self.context.runJob(self, takeUpToNumLeft, p, True)
   File /tmp/spark/python/pyspark/context.py, line 818, in runJob
 it = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd,
 javaPartitions, allowLocal)
   File /tmp/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py,
 line 538, in __call__
   File /tmp/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line
 300, in get_return_value
 py4j.protocol.Py4JJavaError: An error occurred while calling
 z:org.apache.spark.api.python.PythonRDD.runJob.
 : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0
 (TID 3, srv-108-23.720.rdio): org.apache.spark.api.python.PythonException:
 Traceback (most recent call last):
   File
 /hadoop/disk3/mapred/local/filecache/540/spark-assembly-1.2.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/worker.py,
 line 90, in main
 command = pickleSer._read_with_length(infile)
   File
 /hadoop/disk3/mapred/local/filecache/540/spark-assembly-1.2.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/serializers.py,
 line 151, in _read_with_length
 return self.loads(obj)
   File
 /hadoop/disk3/mapred/local/filecache/540/spark-assembly-1.2.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/serializers.py,
 line 396, in loads
 return cPickle.loads(obj)
   File
 /hadoop/disk3/mapred/local/filecache/540/spark-assembly-1.2.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/cloudpickle.py,
 line 825, in subimport
 __import__(name)
 ImportError: ('No module named tables', function subimport at 0x47e1398,
 ('tables',))

 Though, import tables works fine on the local python shell.. but seems like
 every thing is being pickled.. Are we expected to send all the files as
 helper files? that doesn't seems right?
 Thanks


 --


 Mohit

 When you want success as badly as you want the air, then you will get it.
 There is no other secret of success.
 -Socrates

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: processing large dataset

2015-01-22 Thread Jörn Franke
Did you try it with a smaller subset of the data first?
Le 23 janv. 2015 05:54, Kane Kim kane.ist...@gmail.com a écrit :

 I'm trying to process 5TB of data, not doing anything fancy, just
 map/filter and reduceByKey. Spent whole day today trying to get it
 processed, but never succeeded. I've tried to deploy to ec2 with the
 script provided with spark on pretty beefy machines (100 r3.2xlarge
 nodes). Really frustrated that spark doesn't work out of the box for
 anything bigger than word count sample. One big problem is that
 defaults are not suitable for processing big datasets, provided ec2
 script could do a better job, knowing instance type requested. Second
 it takes hours to figure out what is wrong, when spark job fails
 almost finished processing. Even after raising all limits as per
 https://spark.apache.org/docs/latest/tuning.html it still fails (now
 with: error communicating with MapOutputTracker).

 After all I have only one question - how to get spark tuned up for
 processing terabytes of data and is there a way to make this
 configuration easier and more transparent?

 Thanks.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: sparkcontext.objectFile return thousands of partitions

2015-01-22 Thread Sean Owen
Yes, that second argument is what I was referring to, but yes it's a
*minimum*, oops, right. OK, you will want to coalesce then, indeed.

On Thu, Jan 22, 2015 at 6:51 PM, Wang, Ningjun (LNG-NPV)
ningjun.w...@lexisnexis.com wrote:
 Ø  If you know that this number is too high you can request a number of
 partitions when you read it.



 How to do that? Can you give a code snippet? I want to read it into 8
 partitions, so I do



 val rdd2 = sc.objectFile[LabeledPoint]( (“file:///tmp/mydir”, 8)

 However rdd2 contains thousands of partitions instead of 8 partitions


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



How to make spark partition sticky, i.e. stay with node?

2015-01-22 Thread mingyu
I posted an question on stackoverflow and haven't gotten any answer yet.
http://stackoverflow.com/questions/28079037/how-to-make-spark-partition-sticky-i-e-stay-with-node

Is there a way to make a partition stay with a node in Spark Streaming? I
need these since I have to load large amount partition specific auxiliary
data for processing the stream. I noticed that the partitions move among the
nodes. I cannot afford to move the large auxiliary data around.

Thanks,

Mingyu



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-make-spark-partition-sticky-i-e-stay-with-node-tp21322.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Using third party libraries in pyspark

2015-01-22 Thread Mohit Singh
Hi,
  I might be asking something very trivial, but whats the recommend way of
using third party libraries.
I am using tables to read hdf5 format file..
And here is the error trace:


print rdd.take(2)
  File /tmp/spark/python/pyspark/rdd.py, line , in take
res = self.context.runJob(self, takeUpToNumLeft, p, True)
  File /tmp/spark/python/pyspark/context.py, line 818, in runJob
it = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd,
javaPartitions, allowLocal)
  File /tmp/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py,
line 538, in __call__
  File /tmp/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line
300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage
0.0 (TID 3, srv-108-23.720.rdio):
org.apache.spark.api.python.PythonException: Traceback (most recent call
last):
  File
/hadoop/disk3/mapred/local/filecache/540/spark-assembly-1.2.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/worker.py,
line 90, in main
command = pickleSer._read_with_length(infile)
  File
/hadoop/disk3/mapred/local/filecache/540/spark-assembly-1.2.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/serializers.py,
line 151, in _read_with_length
return self.loads(obj)
  File
/hadoop/disk3/mapred/local/filecache/540/spark-assembly-1.2.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/serializers.py,
line 396, in loads
return cPickle.loads(obj)
  File
/hadoop/disk3/mapred/local/filecache/540/spark-assembly-1.2.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/cloudpickle.py,
line 825, in subimport
__import__(name)
ImportError: ('No module named tables', function subimport at 0x47e1398,
('tables',))

Though, import tables works fine on the local python shell.. but seems like
every thing is being pickled.. Are we expected to send all the files as
helper files? that doesn't seems right?
Thanks


-- 


Mohit

When you want success as badly as you want the air, then you will get it.
There is no other secret of success.
-Socrates


Re: Discourse: A proposed alternative to the Spark User list

2015-01-22 Thread Marcelo Vanzin
On Thu, Jan 22, 2015 at 10:21 AM, Sean Owen so...@cloudera.com wrote:
 I think a Spark site would have a lot less traffic. One annoyance is
 that people can't figure out when to post on SO vs Data Science vs
 Cross Validated.

Another is that a lot of the discussions we see on the Spark users
list would be closed very quickly at Stack Overflow. Long and abstract
discussions are generally a good recipe to get your question closed.
Which is an argument for why Discourse would be more appropriate, I
guess.

Finally, maybe I'm showing my age, but I really dislike having to
follow lots of different places. What would happen is that,
personally, I'd end up either ignoring any new discussion forum, or
just treating it like a mailing list and doing everything by e-mail.
Now get off my lawn.

-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Apache Spark broadcast error: Error sending message as driverActor is null [message = UpdateBlockInfo(BlockManagerId(4)

2015-01-22 Thread Zijing Guo
HiI'm using Apache Spark 1.1.0 and I'm currently having issue with broadcast 
method. So when I call broadcast function on a small dataset to a 5 nodes 
cluster, I experiencing the Error sending message as driverActor is null 
after broadcast the variables several times (apps running under jboss). 
Any help would be appreciate.ThanksEdwin



Durablility of Spark Streaming Applications

2015-01-22 Thread Wang, Daniel
Deployed Spark Streaming applications to a standalone cluster, after a cluster 
restart, all the deployed applications are gone and I could not see any 
applications through the Spark Web UI.



How to make the Spark Streaming applications durable and auto-restart after a 
cluster restart?



Thanks,

Daniel



Re: spark streaming with checkpoint

2015-01-22 Thread Jörn Franke
Maybe you use a wrong approach - try something like hyperloglog or bitmap
structures as you can find them, for instance, in  redis. They are much
smaller
Le 22 janv. 2015 17:19, Balakrishnan Narendran balu.na...@gmail.com a
écrit :

 Thank you Jerry,
Does the window operation create new RDDs for each slide
 duration..? I am asking this because i see a constant increase in memory
 even when there is no logs received.
 If not checkpoint is there any alternative that you would suggest.?


 On Tue, Jan 20, 2015 at 7:08 PM, Shao, Saisai saisai.s...@intel.com
 wrote:

  Hi,



 Seems you have such a large window (24 hours), so the phenomena of memory
 increasing is expectable, because of window operation will cache the RDD
 within this window in memory. So for your requirement, memory should be
 enough to hold the data of 24 hours.



 I don’t think checkpoint in Spark Streaming can alleviate such problem,
 because checkpoint are mainly for fault tolerance.



 Thanks

 Jerry



 *From:* balu.naren [mailto:balu.na...@gmail.com]
 *Sent:* Tuesday, January 20, 2015 7:17 PM
 *To:* user@spark.apache.org
 *Subject:* spark streaming with checkpoint



 I am a beginner to spark streaming. So have a basic doubt regarding
 checkpoints. My use case is to calculate the no of unique users by day. I
 am using reduce by key and window for this. Where my window duration is 24
 hours and slide duration is 5 mins. I am updating the processed record to
 mongodb. Currently I am replace the existing record each time. But I see
 the memory is slowly increasing over time and kills the process after 1 and
 1/2 hours(in aws small instance). The DB write after the restart clears all
 the old data. So I understand checkpoint is the solution for this. But my
 doubt is

- What should my check point duration be..? As per documentation it
says 5-10 times of slide duration. But I need the data of entire day. So 
 it
is ok to keep 24 hrs.
- Where ideally should the checkpoint be..? Initially when I receive
the stream or just before the window operation or after the data reduction
has taken place.


 Appreciate your help.
 Thank you
  --

 View this message in context: spark streaming with checkpoint
 http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-with-checkpoint-tp21263.html
 Sent from the Apache Spark User List mailing list archive
 http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com.





Re: How to 'Pipe' Binary Data in Apache Spark

2015-01-22 Thread Silvio Fiorito
Nick,

Have you tried https://github.com/kaitoy/pcap4j

I’ve used this in a Spark app already and didn’t have any issues. My use case 
was slightly different than yours, but you should give it a try.

From: Nick Allen n...@nickallen.orgmailto:n...@nickallen.org
Date: Friday, January 16, 2015 at 10:09 AM
To: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: How to 'Pipe' Binary Data in Apache Spark


I have an RDD containing binary data. I would like to use 'RDD.pipe' to pipe 
that binary data to an external program that will translate it to string/text 
data. Unfortunately, it seems that Spark is mangling the binary data before it 
gets passed to the external program.

This code is representative of what I am trying to do. What am I doing wrong? 
How can I pipe binary data in Spark?  Maybe it is getting corrupted when I read 
it in initially with 'textFile'?

bin = sc.textFile(binary-data.dat)
csv = bin.pipe (/usr/bin/binary-to-csv.sh)
csv.saveAsTextFile(text-data.csv)

Specifically, I am trying to use Spark to transform pcap (packet capture) data 
to text/csv so that I can perform an analysis on it.

Thanks!

--
Nick Allen n...@nickallen.orgmailto:n...@nickallen.org


Installing Spark Standalone to a Cluster

2015-01-22 Thread riginos
I have downloaded spark-1.2.0.tgz on each of my node and execute ./sbt/sbt
assembly on each of them.  So I execute. /sbin/start-master.sh on my master
and ./bin/spark-class org.apache.spark.deploy.worker.Worker
spark://IP:PORT. 
Althought when I got to http://localhost:8080 I cannot see any worker. Why
is that? Do I do something wrong with the installation deploy of the spark? 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Installing-Spark-Standalone-to-a-Cluster-tp21319.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Exception: NoSuchMethodError: org.apache.spark.streaming.StreamingContext$.toPairDStreamFunctions

2015-01-22 Thread Sean Owen
NoSuchMethodError almost always means that you have compiled some code
against one version of a library but are running against another. I
wonder if you are including different versions of Spark in your
project, or running against a cluster on an older version?

On Thu, Jan 22, 2015 at 3:57 PM, Adrian Mocanu
amoc...@verticalscope.com wrote:
 Hi

 I get this exception when I run a Spark test case on my local machine:



 An exception or error caused a run to abort:
 org.apache.spark.streaming.StreamingContext$.toPairDStreamFunctions(Lorg/apache/spark/streaming/dstream/DStream;Lscala/reflect/ClassTag;Lscala/reflect/ClassTag;Lscala/math/Ordering;)Lorg/apache/spark/streaming/dstream/PairDStreamFunctions;

 java.lang.NoSuchMethodError:
 org.apache.spark.streaming.StreamingContext$.toPairDStreamFunctions(Lorg/apache/spark/streaming/dstream/DStream;Lscala/reflect/ClassTag;Lscala/reflect/ClassTag;Lscala/math/Ordering;)Lorg/apache/spark/streaming/dstream/PairDStreamFunctions;



 In my test case I have these Spark related imports imports:

 import org.apache.spark.streaming.StreamingContext._

 import org.apache.spark.streaming.TestSuiteBase

 import org.apache.spark.streaming.dstream.DStream

 import org.apache.spark.streaming.StreamingContext.toPairDStreamFunctions



 -Adrian



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Team - Paco Nathan said that your team can help

2015-01-22 Thread Nicos
Folks,
Just a gentle reminder we owe to ourselves:
- this is a public forum and we need to behave accordingly, it is not place to 
vent frustration in rude way
- getting attention here is an earned privilege and not entitlement
- this is not a “Platinum Support” department of your vendor rather and open 
source collaboration forum where people volunteer their time to pay attention 
to your needs
- there are still many gray areas so be patient and articulate questions in as 
much details as possible if you want to get quick help and not just be 
perceived as a smart a$$

FYI - Paco Nathan is a well respected Spark evangelist and many people, 
including myself, owe to his passion for jumping on Spark platform promise. 
People like Sean Owen keep us believing in things when we feel like hitting the 
dead-end.

Please, be respectful of what connections you are prized with and act civilized.

Have a great day!
- Nicos


 On Jan 22, 2015, at 7:49 AM, Sean Owen so...@cloudera.com wrote:
 
 Yes, this isn't a well-formed question, and got maybe the response it
 deserved, but the tone is veering off the rails. I just got a much
 ruder reply from Sudipta privately, which I will not forward. Sudipta,
 I suggest you take the responses you've gotten so far as about as much
 answer as can be had here and do some work yourself, and come back
 with much more specific questions, and it will all be helpful and
 polite again.
 
 On Thu, Jan 22, 2015 at 2:51 PM, Sudipta Banerjee
 asudipta.baner...@gmail.com wrote:
 Hi Marco,
 
 Thanks for the confirmation. Please let me know what are the lot more detail
 you need to answer a very specific question  WHAT IS THE MINIMUM HARDWARE
 CONFIGURATION REQUIRED TO BUILT HDFS+ MAPREDUCE+SPARK+YARN  on a system?
 Please let me know if you need any further information and if you dont know
 please drive across with the $1 to Sir Paco Nathan and get me the
 answer.
 
 Thanks and Regards,
 Sudipta
 
 On Thu, Jan 22, 2015 at 5:33 PM, Marco Shaw marco.s...@gmail.com wrote:
 
 Hi,
 
 Let me reword your request so you understand how (too) generic your
 question is
 
 Hi, I have $10,000, please find me some means of transportation so I can
 get to work.
 
 Please provide (a lot) more details. If you can't, consider using one of
 the pre-built express VMs from either Cloudera, Hortonworks or MapR, for
 example.
 
 Marco
 
 
 
 On Jan 22, 2015, at 7:36 AM, Sudipta Banerjee
 asudipta.baner...@gmail.com wrote:
 
 
 
 Hi Apache-Spark team ,
 
 What are the system requirements installing Hadoop and Apache Spark?
 I have attached the screen shot of Gparted.
 
 
 Thanks and regards,
 Sudipta
 
 
 
 
 --
 Sudipta Banerjee
 Consultant, Business Analytics and Cloud Based Architecture
 Call me +919019578099
 Screenshot - Wednesday 21 January 2015 - 10:55:29 IST.png
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 
 
 
 
 --
 Sudipta Banerjee
 Consultant, Business Analytics and Cloud Based Architecture
 Call me +919019578099
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 
 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Discourse: A proposed alternative to the Spark User list

2015-01-22 Thread pierred
Love it!

There is a reason why SO is so effective and popular.  Search is excellent,
you can quickly find very thoughtful answers about sometimes thorny
problems, and it is easy to contribute, format code, etc.  Perhaps the most
useful feature is that the best answers naturally bubble up to the top, so
these are the ones you see first.

One annoyance is the troll phenomenon, see e.g.
http://michael.richter.name/blogs/why-i-no-longer-contribute-to-stackoverflow
(that also mentions other pet peeves about SO).  That phenomenon is, IMHO,
most prevalent on the stackoverflow itself, perhaps less so on other
stackexchange sites.

At the same time, I do appreciate the pressure to provide well-written,
concise, and for the posterity questions and answers.  That peer pressure
is what, to a good extent, makes the material on SO so valuable and useful. 
It is probably a tricky balance to strike.

A dedicated stackexchange site for Apache Spark sounds to me like the
logical solution.  Less trolling, more enthusiasm, and with the
participation of the people on this list, I think it would very quickly
become the reference for many technical questions, as well as a great
vehicle to promote the awesomeness of Spark.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Discourse-A-proposed-alternative-to-the-Spark-User-list-tp20851p21321.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Discourse: A proposed alternative to the Spark User list

2015-01-22 Thread Sean Owen
FWIW I am a moderator for datascience.stackexchange.com, and even that
hasn't really achieved the critical mass that SE sites are supposed
to: http://area51.stackexchange.com/proposals/55053/data-science

I think a Spark site would have a lot less traffic. One annoyance is
that people can't figure out when to post on SO vs Data Science vs
Cross Validated. A Spark site would have the same problem,
fragmentation and cross posting with SO. I don't think this would be
accepted as a StackExchange site and don't think it helps.

On Thu, Jan 22, 2015 at 6:16 PM, pierred pie...@demartines.com wrote:

 A dedicated stackexchange site for Apache Spark sounds to me like the
 logical solution.  Less trolling, more enthusiasm, and with the
 participation of the people on this list, I think it would very quickly
 become the reference for many technical questions, as well as a great
 vehicle to promote the awesomeness of Spark.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Exception: NoSuchMethodError: org.apache.spark.streaming.StreamingContext$.toPairDStreamFunctions

2015-01-22 Thread Adrian Mocanu

I use spark 1.1.0-SNAPSHOT and the test I'm running is in local mode. My test 
case uses org.apache.spark.streaming.TestSuiteBase

val spark=org.apache.spark %% spark-core % 1.1.0-SNAPSHOT % provided 
excludeAll( 
val sparkStreaming= org.apache.spark % spark-streaming_2.10 % 
1.1.0-SNAPSHOT % provided excludeAll(
val sparkCassandra= com.tuplejump % calliope_2.10 % 0.9.0-C2-EA 
exclude(org.apache.cassandra, cassandra-all) 
exclude(org.apache.cassandra, cassandra-thrift)
val casAll = org.apache.cassandra % cassandra-all % 2.0.3 intransitive()
val casThrift = org.apache.cassandra % cassandra-thrift % 2.0.3 
intransitive()
val sparkStreamingFromKafka = org.apache.spark % spark-streaming-kafka_2.10 
% 0.9.1 excludeAll(


-Original Message-
From: Sean Owen [mailto:so...@cloudera.com]
Sent: January-22-15 11:39 AM
To: Adrian Mocanu
Cc: u...@spark.incubator.apache.org
Subject: Re: Exception: NoSuchMethodError: 
org.apache.spark.streaming.StreamingContext$.toPairDStreamFunctions

NoSuchMethodError almost always means that you have compiled some code against 
one version of a library but are running against another. I wonder if you are 
including different versions of Spark in your project, or running against a 
cluster on an older version?

On Thu, Jan 22, 2015 at 3:57 PM, Adrian Mocanu amoc...@verticalscope.com 
wrote:
 Hi

 I get this exception when I run a Spark test case on my local machine:



 An exception or error caused a run to abort:
 org.apache.spark.streaming.StreamingContext$.toPairDStreamFunctions(Lo
 rg/apache/spark/streaming/dstream/DStream;Lscala/reflect/ClassTag;Lsca
 la/reflect/ClassTag;Lscala/math/Ordering;)Lorg/apache/spark/streaming/
 dstream/PairDStreamFunctions;

 java.lang.NoSuchMethodError:
 org.apache.spark.streaming.StreamingContext$.toPairDStreamFunctions(Lo
 rg/apache/spark/streaming/dstream/DStream;Lscala/reflect/ClassTag;Lsca
 la/reflect/ClassTag;Lscala/math/Ordering;)Lorg/apache/spark/streaming/
 dstream/PairDStreamFunctions;



 In my test case I have these Spark related imports imports:

 import org.apache.spark.streaming.StreamingContext._

 import org.apache.spark.streaming.TestSuiteBase

 import org.apache.spark.streaming.dstream.DStream

 import
 org.apache.spark.streaming.StreamingContext.toPairDStreamFunctions



 -Adrian


B CB  [  
X  ܚX KK[XZ[
 \ \ ][  X  ܚX P \ ˘\X K ܙ B  ܈Y][ۘ[  [X[  K[XZ[
 \ \ Z[ \ ˘\X K ܙ B B

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Exception: NoSuchMethodError: org.apache.spark.streaming.StreamingContext$.toPairDStreamFunctions

2015-01-22 Thread Adrian Mocanu
I use spark 1.1.0-SNAPSHOT

val spark=org.apache.spark %% spark-core % 1.1.0-SNAPSHOT % provided 
excludeAll(
-Original Message-
From: Sean Owen [mailto:so...@cloudera.com] 
Sent: January-22-15 11:39 AM
To: Adrian Mocanu
Cc: u...@spark.incubator.apache.org
Subject: Re: Exception: NoSuchMethodError: 
org.apache.spark.streaming.StreamingContext$.toPairDStreamFunctions

NoSuchMethodError almost always means that you have compiled some code against 
one version of a library but are running against another. I wonder if you are 
including different versions of Spark in your project, or running against a 
cluster on an older version?

On Thu, Jan 22, 2015 at 3:57 PM, Adrian Mocanu amoc...@verticalscope.com 
wrote:
 Hi

 I get this exception when I run a Spark test case on my local machine:



 An exception or error caused a run to abort:
 org.apache.spark.streaming.StreamingContext$.toPairDStreamFunctions(Lo
 rg/apache/spark/streaming/dstream/DStream;Lscala/reflect/ClassTag;Lsca
 la/reflect/ClassTag;Lscala/math/Ordering;)Lorg/apache/spark/streaming/
 dstream/PairDStreamFunctions;

 java.lang.NoSuchMethodError:
 org.apache.spark.streaming.StreamingContext$.toPairDStreamFunctions(Lo
 rg/apache/spark/streaming/dstream/DStream;Lscala/reflect/ClassTag;Lsca
 la/reflect/ClassTag;Lscala/math/Ordering;)Lorg/apache/spark/streaming/
 dstream/PairDStreamFunctions;



 In my test case I have these Spark related imports imports:

 import org.apache.spark.streaming.StreamingContext._

 import org.apache.spark.streaming.TestSuiteBase

 import org.apache.spark.streaming.dstream.DStream

 import 
 org.apache.spark.streaming.StreamingContext.toPairDStreamFunctions



 -Adrian




Re: Spark Team - Paco Nathan said that your team can help

2015-01-22 Thread Sudipta Banerjee
Hi Nicos, Taking forward your argument,please be a smart a$$ and dont use
unprofessional language just for the sake of being a moderator.
Paco Nathan is respected for the dignity he carries in sharing his
knowledge and making it available free for a$$es like us right!
So just mind your tongue next time you put such a$$ in your mouth.

Best Regards,
Sudipta

On Thu, Jan 22, 2015 at 10:39 PM, Nicos Kekchidis ikon...@me.com wrote:

 Folks,
 Just a gentle reminder we owe to ourselves:
 - this is a public forum and we need to behave accordingly, it is not
 place to vent frustration in rude way
 - getting attention here is an earned privilege and not entitlement
 - this is not a “Platinum Support” department of your vendor rather and
 open source collaboration forum where people volunteer their time to pay
 attention to your needs
 - there are still many gray areas so be patient and articulate questions
 in as much details as possible if you want to get quick help and not just
 be perceived as a smart a$$

 FYI - Paco Nathan is a well respected Spark evangelist and many people,
 including myself, owe to his passion for jumping on Spark platform promise.
 People like Sean Owen keep us believing in things when we feel like hitting
 the dead-end.

 Please, be respectful of what connections you are prized with and act
 civilized.

 Have a great day!
 - Nicos


  On Jan 22, 2015, at 7:49 AM, Sean Owen so...@cloudera.com wrote:
 
  Yes, this isn't a well-formed question, and got maybe the response it
  deserved, but the tone is veering off the rails. I just got a much
  ruder reply from Sudipta privately, which I will not forward. Sudipta,
  I suggest you take the responses you've gotten so far as about as much
  answer as can be had here and do some work yourself, and come back
  with much more specific questions, and it will all be helpful and
  polite again.
 
  On Thu, Jan 22, 2015 at 2:51 PM, Sudipta Banerjee
  asudipta.baner...@gmail.com wrote:
  Hi Marco,
 
  Thanks for the confirmation. Please let me know what are the lot more
 detail
  you need to answer a very specific question  WHAT IS THE MINIMUM
 HARDWARE
  CONFIGURATION REQUIRED TO BUILT HDFS+ MAPREDUCE+SPARK+YARN  on a system?
  Please let me know if you need any further information and if you dont
 know
  please drive across with the $1 to Sir Paco Nathan and get me the
  answer.
 
  Thanks and Regards,
  Sudipta
 
  On Thu, Jan 22, 2015 at 5:33 PM, Marco Shaw marco.s...@gmail.com
 wrote:
 
  Hi,
 
  Let me reword your request so you understand how (too) generic your
  question is
 
  Hi, I have $10,000, please find me some means of transportation so I
 can
  get to work.
 
  Please provide (a lot) more details. If you can't, consider using one
 of
  the pre-built express VMs from either Cloudera, Hortonworks or MapR,
 for
  example.
 
  Marco
 
 
 
  On Jan 22, 2015, at 7:36 AM, Sudipta Banerjee
  asudipta.baner...@gmail.com wrote:
 
 
 
  Hi Apache-Spark team ,
 
  What are the system requirements installing Hadoop and Apache Spark?
  I have attached the screen shot of Gparted.
 
 
  Thanks and regards,
  Sudipta
 
 
 
 
  --
  Sudipta Banerjee
  Consultant, Business Analytics and Cloud Based Architecture
  Call me +919019578099
  Screenshot - Wednesday 21 January 2015 - 10:55:29 IST.png
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 
 
 
 
  --
  Sudipta Banerjee
  Consultant, Business Analytics and Cloud Based Architecture
  Call me +919019578099
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 
 




-- 
Sudipta Banerjee
Consultant, Business Analytics and Cloud Based Architecture
Call me +919019578099


Re: Spark Team - Paco Nathan said that your team can help

2015-01-22 Thread Sudipta Banerjee
Thank you very much Marco! Really appreciate your support.

On Thu, Jan 22, 2015 at 10:57 PM, Marco Shaw marco.s...@gmail.com wrote:

 (Starting over...)

 The best place to look for the requirements would be at the individual
 pages of each technology.

 As for absolute minimum requirements, I would suggest 50GB of disk space
 and at least 8GB of memory.  This is the absolute minimum.

 Architecting a solution like you are looking for is very complex.  If
 you are just looking for a proof-of-concept consider a Docker image or
 going to Cloudera/Hortonworks/MapR and look for their express VMs which
 can usually run on Oracle Virtualbox or VMware.

 Marco


 On Thu, Jan 22, 2015 at 7:36 AM, Sudipta Banerjee 
 asudipta.baner...@gmail.com wrote:



 Hi Apache-Spark team ,

 What are the system requirements installing Hadoop and Apache Spark?
 I have attached the screen shot of Gparted.


 Thanks and regards,
 Sudipta




 --
 Sudipta Banerjee
 Consultant, Business Analytics and Cloud Based Architecture
 Call me +919019578099


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





-- 
Sudipta Banerjee
Consultant, Business Analytics and Cloud Based Architecture
Call me +919019578099


Re: Spark Team - Paco Nathan said that your team can help

2015-01-22 Thread Marco Shaw
Sudipta - Please don't ever come here or post here again.

On Thu, Jan 22, 2015 at 1:25 PM, Sudipta Banerjee 
asudipta.baner...@gmail.com wrote:

 Hi Nicos, Taking forward your argument,please be a smart a$$ and dont use
 unprofessional language just for the sake of being a moderator.
 Paco Nathan is respected for the dignity he carries in sharing his
 knowledge and making it available free for a$$es like us right!
 So just mind your tongue next time you put such a$$ in your mouth.

 Best Regards,
 Sudipta

 On Thu, Jan 22, 2015 at 10:39 PM, Nicos Kekchidis ikon...@me.com wrote:

 Folks,
 Just a gentle reminder we owe to ourselves:
 - this is a public forum and we need to behave accordingly, it is not
 place to vent frustration in rude way
 - getting attention here is an earned privilege and not entitlement
 - this is not a “Platinum Support” department of your vendor rather and
 open source collaboration forum where people volunteer their time to pay
 attention to your needs
 - there are still many gray areas so be patient and articulate questions
 in as much details as possible if you want to get quick help and not just
 be perceived as a smart a$$

 FYI - Paco Nathan is a well respected Spark evangelist and many people,
 including myself, owe to his passion for jumping on Spark platform promise.
 People like Sean Owen keep us believing in things when we feel like hitting
 the dead-end.

 Please, be respectful of what connections you are prized with and act
 civilized.

 Have a great day!
 - Nicos


  On Jan 22, 2015, at 7:49 AM, Sean Owen so...@cloudera.com wrote:
 
  Yes, this isn't a well-formed question, and got maybe the response it
  deserved, but the tone is veering off the rails. I just got a much
  ruder reply from Sudipta privately, which I will not forward. Sudipta,
  I suggest you take the responses you've gotten so far as about as much
  answer as can be had here and do some work yourself, and come back
  with much more specific questions, and it will all be helpful and
  polite again.
 
  On Thu, Jan 22, 2015 at 2:51 PM, Sudipta Banerjee
  asudipta.baner...@gmail.com wrote:
  Hi Marco,
 
  Thanks for the confirmation. Please let me know what are the lot more
 detail
  you need to answer a very specific question  WHAT IS THE MINIMUM
 HARDWARE
  CONFIGURATION REQUIRED TO BUILT HDFS+ MAPREDUCE+SPARK+YARN  on a
 system?
  Please let me know if you need any further information and if you dont
 know
  please drive across with the $1 to Sir Paco Nathan and get me the
  answer.
 
  Thanks and Regards,
  Sudipta
 
  On Thu, Jan 22, 2015 at 5:33 PM, Marco Shaw marco.s...@gmail.com
 wrote:
 
  Hi,
 
  Let me reword your request so you understand how (too) generic your
  question is
 
  Hi, I have $10,000, please find me some means of transportation so I
 can
  get to work.
 
  Please provide (a lot) more details. If you can't, consider using one
 of
  the pre-built express VMs from either Cloudera, Hortonworks or MapR,
 for
  example.
 
  Marco
 
 
 
  On Jan 22, 2015, at 7:36 AM, Sudipta Banerjee
  asudipta.baner...@gmail.com wrote:
 
 
 
  Hi Apache-Spark team ,
 
  What are the system requirements installing Hadoop and Apache Spark?
  I have attached the screen shot of Gparted.
 
 
  Thanks and regards,
  Sudipta
 
 
 
 
  --
  Sudipta Banerjee
  Consultant, Business Analytics and Cloud Based Architecture
  Call me +919019578099
  Screenshot - Wednesday 21 January 2015 - 10:55:29 IST.png
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 
 
 
 
  --
  Sudipta Banerjee
  Consultant, Business Analytics and Cloud Based Architecture
  Call me +919019578099
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 
 




 --
 Sudipta Banerjee
 Consultant, Business Analytics and Cloud Based Architecture
 Call me +919019578099



Re: Installing Spark Standalone to a Cluster

2015-01-22 Thread Yana Kadiyska
You can do ./sbin/start-slave.sh --master spark://IP:PORT. I believe you're
missing --master. In addition, it's a good idea to pass with --master
exactly the spark master's endpoint as shown on your UI under
http://localhost:8080. But that should do it. If that's not working, you
can look at the Worker log and see where it's trying to connect to and if
it's getting any errors.

On Thu, Jan 22, 2015 at 12:06 PM, riginos samarasrigi...@gmail.com wrote:

 I have downloaded spark-1.2.0.tgz on each of my node and execute ./sbt/sbt
 assembly on each of them.  So I execute. /sbin/start-master.sh on my master
 and ./bin/spark-class org.apache.spark.deploy.worker.Worker
 spark://IP:PORT.
 Althought when I got to http://localhost:8080 I cannot see any worker. Why
 is that? Do I do something wrong with the installation deploy of the spark?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Installing-Spark-Standalone-to-a-Cluster-tp21319.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Would Join on PairRDD's result in co-locating data by keys?

2015-01-22 Thread Ankur Srivastava
Hi,

I wanted to understand how the join on two pair rdd's work? Would it result
in shuffling data from both the RDD's with same key into same partition? If
that is the case would it be better to use partitionBy function to
partition (by the join attribute) the RDD at creation for lesser shuffling?

Thanks

Ankur


Apache Spark broadcast error: Error sending message as driverActor is null [message = UpdateBlockInfo(BlockManagerId

2015-01-22 Thread Edwin
I'm using Apache Spark 1.1.0 and I'm currently having issue with broadcast
method. So when I call broadcast function on a small dataset to a 5 nodes
cluster, I experiencing the Error sending message as driverActor is null
after broadcast the variables several times (apps running under jboss).

Any help would be appreciate.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-broadcast-error-Error-sending-message-as-driverActor-is-null-message-UpdateBlockInfo-Bld-tp21320.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Team - Paco Nathan said that your team can help

2015-01-22 Thread Joseph Ottinger
Sudipta, with all due respect... don't respond to me if you don't like
what I say is not the same as not being a jerk about it. One earns social
capital, by being respectful and by respecting the social norms during
interaction; by everything I've seen, you've been demanding and
disrespectful (although some of it's been in private, so I don't know what
those interactions have looked like, nor do I care.)

Bottom line, you could do a lot to mitigate negative responses in what has
been a supportive environment, by all appearances. That's on you, not the
community.

On Thu, Jan 22, 2015 at 12:35 PM, Sudipta Banerjee 
asudipta.baner...@gmail.com wrote:

 Dont ever reply to my queries :D

 On Thu, Jan 22, 2015 at 11:02 PM, Lukas Nalezenec 
 lukas.naleze...@firma.seznam.cz wrote:

  +1


 On 22.1.2015 18:30, Marco Shaw wrote:

  Sudipta - Please don't ever come here or post here again.

 On Thu, Jan 22, 2015 at 1:25 PM, Sudipta Banerjee 
 asudipta.baner...@gmail.com wrote:

   Hi Nicos, Taking forward your argument,please be a smart a$$ and dont
 use unprofessional language just for the sake of being a moderator.
  Paco Nathan is respected for the dignity he carries in sharing his
 knowledge and making it available free for a$$es like us right!
  So just mind your tongue next time you put such a$$ in your mouth.

  Best Regards,
  Sudipta

 On Thu, Jan 22, 2015 at 10:39 PM, Nicos Kekchidis ikon...@me.com
 wrote:

 Folks,
 Just a gentle reminder we owe to ourselves:
 - this is a public forum and we need to behave accordingly, it is not
 place to vent frustration in rude way
 - getting attention here is an earned privilege and not entitlement
 - this is not a “Platinum Support” department of your vendor rather and
 open source collaboration forum where people volunteer their time to pay
 attention to your needs
 - there are still many gray areas so be patient and articulate
 questions in as much details as possible if you want to get quick help and
 not just be perceived as a smart a$$

 FYI - Paco Nathan is a well respected Spark evangelist and many people,
 including myself, owe to his passion for jumping on Spark platform promise.
 People like Sean Owen keep us believing in things when we feel like hitting
 the dead-end.

 Please, be respectful of what connections you are prized with and act
 civilized.

 Have a great day!
 - Nicos


  On Jan 22, 2015, at 7:49 AM, Sean Owen so...@cloudera.com wrote:
 
Yes, this isn't a well-formed question, and got maybe the response
 it
  deserved, but the tone is veering off the rails. I just got a much
  ruder reply from Sudipta privately, which I will not forward. Sudipta,
  I suggest you take the responses you've gotten so far as about as much
  answer as can be had here and do some work yourself, and come back
  with much more specific questions, and it will all be helpful and
  polite again.
 
  On Thu, Jan 22, 2015 at 2:51 PM, Sudipta Banerjee
  asudipta.baner...@gmail.com wrote:
  Hi Marco,
 
  Thanks for the confirmation. Please let me know what are the lot
 more detail
  you need to answer a very specific question  WHAT IS THE MINIMUM
 HARDWARE
  CONFIGURATION REQUIRED TO BUILT HDFS+ MAPREDUCE+SPARK+YARN  on a
 system?
  Please let me know if you need any further information and if you
 dont know
  please drive across with the $1 to Sir Paco Nathan and get me the
  answer.
 
  Thanks and Regards,
  Sudipta
 
  On Thu, Jan 22, 2015 at 5:33 PM, Marco Shaw marco.s...@gmail.com
 wrote:
 
  Hi,
 
  Let me reword your request so you understand how (too) generic your
  question is
 
  Hi, I have $10,000, please find me some means of transportation so
 I can
  get to work.
 
  Please provide (a lot) more details. If you can't, consider using
 one of
  the pre-built express VMs from either Cloudera, Hortonworks or
 MapR, for
  example.
 
  Marco
 
 
 
  On Jan 22, 2015, at 7:36 AM, Sudipta Banerjee
  asudipta.baner...@gmail.com wrote:
 
 
 
  Hi Apache-Spark team ,
 
  What are the system requirements installing Hadoop and Apache
 Spark?
  I have attached the screen shot of Gparted.
 
 
  Thanks and regards,
  Sudipta
 
 
 
 
  --
  Sudipta Banerjee
  Consultant, Business Analytics and Cloud Based Architecture
  Call me +919019578099
  Screenshot - Wednesday 21 January 2015 - 10:55:29 IST.png
 
 
 -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 
 
 
 
  --
  Sudipta Banerjee
  Consultant, Business Analytics and Cloud Based Architecture
  Call me +919019578099
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 
 




 --
  Sudipta Banerjee
 Consultant, Business Analytics and Cloud Based Architecture
 Call me +919019578099






 --
 Sudipta Banerjee
 Consultant, Business 

Re: Spark Team - Paco Nathan said that your team can help

2015-01-22 Thread Marco Shaw
(Starting over...)

The best place to look for the requirements would be at the individual
pages of each technology.

As for absolute minimum requirements, I would suggest 50GB of disk space
and at least 8GB of memory.  This is the absolute minimum.

Architecting a solution like you are looking for is very complex.  If you
are just looking for a proof-of-concept consider a Docker image or going to
Cloudera/Hortonworks/MapR and look for their express VMs which can
usually run on Oracle Virtualbox or VMware.

Marco


On Thu, Jan 22, 2015 at 7:36 AM, Sudipta Banerjee 
asudipta.baner...@gmail.com wrote:



 Hi Apache-Spark team ,

 What are the system requirements installing Hadoop and Apache Spark?
 I have attached the screen shot of Gparted.


 Thanks and regards,
 Sudipta




 --
 Sudipta Banerjee
 Consultant, Business Analytics and Cloud Based Architecture
 Call me +919019578099


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Team - Paco Nathan said that your team can help

2015-01-22 Thread Lukas Nalezenec

+1

On 22.1.2015 18:30, Marco Shaw wrote:

Sudipta - Please don't ever come here or post here again.

On Thu, Jan 22, 2015 at 1:25 PM, Sudipta Banerjee 
asudipta.baner...@gmail.com mailto:asudipta.baner...@gmail.com wrote:


Hi Nicos, Taking forward your argument,please be a smart a$$ and
dont use unprofessional language just for the sake of being a
moderator.
Paco Nathan is respected for the dignity he carries in sharing his
knowledge and making it available free for a$$es like us right!
So just mind your tongue next time you put such a$$ in your mouth.

Best Regards,
Sudipta

On Thu, Jan 22, 2015 at 10:39 PM, Nicos Kekchidis ikon...@me.com
mailto:ikon...@me.com wrote:

Folks,
Just a gentle reminder we owe to ourselves:
- this is a public forum and we need to behave accordingly, it
is not place to vent frustration in rude way
- getting attention here is an earned privilege and not
entitlement
- this is not a “Platinum Support” department of your vendor
rather and open source collaboration forum where people
volunteer their time to pay attention to your needs
- there are still many gray areas so be patient and articulate
questions in as much details as possible if you want to get
quick help and not just be perceived as a smart a$$

FYI - Paco Nathan is a well respected Spark evangelist and
many people, including myself, owe to his passion for jumping
on Spark platform promise. People like Sean Owen keep us
believing in things when we feel like hitting the dead-end.

Please, be respectful of what connections you are prized with
and act civilized.

Have a great day!
- Nicos


 On Jan 22, 2015, at 7:49 AM, Sean Owen so...@cloudera.com
mailto:so...@cloudera.com wrote:

 Yes, this isn't a well-formed question, and got maybe the
response it
 deserved, but the tone is veering off the rails. I just got
a much
 ruder reply from Sudipta privately, which I will not
forward. Sudipta,
 I suggest you take the responses you've gotten so far as
about as much
 answer as can be had here and do some work yourself, and
come back
 with much more specific questions, and it will all be
helpful and
 polite again.

 On Thu, Jan 22, 2015 at 2:51 PM, Sudipta Banerjee
 asudipta.baner...@gmail.com
mailto:asudipta.baner...@gmail.com wrote:
 Hi Marco,

 Thanks for the confirmation. Please let me know what are
the lot more detail
 you need to answer a very specific question  WHAT IS THE
MINIMUM HARDWARE
 CONFIGURATION REQUIRED TO BUILT HDFS+ MAPREDUCE+SPARK+YARN 
on a system?

 Please let me know if you need any further information and
if you dont know
 please drive across with the $1 to Sir Paco Nathan and
get me the
 answer.

 Thanks and Regards,
 Sudipta

 On Thu, Jan 22, 2015 at 5:33 PM, Marco Shaw
marco.s...@gmail.com mailto:marco.s...@gmail.com wrote:

 Hi,

 Let me reword your request so you understand how (too)
generic your
 question is

 Hi, I have $10,000, please find me some means of
transportation so I can
 get to work.

 Please provide (a lot) more details. If you can't,
consider using one of
 the pre-built express VMs from either Cloudera,
Hortonworks or MapR, for
 example.

 Marco



 On Jan 22, 2015, at 7:36 AM, Sudipta Banerjee
 asudipta.baner...@gmail.com
mailto:asudipta.baner...@gmail.com wrote:



 Hi Apache-Spark team ,

 What are the system requirements installing Hadoop and
Apache Spark?
 I have attached the screen shot of Gparted.


 Thanks and regards,
 Sudipta




 --
 Sudipta Banerjee
 Consultant, Business Analytics and Cloud Based Architecture
 Call me +919019578099 tel:%2B919019578099
 Screenshot - Wednesday 21 January 2015 - 10:55:29 IST.png


-
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
mailto:user-unsubscr...@spark.apache.org
 For additional commands, e-mail:
user-h...@spark.apache.org mailto:user-h...@spark.apache.org




 --
 Sudipta Banerjee
 Consultant, Business Analytics and Cloud Based Architecture
 Call me +919019578099 

Re: Spark Team - Paco Nathan said that your team can help

2015-01-22 Thread Sudipta Banerjee
Dont ever reply to my queries :D

On Thu, Jan 22, 2015 at 11:02 PM, Lukas Nalezenec 
lukas.naleze...@firma.seznam.cz wrote:

  +1


 On 22.1.2015 18:30, Marco Shaw wrote:

  Sudipta - Please don't ever come here or post here again.

 On Thu, Jan 22, 2015 at 1:25 PM, Sudipta Banerjee 
 asudipta.baner...@gmail.com wrote:

   Hi Nicos, Taking forward your argument,please be a smart a$$ and dont
 use unprofessional language just for the sake of being a moderator.
  Paco Nathan is respected for the dignity he carries in sharing his
 knowledge and making it available free for a$$es like us right!
  So just mind your tongue next time you put such a$$ in your mouth.

  Best Regards,
  Sudipta

 On Thu, Jan 22, 2015 at 10:39 PM, Nicos Kekchidis ikon...@me.com wrote:

 Folks,
 Just a gentle reminder we owe to ourselves:
 - this is a public forum and we need to behave accordingly, it is not
 place to vent frustration in rude way
 - getting attention here is an earned privilege and not entitlement
 - this is not a “Platinum Support” department of your vendor rather and
 open source collaboration forum where people volunteer their time to pay
 attention to your needs
 - there are still many gray areas so be patient and articulate questions
 in as much details as possible if you want to get quick help and not just
 be perceived as a smart a$$

 FYI - Paco Nathan is a well respected Spark evangelist and many people,
 including myself, owe to his passion for jumping on Spark platform promise.
 People like Sean Owen keep us believing in things when we feel like hitting
 the dead-end.

 Please, be respectful of what connections you are prized with and act
 civilized.

 Have a great day!
 - Nicos


  On Jan 22, 2015, at 7:49 AM, Sean Owen so...@cloudera.com wrote:
 
Yes, this isn't a well-formed question, and got maybe the response
 it
  deserved, but the tone is veering off the rails. I just got a much
  ruder reply from Sudipta privately, which I will not forward. Sudipta,
  I suggest you take the responses you've gotten so far as about as much
  answer as can be had here and do some work yourself, and come back
  with much more specific questions, and it will all be helpful and
  polite again.
 
  On Thu, Jan 22, 2015 at 2:51 PM, Sudipta Banerjee
  asudipta.baner...@gmail.com wrote:
  Hi Marco,
 
  Thanks for the confirmation. Please let me know what are the lot more
 detail
  you need to answer a very specific question  WHAT IS THE MINIMUM
 HARDWARE
  CONFIGURATION REQUIRED TO BUILT HDFS+ MAPREDUCE+SPARK+YARN  on a
 system?
  Please let me know if you need any further information and if you
 dont know
  please drive across with the $1 to Sir Paco Nathan and get me the
  answer.
 
  Thanks and Regards,
  Sudipta
 
  On Thu, Jan 22, 2015 at 5:33 PM, Marco Shaw marco.s...@gmail.com
 wrote:
 
  Hi,
 
  Let me reword your request so you understand how (too) generic your
  question is
 
  Hi, I have $10,000, please find me some means of transportation so
 I can
  get to work.
 
  Please provide (a lot) more details. If you can't, consider using
 one of
  the pre-built express VMs from either Cloudera, Hortonworks or MapR,
 for
  example.
 
  Marco
 
 
 
  On Jan 22, 2015, at 7:36 AM, Sudipta Banerjee
  asudipta.baner...@gmail.com wrote:
 
 
 
  Hi Apache-Spark team ,
 
  What are the system requirements installing Hadoop and Apache Spark?
  I have attached the screen shot of Gparted.
 
 
  Thanks and regards,
  Sudipta
 
 
 
 
  --
  Sudipta Banerjee
  Consultant, Business Analytics and Cloud Based Architecture
  Call me +919019578099
  Screenshot - Wednesday 21 January 2015 - 10:55:29 IST.png
 
 
 -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 
 
 
 
  --
  Sudipta Banerjee
  Consultant, Business Analytics and Cloud Based Architecture
  Call me +919019578099
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 
 




 --
  Sudipta Banerjee
 Consultant, Business Analytics and Cloud Based Architecture
 Call me +919019578099






-- 
Sudipta Banerjee
Consultant, Business Analytics and Cloud Based Architecture
Call me +919019578099


Re: Using third party libraries in pyspark

2015-01-22 Thread Felix C
Python couldn't find your module. Do you have that on each worker node? You 
will need to have that on each one

--- Original Message ---

From: Davies Liu dav...@databricks.com
Sent: January 22, 2015 9:12 PM
To: Mohit Singh mohit1...@gmail.com
Cc: user@spark.apache.org
Subject: Re: Using third party libraries in pyspark

You need to install these libraries on all the slaves, or submit via
spark-submit:

spark-submit --py-files  xxx

On Thu, Jan 22, 2015 at 11:23 AM, Mohit Singh mohit1...@gmail.com wrote:
 Hi,
   I might be asking something very trivial, but whats the recommend way of
 using third party libraries.
 I am using tables to read hdf5 format file..
 And here is the error trace:


 print rdd.take(2)
   File /tmp/spark/python/pyspark/rdd.py, line , in take
 res = self.context.runJob(self, takeUpToNumLeft, p, True)
   File /tmp/spark/python/pyspark/context.py, line 818, in runJob
 it = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd,
 javaPartitions, allowLocal)
   File /tmp/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py,
 line 538, in __call__
   File /tmp/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line
 300, in get_return_value
 py4j.protocol.Py4JJavaError: An error occurred while calling
 z:org.apache.spark.api.python.PythonRDD.runJob.
 : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0
 (TID 3, srv-108-23.720.rdio): org.apache.spark.api.python.PythonException:
 Traceback (most recent call last):
   File
 /hadoop/disk3/mapred/local/filecache/540/spark-assembly-1.2.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/worker.py,
 line 90, in main
 command = pickleSer._read_with_length(infile)
   File
 /hadoop/disk3/mapred/local/filecache/540/spark-assembly-1.2.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/serializers.py,
 line 151, in _read_with_length
 return self.loads(obj)
   File
 /hadoop/disk3/mapred/local/filecache/540/spark-assembly-1.2.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/serializers.py,
 line 396, in loads
 return cPickle.loads(obj)
   File
 /hadoop/disk3/mapred/local/filecache/540/spark-assembly-1.2.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/cloudpickle.py,
 line 825, in subimport
 __import__(name)
 ImportError: ('No module named tables', function subimport at 0x47e1398,
 ('tables',))

 Though, import tables works fine on the local python shell.. but seems like
 every thing is being pickled.. Are we expected to send all the files as
 helper files? that doesn't seems right?
 Thanks


 --


 Mohit

 When you want success as badly as you want the air, then you will get it.
 There is no other secret of success.
 -Socrates

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: processing large dataset

2015-01-22 Thread Russell Jurney
Often when this happens to me, it is actually an exception parsing a few
messages. Easy to miss this, as error messages aren't always informative. I
would be blaming spark, but in reality it was missing fields in a CSV file.

As has been said, make a file with a few records and see if your job works.

On Thursday, January 22, 2015, Jörn Franke jornfra...@gmail.com wrote:

 Did you try it with a smaller subset of the data first?
 Le 23 janv. 2015 05:54, Kane Kim kane.ist...@gmail.com
 javascript:_e(%7B%7D,'cvml','kane.ist...@gmail.com'); a écrit :

 I'm trying to process 5TB of data, not doing anything fancy, just
 map/filter and reduceByKey. Spent whole day today trying to get it
 processed, but never succeeded. I've tried to deploy to ec2 with the
 script provided with spark on pretty beefy machines (100 r3.2xlarge
 nodes). Really frustrated that spark doesn't work out of the box for
 anything bigger than word count sample. One big problem is that
 defaults are not suitable for processing big datasets, provided ec2
 script could do a better job, knowing instance type requested. Second
 it takes hours to figure out what is wrong, when spark job fails
 almost finished processing. Even after raising all limits as per
 https://spark.apache.org/docs/latest/tuning.html it still fails (now
 with: error communicating with MapOutputTracker).

 After all I have only one question - how to get spark tuned up for
 processing terabytes of data and is there a way to make this
 configuration easier and more transparent?

 Thanks.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 javascript:_e(%7B%7D,'cvml','user-unsubscr...@spark.apache.org');
 For additional commands, e-mail: user-h...@spark.apache.org
 javascript:_e(%7B%7D,'cvml','user-h...@spark.apache.org');



-- 
Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com