date:20140707

Spark: All masters are unresponsive!

2014-07-07 Thread Sameer Tilak

Hi All,
I am having a few issues with stability and scheduling. When I use spark shell 
to submit my application. I get the following error message and spark shell 
crashes. I have a small 4-node cluster for PoC. I tried both manual and 
scripts-based cluster set up. I tried using FQDN as well for specifying the 
master node, but no luck.  
14/07/07 23:44:35 INFO DAGScheduler: Submitting 2 missing tasks from Stage 1 
(MappedRDD[6] at map at JaccardScore.scala:83)14/07/07 23:44:35 INFO 
TaskSchedulerImpl: Adding task set 1.0 with 2 tasks14/07/07 23:44:35 INFO 
TaskSetManager: Starting task 1.0:0 as TID 1 on executor localhost: localhost 
(PROCESS_LOCAL)14/07/07 23:44:35 INFO TaskSetManager: Serialized task 1.0:0 as 
2322 bytes in 0 ms14/07/07 23:44:35 INFO TaskSetManager: Starting task 1.0:1 as 
TID 2 on executor localhost: localhost (PROCESS_LOCAL)14/07/07 23:44:35 INFO 
TaskSetManager: Serialized task 1.0:1 as 2322 bytes in 0 ms14/07/07 23:44:35 
INFO Executor: Running task ID 114/07/07 23:44:35 INFO Executor: Running task 
ID 214/07/07 23:44:35 INFO BlockManager: Found block broadcast_1 
locally14/07/07 23:44:35 INFO BlockManager: Found block broadcast_1 
locally14/07/07 23:44:35 INFO HadoopRDD: Input split: 
hdfs://pzxnvm2018:54310/data/sameer_7-2-2014_3mm_sentences.tsv:0+9723938914/07/07
 23:44:35 INFO HadoopRDD: Input split: 
hdfs://pzxnvm2018:54310/data/sameer_7-2-2014_3mm_sentences.tsv:97239389+9723939014/07/07
 23:44:54 INFO AppClient$ClientActor: Connecting to master 
spark://pzxnvm2018:7077...14/07/07 23:45:14 INFO AppClient$ClientActor: 
Connecting to master spark://pzxnvm2018:7077...14/07/07 23:45:35 ERROR 
SparkDeploySchedulerBackend: Application has been killed. Reason: All masters 
are unresponsive! Giving up.14/07/07 23:45:35 ERROR TaskSchedulerImpl: Exiting 
due to error from cluster scheduler: All masters are unresponsive! Giving 
up.14/07/07 23:45:35 WARN HadoopRDD: Exception in 
RecordReader.close()java.io.IOException: Filesystem closed   at 
org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:264)   at 
org.apache.hadoop.hdfs.DFSClient.access$1100(DFSClient.java:74)  at 
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.close(DFSClient.java:2135)   at 
java.io.FilterInputStream.close(FilterInputStream.java:181)  at 
org.apache.hadoop.util.LineReader.close(LineReader.java:83)  at 
org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:168)   at 
org.apache.spark.rdd.HadoopRDD$$anon$1.close(HadoopRDD.scala:208)at 
org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63)  at 
org.apache.spark.rdd.HadoopRDD$$anon$1$$anonfun$1.apply$mcV$sp(HadoopRDD.scala:193)
  at 
org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)
   at 
org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)
   at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)  
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)   at 
org.apache.spark.TaskContext.executeOnCompleteCallbacks(TaskContext.scala:63)   
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:113)  at 
org.apache.spark.scheduler.Task.run(Task.scala:51)   at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) 
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) 
 at java.lang.Thread.run(Thread.java:722)14/07/07 23:45:35 ERROR Executor: 
Exception in task ID 2java.io.IOException: Filesystem closed  at 
org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:264)   at 
org.apache.hadoop.hdfs.DFSClient.access$1100(DFSClient.java:74)  at 
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:2213)at 
java.io.DataInputStream.read(DataInputStream.java:100)   at 
org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)  at 
org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:133)at 
org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:38) at 
org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:198)  at 
org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:181)  at 
org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)  
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)   at 
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)   at 
scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)   at 
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)   at 
scala.collection.Iterator$class.foreach(Iterator.scala:727)  at 
scala.collection.AbstractIterator.foreach(Iterator.scala:1157)   at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growab

Spark SQL registerAsTable requires a Java Class?

2014-07-07 Thread Ionized

The Java API requires a Java Class to register as table.

// Apply a schema to an RDD of JavaBeans and register it as a
table.JavaSchemaRDD schemaPeople = sqlCtx.applySchema(people,
Person.class);schemaPeople.registerAsTable("people");

If instead of JavaRDD I had JavaRDD (along with the knowledge
of column names and types that go along with the List) and wanted to write
a general-purpose registerAsTable, is there any other way besides using ASM
and dynamically creating Java Classes?

Error and doubts in using Mllib Naive bayes for text clasification

2014-07-07 Thread Rahul Bhojwani

Hello,

I am a novice.I want to classify the text into two classes. For this
purpose I  want to use Naive Bayes model. I am using Python for it.

Here are the problems I am facing:

*Problem 1:* I wanted to use all words as features for the bag of words
model. Which means my features will be count of individual words. In this
case whenever a new word comes in the test data (which was never present in
the train data) I need to increase the size of the feature vector to
incorporate that word as well. Correct me if I am wrong. Can I do that in
the present Mllib NaiveBayes. Or what is the way in which I can incorporate
this?

*Problem 2:* As I was not able to proceed with all words I did some
pre-processing and figured out few features from the text. But using this
also is giving errors.
Right now I was testing for only one feature from the text that is count of
positive words. I am submitting the code below, along with the error:


#Code

import tokenizer
import gettingWordLists as gl
from pyspark.mllib.classification import NaiveBayes
from numpy import array
from pyspark import SparkContext, SparkConf

conf = (SparkConf().setMaster("local[6]").setAppName("My
app").set("spark.executor.memory", "1g"))

sc=SparkContext(conf = conf)

# Getting the positive dict:
pos_list = []
pos_list = gl.getPositiveList()
tok = tokenizer.Tokenizer(preserve_case=False)


train_data  = []

with open("training_file.csv","r") as train_file:
for line in train_file:
tokens = line.split(",")
msg = tokens[0]
sentiment = tokens[1]
count = 0
tokens = set(tok.tokenize(msg))
for i in tokens:
if i.encode('utf-8') in pos_list:
count+=1
if sentiment.__contains__('NEG'):
label = 0.0
else:
label = 1.0
feature = []
feature.append(label)
feature.append(float(count))
train_data.append(feature)


model = NaiveBayes.train(sc.parallelize(array(train_data)))
print model.pi
print model.theta
print "\n\n\n\n\n" , model.predict(array([5.0]))

##


*This is the output:*














*[-2.24512292 -0.11195389][[ 0.] [ 0.]]Traceback (most recent call last):
File "naive_bayes_analyser.py", line 77, in  print "\n\n\n\n\n"
, model.predict(array([5.0]))  File
"F:\spark-0.9.1\spark-0.9.1\python\pyspark\mllib\classification.py",
line 101, in predictreturn numpy.argmax(self.pi + dot(x, self.theta))
ValueError: matrices are not aligned*

##

*Problem 3*: As you can see the output for model.pi is -ve. That is prior
probabilities are negative. Can someone explain that also. Is it the log of
the probability?



Thanks,
-- 
Rahul K Bhojwani
3rd Year B.Tech
Computer Science and Engineering
National Institute of Technology, Karnataka

答复: Spark RDD Disk Persistance

2014-07-07 Thread Lizhengbing (bing, BIPA)

You might  let your data stored in tachyon

发件人: Jahagirdar, Madhu [mailto:madhu.jahagir...@philips.com]
发送时间: 2014年7月8日 10:16
收件人: user@spark.apache.org
主题: Spark RDD Disk Persistance

Should i use Disk based Persistance for RDD's and if the machine goes down 
during the program execution, next time when i rerun the program would the data 
be intact and not lost ?

Regards,
Madhu Jahagirdar


The information contained in this message may be confidential and legally 
protected under applicable law. The message is intended solely for the 
addressee(s). If you are not the intended recipient, you are hereby notified 
that any use, forwarding, dissemination, or reproduction of this message is 
strictly prohibited and may be unlawful. If you are not the intended recipient, 
please contact the sender by return e-mail and destroy all copies of the 
original message.

Re: the Pre-built packages for CDH4 can not support yarn ?

2014-07-07 Thread Matei Zaharia

They are for CDH4 without YARN, since YARN is experimental in that. You can 
download one of the Hadoop 2 packages if you want to run on YARN. Or you might 
have to build specifically against CDH4's version of YARN if that doesn't work.

Matei

On Jul 7, 2014, at 9:37 PM, ch huang  wrote:

> hi,maillist :
> i download the pre-built spark packages for CDH4 ,but it say can not 
> support yarn ,why? i need build it by myself with yarn support enable?

Re: Is the order of messages guaranteed in a DStream?

2014-07-07 Thread Mayur Rustagi

If you receive data through multiple receivers across the cluster. I don't
think any order can be guaranteed. Order in distributed systems is tough.

On Tuesday, July 8, 2014, Yan Fang  wrote:

> I know the order of processing DStream is guaranteed. Wondering if the
> order of messages in one DStream is guaranteed. My gut feeling is yes for
> the question because RDD is immutable. Some simple tests prove this. Want
> to hear from "authority" to persuade myself. Thank you.
>
> Best,
>
> Fang, Yan
> yanfang...@gmail.com
> 
> +1 (206) 849-4108
>

-- 
Sent from Gmail Mobile

Re: Spark Installation

2014-07-07 Thread Krishna Sankar

Couldn't find any reference of CDH in pom.xml - profiles or the
hadoop.version.Am also wondering how the cdh compatible artifact was
compiled.
Cheers

On Mon, Jul 7, 2014 at 8:07 PM, Srikrishna S 
wrote:

> Hi All,
>
> Does anyone know what the command line arguments to mvn are to generate
> the pre-built binary for spark on Hadoop 2-CHD5.
>
> I would like to pull in a recent bug fix in spark-master and rebuild the
> binaries in the exact same way that was used for that provided on the
> website.
>
> I have tried the following:
>
> mvn install -Pyarn -Dhadoop.version=2.3.0-cdh5.0.1
>
> And it doesn't quite work.
>
> Any thoughts anyone?
>
>

Seattle Spark Meetup slides: xPatterns, Fun Things, and Machine Learning Streams - next is Interactive OLAP

2014-07-07 Thread Denny Lee

Apologies for the delay but we’ve had a bunch of great slides and sessions at 
Seattle Spark Meetup this past couple of months including Claudiu Barbura’s 
"xPatterns on Spark, Shark, Mesos, and Tachyon"; Paco Nathan’s "Fun Things You 
Can Do with Spark 1.0”, and "Machine Learning Streams with Spark 1.0” by the 
fine folks at Ubix!

Come by for the next Seattle Spark Meetup session on Wednesday July 16th, 2014 
at the WhitePages office in downtown Seattle for Evan Chan’s “Interactive OLAP 
Queries using Cassandra and Spark”!

For more information, please reference this blog post http://wp.me/pHDEa-w4 or 
join the Seattle Spark Meetup at http://www.meetup.com/Seattle-Spark-Meetup/

Enjoy!
Denny

Re: Spark Installation

2014-07-07 Thread Jaideep Dhok

Hi Srikrishna,
You can use the make-distribution script in Spark to generate the binary.
Example - ./make-distribution.sh --tgz --hadoop HADOOP_VERSION

The above script calls maven, so you can look into it to get the exact mvn
command too.

Thanks,
Jaideep


On Tue, Jul 8, 2014 at 8:37 AM, Srikrishna S 
wrote:

> Hi All,
>
> Does anyone know what the command line arguments to mvn are to generate
> the pre-built binary for spark on Hadoop 2-CHD5.
>
> I would like to pull in a recent bug fix in spark-master and rebuild the
> binaries in the exact same way that was used for that provided on the
> website.
>
> I have tried the following:
>
> mvn install -Pyarn -Dhadoop.version=2.3.0-cdh5.0.1
>
> And it doesn't quite work.
>
> Any thoughts anyone?
>
>

-- 
_
The information contained in this communication is intended solely for the 
use of the individual or entity to whom it is addressed and others 
authorized to receive it. It may contain confidential or legally privileged 
information. If you are not the intended recipient you are hereby notified 
that any disclosure, copying, distribution or taking any action in reliance 
on the contents of this information is strictly prohibited and may be 
unlawful. If you have received this communication in error, please notify 
us immediately by responding to this email and then delete it from your 
system. The firm is neither liable for the proper and complete transmission 
of the information contained in this communication nor for any delay in its 
receipt.

Spark Installation

2014-07-07 Thread Srikrishna S

Hi All,

Does anyone know what the command line arguments to mvn are to generate the
pre-built binary for spark on Hadoop 2-CHD5.

I would like to pull in a recent bug fix in spark-master and rebuild the
binaries in the exact same way that was used for that provided on the
website.

I have tried the following:

mvn install -Pyarn -Dhadoop.version=2.3.0-cdh5.0.1

And it doesn't quite work.

Any thoughts anyone?

Help for the large number of the input data files

2014-07-07 Thread innowireless TaeYun Kim

Hi,

 

A help for the implementation best practice is needed.

The operating environment is as follows:

 

- Log data file arrives irregularly.

- The size of a log data file is from 3.9KB to 8.5MB. The average is about
1MB.

- The number of records of a data file is from 13 lines to 22000 lines. The
average is about 2700 lines.

- Data file must be post-processed before aggregation.

- Post-processing algorithm can be changed.

- Post-processed file is managed separately with original data file, since
the post-processing algorithm might be changed.

- Daily aggregation is performed. All post-processed data file must be
filtered record-by-record and aggregation(average, max min.) is calculated.

- Since aggregation is fine-grained, the number of records after the
aggregation is not so small. It can be about half of the number of the
original records.

- At a point, the number of the post-processed file can be about 200,000.

- A data file should be able to be deleted individually.

 

In a test, I tried to process 160,000 post-processed files by Spark starting
with sc.textFile() with glob path, it failed with OutOfMemory exception on
the driver process.

 

What is the best practice to handle this kind of data?

Should I use HBase instead of plain files to save post-processed data?

 

Thank you.

RE: Spark RDD Disk Persistance

2014-07-07 Thread Shao, Saisai

Hi Madhu,

I don't think you can reuse the persistent RDD the next time you run the 
program, because the folder for RDD materialization will be changed, also Spark 
will lose the information of how to retrieve the previous persisted RDD.

AFAIK Spark has fault tolerance mechanism, node failure will lead to 
recomputation of the affected partitions.

Thanks
Jerry

From: Jahagirdar, Madhu [mailto:madhu.jahagir...@philips.com]
Sent: Tuesday, July 08, 2014 10:16 AM
To: user@spark.apache.org
Subject: Spark RDD Disk Persistance

Should i use Disk based Persistance for RDD's and if the machine goes down 
during the program execution, next time when i rerun the program would the data 
be intact and not lost ?

Regards,
Madhu Jahagirdar


The information contained in this message may be confidential and legally 
protected under applicable law. The message is intended solely for the 
addressee(s). If you are not the intended recipient, you are hereby notified 
that any use, forwarding, dissemination, or reproduction of this message is 
strictly prohibited and may be unlawful. If you are not the intended recipient, 
please contact the sender by return e-mail and destroy all copies of the 
original message.

Re: SparkSQL - Partitioned Parquet

2014-07-07 Thread Michael Armbrust

The only partitioning that is currently supported is through Hive
partitioned tables.  Supporting this for parquet as well is on our radar,
but probably won't happen for 1.1.

On Sun, Jul 6, 2014 at 10:00 PM, Raffael Marty  wrote:

> Does SparkSQL support partitioned parquet tables? How do I save to a
> partitioned parquet file from within Python?
>
>  table.saveAsParquetFile("table.parquet”)
>
> This call doesn’t seem to support a partition argument. Or does my
> schemaRDD have to be setup a specific way?
>

Spark RDD Disk Persistance

2014-07-07 Thread Jahagirdar, Madhu

Should i use Disk based Persistance for RDD's and if the machine goes down 
during the program execution, next time when i rerun the program would the data 
be intact and not lost ?

Regards,
Madhu Jahagirdar


The information contained in this message may be confidential and legally 
protected under applicable law. The message is intended solely for the 
addressee(s). If you are not the intended recipient, you are hereby notified 
that any use, forwarding, dissemination, or reproduction of this message is 
strictly prohibited and may be unlawful. If you are not the intended recipient, 
please contact the sender by return e-mail and destroy all copies of the 
original message.

Re: SparkSQL with sequence file RDDs

2014-07-07 Thread Michael Armbrust

Here is a simple example of registering an RDD of Products as a table.  It
is important that all of the fields are val defined in the constructor and
that you implement canEqual, productArity and productElement.

class Record(val x1: String) extends Product with Serializable {
  def canEqual(that: Any) = that.isInstanceOf[Record]
  def productArity = 1
  def productElement(n: Int) = n match {
case 0 => x1
  }
}

sparkContext.parallelize(new Record("a") :: Nil).registerAsTable("records")

sql("SELECT x1 FROM records").collect()


On Mon, Jul 7, 2014 at 6:39 PM, Haoming Zhang 
wrote:

> Hi Michael,
>
> Thanks for the reply.
>
> Actually last week I tried to play with Product interface, but I'm not
> really sure I did correct or not. Here is what I did:
>
> 1. Created an abstract class A with Product interface, which has 20
> parameters,
> 2. Created case class B extends A, and B has 20 parameters.
>
> I can get all the parameters of A, and also B's parameters by
> productElement function, I just curious is that possbile to convert this
> kind of case class to schema? Because I need to use the .registerAsTable
> function to insert the case classes into table.
>
> Best,
> Haoming
>
> --
> From: mich...@databricks.com
> Date: Mon, 7 Jul 2014 17:52:34 -0700
>
> Subject: Re: SparkSQL with sequence file RDDs
> To: user@spark.apache.org
>
> We know Scala 2.11 has remove the limitation of parameter number, but
> Spark 1.0 is not compatible with it. So now we are considering use java
> beans instead of Scala case classes.
>
>
> You can also manually create a class that implements scala's Product
> interface.  Finally, SPARK-2179
>  will give you
> programatic non-classed based way to describe the schema.  Someone is
> working on this now.
>

Re： Pig 0.13, Spark, Spork

2014-07-07 Thread 张包峰

Hi guys, previously I checked out the old "spork" and updated it to Hadoop 2.0, 
Scala 2.10.3 and Spark 0.9.1, see github project of mine 
https://github.com/pelick/flare-spork‍


It it also highly experimental, and just directly mapping pig physical 
operations to spark RDD transformations/actions. It works for simple requests. 
:)


I am also interested on the progress of spork, is it undergoing in Twitter in 
an un open-source way?


--
Thanks
Zhang Baofeng
Blog | Github | Weibo | LinkedIn




 




-- 原始邮件 --
发件人: "Mayur Rustagi";;
发送时间: 2014年7月7日(星期一) 晚上11:55
收件人: "user@spark.apache.org"; 

主题: Re: Pig 0.13, Spark, Spork



That version is old :). We are not forking pig but cleanly separating out pig 
execution engine. Let me know if you are willing to give it a go.



Also would love to know what features of pig you are using ? 
 


Regards
Mayur

Mayur Rustagi
Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com
 @mayur_rustagi





 

On Mon, Jul 7, 2014 at 8:46 PM, Bertrand Dechoux  wrote:
 I saw a wiki page from your company but with an old version of 
Spark.http://docs.sigmoidanalytics.com/index.php/Setting_up_spork_with_spark_0.8.1
 


I have no reason to use it yet but I am interested in the state of the 
initiative.
What's your point of view (personal and/or professional) about the Pig 0.13 
release?
Is the pluggable execution engine flexible enough in order to avoid having 
Spork as a fork of Pig? Pig + Spark + Fork = Spork :D
 

As a (for now) external observer, I am glad to see competition in that space. 
It can only be good for the community in the end.

 Bertrand Dechoux
 

On Mon, Jul 7, 2014 at 5:00 PM, Mayur Rustagi  wrote:
 Hi,We have fixed many major issues around Spork & deploying it with some 
customers. Would be happy to provide a working version to you to try out. We 
are looking for more folks to try it out & submit bugs. 
 

Regards
Mayur 


Mayur Rustagi
Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com
 @mayur_rustagi





 

On Mon, Jul 7, 2014 at 8:21 PM, Bertrand Dechoux  wrote:
 Hi,

I was wondering what was the state of the Pig+Spark initiative now that the 
execution engine of Pig is pluggable? Granted, it was done in order to use Tez 
but could it be used by Spark? I know about a 'theoretical' project called 
Spork but I don't know any stable and maintained version of it.
 

Regards

Bertrand Dechoux

RE: The number of cores vs. the number of executors

2014-07-07 Thread innowireless TaeYun Kim

For your information, I've attached the Ganglia monitoring screen capture on
the Stack Overflow question.

Please see:
http://stackoverflow.com/questions/24622108/apache-spark-the-number-of-cores
-vs-the-number-of-executors 

 

From: innowireless TaeYun Kim [mailto:taeyun@innowireless.co.kr] 
Sent: Tuesday, July 08, 2014 9:43 AM
To: user@spark.apache.org
Subject: The number of cores vs. the number of executors

 

Hi,

 

I'm trying to understand the relationship of the number of cores and the
number of executors when running a Spark job on YARN.

 

The test environment is as follows:

 

- # of data nodes: 3

- Data node machine spec:

- CPU: Core i7-4790 (# of cores: 4, # of threads: 8)

- RAM: 32GB (8GB x 4)

- HDD: 8TB (2TB x 4)

- Network: 1Gb

 

- Spark job flow: sc.textFile -> filter -> map -> filter -> mapToPair ->
reduceByKey -> map -> saveAsTextFile

 

- input data

- type: single text file

- size: 165GB

- # of lines: 454,568,833

 

- output

- # of lines after second filter: 310,640,717

- # of lines of the result file: 99,848,268

- size of the result file: 41GB

 

The job was run with following configurations:

 

1) --master yarn-client --executor-memory 19G --executor-cores 7
--num-executors 3  (executors per data node, use as much as cores)

2) --master yarn-client --executor-memory 19G --executor-cores 4
--num-executors 3  (# of cores reduced)

3) --master yarn-client --executor-memory 4G --executor-cores 2
--num-executors 12  (less core, more executor)

 

- elapsed times:

1) 50 min 15 sec

2) 55 min 48 sec

3) 31 min 23 sec

 

To my surprise, 3) was much faster.

I thought that 1) would be faster, since there would be less inter-executor
communication when shuffling.

Although # of cores of 1) is fewer than 3), #of cores is not the key factor
since 2) did perform well.

 

How can I understand this result?

 

Thanks.

Is the order of messages guaranteed in a DStream?

2014-07-07 Thread Yan Fang

I know the order of processing DStream is guaranteed. Wondering if the
order of messages in one DStream is guaranteed. My gut feeling is yes for
the question because RDD is immutable. Some simple tests prove this. Want
to hear from "authority" to persuade myself. Thank you.

Best,

Fang, Yan
yanfang...@gmail.com
+1 (206) 849-4108

RE: SparkSQL with sequence file RDDs

2014-07-07 Thread Haoming Zhang

Hi Michael,

Thanks for the reply.

Actually last week I tried to play with Product interface, but I'm not really 
sure I did correct or not. Here is what I did:

1. Created an abstract class A with Product interface, which has 20 parameters,
2. Created case class B extends A, and B has 20 parameters.

I can get all the parameters of A, and also B's parameters by productElement 
function, I just curious is that possbile to convert this kind of case class to 
schema? Because I need to use the .registerAsTable function to insert the case 
classes into table.

Best,
Haoming

From: mich...@databricks.com
Date: Mon, 7 Jul 2014 17:52:34 -0700
Subject: Re: SparkSQL with sequence file RDDs
To: user@spark.apache.org



We know Scala 2.11 has remove the limitation of parameter number, but Spark 1.0 
is not compatible with it. So now we are considering use java beans instead of 
Scala case classes.



You can also manually create a class that implements scala's Product interface. 
 Finally, SPARK-2179 will give you programatic non-classed based way to 
describe the schema.  Someone is working on this now.

the Pre-built packages for CDH4 can not support yarn ?

2014-07-07 Thread ch huang

hi,maillist :
i download the pre-built spark packages for CDH4 ,but it say can
not support yarn ,why? i need build it by myself with yarn support enable?

Re: Spark SQL user defined functions

2014-07-07 Thread Michael Armbrust

The names of the directories that are created for the metastore are
different ("metastore" vs "metastore_db"), but that should be it.  Really
we should get rid of LocalHiveContext as it is mostly redundant and the
current state is kind of confusing.  I've created a JIRA to figure this out
before the 1.1 release.


On Mon, Jul 7, 2014 at 12:25 AM, Martin Gammelsæter <
martingammelsae...@gmail.com> wrote:

> Hi again, and thanks for your reply!
>
> On Fri, Jul 4, 2014 at 8:45 PM, Michael Armbrust 
> wrote:
> >
> >> Sweet. Any idea about when this will be merged into master?
> >
> >
> > It is probably going to be a couple of weeks.  There is a fair amount of
> > cleanup that needs to be done.  It works though and we used it in most of
> > the demos at the spark summit.  Mostly I just need to add tests and move
> it
> > out of HiveContext (there is no good reason for that code to depend on
> > HiveContext). So you could also just try working with that branch.
> >
> >>
> >> This is probably a stupid question, but can you query Spark SQL tables
> >> from a (local?) hive context? In which case using that could be a
> >> workaround until the PR is merged.
> >
> >
> > Yeah, this is kind of subtle.  In a HiveContext, SQL Tables are just an
> > additional catalog that sits on top of the metastore.  All the query
> > execution occurs in the same code path, including the use of the Hive
> > Function Registry, independent of where the table comes from.  So for
> your
> > use case you can just create a hive context, which will create a local
> > metastore automatically if no hive-site.xml is present.
>
> Nice, that sounds like it'll solve my problems. Just for clarity, is
> LocalHiveContext and HiveContext equal if no hive-site.xml is present,
> or are there still differences?
>
> --
> Best regards,
> Martin Gammelsæter
>

Re: usage question for saprk run on YARN

2014-07-07 Thread DB Tsai

spark-clinet mode runs driver in your application's JVM while
spark-cluster mode runs driver in yarn cluster.

Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Mon, Jul 7, 2014 at 5:44 PM, Cheng Ju Chuang
 wrote:
> Hi,
>
>
>
> I am running some simple samples for my project. Right now the spark sample
> is running on Hadoop 2.2 with YARN. My question is what is the main
> different when we run as spark-client and spark-cluster except different way
> to submit our job. And what is the specific way to configure the job e.g.
> running on particular nodes which has more resource.
>
>
>
> Thank you very much
>
>
>
> Sincerely,
>
> Cheng-Ju Chuang

Re: Which is the best way to get a connection to an external database per task in Spark Streaming?

2014-07-07 Thread Tobias Pfeiffer

Juan,

I am doing something similar, just not "insert into SQL database", but
"issue some RPC call". I think mapPartitions() may be helpful to you. You
could do something like

dstream.mapPartitions(iter => {
  val db = new DbConnection()
  // maybe only do the above if !iter.isEmpty
  iter.map(item => {
db.call(...)
// do some cleanup if !iter.hasNext here
item
  })
}).count() // force output

Keep in mind though that the whole idea about RDDs is that operations are
idempotent and in theory could be run on multiple hosts (to take the result
from the fastest server) or multiple times (to deal with failures/timeouts)
etc., which is maybe something you want to deal with in your SQL.

Tobias



On Tue, Jul 8, 2014 at 3:40 AM, Juan Rodríguez Hortalá <
juan.rodriguez.hort...@gmail.com> wrote:

> Hi list,
>
> I'm writing a Spark Streaming program that reads from a kafka topic,
> performs some transformations on the data, and then inserts each record in
> a database with foreachRDD. I was wondering which is the best way to handle
> the connection to the database so each worker, or even each task, uses a
> different connection to the database, and then database inserts/updates
> would be performed in parallel.
> - I understand that using a final variable in the driver code is not a
> good idea because then the communication with the database would be
> performed in the driver code, which leads to a bottleneck, according to
> http://engineering.sharethrough.com/blog/2013/09/13/top-3-troubleshooting-tips-to-keep-you-sparking/
> - I think creating a new connection in the call() method of the Function
> passed to foreachRDD is also a bad idea, because then I wouldn't be reusing
> the connection to the database for each batch RDD in the DStream
> - I'm not sure that a broadcast variable with the connection handler is a
> good idea in case the target database is distributed, because if the same
> handler is used for all the nodes of the Spark cluster then than could have
> a negative effect in the data locality of the connection to the database.
> - From
> http://apache-spark-user-list.1001560.n3.nabble.com/Database-connection-per-worker-td1280.html
> I understand that by using an static variable and referencing it in the
> call() method of the Function passed to foreachRDD we get a different
> connection per Spark worker, I guess it's because there is a different JVM
> per worker. But then all the tasks in the same worker would share the same
> database handler object, am I right?
> - Another idea is using updateStateByKey() using the database handler as
> the state, but I guess that would only work for Serializable database
> handlers, and for example not for an org.apache.hadoop.hbase.client.HTable
> object.
>
> So my question is, which is the best way to get a connection to an
> external database per task in Spark Streaming? Or at least per worker. In
> http://apache-spark-user-list.1001560.n3.nabble.com/Connecting-to-an-inmemory-database-from-Spark-td1343.html
> there is a partial solution to this question, but there the database
> handler object is missing. This other question
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Shared-hashmaps-td3247.html
> is closer to mine, but there is no answer for it yet
>
> Thanks in advance,
>
> Greetings,
>
> Juan
>
>

Powered By Spark: Can you please add our org?

2014-07-07 Thread Alex Gaudio

Hi,

Sailthru is also using Spark.  Could you please add us to the Powered By
Spark  page
when you have a chance?

Organization Name: Sailthru
URL: www.sailthru.com
Short Description: Our data science platform uses Spark to build predictive
models and recommendation systems for marketing automation and
personalization


Thank you!
Alex

Re: how to set spark.executor.memory and heap size

2014-07-07 Thread Alex Gaudio

Hi All,


This is a bit late, but I found it helpful.  Piggy-backing on Wang Hao's
comment, spark will ignore the "spark.executor.memory" setting if you add
it to SparkConf via:

conf.set("spark.executor.memory", "1g")


What you actually should do depends on how you run spark.  I found some
"official" documentation for this in a bug report here:

https://issues.apache.org/jira/browse/SPARK-1264



Alex






On Fri, Jun 13, 2014 at 10:40 AM, Hao Wang  wrote:

> Hi, Laurent
>
> You could set Spark.executor.memory and heap size by following methods:
>
> 1. in you conf/spark-env.sh:
> *export SPARK_WORKER_MEMORY=38g*
> *export SPARK_JAVA_OPTS="-XX:-UseGCOverheadLimit
> -XX:+UseConcMarkSweepGC -Xmx2g -XX:MaxPermSize=256m"*
>
> 2. you could also add modification for executor memory and java opts in 
> *spark-submit
> *parameters.
>
> Check the Spark *configure *and *tuning *docs, you could find full
> answers there.
>
>
> Regards,
> Wang Hao(王灏)
>
> CloudTeam | School of Software Engineering
> Shanghai Jiao Tong University
> Address:800 Dongchuan Road, Minhang District, Shanghai, 200240
> Email:wh.s...@gmail.com
>
>
> On Thu, Jun 12, 2014 at 6:29 PM, Laurent T 
> wrote:
>
>> Hi,
>>
>> Can you give us a little more insight on how you used that file to solve
>> your problem ?
>> We're having the same OOM as you were and haven't been able to solve it
>> yet.
>>
>> Thanks
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-spark-executor-memory-and-heap-size-tp4719p7469.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>
>

Re: Data loading to Parquet using spark

2014-07-07 Thread Soren Macbeth

I typed "spark parquet" into google and the top results was this blog post
about reading and writing parquet files from spark

http://zenfractal.com/2013/08/21/a-powerful-big-data-trio/


On Mon, Jul 7, 2014 at 5:23 PM, Michael Armbrust 
wrote:

> SchemaRDDs, provided by Spark SQL, have a saveAsParquetFile command.  You
> can turn a normal RDD into a SchemaRDD using the techniques described here:
> http://spark.apache.org/docs/latest/sql-programming-guide.html
>
> This should work with Impala, but if you run into any issues please let me
> know.
>
>
> On Sun, Jul 6, 2014 at 5:30 PM, Shaikh Riyaz  wrote:
>
>> Hi,
>>
>> We are planning to use spark to load data to Parquet and this data will
>> be query by Impala for present visualization through Tableau.
>>
>> Can we achieve this flow? How to load data to Parquet from spark? Will
>> impala be able to access the data loaded by spark?
>>
>> I will greatly appreciate if someone can help with the example to achieve
>> the goal.
>>
>> Thanks in advance.
>>
>> --
>> Regards,
>>
>> Riyaz
>>
>>
>

Re: Unable to run Spark 1.0 SparkPi on HDP 2.0

2014-07-07 Thread DB Tsai

Actually, the one needed to install the jar to each individual node is
standalone mode which works for both MR1 and MR2. Cloudera and
Hortonworks currently support spark in this way as far as I know.

For both yarn-cluster or yarn-client, Spark will distribute the jars
through distributed cache and each executor can find the jars there.

On Jul 7, 2014 6:23 AM, "Chester @work"  wrote:
>
> In Yarn cluster mode, you can either have spark on all the cluster nodes or 
> supply the spark jar yourself. In the 2nd case, you don't need install spark 
> on cluster at all. As you supply the spark assembly as we as your app jar 
> together.
>
> I hope this make it clear
>
> Chester
>
> Sent from my iPhone
>
> On Jul 7, 2014, at 5:05 AM, Konstantin Kudryavtsev 
>  wrote:
>
> thank you Krishna!
>
> Could you please explain why do I need install spark on each node if Spark 
> official site said: If you have a Hadoop 2 cluster, you can run Spark without 
> any installation needed
>
> I have HDP 2 (YARN) and that's why I hope I don't need to install spark on 
> each node
>
> Thank you,
> Konstantin Kudryavtsev
>
>
> On Mon, Jul 7, 2014 at 1:57 PM, Krishna Sankar  wrote:
>>
>> Konstantin,
>>
>> You need to install the hadoop rpms on all nodes. If it is Hadoop 2, the 
>> nodes would have hdfs & YARN.
>> Then you need to install Spark on all nodes. I haven't had experience with 
>> HDP, but the tech preview might have installed Spark as well.
>> In the end, one should have hdfs,yarn & spark installed on all the nodes.
>> After installations, check the web console to make sure hdfs, yarn & spark 
>> are running.
>> Then you are ready to start experimenting/developing spark applications.
>>
>> HTH.
>> Cheers
>> 
>>
>>
>> On Mon, Jul 7, 2014 at 2:34 AM, Konstantin Kudryavtsev 
>>  wrote:
>>>
>>> guys, I'm not talking about running spark on VM, I don have problem with it.
>>>
>>> I confused in the next:
>>> 1) Hortonworks describe installation process as RPMs on each node
>>> 2) spark home page said that everything I need is YARN
>>>
>>> And I'm in stucj with understanding what I need to do to run spark on yarn 
>>> (do I need RPMs installations or only build spark on edge node?)
>>>
>>>
>>> Thank you,
>>> Konstantin Kudryavtsev
>>>
>>>
>>> On Mon, Jul 7, 2014 at 4:34 AM, Robert James  wrote:

 I can say from my experience that getting Spark to work with Hadoop 2
 is not for the beginner; after solving one problem after another
 (dependencies, scripts, etc.), I went back to Hadoop 1.

 Spark's Maven, ec2 scripts, and others all use Hadoop 1 - not sure
 why, but, given so, Hadoop 2 has too many bumps

 On 7/6/14, Marco Shaw  wrote:
 > That is confusing based on the context you provided.
 >
 > This might take more time than I can spare to try to understand.
 >
 > For sure, you need to add Spark to run it in/on the HDP 2.1 express VM.
 >
 > Cloudera's CDH 5 express VM includes Spark, but the service isn't 
 > running by
 > default.
 >
 > I can't remember for MapR...
 >
 > Marco
 >
 >> On Jul 6, 2014, at 6:33 PM, Konstantin Kudryavtsev
 >>  wrote:
 >>
 >> Marco,
 >>
 >> Hortonworks provides a Tech Preview of Spark 0.9.1 with HDP 2.1 that you
 >> can try
 >> from
 >> http://hortonworks.com/wp-content/uploads/2014/05/SparkTechnicalPreview.pdf
 >>  HDP 2.1 means YARN, at the same time they propose ti install rpm
 >>
 >> On other hand, http://spark.apache.org/ said "
 >> Integrated with Hadoop
 >> Spark can run on Hadoop 2's YARN cluster manager, and can read any
 >> existing Hadoop data.
 >>
 >> If you have a Hadoop 2 cluster, you can run Spark without any 
 >> installation
 >> needed. "
 >>
 >> And this is confusing for me... do I need rpm installation on not?...
 >>
 >>
 >> Thank you,
 >> Konstantin Kudryavtsev
 >>
 >>
 >>> On Sun, Jul 6, 2014 at 10:56 PM, Marco Shaw 
 >>> wrote:
 >>> Can you provide links to the sections that are confusing?
 >>>
 >>> My understanding, the HDP1 binaries do not need YARN, while the HDP2
 >>> binaries do.
 >>>
 >>> Now, you can also install Hortonworks Spark RPM...
 >>>
 >>> For production, in my opinion, RPMs are better for manageability.
 >>>
  On Jul 6, 2014, at 5:39 PM, Konstantin Kudryavtsev
   wrote:
 
  Hello, thanks for your message... I'm confused, Hortonworhs suggest
  install spark rpm on each node, but on Spark main page said that yarn
  enough and I don't need to install it... What the difference?
 
  sent from my HTC
 
 > On Jul 6, 2014 8:34 PM, "vs"  wrote:
 > Konstantin,
 >
 > HWRK provides a Tech Preview of Spark 0.9.1 with HDP 2.1 that you can
 > try
 > from
 > http://hortonworks.com/wp-content/uploads/2014/05/SparkTech

Re: SparkSQL with sequence file RDDs

2014-07-07 Thread Michael Armbrust

>
> We know Scala 2.11 has remove the limitation of parameter number, but
> Spark 1.0 is not compatible with it. So now we are considering use java
> beans instead of Scala case classes.
>

You can also manually create a class that implements scala's Product
interface.  Finally, SPARK-2179
 will give you
programatic non-classed based way to describe the schema.  Someone is
working on this now.

RE: SparkSQL with sequence file RDDs

2014-07-07 Thread Haoming Zhang

Hi Gray,

Like Michael mentioned, you need to take care of the scala case classes or java 
beans, because SparkSQL need the schema.

Currently we are trying insert our data to HBase with Scala 2.10.4 and Spark 
1.0. 

All the data are tables. We created one case class for each rows, which means 
the parameter number of case class should as the same as the column number. But 
Scala 2.10.4 has a limitation that is the max parameter number for case class 
is 22. So here the problem occurs. If the table is small, and the column number 
less than 22, everything will be fine. But if we got a larger table with more 
than 22 columns, then error will be reported.

We know Scala 2.11 has remove the limitation of parameter number, but Spark 1.0 
is not compatible with it. So now we are considering use java beans instead of 
Scala case classes.

Best,
Haoming



From: mich...@databricks.com
Date: Mon, 7 Jul 2014 17:12:42 -0700
Subject: Re: SparkSQL with sequence file RDDs
To: user@spark.apache.org

I haven't heard any reports of this yet, but I don't see any reason why it 
wouldn't work. You'll need to manually convert the objects that come out of the 
sequence file into something where SparkSQL can detect the schema (i.e. scala 
case classes or java beans) before you can register the RDD as a table.


If you run into any issues please let me know.

On Mon, Jul 7, 2014 at 12:36 PM, Gary Malouf  wrote:


Has anyone reported issues using SparkSQL with sequence files (all of our data 
is in this format within HDFS)?  We are considering whether to burn the time 
upgrading to Spark 1.0 from 0.9 now and this is a main decision point for us.

usage question for saprk run on YARN

2014-07-07 Thread Cheng Ju Chuang

Hi,

I am running some simple samples for my project. Right now the spark sample is 
running on Hadoop 2.2 with YARN. My question is what is the main different when 
we run as spark-client and spark-cluster except different way to submit our 
job. And what is the specific way to configure the job e.g. running on 
particular nodes which has more resource.

Thank you very much

Sincerely,
Cheng-Ju Chuang

The number of cores vs. the number of executors

2014-07-07 Thread innowireless TaeYun Kim

Hi,

 

I'm trying to understand the relationship of the number of cores and the
number of executors when running a Spark job on YARN.

 

The test environment is as follows:

 

- # of data nodes: 3

- Data node machine spec:

- CPU: Core i7-4790 (# of cores: 4, # of threads: 8)

- RAM: 32GB (8GB x 4)

- HDD: 8TB (2TB x 4)

- Network: 1Gb

 

- Spark job flow: sc.textFile -> filter -> map -> filter -> mapToPair ->
reduceByKey -> map -> saveAsTextFile

 

- input data

- type: single text file

- size: 165GB

- # of lines: 454,568,833

 

- output

- # of lines after second filter: 310,640,717

- # of lines of the result file: 99,848,268

- size of the result file: 41GB

 

The job was run with following configurations:

 

1) --master yarn-client --executor-memory 19G --executor-cores 7
--num-executors 3  (executors per data node, use as much as cores)

2) --master yarn-client --executor-memory 19G --executor-cores 4
--num-executors 3  (# of cores reduced)

3) --master yarn-client --executor-memory 4G --executor-cores 2
--num-executors 12  (less core, more executor)

 

- elapsed times:

1) 50 min 15 sec

2) 55 min 48 sec

3) 31 min 23 sec

 

To my surprise, 3) was much faster.

I thought that 1) would be faster, since there would be less inter-executor
communication when shuffling.

Although # of cores of 1) is fewer than 3), #of cores is not the key factor
since 2) did perform well.

 

How can I understand this result?

 

Thanks.

Re: Data loading to Parquet using spark

2014-07-07 Thread Michael Armbrust

SchemaRDDs, provided by Spark SQL, have a saveAsParquetFile command.  You
can turn a normal RDD into a SchemaRDD using the techniques described here:
http://spark.apache.org/docs/latest/sql-programming-guide.html

This should work with Impala, but if you run into any issues please let me
know.

On Sun, Jul 6, 2014 at 5:30 PM, Shaikh Riyaz  wrote:

> Hi,
>
> We are planning to use spark to load data to Parquet and this data will be
> query by Impala for present visualization through Tableau.
>
> Can we achieve this flow? How to load data to Parquet from spark? Will
> impala be able to access the data loaded by spark?
>
> I will greatly appreciate if someone can help with the example to achieve
> the goal.
>
> Thanks in advance.
>
> --
> Regards,
>
> Riyaz
>
>

Re: Spark SQL : Join throws exception

2014-07-07 Thread Yin Huai

Hi Subacini,

Just want to follow up on this issue. SPARK-2339 has been merged into the
master and 1.0 branch.

Thanks,

Yin


On Tue, Jul 1, 2014 at 2:00 PM, Yin Huai  wrote:

> Seems it is a bug. I have opened
> https://issues.apache.org/jira/browse/SPARK-2339 to track it.
>
> Thank you for reporting it.
>
> Yin
>
>
> On Tue, Jul 1, 2014 at 12:06 PM, Subacini B  wrote:
>
>> Hi All,
>>
>> Running this join query
>>  sql("SELECT * FROM  A_TABLE A JOIN  B_TABLE B WHERE
>> A.status=1").collect().foreach(println)
>>
>> throws
>>
>> Exception in thread "main" org.apache.spark.SparkException: Job aborted
>> due to stage failure: Task 1.0:3 failed 4 times, most recent failure:
>> Exception failure in TID 12 on host X.X.X.X: 
>> *org.apache.spark.sql.catalyst.errors.package$TreeNodeException:
>> No function to evaluate expression. type: UnresolvedAttribute, tree:
>> 'A.status*
>>
>> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.eval(unresolved.scala:59)
>>
>> org.apache.spark.sql.catalyst.expressions.Equals.eval(predicates.scala:147)
>>
>> org.apache.spark.sql.catalyst.expressions.And.eval(predicates.scala:100)
>>
>> org.apache.spark.sql.execution.Filter$$anonfun$2$$anonfun$apply$1.apply(basicOperators.scala:52)
>>
>> org.apache.spark.sql.execution.Filter$$anonfun$2$$anonfun$apply$1.apply(basicOperators.scala:52)
>> scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:390)
>> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>>
>> org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$1.apply(Aggregate.scala:137)
>>
>> org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$1.apply(Aggregate.scala:134)
>> org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559)
>> org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559)
>>
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>>
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>>
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
>>
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
>> org.apache.spark.scheduler.Task.run(Task.scala:51)
>>
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
>>
>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
>>
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
>> java.lang.Thread.run(Thread.java:695)
>> Driver stacktrace:
>>
>> Can someone help me.
>>
>> Thanks in advance.
>>
>>
>

Re: Comparative study

2014-07-07 Thread Soumya Simanta



Daniel, 

Do you mind sharing the size of your cluster and the production data volumes ? 

Thanks
Soumya 

> On Jul 7, 2014, at 3:39 PM, Daniel Siegmann  wrote:
> 
> From a development perspective, I vastly prefer Spark to MapReduce. The 
> MapReduce API is very constrained; Spark's API feels much more natural to me. 
> Testing and local development is also very easy - creating a local Spark 
> context is trivial and it reads local files. For your unit tests you can just 
> have them create a local context and execute your flow with some test data. 
> Even better, you can do ad-hoc work in the Spark shell and if you want that 
> in your production code it will look exactly the same.
> 
> Unfortunately, the picture isn't so rosy when it gets to production. In my 
> experience, Spark simply doesn't scale to the volumes that MapReduce will 
> handle. Not with a Standalone cluster anyway - maybe Mesos or YARN would be 
> better, but I haven't had the opportunity to try them. I find jobs tend to 
> just hang forever for no apparent reason on large data sets (but smaller than 
> what I push through MapReduce).
> 
> I am hopeful the situation will improve - Spark is developing quickly - but 
> if you have large amounts of data you should proceed with caution.
> 
> Keep in mind there are some frameworks for Hadoop which can hide the ugly 
> MapReduce with something very similar in form to Spark's API; e.g. Apache 
> Crunch. So you might consider those as well.
> 
> (Note: the above is with Spark 1.0.0.)
> 
> 
> 
>> On Mon, Jul 7, 2014 at 11:07 AM,  wrote:
>> Hello Experts,
>> 
>>  
>> 
>> I am doing some comparative study on the below:
>> 
>>  
>> 
>> Spark vs Impala
>> 
>> Spark vs MapREduce . Is it worth migrating from existing MR implementation 
>> to Spark?
>> 
>>  
>> 
>>  
>> 
>> Please share your thoughts and expertise.
>> 
>>  
>> 
>>  
>> 
>> Thanks,
>> Santosh
>> 
>> 
>> 
>> This message is for the designated recipient only and may contain 
>> privileged, proprietary, or otherwise confidential information. If you have 
>> received it in error, please notify the sender immediately and delete the 
>> original. Any other use of the e-mail by you is prohibited. Where allowed by 
>> local law, electronic communications with Accenture and its affiliates, 
>> including e-mail and instant messaging (including content), may be scanned 
>> by our systems for the purposes of information security and assessment of 
>> internal compliance with Accenture policy. 
>> __
>> 
>> www.accenture.com
> 
> 
> 
> -- 
> Daniel Siegmann, Software Developer
> Velos
> Accelerating Machine Learning
> 
> 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
> E: daniel.siegm...@velos.io W: www.velos.io

Re: SparkSQL with sequence file RDDs

2014-07-07 Thread Michael Armbrust

I haven't heard any reports of this yet, but I don't see any reason why it
wouldn't work. You'll need to manually convert the objects that come out of
the sequence file into something where SparkSQL can detect the schema (i.e.
scala case classes or java beans) before you can register the RDD as a
table.

If you run into any issues please let me know.

On Mon, Jul 7, 2014 at 12:36 PM, Gary Malouf  wrote:

> Has anyone reported issues using SparkSQL with sequence files (all of our
> data is in this format within HDFS)?  We are considering whether to burn
> the time upgrading to Spark 1.0 from 0.9 now and this is a main decision
> point for us.
>

Re: Comparative study

2014-07-07 Thread Sean Owen

On Tue, Jul 8, 2014 at 1:05 AM, Nabeel Memon  wrote:

> For Scala API on map/reduce (hadoop engine) there's a library called
> "Scalding". It's built on top of Cascading. If you have a huge dataset or
> if you consider using map/reduce engine for your job, for any reason, you
> can try Scalding.
>

PS Crunch also has a Scala API called Scrunch. And Crunch can run its jobs
on Spark too, not just M/R.

Re: Comparative study

2014-07-07 Thread Nabeel Memon

For Scala API on map/reduce (hadoop engine) there's a library called
"Scalding". It's built on top of Cascading. If you have a huge dataset or
if you consider using map/reduce engine for your job, for any reason, you
can try Scalding.

However, Spark vs Impala doesn't make sense to me. It should've really been
Shark vs Impala. Both are SQL querying engines built on top of Spark and
Hadoop (map/reduce engine) respectively.


On Mon, Jul 7, 2014 at 4:06 PM,  wrote:

>  Thanks Daniel for sharing this info.
>
>
>
> Regards,
> Santosh Karthikeyan
>
>
>
> *From:* Daniel Siegmann [mailto:daniel.siegm...@velos.io]
> *Sent:* Tuesday, July 08, 2014 1:10 AM
> *To:* user@spark.apache.org
> *Subject:* Re: Comparative study
>
>
>
> From a development perspective, I vastly prefer Spark to MapReduce. The
> MapReduce API is very constrained; Spark's API feels much more natural to
> me. Testing and local development is also very easy - creating a local
> Spark context is trivial and it reads local files. For your unit tests you
> can just have them create a local context and execute your flow with some
> test data. Even better, you can do ad-hoc work in the Spark shell and if
> you want that in your production code it will look exactly the same.
>
> Unfortunately, the picture isn't so rosy when it gets to production. In my
> experience, Spark simply doesn't scale to the volumes that MapReduce will
> handle. Not with a Standalone cluster anyway - maybe Mesos or YARN would be
> better, but I haven't had the opportunity to try them. I find jobs tend to
> just hang forever for no apparent reason on large data sets (but smaller
> than what I push through MapReduce).
>
> I am hopeful the situation will improve - Spark is developing quickly -
> but if you have large amounts of data you should proceed with caution.
>
> Keep in mind there are some frameworks for Hadoop which can hide the ugly
> MapReduce with something very similar in form to Spark's API; e.g. Apache
> Crunch. So you might consider those as well.
>
> (Note: the above is with Spark 1.0.0.)
>
>
>
>
>
> On Mon, Jul 7, 2014 at 11:07 AM, 
> wrote:
>
> Hello Experts,
>
>
>
> I am doing some comparative study on the below:
>
>
>
> Spark vs Impala
>
> Spark vs MapREduce . Is it worth migrating from existing MR implementation
> to Spark?
>
>
>
>
>
> Please share your thoughts and expertise.
>
>
>
>
>
> Thanks,
> Santosh
>
>
>  --
>
>
> This message is for the designated recipient only and may contain
> privileged, proprietary, or otherwise confidential information. If you have
> received it in error, please notify the sender immediately and delete the
> original. Any other use of the e-mail by you is prohibited. Where allowed
> by local law, electronic communications with Accenture and its affiliates,
> including e-mail and instant messaging (including content), may be scanned
> by our systems for the purposes of information security and assessment of
> internal compliance with Accenture policy.
>
> __
>
> www.accenture.com
>
>
>
>
> --
>
> Daniel Siegmann, Software Developer
> Velos
>
> Accelerating Machine Learning
>
>
> 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
> E: daniel.siegm...@velos.io W: www.velos.io
>

Re: master attempted to re-register the worker and then took all workers as unregistered

2014-07-07 Thread Nan Zhu

Hey, Cheney,  

The problem is still existing?

Sorry for the delay, I’m starting to look at this issue,  

Best,  

--  
Nan Zhu


On Tuesday, May 6, 2014 at 10:06 PM, Cheney Sun wrote:

> Hi Nan,
>  
> In worker's log, I see the following exception thrown when try to launch on 
> executor. (The SPARK_HOME is wrongly specified on purpose, so there is no 
> such file "/usr/local/spark1/bin/compute-classpath.sh 
> (http://compute-classpath.sh)").  
> After the exception was thrown several times, the worker was requested to 
> kill the executor. Following the killing, the worker try to register again 
> with master, but master reject the registration with WARN message" Got 
> heartbeat from unregistered worker worker-20140504140005-host-spark-online001"
>  
> Looks like the issue wasn't fixed in 0.9.1. Do you know any pull request 
> addressing this issue? Thanks.
>  
> java.io.IOException: Cannot run program 
> "/usr/local/spark1/bin/compute-classpath.sh (http://compute-classpath.sh)" 
> (in directory "."): error=2, No such file or directory  
> at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
> at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:600)
> at 
> org.apache.spark.deploy.worker.CommandUtils$.buildJavaOpts(CommandUtils.scala:58)
> at 
> org.apache.spark.deploy.worker.CommandUtils$.buildCommandSeq(CommandUtils.scala:37)
> at 
> org.apache.spark.deploy.worker.ExecutorRunner.getCommandSeq(ExecutorRunner.scala:104)
> at 
> org.apache.spark.deploy.worker.ExecutorRunner.fetchAndRunExecutor(ExecutorRunner.scala:119)
> at 
> org.apache.spark.deploy.worker.ExecutorRunner$$anon$1.run(ExecutorRunner.scala:59)
> Caused by: java.io.IOException: error=2, No such file or directory
> at java.lang.UNIXProcess.forkAndExec(Native Method)
> at java.lang.UNIXProcess.(UNIXProcess.java:135)
> at java.lang.ProcessImpl.start(ProcessImpl.java:130)
> at java.lang.ProcessBuilder.start(ProcessBuilder.java:1021)
> ... 6 more
> ..
> 14/05/04 21:35:45 INFO Worker: Asked to kill executor 
> app-20140504213545-0034/18
> 14/05/04 21:35:45 INFO Worker: Executor app-20140504213545-0034/18 finished 
> with state FAILED message class java.io.IOException: Cannot run program 
> "/usr/local/spark1/bin/compute-classpath.sh (http://compute-classpath.sh)" 
> (in directory "."): error=2, No such file or directory
> 14/05/04 21:35:45 ERROR OneForOneStrategy: key not found: 
> app-20140504213545-0034/18
> java.util.NoSuchElementException: key not found: app-20140504213545-0034/18
> at scala.collection.MapLike$class.default(MapLike.scala:228)
> at scala.collection.AbstractMap.default(Map.scala:58)
> at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
> at 
> org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:232)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
> at akka.actor.ActorCell.invoke(ActorCell.scala:456)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
> at akka.dispatch.Mailbox.run(Mailbox.scala:219)
> at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
> at 
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> 14/05/04 21:35:45 INFO Worker: Starting Spark worker 
> host-spark-online001:7078 with 10 cores, 28.0 GB RAM
> 14/05/04 21:35:45 INFO Worker: Spark home: /usr/local/spark-0.9.1-cdh4.2.0
> 14/05/04 21:35:45 INFO WorkerWebUI: Started Worker web UI at 
> http://host-spark-online001:8081
> 14/05/04 21:35:45 INFO Worker: Connecting to master 
> spark://host-spark-online001:7077...
> 14/05/04 21:35:45 INFO Worker: Successfully registered with master 
> spark://host-spark-online001:7077
>  
>

memory leak query

2014-07-07 Thread Michael Lewis

Hi,

I hope someone can help as  I’m not sure if I’m using Spark correctly. 
Basically, in the simple example below 
I create an RDD which is just a sequence of random numbers. I then have a loop 
where I just invoke rdd.count()
what I can see  is that the memory use always nudges upwards.

If I attach YourKit to the JVM, I can see the garbage collector in action, but 
eventually the JVM runs out of memory.

Can anyone spot if I am doing something wrong? (Obviously the example is 
slightly contrived, but basically I 
have an RDD with a set of numbers and I’d like to submit lots of jobs that 
perform some calculation, this was
the simplest case I could create that would exhibit same memory issue.)

Regards & Thanks,
Mike


import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkContext, SparkConf}
import scala.util.Random

object SparkTest {
  def main(args: Array[String]) {
println ("spark memory test")

val jars = Seq("spark-test-1.0-SNAPSHOT.jar")

val sparkConfig : SparkConf = new SparkConf()
  .setMaster("local")
  .setAppName("tester")
  .setJars(jars)

val sparkContext = new SparkContext(sparkConfig)
val list = Seq.fill(120)(Random.nextInt)
val rdd : RDD[Int] = sparkContext.makeRDD(list,10)

for (i <- 1 to 100) {
  rdd.count()
}
sparkContext.stop()
  }
}

Re: reading compress lzo files

2014-07-07 Thread Nicholas Chammas

I found it quite painful to figure out all the steps required and have
filed SPARK-2394  to
track improving this. Perhaps I have been going about it the wrong way, but
it seems way more painful than it should be to set up a Spark cluster built
using spark-ec2 to read LZO-compressed input.

Nick

Cannot create dir in Tachyon when running Spark with OFF_HEAP caching (FileDoesNotExistException)

2014-07-07 Thread Teng Long

Hi guys,

I'm running Spark 1.0.0 with Tachyon 0.4.1, both in single node mode.
Tachyon's own tests (./bin/tachyon runTests) works good, and manual file
system operation like mkdir works well. But when I tried to run a very
simple Spark task with RDD persist as OFF_HEAP, I got the following
FileDoesNotExistException error.

My platform is Ubuntu12.04 x64.

More information:

The Spark task is simply in the interaction mode:
-
import org.apache.spark.storage.StorageLevel
val tf = sc.textFile("README.md") // the file is there in the directory.
tf.persist(StorageLevel.OFF_HEAP)
tf.count()
--
I tried other Storage levels like MEM_ONLY, MEM_AND_DISK, etc, and they all
worked fine.

The same error happened on both Tachyon 0.4.1 and Tachyon 0.4.1-thrifty, for
both binary versions or build-from-src versions.

Even more information:

I further track down the "Connecting local worker @" right before the
Exception, and find the problem might be in the tachyon.client.connect()
method. 

Did any of you guys have this problem? If yes, how did you solve it?

Thanks!

-
Exception output:
-


14/07/07 14:03:47 INFO : Trying to connect master @
datanode6/10.10.10.46:19998
14/07/07 14:03:47 INFO : User registered at the master
datanode6/10.10.10.46:19998 got UserId 21
14/07/07 14:03:47 INFO : Trying to get local worker host : datanode6
14/07/07 14:03:47 INFO : Connecting local worker @
datanode6.ssi.samsung.com/10.10.10.46:29998
14/07/07 14:03:47 INFO :
FileDoesNotExistException(message://spark-704334db-270a-48c5-ac52-e646f1ea1aa0//spark-tachyon-20140707140347-6c0a)//spark-704334db-270a-48c5-ac52-e646f1ea1aa0//spark-tachyon-20140707140347-6c0a
14/07/07 14:03:47 INFO storage.TachyonBlockManager: Created tachyon
directory at null
14/07/07 14:03:47 WARN storage.BlockManager: Putting block rdd_1_0 failed
14/07/07 14:03:47 ERROR executor.Executor: Exception in task ID 0
java.lang.NullPointerException
at 
org.apache.spark.util.Utils$.registerShutdownDeleteDir(Utils.scala:196)
at
org.apache.spark.storage.TachyonBlockManager$$anonfun$addShutdownHook$1.apply(TachyonBlockManager.scala:137)
at
org.apache.spark.storage.TachyonBlockManager$$anonfun$addShutdownHook$1.apply(TachyonBlockManager.scala:137)
at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at
org.apache.spark.storage.TachyonBlockManager.addShutdownHook(TachyonBlockManager.scala:137)
at
org.apache.spark.storage.TachyonBlockManager.(TachyonBlockManager.scala:60)
at
org.apache.spark.storage.BlockManager.tachyonStore$lzycompute(BlockManager.scala:69)
at
org.apache.spark.storage.BlockManager.tachyonStore(BlockManager.scala:64)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:681)
at org.apache.spark.storage.BlockManager.put(BlockManager.scala:574)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:108)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
at org.apache.spark.scheduler.Task.run(Task.scala:51)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:701)
14/07/07 14:03:47 WARN scheduler.TaskSetManager: Lost TID 0 (task 0.0:0)
14/07/07 14:03:47 WARN scheduler.TaskSetManager: Loss was due to
java.lang.NullPointerException
java.lang.NullPointerException
at 
org.apache.spark.util.Utils$.registerShutdownDeleteDir(Utils.scala:196)
at
org.apache.spark.storage.TachyonBlockManager$$anonfun$addShutdownHook$1.apply(TachyonBlockManager.scala:137)
at
org.apache.spark.storage.TachyonBlockManager$$anonfun$addShutdownHook$1.apply(TachyonBlockManager.scala:137)
at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at
org.apache.spark.storage.TachyonBlockManager.addShutdownHook(TachyonBlockManager.scala:137)
at
org.apache.spark.storage.TachyonBlockManager.(TachyonBlockManager.scala:60)
at
org.apache.spark.storage.BlockManager.tachyonStore$lzycompute(BlockManager.scala:69)
at
org.apache.spark.storage.BlockManager.tachyonStore(BlockManager.scala:64)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:681)
at org.apache.spark.storage.BlockManage

RE: Spark logging strategy on YARN

2014-07-07 Thread Andrew Lee

Hi Kudryavtsev,
Here's what I am doing as a common practice and reference, I don't want to say 
it is best practice since it requires a lot of customer experience and 
feedback, but from a development and operating stand point, it will be great to 
separate the YARN container logs with the Spark logs.
Event Log - Use HistoryServer to take a look at the workflow, overall resource 
usage, etc for the Job.

Spark Log - Provide readable info on settings and configuration, and is covered 
by the event logs. You can customize this in the 'conf' folder with your own 
log4j.properties file. This won't be picked up by your YARN container since 
your Hadoop may be referring to a different log4j file somewhere else.
Stderr/Stdout log - This is actually picked up by the YARN container and you 
won't be able to override this unless you override the one in the resource 
folder (yarn/common/src/main/resources/log4j-spark-container.properties) during 
the build process and include it in your build (JAR file).
One thing I haven't tried yet is to separate that resource file into a separate 
JAR, and include it in the ext jar options on HDFS to suppress the log. This is 
more of a exploiting the CLASSPATH search behavior to override YARN log4j 
settings without building JARs to include the YARN container log4j settings, I 
don't know if this is a good practice though. Just some ideas that gives ppl 
flexibility, but probably not a good practice.
Anyone else have ideas? thoughts?








> From: kudryavtsev.konstan...@gmail.com
> Subject: Spark logging strategy on YARN
> Date: Thu, 3 Jul 2014 22:26:48 +0300
> To: user@spark.apache.org
> 
> Hi all,
> 
> Could you please share your the best practices on writing logs in Spark? I’m 
> running it on YARN, so when I check logs I’m bit confused… 
> Currently, I’m writing System.err.println to put a message in log and access 
> it via YARN history server. But, I don’t like this way… I’d like to use 
> log4j/slf4j and write them to more concrete place… any practices?
> 
> Thank you in advance

acl for spark ui

2014-07-07 Thread Koert Kuipers

i was testing using the acl for spark ui in secure mode on yarn in client
mode.

it works great. my spark 1.0.0 configuration has:
spark.authenticate = true
spark.ui.acls.enable = true
spark.ui.view.acls = koert
spark.ui.filters =
org.apache.hadoop.security.authentication.server.AuthenticationFilter
spark.org.apache.hadoop.security.authentication.server.AuthenticationFilter.params="type=kerberos,kerberos.principal=HTTP/mybox@MYDOMAIN
,kerberos.keytab=/some/keytab"

i confirmed that i can access the ui from firefox after doing kinit.

however i also saw this in the logs of my driver program:
2014-07-07 17:21:56 DEBUG server.Server: RESPONSE /broadcast_0  401
handled=true

and

2014-07-07 17:21:56 DEBUG server.Server: REQUEST
/jars/somejar-assembly-0.1-SNAPSHOT.jar on BlockingHttpConnection@3d6396f5
,g=HttpGenerator{s=0,h=-1,b=-1,c=-1},p=HttpParse\
r{s=-5,l=10,c=0},r=1
2014-07-07 17:21:56 DEBUG server.Server: RESPONSE
/jars/somejar-assembly-0.1-SNAPSHOT.jar  401 handled=true

what does this mean? is the webserver also responsible for handing out
other stuff such as broadcast variables and jars, and is this now being
rejected by my servlet filter? thats not good... the 401 response is
exactly the same one i see when i try to access the website after kdestroy.
for example:

2014-07-07 17:35:08 DEBUG server.AuthenticationFilter: Request [
http://mybox:5001/] triggering authentication
2014-07-07 17:35:08 DEBUG server.Server: RESPONSE /  401 handled=true

RE: Enable Parsing Failed or Incompleted jobs on HistoryServer (YARN mode)

2014-07-07 Thread Andrew Lee

Hi Suren,
It showed up after awhile when I touch the APPLICATION_COMPLETE file in the 
event log folders.
I checked the source code and it looks like it is re-scanning (polling) the 
folders every 10 seconds (configurable)?
Not sure what exactly triggers that 'refresh', may need to do more digging.
Thanks.


Date: Thu, 3 Jul 2014 06:56:46 -0400
Subject: Re: Enable Parsing Failed or Incompleted jobs on HistoryServer (YARN 
mode)
From: suren.hira...@velos.io
To: user@spark.apache.org

I've had some odd behavior with jobs showing up in the history server in 1.0.0. 
Failed jobs do show up but it seems they can show up minutes or hours later. I 
see in the history server logs messages about bad task ids. But then eventually 
the jobs show up.

This might be your situation.
Anecdotally, if you click on the job in the Spark Master GUI after it is done, 
this may help it show up in the history server faster. Haven't reliably tested 
this though. May just be a coincidence of timing.

-Suren


On Wed, Jul 2, 2014 at 8:01 PM, Andrew Lee  wrote:




Hi All,
I have HistoryServer up and running, and it is great.
Is it possible to also enable HsitoryServer to parse failed jobs event by 
default as well?

I get "No Completed Applications Found" if job fails.

=
Event Log Location: hdfs:///user/test01/spark/logs/
No Completed Applications Found=
The reason is that it is good to run the HistoryServer to keep track of 
performance and resource usage for each completed job, but I found it more 
useful when job fails. I can identify which stage did it fail, etc instead of 
sipping through the logs 
from the Resource Manager. The same event log is only available when the 
Application Master is still active, once the job fails, the Application Master 
is killed, and I lose the GUI access, even though I have the event log in JSON 
format, I can't open it with 
the HistoryServer.
This is very helpful especially for long running jobs that last for 2-18 hours 
that generates Gigabytes of logs.
So I have 2 questions:

1. Any reason why we only render completed jobs? Why can't we bring in all jobs 
and choose from the GUI? Like a time machine to restore the status from the 
Application Master?








./core/src/main/scala/org/apache/spark/deploy/history/HistoryServer.scala 








val logInfos = logDirs

  .sortBy { dir => getModificationTime(dir) }

  .map { dir => (dir, 
EventLoggingListener.parseLoggingInfo(dir.getPath, fileSystem)) }

  .filter { case (dir, info) => info.applicationComplete }




2. If I force to touch a file "APPLICATION_COMPLETE" in the failed job event 
log folder, will this cause any problem?
















  


-- 

SUREN HIRAMAN, VP TECHNOLOGY
VelosAccelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR
NEW YORK, NY 10001
O: (917) 525-2466 ext. 105F: 646.349.4063

E: suren.hira...@velos.io
W: www.velos.io

Re: Running the BroadcastTest.scala with TorrentBroadcastFactory in a standalone cluster

2014-07-07 Thread jackxucs

Hi Mosharaf,

Thanks a lot for the detailed reply. The reason I am using Torrent is mainly
because I would like to have something different to be a comparison with the
IP multicast/broadcast. I will need to transmit larger data size as well.

I tried to increase spark.executor.memory to 2g and spark.akka.frameSize to
1000 but it seems the same problem remains. Is there any other possibility
that you can think of?

Thanks,
Jack 




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Running-BroadcastTest-scala-with-TorrentBroadcastFactory-in-a-standalone-cluster-tp8736p8955.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Spark shell error messages and app exit issues

2014-07-07 Thread Sameer Tilak

Hi All,When I run my application, it runs for a while and give me part of the 
o/p correctly. I then get the following error and it then spark shell exits. 
14/07/07 13:54:53 INFO SendingConnection: Initiating connection to 
[localhost.localdomain/127.0.0.1:57423]14/07/07 13:54:53 INFO 
ConnectionManager: Accepted connection from 
[localhost.localdomain/127.0.0.1]14/07/07 13:54:53 INFO SendingConnection: 
Connected to [localhost.localdomain/127.0.0.1:57423], 2 messages 
pending14/07/07 13:54:53 INFO BlockManager: Removing block 
taskresult_1414/07/07 13:54:53 INFO BlockManager: Removing block 
taskresult_1314/07/07 13:54:53 INFO MemoryStore: Block taskresult_14 of size 
12174859 dropped from memory (free 296532358)14/07/07 13:54:53 INFO 
MemoryStore: Block taskresult_13 of size 12115603 dropped from memory (free 
308647961)14/07/07 13:54:53 INFO BlockManagerInfo: Removed taskresult_14 on 
pzxnvm2018.dcld.pldc.kp.org:50924 in memory (size: 11.6 MB, free: 282.9 
MB)14/07/07 13:54:53 INFO BlockManagerMaster: Updated info of block 
taskresult_1414/07/07 13:54:53 INFO BlockManagerInfo: Removed taskresult_13 on 
pzxnvm2018.a.b.org:50924 in memory (size: 11.6 MB, free: 294.4 MB)14/07/07 
13:54:53 INFO BlockManagerMaster: Updated info of block taskresult_1314/07/07 
13:54:54 INFO TaskSetManager: Finished TID 13 in 3043 ms on localhost 
(progress: 1/2)14/07/07 13:54:54 INFO DAGScheduler: Completed ResultTask(7, 
0)14/07/07 13:54:54 INFO TaskSchedulerImpl: Removed TaskSet 7.0, whose tasks 
have all completed, from pool14/07/07 13:54:54 INFO DAGScheduler: Failed to run 
collect at JaccardScore.scala:8414/07/07 13:54:54 INFO TaskSchedulerImpl: 
Cancelling stage 7org.apache.spark.SparkException: Job aborted due to stage 
failure: Exception while deserializing and fetching task: 
com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 
13994  at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
 at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
 at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)  
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)   at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
 at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
 at scala.Option.foreach(Option.scala:236)   at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:633)
  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1207)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at 
akka.actor.ActorCell.invoke(ActorCell.scala:456) at 
akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)  at 
akka.dispatch.Mailbox.run(Mailbox.scala:219) at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) 
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

scala> 14/07/07 13:54:57 INFO AppClient$ClientActor: Connecting to master 
spark://pzxnvm2018:7077...14/07/07 13:55:17 INFO AppClient$ClientActor: 
Connecting to master spark://pzxnvm2018:7077...14/07/07 13:55:37 ERROR 
SparkDeploySchedulerBackend: Application has been killed. Reason: All masters 
are unresponsive! Giving up.14/07/07 13:55:37 ERROR TaskSchedulerImpl: Exiting 
due to error from cluster scheduler: All masters are unresponsive! Giving up.
Logs for the Master node:14/07/07 13:51:55 INFO Master: 
akka.tcp://spark@localhost:45063 got disassociated, removing it.14/07/07 
13:54:38 ERROR EndpointWriter: dropping message [class 
akka.actor.SelectChildName] for non-local recipient 
[Actor[akka.tcp://sparkMaster@pzxnvm2018:7077/]] arriving at 
[akka.tcp://sparkMaster@pzxnvm2018:7077] inbound addresses are 
[akka.tcp://sparkmas...@pzxnvm2018.a.b.org:7077]14/07/07 13:54:57 ERROR 
EndpointWriter: dropping message [class akka.actor.SelectChildName] for 
non-local recipient [Actor[akka.tcp://sparkMaster@pzxnvm2018:7077/]] arriving 
at [akka.tcp://sparkMaster@pzxnvm2018:7077] inbound addresses are 
[akka.tcp://sparkmas...@pzxnvm2018.a.b.org:7077]14/07/07 13:55:17 ERROR 
EndpointWriter: dropping message [class akka.actor.SelectChildName] for 
non-local recipient [Actor[akka.tcp://sparkMaster@pzxnvm2018:7077/]] arriving 
at [akka.tcp://sparkMaster@pzxnvm2018:7077] i

Re: NoSuchMethodError in KafkaReciever

2014-07-07 Thread mcampbell

xtrahotsauce wrote
> I had this same problem as well.  I ended up just adding the necessary
> code
> in KafkaUtil and compiling my own spark jar.  Something like this for the
> "raw" stream:
> 
>   def createRawStream(
>   jssc: JavaStreamingContext,
>   kafkaParams: JMap[String, String],
>   topics: JMap[String, JInt]
>): JavaPairDStream[Array[Byte], Array[Byte]] = {
> new KafkaInputDStream[Array[Byte], Array[Byte], DefaultDecoder,
> DefaultDecoder](
>   jssc.ssc, kafkaParams.toMap,
> Map(topics.mapValues(_.intValue()).toSeq: _*),
> StorageLevel.MEMORY_AND_DISK_SER_2)
>   }


I had this same problem, and this solution also worked for me so thanks for
this!

One question...  what is this doing?

> Map(topics.mapValues(_.intValue()).toSeq: _*),

it appears to be converting the incoming Map[String, Integer] to a
Map[String, Integer].  I'm not seeing the purpose of it...  help?  (I'm a
bit of a scala newbie.)




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/NoSuchMethodError-in-KafkaReciever-tp2209p8953.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Issues in opening UI when running Spark Streaming in YARN

2014-07-07 Thread Chester Chen

@Andrew

  Yes, the link point to the same redirected


 http://localhost/proxy/application_1404443455764_0010/


  I suspect something todo with the cluster setup. I will let you know
once I found something.

Chester


On Mon, Jul 7, 2014 at 1:07 PM, Andrew Or  wrote:

> @Yan, the UI should still work. As long as you look into the container
> that launches the driver, you will find the SparkUI address and port. Note
> that in yarn-cluster mode the Spark driver doesn't actually run in the
> Application Manager; just like the executors, it runs in a container that
> is launched by the Resource Manager after the Application Master requests
> the container resources. In contrast, in yarn-client mode, your driver is
> not launched in a container, but in the client process that launched your
> application (i.e. spark-submit), so the stdout of this program directly
> contains the SparkUI messages.
>
> @Chester, I'm not sure what has gone wrong as there are many factors at
> play here. When you go the Resource Manager UI, does the "application URL"
> link point you to the same SparkUI address as indicated in the logs? If so,
> this is the correct behavior. However, I believe the redirect error has
> little to do with Spark itself, but more to do with how you set up the
> cluster. I have actually run into this myself, but I haven't found a
> workaround. Let me know if you find anything.
>
>
>
>
> 2014-07-07 12:07 GMT-07:00 Chester Chen :
>
> As Andrew explained, the port is random rather than 4040, as the the spark
>> driver is started in Application Master and the port is random selected.
>>
>>
>> But I have the similar UI issue. I am running Yarn Cluster mode against
>> my local CDH5 cluster.
>>
>> The log states
>> "14/07/07 11:59:29 INFO ui.SparkUI: Started SparkUI at
>> http://10.0.0.63:58750
>>
>> "
>>
>>
>> but when you client the spark UI link (ApplicationMaster or
>>
>> http://10.0.0.63:58750), I will got a 404 with the redirect URI
>>
>>
>>
>>  http://localhost/proxy/application_1404443455764_0010/
>>
>>
>>
>> Looking at the Spark code, notice that the "proxy" is reallya variable to 
>> get the proxy at the yarn-site.xml http address. But when I specified the 
>> value at yarn-site.xml, it still doesn't work for me.
>>
>>
>>
>> Oddly enough, it works for my co-worker on Pivotal HD cluster, therefore I 
>> am still looking what's the difference in terms of cluster setup or 
>> something else.
>>
>>
>> Chester
>>
>>
>>
>>
>>
>> On Mon, Jul 7, 2014 at 11:42 AM, Andrew Or  wrote:
>>
>>> I will assume that you are running in yarn-cluster mode. Because the
>>> driver is launched in one of the containers, it doesn't make sense to
>>> expose port 4040 for the node that contains the container. (Imagine if
>>> multiple driver containers are launched on the same node. This will cause a
>>> port collision). If you're launching Spark from a gateway node that is
>>> physically near your worker nodes, then you can just launch your
>>> application in yarn-client mode, in which case the SparkUI will always be
>>> started on port 4040 on the node that you ran spark-submit on. The reason
>>> why sometimes you see the red text is because it appears only on the driver
>>> containers, not the executor containers. This is because SparkUI belongs to
>>> the SparkContext, which only exists on the driver.
>>>
>>> Andrew
>>>
>>>
>>> 2014-07-07 11:20 GMT-07:00 Yan Fang :
>>>
>>> Hi guys,

 Not sure if you  have similar issues. Did not find relevant tickets in
 JIRA. When I deploy the Spark Streaming to YARN, I have following two
 issues:

 1. The UI port is random. It is not default 4040. I have to look at the
 container's log to check the UI port. Is this suppose to be this way?

 2. Most of the time, the UI does not work. The difference between logs
 are (I ran the same program):






 *14/07/03 11:38:50 INFO spark.HttpServer: Starting HTTP Server14/07/03
 11:38:50 INFO server.Server: jetty-8.y.z-SNAPSHOT 14/07/03 11:38:50 INFO
 server.AbstractConnector: Started SocketConnector@0.0.0.0:12026
 14/07/03 11:38:51 INFO
 executor.CoarseGrainedExecutorBackend: Got assigned task 0 14/07/03
 11:38:51 INFO executor.Executor: Running task ID 0...*

 14/07/02 16:55:32 INFO spark.HttpServer: Starting HTTP Server
 14/07/02 16:55:32 INFO server.Server: jetty-8.y.z-SNAPSHOT
 14/07/02 16:55:32 INFO server.AbstractConnector: Started
 SocketConnector@0.0.0.0:14211




 *14/07/02 16:55:32 INFO ui.JettyUtils: Adding filter:
 org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter14/07/02 16:55:32
 INFO server.Server: jetty-8.y.z-SNAPSHOT14/07/02 16:55:32 INFO
 server.AbstractConnector: Started SelectChannelConnector@0.0.0.0:21867
  14/07/02 16:55:32 INFO
 ui.SparkUI: Started SparkUI at http://myNodeName:21867
>

Re: Issues in opening UI when running Spark Streaming in YARN

2014-07-07 Thread Yan Fang

Thank you, Andrew. That makes sense for me now. I was confused by "In
yarn-cluster mode, the Spark driver runs inside an application master
process which is managed by YARN on the cluster" in
http://spark.apache.org/docs/latest/running-on-yarn.html . After
you explanation, it's clear now. Thank you.

Best,

Fang, Yan
yanfang...@gmail.com
+1 (206) 849-4108


On Mon, Jul 7, 2014 at 1:07 PM, Andrew Or  wrote:

> @Yan, the UI should still work. As long as you look into the container
> that launches the driver, you will find the SparkUI address and port. Note
> that in yarn-cluster mode the Spark driver doesn't actually run in the
> Application Manager; just like the executors, it runs in a container that
> is launched by the Resource Manager after the Application Master requests
> the container resources. In contrast, in yarn-client mode, your driver is
> not launched in a container, but in the client process that launched your
> application (i.e. spark-submit), so the stdout of this program directly
> contains the SparkUI messages.
>
> @Chester, I'm not sure what has gone wrong as there are many factors at
> play here. When you go the Resource Manager UI, does the "application URL"
> link point you to the same SparkUI address as indicated in the logs? If so,
> this is the correct behavior. However, I believe the redirect error has
> little to do with Spark itself, but more to do with how you set up the
> cluster. I have actually run into this myself, but I haven't found a
> workaround. Let me know if you find anything.
>
>
>
>
> 2014-07-07 12:07 GMT-07:00 Chester Chen :
>
> As Andrew explained, the port is random rather than 4040, as the the spark
>> driver is started in Application Master and the port is random selected.
>>
>>
>> But I have the similar UI issue. I am running Yarn Cluster mode against
>> my local CDH5 cluster.
>>
>> The log states
>> "14/07/07 11:59:29 INFO ui.SparkUI: Started SparkUI at
>> http://10.0.0.63:58750
>>
>>
>> "
>>
>>
>> but when you client the spark UI link (ApplicationMaster or
>>
>> http://10.0.0.63:58750), I will got a 404 with the redirect URI
>>
>>
>>
>>
>>  http://localhost/proxy/application_1404443455764_0010/
>>
>>
>>
>> Looking at the Spark code, notice that the "proxy" is reallya variable to 
>> get the proxy at the yarn-site.xml http address. But when I specified the 
>> value at yarn-site.xml, it still doesn't work for me.
>>
>>
>>
>> Oddly enough, it works for my co-worker on Pivotal HD cluster, therefore I 
>> am still looking what's the difference in terms of cluster setup or 
>> something else.
>>
>>
>> Chester
>>
>>
>>
>>
>>
>> On Mon, Jul 7, 2014 at 11:42 AM, Andrew Or  wrote:
>>
>>> I will assume that you are running in yarn-cluster mode. Because the
>>> driver is launched in one of the containers, it doesn't make sense to
>>> expose port 4040 for the node that contains the container. (Imagine if
>>> multiple driver containers are launched on the same node. This will cause a
>>> port collision). If you're launching Spark from a gateway node that is
>>> physically near your worker nodes, then you can just launch your
>>> application in yarn-client mode, in which case the SparkUI will always be
>>> started on port 4040 on the node that you ran spark-submit on. The reason
>>> why sometimes you see the red text is because it appears only on the driver
>>> containers, not the executor containers. This is because SparkUI belongs to
>>> the SparkContext, which only exists on the driver.
>>>
>>> Andrew
>>>
>>>
>>> 2014-07-07 11:20 GMT-07:00 Yan Fang :
>>>
>>> Hi guys,

 Not sure if you  have similar issues. Did not find relevant tickets in
 JIRA. When I deploy the Spark Streaming to YARN, I have following two
 issues:

 1. The UI port is random. It is not default 4040. I have to look at the
 container's log to check the UI port. Is this suppose to be this way?

 2. Most of the time, the UI does not work. The difference between logs
 are (I ran the same program):






 *14/07/03 11:38:50 INFO spark.HttpServer: Starting HTTP Server14/07/03
 11:38:50 INFO server.Server: jetty-8.y.z-SNAPSHOT 14/07/03 11:38:50 INFO
 server.AbstractConnector: Started SocketConnector@0.0.0.0:12026
 14/07/03 11:38:51 INFO
 executor.CoarseGrainedExecutorBackend: Got assigned task 0 14/07/03
 11:38:51 INFO executor.Executor: Running task ID 0...*

 14/07/02 16:55:32 INFO spark.HttpServer: Starting HTTP Server
 14/07/02 16:55:32 INFO server.Server: jetty-8.y.z-SNAPSHOT
 14/07/02 16:55:32 INFO server.AbstractConnector: Started
 SocketConnector@0.0.0.0:14211




 *14/07/02 16:55:32 INFO ui.JettyUtils: Adding filter:
 org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter14/07/02 16:55:32
 INFO server.Server: jetty-8.y.z-SNAPSHOT14/07/02 16:55:32 INFO
 server.AbstractConnector: Started SelectChannelConnector@0.0.0.0:218

Re: Issues in opening UI when running Spark Streaming in YARN

2014-07-07 Thread Andrew Or

@Yan, the UI should still work. As long as you look into the container that
launches the driver, you will find the SparkUI address and port. Note that
in yarn-cluster mode the Spark driver doesn't actually run in the
Application Manager; just like the executors, it runs in a container that
is launched by the Resource Manager after the Application Master requests
the container resources. In contrast, in yarn-client mode, your driver is
not launched in a container, but in the client process that launched your
application (i.e. spark-submit), so the stdout of this program directly
contains the SparkUI messages.

@Chester, I'm not sure what has gone wrong as there are many factors at
play here. When you go the Resource Manager UI, does the "application URL"
link point you to the same SparkUI address as indicated in the logs? If so,
this is the correct behavior. However, I believe the redirect error has
little to do with Spark itself, but more to do with how you set up the
cluster. I have actually run into this myself, but I haven't found a
workaround. Let me know if you find anything.




2014-07-07 12:07 GMT-07:00 Chester Chen :

> As Andrew explained, the port is random rather than 4040, as the the spark
> driver is started in Application Master and the port is random selected.
>
>
> But I have the similar UI issue. I am running Yarn Cluster mode against my
> local CDH5 cluster.
>
> The log states
> "14/07/07 11:59:29 INFO ui.SparkUI: Started SparkUI at
> http://10.0.0.63:58750
>
> "
>
>
> but when you client the spark UI link (ApplicationMaster or
>
> http://10.0.0.63:58750), I will got a 404 with the redirect URI
>
>
>  http://localhost/proxy/application_1404443455764_0010/
>
>
>
> Looking at the Spark code, notice that the "proxy" is reallya variable to get 
> the proxy at the yarn-site.xml http address. But when I specified the value 
> at yarn-site.xml, it still doesn't work for me.
>
>
>
> Oddly enough, it works for my co-worker on Pivotal HD cluster, therefore I am 
> still looking what's the difference in terms of cluster setup or something 
> else.
>
>
> Chester
>
>
>
>
>
> On Mon, Jul 7, 2014 at 11:42 AM, Andrew Or  wrote:
>
>> I will assume that you are running in yarn-cluster mode. Because the
>> driver is launched in one of the containers, it doesn't make sense to
>> expose port 4040 for the node that contains the container. (Imagine if
>> multiple driver containers are launched on the same node. This will cause a
>> port collision). If you're launching Spark from a gateway node that is
>> physically near your worker nodes, then you can just launch your
>> application in yarn-client mode, in which case the SparkUI will always be
>> started on port 4040 on the node that you ran spark-submit on. The reason
>> why sometimes you see the red text is because it appears only on the driver
>> containers, not the executor containers. This is because SparkUI belongs to
>> the SparkContext, which only exists on the driver.
>>
>> Andrew
>>
>>
>> 2014-07-07 11:20 GMT-07:00 Yan Fang :
>>
>> Hi guys,
>>>
>>> Not sure if you  have similar issues. Did not find relevant tickets in
>>> JIRA. When I deploy the Spark Streaming to YARN, I have following two
>>> issues:
>>>
>>> 1. The UI port is random. It is not default 4040. I have to look at the
>>> container's log to check the UI port. Is this suppose to be this way?
>>>
>>> 2. Most of the time, the UI does not work. The difference between logs
>>> are (I ran the same program):
>>>
>>>
>>>
>>>
>>>
>>>
>>> *14/07/03 11:38:50 INFO spark.HttpServer: Starting HTTP Server14/07/03
>>> 11:38:50 INFO server.Server: jetty-8.y.z-SNAPSHOT 14/07/03 11:38:50 INFO
>>> server.AbstractConnector: Started SocketConnector@0.0.0.0:12026
>>> 14/07/03 11:38:51 INFO
>>> executor.CoarseGrainedExecutorBackend: Got assigned task 0 14/07/03
>>> 11:38:51 INFO executor.Executor: Running task ID 0...*
>>>
>>> 14/07/02 16:55:32 INFO spark.HttpServer: Starting HTTP Server
>>> 14/07/02 16:55:32 INFO server.Server: jetty-8.y.z-SNAPSHOT
>>> 14/07/02 16:55:32 INFO server.AbstractConnector: Started
>>> SocketConnector@0.0.0.0:14211
>>>
>>>
>>>
>>>
>>> *14/07/02 16:55:32 INFO ui.JettyUtils: Adding filter:
>>> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter14/07/02 16:55:32
>>> INFO server.Server: jetty-8.y.z-SNAPSHOT14/07/02 16:55:32 INFO
>>> server.AbstractConnector: Started SelectChannelConnector@0.0.0.0:21867
>>>  14/07/02 16:55:32 INFO
>>> ui.SparkUI: Started SparkUI at http://myNodeName:21867
>>> 14/07/02 16:55:32 INFO
>>> cluster.YarnClusterScheduler: Created YarnClusterScheduler*
>>>
>>> When the red part comes, the UI works sometime. Any ideas? Thank you.
>>>
>>> Best,
>>>
>>> Fang, Yan
>>> yanfang...@gmail.com
>>> +1 (206) 849-4108
>>>
>>
>>
>

RE: Comparative study

2014-07-07 Thread santosh.viswanathan

Thanks Daniel for sharing this info.

Regards,
Santosh Karthikeyan

From: Daniel Siegmann [mailto:daniel.siegm...@velos.io]
Sent: Tuesday, July 08, 2014 1:10 AM
To: user@spark.apache.org
Subject: Re: Comparative study

From a development perspective, I vastly prefer Spark to MapReduce. The 
MapReduce API is very constrained; Spark's API feels much more natural to me. 
Testing and local development is also very easy - creating a local Spark 
context is trivial and it reads local files. For your unit tests you can just 
have them create a local context and execute your flow with some test data. 
Even better, you can do ad-hoc work in the Spark shell and if you want that in 
your production code it will look exactly the same.
Unfortunately, the picture isn't so rosy when it gets to production. In my 
experience, Spark simply doesn't scale to the volumes that MapReduce will 
handle. Not with a Standalone cluster anyway - maybe Mesos or YARN would be 
better, but I haven't had the opportunity to try them. I find jobs tend to just 
hang forever for no apparent reason on large data sets (but smaller than what I 
push through MapReduce).
I am hopeful the situation will improve - Spark is developing quickly - but if 
you have large amounts of data you should proceed with caution.
Keep in mind there are some frameworks for Hadoop which can hide the ugly 
MapReduce with something very similar in form to Spark's API; e.g. Apache 
Crunch. So you might consider those as well.
(Note: the above is with Spark 1.0.0.)

On Mon, Jul 7, 2014 at 11:07 AM, 
mailto:santosh.viswanat...@accenture.com>> 
wrote:
Hello Experts,

I am doing some comparative study on the below:

Spark vs Impala
Spark vs MapREduce . Is it worth migrating from existing MR implementation to 
Spark?

Please share your thoughts and expertise.

Thanks,
Santosh

This message is for the designated recipient only and may contain privileged, 
proprietary, or otherwise confidential information. If you have received it in 
error, please notify the sender immediately and delete the original. Any other 
use of the e-mail by you is prohibited. Where allowed by local law, electronic 
communications with Accenture and its affiliates, including e-mail and instant 
messaging (including content), may be scanned by our systems for the purposes 
of information security and assessment of internal compliance with Accenture 
policy.
__

www.accenture.com

--
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegm...@velos.io W: 
www.velos.io

Re: Comparative study

2014-07-07 Thread Daniel Siegmann

>From a development perspective, I vastly prefer Spark to MapReduce. The
MapReduce API is very constrained; Spark's API feels much more natural to
me. Testing and local development is also very easy - creating a local
Spark context is trivial and it reads local files. For your unit tests you
can just have them create a local context and execute your flow with some
test data. Even better, you can do ad-hoc work in the Spark shell and if
you want that in your production code it will look exactly the same.

Unfortunately, the picture isn't so rosy when it gets to production. In my
experience, Spark simply doesn't scale to the volumes that MapReduce will
handle. Not with a Standalone cluster anyway - maybe Mesos or YARN would be
better, but I haven't had the opportunity to try them. I find jobs tend to
just hang forever for no apparent reason on large data sets (but smaller
than what I push through MapReduce).

I am hopeful the situation will improve - Spark is developing quickly - but
if you have large amounts of data you should proceed with caution.

Keep in mind there are some frameworks for Hadoop which can hide the ugly
MapReduce with something very similar in form to Spark's API; e.g. Apache
Crunch. So you might consider those as well.

(Note: the above is with Spark 1.0.0.)

On Mon, Jul 7, 2014 at 11:07 AM,  wrote:

>  Hello Experts,
>
>
>
> I am doing some comparative study on the below:
>
>
>
> Spark vs Impala
>
> Spark vs MapREduce . Is it worth migrating from existing MR implementation
> to Spark?
>
>
>
>
>
> Please share your thoughts and expertise.
>
>
>
>
>
> Thanks,
> Santosh
>
> --
>
> This message is for the designated recipient only and may contain
> privileged, proprietary, or otherwise confidential information. If you have
> received it in error, please notify the sender immediately and delete the
> original. Any other use of the e-mail by you is prohibited. Where allowed
> by local law, electronic communications with Accenture and its affiliates,
> including e-mail and instant messaging (including content), may be scanned
> by our systems for the purposes of information security and assessment of
> internal compliance with Accenture policy.
>
> __
>
> www.accenture.com
>

-- 
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegm...@velos.io W: www.velos.io

SparkSQL with sequence file RDDs

2014-07-07 Thread Gary Malouf

Has anyone reported issues using SparkSQL with sequence files (all of our
data is in this format within HDFS)?  We are considering whether to burn
the time upgrading to Spark 1.0 from 0.9 now and this is a main decision
point for us.

Re: spark-assembly libraries conflict with needed libraries

2014-07-07 Thread Robert James

Thanks - that did solve my error, but instead got a different one:
  java.lang.NoClassDefFoundError:
org/apache/hadoop/mapreduce/lib/input/FileInputFormat

It seems like with that setting, spark can't find Hadoop.

On 7/7/14, Koert Kuipers  wrote:
> spark has a setting to put user jars in front of classpath, which should do
> the trick.
> however i had no luck with this. see here:
>
> https://issues.apache.org/jira/browse/SPARK-1863
>
>
>
> On Mon, Jul 7, 2014 at 1:31 PM, Robert James 
> wrote:
>
>> spark-submit includes a spark-assembly uber jar, which has older
>> versions of many common libraries.  These conflict with some of the
>> dependencies we need.  I have been racking my brain trying to find a
>> solution (including experimenting with ProGuard), but haven't been
>> able to: when we use spark-submit, we get NoMethodErrors, even though
>> the code compiles fine, because the runtime classes are different than
>> the compile time classes!
>>
>> Can someone recommend a solution? We are using scala, sbt, and
>> sbt-assembly, but are happy using another tool (please provide
>> instructions how to).
>>
>

Re: spark-assembly libraries conflict with needed libraries

2014-07-07 Thread Chester Chen

I don't have experience deploying to EC2.  can you use add.jar conf to add
the missing jar at runtime ?   I haven't tried this myself. Just a guess.


On Mon, Jul 7, 2014 at 12:16 PM, Chester Chen  wrote:

> with "provided" scope, you need to provide the "provided" jars at the
> runtime yourself. I guess in this case Hadoop jar files.
>
>
> On Mon, Jul 7, 2014 at 12:13 PM, Robert James 
> wrote:
>
>> Thanks - that did solve my error, but instead got a different one:
>>   java.lang.NoClassDefFoundError:
>> org/apache/hadoop/mapreduce/lib/input/FileInputFormat
>>
>> It seems like with that setting, spark can't find Hadoop.
>>
>> On 7/7/14, Koert Kuipers  wrote:
>> > spark has a setting to put user jars in front of classpath, which
>> should do
>> > the trick.
>> > however i had no luck with this. see here:
>> >
>> > https://issues.apache.org/jira/browse/SPARK-1863
>> >
>> >
>> >
>> > On Mon, Jul 7, 2014 at 1:31 PM, Robert James 
>> > wrote:
>> >
>> >> spark-submit includes a spark-assembly uber jar, which has older
>> >> versions of many common libraries.  These conflict with some of the
>> >> dependencies we need.  I have been racking my brain trying to find a
>> >> solution (including experimenting with ProGuard), but haven't been
>> >> able to: when we use spark-submit, we get NoMethodErrors, even though
>> >> the code compiles fine, because the runtime classes are different than
>> >> the compile time classes!
>> >>
>> >> Can someone recommend a solution? We are using scala, sbt, and
>> >> sbt-assembly, but are happy using another tool (please provide
>> >> instructions how to).
>> >>
>> >
>>
>
>

Re: spark-assembly libraries conflict with needed libraries

2014-07-07 Thread Chester Chen

with "provided" scope, you need to provide the "provided" jars at the
runtime yourself. I guess in this case Hadoop jar files.


On Mon, Jul 7, 2014 at 12:13 PM, Robert James 
wrote:

> Thanks - that did solve my error, but instead got a different one:
>   java.lang.NoClassDefFoundError:
> org/apache/hadoop/mapreduce/lib/input/FileInputFormat
>
> It seems like with that setting, spark can't find Hadoop.
>
> On 7/7/14, Koert Kuipers  wrote:
> > spark has a setting to put user jars in front of classpath, which should
> do
> > the trick.
> > however i had no luck with this. see here:
> >
> > https://issues.apache.org/jira/browse/SPARK-1863
> >
> >
> >
> > On Mon, Jul 7, 2014 at 1:31 PM, Robert James 
> > wrote:
> >
> >> spark-submit includes a spark-assembly uber jar, which has older
> >> versions of many common libraries.  These conflict with some of the
> >> dependencies we need.  I have been racking my brain trying to find a
> >> solution (including experimenting with ProGuard), but haven't been
> >> able to: when we use spark-submit, we get NoMethodErrors, even though
> >> the code compiles fine, because the runtime classes are different than
> >> the compile time classes!
> >>
> >> Can someone recommend a solution? We are using scala, sbt, and
> >> sbt-assembly, but are happy using another tool (please provide
> >> instructions how to).
> >>
> >
>

Re: spark-assembly libraries conflict with needed libraries

2014-07-07 Thread Robert James

Chester - I'm happy rebuilding Spark, but then how can I deploy it to EC2?


On 7/7/14, Chester Chen  wrote:
> Have you tried to change the spark SBT scripts? You can change the
> dependency scope to "provided".  This similar to compile scope, except JDK
> or container need to provide the dependency at runtime.
>
> This assume the Spark will work with the new version of common libraries.
>
> Of course, this is not a general solution even it works ( if may not work).
>
> Chester
>
>
>
>
> On Mon, Jul 7, 2014 at 10:31 AM, Robert James 
> wrote:
>
>> spark-submit includes a spark-assembly uber jar, which has older
>> versions of many common libraries.  These conflict with some of the
>> dependencies we need.  I have been racking my brain trying to find a
>> solution (including experimenting with ProGuard), but haven't been
>> able to: when we use spark-submit, we get NoMethodErrors, even though
>> the code compiles fine, because the runtime classes are different than
>> the compile time classes!
>>
>> Can someone recommend a solution? We are using scala, sbt, and
>> sbt-assembly, but are happy using another tool (please provide
>> instructions how to).
>>
>

Re: Issues in opening UI when running Spark Streaming in YARN

2014-07-07 Thread Chester Chen

As Andrew explained, the port is random rather than 4040, as the the spark
driver is started in Application Master and the port is random selected.


But I have the similar UI issue. I am running Yarn Cluster mode against my
local CDH5 cluster.

The log states
"14/07/07 11:59:29 INFO ui.SparkUI: Started SparkUI at
http://10.0.0.63:58750

"


but when you client the spark UI link (ApplicationMaster or

http://10.0.0.63:58750), I will got a 404 with the redirect URI


 http://localhost/proxy/application_1404443455764_0010/



Looking at the Spark code, notice that the "proxy" is reallya variable
to get the proxy at the yarn-site.xml http address. But when I
specified the value at yarn-site.xml, it still doesn't work for me.



Oddly enough, it works for my co-worker on Pivotal HD cluster,
therefore I am still looking what's the difference in terms of cluster
setup or something else.


Chester





On Mon, Jul 7, 2014 at 11:42 AM, Andrew Or  wrote:

> I will assume that you are running in yarn-cluster mode. Because the
> driver is launched in one of the containers, it doesn't make sense to
> expose port 4040 for the node that contains the container. (Imagine if
> multiple driver containers are launched on the same node. This will cause a
> port collision). If you're launching Spark from a gateway node that is
> physically near your worker nodes, then you can just launch your
> application in yarn-client mode, in which case the SparkUI will always be
> started on port 4040 on the node that you ran spark-submit on. The reason
> why sometimes you see the red text is because it appears only on the driver
> containers, not the executor containers. This is because SparkUI belongs to
> the SparkContext, which only exists on the driver.
>
> Andrew
>
>
> 2014-07-07 11:20 GMT-07:00 Yan Fang :
>
> Hi guys,
>>
>> Not sure if you  have similar issues. Did not find relevant tickets in
>> JIRA. When I deploy the Spark Streaming to YARN, I have following two
>> issues:
>>
>> 1. The UI port is random. It is not default 4040. I have to look at the
>> container's log to check the UI port. Is this suppose to be this way?
>>
>> 2. Most of the time, the UI does not work. The difference between logs
>> are (I ran the same program):
>>
>>
>>
>>
>>
>>
>> *14/07/03 11:38:50 INFO spark.HttpServer: Starting HTTP Server14/07/03
>> 11:38:50 INFO server.Server: jetty-8.y.z-SNAPSHOT 14/07/03 11:38:50 INFO
>> server.AbstractConnector: Started SocketConnector@0.0.0.0:12026
>> 14/07/03 11:38:51 INFO
>> executor.CoarseGrainedExecutorBackend: Got assigned task 0 14/07/03
>> 11:38:51 INFO executor.Executor: Running task ID 0...*
>>
>> 14/07/02 16:55:32 INFO spark.HttpServer: Starting HTTP Server
>> 14/07/02 16:55:32 INFO server.Server: jetty-8.y.z-SNAPSHOT
>> 14/07/02 16:55:32 INFO server.AbstractConnector: Started
>> SocketConnector@0.0.0.0:14211
>>
>>
>>
>>
>> *14/07/02 16:55:32 INFO ui.JettyUtils: Adding filter:
>> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter14/07/02 16:55:32
>> INFO server.Server: jetty-8.y.z-SNAPSHOT14/07/02 16:55:32 INFO
>> server.AbstractConnector: Started SelectChannelConnector@0.0.0.0:21867
>>  14/07/02 16:55:32 INFO
>> ui.SparkUI: Started SparkUI at http://myNodeName:21867
>> 14/07/02 16:55:32 INFO
>> cluster.YarnClusterScheduler: Created YarnClusterScheduler*
>>
>> When the red part comes, the UI works sometime. Any ideas? Thank you.
>>
>> Best,
>>
>> Fang, Yan
>> yanfang...@gmail.com
>> +1 (206) 849-4108
>>
>
>

Re: Issues in opening UI when running Spark Streaming in YARN

2014-07-07 Thread Yan Fang

Hi Andrew,

Thanks for the quick reply. It works with the yarn-client mode.

One question about the yarn-cluster mode: actually I was checking the AM
for the log, since the spark driver is running in the AM, the UI should
also work, right? But that is not true in my case.

Best,

Fang, Yan
yanfang...@gmail.com
+1 (206) 849-4108


On Mon, Jul 7, 2014 at 11:42 AM, Andrew Or  wrote:

> I will assume that you are running in yarn-cluster mode. Because the
> driver is launched in one of the containers, it doesn't make sense to
> expose port 4040 for the node that contains the container. (Imagine if
> multiple driver containers are launched on the same node. This will cause a
> port collision). If you're launching Spark from a gateway node that is
> physically near your worker nodes, then you can just launch your
> application in yarn-client mode, in which case the SparkUI will always be
> started on port 4040 on the node that you ran spark-submit on. The reason
> why sometimes you see the red text is because it appears only on the driver
> containers, not the executor containers. This is because SparkUI belongs to
> the SparkContext, which only exists on the driver.
>
> Andrew
>
>
> 2014-07-07 11:20 GMT-07:00 Yan Fang :
>
> Hi guys,
>>
>> Not sure if you  have similar issues. Did not find relevant tickets in
>> JIRA. When I deploy the Spark Streaming to YARN, I have following two
>> issues:
>>
>> 1. The UI port is random. It is not default 4040. I have to look at the
>> container's log to check the UI port. Is this suppose to be this way?
>>
>> 2. Most of the time, the UI does not work. The difference between logs
>> are (I ran the same program):
>>
>>
>>
>>
>>
>>
>> *14/07/03 11:38:50 INFO spark.HttpServer: Starting HTTP Server14/07/03
>> 11:38:50 INFO server.Server: jetty-8.y.z-SNAPSHOT 14/07/03 11:38:50 INFO
>> server.AbstractConnector: Started SocketConnector@0.0.0.0:12026
>> 14/07/03 11:38:51 INFO
>> executor.CoarseGrainedExecutorBackend: Got assigned task 0 14/07/03
>> 11:38:51 INFO executor.Executor: Running task ID 0...*
>>
>> 14/07/02 16:55:32 INFO spark.HttpServer: Starting HTTP Server
>> 14/07/02 16:55:32 INFO server.Server: jetty-8.y.z-SNAPSHOT
>> 14/07/02 16:55:32 INFO server.AbstractConnector: Started
>> SocketConnector@0.0.0.0:14211
>>
>>
>>
>>
>> *14/07/02 16:55:32 INFO ui.JettyUtils: Adding filter:
>> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter14/07/02 16:55:32
>> INFO server.Server: jetty-8.y.z-SNAPSHOT14/07/02 16:55:32 INFO
>> server.AbstractConnector: Started SelectChannelConnector@0.0.0.0:21867
>>  14/07/02 16:55:32 INFO
>> ui.SparkUI: Started SparkUI at http://myNodeName:21867
>> 14/07/02 16:55:32 INFO
>> cluster.YarnClusterScheduler: Created YarnClusterScheduler*
>>
>> When the red part comes, the UI works sometime. Any ideas? Thank you.
>>
>> Best,
>>
>> Fang, Yan
>> yanfang...@gmail.com
>> +1 (206) 849-4108
>>
>
>

[no subject]

2014-07-07 Thread Juan Rodríguez Hortalá

Hi all,

I'm writing a Spark Streaming program that uses reduceByKeyAndWindow(), and
when I change the windowsLenght or slidingInterval I get the following
exceptions, running in local mode


14/07/06 13:03:46 ERROR actor.OneForOneStrategy: key not found:
1404677026000 ms
java.util.NoSuchElementException: key not found: 1404677026000 ms
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:58)
at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
at
org.apache.spark.streaming.dstream.ReceiverInputDStream.getReceivedBlockInfo(ReceiverInputDStream.scala:77)
at
org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:225)
at
org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:223)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at
org.apache.spark.streaming.scheduler.JobGenerator.generateJobs(JobGenerator.scala:223)
at org.apache.spark.streaming.scheduler.JobGenerator.org
$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:165)
at
org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$start$1$$anon$1$$anonfun$receive$1.applyOrElse(JobGenerator.scala:76)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

It looks like some issue cleaning up some kind of internal checkpoints, is
there any path I should clean to get rid of these exceptions?

Thanks a lot for your help,

Greetings,

Juan

Re: Issues in opening UI when running Spark Streaming in YARN

2014-07-07 Thread Andrew Or

I will assume that you are running in yarn-cluster mode. Because the driver
is launched in one of the containers, it doesn't make sense to expose port
4040 for the node that contains the container. (Imagine if multiple driver
containers are launched on the same node. This will cause a port
collision). If you're launching Spark from a gateway node that is
physically near your worker nodes, then you can just launch your
application in yarn-client mode, in which case the SparkUI will always be
started on port 4040 on the node that you ran spark-submit on. The reason
why sometimes you see the red text is because it appears only on the driver
containers, not the executor containers. This is because SparkUI belongs to
the SparkContext, which only exists on the driver.

Andrew


2014-07-07 11:20 GMT-07:00 Yan Fang :

> Hi guys,
>
> Not sure if you  have similar issues. Did not find relevant tickets in
> JIRA. When I deploy the Spark Streaming to YARN, I have following two
> issues:
>
> 1. The UI port is random. It is not default 4040. I have to look at the
> container's log to check the UI port. Is this suppose to be this way?
>
> 2. Most of the time, the UI does not work. The difference between logs are
> (I ran the same program):
>
>
>
>
>
>
> *14/07/03 11:38:50 INFO spark.HttpServer: Starting HTTP Server14/07/03
> 11:38:50 INFO server.Server: jetty-8.y.z-SNAPSHOT 14/07/03 11:38:50 INFO
> server.AbstractConnector: Started SocketConnector@0.0.0.0:12026
> 14/07/03 11:38:51 INFO
> executor.CoarseGrainedExecutorBackend: Got assigned task 0 14/07/03
> 11:38:51 INFO executor.Executor: Running task ID 0...*
>
> 14/07/02 16:55:32 INFO spark.HttpServer: Starting HTTP Server
> 14/07/02 16:55:32 INFO server.Server: jetty-8.y.z-SNAPSHOT
> 14/07/02 16:55:32 INFO server.AbstractConnector: Started
> SocketConnector@0.0.0.0:14211
>
>
>
>
> *14/07/02 16:55:32 INFO ui.JettyUtils: Adding filter:
> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter14/07/02 16:55:32
> INFO server.Server: jetty-8.y.z-SNAPSHOT14/07/02 16:55:32 INFO
> server.AbstractConnector: Started SelectChannelConnector@0.0.0.0:21867
>  14/07/02 16:55:32 INFO
> ui.SparkUI: Started SparkUI at http://myNodeName:21867
> 14/07/02 16:55:32 INFO
> cluster.YarnClusterScheduler: Created YarnClusterScheduler*
>
> When the red part comes, the UI works sometime. Any ideas? Thank you.
>
> Best,
>
> Fang, Yan
> yanfang...@gmail.com
> +1 (206) 849-4108
>

Which is the best way to get a connection to an external database per task in Spark Streaming?

2014-07-07 Thread Juan Rodríguez Hortalá

Hi list,

I'm writing a Spark Streaming program that reads from a kafka topic,
performs some transformations on the data, and then inserts each record in
a database with foreachRDD. I was wondering which is the best way to handle
the connection to the database so each worker, or even each task, uses a
different connection to the database, and then database inserts/updates
would be performed in parallel.
- I understand that using a final variable in the driver code is not a good
idea because then the communication with the database would be performed in
the driver code, which leads to a bottleneck, according to
http://engineering.sharethrough.com/blog/2013/09/13/top-3-troubleshooting-tips-to-keep-you-sparking/
- I think creating a new connection in the call() method of the Function
passed to foreachRDD is also a bad idea, because then I wouldn't be reusing
the connection to the database for each batch RDD in the DStream
- I'm not sure that a broadcast variable with the connection handler is a
good idea in case the target database is distributed, because if the same
handler is used for all the nodes of the Spark cluster then than could have
a negative effect in the data locality of the connection to the database.
- From
http://apache-spark-user-list.1001560.n3.nabble.com/Database-connection-per-worker-td1280.html
I understand that by using an static variable and referencing it in the
call() method of the Function passed to foreachRDD we get a different
connection per Spark worker, I guess it's because there is a different JVM
per worker. But then all the tasks in the same worker would share the same
database handler object, am I right?
- Another idea is using updateStateByKey() using the database handler as
the state, but I guess that would only work for Serializable database
handlers, and for example not for an org.apache.hadoop.hbase.client.HTable
object.

So my question is, which is the best way to get a connection to an external
database per task in Spark Streaming? Or at least per worker. In
http://apache-spark-user-list.1001560.n3.nabble.com/Connecting-to-an-inmemory-database-from-Spark-td1343.html
there is a partial solution to this question, but there the database
handler object is missing. This other question
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Shared-hashmaps-td3247.html
is closer to mine, but there is no answer for it yet

Thanks in advance,

Greetings,

Juan

Re: Kafka - streaming from multiple topics

2014-07-07 Thread Sergey Malov

I opened JIRA issue with Spark, as an improvement though, not as a bug. 
Hopefully, someone there would notice it.

From: Tobias Pfeiffer mailto:t...@preferred.jp>>
Reply-To: "user@spark.apache.org" 
mailto:user@spark.apache.org>>
Date: Thursday, July 3, 2014 at 9:41 PM
To: "user@spark.apache.org" 
mailto:user@spark.apache.org>>
Subject: Re: Kafka - streaming from multiple topics

Sergey,

On Fri, Jul 4, 2014 at 1:06 AM, Sergey Malov 
mailto:sma...@collective.com>> wrote:
On the other hand, under the hood KafkaInputDStream which is create with this 
KafkaUtils call,  calls ConsumerConnector.createMessageStream which returns a 
Map[String, List[KafkaStream] keyed by topic. It is, however, not exposed.

I wonder if this is a bug. After all, KafkaUtils.createStream() returns a 
DStream[(String, String)], which pretty much looks like it should be a (topic 
-> message) mapping. However, for me, the key is always null. Maybe you could 
consider filing a bug/wishlist report?

Tobias

Issues in opening UI when running Spark Streaming in YARN

2014-07-07 Thread Yan Fang

Hi guys,

Not sure if you  have similar issues. Did not find relevant tickets in
JIRA. When I deploy the Spark Streaming to YARN, I have following two
issues:

1. The UI port is random. It is not default 4040. I have to look at the
container's log to check the UI port. Is this suppose to be this way?

2. Most of the time, the UI does not work. The difference between logs are
(I ran the same program):






*14/07/03 11:38:50 INFO spark.HttpServer: Starting HTTP Server14/07/03
11:38:50 INFO server.Server: jetty-8.y.z-SNAPSHOT 14/07/03 11:38:50 INFO
server.AbstractConnector: Started SocketConnector@0.0.0.0:12026
14/07/03 11:38:51 INFO
executor.CoarseGrainedExecutorBackend: Got assigned task 0 14/07/03
11:38:51 INFO executor.Executor: Running task ID 0...*

14/07/02 16:55:32 INFO spark.HttpServer: Starting HTTP Server
14/07/02 16:55:32 INFO server.Server: jetty-8.y.z-SNAPSHOT
14/07/02 16:55:32 INFO server.AbstractConnector: Started
SocketConnector@0.0.0.0:14211




*14/07/02 16:55:32 INFO ui.JettyUtils: Adding filter:
org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter14/07/02 16:55:32
INFO server.Server: jetty-8.y.z-SNAPSHOT14/07/02 16:55:32 INFO
server.AbstractConnector: Started SelectChannelConnector@0.0.0.0:21867
 14/07/02 16:55:32 INFO
ui.SparkUI: Started SparkUI at http://myNodeName:21867
14/07/02 16:55:32 INFO
cluster.YarnClusterScheduler: Created YarnClusterScheduler*

When the red part comes, the UI works sometime. Any ideas? Thank you.

Best,

Fang, Yan
yanfang...@gmail.com
+1 (206) 849-4108

Error while launching spark cluster manaually

2014-07-07 Thread Sameer Tilak

Hi All,I am having the following issue -- may be fqdn/ip resolution issue, but 
not sure, any help with this will be great!
On the master node I get the following error:I start master using 
./start-master.shstarting org.apache.spark.deploy.master.Master, logging to 
/apps/software/spark-1.0.0-bin-hadoop1/sbin/../logs/spark-p529444-org.apache.spark.deploy.master.Master-1-pzxnvm2018.out
bash-4.1$ tail 
/apps/software/spark-1.0.0-bin-hadoop1/sbin/../logs/spark-p529444-org.apache.spark.deploy.master.Master-1-pzxnvm2018.out
14/07/07 11:03:16 INFO SecurityManager: Using Spark's default log4j profile: 
org/apache/spark/log4j-defaults.properties14/07/07 11:03:16 INFO 
SecurityManager: Changing view acls to: userid14/07/07 11:03:16 INFO 
SecurityManager: SecurityManager: authentication disabled; ui acls disabled; 
users with view permissions: Set(userid)14/07/07 11:03:16 INFO Slf4jLogger: 
Slf4jLogger started14/07/07 11:03:16 INFO Remoting: Starting remoting14/07/07 
11:03:17 INFO Remoting: Remoting started; listening on addresses 
:[akka.tcp://sparkMaster@pzxnvm2018:7077]14/07/07 11:03:17 INFO Master: 
Starting Spark master at spark://pzxnvm2018:707714/07/07 11:03:17 INFO 
MasterWebUI: Started MasterWebUI at 
http://pzxnvm2018.a.b.org(masterfqdn):808014/07/07 11:03:17 INFO Master: I have 
been elected leader! New state: ALIVE
14/07/07 11:07:53 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20140707110744-/10 on hostPort worker3fqdn:54921 with 4 cores, 512.0 MB 
RAM14/07/07 11:07:53 INFO AppClient$ClientActor: Executor updated: 
app-20140707110744-/10 is now RUNNING14/07/07 11:07:54 INFO 
AppClient$ClientActor: Executor updated: app-20140707110744-/8 is now 
FAILED (Command exited with code 1)14/07/07 11:07:54 INFO 
SparkDeploySchedulerBackend: Executor app-20140707110744-/8 removed: 
Command exited with code 114/07/07 11:07:54 INFO AppClient$ClientActor: 
Executor added: app-20140707110744-/11 on 
worker-20140707110701-worker3fqdn-49287 (worker3fqdn:49287) with 4 cores
On the worker node I get the following errorI start the workers node with the 
following command:
./bin/spark-class org.apache.spark.deploy.worker.Worker spark://pzxnvm2018:7077
worker4ip" "4" "akka.tcp://sparkWorker@worker4ip:49287/user/Worker" 
"app-20140707110744-"14/07/07 11:07:49 INFO Worker: Executor 
app-20140707110744-/3 finished with state FAILED message Command exited 
with code 1 exitStatus 114/07/07 11:07:49 INFO Worker: Asked to launch executor 
app-20140707110744-/6 for ApproxStrMatch14/07/07 11:07:49 INFO 
ExecutorRunner: Launch command: 
"/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.9.x86_64/jre/bin/java" "-cp" 
"::/apps/software/spark-1.0.0-bin-hadoop1/conf:/apps/software/spark-1.0.0-bin-hadoop1/lib/spark-assembly-1.0.0-hadoop1.0.4.jar:/apps/hadoop/hadoop-conf"
 "-XX:MaxPermSize=128m" "-Xms512M" "-Xmx512M" 
"org.apache.spark.executor.CoarseGrainedExecutorBackend" 
"akka.tcp://spark@localhost:59792/user/CoarseGrainedScheduler" "6" 
"pzxnvm2024.dcld.pldc.kp.org" "4" 
"akka.tcp://sparkwor...@pzxnvm2024.dcld.pldc.kp.org:49287/user/Worker" 
"app-20140707110744-"14/07/07 11:07:51 INFO Worker: Executor 
app-20140707110744-/6 finished with state FAILED message Command exited 
with code 1 exitStatus 1

Re: spark-assembly libraries conflict with needed libraries

2014-07-07 Thread Chester Chen

Have you tried to change the spark SBT scripts? You can change the
dependency scope to "provided".  This similar to compile scope, except JDK
or container need to provide the dependency at runtime.

This assume the Spark will work with the new version of common libraries.

Of course, this is not a general solution even it works ( if may not work).

Chester

On Mon, Jul 7, 2014 at 10:31 AM, Robert James 
wrote:

> spark-submit includes a spark-assembly uber jar, which has older
> versions of many common libraries.  These conflict with some of the
> dependencies we need.  I have been racking my brain trying to find a
> solution (including experimenting with ProGuard), but haven't been
> able to: when we use spark-submit, we get NoMethodErrors, even though
> the code compiles fine, because the runtime classes are different than
> the compile time classes!
>
> Can someone recommend a solution? We are using scala, sbt, and
> sbt-assembly, but are happy using another tool (please provide
> instructions how to).
>

Re: tiers of caching

2014-07-07 Thread Andrew Or

Others have also asked for this on the mailing list, and hence there's a
related JIRA: https://issues.apache.org/jira/browse/SPARK-1762. Ankur
brings up a good point in that any current implementation of in-memory
shuffles will compete with application RDD blocks. I think we should
definitely add this at some point. In terms of a timeline, we already have
many features lined up for 1.1, however, so it will likely be after that.


2014-07-07 10:13 GMT-07:00 Ankur Dave :

> I think tiers/priorities for caching are a very good idea and I'd be
> interested to see what others think. In addition to letting libraries cache
> RDDs liberally, it could also unify memory management across other parts of
> Spark. For example, small shuffles benefit from explicitly keeping the
> shuffle outputs in memory rather than writing it to disk, possibly due to
> filesystem overhead. To prevent in-memory shuffle outputs from competing
> with application RDDs, Spark could mark them as lower-priority and specify
> that they should be dropped to disk when memory runs low.
>
> Ankur 
>
>

Re: spark-assembly libraries conflict with needed libraries

2014-07-07 Thread Koert Kuipers

spark has a setting to put user jars in front of classpath, which should do
the trick.
however i had no luck with this. see here:

https://issues.apache.org/jira/browse/SPARK-1863



On Mon, Jul 7, 2014 at 1:31 PM, Robert James  wrote:

> spark-submit includes a spark-assembly uber jar, which has older
> versions of many common libraries.  These conflict with some of the
> dependencies we need.  I have been racking my brain trying to find a
> solution (including experimenting with ProGuard), but haven't been
> able to: when we use spark-submit, we get NoMethodErrors, even though
> the code compiles fine, because the runtime classes are different than
> the compile time classes!
>
> Can someone recommend a solution? We are using scala, sbt, and
> sbt-assembly, but are happy using another tool (please provide
> instructions how to).
>

spark-assembly libraries conflict with application libraries

2014-07-07 Thread Robert James

spark-submit includes a spark-assembly uber jar, which has older
versions of many common libraries.  These conflict with some of the
dependencies we need.  I have been racking my brain trying to find a
solution (including experimenting with ProGuard), but haven't been
able to: when we use spark-submit, we get NoMethodErrors, even though
the code compiles fine, because the runtime classes are different than
the compile time classes!

Can someone recommend a solution? We are using scala, sbt, and
sbt-assembly, but are happy using another tool (please provide
instructions how to).

spark-assembly libraries conflict with needed libraries

2014-07-07 Thread Robert James

spark-submit includes a spark-assembly uber jar, which has older
versions of many common libraries.  These conflict with some of the
dependencies we need.  I have been racking my brain trying to find a
solution (including experimenting with ProGuard), but haven't been
able to: when we use spark-submit, we get NoMethodErrors, even though
the code compiles fine, because the runtime classes are different than
the compile time classes!

Can someone recommend a solution? We are using scala, sbt, and
sbt-assembly, but are happy using another tool (please provide
instructions how to).

Re: tiers of caching

2014-07-07 Thread Ankur Dave

I think tiers/priorities for caching are a very good idea and I'd be
interested to see what others think. In addition to letting libraries cache
RDDs liberally, it could also unify memory management across other parts of
Spark. For example, small shuffles benefit from explicitly keeping the
shuffle outputs in memory rather than writing it to disk, possibly due to
filesystem overhead. To prevent in-memory shuffle outputs from competing
with application RDDs, Spark could mark them as lower-priority and specify
that they should be dropped to disk when memory runs low.

Ankur

tiers of caching

2014-07-07 Thread Koert Kuipers

i noticed that some algorithms such as graphx liberally cache RDDs for
efficiency, which makes sense. however it can also leave a long trail of
unused yet cached RDDs, that might push other RDDs out of memory.

in a long-lived spark context i would like to decide which RDDs stick
around. would it make sense to create tiers of caching, to distinguish
explicitly cached RDDs by the application from RDDs that are temporary
cached by algos, so as to make sure these temporary caches don't push
application RDDs out of memory?

Re: Dense to sparse vector converter

2014-07-07 Thread Xiangrui Meng

No, but it should be easy to add one. -Xiangrui

On Mon, Jul 7, 2014 at 12:37 AM, Ulanov, Alexander
 wrote:
> Hi,
>
>
>
> Is there a method in Spark/MLlib to convert DenseVector to SparseVector?
>
>
>
> Best regards, Alexander

Re: Execution stalls in LogisticRegressionWithSGD

2014-07-07 Thread Xiangrui Meng

It seems to me a setup issue. I just tested news20.binary (1355191
features) on a 2-node EC2 cluster and it worked well. I added one line
to conf/spark-env.sh:

export SPARK_JAVA_OPTS=" -Dspark.akka.frameSize=20 "

and launched spark-shell with "--driver-memory 20g". Could you re-try
with an EC2 setup? If it still doesn't work, please attach all your
code and logs.

Best,
Xiangrui

On Sun, Jul 6, 2014 at 1:35 AM, Bharath Ravi Kumar  wrote:
> Hi Xiangrui,
>
> 1) Yes, I used the same build (compiled locally from source) to the host
> that has (master, slave1) and the second host with slave2.
>
> 2) The execution was successful when run in local mode with reduced number
> of partitions. Does this imply issues communicating/coordinating across
> processes (i.e. driver, master and workers)?
>
> Thanks,
> Bharath
>
>
>
> On Sun, Jul 6, 2014 at 11:37 AM, Xiangrui Meng  wrote:
>>
>> Hi Bharath,
>>
>> 1) Did you sync the spark jar and conf to the worker nodes after build?
>> 2) Since the dataset is not large, could you try local mode first
>> using `spark-summit --driver-memory 12g --master local[*]`?
>> 3) Try to use less number of partitions, say 5.
>>
>> If the problem is still there, please attach the full master/worker log
>> files.
>>
>> Best,
>> Xiangrui
>>
>> On Fri, Jul 4, 2014 at 12:16 AM, Bharath Ravi Kumar 
>> wrote:
>> > Xiangrui,
>> >
>> > Leaving the frameSize unspecified led to an error message (and failure)
>> > stating that the task size (~11M) was larger. I hence set it to an
>> > arbitrarily large value ( I realize 500 was unrealistic & unnecessary in
>> > this case). I've now set the size to 20M and repeated the runs. The
>> > earlier
>> > runs were on an uncached RDD. Caching the RDD (and setting
>> > spark.storage.memoryFraction=0.5) resulted in marginal speed up of
>> > execution, but the end result remained the same. The cached RDD size is
>> > as
>> > follows:
>> >
>> > RDD NameStorage LevelCached Partitions
>> > Fraction CachedSize in MemorySize in TachyonSize on Disk
>> > 1084 Memory Deserialized 1x Replicated 80
>> > 100% 165.9 MB 0.0 B 0.0 B
>> >
>> >
>> >
>> > The corresponding master logs were:
>> >
>> > 14/07/04 06:29:34 INFO Master: Removing executor
>> > app-20140704062238-0033/1
>> > because it is EXITED
>> > 14/07/04 06:29:34 INFO Master: Launching executor
>> > app-20140704062238-0033/2
>> > on worker worker-20140630124441-slave1-40182
>> > 14/07/04 06:29:34 INFO Master: Removing executor
>> > app-20140704062238-0033/0
>> > because it is EXITED
>> > 14/07/04 06:29:34 INFO Master: Launching executor
>> > app-20140704062238-0033/3
>> > on worker worker-20140630102913-slave2-44735
>> > 14/07/04 06:29:37 INFO Master: Removing executor
>> > app-20140704062238-0033/2
>> > because it is EXITED
>> > 14/07/04 06:29:37 INFO Master: Launching executor
>> > app-20140704062238-0033/4
>> > on worker worker-20140630124441-slave1-40182
>> > 14/07/04 06:29:37 INFO Master: Removing executor
>> > app-20140704062238-0033/3
>> > because it is EXITED
>> > 14/07/04 06:29:37 INFO Master: Launching executor
>> > app-20140704062238-0033/5
>> > on worker worker-20140630102913-slave2-44735
>> > 14/07/04 06:29:39 INFO Master: akka.tcp://spark@slave2:45172 got
>> > disassociated, removing it.
>> > 14/07/04 06:29:39 INFO Master: Removing app app-20140704062238-0033
>> > 14/07/04 06:29:39 INFO LocalActorRef: Message
>> > [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying]
>> > from
>> > Actor[akka://sparkMaster/deadLetters] to
>> >
>> > Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.3.1.135%3A33061-123#1986674260]
>> > was not delivered. [39] dead letters encountered. This logging can be
>> > turned
>> > off or adjusted with configuration settings 'akka.log-dead-letters' and
>> > 'akka.log-dead-letters-during-shutdown'.
>> > 14/07/04 06:29:39 INFO Master: akka.tcp://spark@slave2:45172 got
>> > disassociated, removing it.
>> > 14/07/04 06:29:39 INFO Master: akka.tcp://spark@slave2:45172 got
>> > disassociated, removing it.
>> > 14/07/04 06:29:39 ERROR EndpointWriter: AssociationError
>> > [akka.tcp://sparkMaster@master:7077] -> [akka.tcp://spark@slave2:45172]:
>> > Error [Association failed with [akka.tcp://spark@slave2:45172]] [
>> > akka.remote.EndpointAssociationException: Association failed with
>> > [akka.tcp://spark@slave2:45172]
>> > Caused by:
>> > akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
>> > Connection refused: slave2/10.3.1.135:45172
>> > ]
>> > 14/07/04 06:29:39 INFO Master: akka.tcp://spark@slave2:45172 got
>> > disassociated, removing it.
>> > 14/07/04 06:29:39 ERROR EndpointWriter: AssociationError
>> > [akka.tcp://sparkMaster@master:7077] -> [akka.tcp://spark@slave2:45172]:
>> > Error [Association failed with [akka.tcp://spark@slave2:45172]] [
>> > akka.remote.EndpointAssociationException: Asso

Re: Control number of tasks per stage

2014-07-07 Thread Daniel Siegmann

The default number of tasks when reading files is based on how the files
are split among the nodes. Beyond that, the default number of tasks after a
shuffle is based on the property spark.default.parallelism. (see
http://spark.apache.org/docs/latest/configuration.html).

You can use RDD.repartition to increase or decrease the number of tasks (or
RDD.coalesce, but you must set shuffle to true if you want to increase the
partitions). Other RDD methods which cause a shuffle usually have a
parameter to set the number of tasks.

On Mon, Jul 7, 2014 at 11:25 AM, Konstantin Kudryavtsev <
kudryavtsev.konstan...@gmail.com> wrote:

> Hi all,
>
> is it any way to control the number tasks per stage?
>
> currently I see situation when only 2 tasks are created per stage and each
> of them is very slow, at the same time cluster has a huge number of unused
> nodes
>
>
> Thank you,
> Konstantin Kudryavtsev
>

-- 
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegm...@velos.io W: www.velos.io

Re: Java sample for using cassandra-driver-spark

2014-07-07 Thread Piotr Kołaczkowski

Hi, we're planning to add a basic Java-API very soon, possibly this week.
There's a ticket for it here:
https://github.com/datastax/cassandra-driver-spark/issues/11

We're open to any ideas. Just let us know what you need the API to have in
the comments.

Regards,
Piotr Kołaczkowski


2014-07-05 0:48 GMT+02:00 M Singh :

> Hi:
>
> Is there a Java sample fragment for using cassandra-driver-spark ?
>
> Thanks
>



-- 
Piotr Kolaczkowski, Lead Software Engineer
pkola...@datastax.com

http://www.datastax.com/
3975 Freedom Circle
Santa Clara, CA 95054, USA

Control number of tasks per stage

2014-07-07 Thread Konstantin Kudryavtsev

Hi all,

is it any way to control the number tasks per stage?

currently I see situation when only 2 tasks are created per stage and each
of them is very slow, at the same time cluster has a huge number of unused
nodes


Thank you,
Konstantin Kudryavtsev

Re: Pig 0.13, Spark, Spork

2014-07-07 Thread Mayur Rustagi

That version is old :).
We are not forking pig but cleanly separating out pig execution engine. Let
me know if you are willing to give it a go.

Also would love to know what features of pig you are using ?

Regards
Mayur

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi 



On Mon, Jul 7, 2014 at 8:46 PM, Bertrand Dechoux  wrote:

> I saw a wiki page from your company but with an old version of Spark.
>
> http://docs.sigmoidanalytics.com/index.php/Setting_up_spork_with_spark_0.8.1
>
> I have no reason to use it yet but I am interested in the state of the
> initiative.
> What's your point of view (personal and/or professional) about the Pig
> 0.13 release?
> Is the pluggable execution engine flexible enough in order to avoid having
> Spork as a fork of Pig? Pig + Spark + Fork = Spork :D
>
> As a (for now) external observer, I am glad to see competition in that
> space. It can only be good for the community in the end.
>
> Bertrand Dechoux
>
>
> On Mon, Jul 7, 2014 at 5:00 PM, Mayur Rustagi 
> wrote:
>
>> Hi,
>> We have fixed many major issues around Spork & deploying it with some
>> customers. Would be happy to provide a working version to you to try out.
>> We are looking for more folks to try it out & submit bugs.
>>
>> Regards
>> Mayur
>>
>> Mayur Rustagi
>> Ph: +1 (760) 203 3257
>> http://www.sigmoidanalytics.com
>> @mayur_rustagi 
>>
>>
>>
>> On Mon, Jul 7, 2014 at 8:21 PM, Bertrand Dechoux 
>> wrote:
>>
>>> Hi,
>>>
>>> I was wondering what was the state of the Pig+Spark initiative now that
>>> the execution engine of Pig is pluggable? Granted, it was done in order to
>>> use Tez but could it be used by Spark? I know about a 'theoretical' project
>>> called Spork but I don't know any stable and maintained version of it.
>>>
>>> Regards
>>>
>>> Bertrand Dechoux
>>>
>>
>>
>

Re: Pig 0.13, Spark, Spork

2014-07-07 Thread Bertrand Dechoux

I saw a wiki page from your company but with an old version of Spark.
http://docs.sigmoidanalytics.com/index.php/Setting_up_spork_with_spark_0.8.1

I have no reason to use it yet but I am interested in the state of the
initiative.
What's your point of view (personal and/or professional) about the Pig 0.13
release?
Is the pluggable execution engine flexible enough in order to avoid having
Spork as a fork of Pig? Pig + Spark + Fork = Spork :D

As a (for now) external observer, I am glad to see competition in that
space. It can only be good for the community in the end.

Bertrand Dechoux

On Mon, Jul 7, 2014 at 5:00 PM, Mayur Rustagi 
wrote:

> Hi,
> We have fixed many major issues around Spork & deploying it with some
> customers. Would be happy to provide a working version to you to try out.
> We are looking for more folks to try it out & submit bugs.
>
> Regards
> Mayur
>
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi 
>
>
>
> On Mon, Jul 7, 2014 at 8:21 PM, Bertrand Dechoux 
> wrote:
>
>> Hi,
>>
>> I was wondering what was the state of the Pig+Spark initiative now that
>> the execution engine of Pig is pluggable? Granted, it was done in order to
>> use Tez but could it be used by Spark? I know about a 'theoretical' project
>> called Spork but I don't know any stable and maintained version of it.
>>
>> Regards
>>
>> Bertrand Dechoux
>>
>
>

Comparative study

2014-07-07 Thread santosh.viswanathan

Hello Experts,

I am doing some comparative study on the below:

Spark vs Impala
Spark vs MapREduce . Is it worth migrating from existing MR implementation to 
Spark?


Please share your thoughts and expertise.


Thanks,
Santosh



This message is for the designated recipient only and may contain privileged, 
proprietary, or otherwise confidential information. If you have received it in 
error, please notify the sender immediately and delete the original. Any other 
use of the e-mail by you is prohibited. Where allowed by local law, electronic 
communications with Accenture and its affiliates, including e-mail and instant 
messaging (including content), may be scanned by our systems for the purposes 
of information security and assessment of internal compliance with Accenture 
policy.
__

www.accenture.com

Re: Pig 0.13, Spark, Spork

2014-07-07 Thread Mayur Rustagi

Hi,
We have fixed many major issues around Spork & deploying it with some
customers. Would be happy to provide a working version to you to try out.
We are looking for more folks to try it out & submit bugs.

Regards
Mayur

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi 

On Mon, Jul 7, 2014 at 8:21 PM, Bertrand Dechoux  wrote:

> Hi,
>
> I was wondering what was the state of the Pig+Spark initiative now that
> the execution engine of Pig is pluggable? Granted, it was done in order to
> use Tez but could it be used by Spark? I know about a 'theoretical' project
> called Spork but I don't know any stable and maintained version of it.
>
> Regards
>
> Bertrand Dechoux
>

Pig 0.13, Spark, Spork

2014-07-07 Thread Bertrand Dechoux

Hi,

I was wondering what was the state of the Pig+Spark initiative now that the
execution engine of Pig is pluggable? Granted, it was done in order to use
Tez but could it be used by Spark? I know about a 'theoretical' project
called Spork but I don't know any stable and maintained version of it.

Regards

Bertrand Dechoux

Re: Unable to run Spark 1.0 SparkPi on HDP 2.0

2014-07-07 Thread Konstantin Kudryavtsev

Hi Chester,

Thank you very much, it is clear now - just two different way to support
spark on acluster

Thank you,
Konstantin Kudryavtsev


On Mon, Jul 7, 2014 at 3:22 PM, Chester @work  wrote:

> In Yarn cluster mode, you can either have spark on all the cluster nodes
> or supply the spark jar yourself. In the 2nd case, you don't need install
> spark on cluster at all. As you supply the spark assembly as we as your app
> jar together.
>
> I hope this make it clear
>
> Chester
>
> Sent from my iPhone
>
> On Jul 7, 2014, at 5:05 AM, Konstantin Kudryavtsev <
> kudryavtsev.konstan...@gmail.com> wrote:
>
> thank you Krishna!
>
>  Could you please explain why do I need install spark on each node if
> Spark official site said: If you have a Hadoop 2 cluster, you can run
> Spark without any installation needed
>
> I have HDP 2 (YARN) and that's why I hope I don't need to install spark on
> each node
>
> Thank you,
> Konstantin Kudryavtsev
>
>
> On Mon, Jul 7, 2014 at 1:57 PM, Krishna Sankar 
> wrote:
>
>> Konstantin,
>>
>>1. You need to install the hadoop rpms on all nodes. If it is Hadoop
>>2, the nodes would have hdfs & YARN.
>>2. Then you need to install Spark on all nodes. I haven't had
>>experience with HDP, but the tech preview might have installed Spark as
>>well.
>>3. In the end, one should have hdfs,yarn & spark installed on all the
>>nodes.
>>4. After installations, check the web console to make sure hdfs, yarn
>>& spark are running.
>>5. Then you are ready to start experimenting/developing spark
>>applications.
>>
>> HTH.
>> Cheers
>> 
>>
>>
>> On Mon, Jul 7, 2014 at 2:34 AM, Konstantin Kudryavtsev <
>> kudryavtsev.konstan...@gmail.com> wrote:
>>
>>> guys, I'm not talking about running spark on VM, I don have problem with
>>> it.
>>>
>>> I confused in the next:
>>> 1) Hortonworks describe installation process as RPMs on each node
>>> 2) spark home page said that everything I need is YARN
>>>
>>> And I'm in stucj with understanding what I need to do to run spark on
>>> yarn (do I need RPMs installations or only build spark on edge node?)
>>>
>>>
>>> Thank you,
>>> Konstantin Kudryavtsev
>>>
>>>
>>> On Mon, Jul 7, 2014 at 4:34 AM, Robert James 
>>> wrote:
>>>
 I can say from my experience that getting Spark to work with Hadoop 2
 is not for the beginner; after solving one problem after another
 (dependencies, scripts, etc.), I went back to Hadoop 1.

 Spark's Maven, ec2 scripts, and others all use Hadoop 1 - not sure
 why, but, given so, Hadoop 2 has too many bumps

 On 7/6/14, Marco Shaw  wrote:
 > That is confusing based on the context you provided.
 >
 > This might take more time than I can spare to try to understand.
 >
 > For sure, you need to add Spark to run it in/on the HDP 2.1 express
 VM.
 >
 > Cloudera's CDH 5 express VM includes Spark, but the service isn't
 running by
 > default.
 >
 > I can't remember for MapR...
 >
 > Marco
 >
 >> On Jul 6, 2014, at 6:33 PM, Konstantin Kudryavtsev
 >>  wrote:
 >>
 >> Marco,
 >>
 >> Hortonworks provides a Tech Preview of Spark 0.9.1 with HDP 2.1 that
 you
 >> can try
 >> from
 >>
 http://hortonworks.com/wp-content/uploads/2014/05/SparkTechnicalPreview.pdf
 >>  HDP 2.1 means YARN, at the same time they propose ti install rpm
 >>
 >> On other hand, http://spark.apache.org/ said "
 >> Integrated with Hadoop
 >> Spark can run on Hadoop 2's YARN cluster manager, and can read any
 >> existing Hadoop data.
 >>
 >> If you have a Hadoop 2 cluster, you can run Spark without any
 installation
 >> needed. "
 >>
 >> And this is confusing for me... do I need rpm installation on not?...
 >>
 >>
 >> Thank you,
 >> Konstantin Kudryavtsev
 >>
 >>
 >>> On Sun, Jul 6, 2014 at 10:56 PM, Marco Shaw 
 >>> wrote:
 >>> Can you provide links to the sections that are confusing?
 >>>
 >>> My understanding, the HDP1 binaries do not need YARN, while the HDP2
 >>> binaries do.
 >>>
 >>> Now, you can also install Hortonworks Spark RPM...
 >>>
 >>> For production, in my opinion, RPMs are better for manageability.
 >>>
  On Jul 6, 2014, at 5:39 PM, Konstantin Kudryavtsev
   wrote:
 
  Hello, thanks for your message... I'm confused, Hortonworhs suggest
  install spark rpm on each node, but on Spark main page said that
 yarn
  enough and I don't need to install it... What the difference?
 
  sent from my HTC
 
 > On Jul 6, 2014 8:34 PM, "vs"  wrote:
 > Konstantin,
 >
 > HWRK provides a Tech Preview of Spark 0.9.1 with HDP 2.1 that you
 can
 > try
 > from
 >
 http://hortonworks.com/wp-content/uploads/2014/05/SparkTechnicalPreview.pdf
 >
 > Let me kn

Possible bug in Spark Streaming :: TextFileStream

2014-07-07 Thread Luis Ángel Vicente Sánchez

I have a basic spark streaming job that is watching a folder, processing
any new file and updating a column family in cassandra using the new
cassandra-spark-driver.

I think there is a problem with SparkStreamingContext.textFileStream... if
I start my job in local mode with no files in the folder that is watched
and then I copy a bunch of files, sometimes spark is continually processing
those files again and again.

I have noticed that it usually happens when spark doesn't detect all new
files in one go... i.e. I copied 6 files and spark detected 3 of them as
new and processed them; then it detected the other 3 as new and processed
them. After it finished to process all 6 files, it detected again the first
3 files as new files and processed them... then the other 3... and again...
and again... and again.

Should I rise a JIRA issue?

Regards,

Luis

Re: graphx Joining two VertexPartitions with different indexes is slow.

2014-07-07 Thread Koert Kuipers

you could only do the deep check if the hashcodes are the same and design
hashcodes that do not take all elements into account.

the alternative seems to be putting cache statements all over graphx, as is
currently the case, which is trouble for any long lived application where
caching is carefully managed. I think? I am currently forced to do
unpersists on vertices after almost every intermediate graph
transformation, or accept my rdd cache getting polluted
On Jul 7, 2014 12:03 AM, "Ankur Dave"  wrote:

> Well, the alternative is to do a deep equality check on the index arrays,
> which would be somewhat expensive since these are pretty large arrays (one
> element per vertex in the graph). But, in case the reference equality check
> fails, it actually might be a good idea to do the deep check before
> resorting to the slow code path.
>
> Ankur 
>

Re: Unable to run Spark 1.0 SparkPi on HDP 2.0

2014-07-07 Thread Chester @work

In Yarn cluster mode, you can either have spark on all the cluster nodes or 
supply the spark jar yourself. In the 2nd case, you don't need install spark on 
cluster at all. As you supply the spark assembly as we as your app jar 
together. 

I hope this make it clear

Chester

Sent from my iPhone

> On Jul 7, 2014, at 5:05 AM, Konstantin Kudryavtsev 
>  wrote:
> 
> thank you Krishna!
> 
> Could you please explain why do I need install spark on each node if Spark 
> official site said: If you have a Hadoop 2 cluster, you can run Spark without 
> any installation needed
> 
> I have HDP 2 (YARN) and that's why I hope I don't need to install spark on 
> each node 
> 
> Thank you,
> Konstantin Kudryavtsev
> 
> 
>> On Mon, Jul 7, 2014 at 1:57 PM, Krishna Sankar  wrote:
>> Konstantin,
>> You need to install the hadoop rpms on all nodes. If it is Hadoop 2, the 
>> nodes would have hdfs & YARN.
>> Then you need to install Spark on all nodes. I haven't had experience with 
>> HDP, but the tech preview might have installed Spark as well.
>> In the end, one should have hdfs,yarn & spark installed on all the nodes.
>> After installations, check the web console to make sure hdfs, yarn & spark 
>> are running.
>> Then you are ready to start experimenting/developing spark applications.
>> HTH.
>> Cheers
>> 
>> 
>> 
>>> On Mon, Jul 7, 2014 at 2:34 AM, Konstantin Kudryavtsev 
>>>  wrote:
>>> guys, I'm not talking about running spark on VM, I don have problem with it.
>>> 
>>> I confused in the next:
>>> 1) Hortonworks describe installation process as RPMs on each node
>>> 2) spark home page said that everything I need is YARN
>>> 
>>> And I'm in stucj with understanding what I need to do to run spark on yarn 
>>> (do I need RPMs installations or only build spark on edge node?)
>>> 
>>> 
>>> Thank you,
>>> Konstantin Kudryavtsev
>>> 
>>> 
 On Mon, Jul 7, 2014 at 4:34 AM, Robert James  
 wrote:
 I can say from my experience that getting Spark to work with Hadoop 2
 is not for the beginner; after solving one problem after another
 (dependencies, scripts, etc.), I went back to Hadoop 1.
 
 Spark's Maven, ec2 scripts, and others all use Hadoop 1 - not sure
 why, but, given so, Hadoop 2 has too many bumps
 
 On 7/6/14, Marco Shaw  wrote:
 > That is confusing based on the context you provided.
 >
 > This might take more time than I can spare to try to understand.
 >
 > For sure, you need to add Spark to run it in/on the HDP 2.1 express VM.
 >
 > Cloudera's CDH 5 express VM includes Spark, but the service isn't 
 > running by
 > default.
 >
 > I can't remember for MapR...
 >
 > Marco
 >
 >> On Jul 6, 2014, at 6:33 PM, Konstantin Kudryavtsev
 >>  wrote:
 >>
 >> Marco,
 >>
 >> Hortonworks provides a Tech Preview of Spark 0.9.1 with HDP 2.1 that you
 >> can try
 >> from
 >> http://hortonworks.com/wp-content/uploads/2014/05/SparkTechnicalPreview.pdf
 >>  HDP 2.1 means YARN, at the same time they propose ti install rpm
 >>
 >> On other hand, http://spark.apache.org/ said "
 >> Integrated with Hadoop
 >> Spark can run on Hadoop 2's YARN cluster manager, and can read any
 >> existing Hadoop data.
 >>
 >> If you have a Hadoop 2 cluster, you can run Spark without any 
 >> installation
 >> needed. "
 >>
 >> And this is confusing for me... do I need rpm installation on not?...
 >>
 >>
 >> Thank you,
 >> Konstantin Kudryavtsev
 >>
 >>
 >>> On Sun, Jul 6, 2014 at 10:56 PM, Marco Shaw 
 >>> wrote:
 >>> Can you provide links to the sections that are confusing?
 >>>
 >>> My understanding, the HDP1 binaries do not need YARN, while the HDP2
 >>> binaries do.
 >>>
 >>> Now, you can also install Hortonworks Spark RPM...
 >>>
 >>> For production, in my opinion, RPMs are better for manageability.
 >>>
  On Jul 6, 2014, at 5:39 PM, Konstantin Kudryavtsev
   wrote:
 
  Hello, thanks for your message... I'm confused, Hortonworhs suggest
  install spark rpm on each node, but on Spark main page said that yarn
  enough and I don't need to install it... What the difference?
 
  sent from my HTC
 
 > On Jul 6, 2014 8:34 PM, "vs"  wrote:
 > Konstantin,
 >
 > HWRK provides a Tech Preview of Spark 0.9.1 with HDP 2.1 that you can
 > try
 > from
 > http://hortonworks.com/wp-content/uploads/2014/05/SparkTechnicalPreview.pdf
 >
 > Let me know if you see issues with the tech preview.
 >
 > "spark PI example on HDP 2.0
 >
 > I downloaded spark 1.0 pre-build from
 > http://spark.apache.org/downloads.html
 > (for HDP2)
 > The run example from spark web-site:
 > ./bin/spark-submit --class org.apache.spark.examples.SparkPi
>>>

spark-submit conflicts with dependencies

2014-07-07 Thread Robert James

When I use spark-submit (along with spark-ec2), I get dependency
conflicts.  spark-assembly includes older versions of apache commons
codec and httpclient, and these conflict with many of the libs our
software uses.

Is there any way to resolve these?  Or, if we use the precompiled
spark, can we simply not use newer versions of these libs, or anything
that depends on them?

Re: Spark memory optimization

2014-07-07 Thread Surendranauth Hiraman

Using persist() is a sort of a "hack" or a hint (depending on your
perspective :-)) to make the RDD use disk, not memory. As I mentioned
though, the disk io has consequences, mainly (I think) making sure you have
enough disks to not let io be a bottleneck.

Increasing partitions I think is the other common approach people take,
from what I've read.

For alternatives, if your data is in HDFS or you just want to stick with
Map/Reduce, then the higher level abstractions on top of M/R you might want
to look at include the following, which have both Scala and Java
implementations in some cases.

Scalding (Scala API on top of Cascading and it seems is the most active of
such projects, at least on the surface)
Scoobi
Scrunch (Scala wrapper around Crunch)

There are other parallel distributed frameworks outside of the Hadoop
ecosystem, of course.

-Suren




On Mon, Jul 7, 2014 at 7:31 AM, Igor Pernek  wrote:

> Thanks guys! Actually, I'm not doing any caching (at least I'm not calling
> cache/persist), do I still need to use the DISK_ONLY storage level?
> However, I do use reduceByKey and sortByKey. Mayur, you mentioned that
> sortByKey requires data to fit the memory. Is there any way to work around
> this (maybe by increasing the number of partitions or something similar?).
>
> What alternative would you suggest, if Spark is not the way to go with
> this kind of scenario. As mentioned, what I like about spark is its high
> level of abstraction of parallelization. I'm ready to sacrifice speed (if
> the slowdown is not too big - I'm doing batch processing, nothing
> real-time) for code simplicity and readability.
>
>
>
> On Fri, Jul 4, 2014 at 3:16 PM, Surendranauth Hiraman <
> suren.hira...@velos.io> wrote:
> >
> > When using DISK_ONLY, keep in mind that disk I/O is pretty high. Make
> sure you are writing to multiple disks for best operation. And even with
> DISK_ONLY, we've found that there is a minimum threshold for executor ram
> (spark.executor.memory), which for us seemed to be around 8 GB.
> >
> > If you find that, with enough disks, you still have errors/exceptions
> getting the flow to finish, first check iostat to see if disk is the
> bottleneck.
> >
> > Then, you may want to try tuning some or all of the following, which
> affect buffers and timeouts. For us, because we did not have enough disks
> to start out, the io bottleneck caused timeouts and other errors. In the
> end, IMHO, it's probably best to solve the problem by adding disks than by
> tuning the parameters, because it seemed that the i/o bottlenecks
> eventually backed up the processing.
> >
> > //conf.set("spark.shuffle.consolidateFiles","true")
> >
> > //conf.set("spark.shuffle.file.buffer.kb", "200")// does
> doubling this help? should increase in-memory buffer to decrease disk writes
> > //conf.set("spark.reducer.maxMbInFlight", "96") // does
> doubling this help? should allow for more simultaneous shuffle data to be
> read from remotes
> >
> > // because we use disk-only, we should be able to reverse the
> default memory usage settings
> > //conf.set("spark.shuffle.memoryFraction","0.6") // default 0.3
> > //conf.set("spark.storage.memoryFraction","0.3")   // default 0.6
> >
> > //conf.set("spark.worker.timeout","180")
> >
> > // akka settings
> > //conf.set("spark.akka.threads", "300")   // number of akka
> actors
> > //conf.set("spark.akka.timeout", "180")   // we saw a problem
> with smaller numbers
> > //conf.set("spark.akka.frameSize", "100")  // not sure if we
> need to up this. Default is 10.
> > //conf.set("spark.akka.batchSize", "30")
> > //conf.set("spark.akka.askTimeout", "30") // supposedly this is
> important for high cpu/io load
> >
> > // block manager
> > //conf.set("spark.storage.blockManagerTimeoutIntervalMs",
> "18")
> > //conf.set("spark.blockManagerHeartBeatMs", "8")
> >
> >
> >
> >
> >
> >
> > On Fri, Jul 4, 2014 at 8:52 AM, Mayur Rustagi 
> wrote:
> >>
> >> I would go with Spark only if you are certain that you are going to
> scale out in the near future.
> >> You can change the default storage of RDD to DISK_ONLY, that might
> remove issues around any rdd leveraging memory. Thr are some functions
> particularly sortbykey that require data to fit in memory to work, so you
> may be hitting some of those walls too.
> >> Regards
> >> Mayur
> >>
> >> Mayur Rustagi
> >> Ph: +1 (760) 203 3257
> >> http://www.sigmoidanalytics.com
> >> @mayur_rustagi
> >>
> >>
> >>
> >> On Fri, Jul 4, 2014 at 2:36 PM, Igor Pernek  wrote:
> >>>
> >>> Hi all!
> >>>
> >>> I have a folder with 150 G of txt files (around 700 files, on average
> each 200 MB).
> >>>
> >>> I'm using scala to process the files and calculate some aggregate
> statistics in the end. I see two possible approaches to do that: - manually
> loop through all the files, do the calculations per file and merge the
> results in

Re: Unable to run Spark 1.0 SparkPi on HDP 2.0

2014-07-07 Thread Konstantin Kudryavtsev

thank you Krishna!

 Could you please explain why do I need install spark on each node if Spark
official site said: If you have a Hadoop 2 cluster, you can run Spark
without any installation needed

I have HDP 2 (YARN) and that's why I hope I don't need to install spark on
each node

Thank you,
Konstantin Kudryavtsev


On Mon, Jul 7, 2014 at 1:57 PM, Krishna Sankar  wrote:

> Konstantin,
>
>1. You need to install the hadoop rpms on all nodes. If it is Hadoop
>2, the nodes would have hdfs & YARN.
>2. Then you need to install Spark on all nodes. I haven't had
>experience with HDP, but the tech preview might have installed Spark as
>well.
>3. In the end, one should have hdfs,yarn & spark installed on all the
>nodes.
>4. After installations, check the web console to make sure hdfs, yarn
>& spark are running.
>5. Then you are ready to start experimenting/developing spark
>applications.
>
> HTH.
> Cheers
> 
>
>
> On Mon, Jul 7, 2014 at 2:34 AM, Konstantin Kudryavtsev <
> kudryavtsev.konstan...@gmail.com> wrote:
>
>> guys, I'm not talking about running spark on VM, I don have problem with
>> it.
>>
>> I confused in the next:
>> 1) Hortonworks describe installation process as RPMs on each node
>> 2) spark home page said that everything I need is YARN
>>
>> And I'm in stucj with understanding what I need to do to run spark on
>> yarn (do I need RPMs installations or only build spark on edge node?)
>>
>>
>> Thank you,
>> Konstantin Kudryavtsev
>>
>>
>> On Mon, Jul 7, 2014 at 4:34 AM, Robert James 
>> wrote:
>>
>>> I can say from my experience that getting Spark to work with Hadoop 2
>>> is not for the beginner; after solving one problem after another
>>> (dependencies, scripts, etc.), I went back to Hadoop 1.
>>>
>>> Spark's Maven, ec2 scripts, and others all use Hadoop 1 - not sure
>>> why, but, given so, Hadoop 2 has too many bumps
>>>
>>> On 7/6/14, Marco Shaw  wrote:
>>> > That is confusing based on the context you provided.
>>> >
>>> > This might take more time than I can spare to try to understand.
>>> >
>>> > For sure, you need to add Spark to run it in/on the HDP 2.1 express VM.
>>> >
>>> > Cloudera's CDH 5 express VM includes Spark, but the service isn't
>>> running by
>>> > default.
>>> >
>>> > I can't remember for MapR...
>>> >
>>> > Marco
>>> >
>>> >> On Jul 6, 2014, at 6:33 PM, Konstantin Kudryavtsev
>>> >>  wrote:
>>> >>
>>> >> Marco,
>>> >>
>>> >> Hortonworks provides a Tech Preview of Spark 0.9.1 with HDP 2.1 that
>>> you
>>> >> can try
>>> >> from
>>> >>
>>> http://hortonworks.com/wp-content/uploads/2014/05/SparkTechnicalPreview.pdf
>>> >>  HDP 2.1 means YARN, at the same time they propose ti install rpm
>>> >>
>>> >> On other hand, http://spark.apache.org/ said "
>>> >> Integrated with Hadoop
>>> >> Spark can run on Hadoop 2's YARN cluster manager, and can read any
>>> >> existing Hadoop data.
>>> >>
>>> >> If you have a Hadoop 2 cluster, you can run Spark without any
>>> installation
>>> >> needed. "
>>> >>
>>> >> And this is confusing for me... do I need rpm installation on not?...
>>> >>
>>> >>
>>> >> Thank you,
>>> >> Konstantin Kudryavtsev
>>> >>
>>> >>
>>> >>> On Sun, Jul 6, 2014 at 10:56 PM, Marco Shaw 
>>> >>> wrote:
>>> >>> Can you provide links to the sections that are confusing?
>>> >>>
>>> >>> My understanding, the HDP1 binaries do not need YARN, while the HDP2
>>> >>> binaries do.
>>> >>>
>>> >>> Now, you can also install Hortonworks Spark RPM...
>>> >>>
>>> >>> For production, in my opinion, RPMs are better for manageability.
>>> >>>
>>>  On Jul 6, 2014, at 5:39 PM, Konstantin Kudryavtsev
>>>   wrote:
>>> 
>>>  Hello, thanks for your message... I'm confused, Hortonworhs suggest
>>>  install spark rpm on each node, but on Spark main page said that
>>> yarn
>>>  enough and I don't need to install it... What the difference?
>>> 
>>>  sent from my HTC
>>> 
>>> > On Jul 6, 2014 8:34 PM, "vs"  wrote:
>>> > Konstantin,
>>> >
>>> > HWRK provides a Tech Preview of Spark 0.9.1 with HDP 2.1 that you
>>> can
>>> > try
>>> > from
>>> >
>>> http://hortonworks.com/wp-content/uploads/2014/05/SparkTechnicalPreview.pdf
>>> >
>>> > Let me know if you see issues with the tech preview.
>>> >
>>> > "spark PI example on HDP 2.0
>>> >
>>> > I downloaded spark 1.0 pre-build from
>>> > http://spark.apache.org/downloads.html
>>> > (for HDP2)
>>> > The run example from spark web-site:
>>> > ./bin/spark-submit --class org.apache.spark.examples.SparkPi
>>> > --master
>>> > yarn-cluster --num-executors 3 --driver-memory 2g
>>> --executor-memory 2g
>>> > --executor-cores 1 ./lib/spark-examples-1.0.0-hadoop2.2.0.jar 2
>>> >
>>> > I got error:
>>> > Application application_1404470405736_0044 failed 3 times due to AM
>>> > Container for appattempt_1404470405736_0044_03 exited with
>>> > exitCode: 1
>>> > due to: Exception from contain

Re: Unable to run Spark 1.0 SparkPi on HDP 2.0

2014-07-07 Thread Krishna Sankar

Konstantin,

   1. You need to install the hadoop rpms on all nodes. If it is Hadoop 2,
   the nodes would have hdfs & YARN.
   2. Then you need to install Spark on all nodes. I haven't had experience
   with HDP, but the tech preview might have installed Spark as well.
   3. In the end, one should have hdfs,yarn & spark installed on all the
   nodes.
   4. After installations, check the web console to make sure hdfs, yarn &
   spark are running.
   5. Then you are ready to start experimenting/developing spark
   applications.

HTH.
Cheers



On Mon, Jul 7, 2014 at 2:34 AM, Konstantin Kudryavtsev <
kudryavtsev.konstan...@gmail.com> wrote:

> guys, I'm not talking about running spark on VM, I don have problem with
> it.
>
> I confused in the next:
> 1) Hortonworks describe installation process as RPMs on each node
> 2) spark home page said that everything I need is YARN
>
> And I'm in stucj with understanding what I need to do to run spark on yarn
> (do I need RPMs installations or only build spark on edge node?)
>
>
> Thank you,
> Konstantin Kudryavtsev
>
>
> On Mon, Jul 7, 2014 at 4:34 AM, Robert James 
> wrote:
>
>> I can say from my experience that getting Spark to work with Hadoop 2
>> is not for the beginner; after solving one problem after another
>> (dependencies, scripts, etc.), I went back to Hadoop 1.
>>
>> Spark's Maven, ec2 scripts, and others all use Hadoop 1 - not sure
>> why, but, given so, Hadoop 2 has too many bumps
>>
>> On 7/6/14, Marco Shaw  wrote:
>> > That is confusing based on the context you provided.
>> >
>> > This might take more time than I can spare to try to understand.
>> >
>> > For sure, you need to add Spark to run it in/on the HDP 2.1 express VM.
>> >
>> > Cloudera's CDH 5 express VM includes Spark, but the service isn't
>> running by
>> > default.
>> >
>> > I can't remember for MapR...
>> >
>> > Marco
>> >
>> >> On Jul 6, 2014, at 6:33 PM, Konstantin Kudryavtsev
>> >>  wrote:
>> >>
>> >> Marco,
>> >>
>> >> Hortonworks provides a Tech Preview of Spark 0.9.1 with HDP 2.1 that
>> you
>> >> can try
>> >> from
>> >>
>> http://hortonworks.com/wp-content/uploads/2014/05/SparkTechnicalPreview.pdf
>> >>  HDP 2.1 means YARN, at the same time they propose ti install rpm
>> >>
>> >> On other hand, http://spark.apache.org/ said "
>> >> Integrated with Hadoop
>> >> Spark can run on Hadoop 2's YARN cluster manager, and can read any
>> >> existing Hadoop data.
>> >>
>> >> If you have a Hadoop 2 cluster, you can run Spark without any
>> installation
>> >> needed. "
>> >>
>> >> And this is confusing for me... do I need rpm installation on not?...
>> >>
>> >>
>> >> Thank you,
>> >> Konstantin Kudryavtsev
>> >>
>> >>
>> >>> On Sun, Jul 6, 2014 at 10:56 PM, Marco Shaw 
>> >>> wrote:
>> >>> Can you provide links to the sections that are confusing?
>> >>>
>> >>> My understanding, the HDP1 binaries do not need YARN, while the HDP2
>> >>> binaries do.
>> >>>
>> >>> Now, you can also install Hortonworks Spark RPM...
>> >>>
>> >>> For production, in my opinion, RPMs are better for manageability.
>> >>>
>>  On Jul 6, 2014, at 5:39 PM, Konstantin Kudryavtsev
>>   wrote:
>> 
>>  Hello, thanks for your message... I'm confused, Hortonworhs suggest
>>  install spark rpm on each node, but on Spark main page said that yarn
>>  enough and I don't need to install it... What the difference?
>> 
>>  sent from my HTC
>> 
>> > On Jul 6, 2014 8:34 PM, "vs"  wrote:
>> > Konstantin,
>> >
>> > HWRK provides a Tech Preview of Spark 0.9.1 with HDP 2.1 that you
>> can
>> > try
>> > from
>> >
>> http://hortonworks.com/wp-content/uploads/2014/05/SparkTechnicalPreview.pdf
>> >
>> > Let me know if you see issues with the tech preview.
>> >
>> > "spark PI example on HDP 2.0
>> >
>> > I downloaded spark 1.0 pre-build from
>> > http://spark.apache.org/downloads.html
>> > (for HDP2)
>> > The run example from spark web-site:
>> > ./bin/spark-submit --class org.apache.spark.examples.SparkPi
>> > --master
>> > yarn-cluster --num-executors 3 --driver-memory 2g --executor-memory
>> 2g
>> > --executor-cores 1 ./lib/spark-examples-1.0.0-hadoop2.2.0.jar 2
>> >
>> > I got error:
>> > Application application_1404470405736_0044 failed 3 times due to AM
>> > Container for appattempt_1404470405736_0044_03 exited with
>> > exitCode: 1
>> > due to: Exception from container-launch:
>> > org.apache.hadoop.util.Shell$ExitCodeException:
>> > at org.apache.hadoop.util.Shell.runCommand(Shell.java:464)
>> > at org.apache.hadoop.util.Shell.run(Shell.java:379)
>> > at
>> >
>> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
>> > at
>> >
>> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
>> > at
>> >
>> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLau

which Spark package(wrt. graphX) I should install to do graph computation on cluster?

2014-07-07 Thread Yifan LI

Hi,

I am planning to do graph(social network) computation on a cluster(hadoop has 
been installed), but it seems there are a "Pre-built" package for hadoop which 
I am NOT sure if the graphX has been included in.

or, should I install other released version(obviously the graphX has been 
included)?

Looking for your reply! :)

Best,
Yifan

Re: How to parallelize model fitting with different cross-validation folds?

2014-07-07 Thread sparkuser2345

Thank you for all the replies! 

Realizing that I can't distribute the modelling with different
cross-validation folds to the cluster nodes this way (but to the threads
only), I decided not to create nfolds data sets but to parallelize the
calculation (threadwise) over folds and to zip the original dataset with a
sequence of indices indicating fold division: 
 
val data = sc.parallelize(orig_data zip fold_division)

(1 to nfolds).par.map( fold_i => {
  val svmAlg= new SVMWithSGD() 
  val tr_data   = data.filter(x => x._2 != fold_i).map(x => x._1) 
  val test_data = data.filter(x => x._2 == fold_i).map(x => x._1)
  val model = svmAlg.run(tr_data)
  val labelAndPreds = test_data.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
  }
  val trainErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble /
test_data.count
  trainErr.toDouble 
})

Really looking forward to the new functionalities in Spark 1.1!  



Nick Pentreath wrote
> For linear models the 3rd option is by far most efficient and I suspect
> what Evan is alluding to. 
> 
> 
> Unfortunately it's not directly possible with the classes in Mllib now so
> you'll have to roll your own using underlying sgd / bfgs primitives.
> —
> Sent from Mailbox
> 
> On Sat, Jul 5, 2014 at 10:45 AM, Christopher Nguyen <

> ctn@

> >
> wrote:
> 
>> Hi sparkuser2345,
>> I'm inferring the problem statement is something like "how do I make this
>> complete faster (given my compute resources)?"
>> Several comments.
>> First, Spark only allows launching parallel tasks from the driver, not
>> from
>> workers, which is why you're seeing the exception when you try. Whether
>> the
>> latter is a sensible/doable idea is another discussion, but I can
>> appreciate why many people assume this should be possible.
>> Second, on optimization, you may be able to apply Sean's idea about
>> (thread) parallelism at the driver, combined with the knowledge that
>> often
>> these cluster tasks bottleneck while competing for the same resources at
>> the same time (cpu vs disk vs network, etc.) You may be able to achieve
>> some performance optimization by randomizing these timings. This is not
>> unlike GMail randomizing user storage locations around the world for load
>> balancing. Here, you would partition each of your RDDs into a different
>> number of partitions, making some tasks larger than others, and thus some
>> may be in cpu-intensive map while others are shuffling data around the
>> network. This is rather cluster-specific; I'd be interested in what you
>> learn from such an exercise.
>> Third, I find it useful always to consider doing as much as possible in
>> one
>> pass, subject to memory limits, e.g., mapPartitions() vs map(), thus
>> minimizing map/shuffle/reduce boundaries with their context switches and
>> data shuffling. In this case, notice how you're running the
>> training+prediction k times over mostly the same rows, with map/reduce
>> boundaries in between. While the training phase is sealed in this
>> context,
>> you may be able to improve performance by collecting all the k models
>> together, and do a [m x k] predictions all at once which may end up being
>> faster.
>> Finally, as implied from the above, for the very common k-fold
>> cross-validation pattern, the algorithm itself might be written to be
>> smart
>> enough to take both train and test data and "do the right thing" within
>> itself, thus obviating the need for the user to prepare k data sets and
>> running over them serially, and likely saving a lot of repeated
>> computations in the right internal places.
>> Enjoy,
>> --
>> Christopher T. Nguyen
>> Co-founder & CEO, Adatao ;
>> linkedin.com/in/ctnguyen
>> On Sat, Jul 5, 2014 at 1:50 AM, Sean Owen <

> sowen@

> > wrote:
>>> If you call .par on data_kfolded it will become a parallel collection in
>>> Scala and so the maps will happen in parallel .
>>> On Jul 5, 2014 9:35 AM, "sparkuser2345" <

> hm.spark.user@

> > wrote:
>>>
 Hi,

 I am trying to fit a logistic regression model with cross validation in
 Spark 0.9.0 using SVMWithSGD. I have created an array data_kfolded
 where
 each element is a pair of RDDs containing the training and test data:

 (training_data: (RDD[org.apache.spark.mllib.regression.LabeledPoint],
 test_data: (RDD[org.apache.spark.mllib.regression.LabeledPoint])

 scala> data_kfolded
 res21:

 Array[(org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint],
 org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint])]
 =
 Array((MappedRDD[9] at map at 
> 
> :24,MappedRDD[7] at map at
 
> 
> :23), (MappedRDD[13] at map at 
> 
> :24,MappedRDD[11] at map
 at
 
> 
> :23), (MappedRDD[17] at map at 
> 
> :24,MappedRDD[15] at map
 at
 
> 
> :23))

 Everything works fine when using data_kfolded:

 val validationErrors =
 data_kf

Re: Spark memory optimization

2014-07-07 Thread Igor Pernek

Thanks guys! Actually, I'm not doing any caching (at least I'm not calling
cache/persist), do I still need to use the DISK_ONLY storage level?
However, I do use reduceByKey and sortByKey. Mayur, you mentioned that
sortByKey requires data to fit the memory. Is there any way to work around
this (maybe by increasing the number of partitions or something similar?).

What alternative would you suggest, if Spark is not the way to go with this
kind of scenario. As mentioned, what I like about spark is its high level
of abstraction of parallelization. I'm ready to sacrifice speed (if the
slowdown is not too big - I'm doing batch processing, nothing real-time)
for code simplicity and readability.


On Fri, Jul 4, 2014 at 3:16 PM, Surendranauth Hiraman <
suren.hira...@velos.io> wrote:
>
> When using DISK_ONLY, keep in mind that disk I/O is pretty high. Make
sure you are writing to multiple disks for best operation. And even with
DISK_ONLY, we've found that there is a minimum threshold for executor ram
(spark.executor.memory), which for us seemed to be around 8 GB.
>
> If you find that, with enough disks, you still have errors/exceptions
getting the flow to finish, first check iostat to see if disk is the
bottleneck.
>
> Then, you may want to try tuning some or all of the following, which
affect buffers and timeouts. For us, because we did not have enough disks
to start out, the io bottleneck caused timeouts and other errors. In the
end, IMHO, it's probably best to solve the problem by adding disks than by
tuning the parameters, because it seemed that the i/o bottlenecks
eventually backed up the processing.
>
> //conf.set("spark.shuffle.consolidateFiles","true")
>
> //conf.set("spark.shuffle.file.buffer.kb", "200")// does
doubling this help? should increase in-memory buffer to decrease disk writes
> //conf.set("spark.reducer.maxMbInFlight", "96") // does
doubling this help? should allow for more simultaneous shuffle data to be
read from remotes
>
> // because we use disk-only, we should be able to reverse the
default memory usage settings
> //conf.set("spark.shuffle.memoryFraction","0.6") // default 0.3
> //conf.set("spark.storage.memoryFraction","0.3")   // default 0.6
>
> //conf.set("spark.worker.timeout","180")
>
> // akka settings
> //conf.set("spark.akka.threads", "300")   // number of akka actors
> //conf.set("spark.akka.timeout", "180")   // we saw a problem
with smaller numbers
> //conf.set("spark.akka.frameSize", "100")  // not sure if we need
to up this. Default is 10.
> //conf.set("spark.akka.batchSize", "30")
> //conf.set("spark.akka.askTimeout", "30") // supposedly this is
important for high cpu/io load
>
> // block manager
> //conf.set("spark.storage.blockManagerTimeoutIntervalMs",
"18")
> //conf.set("spark.blockManagerHeartBeatMs", "8")
>
>
>
>
>
>
> On Fri, Jul 4, 2014 at 8:52 AM, Mayur Rustagi 
wrote:
>>
>> I would go with Spark only if you are certain that you are going to
scale out in the near future.
>> You can change the default storage of RDD to DISK_ONLY, that might
remove issues around any rdd leveraging memory. Thr are some functions
particularly sortbykey that require data to fit in memory to work, so you
may be hitting some of those walls too.
>> Regards
>> Mayur
>>
>> Mayur Rustagi
>> Ph: +1 (760) 203 3257
>> http://www.sigmoidanalytics.com
>> @mayur_rustagi
>>
>>
>>
>> On Fri, Jul 4, 2014 at 2:36 PM, Igor Pernek  wrote:
>>>
>>> Hi all!
>>>
>>> I have a folder with 150 G of txt files (around 700 files, on average
each 200 MB).
>>>
>>> I'm using scala to process the files and calculate some aggregate
statistics in the end. I see two possible approaches to do that: - manually
loop through all the files, do the calculations per file and merge the
results in the end - read the whole folder to one RDD, do all the
operations on this single RDD and let spark do all the parallelization
>>>
>>> I'm leaning towards the second approach as it seems cleaner (no need
for parallelization specific code), but I'm wondering if my scenario will
fit the constraints imposed by my hardware and data. I have one workstation
with 16 threads and 64 GB of RAM available (so the parallelization will be
strictly local between different processor cores). I might scale the
infrastructure with more machines later on, but for now I would just like
to focus on tunning the settings for this one workstation scenario.
>>>
>>> The code I'm using: - reads TSV files, and extracts meaningful data to
(String, String, String) triplets - afterwards some filtering, mapping and
grouping is performed - finally, the data is reduced and some aggregates
are calculated
>>>
>>> I've been able to run this code with a single file (~200 MB of data),
however I get a java.lang.OutOfMemoryError: GC overhead limit exceeded
and/or a Java out of heap exception when adding more data (the application
breaks

Re: Unable to run Spark 1.0 SparkPi on HDP 2.0

2014-07-07 Thread Konstantin Kudryavtsev

guys, I'm not talking about running spark on VM, I don have problem with it.

I confused in the next:
1) Hortonworks describe installation process as RPMs on each node
2) spark home page said that everything I need is YARN

And I'm in stucj with understanding what I need to do to run spark on yarn
(do I need RPMs installations or only build spark on edge node?)


Thank you,
Konstantin Kudryavtsev


On Mon, Jul 7, 2014 at 4:34 AM, Robert James  wrote:

> I can say from my experience that getting Spark to work with Hadoop 2
> is not for the beginner; after solving one problem after another
> (dependencies, scripts, etc.), I went back to Hadoop 1.
>
> Spark's Maven, ec2 scripts, and others all use Hadoop 1 - not sure
> why, but, given so, Hadoop 2 has too many bumps
>
> On 7/6/14, Marco Shaw  wrote:
> > That is confusing based on the context you provided.
> >
> > This might take more time than I can spare to try to understand.
> >
> > For sure, you need to add Spark to run it in/on the HDP 2.1 express VM.
> >
> > Cloudera's CDH 5 express VM includes Spark, but the service isn't
> running by
> > default.
> >
> > I can't remember for MapR...
> >
> > Marco
> >
> >> On Jul 6, 2014, at 6:33 PM, Konstantin Kudryavtsev
> >>  wrote:
> >>
> >> Marco,
> >>
> >> Hortonworks provides a Tech Preview of Spark 0.9.1 with HDP 2.1 that you
> >> can try
> >> from
> >>
> http://hortonworks.com/wp-content/uploads/2014/05/SparkTechnicalPreview.pdf
> >>  HDP 2.1 means YARN, at the same time they propose ti install rpm
> >>
> >> On other hand, http://spark.apache.org/ said "
> >> Integrated with Hadoop
> >> Spark can run on Hadoop 2's YARN cluster manager, and can read any
> >> existing Hadoop data.
> >>
> >> If you have a Hadoop 2 cluster, you can run Spark without any
> installation
> >> needed. "
> >>
> >> And this is confusing for me... do I need rpm installation on not?...
> >>
> >>
> >> Thank you,
> >> Konstantin Kudryavtsev
> >>
> >>
> >>> On Sun, Jul 6, 2014 at 10:56 PM, Marco Shaw 
> >>> wrote:
> >>> Can you provide links to the sections that are confusing?
> >>>
> >>> My understanding, the HDP1 binaries do not need YARN, while the HDP2
> >>> binaries do.
> >>>
> >>> Now, you can also install Hortonworks Spark RPM...
> >>>
> >>> For production, in my opinion, RPMs are better for manageability.
> >>>
>  On Jul 6, 2014, at 5:39 PM, Konstantin Kudryavtsev
>   wrote:
> 
>  Hello, thanks for your message... I'm confused, Hortonworhs suggest
>  install spark rpm on each node, but on Spark main page said that yarn
>  enough and I don't need to install it... What the difference?
> 
>  sent from my HTC
> 
> > On Jul 6, 2014 8:34 PM, "vs"  wrote:
> > Konstantin,
> >
> > HWRK provides a Tech Preview of Spark 0.9.1 with HDP 2.1 that you can
> > try
> > from
> >
> http://hortonworks.com/wp-content/uploads/2014/05/SparkTechnicalPreview.pdf
> >
> > Let me know if you see issues with the tech preview.
> >
> > "spark PI example on HDP 2.0
> >
> > I downloaded spark 1.0 pre-build from
> > http://spark.apache.org/downloads.html
> > (for HDP2)
> > The run example from spark web-site:
> > ./bin/spark-submit --class org.apache.spark.examples.SparkPi
> > --master
> > yarn-cluster --num-executors 3 --driver-memory 2g --executor-memory
> 2g
> > --executor-cores 1 ./lib/spark-examples-1.0.0-hadoop2.2.0.jar 2
> >
> > I got error:
> > Application application_1404470405736_0044 failed 3 times due to AM
> > Container for appattempt_1404470405736_0044_03 exited with
> > exitCode: 1
> > due to: Exception from container-launch:
> > org.apache.hadoop.util.Shell$ExitCodeException:
> > at org.apache.hadoop.util.Shell.runCommand(Shell.java:464)
> > at org.apache.hadoop.util.Shell.run(Shell.java:379)
> > at
> >
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
> > at
> >
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
> > at
> >
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283)
> > at
> >
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79)
> > at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> > at
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> > at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> > at java.lang.Thread.run(Thread.java:744)
> > .Failing this attempt.. Failing the application.
> >
> > Unknown/unsupported param List(--executor-memory, 2048,
> > --executor-cores, 1,
> > --num-executors, 3)
> > Usage: org.apache.spark.deploy.yarn.ApplicationMaster [options]
> > Options:
> >   --jar JAR_PATH

Re: [mllib] strange/buggy results with RidgeRegressionWithSGD

2014-07-07 Thread Eustache DIEMERT

Ok, I've tried to add the intercept term myself (code here [1]), but with
no luck.

It seems that adding a column of ones doesn't help with convergence either.

I may have missed something in the coding as I'm quite a noob in Scala, but
printing the data seem to indicate I succeeded in adding the ones column.

Does anyone here has had success with this code on real-world datasets ?

[1] https://github.com/oddskool/mllib-samples/tree/ridge (in the ridge
branch)




2014-07-07 9:08 GMT+02:00 Eustache DIEMERT :

> Well, why not, but IMHO MLLib Logistic Regression is unusable right now.
> The inability to use intercept is just a no-go. I could hack a column of
> ones to inject the intercept into the data but frankly it's a pithy to have
> to do so.
>
>
> 2014-07-05 23:04 GMT+02:00 DB Tsai :
>
> You may try LBFGS to have more stable convergence. In spark 1.1, we will
>> be able to use LBFGS instead of GD in training process.
>> On Jul 4, 2014 1:23 PM, "Thomas Robert"  wrote:
>>
>>> Hi all,
>>>
>>> I too am having some issues with *RegressionWithSGD algorithms.
>>>
>>> Concerning your issue Eustache, this could be due to the fact that these
>>> regression algorithms uses a fixed step (that is divided by
>>> sqrt(iteration)). During my tests, quite often, the algorithm diverged an
>>> infinity cost, I guessed because the step was too big. I reduce it and
>>> managed to get good results on a very simple generated dataset.
>>>
>>> But I was wondering if anyone here had some advises concerning the use
>>> of these regression algorithms, for example how to choose a good step and
>>> number of iterations? I wonder if I'm using those right...
>>>
>>> Thanks,
>>>
>>> --
>>>
>>> *Thomas ROBERT*
>>> www.creativedata.fr
>>>
>>>
>>> 2014-07-03 16:16 GMT+02:00 Eustache DIEMERT :
>>>
 Printing the model show the intercept is always 0 :(

 Should I open a bug for that ?


 2014-07-02 16:11 GMT+02:00 Eustache DIEMERT :

> Hi list,
>
> I'm benchmarking MLlib for a regression task [1] and get strange
> results.
>
> Namely, using RidgeRegressionWithSGD it seems the predicted points
> miss the intercept:
>
> {code}
> val trainedModel = RidgeRegressionWithSGD.train(trainingData, 1000)
> ...
> valuesAndPreds.take(10).map(t => println(t))
> {code}
>
> output:
>
> (2007.0,-3.784588726958493E75)
> (2003.0,-1.9562390324037716E75)
> (2005.0,-4.147413202985629E75)
> (2003.0,-1.524938024096847E75)
> ...
>
> If I change the parameters (step size, regularization and iterations)
> I get NaNs more often than not:
> (2007.0,NaN)
> (2003.0,NaN)
> (2005.0,NaN)
> ...
>
> On the other hand DecisionTree model give sensible results.
>
> I see there is a `setIntercept()` method in abstract class
> GeneralizedLinearAlgorithm that seems to trigger the use of the intercept
> but I'm unable to use it from the public interface :(
>
> Any help appreciated :)
>
> Eustache
>
> [1] https://archive.ics.uci.edu/ml/datasets/YearPredictionMSD
>

>>>
>

Re: Broadcast variable in Spark Java application

2014-07-07 Thread Cesar Arevalo

Hi Praveen:

It may be easier for other people to help you if you provide more details about 
what you are doing. It may be worthwhile to also mention which spark version 
you are using. And if you can share the code which doesn't work for you, that 
may also give others more clues as to what you are doing wrong.

I've found that following the spark programming guide online usually gives me 
enough information, but I guess you've already tried that.

Best,
-Cesar

> On Jul 7, 2014, at 12:41 AM, Praveen R  wrote:
> 
> I need a variable to be broadcasted from driver to executor processes in my 
> spark java application. I tried using spark broadcast mechanism to achieve, 
> but no luck there. 
> 
> Could someone help me doing this, share some code probably ?
> 
> Thanks,
> Praveen R

Broadcast variable in Spark Java application

2014-07-07 Thread Praveen R

I need a variable to be broadcasted from driver to executor processes in my
spark java application. I tried using spark broadcast mechanism to achieve,
but no luck there.

Could someone help me doing this, share some code probably ?

Thanks,
Praveen R

Dense to sparse vector converter

2014-07-07 Thread Ulanov, Alexander

Hi,

Is there a method in Spark/MLlib to convert DenseVector to SparseVector?

Best regards, Alexander

Re: Spark SQL user defined functions

2014-07-07 Thread Martin Gammelsæter

Hi again, and thanks for your reply!

On Fri, Jul 4, 2014 at 8:45 PM, Michael Armbrust  wrote:
>
>> Sweet. Any idea about when this will be merged into master?
>
>
> It is probably going to be a couple of weeks.  There is a fair amount of
> cleanup that needs to be done.  It works though and we used it in most of
> the demos at the spark summit.  Mostly I just need to add tests and move it
> out of HiveContext (there is no good reason for that code to depend on
> HiveContext). So you could also just try working with that branch.
>
>>
>> This is probably a stupid question, but can you query Spark SQL tables
>> from a (local?) hive context? In which case using that could be a
>> workaround until the PR is merged.
>
>
> Yeah, this is kind of subtle.  In a HiveContext, SQL Tables are just an
> additional catalog that sits on top of the metastore.  All the query
> execution occurs in the same code path, including the use of the Hive
> Function Registry, independent of where the table comes from.  So for your
> use case you can just create a hive context, which will create a local
> metastore automatically if no hive-site.xml is present.

Nice, that sounds like it'll solve my problems. Just for clarity, is
LocalHiveContext and HiveContext equal if no hive-site.xml is present,
or are there still differences?

-- 
Best regards,
Martin Gammelsæter

Re: [mllib] strange/buggy results with RidgeRegressionWithSGD

2014-07-07 Thread Eustache DIEMERT

Well, why not, but IMHO MLLib Logistic Regression is unusable right now.
The inability to use intercept is just a no-go. I could hack a column of
ones to inject the intercept into the data but frankly it's a pithy to have
to do so.


2014-07-05 23:04 GMT+02:00 DB Tsai :

> You may try LBFGS to have more stable convergence. In spark 1.1, we will
> be able to use LBFGS instead of GD in training process.
> On Jul 4, 2014 1:23 PM, "Thomas Robert"  wrote:
>
>> Hi all,
>>
>> I too am having some issues with *RegressionWithSGD algorithms.
>>
>> Concerning your issue Eustache, this could be due to the fact that these
>> regression algorithms uses a fixed step (that is divided by
>> sqrt(iteration)). During my tests, quite often, the algorithm diverged an
>> infinity cost, I guessed because the step was too big. I reduce it and
>> managed to get good results on a very simple generated dataset.
>>
>> But I was wondering if anyone here had some advises concerning the use of
>> these regression algorithms, for example how to choose a good step and
>> number of iterations? I wonder if I'm using those right...
>>
>> Thanks,
>>
>> --
>>
>> *Thomas ROBERT*
>> www.creativedata.fr
>>
>>
>> 2014-07-03 16:16 GMT+02:00 Eustache DIEMERT :
>>
>>> Printing the model show the intercept is always 0 :(
>>>
>>> Should I open a bug for that ?
>>>
>>>
>>> 2014-07-02 16:11 GMT+02:00 Eustache DIEMERT :
>>>
 Hi list,

 I'm benchmarking MLlib for a regression task [1] and get strange
 results.

 Namely, using RidgeRegressionWithSGD it seems the predicted points miss
 the intercept:

 {code}
 val trainedModel = RidgeRegressionWithSGD.train(trainingData, 1000)
 ...
 valuesAndPreds.take(10).map(t => println(t))
 {code}

 output:

 (2007.0,-3.784588726958493E75)
 (2003.0,-1.9562390324037716E75)
 (2005.0,-4.147413202985629E75)
 (2003.0,-1.524938024096847E75)
 ...

 If I change the parameters (step size, regularization and iterations) I
 get NaNs more often than not:
 (2007.0,NaN)
 (2003.0,NaN)
 (2005.0,NaN)
 ...

 On the other hand DecisionTree model give sensible results.

 I see there is a `setIntercept()` method in abstract class
 GeneralizedLinearAlgorithm that seems to trigger the use of the intercept
 but I'm unable to use it from the public interface :(

 Any help appreciated :)

 Eustache

 [1] https://archive.ics.uci.edu/ml/datasets/YearPredictionMSD

>>>
>>

1 2 >

1 - 100 of 101 matches

Mail list logo