RE: How to compile Spark with private build of Hadoop

2016-03-08 Thread Lu, Yingqi
Thank you for the quick reply. I am very new to maven and always use the 
default settings. Can you please be a little more specific on the instructions?

I think all the jar files from Hadoop build are located at 
Hadoop-3.0.0-SNAPSHOT/share/hadoop. Which ones I need to use to compile Spark 
and how can I change the pom.xml?

Thanks,
Lucy

From: fightf...@163.com [mailto:fightf...@163.com]
Sent: Monday, March 07, 2016 11:15 PM
To: Lu, Yingqi ; user 
Subject: Re: How to compile Spark with private build of Hadoop

I think you can establish your own maven repository and deploy your modified 
hadoop binary jar
with your modified version number. Then you can add your repository in spark 
pom.xml and use
mvn -Dhadoop.version=


fightf...@163.com

From: Lu, Yingqi
Date: 2016-03-08 15:09
To: user@spark.apache.org
Subject: How to compile Spark with private build of Hadoop
Hi All,

I am new to Spark and I have a question regarding to compile Spark. I modified 
trunk version of Hadoop source code. How can I compile Spark (standalone mode) 
with my modified version of Hadoop (HDFS, Hadoop-common and etc.)?

Thanks a lot for your help!

Thanks,
Lucy






Re: Spark Twitter streaming

2016-03-08 Thread Imre Nagi
Do you mean listening to the twitter stream data? Maybe you can use the
Twitter Stream API or Twitter Search API for this purpose.

Imre

On Tue, Mar 8, 2016 at 2:54 PM, Soni spark  wrote:

> Hallo friends,
>
> I need a urgent help.
>
> I am using spark streaming to get the tweets from twitter and loading the
> data into HDFS. I want to find out the tweet source whether it is from web
> or mobile web or facebook ..etc.  could you please help me logic.
>
> Thanks
> Soniya
>


Re: OOM exception during Broadcast

2016-03-08 Thread Olivier Girardot
Java's default serialization is not the best/most efficient way of handling
ser/deser, did you try switching to Kryo serialization ?
c.f.
https://ogirardot.wordpress.com/2015/01/09/changing-sparks-default-java-serialization-to-kryo/
if you need a tutorial.

This should help in terms of both CPU and memory usage, if you want to
achieve best performance turn on registration too and register manually
what you need.
Plus if you have some kind of key value clearly partitioned you can try
https://github.com/amplab/spark-indexedrdd

cheers,
Olivier.

2016-03-08 6:09 GMT+01:00 Arash :

> The driver memory is set at 40G and OOM seems to be happening on the
> executors. I might try a different broadcast block size (vs 4m) as Takeshi
> suggested to see if it makes a difference.
>
> On Mon, Mar 7, 2016 at 6:54 PM, Tristan Nixon 
> wrote:
>
>> Yeah, the spark engine is pretty clever and its best not to pre-maturely
>> optimize. It would be interesting to profile your join vs. the collect on
>> the smaller dataset. I suspect that the join is faster (even before you
>> broadcast it back out).
>>
>> I’m also curious about the broadcast OOM - did you try expanding the
>> driver memory?
>>
>> On Mar 7, 2016, at 8:28 PM, Arash  wrote:
>>
>> So I just implemented the logic through a standard join (without collect
>> and broadcast) and it's working great.
>>
>> The idea behind trying the broadcast was that since the other side of
>> join is a much larger dataset, the process might be faster through collect
>> and broadcast, since it avoids the shuffle of the bigger dataset.
>>
>> I think the join is working much better in this case so I'll probably
>> just use that, still a bit curious as why the error is happening.
>>
>> On Mon, Mar 7, 2016 at 5:55 PM, Tristan Nixon 
>> wrote:
>>
>>> I’m not sure I understand - if it was already distributed over the
>>> cluster in an RDD, why would you want to collect and then re-send it as a
>>> broadcast variable? Why not simply use the RDD that is already distributed
>>> on the worker nodes?
>>>
>>> On Mar 7, 2016, at 7:44 PM, Arash  wrote:
>>>
>>> Hi Tristan,
>>>
>>> This is not static, I actually collect it from an RDD to the driver.
>>>
>>> On Mon, Mar 7, 2016 at 5:42 PM, Tristan Nixon 
>>> wrote:
>>>
 Hi Arash,

 is this static data?  Have you considered including it in your jars and
 de-serializing it from jar on each worker node?
 It’s not pretty, but it’s a workaround for serialization troubles.

 On Mar 7, 2016, at 5:29 PM, Arash  wrote:

 Hello all,

 I'm trying to broadcast a variable of size ~1G to a cluster of 20 nodes
 but haven't been able to make it work so far.

 It looks like the executors start to run out of memory during
 deserialization. This behavior only shows itself when the number of
 partitions is above a few 10s, the broadcast does work for 10 or 20
 partitions.

 I'm using the following setup to observe the problem:

 val tuples: Array[((String, String), (String, String))]  // ~ 10M
 tuples
 val tuplesBc = sc.broadcast(tuples)
 val numsRdd = sc.parallelize(1 to 5000, 100)
 numsRdd.map(n => tuplesBc.value.head).count()

 If I set the number of partitions for numsRDD to 20, the count goes
 through successfully, but at 100, I'll start to get errors such as:

 16/03/07 19:35:32 WARN scheduler.TaskSetManager: Lost task 77.0 in
 stage 1.0 (TID 1677, xxx.ec2.internal): java.lang.OutOfMemoryError: Java
 heap space
 at
 java.io.ObjectInputStream$HandleTable.grow(ObjectInputStream.java:3472)
 at
 java.io.ObjectInputStream$HandleTable.assign(ObjectInputStream.java:3278)
 at
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1789)
 at
 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
 at
 java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
 at
 scala.collection.immutable.HashMap$SerializationProxy.readObject(HashMap.scala:516)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at
 java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058)
 at
 java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1897)
 at
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
 at
 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
 at
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1997)
 at
 java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:

Re: Best way to merge files from streaming jobs

2016-03-08 Thread Sumedh Wale

  
  
On Saturday 05 March 2016 02:39 AM,
  Jelez Raditchkov wrote:


  
  My streaming job is creating files on S3.
The problem is that those files end up very small if I just
  write them to S3 directly.
This is why I use coalesce() to reduce the number of files
  and make them larger.


  


RDD.coalesce right? It accepts whether or not to shuffle as an
argument. If you are reducing the number of partitions it should not
cause a shuffle.

dstream.foreachRDD { rdd =>
  val numParts = rdd.getPartitions.length
  // half the partitions
  rdd.coalesce(numParts / 2, shuffle = false)
}

Another option can be to combine multiple RDDs. Set appropriate
remember duration (StreamingContext.remember), store the RDDs in a
fixed size list/array and then process all the cached RDDs in one go
periodically when list is full (combining with RDD.zipPartitions).
You may have to keep the remember duration somewhat larger than the
duration corresponding to the list size to account for processing
time.


  
However, coalesce shuffles data and my job processing time
  ends up higher than sparkBatchIntervalMilliseconds.


I have observed that if I coalesce the number of partitions
  to be equal to the cores in the cluster I get less shuffling -
  but that is unsubstantiated.
Is there any dependency/rule between number of executors,
  number of cores etc. that I can use to minimize shuffling and
  at the same time achieve minimum number of output files per
  batch?
What is the best practice?


  


I think most DStreams (Kafka streams can be exceptions) will create
number of partitions to be same as total number of executor cores
(spark.default.parallelism). Perhaps that is why you are seeing the
above behaviour. Looks like shuffle should be avoidable for your
case but if using coalesce it will likely not use the full
processing power.


thanks
-- 
Sumedh Wale
SnappyData (http://www.snappydata.io)
  


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SSL support for Spark Thrift Server

2016-03-08 Thread Sumedh Wale

On Saturday 05 March 2016 02:46 AM, Sourav Mazumder wrote:

Hi All,

While starting the Spark Thrift Server I don't see any option to start 
it with SSL support.


Is that support currently there ?



It uses HiveServer2 so the SSL settings in hive-site.xml should work: 
https://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2#SettingUpHiveServer2-SSLEncryption



Regards,
Sourav


thanks

--
Sumedh Wale
SnappyData (http://www.snappydata.io)


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to compile Spark with private build of Hadoop

2016-03-08 Thread Saisai Shao
I think the first step is to publish your in-house built Hadoop related
jars to your local maven or ivy repo, and then change the Spark building
profiles like -Phadoop-2.x (you could use 2.7 or you have to change the pom
file if you met jar conflicts) -Dhadoop.version=3.0.0-SNAPSHOT to build
against your specified version.

Thanks
Saisai


Re: RE: How to compile Spark with private build of Hadoop

2016-03-08 Thread fightf...@163.com
Hi, there,
You may try to use nexus to establish maven local repository. I think this link 
would be helpful. 
http://www.sonatype.org/nexus/2015/02/27/setup-local-nexus-repository-and-deploy-war-file-from-maven/
 

After you had done the repository, you may use maven-deploy-plugin to deploy 
your customized 
hadoop jar and relative pom.xml to nexus repository. Check the link for 
reference: 
https://books.sonatype.com/nexus-book/reference/staging-deployment.html 




fightf...@163.com
 
From: Lu, Yingqi
Date: 2016-03-08 15:23
To: fightf...@163.com; user
Subject: RE: How to compile Spark with private build of Hadoop
Thank you for the quick reply. I am very new to maven and always use the 
default settings. Can you please be a little more specific on the instructions?
 
I think all the jar files from Hadoop build are located at 
Hadoop-3.0.0-SNAPSHOT/share/hadoop. Which ones I need to use to compile Spark 
and how can I change the pom.xml?
 
Thanks,
Lucy
 
From: fightf...@163.com [mailto:fightf...@163.com] 
Sent: Monday, March 07, 2016 11:15 PM
To: Lu, Yingqi ; user 
Subject: Re: How to compile Spark with private build of Hadoop
 
I think you can establish your own maven repository and deploy your modified 
hadoop binary jar 
with your modified version number. Then you can add your repository in spark 
pom.xml and use 
mvn -Dhadoop.version=
 


fightf...@163.com
 
From: Lu, Yingqi
Date: 2016-03-08 15:09
To: user@spark.apache.org
Subject: How to compile Spark with private build of Hadoop
Hi All,
 
I am new to Spark and I have a question regarding to compile Spark. I modified 
trunk version of Hadoop source code. How can I compile Spark (standalone mode) 
with my modified version of Hadoop (HDFS, Hadoop-common and etc.)?
 
Thanks a lot for your help!
 
Thanks,
Lucy
 
 
 
 


Spark structured streaming

2016-03-08 Thread Praveen Devarao
Hi,

I would like to get my hands on the structured streaming feature 
coming out in Spark 2.0. I have tried looking around for code samples to 
get started but am not able to find any. Only few things I could look into 
is the test cases that have been committed under the JIRA umbrella 
https://issues.apache.org/jira/browse/SPARK-8360 but the test cases don't 
lead to building a example code as they seem to be working out of internal 
classes.

Could anyone point me to some resources or pointers in code that I 
can start with to understand structured streaming from a consumability 
angle.

Thanking You
-
Praveen Devarao
Spark Technology Centre
IBM India Software Labs
-
"Courage doesn't always roar. Sometimes courage is the quiet voice at the 
end of the day saying I will try again"



Re: how to implement ALS with csv file? getting error while calling Rating class

2016-03-08 Thread Nick Pentreath
As I mentioned, using that *train* method returns the user and item factor
RDDs, as opposed to an ALSModel instance. You first need to construct a
model manually yourself. This is exactly why it's marked as *DeveloperApi*,
since it is not user-friendly and not strictly part of the ML pipeline
approach.

If you really want to use it, this should work:

import org.apache.spark.ml.recommendation.ALS
import org.apache.spark.ml.recommendation.ALS.Rating

val conf = new SparkConf().setAppName("ALSWithStringID").setMaster("local[4]")
val sc = new SparkContext(conf)
val sql = new SQLContext(sc)
// Name,Value1,Value2.
val rdd = sc.parallelize(Seq(
  Rating[String]("foo", "1", 4.0f),
  Rating[String]("foo", "2", 2.0f),
  Rating[String]("bar", "1", 5.0f),
  Rating[String]("bar", "3", 1.0f)
))
val als = new ALS()
val (userFactors, itemFactors) = ALS.train(rdd)   // note have not
synced up training params with ALS instance params above.

import sql.implicits._
val userDF = userFactors.toDF("id", "features")
val itemDF = itemFactors.toDF("id", "features")
val model = new ALSModel(als.uid, als.getRank, userDF, itemDF)
  .setParent(als)
  .setUserCol("user")
  .setItemCol("item")

val pred = model.transform(rdd.toDF("user", "item", "rating"))
println(pred.show())


Note that you will need to be careful to sync up parameters between the
ALS.train and ALS instance and ALSModel. Note also that ml.ALS only
supports *transform* (which makes predictions for a set of user and item
columns in a DataFrame), and doesn't yet support the other predict methods
available in mllib.ALS


On Mon, 7 Mar 2016 at 21:25 Shishir Anshuman 
wrote:

> Hello Nick,
>
> I used *ml *instead of *mllib*  for ALS and Rating. But now It gives me
> error while using *predict()* from
> *org.apache.spark.mllib.recommendation.MatrixFactorizationModel.*
>
> I have attached the code and the error screenshot.
>
> Thank you.
>
> On Mon, Mar 7, 2016 at 12:40 PM, Nick Pentreath 
> wrote:
>
>> As you've pointed out, Rating requires user and item ids in Int form. So
>> you will need to map String user ids to integers.
>>
>> See this thread for example:
>> https://mail-archives.apache.org/mod_mbox/spark-user/201501.mbox/%3CCAJgQjQ9GhGqpg1=hvxpfrs+59elfj9f7knhp8nyqnh1ut_6...@mail.gmail.com%3E
>> .
>>
>> There is a DeveloperApi method
>> in org.apache.spark.ml.recommendation.ALS that takes Rating with generic
>> type (can be String) for user id and item id. However that is a little more
>> involved, and for larger scale data will be a lot less efficient.
>>
>> Something like this for example:
>>
>> import org.apache.spark.ml.recommendation.ALS
>> import org.apache.spark.ml.recommendation.ALS.Rating
>>
>> val conf = new 
>> SparkConf().setAppName("ALSWithStringID").setMaster("local[4]")
>> val sc = new SparkContext(conf)
>> // Name,Value1,Value2.
>> val rdd = sc.parallelize(Seq(
>>   Rating[String]("foo", "1", 4.0f),
>>   Rating[String]("foo", "2", 2.0f),
>>   Rating[String]("bar", "1", 5.0f),
>>   Rating[String]("bar", "3", 1.0f)
>> ))
>> val (userFactors, itemFactors) = ALS.train(rdd)
>>
>>
>> As you can see, you just get the factor RDDs back, and if you want an
>> ALSModel you will have to construct it yourself.
>>
>>
>> On Sun, 6 Mar 2016 at 18:23 Shishir Anshuman 
>> wrote:
>>
>>> I am new to apache Spark, and I want to implement the Alternating Least
>>> Squares algorithm. The data set is stored in a csv file in the format:
>>> *Name,Value1,Value2*.
>>>
>>> When I read the csv file, I get
>>> *java.lang.NumberFormatException.forInputString* error because the
>>> Rating class needs the parameters in the format: *(user: Int, product:
>>> Int, rating: Double)* and the first column of my file contains *Name*.
>>>
>>> Please suggest me a way to overcome this issue.
>>>
>>
>


Re: Use cases for kafka direct stream messageHandler

2016-03-08 Thread Marius Soutier

> On 04.03.2016, at 22:39, Cody Koeninger  wrote:
> 
> The only other valid use of messageHandler that I can think of is
> catching serialization problems on a per-message basis.  But with the
> new Kafka consumer library, that doesn't seem feasible anyway, and
> could be handled with a custom (de)serializer.

What do you mean, that doesn't seem feasible? You mean when using a custom 
deserializer? Right now I'm catching serialization problems in the message 
handler, after your proposed change I'd catch them in `map()`.


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark structured streaming

2016-03-08 Thread Jacek Laskowski
Hi Praveen,

I've spent few hours on the changes related to streaming dataframes
(included in the SPARK-8360) and concluded that it's currently only
possible to read.stream(), but not write.stream() since there are no
streaming Sinks yet.

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Tue, Mar 8, 2016 at 10:38 AM, Praveen Devarao  wrote:
> Hi,
>
> I would like to get my hands on the structured streaming feature
> coming out in Spark 2.0. I have tried looking around for code samples to get
> started but am not able to find any. Only few things I could look into is
> the test cases that have been committed under the JIRA umbrella
> https://issues.apache.org/jira/browse/SPARK-8360but the test cases don't
> lead to building a example code as they seem to be working out of internal
> classes.
>
> Could anyone point me to some resources or pointers in code that I
> can start with to understand structured streaming from a consumability
> angle.
>
> Thanking You
> -
> Praveen Devarao
> Spark Technology Centre
> IBM India Software Labs
> -
> "Courage doesn't always roar. Sometimes courage is the quiet voice at the
> end of the day saying I will try again"

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to compile Spark with private build of Hadoop

2016-03-08 Thread Steve Loughran

On 8 Mar 2016, at 07:23, Lu, Yingqi 
mailto:yingqi...@intel.com>> wrote:

Thank you for the quick reply. I am very new to maven and always use the 
default settings. Can you please be a little more specific on the instructions?

I think all the jar files from Hadoop build are located at 
Hadoop-3.0.0-SNAPSHOT/share/hadoop. Which ones I need to use to compile Spark 
and how can I change the pom.xml?

Thanks,
Lucy





It's simple to do this locally; no need for a remote server

You do need to do every morning, and do not try to run a build over midnight, 
as that confuses Maven. Just bear in mind "the first build that maven does 
every day, it will try to get snapshots remotely, if they aren't local"

1. in hadoop-trunk:
mvn install -DskipTests

This will publish the 3.0.0-SNAPSHOT JARs into ~/.m2/repository , where they 
will be picked up by dependent builds for the test of the day

2. in spark

mvn install -DskipTests -Phadoop-2.6 -Dhadoop.version=3.0.0-SNAPSHOT

That turns on the Hadoop 2.6+ profile, but sets the Hadoop version to build to 
be the 3.0.0 one you built in step (1)

3. Go and have a coffee; wait for the spark build to finish

That's all you need to do to get a version of spark built with your hadoop 
version.

It may be that spark fails to compile against Hadoop 2.9.0-SNAPSHOT or 
3.0.0-SNAPSHOT. If that happens, consider it a regression in Hadoop, file a bug 
there. I've been working with Hadoop 2.8.0-SNAPSHOT without problems, except 
for where, in the split of HDFS in to client and server JARs/POMs 
(hadoop-hdfs-client and hadoop-hdfs), the client JAR had left out some classes 
I expected to be available. That's been fixed, but don't be afraid to complain 
yourself if you find a problem: it's in the nightly build phase where 
regressions can be fixed within 24h


SparkFiles.get() returns with driver path Instead of Worker Path

2016-03-08 Thread ashikvc
I am trying to play a little bit with apache-spark cluster mode.
So my cluster consists of a driver in my machine and a worker and manager in
host machine(separate machine).

I send a textfile using `sparkContext.addFile(filepath)` where the filepath
is the path of my text file in local machine for which I get the following
output:

INFO Utils: Copying /home/files/data.txt to
/tmp/spark-b2e2bb22-487b-412b-831d-19d7aa96f275/userFiles-147c9552-1a77-427e-9b17-cb0845807860/data.txt

INFO SparkContext: Added file /home/files/data.txt at
http://192.XX.XX.164:58143/files/data.txt with timestamp 1457432207649

But when I try to access the same file using `SparkFiles.get("data.txt")`, I
get the path to file in my driver instead of worker.
I am setting my file like this

SparkConf conf = new
SparkConf().setAppName("spark-play").setMaster("spark://192.XX.XX.172:7077");
conf.setJars(new String[]{"jars/SparkWorker.jar"});
JavaSparkContext sparkContext = new JavaSparkContext(conf);
sparkContext.addFile("/home/files/data.txt");
List file
=sparkContext.textFile(SparkFiles.get("data.txt")).collect();
I am getting FileNotFoundException here.





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkFiles-get-returns-with-driver-path-Instead-of-Worker-Path-tp26428.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark structured streaming

2016-03-08 Thread Praveen Devarao
Thanks Jacek for the pointer.

Any idea which package can be used in .format(). The test cases seem to 
work out of the DefaultSource class defined within the 
DataFrameReaderWriterSuite [
org.apache.spark.sql.streaming.test.DefaultSource]

Thanking You
-
Praveen Devarao
Spark Technology Centre
IBM India Software Labs
-
"Courage doesn't always roar. Sometimes courage is the quiet voice at the 
end of the day saying I will try again"



From:   Jacek Laskowski 
To: Praveen Devarao/India/IBM@IBMIN
Cc: user , dev 
Date:   08/03/2016 04:17 pm
Subject:Re: Spark structured streaming



Hi Praveen,

I've spent few hours on the changes related to streaming dataframes
(included in the SPARK-8360) and concluded that it's currently only
possible to read.stream(), but not write.stream() since there are no
streaming Sinks yet.

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Tue, Mar 8, 2016 at 10:38 AM, Praveen Devarao  
wrote:
> Hi,
>
> I would like to get my hands on the structured streaming feature
> coming out in Spark 2.0. I have tried looking around for code samples to 
get
> started but am not able to find any. Only few things I could look into 
is
> the test cases that have been committed under the JIRA umbrella
> https://issues.apache.org/jira/browse/SPARK-8360but the test cases don't
> lead to building a example code as they seem to be working out of 
internal
> classes.
>
> Could anyone point me to some resources or pointers in code that 
I
> can start with to understand structured streaming from a consumability
> angle.
>
> Thanking You
> 
-
> Praveen Devarao
> Spark Technology Centre
> IBM India Software Labs
> 
-
> "Courage doesn't always roar. Sometimes courage is the quiet voice at 
the
> end of the day saying I will try again"

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org







Quetions about Actor model of Computation.

2016-03-08 Thread Minglei Zhang
hello, experts.

I am a student. and recently, I read a paper about *Actor Model of
Computation:Scalable Robust Information System*. In the paper I am trying
to understand it's concept. But with the following sentence makes me
confusing.

# Messages are the unit of communication *1*

Reference to *1*

All physically computable functions can be implemented using the lambda
calculus. It is a consequence of the Actor Model that* there are some
computations that cannot be implemented in the lambda calculus*.

*My question*, What are there computations that cannot be implemented in
Lambda calculus ?

thanks.
Minglei.


Output the data to external database at particular time in spark streaming

2016-03-08 Thread Abhishek Anand
I have a spark streaming job where I am aggregating the data by doing
reduceByKeyAndWindow with inverse function.

I am keeping the data in memory for upto 2 hours and In order to output the
reduced data to an external storage I conditionally need to puke the data
to DB say at every 15th minute of the each hour.

How can this be achieved.


Regards,
Abhi


[Streaming + MLlib] Is it only Linear regression supported by online learning?

2016-03-08 Thread diplomatic Guru
Hello all,

I'm using Random Forest for my machine learning (batch), I would like to
use online prediction using Streaming job. However, the document only
states linear algorithm for regression job. Could we not use other
algorithms?


Spark ML Interaction

2016-03-08 Thread amarouni
Hi,

Did anyone here manage to write an example of the following ML feature
transformer
http://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/feature/Interaction.html
?
It's not documented on the official Spark ML features pages but it can
be found in the package API javadocs.

Thanks,

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark ML Interaction

2016-03-08 Thread Nick Pentreath
Could you create a JIRA to add an example and documentation?

Thanks

On Tue, 8 Mar 2016 at 16:18, amarouni  wrote:

> Hi,
>
> Did anyone here manage to write an example of the following ML feature
> transformer
>
> http://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/feature/Interaction.html
> ?
> It's not documented on the official Spark ML features pages but it can
> be found in the package API javadocs.
>
> Thanks,
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Use cases for kafka direct stream messageHandler

2016-03-08 Thread Cody Koeninger
No, looks like you'd have to catch them in the serializer and have the
serializer return option or something. The new consumer builds a buffer
full of records, not one at a time.
On Mar 8, 2016 4:43 AM, "Marius Soutier"  wrote:

>
> > On 04.03.2016, at 22:39, Cody Koeninger  wrote:
> >
> > The only other valid use of messageHandler that I can think of is
> > catching serialization problems on a per-message basis.  But with the
> > new Kafka consumer library, that doesn't seem feasible anyway, and
> > could be handled with a custom (de)serializer.
>
> What do you mean, that doesn't seem feasible? You mean when using a custom
> deserializer? Right now I'm catching serialization problems in the message
> handler, after your proposed change I'd catch them in `map()`.
>
>


Re: Quetions about Actor model of Computation.

2016-03-08 Thread Ted Yu
This seems related:

the second paragraph under Implementation and theory
https://en.wikipedia.org/wiki/Closure_(computer_programming)

On Tue, Mar 8, 2016 at 4:49 AM, Minglei Zhang  wrote:

> hello, experts.
>
> I am a student. and recently, I read a paper about *Actor Model of
> Computation:Scalable Robust Information System*. In the paper I am trying
> to understand it's concept. But with the following sentence makes me
> confusing.
>
> # Messages are the unit of communication *1*
>
> Reference to *1*
>
> All physically computable functions can be implemented using the lambda
> calculus. It is a consequence of the Actor Model that* there are some
> computations that cannot be implemented in the lambda calculus*.
>
> *My question*, What are there computations that cannot be implemented in
> Lambda calculus ?
>
> thanks.
> Minglei.
>


Re: Spark structured streaming

2016-03-08 Thread Jacek Laskowski
Hi Praveen,

I don't really know. I think TD or Michael should know as they
personally involved in the task (as far as I could figure it out from
the JIRA and the changes). Ping people on the JIRA so they notice your
question(s).

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Tue, Mar 8, 2016 at 12:32 PM, Praveen Devarao  wrote:
> Thanks Jacek for the pointer.
>
> Any idea which package can be used in .format(). The test cases seem to work
> out of the DefaultSource class defined within the DataFrameReaderWriterSuite
> [org.apache.spark.sql.streaming.test.DefaultSource]
>
> Thanking You
> -
> Praveen Devarao
> Spark Technology Centre
> IBM India Software Labs
> -
> "Courage doesn't always roar. Sometimes courage is the quiet voice at the
> end of the day saying I will try again"
>
>
>
> From:Jacek Laskowski 
> To:Praveen Devarao/India/IBM@IBMIN
> Cc:user , dev 
> Date:08/03/2016 04:17 pm
> Subject:Re: Spark structured streaming
> 
>
>
>
> Hi Praveen,
>
> I've spent few hours on the changes related to streaming dataframes
> (included in the SPARK-8360) and concluded that it's currently only
> possible to read.stream(), but not write.stream() since there are no
> streaming Sinks yet.
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Tue, Mar 8, 2016 at 10:38 AM, Praveen Devarao 
> wrote:
>> Hi,
>>
>> I would like to get my hands on the structured streaming feature
>> coming out in Spark 2.0. I have tried looking around for code samples to
>> get
>> started but am not able to find any. Only few things I could look into is
>> the test cases that have been committed under the JIRA umbrella
>> https://issues.apache.org/jira/browse/SPARK-8360butthe test cases don't
>> lead to building a example code as they seem to be working out of internal
>> classes.
>>
>> Could anyone point me to some resources or pointers in code that I
>> can start with to understand structured streaming from a consumability
>> angle.
>>
>> Thanking You
>>
>> -
>> Praveen Devarao
>> Spark Technology Centre
>> IBM India Software Labs
>>
>> -
>> "Courage doesn't always roar. Sometimes courage is the quiet voice at the
>> end of the day saying I will try again"
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



FileAlreadyExistsException and Streaming context

2016-03-08 Thread Peter Halliday
I’m getting a FileAlreadyExistsException.  I’ve tired setting the save to 
SaveMode.Overwrite, and setting spark.hadooop.validateOutputSpecs to false.  
However, I am wonder if these settings are being ignored, because I’m using 
Spark Streaming.  We aren’t using checkpointing though.  Here’s the stack 
trace: http://pastebin.com/AqBFXkga 

Peter Halliday

Spark on RAID

2016-03-08 Thread Eddie Esquivel
Hello All,
In the Spark documentation under "Hardware Requirements" it very clearly
states:

We recommend having *4-8 disks* per node, configured *without* RAID (just
as separate mount points)

My question is why not raid? What is the argument\reason for not using Raid?

Thanks!
-Eddie


Re: Spark on RAID

2016-03-08 Thread Alex Kozlov
Parallel disk IO?  But the effect should be less noticeable compared to
Hadoop which reads/writes a lot.  Much depends on how often Spark persists
on disk.  Depends on the specifics of the RAID controller as well.

If you write to HDFS as opposed to local file system this may be a big
factor as well.

On Tue, Mar 8, 2016 at 8:34 AM, Eddie Esquivel 
wrote:

> Hello All,
> In the Spark documentation under "Hardware Requirements" it very clearly
> states:
>
> We recommend having *4-8 disks* per node, configured *without* RAID (just
> as separate mount points)
>
> My question is why not raid? What is the argument\reason for not using
> Raid?
>
> Thanks!
> -Eddie
>

--
Alex Kozlov


Re: Quetions about Actor model of Computation.

2016-03-08 Thread Minglei Zhang
thanks Ted gives a quick reply. I will see it you mentioned.

Best Regards.

2016-03-08 23:26 GMT+08:00 Ted Yu :

> This seems related:
>
> the second paragraph under Implementation and theory
> https://en.wikipedia.org/wiki/Closure_(computer_programming)
>
> On Tue, Mar 8, 2016 at 4:49 AM, Minglei Zhang 
> wrote:
>
>> hello, experts.
>>
>> I am a student. and recently, I read a paper about *Actor Model of
>> Computation:Scalable Robust Information System*. In the paper I am
>> trying to understand it's concept. But with the following sentence makes me
>> confusing.
>>
>> # Messages are the unit of communication *1*
>>
>> Reference to *1*
>>
>> All physically computable functions can be implemented using the lambda
>> calculus. It is a consequence of the Actor Model that* there are some
>> computations that cannot be implemented in the lambda calculus*.
>>
>> *My question*, What are there computations that cannot be implemented in
>> Lambda calculus ?
>>
>> thanks.
>> Minglei.
>>
>
>


Re: how to implement and deploy robust streaming apps

2016-03-08 Thread Xinh Huynh
If you would like an overview of Spark Stream and fault tolerance, these
slides are great (Slides 24+ focus on fault tolerance; Slide 52 is on
resilience to traffic spikes):
http://www.lightbend.com/blog/four-things-to-know-about-reliable-spark-streaming-typesafe-databricks

This recent Spark Summit talk is all about backpressure and dynamic
scaling:
https://spark-summit.org/east-2016/events/building-robust-scalable-and-adaptive-applications-on-spark-streaming/

>From the Spark docs, backpressure works by placing a limit on the receiving
rate, and this limit is adjusted dynamically based on processing times. If
there is a burst and the data source generates events at a higher rate,
those extra events will get backed up in the data source. So, how much
buffering is available in the data source? For instance, Kafka can use HDFS
as a huge buffer, with capacity to buffer traffic spikes. Spark itself
doesn't handle the buffering of unprocessed events, so in some cases, Kafka
(or some other storage) is placed between the data source and Spark to
provide a buffer.

Xinh


On Mon, Mar 7, 2016 at 2:10 PM, Andy Davidson  wrote:

> One of the challenges we need to prepare for with streaming apps is bursty
> data. Typically we need to estimate our worst case data load and make sure
> we have enough capacity
>
>
> It not obvious what best practices are with spark streaming.
>
>
>- we have implemented check pointing as described in the prog guide
>- Use stand alone cluster manager and spark-submit
>- We use the mgmt console to kill drives when needed
>
>
>- we plan to configure write ahead spark.streaming.backpressure.enabled
> to true.
>
>
>- our application runs a single unreliable receive
>   - We run multiple implementation configured to partition the input
>
>
> As long as our processing time is < our windowing time everything is fine
>
> In the streaming systems I have worked on in the past we scaled out by
> using load balancers and proxy farms to create buffering capacity. Its not
> clear how to scale out spark
>
> In our limited testing it seems like we have a single app configure to
> receive a predefined portion of the data. Once it is stated we can not add
> additional resources. Adding cores and memory does not seem increase our
> capacity
>
>
> Kind regards
>
> Andy
>
>
>


Re: Output the data to external database at particular time in spark streaming

2016-03-08 Thread ayan guha
Why not compare current time in every batch and it meets certain condition
emit the data?
On 9 Mar 2016 00:19, "Abhishek Anand"  wrote:

> I have a spark streaming job where I am aggregating the data by doing
> reduceByKeyAndWindow with inverse function.
>
> I am keeping the data in memory for upto 2 hours and In order to output
> the reduced data to an external storage I conditionally need to puke the
> data to DB say at every 15th minute of the each hour.
>
> How can this be achieved.
>
>
> Regards,
> Abhi
>


Re: SparkFiles.get() returns with driver path Instead of Worker Path

2016-03-08 Thread Tristan Nixon
My understanding of the model is that you’re supposed to execute 
SparkFiles.get(…) on each worker node, not on the driver.

Since you already know where the files are on the driver, if you want to load 
these into an RDD with SparkContext.textFile, then this will distribute it out 
to the workers, there’s no need to use SparkContext.add to do this.

If you have some functions that run on workers that expects local file 
resources, then you can use SparkContext.addFile to distribute the files into 
worker local storage, then you can execute SparkFiles.get separately on each 
worker to retrieve these local files (it will give different paths on each 
worker).

> On Mar 8, 2016, at 5:31 AM, ashikvc  wrote:
> 
> I am trying to play a little bit with apache-spark cluster mode.
> So my cluster consists of a driver in my machine and a worker and manager in
> host machine(separate machine).
> 
> I send a textfile using `sparkContext.addFile(filepath)` where the filepath
> is the path of my text file in local machine for which I get the following
> output:
> 
>INFO Utils: Copying /home/files/data.txt to
> /tmp/spark-b2e2bb22-487b-412b-831d-19d7aa96f275/userFiles-147c9552-1a77-427e-9b17-cb0845807860/data.txt
> 
>INFO SparkContext: Added file /home/files/data.txt at
> http://192.XX.XX.164:58143/files/data.txt with timestamp 1457432207649
> 
> But when I try to access the same file using `SparkFiles.get("data.txt")`, I
> get the path to file in my driver instead of worker.
> I am setting my file like this
> 
>SparkConf conf = new
> SparkConf().setAppName("spark-play").setMaster("spark://192.XX.XX.172:7077");
>conf.setJars(new String[]{"jars/SparkWorker.jar"});
>JavaSparkContext sparkContext = new JavaSparkContext(conf);
>sparkContext.addFile("/home/files/data.txt");
>List file
> =sparkContext.textFile(SparkFiles.get("data.txt")).collect();
> I am getting FileNotFoundException here.
> 
> 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/SparkFiles-get-returns-with-driver-path-Instead-of-Worker-Path-tp26428.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Output the data to external database at particular time in spark streaming

2016-03-08 Thread Ted Yu
That may miss the 15th minute of the hour (with non-trivial deviation),
right ?

On Tue, Mar 8, 2016 at 8:50 AM, ayan guha  wrote:

> Why not compare current time in every batch and it meets certain condition
> emit the data?
> On 9 Mar 2016 00:19, "Abhishek Anand"  wrote:
>
>> I have a spark streaming job where I am aggregating the data by doing
>> reduceByKeyAndWindow with inverse function.
>>
>> I am keeping the data in memory for upto 2 hours and In order to output
>> the reduced data to an external storage I conditionally need to puke the
>> data to DB say at every 15th minute of the each hour.
>>
>> How can this be achieved.
>>
>>
>> Regards,
>> Abhi
>>
>


Re: Spark on RAID

2016-03-08 Thread Mark Hamstra
One issue is that RAID levels providing data replication are not necessary
since HDFS already replicates blocks on multiple nodes.

On Tue, Mar 8, 2016 at 8:45 AM, Alex Kozlov  wrote:

> Parallel disk IO?  But the effect should be less noticeable compared to
> Hadoop which reads/writes a lot.  Much depends on how often Spark persists
> on disk.  Depends on the specifics of the RAID controller as well.
>
> If you write to HDFS as opposed to local file system this may be a big
> factor as well.
>
> On Tue, Mar 8, 2016 at 8:34 AM, Eddie Esquivel  > wrote:
>
>> Hello All,
>> In the Spark documentation under "Hardware Requirements" it very clearly
>> states:
>>
>> We recommend having *4-8 disks* per node, configured *without* RAID
>> (just as separate mount points)
>>
>> My question is why not raid? What is the argument\reason for not using
>> Raid?
>>
>> Thanks!
>> -Eddie
>>
>
> --
> Alex Kozlov
>


PySpark/SQL Octet Length

2016-03-08 Thread Ross.Cramblit
I am trying to define a UDF to calculate octet_length of a string but I am 
having some trouble getting it right. Does anyone have a working version of 
this already/any pointers?

I am using Spark 1.5.2/Python 2.7.

Thanks


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: PySpark/SQL Octet Length

2016-03-08 Thread Ross.Cramblit
Meant to include:

I have this function which seems to work, but I am not sure if it is always 
correct:

def octet_length(s):
return len(s.encode(‘utf8’))

sqlContext.registerFunction('octet_length', lambda x: octet_length(x))

> On Mar 8, 2016, at 12:30 PM, Cramblit, Ross (Reuters News) 
>  wrote:
> 
> I am trying to define a UDF to calculate octet_length of a string but I am 
> having some trouble getting it right. Does anyone have a working version of 
> this already/any pointers?
> 
> I am using Spark 1.5.2/Python 2.7.
> 
> Thanks
> 
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Using dynamic allocation and shuffle service in Standalone Mode

2016-03-08 Thread Yuval.Itzchakov
Hi, 
I'm using Spark 1.6.0, and according to the documentation, dynamic
allocation and spark shuffle service should be enabled.

When I submit a spark job via the following:

spark-submit \
--master  \
--deploy-mode cluster \
--executor-cores 3 \
--conf "spark.streaming.backpressure.enabled=true" \
--conf "spark.dynamicAllocation.enabled=true" \
--conf "spark.dynamicAllocation.minExecutors=2" \
--conf "spark.dynamicAllocation.maxExecutors=24" \
--conf "spark.shuffle.service.enabled=true" \
--conf "spark.executor.memory=8g" \
--conf "spark.driver.memory=10g" \
--class SparkJobRunner
/opt/clicktale/entityCreator/com.clicktale.ai.entity-creator-assembly-0.0.2.jar

I'm seeing error logs from the workers being unable to connect to the
shuffle service:

16/03/08 17:33:15 ERROR storage.BlockManager: Failed to connect to external
shuffle server, will retry 2 more times after waiting 5 seconds...
java.io.IOException: Failed to connect to 
at
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
at
org.apache.spark.network.client.TransportClientFactory.createUnmanagedClient(TransportClientFactory.java:181)
at
org.apache.spark.network.shuffle.ExternalShuffleClient.registerWithShuffleServer(ExternalShuffleClient.java:141)
at
org.apache.spark.storage.BlockManager$$anonfun$registerWithExternalShuffleServer$1.apply$mcVI$sp(BlockManager.scala:211)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at
org.apache.spark.storage.BlockManager.registerWithExternalShuffleServer(BlockManager.scala:208)
at 
org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:194)
at org.apache.spark.executor.Executor.(Executor.scala:85)
at
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:83)
at
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at
org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

I verified all relevant ports are open. Has anyone else experienced such a
failure?

Yuval.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Using-dynamic-allocation-and-shuffle-service-in-Standalone-Mode-tp26430.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Analyzing json Data streams using sparkSQL in spark streaming returns java.lang.ClassNotFoundException

2016-03-08 Thread Nesrine BEN MUSTAPHA
Hello,

I tried to use sparkSQL to analyse json data streams within a standalone
application.

here the code snippet that receive the streaming data:

*final JavaReceiverInputDStream lines =
streamCtx.socketTextStream("localhost", Integer.parseInt(args[0]),
StorageLevel.MEMORY_AND_DISK_SER_2());*

*lines.foreachRDD((rdd) -> {*

*final JavaRDD jsonElements = rdd.flatMap(new
FlatMapFunction() {*

*@Override*

*public Iterable call(final String line)*

*throws Exception {*

*return Arrays.asList(line.split("\n"));*

*}*

*}).filter(new Function() {*

*@Override*

*public Boolean call(final String v1)*

*throws Exception {*

*return v1.length() > 0;*

*}*

*});*

*//System.out.println("Data Received = " + jsonElements.collect().size());*

*final SQLContext sqlContext =
JavaSQLContextSingleton.getInstance(rdd.context());*

*final DataFrame dfJsonElement = sqlContext.read().json(jsonElements);
 *

*executeSQLOperations(sqlContext, dfJsonElement);*

*});*

*streamCtx.start();*

*streamCtx.awaitTermination();*

*}*


I got the following error when the red line is executed:

java.lang.ClassNotFoundException:
com.intrinsec.common.spark.SQLStreamingJsonAnalyzer$2
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at 
org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:68)
at 
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)


Re: Installing Spark on Mac

2016-03-08 Thread Aida
Hi all,

Thanks everyone for your responses; really appreciate it.

Eduardo - I tried your suggestions but ran into some issues, please see
below:

ukdrfs01:Spark aidatefera$ cd spark-1.6.0
ukdrfs01:spark-1.6.0 aidatefera$ build/mvn -DskipTests clean package
Using `mvn` from path: /usr/bin/mvn
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512M;
support was removed in 8.0
[INFO] Scanning for projects...
[INFO]

[INFO] Reactor Build Order:
[INFO] 
[INFO] Spark Project Parent POM
[INFO] Spark Project Test Tags
[INFO] Spark Project Launcher
[INFO] Spark Project Networking
[INFO] Spark Project Shuffle Streaming Service
[INFO] Spark Project Unsafe
[INFO] Spark Project Core
[INFO] Spark Project Bagel
[INFO] Spark Project GraphX
[INFO] Spark Project Streaming
[INFO] Spark Project Catalyst
[INFO] Spark Project SQL
[INFO] Spark Project ML Library
[INFO] Spark Project Tools
[INFO] Spark Project Hive
[INFO] Spark Project Docker Integration Tests
[INFO] Spark Project REPL
[INFO] Spark Project Assembly
[INFO] Spark Project External Twitter
[INFO] Spark Project External Flume Sink
[INFO] Spark Project External Flume
[INFO] Spark Project External Flume Assembly
[INFO] Spark Project External MQTT
[INFO] Spark Project External MQTT Assembly
[INFO] Spark Project External ZeroMQ
[INFO] Spark Project External Kafka
[INFO] Spark Project Examples
[INFO] Spark Project External Kafka Assembly
[INFO] 
[INFO]

[INFO] Building Spark Project Parent POM 1.6.0
[INFO]

[INFO] 
[INFO] --- maven-clean-plugin:2.6.1:clean (default-clean) @
spark-parent_2.10 ---
[INFO] 
[INFO] --- maven-enforcer-plugin:1.4:enforce (enforce-versions) @
spark-parent_2.10 ---
[WARNING] Rule 0: org.apache.maven.plugins.enforcer.RequireMavenVersion
failed with message:
Detected Maven Version: 3.0.3 is not in the allowed range 3.3.3.
[INFO]

[INFO] Reactor Summary:
[INFO] 
[INFO] Spark Project Parent POM .. FAILURE [0.821s]
[INFO] Spark Project Test Tags ... SKIPPED
[INFO] Spark Project Launcher  SKIPPED
[INFO] Spark Project Networking .. SKIPPED
[INFO] Spark Project Shuffle Streaming Service ... SKIPPED
[INFO] Spark Project Unsafe .. SKIPPED
[INFO] Spark Project Core  SKIPPED
[INFO] Spark Project Bagel ... SKIPPED
[INFO] Spark Project GraphX .. SKIPPED
[INFO] Spark Project Streaming ... SKIPPED
[INFO] Spark Project Catalyst  SKIPPED
[INFO] Spark Project SQL . SKIPPED
[INFO] Spark Project ML Library .. SKIPPED
[INFO] Spark Project Tools ... SKIPPED
[INFO] Spark Project Hive  SKIPPED
[INFO] Spark Project Docker Integration Tests  SKIPPED
[INFO] Spark Project REPL  SKIPPED
[INFO] Spark Project Assembly  SKIPPED
[INFO] Spark Project External Twitter  SKIPPED
[INFO] Spark Project External Flume Sink . SKIPPED
[INFO] Spark Project External Flume .. SKIPPED
[INFO] Spark Project External Flume Assembly . SKIPPED
[INFO] Spark Project External MQTT ... SKIPPED
[INFO] Spark Project External MQTT Assembly .. SKIPPED
[INFO] Spark Project External ZeroMQ . SKIPPED
[INFO] Spark Project External Kafka .. SKIPPED
[INFO] Spark Project Examples  SKIPPED
[INFO] Spark Project External Kafka Assembly . SKIPPED
[INFO]

[INFO] BUILD FAILURE
[INFO]

[INFO] Total time: 1.745s
[INFO] Finished at: Tue Mar 08 18:01:48 GMT 2016
[INFO] Final Memory: 19M/183M
[INFO]

[ERROR] Failed to execute goal
org.apache.maven.plugins:maven-enforcer-plugin:1.4:enforce
(enforce-versions) on project spark-parent_2.10: Some Enforcer rules have
failed. Look above for specific messages explaining why the rule failed. ->
[Help 1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e
switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please
read the following

Not able to write data after running fpgrowth in pyspark

2016-03-08 Thread goutham koneru
Hello,

I am using fpgrowth to generate frequent item sets and the model is working
fine. If I select n rows I was able to see the data.

When I try to save the data using any of the methods like write.orc or
saveAsTable or saveAsParquet it is taking unusual amount of time to save
the data.

If I save data before running the model all the methods work perfectly.
Only after running the model I see this issue. Below is the sample code I
am using. Can you let me know if there is anything wrong that I am doing or
any configuration changes need to be made?

rdd_buckets.write.orc('/data/buckets') -- *before running the model this
works and writes data in less than 2 minutes.*

transactions = rdd_buckets.rdd.map(lambda line: line.buckets.split('::'))
model = FPGrowth.train(transactions, minSupport=0.01,numPartitions=200)
result = model.freqItemsets().toDF()
size_1_buckets = result.filter(F.size(result.items) == 1)
size_2_buckets = result.filter(F.size(result.items) == 2)
size_1_buckets.registerTempTable('size_1_buckets')
hive_context.sql("use buckets")
hive_context.sql("create table size_1_buckets as select * from
size_1_buckets" ) -- *this step takes long time (10 hours) to complete the
writing process.*
size_2_buckets.registerTempTable('size_2_buckets')
hive_context.sql("use buckets")
hive_context.sql("create table size_2_buckets as select * from
size_2_buckets" ) -- *this step takes **takes long time (10 hours) ** to
complete the writing process.*

Below is the command that we are using to submit the job.

/usr/hdp/current/spark-client/bin/spark-submit --master yarn-client
--num-executors 10 --conf spark.executor.memory=10g --conf
spark.yarn.queue=batch --conf spark.rpc.askTimeout=100s --conf
spark.driver.memory=2g --conf spark.kryoserializer.buffer.max=256m --conf
spark.executor.cores=5 --conf spark.driver.cores=4 python/buckets.py

Thanks,
Goutham.


RE: Using dynamic allocation and shuffle service in Standalone Mode

2016-03-08 Thread Silvio Fiorito
You’ve started the external shuffle service on all worker nodes, correct? Can 
you confirm they’re still running and haven’t exited?







From: Yuval.Itzchakov
Sent: Tuesday, March 8, 2016 12:41 PM
To: user@spark.apache.org
Subject: Using dynamic allocation and shuffle service in Standalone Mode



Hi,
I'm using Spark 1.6.0, and according to the documentation, dynamic
allocation and spark shuffle service should be enabled.

When I submit a spark job via the following:

spark-submit \
--master  \
--deploy-mode cluster \
--executor-cores 3 \
--conf "spark.streaming.backpressure.enabled=true" \
--conf "spark.dynamicAllocation.enabled=true" \
--conf "spark.dynamicAllocation.minExecutors=2" \
--conf "spark.dynamicAllocation.maxExecutors=24" \
--conf "spark.shuffle.service.enabled=true" \
--conf "spark.executor.memory=8g" \
--conf "spark.driver.memory=10g" \
--class SparkJobRunner
/opt/clicktale/entityCreator/com.clicktale.ai.entity-creator-assembly-0.0.2.jar

I'm seeing error logs from the workers being unable to connect to the
shuffle service:

16/03/08 17:33:15 ERROR storage.BlockManager: Failed to connect to external
shuffle server, will retry 2 more times after waiting 5 seconds...
java.io.IOException: Failed to connect to 
at
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
at
org.apache.spark.network.client.TransportClientFactory.createUnmanagedClient(TransportClientFactory.java:181)
at
org.apache.spark.network.shuffle.ExternalShuffleClient.registerWithShuffleServer(ExternalShuffleClient.java:141)
at
org.apache.spark.storage.BlockManager$$anonfun$registerWithExternalShuffleServer$1.apply$mcVI$sp(BlockManager.scala:211)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at
org.apache.spark.storage.BlockManager.registerWithExternalShuffleServer(BlockManager.scala:208)
at 
org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:194)
at org.apache.spark.executor.Executor.(Executor.scala:85)
at
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:83)
at
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at
org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

I verified all relevant ports are open. Has anyone else experienced such a
failure?

Yuval.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Using-dynamic-allocation-and-shuffle-service-in-Standalone-Mode-tp26430.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Using dynamic allocation and shuffle service in Standalone Mode

2016-03-08 Thread Suniti Singh
Please check the document for the configuration -
http://spark.apache.org/docs/latest/job-scheduling.html#configuration-and-setup


On Tue, Mar 8, 2016 at 10:14 AM, Silvio Fiorito <
silvio.fior...@granturing.com> wrote:

> You’ve started the external shuffle service on all worker nodes, correct?
> Can you confirm they’re still running and haven’t exited?
>
>
>
>
>
>
>
> *From: *Yuval.Itzchakov 
> *Sent: *Tuesday, March 8, 2016 12:41 PM
> *To: *user@spark.apache.org
> *Subject: *Using dynamic allocation and shuffle service in Standalone Mode
>
>
> Hi,
> I'm using Spark 1.6.0, and according to the documentation, dynamic
> allocation and spark shuffle service should be enabled.
>
> When I submit a spark job via the following:
>
> spark-submit \
> --master  \
> --deploy-mode cluster \
> --executor-cores 3 \
> --conf "spark.streaming.backpressure.enabled=true" \
> --conf "spark.dynamicAllocation.enabled=true" \
> --conf "spark.dynamicAllocation.minExecutors=2" \
> --conf "spark.dynamicAllocation.maxExecutors=24" \
> --conf "spark.shuffle.service.enabled=true" \
> --conf "spark.executor.memory=8g" \
> --conf "spark.driver.memory=10g" \
> --class SparkJobRunner
>
> /opt/clicktale/entityCreator/com.clicktale.ai.entity-creator-assembly-0.0.2.jar
>
> I'm seeing error logs from the workers being unable to connect to the
> shuffle service:
>
> 16/03/08 17:33:15 ERROR storage.BlockManager: Failed to connect to external
> shuffle server, will retry 2 more times after waiting 5 seconds...
> java.io.IOException: Failed to connect to 
> at
>
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
> at
>
> org.apache.spark.network.client.TransportClientFactory.createUnmanagedClient(TransportClientFactory.java:181)
> at
>
> org.apache.spark.network.shuffle.ExternalShuffleClient.registerWithShuffleServer(ExternalShuffleClient.java:141)
> at
>
> org.apache.spark.storage.BlockManager$$anonfun$registerWithExternalShuffleServer$1.apply$mcVI$sp(BlockManager.scala:211)
> at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
> at
>
> org.apache.spark.storage.BlockManager.registerWithExternalShuffleServer(BlockManager.scala:208)
> at
> org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:194)
> at org.apache.spark.executor.Executor.(Executor.scala:85)
> at
>
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:83)
> at
>
> org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116)
> at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
> at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
> at
> org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
> at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
>
> I verified all relevant ports are open. Has anyone else experienced such a
> failure?
>
> Yuval.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Using-dynamic-allocation-and-shuffle-service-in-Standalone-Mode-tp26430.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Analyzing json Data streams using sparkSQL in spark streaming returns java.lang.ClassNotFoundException

2016-03-08 Thread Tristan Nixon
this is a bit strange, because you’re trying to create an RDD inside of a 
foreach function (the jsonElements). This executes on the workers, and so will 
actually produce a different instance in each JVM on each worker, not one 
single RDD referenced by the driver, which is what I think you’re trying to get.

Why don’t you try something like:

JavaDStream jsonElements = lines.flatMap( … )

and just skip the lines.foreach?

> On Mar 8, 2016, at 11:59 AM, Nesrine BEN MUSTAPHA 
>  wrote:
> 
> Hello,
> 
> I tried to use sparkSQL to analyse json data streams within a standalone 
> application. 
> 
> here the code snippet that receive the streaming data: 
> final JavaReceiverInputDStream lines = 
> streamCtx.socketTextStream("localhost", Integer.parseInt(args[0]), 
> StorageLevel.MEMORY_AND_DISK_SER_2());
> 
> lines.foreachRDD((rdd) -> {
> 
> final JavaRDD jsonElements = rdd.flatMap(new FlatMapFunction String>() {
> 
> @Override
> 
> public Iterable call(final String line)
> 
> throws Exception {
> 
> return Arrays.asList(line.split("\n"));
> 
> }
> 
> }).filter(new Function() {
> 
> @Override
> 
> public Boolean call(final String v1)
> 
> throws Exception {
> 
> return v1.length() > 0;
> 
> }
> 
> });
> 
> //System.out.println("Data Received = " + jsonElements.collect().size());
> 
> final SQLContext sqlContext = 
> JavaSQLContextSingleton.getInstance(rdd.context());
> 
> final DataFrame dfJsonElement = sqlContext.read().json(jsonElements); 
> 
> executeSQLOperations(sqlContext, dfJsonElement);
> 
> });
> 
> streamCtx.start();
> 
> streamCtx.awaitTermination();
> 
> }
> 
> 
> 
> 
> 
> 
> 
> 
> 
> I got the following error when the red line is executed:
> 
> java.lang.ClassNotFoundException: 
> com.intrinsec.common.spark.SQLStreamingJsonAnalyzer$2
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:348)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:68)
>   at 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613)
>   at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
> 
> 
> 
> 
> 
> 



Re: Installing Spark on Mac

2016-03-08 Thread Aida
tried sbt/sbt package; seemed to run fine until it didn't, was wondering
whether the below error has to do with my JVM version. Any thoughts? Thanks

ukdrfs01:~ aidatefera$ cd Spark
ukdrfs01:Spark aidatefera$ cd spark-1.6.0
ukdrfs01:spark-1.6.0 aidatefera$ sbt/sbt package
NOTE: The sbt/sbt script has been relocated to build/sbt.
  Please update references to point to the new location.

  Invoking 'build/sbt package' now ...

Attempting to fetch sbt
Launching sbt from build/sbt-launch-0.13.7.jar
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512m;
support was removed in 8.0
Getting org.scala-sbt sbt 0.13.7 ...
downloading
https://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/sbt/0.13.7/jars/sbt.jar
...
[SUCCESSFUL ] org.scala-sbt#sbt;0.13.7!sbt.jar (1785ms)
downloading
https://repo1.maven.org/maven2/org/scala-lang/scala-library/2.10.4/scala-library-2.10.4.jar
...
[SUCCESSFUL ] org.scala-lang#scala-library;2.10.4!scala-library.jar
(3639ms)
downloading
https://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/main/0.13.7/jars/main.jar
...
[SUCCESSFUL ] org.scala-sbt#main;0.13.7!main.jar (3302ms)
downloading
https://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/compiler-interface/0.13.7/jars/compiler-interface-src.jar
...
[SUCCESSFUL ]
org.scala-sbt#compiler-interface;0.13.7!compiler-interface-src.jar (1865ms)
downloading
https://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/compiler-interface/0.13.7/jars/compiler-interface-bin.jar
...
[SUCCESSFUL ]
org.scala-sbt#compiler-interface;0.13.7!compiler-interface-bin.jar (1993ms)
downloading
https://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/precompiled-2_8_2/0.13.7/jars/compiler-interface-bin.jar
...
[SUCCESSFUL ]
org.scala-sbt#precompiled-2_8_2;0.13.7!compiler-interface-bin.jar (2020ms)
downloading
https://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/precompiled-2_9_2/0.13.7/jars/compiler-interface-bin.jar
...
[SUCCESSFUL ]
org.scala-sbt#precompiled-2_9_2;0.13.7!compiler-interface-bin.jar (2067ms)
downloading
https://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/precompiled-2_9_3/0.13.7/jars/compiler-interface-bin.jar
...
[SUCCESSFUL ]
org.scala-sbt#precompiled-2_9_3;0.13.7!compiler-interface-bin.jar (2297ms)
downloading
https://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/actions/0.13.7/jars/actions.jar
...
[SUCCESSFUL ] org.scala-sbt#actions;0.13.7!actions.jar (2195ms)
downloading
https://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/main-settings/0.13.7/jars/main-settings.jar
...
[SUCCESSFUL ] org.scala-sbt#main-settings;0.13.7!main-settings.jar 
(2232ms)
downloading
https://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/interface/0.13.7/jars/interface.jar
...
[SUCCESSFUL ] org.scala-sbt#interface;0.13.7!interface.jar (1977ms)
downloading
https://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/io/0.13.7/jars/io.jar
...
[SUCCESSFUL ] org.scala-sbt#io;0.13.7!io.jar (2013ms)
downloading
https://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/ivy/0.13.7/jars/ivy.jar
...
[SUCCESSFUL ] org.scala-sbt#ivy;0.13.7!ivy.jar (2336ms)
downloading
https://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/launcher-interface/0.13.7/jars/launcher-interface.jar
...
[SUCCESSFUL ]
org.scala-sbt#launcher-interface;0.13.7!launcher-interface.jar (1728ms)
downloading
https://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/logging/0.13.7/jars/logging.jar
...
[SUCCESSFUL ] org.scala-sbt#logging;0.13.7!logging.jar (1979ms)
downloading
https://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/logic/0.13.7/jars/logic.jar
...
[SUCCESSFUL ] org.scala-sbt#logic;0.13.7!logic.jar (1878ms)
downloading
https://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/process/0.13.7/jars/process.jar
...
[SUCCESSFUL ] org.scala-sbt#process;0.13.7!process.jar (1816ms)
downloading
https://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/run/0.13.7/jars/run.jar
...
[SUCCESSFUL ] org.scala-sbt#run;0.13.7!run.jar (1943ms)
downloading
https://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/command/0.13.7/jars/command.jar
...
[SUCCESSFUL ] org.scala-sbt#command;0.13.7!command.jar (2554ms)
downloading
https://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/classpath/0.13.7/jars/classpath.jar
...
[SUCCESSFUL ] org.scala-sbt#classpath;0.13.7!classpath.jar (2275ms)
downloading
https://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/completion/0.13.7/jars/completion.jar
...
[SUCCESSFUL ] org.scala-sbt#completion;0.13.7!completion.jar (2443ms)
downloading
https://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/api/0.13.7/jars/api.jar
...
[SUCCESSFUL ] org.scala-sbt#api;0.13.7!api.jar (2558ms)
downloading
https://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/compiler-integ

Re: Spark structured streaming

2016-03-08 Thread Michael Armbrust
This is in active development, so there is not much that can be done from
an end user perspective.  In particular the only sink that is available in
apache/master is a testing sink that just stores the data in memory.  We
are working on a parquet based file sink and will eventually support all
the of Data Source API file formats (text, json, csv, orc, parquet).

On Tue, Mar 8, 2016 at 7:38 AM, Jacek Laskowski  wrote:

> Hi Praveen,
>
> I don't really know. I think TD or Michael should know as they
> personally involved in the task (as far as I could figure it out from
> the JIRA and the changes). Ping people on the JIRA so they notice your
> question(s).
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Tue, Mar 8, 2016 at 12:32 PM, Praveen Devarao 
> wrote:
> > Thanks Jacek for the pointer.
> >
> > Any idea which package can be used in .format(). The test cases seem to
> work
> > out of the DefaultSource class defined within the
> DataFrameReaderWriterSuite
> > [org.apache.spark.sql.streaming.test.DefaultSource]
> >
> > Thanking You
> >
> -
> > Praveen Devarao
> > Spark Technology Centre
> > IBM India Software Labs
> >
> -
> > "Courage doesn't always roar. Sometimes courage is the quiet voice at the
> > end of the day saying I will try again"
> >
> >
> >
> > From:Jacek Laskowski 
> > To:Praveen Devarao/India/IBM@IBMIN
> > Cc:user , dev 
> > Date:08/03/2016 04:17 pm
> > Subject:Re: Spark structured streaming
> > 
> >
> >
> >
> > Hi Praveen,
> >
> > I've spent few hours on the changes related to streaming dataframes
> > (included in the SPARK-8360) and concluded that it's currently only
> > possible to read.stream(), but not write.stream() since there are no
> > streaming Sinks yet.
> >
> > Pozdrawiam,
> > Jacek Laskowski
> > 
> > https://medium.com/@jaceklaskowski/
> > Mastering Apache Spark http://bit.ly/mastering-apache-spark
> > Follow me at https://twitter.com/jaceklaskowski
> >
> >
> > On Tue, Mar 8, 2016 at 10:38 AM, Praveen Devarao 
> > wrote:
> >> Hi,
> >>
> >> I would like to get my hands on the structured streaming feature
> >> coming out in Spark 2.0. I have tried looking around for code samples to
> >> get
> >> started but am not able to find any. Only few things I could look into
> is
> >> the test cases that have been committed under the JIRA umbrella
> >> https://issues.apache.org/jira/browse/SPARK-8360butthe test cases don't
> >> lead to building a example code as they seem to be working out of
> internal
> >> classes.
> >>
> >> Could anyone point me to some resources or pointers in code
> that I
> >> can start with to understand structured streaming from a consumability
> >> angle.
> >>
> >> Thanking You
> >>
> >>
> -
> >> Praveen Devarao
> >> Spark Technology Centre
> >> IBM India Software Labs
> >>
> >>
> -
> >> "Courage doesn't always roar. Sometimes courage is the quiet voice at
> the
> >> end of the day saying I will try again"
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
> >
> >
> >
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Installing Spark on Mac

2016-03-08 Thread Eduardo Costa Alfaia
Hi Aida,
The installation has detected a maven version 3.0.3. Update to 3.3.3 and
try again.
Il 08/Mar/2016 14:06, "Aida"  ha scritto:

> Hi all,
>
> Thanks everyone for your responses; really appreciate it.
>
> Eduardo - I tried your suggestions but ran into some issues, please see
> below:
>
> ukdrfs01:Spark aidatefera$ cd spark-1.6.0
> ukdrfs01:spark-1.6.0 aidatefera$ build/mvn -DskipTests clean package
> Using `mvn` from path: /usr/bin/mvn
> Java HotSpot(TM) 64-Bit Server VM warning: ignoring option
> MaxPermSize=512M;
> support was removed in 8.0
> [INFO] Scanning for projects...
> [INFO]
> 
> [INFO] Reactor Build Order:
> [INFO]
> [INFO] Spark Project Parent POM
> [INFO] Spark Project Test Tags
> [INFO] Spark Project Launcher
> [INFO] Spark Project Networking
> [INFO] Spark Project Shuffle Streaming Service
> [INFO] Spark Project Unsafe
> [INFO] Spark Project Core
> [INFO] Spark Project Bagel
> [INFO] Spark Project GraphX
> [INFO] Spark Project Streaming
> [INFO] Spark Project Catalyst
> [INFO] Spark Project SQL
> [INFO] Spark Project ML Library
> [INFO] Spark Project Tools
> [INFO] Spark Project Hive
> [INFO] Spark Project Docker Integration Tests
> [INFO] Spark Project REPL
> [INFO] Spark Project Assembly
> [INFO] Spark Project External Twitter
> [INFO] Spark Project External Flume Sink
> [INFO] Spark Project External Flume
> [INFO] Spark Project External Flume Assembly
> [INFO] Spark Project External MQTT
> [INFO] Spark Project External MQTT Assembly
> [INFO] Spark Project External ZeroMQ
> [INFO] Spark Project External Kafka
> [INFO] Spark Project Examples
> [INFO] Spark Project External Kafka Assembly
> [INFO]
> [INFO]
> 
> [INFO] Building Spark Project Parent POM 1.6.0
> [INFO]
> 
> [INFO]
> [INFO] --- maven-clean-plugin:2.6.1:clean (default-clean) @
> spark-parent_2.10 ---
> [INFO]
> [INFO] --- maven-enforcer-plugin:1.4:enforce (enforce-versions) @
> spark-parent_2.10 ---
> [WARNING] Rule 0: org.apache.maven.plugins.enforcer.RequireMavenVersion
> failed with message:
> Detected Maven Version: 3.0.3 is not in the allowed range 3.3.3.
> [INFO]
> 
> [INFO] Reactor Summary:
> [INFO]
> [INFO] Spark Project Parent POM .. FAILURE [0.821s]
> [INFO] Spark Project Test Tags ... SKIPPED
> [INFO] Spark Project Launcher  SKIPPED
> [INFO] Spark Project Networking .. SKIPPED
> [INFO] Spark Project Shuffle Streaming Service ... SKIPPED
> [INFO] Spark Project Unsafe .. SKIPPED
> [INFO] Spark Project Core  SKIPPED
> [INFO] Spark Project Bagel ... SKIPPED
> [INFO] Spark Project GraphX .. SKIPPED
> [INFO] Spark Project Streaming ... SKIPPED
> [INFO] Spark Project Catalyst  SKIPPED
> [INFO] Spark Project SQL . SKIPPED
> [INFO] Spark Project ML Library .. SKIPPED
> [INFO] Spark Project Tools ... SKIPPED
> [INFO] Spark Project Hive  SKIPPED
> [INFO] Spark Project Docker Integration Tests  SKIPPED
> [INFO] Spark Project REPL  SKIPPED
> [INFO] Spark Project Assembly  SKIPPED
> [INFO] Spark Project External Twitter  SKIPPED
> [INFO] Spark Project External Flume Sink . SKIPPED
> [INFO] Spark Project External Flume .. SKIPPED
> [INFO] Spark Project External Flume Assembly . SKIPPED
> [INFO] Spark Project External MQTT ... SKIPPED
> [INFO] Spark Project External MQTT Assembly .. SKIPPED
> [INFO] Spark Project External ZeroMQ . SKIPPED
> [INFO] Spark Project External Kafka .. SKIPPED
> [INFO] Spark Project Examples  SKIPPED
> [INFO] Spark Project External Kafka Assembly . SKIPPED
> [INFO]
> 
> [INFO] BUILD FAILURE
> [INFO]
> 
> [INFO] Total time: 1.745s
> [INFO] Finished at: Tue Mar 08 18:01:48 GMT 2016
> [INFO] Final Memory: 19M/183M
> [INFO]
> 
> [ERROR] Failed to execute goal
> org.apache.maven.plugins:maven-enforcer-plugin:1.4:enforce
> (enforce-versions) on project spark-parent_2.10: Some Enforcer rules have
> failed. Look above for specific messages explaining why the rule failed. ->
> [Help

Not able to save data after running fpgrowth in pyspark

2016-03-08 Thread goutham koneru
Hello,

I am using fpgrowth to generate frequent item sets and the model is working
fine. If I select n rows I was able to see the data.

When I try to save the data using any of the methods like write.orc or
saveAsTable or saveAsParquet it is taking unusual amount of time to save
the data.

If I save data before running the model all the methods work perfectly.
Only after running the model I see this issue. Below is the sample code I
am using. Can you let me know if there is anything wrong that I am doing or
any configuration changes need to be made?

rdd_buckets.write.orc('/data/buckets') -- *before running the model this
works and writes data in less than 2 minutes.*

transactions = rdd_buckets.rdd.map(lambda line: line.buckets.split('::'))
model = FPGrowth.train(transactions, minSupport=0.01,numPartitions=200)
result = model.freqItemsets().toDF()
size_1_buckets = result.filter(F.size(result.items) == 1)
size_2_buckets = result.filter(F.size(result.items) == 2)
size_1_buckets.registerTempTable('size_1_buckets')
hive_context.sql("use buckets")
hive_context.sql("create table size_1_buckets as select * from
size_1_buckets" ) -- *this step takes long time (10 hours) to complete the
writing process.*
size_2_buckets.registerTempTable('size_2_buckets')
hive_context.sql("use buckets")
hive_context.sql("create table size_2_buckets as select * from
size_2_buckets" ) -- *this step takes **takes long time (10 hours) ** to
complete the writing process.*

Below is the command that we are using to submit the job.

/usr/hdp/current/spark-client/bin/spark-submit --master yarn-client
--num-executors 10 --conf spark.executor.memory=10g --conf
spark.yarn.queue=batch --conf spark.rpc.askTimeout=100s --conf
spark.driver.memory=2g --conf spark.kryoserializer.buffer.max=256m --conf
spark.executor.cores=5 --conf spark.driver.cores=4 python/buckets.py

Thanks,
Goutham.


Re: Installing Spark on Mac

2016-03-08 Thread Cody Koeninger
You said you downloaded a prebuilt version.

You shouldn't have to mess with maven or building spark at all.  All
you need is a jvm, which it looks like you already have installed.

You should be able to follow the instructions at

http://spark.apache.org/docs/latest/

and

http://spark.apache.org/docs/latest/spark-standalone.html

If you want standalone mode (master and several worker processes on
your machine) rather than local mode (single process on your machine),
you need to set up passwordless ssh to localhost

http://stackoverflow.com/questions/7134535/setup-passphraseless-ssh-to-localhost-on-os-x



On Tue, Mar 8, 2016 at 12:45 PM, Eduardo Costa Alfaia
 wrote:
> Hi Aida,
> The installation has detected a maven version 3.0.3. Update to 3.3.3 and try
> again.
>
> Il 08/Mar/2016 14:06, "Aida"  ha scritto:
>>
>> Hi all,
>>
>> Thanks everyone for your responses; really appreciate it.
>>
>> Eduardo - I tried your suggestions but ran into some issues, please see
>> below:
>>
>> ukdrfs01:Spark aidatefera$ cd spark-1.6.0
>> ukdrfs01:spark-1.6.0 aidatefera$ build/mvn -DskipTests clean package
>> Using `mvn` from path: /usr/bin/mvn
>> Java HotSpot(TM) 64-Bit Server VM warning: ignoring option
>> MaxPermSize=512M;
>> support was removed in 8.0
>> [INFO] Scanning for projects...
>> [INFO]
>> 
>> [INFO] Reactor Build Order:
>> [INFO]
>> [INFO] Spark Project Parent POM
>> [INFO] Spark Project Test Tags
>> [INFO] Spark Project Launcher
>> [INFO] Spark Project Networking
>> [INFO] Spark Project Shuffle Streaming Service
>> [INFO] Spark Project Unsafe
>> [INFO] Spark Project Core
>> [INFO] Spark Project Bagel
>> [INFO] Spark Project GraphX
>> [INFO] Spark Project Streaming
>> [INFO] Spark Project Catalyst
>> [INFO] Spark Project SQL
>> [INFO] Spark Project ML Library
>> [INFO] Spark Project Tools
>> [INFO] Spark Project Hive
>> [INFO] Spark Project Docker Integration Tests
>> [INFO] Spark Project REPL
>> [INFO] Spark Project Assembly
>> [INFO] Spark Project External Twitter
>> [INFO] Spark Project External Flume Sink
>> [INFO] Spark Project External Flume
>> [INFO] Spark Project External Flume Assembly
>> [INFO] Spark Project External MQTT
>> [INFO] Spark Project External MQTT Assembly
>> [INFO] Spark Project External ZeroMQ
>> [INFO] Spark Project External Kafka
>> [INFO] Spark Project Examples
>> [INFO] Spark Project External Kafka Assembly
>> [INFO]
>> [INFO]
>> 
>> [INFO] Building Spark Project Parent POM 1.6.0
>> [INFO]
>> 
>> [INFO]
>> [INFO] --- maven-clean-plugin:2.6.1:clean (default-clean) @
>> spark-parent_2.10 ---
>> [INFO]
>> [INFO] --- maven-enforcer-plugin:1.4:enforce (enforce-versions) @
>> spark-parent_2.10 ---
>> [WARNING] Rule 0: org.apache.maven.plugins.enforcer.RequireMavenVersion
>> failed with message:
>> Detected Maven Version: 3.0.3 is not in the allowed range 3.3.3.
>> [INFO]
>> 
>> [INFO] Reactor Summary:
>> [INFO]
>> [INFO] Spark Project Parent POM .. FAILURE
>> [0.821s]
>> [INFO] Spark Project Test Tags ... SKIPPED
>> [INFO] Spark Project Launcher  SKIPPED
>> [INFO] Spark Project Networking .. SKIPPED
>> [INFO] Spark Project Shuffle Streaming Service ... SKIPPED
>> [INFO] Spark Project Unsafe .. SKIPPED
>> [INFO] Spark Project Core  SKIPPED
>> [INFO] Spark Project Bagel ... SKIPPED
>> [INFO] Spark Project GraphX .. SKIPPED
>> [INFO] Spark Project Streaming ... SKIPPED
>> [INFO] Spark Project Catalyst  SKIPPED
>> [INFO] Spark Project SQL . SKIPPED
>> [INFO] Spark Project ML Library .. SKIPPED
>> [INFO] Spark Project Tools ... SKIPPED
>> [INFO] Spark Project Hive  SKIPPED
>> [INFO] Spark Project Docker Integration Tests  SKIPPED
>> [INFO] Spark Project REPL  SKIPPED
>> [INFO] Spark Project Assembly  SKIPPED
>> [INFO] Spark Project External Twitter  SKIPPED
>> [INFO] Spark Project External Flume Sink . SKIPPED
>> [INFO] Spark Project External Flume .. SKIPPED
>> [INFO] Spark Project External Flume Assembly . SKIPPED
>> [INFO] Spark Project External MQTT ... SKIPPED
>> [INFO] Spark Project External MQTT Assembly .. SKIPPED
>> [INFO] Spark Project External ZeroMQ . SKIPPED
>> [INFO] Spark Project External Kafka .. SKIP

Re: Using dynamic allocation and shuffle service in Standalone Mode

2016-03-08 Thread Yuval Itzchakov
Actually, I assumed that setting the flag in the spark job would turn on
the shuffle service in the workers. I now understand that assumption was
wrong.

Is there any way to set the flag via the driver? Or must I manually set it
via spark-env.sh on each worker?


On Tue, Mar 8, 2016, 20:14 Silvio Fiorito 
wrote:

> You’ve started the external shuffle service on all worker nodes, correct?
> Can you confirm they’re still running and haven’t exited?
>
>
>
>
>
>
>
> *From: *Yuval.Itzchakov 
> *Sent: *Tuesday, March 8, 2016 12:41 PM
> *To: *user@spark.apache.org
> *Subject: *Using dynamic allocation and shuffle service in Standalone Mode
>
>
> Hi,
> I'm using Spark 1.6.0, and according to the documentation, dynamic
> allocation and spark shuffle service should be enabled.
>
> When I submit a spark job via the following:
>
> spark-submit \
> --master  \
> --deploy-mode cluster \
> --executor-cores 3 \
> --conf "spark.streaming.backpressure.enabled=true" \
> --conf "spark.dynamicAllocation.enabled=true" \
> --conf "spark.dynamicAllocation.minExecutors=2" \
> --conf "spark.dynamicAllocation.maxExecutors=24" \
> --conf "spark.shuffle.service.enabled=true" \
> --conf "spark.executor.memory=8g" \
> --conf "spark.driver.memory=10g" \
> --class SparkJobRunner
>
> /opt/clicktale/entityCreator/com.clicktale.ai.entity-creator-assembly-0.0.2.jar
>
> I'm seeing error logs from the workers being unable to connect to the
> shuffle service:
>
> 16/03/08 17:33:15 ERROR storage.BlockManager: Failed to connect to external
> shuffle server, will retry 2 more times after waiting 5 seconds...
> java.io.IOException: Failed to connect to 
> at
>
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
> at
>
> org.apache.spark.network.client.TransportClientFactory.createUnmanagedClient(TransportClientFactory.java:181)
> at
>
> org.apache.spark.network.shuffle.ExternalShuffleClient.registerWithShuffleServer(ExternalShuffleClient.java:141)
> at
>
> org.apache.spark.storage.BlockManager$$anonfun$registerWithExternalShuffleServer$1.apply$mcVI$sp(BlockManager.scala:211)
> at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
> at
>
> org.apache.spark.storage.BlockManager.registerWithExternalShuffleServer(BlockManager.scala:208)
> at
> org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:194)
> at org.apache.spark.executor.Executor.(Executor.scala:85)
> at
>
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:83)
> at
>
> org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116)
> at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
> at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
> at
> org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
> at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
>
> I verified all relevant ports are open. Has anyone else experienced such a
> failure?
>
> Yuval.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Using-dynamic-allocation-and-shuffle-service-in-Standalone-Mode-tp26430.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


RE: Using dynamic allocation and shuffle service in Standalone Mode

2016-03-08 Thread Silvio Fiorito
There’s a script to start it up under sbin, start-shuffle-service.sh. Run that 
on each of your worker nodes.



From: Yuval Itzchakov
Sent: Tuesday, March 8, 2016 2:17 PM
To: Silvio Fiorito; 
user@spark.apache.org
Subject: Re: Using dynamic allocation and shuffle service in Standalone Mode

Actually, I assumed that setting the flag in the spark job would turn on the 
shuffle service in the workers. I now understand that assumption was wrong.

Is there any way to set the flag via the driver? Or must I manually set it via 
spark-env.sh on each worker?


On Tue, Mar 8, 2016, 20:14 Silvio Fiorito 
mailto:silvio.fior...@granturing.com>> wrote:

You’ve started the external shuffle service on all worker nodes, correct? Can 
you confirm they’re still running and haven’t exited?







From: Yuval.Itzchakov
Sent: Tuesday, March 8, 2016 12:41 PM
To: user@spark.apache.org
Subject: Using dynamic allocation and shuffle service in Standalone Mode



Hi,
I'm using Spark 1.6.0, and according to the documentation, dynamic
allocation and spark shuffle service should be enabled.

When I submit a spark job via the following:

spark-submit \
--master  \
--deploy-mode cluster \
--executor-cores 3 \
--conf "spark.streaming.backpressure.enabled=true" \
--conf "spark.dynamicAllocation.enabled=true" \
--conf "spark.dynamicAllocation.minExecutors=2" \
--conf "spark.dynamicAllocation.maxExecutors=24" \
--conf "spark.shuffle.service.enabled=true" \
--conf "spark.executor.memory=8g" \
--conf "spark.driver.memory=10g" \
--class SparkJobRunner
/opt/clicktale/entityCreator/com.clicktale.ai.entity-creator-assembly-0.0.2.jar

I'm seeing error logs from the workers being unable to connect to the
shuffle service:

16/03/08 17:33:15 ERROR storage.BlockManager: Failed to connect to external
shuffle server, will retry 2 more times after waiting 5 seconds...
java.io.IOException: Failed to connect to 
at
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
at
org.apache.spark.network.client.TransportClientFactory.createUnmanagedClient(TransportClientFactory.java:181)
at
org.apache.spark.network.shuffle.ExternalShuffleClient.registerWithShuffleServer(ExternalShuffleClient.java:141)
at
org.apache.spark.storage.BlockManager$$anonfun$registerWithExternalShuffleServer$1.apply$mcVI$sp(BlockManager.scala:211)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at
org.apache.spark.storage.BlockManager.registerWithExternalShuffleServer(BlockManager.scala:208)
at 
org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:194)
at org.apache.spark.executor.Executor.(Executor.scala:85)
at
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:83)
at
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at
org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

I verified all relevant ports are open. Has anyone else experienced such a
failure?

Yuval.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Using-dynamic-allocation-and-shuffle-service-in-Standalone-Mode-tp26430.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.org



Re: SparkFiles.get() returns with driver path Instead of Worker Path

2016-03-08 Thread Tristan Nixon
Based on your code:

sparkContext.addFile("/home/files/data.txt");
List file =sparkContext.textFile(SparkFiles.get("data.txt")).collect();

I’m assuming the file in “/home/files/data.txt” exists and is readable in the 
driver’s filesystem.
Did you try just doing this:

List file =sparkContext.textFile("/home/files/data.txt").collect();

> On Mar 8, 2016, at 1:20 PM, Ashik Vetrivelu  wrote:
> 
> Hey, yeah I also tried by setting sc.textFile() with a local path and it 
> still throws the exception when trying to use collect().
> 
> Sorry I am new to spark and I am just messing around with it.
> 
> On Mar 8, 2016 10:23 PM, "Tristan Nixon"  > wrote:
> My understanding of the model is that you’re supposed to execute 
> SparkFiles.get(…) on each worker node, not on the driver.
> 
> Since you already know where the files are on the driver, if you want to load 
> these into an RDD with SparkContext.textFile, then this will distribute it 
> out to the workers, there’s no need to use SparkContext.add to do this.
> 
> If you have some functions that run on workers that expects local file 
> resources, then you can use SparkContext.addFile to distribute the files into 
> worker local storage, then you can execute SparkFiles.get separately on each 
> worker to retrieve these local files (it will give different paths on each 
> worker).
> 
> > On Mar 8, 2016, at 5:31 AM, ashikvc  > > wrote:
> >
> > I am trying to play a little bit with apache-spark cluster mode.
> > So my cluster consists of a driver in my machine and a worker and manager in
> > host machine(separate machine).
> >
> > I send a textfile using `sparkContext.addFile(filepath)` where the filepath
> > is the path of my text file in local machine for which I get the following
> > output:
> >
> >INFO Utils: Copying /home/files/data.txt to
> > /tmp/spark-b2e2bb22-487b-412b-831d-19d7aa96f275/userFiles-147c9552-1a77-427e-9b17-cb0845807860/data.txt
> >
> >INFO SparkContext: Added file /home/files/data.txt at
> > http://192.XX.XX.164:58143/files/data.txt 
> >  with timestamp 1457432207649
> >
> > But when I try to access the same file using `SparkFiles.get("data.txt")`, I
> > get the path to file in my driver instead of worker.
> > I am setting my file like this
> >
> >SparkConf conf = new
> > SparkConf().setAppName("spark-play").setMaster("spark://192.XX.XX.172:7077");
> >conf.setJars(new String[]{"jars/SparkWorker.jar"});
> >JavaSparkContext sparkContext = new JavaSparkContext(conf);
> >sparkContext.addFile("/home/files/data.txt");
> >List file
> > =sparkContext.textFile(SparkFiles.get("data.txt")).collect();
> > I am getting FileNotFoundException here.
> >
> >
> >
> >
> >
> > --
> > View this message in context: 
> > http://apache-spark-user-list.1001560.n3.nabble.com/SparkFiles-get-returns-with-driver-path-Instead-of-Worker-Path-tp26428.html
> >  
> > 
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> > 
> > For additional commands, e-mail: user-h...@spark.apache.org 
> > 
> >
> 



Re: Using dynamic allocation and shuffle service in Standalone Mode

2016-03-08 Thread Andrew Or
Hi Yuval, if you start the Workers with `spark.shuffle.service.enabled =
true` then the workers will each start a shuffle service automatically. No
need to start the shuffle services yourself separately.

-Andrew

2016-03-08 11:21 GMT-08:00 Silvio Fiorito :

> There’s a script to start it up under sbin, start-shuffle-service.sh. Run
> that on each of your worker nodes.
>
>
>
>
>
>
>
> *From: *Yuval Itzchakov 
> *Sent: *Tuesday, March 8, 2016 2:17 PM
> *To: *Silvio Fiorito ;
> user@spark.apache.org
> *Subject: *Re: Using dynamic allocation and shuffle service in Standalone
> Mode
>
>
> Actually, I assumed that setting the flag in the spark job would turn on
> the shuffle service in the workers. I now understand that assumption was
> wrong.
>
> Is there any way to set the flag via the driver? Or must I manually set it
> via spark-env.sh on each worker?
>
>
> On Tue, Mar 8, 2016, 20:14 Silvio Fiorito 
> wrote:
>
>> You’ve started the external shuffle service on all worker nodes, correct?
>> Can you confirm they’re still running and haven’t exited?
>>
>>
>>
>>
>>
>>
>>
>> *From: *Yuval.Itzchakov 
>> *Sent: *Tuesday, March 8, 2016 12:41 PM
>> *To: *user@spark.apache.org
>> *Subject: *Using dynamic allocation and shuffle service in Standalone
>> Mode
>>
>>
>> Hi,
>> I'm using Spark 1.6.0, and according to the documentation, dynamic
>> allocation and spark shuffle service should be enabled.
>>
>> When I submit a spark job via the following:
>>
>> spark-submit \
>> --master  \
>> --deploy-mode cluster \
>> --executor-cores 3 \
>> --conf "spark.streaming.backpressure.enabled=true" \
>> --conf "spark.dynamicAllocation.enabled=true" \
>> --conf "spark.dynamicAllocation.minExecutors=2" \
>> --conf "spark.dynamicAllocation.maxExecutors=24" \
>> --conf "spark.shuffle.service.enabled=true" \
>> --conf "spark.executor.memory=8g" \
>> --conf "spark.driver.memory=10g" \
>> --class SparkJobRunner
>>
>> /opt/clicktale/entityCreator/com.clicktale.ai.entity-creator-assembly-0.0.2.jar
>>
>> I'm seeing error logs from the workers being unable to connect to the
>> shuffle service:
>>
>> 16/03/08 17:33:15 ERROR storage.BlockManager: Failed to connect to
>> external
>> shuffle server, will retry 2 more times after waiting 5 seconds...
>> java.io.IOException: Failed to connect to 
>> at
>>
>> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
>> at
>>
>> org.apache.spark.network.client.TransportClientFactory.createUnmanagedClient(TransportClientFactory.java:181)
>> at
>>
>> org.apache.spark.network.shuffle.ExternalShuffleClient.registerWithShuffleServer(ExternalShuffleClient.java:141)
>> at
>>
>> org.apache.spark.storage.BlockManager$$anonfun$registerWithExternalShuffleServer$1.apply$mcVI$sp(BlockManager.scala:211)
>> at
>> scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
>> at
>>
>> org.apache.spark.storage.BlockManager.registerWithExternalShuffleServer(BlockManager.scala:208)
>> at
>> org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:194)
>> at org.apache.spark.executor.Executor.(Executor.scala:85)
>> at
>>
>> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:83)
>> at
>>
>> org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116)
>> at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
>> at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
>> at
>>
>> org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
>> at
>>
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>> at
>>
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>> at java.lang.Thread.run(Thread.java:745)
>>
>> I verified all relevant ports are open. Has anyone else experienced such a
>> failure?
>>
>> Yuval.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Using-dynamic-allocation-and-shuffle-service-in-Standalone-Mode-tp26430.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>


Re: No event log in /tmp/spark-events

2016-03-08 Thread Andrew Or
Hi Patrick,

I think he means just write `/tmp/sparkserverlog` instead of
`file:/tmp/sparkserverlog`. However, I think both should work. What mode
are you running in, client mode (the default) or cluster mode? If the
latter your driver will be run on the cluster, and so your event logs won't
be on the machine you ran spark-submit from. Also, are you running
standalone, YARN or Mesos?

As Jeff commented above, if event log is in fact enabled you should see the
log message from EventLoggingListener. If the log message is not present in
your driver logs, it's likely that the configurations in your
spark-defaults.conf are not passed correctly.

-Andrew

2016-03-03 19:57 GMT-08:00 PatrickYu :

> alvarobrandon wrote
> > Just write /tmp/sparkserverlog without the file part.
>
> I don't get your point, what's mean of 'without the file part'
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/No-event-log-in-tmp-spark-events-tp26318p26394.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Using dynamic allocation and shuffle service in Standalone Mode

2016-03-08 Thread Yuval Itzchakov
Great.
Thanks a lot Silvio.

On Tue, Mar 8, 2016, 21:39 Andrew Or  wrote:

> Hi Yuval, if you start the Workers with `spark.shuffle.service.enabled =
> true` then the workers will each start a shuffle service automatically. No
> need to start the shuffle services yourself separately.
>
> -Andrew
>
> 2016-03-08 11:21 GMT-08:00 Silvio Fiorito :
>
>> There’s a script to start it up under sbin, start-shuffle-service.sh. Run
>> that on each of your worker nodes.
>>
>>
>>
>>
>>
>>
>>
>> *From: *Yuval Itzchakov 
>> *Sent: *Tuesday, March 8, 2016 2:17 PM
>> *To: *Silvio Fiorito ;
>> user@spark.apache.org
>> *Subject: *Re: Using dynamic allocation and shuffle service in
>> Standalone Mode
>>
>>
>> Actually, I assumed that setting the flag in the spark job would turn on
>> the shuffle service in the workers. I now understand that assumption was
>> wrong.
>>
>> Is there any way to set the flag via the driver? Or must I manually set
>> it via spark-env.sh on each worker?
>>
>>
>> On Tue, Mar 8, 2016, 20:14 Silvio Fiorito 
>> wrote:
>>
>>> You’ve started the external shuffle service on all worker nodes,
>>> correct? Can you confirm they’re still running and haven’t exited?
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> *From: *Yuval.Itzchakov 
>>> *Sent: *Tuesday, March 8, 2016 12:41 PM
>>> *To: *user@spark.apache.org
>>> *Subject: *Using dynamic allocation and shuffle service in Standalone
>>> Mode
>>>
>>>
>>> Hi,
>>> I'm using Spark 1.6.0, and according to the documentation, dynamic
>>> allocation and spark shuffle service should be enabled.
>>>
>>> When I submit a spark job via the following:
>>>
>>> spark-submit \
>>> --master  \
>>> --deploy-mode cluster \
>>> --executor-cores 3 \
>>> --conf "spark.streaming.backpressure.enabled=true" \
>>> --conf "spark.dynamicAllocation.enabled=true" \
>>> --conf "spark.dynamicAllocation.minExecutors=2" \
>>> --conf "spark.dynamicAllocation.maxExecutors=24" \
>>> --conf "spark.shuffle.service.enabled=true" \
>>> --conf "spark.executor.memory=8g" \
>>> --conf "spark.driver.memory=10g" \
>>> --class SparkJobRunner
>>>
>>> /opt/clicktale/entityCreator/com.clicktale.ai.entity-creator-assembly-0.0.2.jar
>>>
>>> I'm seeing error logs from the workers being unable to connect to the
>>> shuffle service:
>>>
>>> 16/03/08 17:33:15 ERROR storage.BlockManager: Failed to connect to
>>> external
>>> shuffle server, will retry 2 more times after waiting 5 seconds...
>>> java.io.IOException: Failed to connect to 
>>> at
>>>
>>> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
>>> at
>>>
>>> org.apache.spark.network.client.TransportClientFactory.createUnmanagedClient(TransportClientFactory.java:181)
>>> at
>>>
>>> org.apache.spark.network.shuffle.ExternalShuffleClient.registerWithShuffleServer(ExternalShuffleClient.java:141)
>>> at
>>>
>>> org.apache.spark.storage.BlockManager$$anonfun$registerWithExternalShuffleServer$1.apply$mcVI$sp(BlockManager.scala:211)
>>> at
>>> scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
>>> at
>>>
>>> org.apache.spark.storage.BlockManager.registerWithExternalShuffleServer(BlockManager.scala:208)
>>> at
>>> org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:194)
>>> at org.apache.spark.executor.Executor.(Executor.scala:85)
>>> at
>>>
>>> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:83)
>>> at
>>>
>>> org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116)
>>> at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
>>> at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
>>> at
>>>
>>> org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
>>> at
>>>
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>> at
>>>
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>> at java.lang.Thread.run(Thread.java:745)
>>>
>>> I verified all relevant ports are open. Has anyone else experienced such
>>> a
>>> failure?
>>>
>>> Yuval.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Using-dynamic-allocation-and-shuffle-service-in-Standalone-Mode-tp26430.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>


Re: Best practices of maintaining a long running SparkContext

2016-03-08 Thread Zhong Wang
+spark-users

We are using Zeppelin (http://zeppelin.incubator.apache.org) as our UI to
run spark jobs. Zeppelin maintains a long running SparkContext, and we run
into a couple of issues:
--
1. Dynamic resource allocation keeps removing and registering executors,
even though no jobs are running
2. EventLogging doesn't work due to HDFS lease issue. Similar to this:
https://mail-archives.apache.org/mod_mbox/spark-user/201507.mbox/%3ccae6kwsp_c00gksmnx0obu5aouxphdjs-syqywt-jfi3psvc...@mail.gmail.com%3E
3. SparkUI is getting slower due to large number of history jobs
4. Cached data is gone mystically (shown in the Storage page, but not in
the Executor page)

The aim of this thread is not resolve specific issues (though any ideas on
the listed issue will be welcome), but to hear suggestions about the best
practices of maintaining a long running SparkContext from both the Spark
and Zeppelin community.

Thanks,
Zhong

On Tue, Mar 8, 2016 at 11:13 AM, Zhong Wang  wrote:

> Thanks for your insights, Deenar. I think this is really helpful to users
> who want to run Zeppelin as a service.
>
> The caching issue we experienced seems to be a Spark bug, because I see
> some inconsistent states through the SparkUI, but thanks for pointing out
> the potential reasons.
>
> I am still interested in for the people who run Zeppelin as a service,
> whether you have experienced bugs or memory leaks, and how did you deal
> with these.
>
> Thanks!
>
> Zhong
>
> On Tue, Mar 8, 2016 at 8:17 AM, Deenar Toraskar  > wrote:
>
>> 1) You should turn dynamic allocation on see
>> http://spark.apache.org/docs/latest/configuration.html#dynamic-allocation
>> to maximise utilisation of your cluster resources. This might be a reason
>> you are seeing cached data disappearing.
>> 2) If other processes cache data and the amount of data cached is larger
>> than your cluster memory, Spark will evict some cached data from memory.
>> 3) If you are using Kerberos authentication, you need a process that
>> renews tickets.
>>
>> Deenar
>>
>> On 8 March 2016 at 01:35, Zhong Wang  wrote:
>>
>>> Hi zeppelin-users,
>>>
>>> Because Zeppelin relies on a long running SparkContext, it is quite
>>> important to make it stable to improve availability. From my experience, I
>>> run into a couple of issues if I run a SparkContext for several days,
>>> including:
>>> --
>>> 1. EventLoggong doest work due to HDFS lease issue. Similar to this:
>>> https://mail-archives.apache.org/mod_mbox/spark-user/201507.mbox/%3ccae6kwsp_c00gksmnx0obu5aouxphdjs-syqywt-jfi3psvc...@mail.gmail.com%3E
>>> 2. SparkUI is getting slower due to large number of history jobs
>>> 3. Cached data is gone mystically
>>>
>>> They may not be Zeppelin issues, but I would like to hear the problems
>>> you run into, and your experience of how to deal with maintaining a long
>>> running SparkContext.
>>>
>>> I know that we can do some cleanups periodically by restarting the spark
>>> interpreter, but I am wondering whether there are better ways.
>>>
>>> Thanks!
>>>
>>> Zhong
>>>
>>
>>
>


Re: Spark ML - Scaling logistic regression for many features

2016-03-08 Thread Daniel Siegmann
Just for the heck of it I tried the old MLlib implementation, but it had
the same scalability problem.

Anyone familiar with the logistic regression implementation who could weigh
in?

On Mon, Mar 7, 2016 at 5:35 PM, Michał Zieliński <
zielinski.mich...@gmail.com> wrote:

> We're using SparseVector columns in a DataFrame, so they are definitely
> supported. But maybe for LR some implicit magic is happening inside.
>
> On 7 March 2016 at 23:04, Devin Jones  wrote:
>
>> I could be wrong but its possible that toDF populates a dataframe which I
>> understand do not support sparsevectors at the moment.
>>
>> If you use the MlLib logistic regression implementation (not ml) you can
>> pass the RDD[LabeledPoint] data type directly to the learner.
>>
>>
>> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS
>>
>> Only downside is that you can't use the pipeline framework from spark ml.
>>
>> Cheers,
>> Devin
>>
>>
>>
>> On Mon, Mar 7, 2016 at 4:54 PM, Daniel Siegmann <
>> daniel.siegm...@teamaol.com> wrote:
>>
>>> Yes, it is a SparseVector. Most rows only have a few features, and all
>>> the rows together only have tens of thousands of features, but the vector
>>> size is ~ 20 million because that is the largest feature.
>>>
>>> On Mon, Mar 7, 2016 at 4:31 PM, Devin Jones 
>>> wrote:
>>>
 Hi,

 Which data structure are you using to train the model? If you haven't
 tried yet, you should consider the SparseVector


 http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.SparseVector


 On Mon, Mar 7, 2016 at 4:03 PM, Daniel Siegmann <
 daniel.siegm...@teamaol.com> wrote:

> I recently tried to a model using
> org.apache.spark.ml.classification.LogisticRegression on a data set
> where the feature vector size was around ~20 million. It did *not* go
> well. It took around 10 hours to train on a substantial cluster.
> Additionally, it pulled a lot data back to the driver - I eventually set 
> --conf
> spark.driver.memory=128g --conf spark.driver.maxResultSize=112g when
> submitting.
>
> Attempting the same application on the same cluster with the feature
> vector size reduced to 100k took only ~ 9 minutes. Clearly there is an
> issue with scaling to large numbers of features. I'm not doing anything
> fancy in my app, here's the relevant code:
>
> val lr = new LogisticRegression().setRegParam(1)
> val model = lr.fit(trainingSet.toDF())
>
> In comparison, a coworker trained a logistic regression model on her
> *laptop* using the Java library liblinear in just a few minutes.
> That's with the ~20 million-sized feature vectors. This suggests to me
> there is some issue with Spark ML's implementation of logistic regression
> which is limiting its scalability.
>
> Note that my feature vectors are *very* sparse. The maximum feature
> is around 20 million, but I think there are only 10's of thousands of
> features.
>
> Has anyone run into this? Any idea where the bottleneck is or how this
> problem might be solved?
>
> One solution of course is to implement some dimensionality reduction.
> I'd really like to avoid this, as it's just another thing to deal with -
> not so hard to put it into the trainer, but then anything doing scoring
> will need the same logic. Unless Spark ML supports this out of the box? An
> easy way to save / load a model along with the dimensionality reduction
> logic so when transform is called on the model it will handle the
> dimensionality reduction transparently?
>
> Any advice would be appreciated.
>
> ~Daniel Siegmann
>


>>>
>>
>


Re: Best practices of maintaining a long running SparkContext

2016-03-08 Thread Mich Talebzadeh
Hi,

I have recently started experimenting with Zeppelin and run it on TCP port
21999 (configurable in zeppelin-env.sh). The daemon seems to be stable.
However, I have noticed that it goes stale from time to time and also
killing the UI does not stop the job properly. Sometime it is also
necessary to kill the connections at OS level as "zeppelin-daemon.sh stop"
does not stop it. The logs are pretty informative. There is a spark-context
created whenever you start a new jon on UI as below

-- Create new SparkContext local[*] ---
[Stage 5:==> (688 + 12) /
12544]

The interpreter screen is pretty useful for changing the configuration
parameters. To be honest I am not sure how many concurrent Spark context
one can run. In my case only one process runs and the rest are pending in
the queue until the one running completes

I would be interested to know how many concurrent jobs you can run through
UI.

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 8 March 2016 at 20:41, Zhong Wang  wrote:

> +spark-users
>
> We are using Zeppelin (http://zeppelin.incubator.apache.org) as our UI to
> run spark jobs. Zeppelin maintains a long running SparkContext, and we run
> into a couple of issues:
> --
> 1. Dynamic resource allocation keeps removing and registering executors,
> even though no jobs are running
> 2. EventLogging doesn't work due to HDFS lease issue. Similar to this:
> https://mail-archives.apache.org/mod_mbox/spark-user/201507.mbox/%3ccae6kwsp_c00gksmnx0obu5aouxphdjs-syqywt-jfi3psvc...@mail.gmail.com%3E
> 3. SparkUI is getting slower due to large number of history jobs
> 4. Cached data is gone mystically (shown in the Storage page, but not in
> the Executor page)
>
> The aim of this thread is not resolve specific issues (though any ideas on
> the listed issue will be welcome), but to hear suggestions about the best
> practices of maintaining a long running SparkContext from both the Spark
> and Zeppelin community.
>
> Thanks,
> Zhong
>
> On Tue, Mar 8, 2016 at 11:13 AM, Zhong Wang 
> wrote:
>
>> Thanks for your insights, Deenar. I think this is really helpful to users
>> who want to run Zeppelin as a service.
>>
>> The caching issue we experienced seems to be a Spark bug, because I see
>> some inconsistent states through the SparkUI, but thanks for pointing out
>> the potential reasons.
>>
>> I am still interested in for the people who run Zeppelin as a service,
>> whether you have experienced bugs or memory leaks, and how did you deal
>> with these.
>>
>> Thanks!
>>
>> Zhong
>>
>> On Tue, Mar 8, 2016 at 8:17 AM, Deenar Toraskar <
>> deenar.toras...@gmail.com> wrote:
>>
>>> 1) You should turn dynamic allocation on see
>>> http://spark.apache.org/docs/latest/configuration.html#dynamic-allocation
>>> to maximise utilisation of your cluster resources. This might be a reason
>>> you are seeing cached data disappearing.
>>> 2) If other processes cache data and the amount of data cached is larger
>>> than your cluster memory, Spark will evict some cached data from memory.
>>> 3) If you are using Kerberos authentication, you need a process that
>>> renews tickets.
>>>
>>> Deenar
>>>
>>> On 8 March 2016 at 01:35, Zhong Wang  wrote:
>>>
 Hi zeppelin-users,

 Because Zeppelin relies on a long running SparkContext, it is quite
 important to make it stable to improve availability. From my experience, I
 run into a couple of issues if I run a SparkContext for several days,
 including:
 --
 1. EventLoggong doest work due to HDFS lease issue. Similar to this:
 https://mail-archives.apache.org/mod_mbox/spark-user/201507.mbox/%3ccae6kwsp_c00gksmnx0obu5aouxphdjs-syqywt-jfi3psvc...@mail.gmail.com%3E
 2. SparkUI is getting slower due to large number of history jobs
 3. Cached data is gone mystically

 They may not be Zeppelin issues, but I would like to hear the problems
 you run into, and your experience of how to deal with maintaining a long
 running SparkContext.

 I know that we can do some cleanups periodically by restarting the
 spark interpreter, but I am wondering whether there are better ways.

 Thanks!

 Zhong

>>>
>>>
>>
>


How to add a custom jar file to the Spark driver?

2016-03-08 Thread Gerhard Fiedler
We're running Spark 1.6.0 on EMR, in YARN client mode. We run Python code, but 
we want to add a custom jar file to the driver.

When running on a local one-node standalone cluster, we just use 
spark.driver.extraClassPath and everything works:

spark-submit --conf spark.driver.extraClassPath=/path/to/our/custom/jar/*  
our-python-script.py

But on EMR, this value is set to something that is needed to make their 
installation of Spark work. Setting it to point to our custom jar overwrites 
the original setting rather than adding to it and breaks Spark.

Our current workaround is to capture to whatever EMR sets 
spark.driver.extraClassPath once, then use that path and add our jar file to 
it. Of course this breaks when EMR changes this path in their cluster settings. 
We wouldn't necessarily notice this easily. This is how it looks:

spark-submit --conf 
spark.driver.extraClassPath=/path/to/our/custom/jar/*:/etc/hadoop/conf:/etc/hive/conf:/usr/lib/hadoop/*:/usr/lib/hadoop-hdfs/*:/usr/lib/hadoop-yarn/*:/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*
  our-python-script.py

We prefer not to do this...

We tried the spark-submit argument --jars, but it didn't seem to do anything. 
Like this:

spark-submit --jars /path/to/our/custom/jar/file.jar  our-python-script.py

We also tried to set CLASSPATH, but it doesn't seem to have any impact:

export CLASSPATH=/path/to/our/custom/jar/*
spark-submit  our-python-script.py

When using SPARK_CLASSPATH, we got warnings that it is deprecated, and the 
messages also seemed to imply that it affects the same configuration that is 
set by spark.driver.extraClassPath.


So, my question is: Is there a clean way to add a custom jar file to a Spark 
configuration?

Thanks,
Gerhard



Re: Installing Spark on Mac

2016-03-08 Thread Aida Tefera
Hi Cody, thanks for your reply

I tried "sbt/sbt clean assembly" in the Terminal; somehow I still end up with 
errors.

I have looked at the below links, doesn't give much detail on how to install it 
before executing "./sbin/start-master.sh"

Thanks,

Aida
Sent from my iPhone

> On 8 Mar 2016, at 19:02, Cody Koeninger  wrote:
> 
> You said you downloaded a prebuilt version.
> 
> You shouldn't have to mess with maven or building spark at all.  All
> you need is a jvm, which it looks like you already have installed.
> 
> You should be able to follow the instructions at
> 
> http://spark.apache.org/docs/latest/
> 
> and
> 
> http://spark.apache.org/docs/latest/spark-standalone.html
> 
> If you want standalone mode (master and several worker processes on
> your machine) rather than local mode (single process on your machine),
> you need to set up passwordless ssh to localhost
> 
> http://stackoverflow.com/questions/7134535/setup-passphraseless-ssh-to-localhost-on-os-x
> 
> 
> 
> On Tue, Mar 8, 2016 at 12:45 PM, Eduardo Costa Alfaia
>  wrote:
>> Hi Aida,
>> The installation has detected a maven version 3.0.3. Update to 3.3.3 and try
>> again.
>> 
>> Il 08/Mar/2016 14:06, "Aida"  ha scritto:
>>> 
>>> Hi all,
>>> 
>>> Thanks everyone for your responses; really appreciate it.
>>> 
>>> Eduardo - I tried your suggestions but ran into some issues, please see
>>> below:
>>> 
>>> ukdrfs01:Spark aidatefera$ cd spark-1.6.0
>>> ukdrfs01:spark-1.6.0 aidatefera$ build/mvn -DskipTests clean package
>>> Using `mvn` from path: /usr/bin/mvn
>>> Java HotSpot(TM) 64-Bit Server VM warning: ignoring option
>>> MaxPermSize=512M;
>>> support was removed in 8.0
>>> [INFO] Scanning for projects...
>>> [INFO]
>>> 
>>> [INFO] Reactor Build Order:
>>> [INFO]
>>> [INFO] Spark Project Parent POM
>>> [INFO] Spark Project Test Tags
>>> [INFO] Spark Project Launcher
>>> [INFO] Spark Project Networking
>>> [INFO] Spark Project Shuffle Streaming Service
>>> [INFO] Spark Project Unsafe
>>> [INFO] Spark Project Core
>>> [INFO] Spark Project Bagel
>>> [INFO] Spark Project GraphX
>>> [INFO] Spark Project Streaming
>>> [INFO] Spark Project Catalyst
>>> [INFO] Spark Project SQL
>>> [INFO] Spark Project ML Library
>>> [INFO] Spark Project Tools
>>> [INFO] Spark Project Hive
>>> [INFO] Spark Project Docker Integration Tests
>>> [INFO] Spark Project REPL
>>> [INFO] Spark Project Assembly
>>> [INFO] Spark Project External Twitter
>>> [INFO] Spark Project External Flume Sink
>>> [INFO] Spark Project External Flume
>>> [INFO] Spark Project External Flume Assembly
>>> [INFO] Spark Project External MQTT
>>> [INFO] Spark Project External MQTT Assembly
>>> [INFO] Spark Project External ZeroMQ
>>> [INFO] Spark Project External Kafka
>>> [INFO] Spark Project Examples
>>> [INFO] Spark Project External Kafka Assembly
>>> [INFO]
>>> [INFO]
>>> 
>>> [INFO] Building Spark Project Parent POM 1.6.0
>>> [INFO]
>>> 
>>> [INFO]
>>> [INFO] --- maven-clean-plugin:2.6.1:clean (default-clean) @
>>> spark-parent_2.10 ---
>>> [INFO]
>>> [INFO] --- maven-enforcer-plugin:1.4:enforce (enforce-versions) @
>>> spark-parent_2.10 ---
>>> [WARNING] Rule 0: org.apache.maven.plugins.enforcer.RequireMavenVersion
>>> failed with message:
>>> Detected Maven Version: 3.0.3 is not in the allowed range 3.3.3.
>>> [INFO]
>>> 
>>> [INFO] Reactor Summary:
>>> [INFO]
>>> [INFO] Spark Project Parent POM .. FAILURE
>>> [0.821s]
>>> [INFO] Spark Project Test Tags ... SKIPPED
>>> [INFO] Spark Project Launcher  SKIPPED
>>> [INFO] Spark Project Networking .. SKIPPED
>>> [INFO] Spark Project Shuffle Streaming Service ... SKIPPED
>>> [INFO] Spark Project Unsafe .. SKIPPED
>>> [INFO] Spark Project Core  SKIPPED
>>> [INFO] Spark Project Bagel ... SKIPPED
>>> [INFO] Spark Project GraphX .. SKIPPED
>>> [INFO] Spark Project Streaming ... SKIPPED
>>> [INFO] Spark Project Catalyst  SKIPPED
>>> [INFO] Spark Project SQL . SKIPPED
>>> [INFO] Spark Project ML Library .. SKIPPED
>>> [INFO] Spark Project Tools ... SKIPPED
>>> [INFO] Spark Project Hive  SKIPPED
>>> [INFO] Spark Project Docker Integration Tests  SKIPPED
>>> [INFO] Spark Project REPL  SKIPPED
>>> [INFO] Spark Project Assembly  SKIPPED
>>> [INFO] Spark Project External Twitter  SKIPPED
>>> [INFO] S

Re: Installing Spark on Mac

2016-03-08 Thread Cody Koeninger
That's what I'm saying, there is no "installing" necessary for
pre-built packages.  Just unpack it and change directory into it.

What happens when you do

./bin/spark-shell --master local[2]

or

./bin/start-all.sh



On Tue, Mar 8, 2016 at 3:45 PM, Aida Tefera  wrote:
> Hi Cody, thanks for your reply
>
> I tried "sbt/sbt clean assembly" in the Terminal; somehow I still end up with 
> errors.
>
> I have looked at the below links, doesn't give much detail on how to install 
> it before executing "./sbin/start-master.sh"
>
> Thanks,
>
> Aida
> Sent from my iPhone
>
>> On 8 Mar 2016, at 19:02, Cody Koeninger  wrote:
>>
>> You said you downloaded a prebuilt version.
>>
>> You shouldn't have to mess with maven or building spark at all.  All
>> you need is a jvm, which it looks like you already have installed.
>>
>> You should be able to follow the instructions at
>>
>> http://spark.apache.org/docs/latest/
>>
>> and
>>
>> http://spark.apache.org/docs/latest/spark-standalone.html
>>
>> If you want standalone mode (master and several worker processes on
>> your machine) rather than local mode (single process on your machine),
>> you need to set up passwordless ssh to localhost
>>
>> http://stackoverflow.com/questions/7134535/setup-passphraseless-ssh-to-localhost-on-os-x
>>
>>
>>
>> On Tue, Mar 8, 2016 at 12:45 PM, Eduardo Costa Alfaia
>>  wrote:
>>> Hi Aida,
>>> The installation has detected a maven version 3.0.3. Update to 3.3.3 and try
>>> again.
>>>
>>> Il 08/Mar/2016 14:06, "Aida"  ha scritto:

 Hi all,

 Thanks everyone for your responses; really appreciate it.

 Eduardo - I tried your suggestions but ran into some issues, please see
 below:

 ukdrfs01:Spark aidatefera$ cd spark-1.6.0
 ukdrfs01:spark-1.6.0 aidatefera$ build/mvn -DskipTests clean package
 Using `mvn` from path: /usr/bin/mvn
 Java HotSpot(TM) 64-Bit Server VM warning: ignoring option
 MaxPermSize=512M;
 support was removed in 8.0
 [INFO] Scanning for projects...
 [INFO]
 
 [INFO] Reactor Build Order:
 [INFO]
 [INFO] Spark Project Parent POM
 [INFO] Spark Project Test Tags
 [INFO] Spark Project Launcher
 [INFO] Spark Project Networking
 [INFO] Spark Project Shuffle Streaming Service
 [INFO] Spark Project Unsafe
 [INFO] Spark Project Core
 [INFO] Spark Project Bagel
 [INFO] Spark Project GraphX
 [INFO] Spark Project Streaming
 [INFO] Spark Project Catalyst
 [INFO] Spark Project SQL
 [INFO] Spark Project ML Library
 [INFO] Spark Project Tools
 [INFO] Spark Project Hive
 [INFO] Spark Project Docker Integration Tests
 [INFO] Spark Project REPL
 [INFO] Spark Project Assembly
 [INFO] Spark Project External Twitter
 [INFO] Spark Project External Flume Sink
 [INFO] Spark Project External Flume
 [INFO] Spark Project External Flume Assembly
 [INFO] Spark Project External MQTT
 [INFO] Spark Project External MQTT Assembly
 [INFO] Spark Project External ZeroMQ
 [INFO] Spark Project External Kafka
 [INFO] Spark Project Examples
 [INFO] Spark Project External Kafka Assembly
 [INFO]
 [INFO]
 
 [INFO] Building Spark Project Parent POM 1.6.0
 [INFO]
 
 [INFO]
 [INFO] --- maven-clean-plugin:2.6.1:clean (default-clean) @
 spark-parent_2.10 ---
 [INFO]
 [INFO] --- maven-enforcer-plugin:1.4:enforce (enforce-versions) @
 spark-parent_2.10 ---
 [WARNING] Rule 0: org.apache.maven.plugins.enforcer.RequireMavenVersion
 failed with message:
 Detected Maven Version: 3.0.3 is not in the allowed range 3.3.3.
 [INFO]
 
 [INFO] Reactor Summary:
 [INFO]
 [INFO] Spark Project Parent POM .. FAILURE
 [0.821s]
 [INFO] Spark Project Test Tags ... SKIPPED
 [INFO] Spark Project Launcher  SKIPPED
 [INFO] Spark Project Networking .. SKIPPED
 [INFO] Spark Project Shuffle Streaming Service ... SKIPPED
 [INFO] Spark Project Unsafe .. SKIPPED
 [INFO] Spark Project Core  SKIPPED
 [INFO] Spark Project Bagel ... SKIPPED
 [INFO] Spark Project GraphX .. SKIPPED
 [INFO] Spark Project Streaming ... SKIPPED
 [INFO] Spark Project Catalyst  SKIPPED
 [INFO] Spark Project SQL . SKIPPED
 [INFO] Spark Project ML Library .. SKIPPED
 [INFO] Spark Project Tools 

Re: Installing Spark on Mac

2016-03-08 Thread Aida Tefera
Ok, once I downloaded the pre built version, I created a directory for it and 
named Spark

When I try ./bin/start-all.sh 

It comes back with : no such file or directory 

When I try ./bin/spark-shell --master local[2]

I get: no such file or directory
Failed to find spark assembly, you need to build Spark before running this 
program



Sent from my iPhone

> On 8 Mar 2016, at 21:50, Cody Koeninger  wrote:
> 
> That's what I'm saying, there is no "installing" necessary for
> pre-built packages.  Just unpack it and change directory into it.
> 
> What happens when you do
> 
> ./bin/spark-shell --master local[2]
> 
> or
> 
> ./bin/start-all.sh
> 
> 
> 
>> On Tue, Mar 8, 2016 at 3:45 PM, Aida Tefera  wrote:
>> Hi Cody, thanks for your reply
>> 
>> I tried "sbt/sbt clean assembly" in the Terminal; somehow I still end up 
>> with errors.
>> 
>> I have looked at the below links, doesn't give much detail on how to install 
>> it before executing "./sbin/start-master.sh"
>> 
>> Thanks,
>> 
>> Aida
>> Sent from my iPhone
>> 
>>> On 8 Mar 2016, at 19:02, Cody Koeninger  wrote:
>>> 
>>> You said you downloaded a prebuilt version.
>>> 
>>> You shouldn't have to mess with maven or building spark at all.  All
>>> you need is a jvm, which it looks like you already have installed.
>>> 
>>> You should be able to follow the instructions at
>>> 
>>> http://spark.apache.org/docs/latest/
>>> 
>>> and
>>> 
>>> http://spark.apache.org/docs/latest/spark-standalone.html
>>> 
>>> If you want standalone mode (master and several worker processes on
>>> your machine) rather than local mode (single process on your machine),
>>> you need to set up passwordless ssh to localhost
>>> 
>>> http://stackoverflow.com/questions/7134535/setup-passphraseless-ssh-to-localhost-on-os-x
>>> 
>>> 
>>> 
>>> On Tue, Mar 8, 2016 at 12:45 PM, Eduardo Costa Alfaia
>>>  wrote:
 Hi Aida,
 The installation has detected a maven version 3.0.3. Update to 3.3.3 and 
 try
 again.
 
 Il 08/Mar/2016 14:06, "Aida"  ha scritto:
> 
> Hi all,
> 
> Thanks everyone for your responses; really appreciate it.
> 
> Eduardo - I tried your suggestions but ran into some issues, please see
> below:
> 
> ukdrfs01:Spark aidatefera$ cd spark-1.6.0
> ukdrfs01:spark-1.6.0 aidatefera$ build/mvn -DskipTests clean package
> Using `mvn` from path: /usr/bin/mvn
> Java HotSpot(TM) 64-Bit Server VM warning: ignoring option
> MaxPermSize=512M;
> support was removed in 8.0
> [INFO] Scanning for projects...
> [INFO]
> 
> [INFO] Reactor Build Order:
> [INFO]
> [INFO] Spark Project Parent POM
> [INFO] Spark Project Test Tags
> [INFO] Spark Project Launcher
> [INFO] Spark Project Networking
> [INFO] Spark Project Shuffle Streaming Service
> [INFO] Spark Project Unsafe
> [INFO] Spark Project Core
> [INFO] Spark Project Bagel
> [INFO] Spark Project GraphX
> [INFO] Spark Project Streaming
> [INFO] Spark Project Catalyst
> [INFO] Spark Project SQL
> [INFO] Spark Project ML Library
> [INFO] Spark Project Tools
> [INFO] Spark Project Hive
> [INFO] Spark Project Docker Integration Tests
> [INFO] Spark Project REPL
> [INFO] Spark Project Assembly
> [INFO] Spark Project External Twitter
> [INFO] Spark Project External Flume Sink
> [INFO] Spark Project External Flume
> [INFO] Spark Project External Flume Assembly
> [INFO] Spark Project External MQTT
> [INFO] Spark Project External MQTT Assembly
> [INFO] Spark Project External ZeroMQ
> [INFO] Spark Project External Kafka
> [INFO] Spark Project Examples
> [INFO] Spark Project External Kafka Assembly
> [INFO]
> [INFO]
> 
> [INFO] Building Spark Project Parent POM 1.6.0
> [INFO]
> 
> [INFO]
> [INFO] --- maven-clean-plugin:2.6.1:clean (default-clean) @
> spark-parent_2.10 ---
> [INFO]
> [INFO] --- maven-enforcer-plugin:1.4:enforce (enforce-versions) @
> spark-parent_2.10 ---
> [WARNING] Rule 0: org.apache.maven.plugins.enforcer.RequireMavenVersion
> failed with message:
> Detected Maven Version: 3.0.3 is not in the allowed range 3.3.3.
> [INFO]
> 
> [INFO] Reactor Summary:
> [INFO]
> [INFO] Spark Project Parent POM .. FAILURE
> [0.821s]
> [INFO] Spark Project Test Tags ... SKIPPED
> [INFO] Spark Project Launcher  SKIPPED
> [INFO] Spark Project Networking .. SKIPPED
> [INFO] Spark Project Shuffle Streaming Service ... SKIPPED
> [INFO] Spark Project Unsafe 

Re: Output the data to external database at particular time in spark streaming

2016-03-08 Thread ayan guha
Yes if it falls within the batch. But if the requirement is flush
everything till 15th min of the hour, then it should work.
On 9 Mar 2016 04:01, "Ted Yu"  wrote:

> That may miss the 15th minute of the hour (with non-trivial deviation),
> right ?
>
> On Tue, Mar 8, 2016 at 8:50 AM, ayan guha  wrote:
>
>> Why not compare current time in every batch and it meets certain
>> condition emit the data?
>> On 9 Mar 2016 00:19, "Abhishek Anand"  wrote:
>>
>>> I have a spark streaming job where I am aggregating the data by doing
>>> reduceByKeyAndWindow with inverse function.
>>>
>>> I am keeping the data in memory for upto 2 hours and In order to output
>>> the reduced data to an external storage I conditionally need to puke the
>>> data to DB say at every 15th minute of the each hour.
>>>
>>> How can this be achieved.
>>>
>>>
>>> Regards,
>>> Abhi
>>>
>>
>


Hive Context: Hive Metastore Client

2016-03-08 Thread Alex F
As of Spark 1.6.0 it is now possible to create new Hive Context sessions
sharing various components but right now the Hive Metastore Client is
shared amongst each new Hive Context Session.

Are there any plans to create individual Metastore Clients for each Hive
Context?

Related to the question above are there any plans to create an interface
for customizing the username that the Metastore Client uses to connect to
the Hive Metastore? Right now it either uses the user specified in an
environment variable or the application's process owner.


Streaming job delays

2016-03-08 Thread jleaniz
Hi,

I have a streaming application that reads batches from Flume, does some
transformations and then writes parquet files to HDFS.

The problem I have right now is that the scheduling delays are really really
high, and get even higher as time goes. Have seen it go up to 24 hours. The
processing time for each batch is usually steady at 50s or less.

The workers and master are pretty much idle most of the time. Any ideas why
the scheduling time would be so high when the processing time is low?

Thanks

Juan



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Streaming-job-delays-tp26433.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Using dynamic allocation and shuffle service in Standalone Mode

2016-03-08 Thread jleaniz
You've got to start the shuffle service on all your workers. There's a script
for that in the 'sbin' directory.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Using-dynamic-allocation-and-shuffle-service-in-Standalone-Mode-tp26430p26434.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Installing Spark on Mac

2016-03-08 Thread Cody Koeninger
http://spark.apache.org/downloads.html

Make sure you selected Choose a package type: something that says pre-built

In my case,  spark-1.6.0-bin-hadoop2.4.tgz

bash-3.2$ cd ~/Downloads/

bash-3.2$ tar -xzvf spark-1.6.0-bin-hadoop2.4.tgz

bash-3.2$ cd spark-1.6.0-bin-hadoop2.4/

bash-3.2$ ./bin/spark-shell

Works fine


On Tue, Mar 8, 2016 at 4:01 PM, Aida Tefera  wrote:
> Ok, once I downloaded the pre built version, I created a directory for it and 
> named Spark
>
> When I try ./bin/start-all.sh
>
> It comes back with : no such file or directory
>
> When I try ./bin/spark-shell --master local[2]
>
> I get: no such file or directory
> Failed to find spark assembly, you need to build Spark before running this 
> program
>
>
>
> Sent from my iPhone
>
>> On 8 Mar 2016, at 21:50, Cody Koeninger  wrote:
>>
>> That's what I'm saying, there is no "installing" necessary for
>> pre-built packages.  Just unpack it and change directory into it.
>>
>> What happens when you do
>>
>> ./bin/spark-shell --master local[2]
>>
>> or
>>
>> ./bin/start-all.sh
>>
>>
>>
>>> On Tue, Mar 8, 2016 at 3:45 PM, Aida Tefera  wrote:
>>> Hi Cody, thanks for your reply
>>>
>>> I tried "sbt/sbt clean assembly" in the Terminal; somehow I still end up 
>>> with errors.
>>>
>>> I have looked at the below links, doesn't give much detail on how to 
>>> install it before executing "./sbin/start-master.sh"
>>>
>>> Thanks,
>>>
>>> Aida
>>> Sent from my iPhone
>>>
 On 8 Mar 2016, at 19:02, Cody Koeninger  wrote:

 You said you downloaded a prebuilt version.

 You shouldn't have to mess with maven or building spark at all.  All
 you need is a jvm, which it looks like you already have installed.

 You should be able to follow the instructions at

 http://spark.apache.org/docs/latest/

 and

 http://spark.apache.org/docs/latest/spark-standalone.html

 If you want standalone mode (master and several worker processes on
 your machine) rather than local mode (single process on your machine),
 you need to set up passwordless ssh to localhost

 http://stackoverflow.com/questions/7134535/setup-passphraseless-ssh-to-localhost-on-os-x



 On Tue, Mar 8, 2016 at 12:45 PM, Eduardo Costa Alfaia
  wrote:
> Hi Aida,
> The installation has detected a maven version 3.0.3. Update to 3.3.3 and 
> try
> again.
>
> Il 08/Mar/2016 14:06, "Aida"  ha scritto:
>>
>> Hi all,
>>
>> Thanks everyone for your responses; really appreciate it.
>>
>> Eduardo - I tried your suggestions but ran into some issues, please see
>> below:
>>
>> ukdrfs01:Spark aidatefera$ cd spark-1.6.0
>> ukdrfs01:spark-1.6.0 aidatefera$ build/mvn -DskipTests clean package
>> Using `mvn` from path: /usr/bin/mvn
>> Java HotSpot(TM) 64-Bit Server VM warning: ignoring option
>> MaxPermSize=512M;
>> support was removed in 8.0
>> [INFO] Scanning for projects...
>> [INFO]
>> 
>> [INFO] Reactor Build Order:
>> [INFO]
>> [INFO] Spark Project Parent POM
>> [INFO] Spark Project Test Tags
>> [INFO] Spark Project Launcher
>> [INFO] Spark Project Networking
>> [INFO] Spark Project Shuffle Streaming Service
>> [INFO] Spark Project Unsafe
>> [INFO] Spark Project Core
>> [INFO] Spark Project Bagel
>> [INFO] Spark Project GraphX
>> [INFO] Spark Project Streaming
>> [INFO] Spark Project Catalyst
>> [INFO] Spark Project SQL
>> [INFO] Spark Project ML Library
>> [INFO] Spark Project Tools
>> [INFO] Spark Project Hive
>> [INFO] Spark Project Docker Integration Tests
>> [INFO] Spark Project REPL
>> [INFO] Spark Project Assembly
>> [INFO] Spark Project External Twitter
>> [INFO] Spark Project External Flume Sink
>> [INFO] Spark Project External Flume
>> [INFO] Spark Project External Flume Assembly
>> [INFO] Spark Project External MQTT
>> [INFO] Spark Project External MQTT Assembly
>> [INFO] Spark Project External ZeroMQ
>> [INFO] Spark Project External Kafka
>> [INFO] Spark Project Examples
>> [INFO] Spark Project External Kafka Assembly
>> [INFO]
>> [INFO]
>> 
>> [INFO] Building Spark Project Parent POM 1.6.0
>> [INFO]
>> 
>> [INFO]
>> [INFO] --- maven-clean-plugin:2.6.1:clean (default-clean) @
>> spark-parent_2.10 ---
>> [INFO]
>> [INFO] --- maven-enforcer-plugin:1.4:enforce (enforce-versions) @
>> spark-parent_2.10 ---
>> [WARNING] Rule 0: org.apache.maven.plugins.enforcer.RequireMavenVersion
>> failed with message:
>> Detected Maven Version: 3.0.3 is not in the allowed range 3.3.3.
>> [INFO]
>> -

Re: Installing Spark on Mac

2016-03-08 Thread Jakob Odersky
I've had some issues myself with the user-provided-Hadoop version.
If you simply just want to get started, I would recommend downloading
Spark (pre-built, with any of the hadoop versions) as Cody suggested.

A simple step-by-step guide:

1. curl http://apache.arvixe.com/spark/spark-1.6.0/spark-1.6.0-bin-hadoop2.6.tgz
-O

2. tar -xzf spark-1.6.0-bin-hadoop2.6.tgz

3. cd spark-1.6.0-bin-hadoop2.6

4. ./bin/spark-shell --master local[2]

On Tue, Mar 8, 2016 at 2:01 PM, Aida Tefera  wrote:
> Ok, once I downloaded the pre built version, I created a directory for it and 
> named Spark
>
> When I try ./bin/start-all.sh
>
> It comes back with : no such file or directory
>
> When I try ./bin/spark-shell --master local[2]
>
> I get: no such file or directory
> Failed to find spark assembly, you need to build Spark before running this 
> program
>
>
>
> Sent from my iPhone
>
>> On 8 Mar 2016, at 21:50, Cody Koeninger  wrote:
>>
>> That's what I'm saying, there is no "installing" necessary for
>> pre-built packages.  Just unpack it and change directory into it.
>>
>> What happens when you do
>>
>> ./bin/spark-shell --master local[2]
>>
>> or
>>
>> ./bin/start-all.sh
>>
>>
>>
>>> On Tue, Mar 8, 2016 at 3:45 PM, Aida Tefera  wrote:
>>> Hi Cody, thanks for your reply
>>>
>>> I tried "sbt/sbt clean assembly" in the Terminal; somehow I still end up 
>>> with errors.
>>>
>>> I have looked at the below links, doesn't give much detail on how to 
>>> install it before executing "./sbin/start-master.sh"
>>>
>>> Thanks,
>>>
>>> Aida
>>> Sent from my iPhone
>>>
 On 8 Mar 2016, at 19:02, Cody Koeninger  wrote:

 You said you downloaded a prebuilt version.

 You shouldn't have to mess with maven or building spark at all.  All
 you need is a jvm, which it looks like you already have installed.

 You should be able to follow the instructions at

 http://spark.apache.org/docs/latest/

 and

 http://spark.apache.org/docs/latest/spark-standalone.html

 If you want standalone mode (master and several worker processes on
 your machine) rather than local mode (single process on your machine),
 you need to set up passwordless ssh to localhost

 http://stackoverflow.com/questions/7134535/setup-passphraseless-ssh-to-localhost-on-os-x



 On Tue, Mar 8, 2016 at 12:45 PM, Eduardo Costa Alfaia
  wrote:
> Hi Aida,
> The installation has detected a maven version 3.0.3. Update to 3.3.3 and 
> try
> again.
>
> Il 08/Mar/2016 14:06, "Aida"  ha scritto:
>>
>> Hi all,
>>
>> Thanks everyone for your responses; really appreciate it.
>>
>> Eduardo - I tried your suggestions but ran into some issues, please see
>> below:
>>
>> ukdrfs01:Spark aidatefera$ cd spark-1.6.0
>> ukdrfs01:spark-1.6.0 aidatefera$ build/mvn -DskipTests clean package
>> Using `mvn` from path: /usr/bin/mvn
>> Java HotSpot(TM) 64-Bit Server VM warning: ignoring option
>> MaxPermSize=512M;
>> support was removed in 8.0
>> [INFO] Scanning for projects...
>> [INFO]
>> 
>> [INFO] Reactor Build Order:
>> [INFO]
>> [INFO] Spark Project Parent POM
>> [INFO] Spark Project Test Tags
>> [INFO] Spark Project Launcher
>> [INFO] Spark Project Networking
>> [INFO] Spark Project Shuffle Streaming Service
>> [INFO] Spark Project Unsafe
>> [INFO] Spark Project Core
>> [INFO] Spark Project Bagel
>> [INFO] Spark Project GraphX
>> [INFO] Spark Project Streaming
>> [INFO] Spark Project Catalyst
>> [INFO] Spark Project SQL
>> [INFO] Spark Project ML Library
>> [INFO] Spark Project Tools
>> [INFO] Spark Project Hive
>> [INFO] Spark Project Docker Integration Tests
>> [INFO] Spark Project REPL
>> [INFO] Spark Project Assembly
>> [INFO] Spark Project External Twitter
>> [INFO] Spark Project External Flume Sink
>> [INFO] Spark Project External Flume
>> [INFO] Spark Project External Flume Assembly
>> [INFO] Spark Project External MQTT
>> [INFO] Spark Project External MQTT Assembly
>> [INFO] Spark Project External ZeroMQ
>> [INFO] Spark Project External Kafka
>> [INFO] Spark Project Examples
>> [INFO] Spark Project External Kafka Assembly
>> [INFO]
>> [INFO]
>> 
>> [INFO] Building Spark Project Parent POM 1.6.0
>> [INFO]
>> 
>> [INFO]
>> [INFO] --- maven-clean-plugin:2.6.1:clean (default-clean) @
>> spark-parent_2.10 ---
>> [INFO]
>> [INFO] --- maven-enforcer-plugin:1.4:enforce (enforce-versions) @
>> spark-parent_2.10 ---
>> [WARNING] Rule 0: org.apache.maven.plugins.enforcer.RequireMavenVersion
>> failed with message:
>> Detected Maven

Re: Hive Context: Hive Metastore Client

2016-03-08 Thread Mich Talebzadeh
Hi,

What do you mean by Hive Metastore Client? Are you referring to Hive server
login much like beeline?

Spark uses hive-site.xml to get the details of Hive metastore and the login
to the metastore which could be any database. Mine is Oracle and as far as
I know even in  Hive 2, hive-site.xml has an entry for
javax.jdo.option.ConnectionUserName that specifies username to use against
metastore database. These are all multi-threaded JDBC connections to the
database, the same login as shown below:

LOGINSID/serial# LOGGED IN S HOST   OS PID Client PID
PROGRAM   MEM/KB  Logical I/O Physical I/O ACT
 --- --- -- -- --
---    ---
INFO
---
HIVEUSER 67,6160 08/03 08:11 rhes564oracle/20539   hduser/1234
JDBC Thin Clien1,017   370 N
HIVEUSER 89,6421 08/03 08:11 rhes564oracle/20541   hduser/1234
JDBC Thin Clien1,081  5280 N
HIVEUSER 112,561 08/03 10:45 rhes564oracle/24624   hduser/1234
JDBC Thin Clien  889   370 N
HIVEUSER 131,881108/03 08:11 rhes564oracle/20543   hduser/1234
JDBC Thin Clien1,017   370 N
HIVEUSER 47,3011408/03 10:45 rhes564oracle/24626   hduser/1234
JDBC Thin Clien1,017   370 N
HIVEUSER 170,895508/03 08:11 rhes564oracle/20545   hduser/1234
JDBC Thin Clien1,017  3230 N

As I understand what you are suggesting is that each Spark user uses
different login to connect to Hive metastore. As of now there is only one
login that connects to Hive metastore shared among all

2016-03-08T23:08:01,890 INFO  [pool-5-thread-72]: HiveMetaStore.audit
(HiveMetaStore.java:logAuditEvent(280)) - ugi=hduser
ip=50.140.197.217   cmd=source:50.140.197.217 get_table : db=test tbl=t
2016-03-08T23:18:10,432 INFO  [pool-5-thread-81]: HiveMetaStore.audit
(HiveMetaStore.java:logAuditEvent(280)) - ugi=hduser
ip=50.140.197.216   cmd=source:50.140.197.216 get_tables: db=asehadoop
pat=.*

And this is an entry in Hive log when connection is made theough Zeppelin UI

2016-03-08T23:20:13,546 INFO  [pool-5-thread-84]: metastore.HiveMetaStore
(HiveMetaStore.java:newRawStore(499)) - 84: Opening raw store with
implementation class:org.apache.hadoop.hive.metastore.ObjectStore
2016-03-08T23:20:13,547 INFO  [pool-5-thread-84]: metastore.ObjectStore
(ObjectStore.java:initialize(318)) - ObjectStore, initialize called
2016-03-08T23:20:13,550 INFO  [pool-5-thread-84]:
metastore.MetaStoreDirectSql (MetaStoreDirectSql.java:(142)) - Using
direct SQL, underlying DB is ORACLE
2016-03-08T23:20:13,550 INFO  [pool-5-thread-84]: metastore.ObjectStore
(ObjectStore.java:setConf(301)) - Initialized ObjectStore

I am not sure there is currently such plan to have different logins allowed
to Hive Metastore. But it will add another level of security. Though I am
not sure how this would be authenticated.

HTH



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 8 March 2016 at 22:23, Alex F  wrote:

> As of Spark 1.6.0 it is now possible to create new Hive Context sessions
> sharing various components but right now the Hive Metastore Client is
> shared amongst each new Hive Context Session.
>
> Are there any plans to create individual Metastore Clients for each Hive
> Context?
>
> Related to the question above are there any plans to create an interface
> for customizing the username that the Metastore Client uses to connect to
> the Hive Metastore? Right now it either uses the user specified in an
> environment variable or the application's process owner.
>


Re: Hive Context: Hive Metastore Client

2016-03-08 Thread Alex
Yes, when creating a Hive Context a Hive Metastore client should be 
created with a user that the Spark application will talk to the *remote* 
Hive Metastore with. We would like to add a custom authorization plugin 
to our remote Hive Metastore to authorize the query requests that the 
spark application is submitting which would also add authorization for 
any other applications hitting the Hive Metastore. Furthermore we would 
like to extend this so that we can submit "jobs" to our Spark 
application that will allow us to run against the metastore as different 
users while leveraging the abilities of our spark cluster. But as you 
mentioned only one login connects to the Hive Metastore is shared among 
all HiveContext sessions.


Likely the authentication would have to be completed either through a 
secured Hive Metastore (Kerberos) or by having the requests go through 
HiveServer2.


--Alex

On 3/8/2016 3:13 PM, Mich Talebzadeh wrote:

Hi,

What do you mean by Hive Metastore Client? Are you referring to Hive 
server login much like beeline?


Spark uses hive-site.xml to get the details of Hive metastore and the 
login to the metastore which could be any database. Mine is Oracle and 
as far as I know even in  Hive 2, hive-site.xml has an entry for 
javax.jdo.option.ConnectionUserName that specifies username to use 
against metastore database. These are all multi-threaded JDBC 
connections to the database, the same login as shown below:


LOGIN SID/serial# LOGGED IN S HOST   OS PID Client PID 
PROGRAM   MEM/KB  Logical I/O Physical I/O ACT
 --- --- -- -- 
-- ---   
 ---

INFO
---
HIVEUSER 67,6160 08/03 08:11 rhes564oracle/20539 
hduser/1234JDBC Thin Clien1,017 370 N
HIVEUSER 89,6421 08/03 08:11 rhes564oracle/20541 
hduser/1234JDBC Thin Clien1,081 5280 N
HIVEUSER 112,561 08/03 10:45 rhes564oracle/24624 
hduser/1234JDBC Thin Clien  889 370 N
HIVEUSER 131,881108/03 08:11 rhes564oracle/20543 
hduser/1234JDBC Thin Clien1,017 370 N
HIVEUSER 47,3011408/03 10:45 rhes564oracle/24626 
hduser/1234JDBC Thin Clien1,017 370 N
HIVEUSER 170,895508/03 08:11 rhes564oracle/20545 
hduser/1234JDBC Thin Clien1,017 3230 N


As I understand what you are suggesting is that each Spark user uses 
different login to connect to Hive metastore. As of now there is only 
one login that connects to Hive metastore shared among all


2016-03-08T23:08:01,890 INFO  [pool-5-thread-72]: HiveMetaStore.audit 
(HiveMetaStore.java:logAuditEvent(280)) - ugi=hduser 
ip=50.140.197.217   cmd=source:50.140.197.217 get_table : db=test 
tbl=t
2016-03-08T23:18:10,432 INFO  [pool-5-thread-81]: HiveMetaStore.audit 
(HiveMetaStore.java:logAuditEvent(280)) - ugi=hduser 
ip=50.140.197.216   cmd=source:50.140.197.216 get_tables: 
db=asehadoop pat=.*


And this is an entry in Hive log when connection is made theough 
Zeppelin UI


2016-03-08T23:20:13,546 INFO  [pool-5-thread-84]: 
metastore.HiveMetaStore (HiveMetaStore.java:newRawStore(499)) - 84: 
Opening raw store with implementation 
class:org.apache.hadoop.hive.metastore.ObjectStore
2016-03-08T23:20:13,547 INFO  [pool-5-thread-84]: 
metastore.ObjectStore (ObjectStore.java:initialize(318)) - 
ObjectStore, initialize called
2016-03-08T23:20:13,550 INFO  [pool-5-thread-84]: 
metastore.MetaStoreDirectSql (MetaStoreDirectSql.java:(142)) - 
Using direct SQL, underlying DB is ORACLE
2016-03-08T23:20:13,550 INFO  [pool-5-thread-84]: 
metastore.ObjectStore (ObjectStore.java:setConf(301)) - Initialized 
ObjectStore


I am not sure there is currently such plan to have different logins 
allowed to Hive Metastore. But it will add another level of security. 
Though I am not sure how this would be authenticated.


HTH



Dr Mich Talebzadeh

LinkedIn 
/https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw/


http://talebzadehmich.wordpress.com 


On 8 March 2016 at 22:23, Alex F > wrote:


As of Spark 1.6.0 it is now possible to create new Hive Context
sessions sharing various components but right now the Hive
Metastore Client is shared amongst each new Hive Context Session.

Are there any plans to create individual Metastore Clients for
each Hive Context?

Related to the question above are there any plans to create an
interface for customizing the username that the Metastore Client
uses to connect to the Hive Metastore? Right now it either uses
the user specified in an environment variable or the application's
process owner.






SparkML. RandomForest scalability question.

2016-03-08 Thread Eugene Morozov
Hi,

I have 4 nodes cluster: one master (also has hdfs namenode) and 3 workers
(also have 3 colocated hdfs datanodes). Each worker has only 2 cores and
spark.executor.memory is 2.3g.
Input file is two hdfs blocks, one block configured = 64MB.

I train random forest regression with numTrees=50 and maxDepth=10 and for
different number of default parallelism I measure time. I don't see
expected boost in time with increased parallelism. Is the following
expected?

parallelism, time (minutes)
2, 27.0
3, 20.5
4, 23.8
5, 21.4
6, 19.9
12, 22.6
24, 29.7

I saw spark does pretty good job with scheduling tasks as evenly as
possible. For parallelism 2 and 3, tasks are always at different machines.
And for all others parallelisms cached rdd blocks are split evenly across
the cluster. And all data are cached and kept as deserialized 1x.

I realise there shouldn't be any boost for 24 parallelism with my set up,
but I've measured it out of curiosity. I'd expect to have some boost with 4
and 5 parallelism, though.

There might be some disk contention (HDD is in place), as there are pretty
high write shuflles (300 to 600 MB), but that have to be applied to all of
the parallelism after 3. I've monitored disk usages using atop utility and
haven't noticed any contention there.

I realise this might be as well as poor measurement as I've run this thing
once and took time. Usually the recommended way is to measure it several
times, get mean, sigma, etc.

Has anyone experienced similar behaviour? Could you give me any advices or
explanation of what's happening?
--
Be well!
Jean Morozov


Re: Hive Context: Hive Metastore Client

2016-03-08 Thread Mich Talebzadeh
The current scenario resembles a three tier architecture but without the
security of second tier. In a typical three-tier you have users connecting
to the application server (read Hive server2) are independently
authenticated and if OK, the second tier creates new ,NET type or JDBC
threads to connect to database much like multi-threading. The problem I
believe is that Hive server 2 does not have that concept of handling the
individual loggings yet. Hive server 2 should be able to handle LDAP logins
as well. It is a useful layer to have.

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 8 March 2016 at 23:28, Alex  wrote:

> Yes, when creating a Hive Context a Hive Metastore client should be
> created with a user that the Spark application will talk to the *remote*
> Hive Metastore with. We would like to add a custom authorization plugin to
> our remote Hive Metastore to authorize the query requests that the spark
> application is submitting which would also add authorization for any other
> applications hitting the Hive Metastore. Furthermore we would like to
> extend this so that we can submit "jobs" to our Spark application that will
> allow us to run against the metastore as different users while leveraging
> the abilities of our spark cluster. But as you mentioned only one login
> connects to the Hive Metastore is shared among all HiveContext sessions.
>
> Likely the authentication would have to be completed either through a
> secured Hive Metastore (Kerberos) or by having the requests go through
> HiveServer2.
>
> --Alex
>
>
> On 3/8/2016 3:13 PM, Mich Talebzadeh wrote:
>
> Hi,
>
> What do you mean by Hive Metastore Client? Are you referring to Hive
> server login much like beeline?
>
> Spark uses hive-site.xml to get the details of Hive metastore and the
> login to the metastore which could be any database. Mine is Oracle and as
> far as I know even in  Hive 2, hive-site.xml has an entry for
> javax.jdo.option.ConnectionUserName that specifies username to use against
> metastore database. These are all multi-threaded JDBC connections to the
> database, the same login as shown below:
>
> LOGINSID/serial# LOGGED IN S HOST   OS PID Client PID
> PROGRAM   MEM/KB  Logical I/O Physical I/O ACT
>  --- --- -- -- --
> ---    ---
> INFO
> ---
> HIVEUSER 67,6160 08/03 08:11 rhes564oracle/20539   hduser/1234
> JDBC Thin Clien1,017   370 N
> HIVEUSER 89,6421 08/03 08:11 rhes564oracle/20541   hduser/1234
> JDBC Thin Clien1,081  5280 N
> HIVEUSER 112,561 08/03 10:45 rhes564oracle/24624   hduser/1234
> JDBC Thin Clien  889   370 N
> HIVEUSER 131,881108/03 08:11 rhes564oracle/20543   hduser/1234
> JDBC Thin Clien1,017   370 N
> HIVEUSER 47,3011408/03 10:45 rhes564oracle/24626   hduser/1234
> JDBC Thin Clien1,017   370 N
> HIVEUSER 170,895508/03 08:11 rhes564oracle/20545   hduser/1234
> JDBC Thin Clien1,017  3230 N
>
> As I understand what you are suggesting is that each Spark user uses
> different login to connect to Hive metastore. As of now there is only one
> login that connects to Hive metastore shared among all
>
> 2016-03-08T23:08:01,890 INFO  [pool-5-thread-72]: HiveMetaStore.audit
> (HiveMetaStore.java:logAuditEvent(280)) - ugi=hduser
> ip=50.140.197.217   cmd=source:50.140.197.217 get_table : db=test tbl=t
> 2016-03-08T23:18:10,432 INFO  [pool-5-thread-81]: HiveMetaStore.audit
> (HiveMetaStore.java:logAuditEvent(280)) - ugi=hduser
> ip=50.140.197.216   cmd=source:50.140.197.216 get_tables: db=asehadoop
> pat=.*
>
> And this is an entry in Hive log when connection is made theough Zeppelin
> UI
>
> 2016-03-08T23:20:13,546 INFO  [pool-5-thread-84]: metastore.HiveMetaStore
> (HiveMetaStore.java:newRawStore(499)) - 84: Opening raw store with
> implementation class:org.apache.hadoop.hive.metastore.ObjectStore
> 2016-03-08T23:20:13,547 INFO  [pool-5-thread-84]: metastore.ObjectStore
> (ObjectStore.java:initialize(318)) - ObjectStore, initialize called
> 2016-03-08T23:20:13,550 INFO  [pool-5-thread-84]:
> metastore.MetaStoreDirectSql (MetaStoreDirectSql.java:(142)) - Using
> direct SQL, underlying DB is ORACLE
> 2016-03-08T23:20:13,550 INFO  [pool-5-thread-84]: metastore.ObjectStore
> (ObjectStore.java:setConf(301)) - Initialized ObjectStore
>
> I am not sure there is currently such plan to have different logins
> allowed to Hive Metastore. But it will add another level of security.
> Though I am not sure how this would be authenticated.
>

Re: Hive Context: Hive Metastore Client

2016-03-08 Thread Alex
I agree it is a useful layer and during my investigations in to 
individual user connections from a spark application I was running some 
tests with HiveServer2 and using Beeline I was able to authenticate the 
users passed in correctly but when it came down to authorizing the 
queries on the metastore they were all using the initial user connection 
that HiveServer2 had made with the Hive Metastore.


It is my intention that should we get access to the Hive Metastore 
Client and its configuration through the Hive Context that we could 
create new HiveContext sessions each with their own connections to the 
Hive Metastore and have the authorization for the query be completed on 
the Metastore itself and we would handle the authentication of the users 
acting as the second tier.


It sounds like this functionality is not likely to be implemented any 
time soon though so we will have to find a solution in the meantime.


Thanks,
Alex

On 3/8/2016 4:00 PM, Mich Talebzadeh wrote:
The current scenario resembles a three tier architecture but without 
the security of second tier. In a typical three-tier you have users 
connecting to the application server (read Hive server2) 
are independently authenticated and if OK, the second tier creates new 
,NET type or JDBC threads to connect to database much like 
multi-threading. The problem I believe is that Hive server 2 does not 
have that concept of handling the individual loggings yet. Hive server 
2 should be able to handle LDAP logins as well. It is a useful layer 
to have.


Dr Mich Talebzadeh

LinkedIn 
/https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw/


http://talebzadehmich.wordpress.com 


On 8 March 2016 at 23:28, Alex > wrote:


Yes, when creating a Hive Context a Hive Metastore client should
be created with a user that the Spark application will talk to the
*remote* Hive Metastore with. We would like to add a custom
authorization plugin to our remote Hive Metastore to authorize the
query requests that the spark application is submitting which
would also add authorization for any other applications hitting
the Hive Metastore. Furthermore we would like to extend this so
that we can submit "jobs" to our Spark application that will allow
us to run against the metastore as different users while
leveraging the abilities of our spark cluster. But as you
mentioned only one login connects to the Hive Metastore is shared
among all HiveContext sessions.

Likely the authentication would have to be completed either
through a secured Hive Metastore (Kerberos) or by having the
requests go through HiveServer2.

--Alex


On 3/8/2016 3:13 PM, Mich Talebzadeh wrote:

Hi,

What do you mean by Hive Metastore Client? Are you referring to
Hive server login much like beeline?

Spark uses hive-site.xml to get the details of Hive metastore and
the login to the metastore which could be any database. Mine is
Oracle and as far as I know even in  Hive 2, hive-site.xml has an
entry for javax.jdo.option.ConnectionUserName that specifies
username to use against metastore database. These are all
multi-threaded JDBC connections to the database, the same login
as shown below:

LOGIN SID/serial# LOGGED IN S HOST   OS PID Client
PID PROGRAM   MEM/KB  Logical I/O Physical I/O ACT
 --- --- -- --
-- ---  
 ---
INFO
---
HIVEUSER 67,6160 08/03 08:11 rhes564 oracle/20539  
hduser/1234JDBC Thin Clien1,017   37 0 N
HIVEUSER 89,6421 08/03 08:11 rhes564 oracle/20541  
hduser/1234JDBC Thin Clien1,081  528 0 N
HIVEUSER 112,561 08/03 10:45 rhes564 oracle/24624  
hduser/1234JDBC Thin Clien  889   37 0 N
HIVEUSER 131,881108/03 08:11 rhes564 oracle/20543  
hduser/1234JDBC Thin Clien1,017   37 0 N
HIVEUSER 47,3011408/03 10:45 rhes564 oracle/24626  
hduser/1234JDBC Thin Clien1,017   37 0 N
HIVEUSER 170,895508/03 08:11 rhes564 oracle/20545  
hduser/1234JDBC Thin Clien1,017  323 0 N


As I understand what you are suggesting is that each Spark user
uses different login to connect to Hive metastore. As of now
there is only one login that connects to Hive metastore shared
among all

2016-03-08T23:08:01,890 INFO  [pool-5-thread-72]:
HiveMetaStore.audit (HiveMetaStore.java:logAuditEvent(280)) -
ugi=hduser  ip=50.140.197.217 cmd=source:50.140.197.217
get_table : db=test tbl=t
2016-03-08T23:18:10,432 INFO  [pool-5-thread-81]:
HiveMetaStore.audit (HiveMetaStore.java:logAuditEvent(280))

Saving multiple outputs in the same job

2016-03-08 Thread Andy Sloane
We have a somewhat complex pipeline which has multiple output files on
HDFS, and we'd like the materialization of those outputs to happen
concurrently.

Internal to Spark, any "save" call creates a new "job", which runs
synchronously -- that is, the line of code after your save() executes once
the job completes, executing the entire dependency DAG to produce it. Same
with foreach, collect, count, etc.

The files we want to save have overlapping dependencies. For us to create
multiple outputs concurrently, we have a few options that I can see:
 - Spawn a thread for each file we want to save, letting Spark run the jobs
somewhat independently. This has the downside of various concurrency bugs
(e.g. SPARK-4454 , and
more recently SPARK-13631
) and also causes RDDs
up the dependency graph to get independently, uselessly recomputed.
 - Analyze our own dependency graph, materialize (by checkpointing or
saving) common dependencies, and then executing the two saves in threads.
 - Implement our own saves to HDFS as side-effects inside mapPartitions
(which is how save actually works internally anyway, modulo committing
logic to handle speculative execution), yielding an empty dummy RDD for
each thing we want to save, and then run foreach or count on the union of
all the dummy RDDs, which causes Spark to schedule the entire DAG we're
interested in.

Currently we are doing a little of #1 and a little of #3, depending on who
originally wrote the code. #2 is probably closer to what we're supposed to
be doing, but IMO Spark is already able to produce a good execution plan
and we shouldn't have to do that.

AFAIK, there's no way to do what I *actually* want in Spark, which is to
have some control over which saves go into which jobs, and then execute the
jobs directly. I can envision a new version of the various save functions
which take an extra job argument, or something, or some way to defer and
unblock job creation in the spark context.

Ideas?


Confusing RDD function

2016-03-08 Thread Hemminger Jeff
I'm currently developing a Spark Streaming application.

I have a function that receives an RDD and an object instance as  a
parameter, and returns an RDD:

def doTheThing(a: RDD[A], b: B): RDD[C]


Within the function, I do some processing within a map of the RDD.
Like this:


def doTheThing(a: RDD[A], b: B): RDD[C] {

  a.combineByKey(...).map(b.function(_))

}


I combine the RDD by key, then map the results calling a function of
instance b, and return the results.

Here is where I ran into trouble.

In a unit test running Spark in memory, I was able to convince myself that
this worked well.

But in our development environment, the returned RDD results were empty and
b.function(_) was never executed.

However, when I added an otherwise useless foreach:


doTheThing(a: RDD[A], b: B): RDD[C] {

  val results = a.combineByKey(...).map(b.function(_))

  results.foreach( p => p )

  results

}


Then it works.

So, basically, adding an extra foreach iteration appears to cause
b.function(_) to execute and returns results correctly.

I find this confusing. Can anyone shed some light on why this would be?

Thank you,
Jeff


updating the Books section on the Spark documentation page

2016-03-08 Thread Mohammed Guller
Hi -

The Spark documentation page (http://spark.apache.org/documentation.html) has 
links to books covering Spark. What is the process for adding a new book to 
that list?

Thanks,
Mohammed
Author: Big Data Analytics with 
Spark




pyspark spark-cassandra-connector java.io.IOException: Failed to open native connection to Cassandra at {192.168.1.126}:9042

2016-03-08 Thread Andy Davidson

I am using spark-1.6.0-bin-hadoop2.6. I am trying to write a python notebook
that reads a data frame from Cassandra.

I connect to cassadra using an ssh tunnel running on port 9043. CQLSH works
how ever I can not figure out how to configure my notebook. I have tried
various hacks any idea what I am doing wrong

: java.io.IOException: Failed to open native connection to Cassandra at
{192.168.1.126}:9042



Thanks in advance

Andy



$ extraPkgs="--packages com.databricks:spark-csv_2.11:1.3.0 \
--packages datastax:spark-cassandra-connector:1.6.0-M1-s_2.11"

$ export PYSPARK_PYTHON=python3
$ export PYSPARK_DRIVER_PYTHON=python3
$ IPYTHON_OPTS=notebook $SPARK_ROOT/bin/pyspark $extraPkgs $*



In [15]:
1
sqlContext.setConf("spark.cassandra.connection.host”,”127.0.0.1:9043")
2
df = sqlContext.read\
3
.format("org.apache.spark.sql.cassandra")\
4
.options(table=“time_series", keyspace="notification")\
5
.load()
6
​
7
df.printSchema()
8
df.show()
---
Py4JJavaError Traceback (most recent call last)
 in ()
  1 
sqlContext.setConf("spark.cassandra.connection.host","localhost:9043")
> 2 df = sqlContext.read.format("org.apache.spark.sql.cassandra")
.options(table="kv", keyspace="notification").load()
  3 
  4 df.printSchema()
  5 df.show()

/Users/andrewdavidson/workSpace/spark/spark-1.6.0-bin-hadoop2.6/python/pyspa
rk/sql/readwriter.py in load(self, path, format, schema, **options)
137 return self._df(self._jreader.load(path))
138 else:
--> 139 return self._df(self._jreader.load())
140 
141 @since(1.4)

/Users/andrewdavidson/workSpace/spark/spark-1.6.0-bin-hadoop2.6/python/lib/p
y4j-0.9-src.zip/py4j/java_gateway.py in __call__(self, *args)
811 answer = self.gateway_client.send_command(command)
812 return_value = get_return_value(
--> 813 answer, self.gateway_client, self.target_id, self.name)
814 
815 for temp_arg in temp_args:

/Users/andrewdavidson/workSpace/spark/spark-1.6.0-bin-hadoop2.6/python/pyspa
rk/sql/utils.py in deco(*a, **kw)
 43 def deco(*a, **kw):
 44 try:
---> 45 return f(*a, **kw)
 46 except py4j.protocol.Py4JJavaError as e:
 47 s = e.java_exception.toString()

/Users/andrewdavidson/workSpace/spark/spark-1.6.0-bin-hadoop2.6/python/lib/p
y4j-0.9-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client,
target_id, name)
306 raise Py4JJavaError(
307 "An error occurred while calling {0}{1}{2}.\n".
--> 308 format(target_id, ".", name), value)
309 else:
310 raise Py4JError(

Py4JJavaError: An error occurred while calling o280.load.
: java.io.IOException: Failed to open native connection to Cassandra at
{192.168.1.126}:9042
at 
com.datastax.spark.connector.cql.CassandraConnector$.com$datastax$spark$conn
ector$cql$CassandraConnector$$createSession(CassandraConnector.scala:162)
at 
com.datastax.spark.connector.cql.CassandraConnector$$anonfun$2.apply(Cassand
raConnector.scala:148)
at 
com.datastax.spark.connector.cql.CassandraConnector$$anonfun$2.apply(Cassand
raConnector.scala:148)
at 
com.datastax.spark.connector.cql.RefCountedCache.createNewValueAndKeys(RefCo
untedCache.scala:31)
at 
com.datastax.spark.connector.cql.RefCountedCache.acquire(RefCountedCache.sca
la:56)
at 
com.datastax.spark.connector.cql.CassandraConnector.openSession(CassandraCon
nector.scala:81)
at 
com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraC
onnector.scala:109)
at 
com.datastax.spark.connector.rdd.partitioner.CassandraRDDPartitioner$.getTok
enFactory(CassandraRDDPartitioner.scala:184)
at 
org.apache.spark.sql.cassandra.CassandraSourceRelation$.apply(CassandraSourc
eRelation.scala:267)
at 
org.apache.spark.sql.cassandra.DefaultSource.createRelation(DefaultSource.sc
ala:57)
at 
org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(Resolve
dDataSource.scala:158)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62
)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.jav

Re: pyspark spark-cassandra-connector java.io.IOException: Failed to open native connection to Cassandra at {192.168.1.126}:9042

2016-03-08 Thread Ted Yu
Have you contacted spark-cassandra-connector related mailing list ?

I wonder where the port 9042 came from.

Cheers

On Tue, Mar 8, 2016 at 6:02 PM, Andy Davidson  wrote:

>
> I am using spark-1.6.0-bin-hadoop2.6. I am trying to write a python
> notebook that reads a data frame from Cassandra.
>
> *I connect to cassadra using an ssh tunnel running on port 9043.* CQLSH
> works how ever I can not figure out how to configure my notebook. I have
> tried various hacks any idea what I am doing wrong
>
> : java.io.IOException: Failed to open native connection to Cassandra at 
> {192.168.1.126}:9042
>
>
>
>
> Thanks in advance
>
> Andy
>
>
>
> $ extraPkgs="--packages com.databricks:spark-csv_2.11:1.3.0 \
> --packages datastax:spark-cassandra-connector:1.6.0-M1-s_2.11"
>
> $ export PYSPARK_PYTHON=python3
> $ export PYSPARK_DRIVER_PYTHON=python3
> $ IPYTHON_OPTS=notebook $SPARK_ROOT/bin/pyspark $extraPkgs $*
>
>
>
> In [15]:
> 1
>
> sqlContext.setConf("spark.cassandra.connection.host”,”127.0.0.1:9043")
>
> 2
>
> df = sqlContext.read\
>
> 3
>
> .format("org.apache.spark.sql.cassandra")\
>
> 4
>
> .options(table=“time_series", keyspace="notification")\
>
> 5
>
> .load()
>
> 6
>
> ​
>
> 7
>
> df.printSchema()
>
> 8
>
> df.show()
>
> ---Py4JJavaError
>  Traceback (most recent call 
> last) in ()  1 
> sqlContext.setConf("spark.cassandra.connection.host","localhost:9043")> 2 
> df = sqlContext.read.format("org.apache.spark.sql.cassandra")
> .options(table="kv", keyspace="notification").load()  3   4 
> df.printSchema()  5 df.show()
> /Users/andrewdavidson/workSpace/spark/spark-1.6.0-bin-hadoop2.6/python/pyspark/sql/readwriter.py
>  in load(self, path, format, schema, **options)137 return 
> self._df(self._jreader.load(path))138 else:--> 139 
> return self._df(self._jreader.load())140 141 @since(1.4)
> /Users/andrewdavidson/workSpace/spark/spark-1.6.0-bin-hadoop2.6/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py
>  in __call__(self, *args)811 answer = 
> self.gateway_client.send_command(command)812 return_value = 
> get_return_value(--> 813 answer, self.gateway_client, 
> self.target_id, self.name)814 815 for temp_arg in temp_args:
> /Users/andrewdavidson/workSpace/spark/spark-1.6.0-bin-hadoop2.6/python/pyspark/sql/utils.py
>  in deco(*a, **kw) 43 def deco(*a, **kw): 44 try:---> 45  
>return f(*a, **kw) 46 except 
> py4j.protocol.Py4JJavaError as e: 47 s = 
> e.java_exception.toString()
> /Users/andrewdavidson/workSpace/spark/spark-1.6.0-bin-hadoop2.6/python/lib/py4j-0.9-src.zip/py4j/protocol.py
>  in get_return_value(answer, gateway_client, target_id, name)306  
>raise Py4JJavaError(307 "An error occurred 
> while calling {0}{1}{2}.\n".--> 308 format(target_id, 
> ".", name), value)309 else:310 raise 
> Py4JError(
> Py4JJavaError: An error occurred while calling o280.load.
> : java.io.IOException: Failed to open native connection to Cassandra at 
> {192.168.1.126}:9042
>   at 
> com.datastax.spark.connector.cql.CassandraConnector$.com$datastax$spark$connector$cql$CassandraConnector$$createSession(CassandraConnector.scala:162)
>   at 
> com.datastax.spark.connector.cql.CassandraConnector$$anonfun$2.apply(CassandraConnector.scala:148)
>   at 
> com.datastax.spark.connector.cql.CassandraConnector$$anonfun$2.apply(CassandraConnector.scala:148)
>   at 
> com.datastax.spark.connector.cql.RefCountedCache.createNewValueAndKeys(RefCountedCache.scala:31)
>   at 
> com.datastax.spark.connector.cql.RefCountedCache.acquire(RefCountedCache.scala:56)
>   at 
> com.datastax.spark.connector.cql.CassandraConnector.openSession(CassandraConnector.scala:81)
>   at 
> com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:109)
>   at 
> com.datastax.spark.connector.rdd.partitioner.CassandraRDDPartitioner$.getTokenFactory(CassandraRDDPartitioner.scala:184)
>   at 
> org.apache.spark.sql.cassandra.CassandraSourceRelation$.apply(CassandraSourceRelation.scala:267)
>   at 
> org.apache.spark.sql.cassandra.DefaultSource.createRelation(DefaultSource.scala:57)
>   at 
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:158)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.j

Re: pyspark spark-cassandra-connector java.io.IOException: Failed to open native connection to Cassandra at {192.168.1.126}:9042

2016-03-08 Thread Andy Davidson
Hi Ted

I believe by default cassandra listens on 9042

From:  Ted Yu 
Date:  Tuesday, March 8, 2016 at 6:11 PM
To:  Andrew Davidson 
Cc:  "user @spark" 
Subject:  Re: pyspark spark-cassandra-connector java.io.IOException: Failed
to open native connection to Cassandra at {192.168.1.126}:9042

> Have you contacted spark-cassandra-connector related mailing list ?
> 
> I wonder where the port 9042 came from.
> 
> Cheers
> 
> On Tue, Mar 8, 2016 at 6:02 PM, Andy Davidson 
> wrote:
>> 
>> I am using spark-1.6.0-bin-hadoop2.6. I am trying to write a python notebook
>> that reads a data frame from Cassandra.
>> 
>> I connect to cassadra using an ssh tunnel running on port 9043. CQLSH works
>> how ever I can not figure out how to configure my notebook. I have tried
>> various hacks any idea what I am doing wrong
>> 
>> : java.io.IOException: Failed to open native connection to Cassandra at
>> {192.168.1.126}:9042
>> 
>> 
>> 
>> Thanks in advance
>> 
>> Andy
>> 
>> 
>> 
>> $ extraPkgs="--packages com.databricks:spark-csv_2.11:1.3.0 \
>> --packages datastax:spark-cassandra-connector:1.6.0-M1-s_2.11"
>> 
>> $ export PYSPARK_PYTHON=python3
>> $ export PYSPARK_DRIVER_PYTHON=python3
>> $ IPYTHON_OPTS=notebook $SPARK_ROOT/bin/pyspark $extraPkgs $*
>> 
>> 
>> 
>> In [15]:
>> 1
>> sqlContext.setConf("spark.cassandra.connection.host”,”127.0.0.1:9043
>>  ")
>> 2
>> df = sqlContext.read\
>> 3
>> .format("org.apache.spark.sql.cassandra")\
>> 4
>> .options(table=“time_series", keyspace="notification")\
>> 5
>> .load()
>> 6
>> ​
>> 7
>> df.printSchema()
>> 8
>> df.show()
>> ---Py
>> 4JJavaError Traceback (most recent call last)
>>  in ()  1
>> sqlContext.setConf("spark.cassandra.connection.host","localhost:9043")> 2
>> df = sqlContext.read.format("org.apache.spark.sql.cassandra")
>> .options(table="kv", keyspace="notification").load()  3   4
>> df.printSchema()  5
>> df.show()/Users/andrewdavidson/workSpace/spark/spark-1.6.0-bin-hadoop2.6/pyth
>> on/pyspark/sql/readwriter.py in load(self, path, format, schema, **options)
>> 137 return self._df(self._jreader.load(path))138
>> else:--> 139 return self._df(self._jreader.load())140 141
>> @since(1.4)/Users/andrewdavidson/workSpace/spark/spark-1.6.0-bin-hadoop2.6/py
>> thon/lib/py4j-0.9-src.zip/py4j/java_gateway.py in __call__(self, *args)
>> 811 answer = self.gateway_client.send_command(command)812
>> return_value = get_return_value(
>> --> 813 answer, self.gateway_client, self.target_id, self.name
>>  )
>> 814 815 for temp_arg in
>> temp_args:/Users/andrewdavidson/workSpace/spark/spark-1.6.0-bin-hadoop2.6/pyt
>> hon/pyspark/sql/utils.py in deco(*a, **kw) 43 def deco(*a, **kw):
>> 44 try:---> 45 return f(*a, **kw) 46 except
>> py4j.protocol.Py4JJavaError as e: 47 s =
>> e.java_exception.toString()/Users/andrewdavidson/workSpace/spark/spark-1.6.0-
>> bin-hadoop2.6/python/lib/py4j-0.9-src.zip/py4j/protocol.py in
>> get_return_value(answer, gateway_client, target_id, name)306
>> raise Py4JJavaError(
>> 307 "An error occurred while calling
>> {0}{1}{2}.\n".--> 308 format(target_id, ".", name),
>> value)
>> 309 else:310 raise Py4JError(
>> 
>> Py4JJavaError: An error occurred while calling o280.load.
>> : java.io.IOException: Failed to open native connection to Cassandra at
>> {192.168.1.126}:9042
>>  at 
>> com.datastax.spark.connector.cql.CassandraConnector$.com$datastax$spark$conne
>> ctor$cql$CassandraConnector$$createSession(CassandraConnector.scala:162)
>>  at 
>> com.datastax.spark.connector.cql.CassandraConnector$$anonfun$2.apply(Cassandr
>> aConnector.scala:148)
>>  at 
>> com.datastax.spark.connector.cql.CassandraConnector$$anonfun$2.apply(Cassandr
>> aConnector.scala:148)
>>  at 
>> com.datastax.spark.connector.cql.RefCountedCache.createNewValueAndKeys(RefCou
>> ntedCache.scala:31)
>>  at 
>> com.datastax.spark.connector.cql.RefCountedCache.acquire(RefCountedCache.scal
>> a:56)
>>  at 
>> com.datastax.spark.connector.cql.CassandraConnector.openSession(CassandraConn
>> ector.scala:81)
>>  at 
>> com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraCo
>> nnector.scala:109)
>>  at 
>> com.datastax.spark.connector.rdd.partitioner.CassandraRDDPartitioner$.getToke
>> nFactory(CassandraRDDPartitioner.scala:184)
>>  at 
>> org.apache.spark.sql.cassandra.CassandraSourceRelation$.apply(CassandraSource
>> Relation.scala:267)
>>  at 
>> org.apache.spark.sql.cassandra.DefaultSource.createRelation(DefaultSource.sca
>> la:57)
>>  at 
>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(Resolved
>> DataS

Re: Confusing RDD function

2016-03-08 Thread Manoj Awasthi
Spark RDDs are lazily computed and hence unless an 'action' is applied
which mandates the computation - there won't be any computation. You can
read more on spark docs.
On Mar 9, 2016 7:11 AM, "Hemminger Jeff"  wrote:

>
> I'm currently developing a Spark Streaming application.
>
> I have a function that receives an RDD and an object instance as  a
> parameter, and returns an RDD:
>
> def doTheThing(a: RDD[A], b: B): RDD[C]
>
>
> Within the function, I do some processing within a map of the RDD.
> Like this:
>
>
> def doTheThing(a: RDD[A], b: B): RDD[C] {
>
>   a.combineByKey(...).map(b.function(_))
>
> }
>
>
> I combine the RDD by key, then map the results calling a function of
> instance b, and return the results.
>
> Here is where I ran into trouble.
>
> In a unit test running Spark in memory, I was able to convince myself that
> this worked well.
>
> But in our development environment, the returned RDD results were empty
> and b.function(_) was never executed.
>
> However, when I added an otherwise useless foreach:
>
>
> doTheThing(a: RDD[A], b: B): RDD[C] {
>
>   val results = a.combineByKey(...).map(b.function(_))
>
>   results.foreach( p => p )
>
>   results
>
> }
>
>
> Then it works.
>
> So, basically, adding an extra foreach iteration appears to cause
> b.function(_) to execute and returns results correctly.
>
> I find this confusing. Can anyone shed some light on why this would be?
>
> Thank you,
> Jeff
>
>
>
>


Re: Confusing RDD function

2016-03-08 Thread Jakob Odersky
Hi Jeff,

> But in our development environment, the returned RDD results were empty and 
> b.function(_) was never executed
what do you mean by "the returned RDD results were empty", did you try
running a foreach, collect or any other action on the returned RDD[C]?

Spark provides two kinds of operations on RDDs:
1. transformations, which return a new RDD and are lazy and
2. actions that actually run an RDD and return some kind of result.
In your example above, 'map' is a transformation and thus is not
actually applied until some action (like 'foreach') is called on the
resulting RDD.
You can find more information in the Spark Programming Guide
http://spark.apache.org/docs/latest/programming-guide.html#rdd-operations.

best,
--Jakob

On Tue, Mar 8, 2016 at 5:41 PM, Hemminger Jeff  wrote:
>
> I'm currently developing a Spark Streaming application.
>
> I have a function that receives an RDD and an object instance as  a
> parameter, and returns an RDD:
>
> def doTheThing(a: RDD[A], b: B): RDD[C]
>
>
> Within the function, I do some processing within a map of the RDD.
> Like this:
>
>
> def doTheThing(a: RDD[A], b: B): RDD[C] {
>
>   a.combineByKey(...).map(b.function(_))
>
> }
>
>
> I combine the RDD by key, then map the results calling a function of
> instance b, and return the results.
>
> Here is where I ran into trouble.
>
> In a unit test running Spark in memory, I was able to convince myself that
> this worked well.
>
> But in our development environment, the returned RDD results were empty and
> b.function(_) was never executed.
>
> However, when I added an otherwise useless foreach:
>
>
> doTheThing(a: RDD[A], b: B): RDD[C] {
>
>   val results = a.combineByKey(...).map(b.function(_))
>
>   results.foreach( p => p )
>
>   results
>
> }
>
>
> Then it works.
>
> So, basically, adding an extra foreach iteration appears to cause
> b.function(_) to execute and returns results correctly.
>
> I find this confusing. Can anyone shed some light on why this would be?
>
> Thank you,
> Jeff
>
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Confusing RDD function

2016-03-08 Thread Hemminger Jeff
Thank you, yes that makes sense.
I was aware of transformations and actions, but did not realize foreach was
an action. I've found the exhaustive list here
http://spark.apache.org/docs/latest/programming-guide.html#actions
and it's clear to me again.

Thank you for your help!

On Wed, Mar 9, 2016 at 11:37 AM, Jakob Odersky  wrote:

> Hi Jeff,
>
> > But in our development environment, the returned RDD results were empty
> and b.function(_) was never executed
> what do you mean by "the returned RDD results were empty", did you try
> running a foreach, collect or any other action on the returned RDD[C]?
>
> Spark provides two kinds of operations on RDDs:
> 1. transformations, which return a new RDD and are lazy and
> 2. actions that actually run an RDD and return some kind of result.
> In your example above, 'map' is a transformation and thus is not
> actually applied until some action (like 'foreach') is called on the
> resulting RDD.
> You can find more information in the Spark Programming Guide
> http://spark.apache.org/docs/latest/programming-guide.html#rdd-operations.
>
> best,
> --Jakob
>
> On Tue, Mar 8, 2016 at 5:41 PM, Hemminger Jeff  wrote:
> >
> > I'm currently developing a Spark Streaming application.
> >
> > I have a function that receives an RDD and an object instance as  a
> > parameter, and returns an RDD:
> >
> > def doTheThing(a: RDD[A], b: B): RDD[C]
> >
> >
> > Within the function, I do some processing within a map of the RDD.
> > Like this:
> >
> >
> > def doTheThing(a: RDD[A], b: B): RDD[C] {
> >
> >   a.combineByKey(...).map(b.function(_))
> >
> > }
> >
> >
> > I combine the RDD by key, then map the results calling a function of
> > instance b, and return the results.
> >
> > Here is where I ran into trouble.
> >
> > In a unit test running Spark in memory, I was able to convince myself
> that
> > this worked well.
> >
> > But in our development environment, the returned RDD results were empty
> and
> > b.function(_) was never executed.
> >
> > However, when I added an otherwise useless foreach:
> >
> >
> > doTheThing(a: RDD[A], b: B): RDD[C] {
> >
> >   val results = a.combineByKey(...).map(b.function(_))
> >
> >   results.foreach( p => p )
> >
> >   results
> >
> > }
> >
> >
> > Then it works.
> >
> > So, basically, adding an extra foreach iteration appears to cause
> > b.function(_) to execute and returns results correctly.
> >
> > I find this confusing. Can anyone shed some light on why this would be?
> >
> > Thank you,
> > Jeff
> >
> >
> >
>


Spark Scheduler creating Straggler Node

2016-03-08 Thread Prabhu Joseph
Hi All,

When a Spark Job is running, and one of the Spark Executor on Node A
has some partitions cached. Later for some other stage, Scheduler tries to
assign a task to Node A to process a cached partition (PROCESS_LOCAL). But
meanwhile the Node A is occupied with some other
tasks and got busy. Scheduler waits for spark.locality.wait interval and
times out and tries to find some other node B which is NODE_LOCAL. The
executor on Node B will try to get the cached partition from Node A which
adds network IO to node and also some extra CPU for I/O. Eventually,
every node will have a task that is waiting to fetch some cached partition
from node A and so the spark job / cluster is basically blocked on a single
node.

Spark JIRA is created https://issues.apache.org/jira/browse/SPARK-13718

Beginning from Spark 1.2, Spark introduced External Shuffle Service to
enable executors fetch shuffle files from an external service instead of
from each other which will offload the load on Spark Executors.

We want to check whether a similar thing of an External Service is
implemented for transferring the cached partition to other executors.


Thanks, Prabhu Joseph


Re: Spark Scheduler creating Straggler Node

2016-03-08 Thread Reynold Xin
You just want to be able to replicate hot cached blocks right?

On Tuesday, March 8, 2016, Prabhu Joseph  wrote:

> Hi All,
>
> When a Spark Job is running, and one of the Spark Executor on Node A
> has some partitions cached. Later for some other stage, Scheduler tries to
> assign a task to Node A to process a cached partition (PROCESS_LOCAL). But
> meanwhile the Node A is occupied with some other
> tasks and got busy. Scheduler waits for spark.locality.wait interval and
> times out and tries to find some other node B which is NODE_LOCAL. The
> executor on Node B will try to get the cached partition from Node A which
> adds network IO to node and also some extra CPU for I/O. Eventually,
> every node will have a task that is waiting to fetch some cached partition
> from node A and so the spark job / cluster is basically blocked on a single
> node.
>
> Spark JIRA is created https://issues.apache.org/jira/browse/SPARK-13718
>
> Beginning from Spark 1.2, Spark introduced External Shuffle Service to
> enable executors fetch shuffle files from an external service instead of
> from each other which will offload the load on Spark Executors.
>
> We want to check whether a similar thing of an External Service is
> implemented for transferring the cached partition to other executors.
>
>
> Thanks, Prabhu Joseph
>
>
>


S3 Zip File Loading Advice

2016-03-08 Thread Benjamin Kim
I am wondering if anyone can help.

Our company stores zipped CSV files in S3, which has been a big headache from 
the start. I was wondering if anyone has created a way to iterate through 
several subdirectories (s3n://events/2016/03/01/00, s3n://2016/03/01/01, etc.) 
in S3 to find the newest files and load them. It would be a big bonus to 
include the unzipping of the file in the process so that the CSV can be loaded 
directly into a dataframe for further processing. I’m pretty sure that the S3 
part of this request is not uncommon. I would think the file being zipped is 
uncommon. If anyone can help, I would truly be grateful for I am new to Scala 
and Spark. This would be a great help in learning.

Thanks,
Ben
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: pyspark spark-cassandra-connector java.io.IOException: Failed to open native connection to Cassandra at {192.168.1.126}:9042

2016-03-08 Thread Saurabh Bajaj
Hi Andy,

I believe you need to set the host and port settings separately
spark.cassandra.connection.host
spark.cassandra.connection.port
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#cassandra-connection-parameters

Looking at the logs, it seems your port config is not being set and it's
falling back to default.
Let me know if that helps.

Saurabh Bajaj

On Tue, Mar 8, 2016 at 6:25 PM, Andy Davidson  wrote:

> Hi Ted
>
> I believe by default cassandra listens on 9042
>
> From: Ted Yu 
> Date: Tuesday, March 8, 2016 at 6:11 PM
> To: Andrew Davidson 
> Cc: "user @spark" 
> Subject: Re: pyspark spark-cassandra-connector java.io.IOException:
> Failed to open native connection to Cassandra at {192.168.1.126}:9042
>
> Have you contacted spark-cassandra-connector related mailing list ?
>
> I wonder where the port 9042 came from.
>
> Cheers
>
> On Tue, Mar 8, 2016 at 6:02 PM, Andy Davidson <
> a...@santacruzintegration.com> wrote:
>
>>
>> I am using spark-1.6.0-bin-hadoop2.6. I am trying to write a python
>> notebook that reads a data frame from Cassandra.
>>
>> *I connect to cassadra using an ssh tunnel running on port 9043.* CQLSH
>> works how ever I can not figure out how to configure my notebook. I have
>> tried various hacks any idea what I am doing wrong
>>
>> : java.io.IOException: Failed to open native connection to Cassandra at 
>> {192.168.1.126}:9042
>>
>>
>>
>>
>> Thanks in advance
>>
>> Andy
>>
>>
>>
>> $ extraPkgs="--packages com.databricks:spark-csv_2.11:1.3.0 \
>> --packages datastax:spark-cassandra-connector:1.6.0-M1-s_2.11"
>>
>> $ export PYSPARK_PYTHON=python3
>> $ export PYSPARK_DRIVER_PYTHON=python3
>> $ IPYTHON_OPTS=notebook $SPARK_ROOT/bin/pyspark $extraPkgs $*
>>
>>
>>
>> In [15]:
>> 1
>>
>> sqlContext.setConf("spark.cassandra.connection.host”,”127.0.0.1:9043")
>>
>> 2
>>
>> df = sqlContext.read\
>>
>> 3
>>
>> .format("org.apache.spark.sql.cassandra")\
>>
>> 4
>>
>> .options(table=“time_series", keyspace="notification")\
>>
>> 5
>>
>> .load()
>>
>> 6
>>
>> ​
>>
>> 7
>>
>> df.printSchema()
>>
>> 8
>>
>> df.show()
>>
>> ---Py4JJavaError
>>  Traceback (most recent call 
>> last) in ()  1 
>> sqlContext.setConf("spark.cassandra.connection.host","localhost:9043")> 
>> 2 df = sqlContext.read.format("org.apache.spark.sql.cassandra")
>> .options(table="kv", keyspace="notification").load()  3   4 
>> df.printSchema()  5 
>> df.show()/Users/andrewdavidson/workSpace/spark/spark-1.6.0-bin-hadoop2.6/python/pyspark/sql/readwriter.py
>>  in load(self, path, format, schema, **options)137 
>> return self._df(self._jreader.load(path))138 else:--> 139
>>  return self._df(self._jreader.load())140 141 
>> @since(1.4)/Users/andrewdavidson/workSpace/spark/spark-1.6.0-bin-hadoop2.6/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py
>>  in __call__(self, *args)811 answer = 
>> self.gateway_client.send_command(command)812 return_value = 
>> get_return_value(--> 813 answer, self.gateway_client, 
>> self.target_id, self.name)814 815 for temp_arg in 
>> temp_args:/Users/andrewdavidson/workSpace/spark/spark-1.6.0-bin-hadoop2.6/python/pyspark/sql/utils.py
>>  in deco(*a, **kw) 43 def deco(*a, **kw): 44 try:---> 45 
>> return f(*a, **kw) 46 except 
>> py4j.protocol.Py4JJavaError as e: 47 s = 
>> e.java_exception.toString()/Users/andrewdavidson/workSpace/spark/spark-1.6.0-bin-hadoop2.6/python/lib/py4j-0.9-src.zip/py4j/protocol.py
>>  in get_return_value(answer, gateway_client, target_id, name)306 
>> raise Py4JJavaError(307 "An error occurred 
>> while calling {0}{1}{2}.\n".--> 308 format(target_id, 
>> ".", name), value)
>> 309 else:310 raise Py4JError(
>> Py4JJavaError: An error occurred while calling o280.load.
>> : java.io.IOException: Failed to open native connection to Cassandra at 
>> {192.168.1.126}:9042
>>  at 
>> com.datastax.spark.connector.cql.CassandraConnector$.com$datastax$spark$connector$cql$CassandraConnector$$createSession(CassandraConnector.scala:162)
>>  at 
>> com.datastax.spark.connector.cql.CassandraConnector$$anonfun$2.apply(CassandraConnector.scala:148)
>>  at 
>> com.datastax.spark.connector.cql.CassandraConnector$$anonfun$2.apply(CassandraConnector.scala:148)
>>  at 
>> com.datastax.spark.connector.cql.RefCountedCache.createNewValueAndKeys(RefCountedCache.scala:31)
>>  at 
>> com.datastax.spark.connector.cql.RefCountedCache.acquire(RefCountedCache.scala:56)
>>  at 
>> com.datastax.spark.connector.cql.CassandraConnector.openSession(CassandraConnector.scala:81)
>>  at 
>> com.datastax.spark.connector.cql.CassandraConnec

Re: pyspark spark-cassandra-connector java.io.IOException: Failed to open native connection to Cassandra at {192.168.1.126}:9042

2016-03-08 Thread Ted Yu
>From cassandra.yaml :

native_transport_port: 9042

FYI

On Tue, Mar 8, 2016 at 9:13 PM, Saurabh Bajaj 
wrote:

> Hi Andy,
>
> I believe you need to set the host and port settings separately
> spark.cassandra.connection.host
> spark.cassandra.connection.port
>
> https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#cassandra-connection-parameters
>
> Looking at the logs, it seems your port config is not being set and it's
> falling back to default.
> Let me know if that helps.
>
> Saurabh Bajaj
>
> On Tue, Mar 8, 2016 at 6:25 PM, Andy Davidson <
> a...@santacruzintegration.com> wrote:
>
>> Hi Ted
>>
>> I believe by default cassandra listens on 9042
>>
>> From: Ted Yu 
>> Date: Tuesday, March 8, 2016 at 6:11 PM
>> To: Andrew Davidson 
>> Cc: "user @spark" 
>> Subject: Re: pyspark spark-cassandra-connector java.io.IOException:
>> Failed to open native connection to Cassandra at {192.168.1.126}:9042
>>
>> Have you contacted spark-cassandra-connector related mailing list ?
>>
>> I wonder where the port 9042 came from.
>>
>> Cheers
>>
>> On Tue, Mar 8, 2016 at 6:02 PM, Andy Davidson <
>> a...@santacruzintegration.com> wrote:
>>
>>>
>>> I am using spark-1.6.0-bin-hadoop2.6. I am trying to write a python
>>> notebook that reads a data frame from Cassandra.
>>>
>>> *I connect to cassadra using an ssh tunnel running on port 9043.* CQLSH
>>> works how ever I can not figure out how to configure my notebook. I have
>>> tried various hacks any idea what I am doing wrong
>>>
>>> : java.io.IOException: Failed to open native connection to Cassandra at 
>>> {192.168.1.126}:9042
>>>
>>>
>>>
>>>
>>> Thanks in advance
>>>
>>> Andy
>>>
>>>
>>>
>>> $ extraPkgs="--packages com.databricks:spark-csv_2.11:1.3.0 \
>>> --packages
>>> datastax:spark-cassandra-connector:1.6.0-M1-s_2.11"
>>>
>>> $ export PYSPARK_PYTHON=python3
>>> $ export PYSPARK_DRIVER_PYTHON=python3
>>> $ IPYTHON_OPTS=notebook $SPARK_ROOT/bin/pyspark $extraPkgs $*
>>>
>>>
>>>
>>> In [15]:
>>> 1
>>>
>>> sqlContext.setConf("spark.cassandra.connection.host”,”127.0.0.1:9043")
>>>
>>> 2
>>>
>>> df = sqlContext.read\
>>>
>>> 3
>>>
>>> .format("org.apache.spark.sql.cassandra")\
>>>
>>> 4
>>>
>>> .options(table=“time_series", keyspace="notification")\
>>>
>>> 5
>>>
>>> .load()
>>>
>>> 6
>>>
>>> ​
>>>
>>> 7
>>>
>>> df.printSchema()
>>>
>>> 8
>>>
>>> df.show()
>>>
>>> ---Py4JJavaError
>>>  Traceback (most recent call 
>>> last) in ()  1 
>>> sqlContext.setConf("spark.cassandra.connection.host","localhost:9043")> 
>>> 2 df = sqlContext.read.format("org.apache.spark.sql.cassandra")
>>> .options(table="kv", keyspace="notification").load()  3   4 
>>> df.printSchema()  5 
>>> df.show()/Users/andrewdavidson/workSpace/spark/spark-1.6.0-bin-hadoop2.6/python/pyspark/sql/readwriter.py
>>>  in load(self, path, format, schema, **options)137 
>>> return self._df(self._jreader.load(path))138 else:--> 139   
>>>   return self._df(self._jreader.load())140 141 
>>> @since(1.4)/Users/andrewdavidson/workSpace/spark/spark-1.6.0-bin-hadoop2.6/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py
>>>  in __call__(self, *args)811 answer = 
>>> self.gateway_client.send_command(command)812 return_value = 
>>> get_return_value(--> 813 answer, self.gateway_client, 
>>> self.target_id, self.name)814 815 for temp_arg in 
>>> temp_args:/Users/andrewdavidson/workSpace/spark/spark-1.6.0-bin-hadoop2.6/python/pyspark/sql/utils.py
>>>  in deco(*a, **kw) 43 def deco(*a, **kw): 44 try:---> 
>>> 45 return f(*a, **kw) 46 except 
>>> py4j.protocol.Py4JJavaError as e: 47 s = 
>>> e.java_exception.toString()/Users/andrewdavidson/workSpace/spark/spark-1.6.0-bin-hadoop2.6/python/lib/py4j-0.9-src.zip/py4j/protocol.py
>>>  in get_return_value(answer, gateway_client, target_id, name)306
>>>  raise Py4JJavaError(307 "An error occurred 
>>> while calling {0}{1}{2}.\n".--> 308 format(target_id, 
>>> ".", name), value)
>>> 309 else:310 raise Py4JError(
>>> Py4JJavaError: An error occurred while calling o280.load.
>>> : java.io.IOException: Failed to open native connection to Cassandra at 
>>> {192.168.1.126}:9042
>>> at 
>>> com.datastax.spark.connector.cql.CassandraConnector$.com$datastax$spark$connector$cql$CassandraConnector$$createSession(CassandraConnector.scala:162)
>>> at 
>>> com.datastax.spark.connector.cql.CassandraConnector$$anonfun$2.apply(CassandraConnector.scala:148)
>>> at 
>>> com.datastax.spark.connector.cql.CassandraConnector$$anonfun$2.apply(CassandraConnector.scala:148)
>>> at 
>>> com.datastax.spark.connector.cql.RefCountedCache.createNewValueAndKeys(Re

Re: Output the data to external database at particular time in spark streaming

2016-03-08 Thread Saurabh Bajaj
You can call *foreachRDD*(*func*) on the output from the final stage, then
check the time if it's the 15th min of an hour then you flush the output to
DB else you don't.
Let me know if that approach works.

On Tue, Mar 8, 2016 at 2:10 PM, ayan guha  wrote:

> Yes if it falls within the batch. But if the requirement is flush
> everything till 15th min of the hour, then it should work.
> On 9 Mar 2016 04:01, "Ted Yu"  wrote:
>
>> That may miss the 15th minute of the hour (with non-trivial deviation),
>> right ?
>>
>> On Tue, Mar 8, 2016 at 8:50 AM, ayan guha  wrote:
>>
>>> Why not compare current time in every batch and it meets certain
>>> condition emit the data?
>>> On 9 Mar 2016 00:19, "Abhishek Anand"  wrote:
>>>
 I have a spark streaming job where I am aggregating the data by doing
 reduceByKeyAndWindow with inverse function.

 I am keeping the data in memory for upto 2 hours and In order to output
 the reduced data to an external storage I conditionally need to puke the
 data to DB say at every 15th minute of the each hour.

 How can this be achieved.


 Regards,
 Abhi

>>>
>>


Re: Spark Scheduler creating Straggler Node

2016-03-08 Thread Prabhu Joseph
I don't just want to replicate all Cached Blocks. I am trying to find a way
to solve the issue which i mentioned above mail. Having replicas for all
cached blocks will add more cost to customers.



On Wed, Mar 9, 2016 at 9:50 AM, Reynold Xin  wrote:

> You just want to be able to replicate hot cached blocks right?
>
>
> On Tuesday, March 8, 2016, Prabhu Joseph 
> wrote:
>
>> Hi All,
>>
>> When a Spark Job is running, and one of the Spark Executor on Node A
>> has some partitions cached. Later for some other stage, Scheduler tries to
>> assign a task to Node A to process a cached partition (PROCESS_LOCAL). But
>> meanwhile the Node A is occupied with some other
>> tasks and got busy. Scheduler waits for spark.locality.wait interval and
>> times out and tries to find some other node B which is NODE_LOCAL. The
>> executor on Node B will try to get the cached partition from Node A which
>> adds network IO to node and also some extra CPU for I/O. Eventually,
>> every node will have a task that is waiting to fetch some cached
>> partition from node A and so the spark job / cluster is basically blocked
>> on a single node.
>>
>> Spark JIRA is created https://issues.apache.org/jira/browse/SPARK-13718
>>
>> Beginning from Spark 1.2, Spark introduced External Shuffle Service to
>> enable executors fetch shuffle files from an external service instead of
>> from each other which will offload the load on Spark Executors.
>>
>> We want to check whether a similar thing of an External Service is
>> implemented for transferring the cached partition to other executors.
>>
>>
>> Thanks, Prabhu Joseph
>>
>>
>>


reading the parquet file

2016-03-08 Thread Angel Angel
Hello Sir/Madam,


I writing the spark application in spark 1.4.0.

I have one text file with the size of 8 GB.
I save that file in parquet format


val df2 =
sc.textFile("/root/Desktop/database_200/database_200.txt").map(_.split(",")).map(p
=> Table(p(0),p(1).trim.toInt, p(2).trim.toInt, p(3)))toDF


df2.write.parquet("hdfs://hadoopm0:8020/tmp/input1/database4.parquet")

After that i did the following operations


val df1 =
sqlContext.read.parquet("dfs://hadoopm0:8020/tmp/input1/database4.parquet")


var a=0

var k = df1.filter(df1("Address").equalTo(Array_Ele(0) ))


for( a <-2 until 2720 by 2){


var temp= df1.filter(df1("Address").equalTo(Array_Ele(a)))


var temp1 =
temp.select(temp("Address"),temp("Couple_time")-Array_Ele(a+1),temp("WT_ID"),temp("WT_Name"))


k =k.unionAll(temp1) }


val WT_ID_Sort  = k.groupBy("WT_ID").count().sort(desc("count"))



WT_ID_Sort.show()



after that I am getting the following warning and my task is disconnected
again and again.





[image: Inline image 1]




I need to do many iterative operations on that df1 file.


So can any one help me to solve this problem?

thanks in advance.


Thanks.


RE: How to add a custom jar file to the Spark driver?

2016-03-08 Thread Wang, Daoyuan
Hi Gerhard,

How does EMR set its conf for spark? I think if you set SPARK_CLASSPATH and 
spark.dirver.extraClassPath, spark would ignore SPARK_CLASSPATH.
I think you can do this by read the configuration from SparkConf, and then add 
your custom settings to the corresponding key, and use the updated SparkConf to 
instantiate your SparkContext.

Thanks,
Daoyuan

From: Gerhard Fiedler [mailto:gfied...@algebraixdata.com]
Sent: Wednesday, March 09, 2016 5:41 AM
To: user@spark.apache.org
Subject: How to add a custom jar file to the Spark driver?

We're running Spark 1.6.0 on EMR, in YARN client mode. We run Python code, but 
we want to add a custom jar file to the driver.

When running on a local one-node standalone cluster, we just use 
spark.driver.extraClassPath and everything works:

spark-submit --conf spark.driver.extraClassPath=/path/to/our/custom/jar/*  
our-python-script.py

But on EMR, this value is set to something that is needed to make their 
installation of Spark work. Setting it to point to our custom jar overwrites 
the original setting rather than adding to it and breaks Spark.

Our current workaround is to capture to whatever EMR sets 
spark.driver.extraClassPath once, then use that path and add our jar file to 
it. Of course this breaks when EMR changes this path in their cluster settings. 
We wouldn't necessarily notice this easily. This is how it looks:

spark-submit --conf 
spark.driver.extraClassPath=/path/to/our/custom/jar/*:/etc/hadoop/conf:/etc/hive/conf:/usr/lib/hadoop/*:/usr/lib/hadoop-hdfs/*:/usr/lib/hadoop-yarn/*:/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*
  our-python-script.py

We prefer not to do this...

We tried the spark-submit argument --jars, but it didn't seem to do anything. 
Like this:

spark-submit --jars /path/to/our/custom/jar/file.jar  our-python-script.py

We also tried to set CLASSPATH, but it doesn't seem to have any impact:

export CLASSPATH=/path/to/our/custom/jar/*
spark-submit  our-python-script.py

When using SPARK_CLASSPATH, we got warnings that it is deprecated, and the 
messages also seemed to imply that it affects the same configuration that is 
set by spark.driver.extraClassPath.


So, my question is: Is there a clean way to add a custom jar file to a Spark 
configuration?

Thanks,
Gerhard



Re: Saving multiple outputs in the same job

2016-03-08 Thread Jan Štěrba
Hi Andy,

its nice to see that we are not the only ones with the same issues. So
far we have not gone as far as you have. What we have done is that we
cache whatever dataframes/rdds are shared foc computing different
output. This has brought us quite the speedup, but we still see that
saving some large output blocks all other computation even though the
save uses only one executor and rest of the cluster is just waiting.

I was thinking about trying something similar to what you are
describing in 1) but I am sad to see it is riddled with bugs and to me
it seems like going againts spark in a way.

Hope someone can help in resolving this.

Cheers.

Jan
--
Jan Sterba
https://twitter.com/honzasterba | http://flickr.com/honzasterba |
http://500px.com/honzasterba


On Wed, Mar 9, 2016 at 2:31 AM, Andy Sloane  wrote:
> We have a somewhat complex pipeline which has multiple output files on HDFS,
> and we'd like the materialization of those outputs to happen concurrently.
>
> Internal to Spark, any "save" call creates a new "job", which runs
> synchronously -- that is, the line of code after your save() executes once
> the job completes, executing the entire dependency DAG to produce it. Same
> with foreach, collect, count, etc.
>
> The files we want to save have overlapping dependencies. For us to create
> multiple outputs concurrently, we have a few options that I can see:
>  - Spawn a thread for each file we want to save, letting Spark run the jobs
> somewhat independently. This has the downside of various concurrency bugs
> (e.g. SPARK-4454, and more recently SPARK-13631) and also causes RDDs up the
> dependency graph to get independently, uselessly recomputed.
>  - Analyze our own dependency graph, materialize (by checkpointing or
> saving) common dependencies, and then executing the two saves in threads.
>  - Implement our own saves to HDFS as side-effects inside mapPartitions
> (which is how save actually works internally anyway, modulo committing logic
> to handle speculative execution), yielding an empty dummy RDD for each thing
> we want to save, and then run foreach or count on the union of all the dummy
> RDDs, which causes Spark to schedule the entire DAG we're interested in.
>
> Currently we are doing a little of #1 and a little of #3, depending on who
> originally wrote the code. #2 is probably closer to what we're supposed to
> be doing, but IMO Spark is already able to produce a good execution plan and
> we shouldn't have to do that.
>
> AFAIK, there's no way to do what I *actually* want in Spark, which is to
> have some control over which saves go into which jobs, and then execute the
> jobs directly. I can envision a new version of the various save functions
> which take an extra job argument, or something, or some way to defer and
> unblock job creation in the spark context.
>
> Ideas?
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: updating the Books section on the Spark documentation page

2016-03-08 Thread Jan Štěrba
You could try creating a pull-request on github.

-Jan
--
Jan Sterba
https://twitter.com/honzasterba | http://flickr.com/honzasterba |
http://500px.com/honzasterba


On Wed, Mar 9, 2016 at 2:45 AM, Mohammed Guller  wrote:
> Hi -
>
>
>
> The Spark documentation page (http://spark.apache.org/documentation.html)
> has links to books covering Spark. What is the process for adding a new book
> to that list?
>
>
>
> Thanks,
>
> Mohammed
>
> Author: Big Data Analytics with Spark
>
>
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: S3 Zip File Loading Advice

2016-03-08 Thread Hemant Bhanawat
https://issues.apache.org/jira/browse/SPARK-3586 talks about creating a
file dstream which can monitor for new files recursively but this
functionality is not yet added.

I don't see an easy way out. You will have to create your folders based on
timeline (looks like you are already doing that) and running a new job over
the new folders created in an interval.  This will have to be an automated
using an external script.

Hemant Bhanawat 
www.snappydata.io

On Wed, Mar 9, 2016 at 10:33 AM, Benjamin Kim  wrote:

> I am wondering if anyone can help.
>
> Our company stores zipped CSV files in S3, which has been a big headache
> from the start. I was wondering if anyone has created a way to iterate
> through several subdirectories (s3n://events/2016/03/01/00,
> s3n://2016/03/01/01, etc.) in S3 to find the newest files and load them. It
> would be a big bonus to include the unzipping of the file in the process so
> that the CSV can be loaded directly into a dataframe for further
> processing. I’m pretty sure that the S3 part of this request is not
> uncommon. I would think the file being zipped is uncommon. If anyone can
> help, I would truly be grateful for I am new to Scala and Spark. This would
> be a great help in learning.
>
> Thanks,
> Ben
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>