Broadcast variables: when should I use them?

2015-01-26 Thread frodo777
Hello.

I have a number of static Arrays and Maps in my Spark Streaming driver
program.
They are simple collections, initialized with integer values and strings
directly in the code. There is no RDD/DStream involvement here.
I do not expect them to contain more than 100 entries, each.
They are used in several subsequent parallel operations.

The question is:
Should I convert them into broadcast variables?

Thanks and regards.
-Bob



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Broadcast-variables-when-should-I-use-them-tp21366.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



R: Broadcast variables: when should I use them?

2015-01-26 Thread Paolo Platter
Hi,

Yes, if they are not big, it's a good practice to broadcast them to avoid 
serializing them each time you use clojure.

Paolo

Inviata dal mio Windows Phone

Da: frodo777mailto:roberto.vaquer...@bitmonlab.com
Inviato: ‎26/‎01/‎2015 14:34
A: user@spark.apache.orgmailto:user@spark.apache.org
Oggetto: Broadcast variables: when should I use them?

Hello.

I have a number of static Arrays and Maps in my Spark Streaming driver
program.
They are simple collections, initialized with integer values and strings
directly in the code. There is no RDD/DStream involvement here.
I do not expect them to contain more than 100 entries, each.
They are used in several subsequent parallel operations.

The question is:
Should I convert them into broadcast variables?

Thanks and regards.
-Bob



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Broadcast-variables-when-should-I-use-them-tp21366.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



spark context not picking up default hadoop filesystem

2015-01-26 Thread jamborta
hi all,

I am trying to create a spark context programmatically, using
org.apache.spark.deploy.SparkSubmit. It all looks OK, except that the hadoop
config that is created during the process is not picking up core-site.xml,
so it defaults back to the local file-system. I have set HADOOP_CONF_DIR in
spark-env.sh, also core-site.xml in in the conf folder. The whole thing
works if it is executed through spark shell.

Just wondering where spark is picking up the hadoop config path from?

many thanks,



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-context-not-picking-up-default-hadoop-filesystem-tp21368.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



HW imbalance

2015-01-26 Thread Antony Mayi
Hi,
is it possible to mix hosts with (significantly) different specs within a 
cluster (without wasting the extra resources)? for example having 10 nodes with 
36GB RAM/10CPUs now trying to add 3 hosts with 128GB/10CPUs - is there a way to 
utilize the extra memory by spark executors (as my understanding is all spark 
executors must have same memory).
thanks,Antony.

Re: No AMI for Spark 1.2 using ec2 scripts

2015-01-26 Thread Håkan Jonsson
Thanks. Turns out this is a proxy problem somehow. Sorry to bother you.

/Håkan

On Mon Jan 26 2015 at 11:02:18 AM Franc Carter franc.car...@rozettatech.com
wrote:


 AMI's are specific to an AWS region, so the ami-id of the spark AMI in
 us-west will be different if it exists. I can't remember where but I have a
 memory of seeing somewhere that the AMI was only in us-east

 cheers

 On Mon, Jan 26, 2015 at 8:47 PM, Håkan Jonsson haj...@gmail.com wrote:

 Thanks,

 I also use Spark 1.2 with prebuilt for Hadoop 2.4. I launch both 1.1 and
 1.2 with the same command:

 ./spark-ec2 -k foo -i bar.pem launch mycluster

 By default this launches in us-east-1. I tried changing the the region
 using:

 -r us-west-1 but that had the same result:

 Could not resolve AMI at:
 https://raw.github.com/mesos/spark-ec2/v4/ami-list/us-west-1/pvm

 Looking up
 https://raw.github.com/mesos/spark-ec2/v4/ami-list/us-west-1/pvm in a
 browser results in the same AMI ID as yours. If I search for ami-7a320f3f
 AMI in AWS, I can't find any such image. I tried searching in all regions I
 could find in the AWS console.

 The AMI for 1.1 is spark.ami.pvm.v9 (ami-5bb18832). I can find that AMI
 in us-west-1.

 Strange. Not sure what to do.

 /Håkan


 On Mon Jan 26 2015 at 9:02:42 AM Charles Feduke charles.fed...@gmail.com
 wrote:

 I definitely have Spark 1.2 running within EC2 using the spark-ec2
 scripts. I downloaded Spark 1.2 with prebuilt for Hadoop 2.4 and later.

 What parameters are you using when you execute spark-ec2?


 I am launching in the us-west-1 region (ami-7a320f3f) which may explain
 things.

 On Mon Jan 26 2015 at 2:40:01 AM hajons haj...@gmail.com wrote:

 Hi,

 When I try to launch a standalone cluster on EC2 using the scripts in the
 ec2 directory for Spark 1.2, I get the following error:

 Could not resolve AMI at:
 https://raw.github.com/mesos/spark-ec2/v4/ami-list/us-east-1/pvm

 It seems there is not yet any AMI available on EC2. Any ideas when there
 will be one?

 This works without problems for version 1.1. Starting up a cluster using
 these scripts is so simple and straightforward, so I am really missing it
 on
 1.2.

 /Håkan





 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/No-AMI-for-Spark-1-2-using-ec2-scripts-tp21362.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




 --

 *Franc Carter* | Systems Architect | Rozetta Technology

 franc.car...@rozettatech.com  franc.car...@rozettatech.com|
 www.rozettatechnology.com

 Tel: +61 2 8355 2515

 Level 4, 55 Harrington St, The Rocks NSW 2000

 PO Box H58, Australia Square, Sydney NSW 1215

 AUSTRALIA




[SQL] Self join with ArrayType columns problems

2015-01-26 Thread Pierre B
Using Spark 1.2.0, we are facing some weird behaviour when performing self
join on a table with some ArrayType field. 
(potential bug ?) 

I have set up a minimal non working example here: 
https://gist.github.com/pierre-borckmans/4853cd6d0b2f2388bf4f
https://gist.github.com/pierre-borckmans/4853cd6d0b2f2388bf4f
  
In a nutshell, if the ArrayType column used for the pivot is created
manually in the StructType definition, everything works as expected. 
However, if the ArrayType pivot column is obtained by a sql query (be it by
using a array wrapper, or using a collect_list operator for instance),
then results are completely off. 

Could anyone have a look as this really is a blocking issue. 

Thanks! 

Cheers 

P.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SQL-Self-join-with-ArrayType-columns-problems-tp21364.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark context not picking up default hadoop filesystem

2015-01-26 Thread Akhil Das
Ah i think for locally you should give the full hdfs URL. like

val logs = sc.textFile(hdfs://akhldz:9000/sigmoid/logs)



Thanks
Best Regards

On Mon, Jan 26, 2015 at 9:36 PM, Tamas Jambor jambo...@gmail.com wrote:

 thanks for the reply. I have tried to add SPARK_CLASSPATH, I got a warning
 that it was deprecated (didn't solve the problem), also tried to run with
 --driver-class-path, which did not work either. I am trying this locally.




 On Mon Jan 26 2015 at 15:04:03 Akhil Das ak...@sigmoidanalytics.com
 wrote:

 You can also trying adding the core-site.xml in the SPARK_CLASSPATH, btw
 are you running the application locally? or in standalone mode?

 Thanks
 Best Regards

 On Mon, Jan 26, 2015 at 7:37 PM, jamborta jambo...@gmail.com wrote:

 hi all,

 I am trying to create a spark context programmatically, using
 org.apache.spark.deploy.SparkSubmit. It all looks OK, except that the
 hadoop
 config that is created during the process is not picking up
 core-site.xml,
 so it defaults back to the local file-system. I have set HADOOP_CONF_DIR
 in
 spark-env.sh, also core-site.xml in in the conf folder. The whole thing
 works if it is executed through spark shell.

 Just wondering where spark is picking up the hadoop config path from?

 many thanks,



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/spark-context-not-picking-up-default-hadoop-filesystem-tp21368.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Issues when combining Spark and a third party java library

2015-01-26 Thread Akhil Das
Its more like, Spark is not able to find the hadoop jars. Try setting the
HADOOP_CONF_DIR and also make sure *-site.xml are available in the
CLASSPATH/SPARK_CLASSPATH.

Thanks
Best Regards

On Mon, Jan 26, 2015 at 7:28 PM, Staffan staffan.arvids...@gmail.com
wrote:

 I'm using Maven and Eclipse to build my project. I'm letting Maven download
 all the things I need for running everything, which has worked fine up
 until
 now. I need to use the CDK library (https://github.com/egonw/cdk,
 http://sourceforge.net/projects/cdk/) and once I add the dependencies to
 my
 pom.xml Spark starts to complain (this is without calling any function or
 importing any new library into my code, only by introducing new
 dependencies
 to the pom.xml). Trying to set up a SparkContext give me the following
 errors in the log:

 [main] DEBUG org.apache.spark.rdd.HadoopRDD - SplitLocationInfo and other
 new Hadoop classes are unavailable. Using the older Hadoop location info
 code.
 java.lang.ClassNotFoundException:
 org.apache.hadoop.mapred.InputSplitWithLocationInfo
 at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
 at java.lang.Class.forName0(Native Method)
 at java.lang.Class.forName(Class.java:191)
 at

 org.apache.spark.rdd.HadoopRDD$SplitInfoReflections.init(HadoopRDD.scala:381)
 at org.apache.spark.rdd.HadoopRDD$.liftedTree1$1(HadoopRDD.scala:391)
 at org.apache.spark.rdd.HadoopRDD$.init(HadoopRDD.scala:390)
 at org.apache.spark.rdd.HadoopRDD$.clinit(HadoopRDD.scala)
 at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:159)
 at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:194)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
 at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
 at org.apache.spark.SparkContext.runJob(SparkContext.scala:1328)
 at org.apache.spark.rdd.RDD.foreach(RDD.scala:765)

 later in the log:
 [Executor task launch worker-0] DEBUG
 org.apache.spark.deploy.SparkHadoopUtil - Couldn't find method for
 retrieving thread-level FileSystem input data
 java.lang.NoSuchMethodException:
 org.apache.hadoop.fs.FileSystem$Statistics.getThreadStatistics()
 at java.lang.Class.getDeclaredMethod(Class.java:2009)
 at org.apache.spark.util.Utils$.invoke(Utils.scala:1733)
 at

 org.apache.spark.deploy.SparkHadoopUtil$$anonfun$getFileSystemThreadStatistics$1.apply(SparkHadoopUtil.scala:178)
 at

 org.apache.spark.deploy.SparkHadoopUtil$$anonfun$getFileSystemThreadStatistics$1.apply(SparkHadoopUtil.scala:178)
 at

 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at

 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at

 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
 at scala.collection.AbstractTraversable.map(Traversable.scala:105)
 at

 org.apache.spark.deploy.SparkHadoopUtil.getFileSystemThreadStatistics(SparkHadoopUtil.scala:178)
 at

 org.apache.spark.deploy.SparkHadoopUtil.getFSBytesReadOnThreadCallback(SparkHadoopUtil.scala:138)
 at org.apache.spark.rdd.HadoopRDD$$anon$1.init(HadoopRDD.scala:220)
 at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:210)
 at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:99)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
 at org.apache.spark.scheduler.Task.run(Task.scala:56)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
 at

 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)

 There has also been issues related to HADOOP_HOME not being set etc., but
 which seems to be intermittent and 

Re: Can Spark benefit from Hive-like partitions?

2015-01-26 Thread Michael Armbrust
You can create a partitioned hive table using Spark SQL:
http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables

On Mon, Jan 26, 2015 at 5:40 AM, Danny Yates da...@codeaholics.org wrote:

 Hi,

 I've got a bunch of data stored in S3 under directories like this:

 s3n://blah/y=2015/m=01/d=25/lots-of-files.csv

 In Hive, if I issue a query WHERE y=2015 AND m=01, I get the benefit that
 it only scans the necessary directories for files to read.

 As far as I can tell from searching and reading the docs, the right way of
 loading this data into Spark is to use sc.textFile(s3n://blah/*/*/*/)

 1) Is there any way in Spark to access y, m and d as fields? In Hive, you
 declare them in the schema, but you don't put them in the CSV files - their
 values are extracted from the path.
 2) Is there any way to get Spark to use the y, m and d fields to minimise
 the files it transfers from S3?

 Thanks,

 Danny.



Re: spark context not picking up default hadoop filesystem

2015-01-26 Thread Tamas Jambor
thanks for the reply. I have tried to add SPARK_CLASSPATH, I got a warning
that it was deprecated (didn't solve the problem), also tried to run with
--driver-class-path, which did not work either. I am trying this locally.



On Mon Jan 26 2015 at 15:04:03 Akhil Das ak...@sigmoidanalytics.com wrote:

 You can also trying adding the core-site.xml in the SPARK_CLASSPATH, btw
 are you running the application locally? or in standalone mode?

 Thanks
 Best Regards

 On Mon, Jan 26, 2015 at 7:37 PM, jamborta jambo...@gmail.com wrote:

 hi all,

 I am trying to create a spark context programmatically, using
 org.apache.spark.deploy.SparkSubmit. It all looks OK, except that the
 hadoop
 config that is created during the process is not picking up core-site.xml,
 so it defaults back to the local file-system. I have set HADOOP_CONF_DIR
 in
 spark-env.sh, also core-site.xml in in the conf folder. The whole thing
 works if it is executed through spark shell.

 Just wondering where spark is picking up the hadoop config path from?

 many thanks,



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/spark-context-not-picking-up-default-hadoop-filesystem-tp21368.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Lost task - connection closed

2015-01-26 Thread octavian.ganea
Here is the first error I get at the executors: 

15/01/26 17:27:04 ERROR ExecutorUncaughtExceptionHandler: Uncaught exception
in thread Thread[handle-message-executor-16,5,main]
java.lang.StackOverflowError
at
java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876)
at
java.io.ObjectOutputStream$BlockDataOutputStream.write(ObjectOutputStream.java:1840)
at
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1533)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)

If you have any pointers for me on how to debug this, that would be very
useful. I tried running with both spark 1.2.0 and 1.1.1, getting the same
error.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Lost-task-connection-closed-tp21361p21371.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark context not picking up default hadoop filesystem

2015-01-26 Thread Akhil Das
You can also trying adding the core-site.xml in the SPARK_CLASSPATH, btw
are you running the application locally? or in standalone mode?

Thanks
Best Regards

On Mon, Jan 26, 2015 at 7:37 PM, jamborta jambo...@gmail.com wrote:

 hi all,

 I am trying to create a spark context programmatically, using
 org.apache.spark.deploy.SparkSubmit. It all looks OK, except that the
 hadoop
 config that is created during the process is not picking up core-site.xml,
 so it defaults back to the local file-system. I have set HADOOP_CONF_DIR in
 spark-env.sh, also core-site.xml in in the conf folder. The whole thing
 works if it is executed through spark shell.

 Just wondering where spark is picking up the hadoop config path from?

 many thanks,



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/spark-context-not-picking-up-default-hadoop-filesystem-tp21368.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: HW imbalance

2015-01-26 Thread Antony Mayi
should have said I am running as yarn-client. all I can see is specifying the 
generic executor memory that is then to be used in all containers. 

 On Monday, 26 January 2015, 16:48, Charles Feduke 
charles.fed...@gmail.com wrote:
   
 

 You should look at using Mesos. This should abstract away the individual hosts 
into a pool of resources and make the different physical specifications 
manageable.

I haven't tried configuring Spark Standalone mode to have different specs on 
different machines but based on spark-env.sh.template:
# - SPARK_WORKER_CORES, to set the number of cores to use on this machine# - 
SPARK_WORKER_MEMORY, to set how much total memory workers have to give 
executors (e.g. 1000m, 2g)# - SPARK_WORKER_OPTS, to set config properties only 
for the worker (e.g. -Dx=y)
it looks like you should be able to mix. (Its not clear to me whether 
SPARK_WORKER_MEMORY is uniform across the cluster or for the machine where the 
config file resides.)

On Mon Jan 26 2015 at 8:07:51 AM Antony Mayi antonym...@yahoo.com.invalid 
wrote:

Hi,
is it possible to mix hosts with (significantly) different specs within a 
cluster (without wasting the extra resources)? for example having 10 nodes with 
36GB RAM/10CPUs now trying to add 3 hosts with 128GB/10CPUs - is there a way to 
utilize the extra memory by spark executors (as my understanding is all spark 
executors must have same memory).
thanks,Antony.


 


Re: cannot run spark-shell interactively against cluster from remote host - confusing memory warnings

2015-01-26 Thread Akhil Das
When you say remote cluster you need to make sure a few things like:

- No firewall/network is blocking any connection (Simply ping from
localmachine to remote ip and vice versa)
- Make sure all ports (unless you specify them manually) are open.

You can also refer this discussion,
http://apache-spark-user-list.1001560.n3.nabble.com/Submitting-Spark-job-on-Unix-cluster-from-dev-environment-Windows-td16989.html

Hope it helps.

Thanks
Best Regards

On Sun, Jan 25, 2015 at 2:40 AM, Joseph Lust jl...@mc10inc.com wrote:

  I’ve setup a Spark cluster in the last few weeks and everything is
 working, but * I cannot run spark-shell interactively against the cluster
 from a remote host*

- Deploy .jar to cluster from remote (laptop) spark-submit and have it
run – Check
- Run .jar on spark-shell locally – Check
- Run same .jar on spark-shell on master server – Check
- Run spark-shell interactively against cluster on master server –
Check
- Run spark-shell interactively from remote (laptop) against cluster –
*FAIL*

  It seems other people have faced this same issue:

 http://apache-spark-user-list.1001560.n3.nabble.com/spark-shell-working-local-but-not-remote-td19727.html

  I’m getting the same warnings about memory, despite plenty of memory
 being available for the job to run (see above working cases)

  WARN TaskSchedulerImpl: Initial job has not accepted any resources;
 check your cluster UI to ensure that workers are registered and have
 sufficient memory”

  Some have suggested it has to do with conflicts of Jars on the class
 path and that Spark is providing spurious memory error messages while the
 problem is really class path conflicts.

 http://apache-spark-user-list.1001560.n3.nabble.com/WARN-ClusterScheduler-Initial-job-has-not-accepted-any-resources-check-your-cluster-UI-to-ensure-thay-td374.html#a396

  Details:

- Cluster: 1 master, 3 workers on 4GB/4 core Ubuntu 14.04 LTS
- Local (aka remote laptop) MacBook Pro 10.10.1
- All running HotSpot Java (build 1.8.0_31-b13 and uild 1.8.0_25-b17)
- All running spark-1.2.0-bin-hadoop2.4
- Using Standalone cluster manager


  Cluster UI:
 *

  Even when I clamp down to the most restrictive amounts, 1 core, 1
 executor, 128m (of 3G available), it still says I don’t have the resources:

   Start Console example
 $ spark-shell --executor-memory 128m --total-executor-cores 1
 --driver-cores 1 --master spark://:7077

  15/01/24 15:57:29 INFO SparkILoop: Created spark context..
 Spark context available as sc.

  scala val rdd = sc.parallelize(1 to 1000);
 rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at
 parallelize at console:12
 scala rdd.count

  15/01/24 15:58:20 INFO BlockManagerMaster: Updated info of block
 broadcast_0_piece0
 15/01/24 15:58:20 INFO SparkContext: Created broadcast 0 from broadcast at
 DAGScheduler.scala:838
 15/01/24 15:58:20 INFO DAGScheduler: Submitting 2 missing tasks from Stage
 0 (ParallelCollectionRDD[0] at parallelize at console:12)
 15/01/24 15:58:20 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
 15/01/24 15:58:35 WARN TaskSchedulerImpl: Initial job has not accepted any
 resources; check your cluster UI to ensure that workers are registered and
 have sufficient memory
   End console example

  So, can anyone tell me if remote interactive spark-shell on a Standalone
 cluster even works? Thanks for your help.

  Cluster UI below showing job is running on cluster, is using a driver
 app and worker, and that there are plenty of cores and GB of memory free.


  Sincerely,
 Joe Lust



Error when cache partitioned Parquet table

2015-01-26 Thread ZHENG, Xu-dong
Hi all,

I meet below error when I cache a partitioned Parquet table. It seems that,
Spark is trying to extract the partitioned key in the Parquet file, so it
is not found. But other query could run successfully, even request the
partitioned key. Is it a bug in SparkSQL? Is there any workaround for it?
Thank you!

java.util.NoSuchElementException: key not found: querydate
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:58)
at scala.collection.MapLike$class.apply(MapLike.scala:141)
at scala.collection.AbstractMap.apply(Map.scala:58)
at 
org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$4$$anonfun$3.apply(ParquetTableOperations.scala:142)
at 
org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$4$$anonfun$3.apply(ParquetTableOperations.scala:142)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at 
org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$4.apply(ParquetTableOperations.scala:142)
at 
org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$4.apply(ParquetTableOperations.scala:127)
at 
org.apache.spark.rdd.NewHadoopRDD$NewHadoopMapPartitionsWithSplitRDD.compute(NewHadoopRDD.scala:247)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:197)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)

-- 
郑旭东
ZHENG, Xu-dong


Re: HW imbalance

2015-01-26 Thread Charles Feduke
You should look at using Mesos. This should abstract away the individual
hosts into a pool of resources and make the different physical
specifications manageable.

I haven't tried configuring Spark Standalone mode to have different specs
on different machines but based on spark-env.sh.template:

# - SPARK_WORKER_CORES, to set the number of cores to use on this machine

# - SPARK_WORKER_MEMORY, to set how much total memory workers have to give
executors (e.g. 1000m, 2g)

# - SPARK_WORKER_OPTS, to set config properties only for the worker (e.g.
-Dx=y)

it looks like you should be able to mix. (Its not clear to me whether
SPARK_WORKER_MEMORY is uniform across the cluster or for the machine where
the config file resides.)

On Mon Jan 26 2015 at 8:07:51 AM Antony Mayi antonym...@yahoo.com.invalid
wrote:

 Hi,

 is it possible to mix hosts with (significantly) different specs within a
 cluster (without wasting the extra resources)? for example having 10 nodes
 with 36GB RAM/10CPUs now trying to add 3 hosts with 128GB/10CPUs - is there
 a way to utilize the extra memory by spark executors (as my understanding
 is all spark executors must have same memory).

 thanks,
 Antony.



Re: [SQL] Self join with ArrayType columns problems

2015-01-26 Thread Michael Armbrust
It seems likely that there is some sort of bug related to the reuse of
array objects that are returned by UDFs.  Can you open a JIRA?

I'll also note that the sql method on HiveContext does run HiveQL
(configured by spark.sql.dialect) and the hql method has been deprecated
since 1.1 (and will probably be removed in 1.3).  The errors are probably
because array and collect set are hive UDFs and thus not available in a
SQLContext.

On Mon, Jan 26, 2015 at 5:44 AM, Dean Wampler deanwamp...@gmail.com wrote:

 You are creating a HiveContext, then using the sql method instead of hql.
 Is that deliberate?

 The code doesn't work if you replace HiveContext with SQLContext. Lots of
 exceptions are thrown, but I don't have time to investigate now.

 dean

 Dean Wampler, Ph.D.
 Author: Programming Scala, 2nd Edition
 http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
 Typesafe http://typesafe.com
 @deanwampler http://twitter.com/deanwampler
 http://polyglotprogramming.com

 On Mon, Jan 26, 2015 at 7:17 AM, Pierre B 
 pierre.borckm...@realimpactanalytics.com wrote:

 Using Spark 1.2.0, we are facing some weird behaviour when performing self
 join on a table with some ArrayType field.
 (potential bug ?)

 I have set up a minimal non working example here:
 https://gist.github.com/pierre-borckmans/4853cd6d0b2f2388bf4f
 https://gist.github.com/pierre-borckmans/4853cd6d0b2f2388bf4f
 
 In a nutshell, if the ArrayType column used for the pivot is created
 manually in the StructType definition, everything works as expected.
 However, if the ArrayType pivot column is obtained by a sql query (be it
 by
 using a array wrapper, or using a collect_list operator for instance),
 then results are completely off.

 Could anyone have a look as this really is a blocking issue.

 Thanks!

 Cheers

 P.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/SQL-Self-join-with-ArrayType-columns-problems-tp21364.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Can Spark benefit from Hive-like partitions?

2015-01-26 Thread Danny Yates
Hi,

I've got a bunch of data stored in S3 under directories like this:

s3n://blah/y=2015/m=01/d=25/lots-of-files.csv

In Hive, if I issue a query WHERE y=2015 AND m=01, I get the benefit that
it only scans the necessary directories for files to read.

As far as I can tell from searching and reading the docs, the right way of
loading this data into Spark is to use sc.textFile(s3n://blah/*/*/*/)

1) Is there any way in Spark to access y, m and d as fields? In Hive, you
declare them in the schema, but you don't put them in the CSV files - their
values are extracted from the path.
2) Is there any way to get Spark to use the y, m and d fields to minimise
the files it transfers from S3?

Thanks,

Danny.


Re: [SQL] Self join with ArrayType columns problems

2015-01-26 Thread Dean Wampler
You are creating a HiveContext, then using the sql method instead of hql.
Is that deliberate?

The code doesn't work if you replace HiveContext with SQLContext. Lots of
exceptions are thrown, but I don't have time to investigate now.

dean

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
Typesafe http://typesafe.com
@deanwampler http://twitter.com/deanwampler
http://polyglotprogramming.com

On Mon, Jan 26, 2015 at 7:17 AM, Pierre B 
pierre.borckm...@realimpactanalytics.com wrote:

 Using Spark 1.2.0, we are facing some weird behaviour when performing self
 join on a table with some ArrayType field.
 (potential bug ?)

 I have set up a minimal non working example here:
 https://gist.github.com/pierre-borckmans/4853cd6d0b2f2388bf4f
 https://gist.github.com/pierre-borckmans/4853cd6d0b2f2388bf4f
 
 In a nutshell, if the ArrayType column used for the pivot is created
 manually in the StructType definition, everything works as expected.
 However, if the ArrayType pivot column is obtained by a sql query (be it by
 using a array wrapper, or using a collect_list operator for instance),
 then results are completely off.

 Could anyone have a look as this really is a blocking issue.

 Thanks!

 Cheers

 P.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/SQL-Self-join-with-ArrayType-columns-problems-tp21364.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Issues when combining Spark and a third party java library

2015-01-26 Thread Staffan
I'm using Maven and Eclipse to build my project. I'm letting Maven download
all the things I need for running everything, which has worked fine up until
now. I need to use the CDK library (https://github.com/egonw/cdk,
http://sourceforge.net/projects/cdk/) and once I add the dependencies to my
pom.xml Spark starts to complain (this is without calling any function or
importing any new library into my code, only by introducing new dependencies
to the pom.xml). Trying to set up a SparkContext give me the following
errors in the log: 

[main] DEBUG org.apache.spark.rdd.HadoopRDD - SplitLocationInfo and other
new Hadoop classes are unavailable. Using the older Hadoop location info
code.
java.lang.ClassNotFoundException:
org.apache.hadoop.mapred.InputSplitWithLocationInfo
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:191)
at
org.apache.spark.rdd.HadoopRDD$SplitInfoReflections.init(HadoopRDD.scala:381)
at org.apache.spark.rdd.HadoopRDD$.liftedTree1$1(HadoopRDD.scala:391)
at org.apache.spark.rdd.HadoopRDD$.init(HadoopRDD.scala:390)
at org.apache.spark.rdd.HadoopRDD$.clinit(HadoopRDD.scala)
at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:159)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:194)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1328)
at org.apache.spark.rdd.RDD.foreach(RDD.scala:765)

later in the log:
[Executor task launch worker-0] DEBUG
org.apache.spark.deploy.SparkHadoopUtil - Couldn't find method for
retrieving thread-level FileSystem input data
java.lang.NoSuchMethodException:
org.apache.hadoop.fs.FileSystem$Statistics.getThreadStatistics()
at java.lang.Class.getDeclaredMethod(Class.java:2009)
at org.apache.spark.util.Utils$.invoke(Utils.scala:1733)
at
org.apache.spark.deploy.SparkHadoopUtil$$anonfun$getFileSystemThreadStatistics$1.apply(SparkHadoopUtil.scala:178)
at
org.apache.spark.deploy.SparkHadoopUtil$$anonfun$getFileSystemThreadStatistics$1.apply(SparkHadoopUtil.scala:178)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at
org.apache.spark.deploy.SparkHadoopUtil.getFileSystemThreadStatistics(SparkHadoopUtil.scala:178)
at
org.apache.spark.deploy.SparkHadoopUtil.getFSBytesReadOnThreadCallback(SparkHadoopUtil.scala:138)
at org.apache.spark.rdd.HadoopRDD$$anon$1.init(HadoopRDD.scala:220)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:210)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:99)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

There has also been issues related to HADOOP_HOME not being set etc., but
which seems to be intermittent and only occur sometimes.


After testing different versions of both CDK and Spark, I've found out that
the Spark version 0.9.1 and earlier DO NOT have this problem, so there is
something in the newer versions of Spark that do not play well with
others... However, I need the functionality in the later versions of Spark
so this do not solve my problem. Anyone willing 

Re: Can Spark benefit from Hive-like partitions?

2015-01-26 Thread Cheng Lian
Currently no if you don't want to use Spark SQL's HiveContext. But we're 
working on adding partitioning support to the external data sources API, 
with which you can create, for example, partitioned Parquet tables 
without using Hive.


Cheng

On 1/26/15 8:47 AM, Danny Yates wrote:

Thanks Michael.

I'm not actually using Hive at the moment - in fact, I'm trying to 
avoid it if I can. I'm just wondering whether Spark has anything 
similar I can leverage?


Thanks



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SVD in pyspark ?

2015-01-26 Thread Joseph Bradley
Hi Andreas,

There unfortunately is not a Python API yet for distributed matrices or
their operations.  Here's the JIRA to follow to stay up-to-date on it:
https://issues.apache.org/jira/browse/SPARK-3956

There are internal wrappers (used to create the Python API), but they are
not really public APIs.  The bigger challenge is creating/storing the
distributed matrix in Python.

Joseph

On Sun, Jan 25, 2015 at 11:32 AM, Chip Senkbeil chip.senkb...@gmail.com
wrote:

 Hi Andreas,

 With regard to the notebook interface,  you can use the Spark Kernel (
 https://github.com/ibm-et/spark-kernel) as the backend for an IPython 3.0
 notebook. The kernel is designed to be the foundation for interactive
 applications connecting to Apache Spark and uses the IPython 5.0 message
 protocol - used by IPython 3.0 - to communicate.

 See the getting started section here:
 https://github.com/ibm-et/spark-kernel/wiki/Getting-Started-with-the-Spark-Kernel

 It discusses getting IPython connected to a Spark Kernel. If you have any
 more questions, feel free to ask!

 Signed,
 Chip Senkbeil
 IBM Emerging Technologies Software Engineer

 On Sun Jan 25 2015 at 1:12:32 PM Andreas Rhode m.a.rh...@gmail.com
 wrote:

 Is the distributed SVD functionality exposed to Python yet?

 Seems it's only available to scala or java, unless I am missing something,
 looking for a pyspark equivalent to
 org.apache.spark.mllib.linalg.SingularValueDecomposition

 In case it's not there yet, is there a way to make a wrapper to call from
 python into the corresponding java/scala code? The reason for using python
 instead of just directly  scala is that I like to take advantage of the
 notebook interface for visualization.

 As a side, is there a inotebook like interface for the scala based REPL?

 Thanks

 Andreas



 --
 View this message in context: http://apache-spark-user-list.
 1001560.n3.nabble.com/SVD-in-pyspark-tp21356.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Can Spark benefit from Hive-like partitions?

2015-01-26 Thread Chris Gore
Good to hear there will be partitioning support.  I’ve had some success loading 
partitioned data specified with Unix glowing format.  i.e.:

sc.textFile(s3:/bucket/directory/dt=2014-11-{2[4-9],30}T00-00-00”)

would load dates 2014-11-24 through 2014-11-30.  Not the most ideal solution, 
but it seems to work for loading data from a range.

Best,
Chris

 On Jan 26, 2015, at 10:55 AM, Cheng Lian lian.cs@gmail.com wrote:
 
 Currently no if you don't want to use Spark SQL's HiveContext. But we're 
 working on adding partitioning support to the external data sources API, with 
 which you can create, for example, partitioned Parquet tables without using 
 Hive.
 
 Cheng
 
 On 1/26/15 8:47 AM, Danny Yates wrote:
 Thanks Michael.
 
 I'm not actually using Hive at the moment - in fact, I'm trying to avoid it 
 if I can. I'm just wondering whether Spark has anything similar I can 
 leverage?
 
 Thanks
 
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark (Streaming?) holding on to Mesos resources

2015-01-26 Thread Gerard Maas
Hi,

We are observing with certain regularity that our Spark  jobs, as Mesos
framework, are hoarding resources and not releasing them, resulting in
resource starvation to all jobs running on the Mesos cluster.

For example:
This is a job that has spark.cores.max = 4 and spark.executor.memory=3g

IDFrameworkHostCPUsMem…5050-16506-1146497FooStreamingdnode-4.hdfs.private713.4
GB…5050-16506-1146495FooStreaming
dnode-0.hdfs.private16.4 GB…5050-16506-1146491FooStreaming
dnode-5.hdfs.private711.9 GB…5050-16506-1146449FooStreaming
dnode-3.hdfs.private74.9 GB…5050-16506-1146247FooStreaming
dnode-1.hdfs.private0.55.9 GB…5050-16506-1146226FooStreaming
dnode-2.hdfs.private37.9 GB…5050-16506-1144069FooStreaming
dnode-3.hdfs.private18.7 GB…5050-16506-1133091FooStreaming
dnode-5.hdfs.private11.7 GB…5050-16506-1133090FooStreaming
dnode-2.hdfs.private55.2 GB…5050-16506-1133089FooStreaming
dnode-1.hdfs.private6.56.3 GB…5050-16506-1133088FooStreaming
dnode-4.hdfs.private1251 MB…5050-16506-1133087FooStreaming
dnode-0.hdfs.private6.46.8 GB
The only way to release the resources is by manually finding the process in
the cluster and killing it. The jobs are often streaming but also batch
jobs show this behavior. We have more streaming jobs than batch, so stats
are biased.
Any ideas of what's up here? Hopefully some very bad ugly bug that has been
fixed already and that will urge us to upgrade our infra?

Mesos 0.20 +  Marathon 0.7.4 + Spark 1.1.0

-kr, Gerard.


Re: HW imbalance

2015-01-26 Thread Sandy Ryza
Hi Antony,

Unfortunately, all executors for any single Spark application must have the
same amount of memory.  It's possibly to configure YARN with different
amounts of memory for each host (using
yarn.nodemanager.resource.memory-mb), so other apps might be able to take
advantage of the extra memory.

-Sandy

On Mon, Jan 26, 2015 at 8:34 AM, Michael Segel msegel_had...@hotmail.com
wrote:

 If you’re running YARN, then you should be able to mix and max where YARN
 is managing the resources available on the node.

 Having said that… it depends on which version of Hadoop/YARN.

 If you’re running Hortonworks and Ambari, then setting up multiple
 profiles may not be straight forward. (I haven’t seen the latest version of
 Ambari)

 So in theory, one profile would be for your smaller 36GB of ram, then one
 profile for your 128GB sized machines.
 Then as your request resources for your spark job, it should schedule the
 jobs based on the cluster’s available resources.
 (At least in theory.  I haven’t tried this so YMMV)

 HTH

 -Mike

 On Jan 26, 2015, at 4:25 PM, Antony Mayi antonym...@yahoo.com.INVALID
 wrote:

 should have said I am running as yarn-client. all I can see is specifying
 the generic executor memory that is then to be used in all containers.


   On Monday, 26 January 2015, 16:48, Charles Feduke 
 charles.fed...@gmail.com wrote:



 You should look at using Mesos. This should abstract away the individual
 hosts into a pool of resources and make the different physical
 specifications manageable.

 I haven't tried configuring Spark Standalone mode to have different specs
 on different machines but based on spark-env.sh.template:

 # - SPARK_WORKER_CORES, to set the number of cores to use on this machine
 # - SPARK_WORKER_MEMORY, to set how much total memory workers have to give
 executors (e.g. 1000m, 2g)
 # - SPARK_WORKER_OPTS, to set config properties only for the worker (e.g.
 -Dx=y)
 it looks like you should be able to mix. (Its not clear to me whether
 SPARK_WORKER_MEMORY is uniform across the cluster or for the machine where
 the config file resides.)

 On Mon Jan 26 2015 at 8:07:51 AM Antony Mayi antonym...@yahoo.com.invalid
 wrote:

 Hi,

 is it possible to mix hosts with (significantly) different specs within a
 cluster (without wasting the extra resources)? for example having 10 nodes
 with 36GB RAM/10CPUs now trying to add 3 hosts with 128GB/10CPUs - is there
 a way to utilize the extra memory by spark executors (as my understanding
 is all spark executors must have same memory).

 thanks,
 Antony.







Re: Spark webUI - application details page

2015-01-26 Thread spark08011
Where is the history server running? Is it running on the same node as the
logs directory. 





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-webUI-application-details-page-tp3490p21374.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Error when cache partitioned Parquet table

2015-01-26 Thread Sadhan Sood
Hi Xu-dong,

Thats probably because your table's partition path don't look like
hdfs://somepath/key=value/*.parquet. Spark is trying to extract the
partition key's value from the path while caching and hence the exception
is being thrown since it can't find one.

On Mon, Jan 26, 2015 at 10:45 AM, ZHENG, Xu-dong dong...@gmail.com wrote:

 Hi all,

 I meet below error when I cache a partitioned Parquet table. It seems
 that, Spark is trying to extract the partitioned key in the Parquet file,
 so it is not found. But other query could run successfully, even request
 the partitioned key. Is it a bug in SparkSQL? Is there any workaround for
 it? Thank you!

 java.util.NoSuchElementException: key not found: querydate
   at scala.collection.MapLike$class.default(MapLike.scala:228)
   at scala.collection.AbstractMap.default(Map.scala:58)
   at scala.collection.MapLike$class.apply(MapLike.scala:141)
   at scala.collection.AbstractMap.apply(Map.scala:58)
   at 
 org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$4$$anonfun$3.apply(ParquetTableOperations.scala:142)
   at 
 org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$4$$anonfun$3.apply(ParquetTableOperations.scala:142)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
   at 
 org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$4.apply(ParquetTableOperations.scala:142)
   at 
 org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$4.apply(ParquetTableOperations.scala:127)
   at 
 org.apache.spark.rdd.NewHadoopRDD$NewHadoopMapPartitionsWithSplitRDD.compute(NewHadoopRDD.scala:247)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
   at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
   at org.apache.spark.scheduler.Task.run(Task.scala:56)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:197)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:724)

 --
 郑旭东
 ZHENG, Xu-dong




Re: Can Spark benefit from Hive-like partitions?

2015-01-26 Thread Danny Yates
Thanks Michael.

I'm not actually using Hive at the moment - in fact, I'm trying to avoid it
if I can. I'm just wondering whether Spark has anything similar I can
leverage?

Thanks


Re: Lost task - connection closed

2015-01-26 Thread Aaron Davidson
It looks like something weird is going on with your object serialization,
perhaps a funny form of self-reference which is not detected by
ObjectOutputStream's typical loop avoidance. That, or you have some data
structure like a linked list with a parent pointer and you have many
thousand elements.

Assuming the stack trace is coming from an executor, it is probably a
problem with the objects you're sending back as results, so I would
carefully examine these and maybe try serializing some using
ObjectOutputStream manually.

If your program looks like
foo.map { row = doComplexOperation(row) }.take(10)

you can also try changing it to
foo.map { row = doComplexOperation(row); 1 }.take(10)

to avoid serializing the result of that complex operation, which should
help narrow down where exactly the problematic objects are coming from.

On Mon, Jan 26, 2015 at 8:31 AM, octavian.ganea octavian.ga...@inf.ethz.ch
wrote:

 Here is the first error I get at the executors:

 15/01/26 17:27:04 ERROR ExecutorUncaughtExceptionHandler: Uncaught
 exception
 in thread Thread[handle-message-executor-16,5,main]
 java.lang.StackOverflowError
 at

 java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876)
 at

 java.io.ObjectOutputStream$BlockDataOutputStream.write(ObjectOutputStream.java:1840)
 at
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1533)
 at
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
 at

 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
 at
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
 at
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
 at
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
 at

 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
 at
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
 at
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
 at
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
 at

 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
 at
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
 at
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
 at
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
 at

 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
 at
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
 at
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
 at
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
 at

 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
 at
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
 at
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
 at
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
 at

 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
 at
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
 at
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
 at
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
 at

 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
 at
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
 at
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
 at
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
 at

 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
 at
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
 at
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
 at
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
 at

 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
 at
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
 at
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)

 If you have any pointers for me on how to debug this, that would be very
 useful. I tried running with both spark 1.2.0 and 1.1.1, getting the same
 error.




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Lost-task-connection-closed-tp21361p21371.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 

large data set to get rid of exceeds Integer.MAX_VALUE error

2015-01-26 Thread freedafeng
Hi,

This seems to be a known issue (see here:
http://apache-spark-user-list.1001560.n3.nabble.com/ALS-failure-with-size-gt-Integer-MAX-VALUE-td19982.html)

The data set is about 1.5 T bytes. There are 14 region servers. I am not
sure how many regions there are for this data set. But very likely each
region will have much more than 2g data. In this case, repartition seems
also a very expensive action (I would guess), if possible in my cluster at
all. 

Could any one give some suggestions to make this job done? Thanks!

platform: spark 1.2.0, cdh5.3.0. 

The error is like,
py4j.protocol.Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.saveAsHadoopDataset.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0
(TID 34, node007): java.lang.RuntimeException:
java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:828)
at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:123)
at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:132)
at 
org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:517)
at
org.apache.spark.storage.BlockManager.getBlockData(BlockManager.scala:307)
at
org.apache.spark.network.netty.NettyBlockRpcServer$$anonfun$2.apply(NettyBlockRpcServer.scala:57)
at
org.apache.spark.network.netty.NettyBlockRpcServer$$anonfun$2.apply(NettyBlockRpcServer.scala:57)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at
org.apache.spark.network.netty.NettyBlockRpcServer.receive(NettyBlockRpcServer.scala:57)
at
org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:124)
at
org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:97)
at
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:91)
at
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:44)
at
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
at
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
at
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
at
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
at
io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:163)
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
at
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
at
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
at
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
at
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
at java.lang.Thread.run(Thread.java:745)

at
org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:156)
at
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:93)
at
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:44)
at
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
at

Spark (Streaming?) holding on to Mesos Resources

2015-01-26 Thread Gerard Maas
(looks like the list didn't like a HTML table on the previous email. My
excuses for any duplicates)

Hi,

We are observing with certain regularity that our Spark  jobs, as Mesos
framework, are hoarding resources and not releasing them, resulting in
resource starvation to all jobs running on the Mesos cluster.

For example:
This is a job that has spark.cores.max = 4 and spark.executor.memory=3g

| ID   |Framework  |Host|CPUs  |Mem
…5050-16506-1146497 FooStreaming dnode-4.hdfs.private 7 13.4 GB
…5050-16506-1146495 FooStreamingdnode-0.hdfs.private 1 6.4 GB
…5050-16506-1146491 FooStreamingdnode-5.hdfs.private 7 11.9 GB
…5050-16506-1146449 FooStreamingdnode-3.hdfs.private 7 4.9 GB
…5050-16506-1146247 FooStreamingdnode-1.hdfs.private 0.5 5.9 GB
…5050-16506-1146226 FooStreamingdnode-2.hdfs.private 3 7.9 GB
…5050-16506-1144069 FooStreamingdnode-3.hdfs.private 1 8.7 GB
…5050-16506-1133091 FooStreamingdnode-5.hdfs.private 1 1.7 GB
…5050-16506-1133090 FooStreamingdnode-2.hdfs.private 5 5.2 GB
…5050-16506-1133089 FooStreamingdnode-1.hdfs.private 6.5 6.3 GB
…5050-16506-1133088 FooStreamingdnode-4.hdfs.private 1 251 MB
…5050-16506-1133087 FooStreamingdnode-0.hdfs.private 6.4 6.8 GB

The only way to release the resources is by manually finding the process in
the cluster and killing it. The jobs are often streaming but also batch
jobs show this behavior. We have more streaming jobs than batch, so stats
are biased.
Any ideas of what's up here? Hopefully some very bad ugly bug that has been
fixed already and that will urge us to upgrade our infra?

Mesos 0.20 +  Marathon 0.7.4 + Spark 1.1.0

-kr, Gerard.


Spark and S3 server side encryption

2015-01-26 Thread curtkohler
We are trying to create a Spark job that writes out a file to S3 that
leverage S3's server side encryption for sensitive data. Typically this is
accomplished by setting the appropriate header on the put request, but it
isn't clear whether this capability is exposed in the Spark/Hadoop APIs.
Does anyone have any suggestions? 





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-and-S3-server-side-encryption-tp21377.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



SaveAsTextFile to S3 bucket

2015-01-26 Thread Chen, Kevin
Does anyone know if I can save a RDD as a text file to a pre-created directory 
in S3 bucket?

I have a directory created in S3 bucket: //nexgen-software/dev

When I tried to save a RDD as text file in this directory:
rdd.saveAsTextFile(s3n://nexgen-software/dev/output);


I got following exception at runtime:

Exception in thread main org.apache.hadoop.fs.s3.S3Exception: 
org.jets3t.service.S3ServiceException: S3 HEAD request failed for '/dev' - 
ResponseCode=403, ResponseMessage=Forbidden


I have verified /dev has write permission. However, if I grant the bucket 
//nexgen-software write permission, I don't get exception. But the output is 
not created under dev. Rather, a different /dev/output directory is created 
directory in the bucket (//nexgen-software). Is this how saveAsTextFile 
behalves in S3? Is there anyway I can have output created under a pre-defied 
directory.


Thanks in advance.





Re: HDFS Namenode in safemode when I turn off my EC2 instance

2015-01-26 Thread Su She
Hello Sean and Akhil,

I shut down the services on Cloudera Manager. I shut them down in the
appropriate order and then stopped all services of CM. I then shut down my
instances. I then turned my instances back on, but I am getting the same
error.

1) I tried hadoop fs -safemode leave and it said -safemode is an unknown
command, but it does recognize hadoop fs

2) I also noticed I can't ping my instances from my personal laptop and I
can't ping google.com from my instances. However, I can still run my Kafka
Zookeeper/server/console producer/consumer. I know this is the spark
thread, but thought that might be relevant.

Thank you for any suggestions!

Best,

Su



On Thu, Jan 22, 2015 at 2:41 AM, Sean Owen so...@cloudera.com wrote:

 If you are using CDH, you would be shutting down services with
 Cloudera Manager. I believe you can do it manually using Linux
 'services' if you do the steps correctly across your whole cluster.
 I'm not sure if the stock stop-all.sh script is supposed to work.
 Certainly, if you are using CM, by far the easiest is to start/stop
 all of these things in CM.

 On Wed, Jan 21, 2015 at 6:08 PM, Su She suhsheka...@gmail.com wrote:
  Hello Sean  Akhil,
 
  I tried running the stop-all.sh script on my master and I got this
 message:
 
  localhost: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).
  chown: changing ownership of
  `/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/spark/logs':
 Operation
  not permitted
  no org.apache.spark.deploy.master.Master to stop
 
  I am running Spark (Yarn) via Cloudera Manager. I tried stopping it from
  Cloudera Manager first, but it looked like it was only stopping the
 history
  server, so I started Spark again and tried ./stop-all.sh and got the
 above
  message.
 
  Also, what is the command for shutting down storage or can I simply stop
  hdfs in Cloudera Manager?
 
  Thank you for the help!
 
 
 
  On Sat, Jan 17, 2015 at 12:58 PM, Su She suhsheka...@gmail.com wrote:
 
  Thanks Akhil and Sean for the responses.
 
  I will try shutting down spark, then storage and then the instances.
  Initially, when hdfs was in safe mode, I waited for 1 hour and the
 problem
  still persisted. I will try this new method.
 
  Thanks!
 
 
 
  On Sat, Jan 17, 2015 at 2:03 AM, Sean Owen so...@cloudera.com wrote:
 
  You would not want to turn off storage underneath Spark. Shut down
  Spark first, then storage, then shut down the instances. Reverse the
  order when restarting.
 
  HDFS will be in safe mode for a short time after being started before
  it becomes writeable. I would first check that it's not just that.
  Otherwise, find out why the cluster went into safe mode from the logs,
  fix it, and then leave safe mode.
 
  On Sat, Jan 17, 2015 at 9:03 AM, Akhil Das ak...@sigmoidanalytics.com
 
  wrote:
   Safest way would be to first shutdown HDFS and then shutdown Spark
   (call
   stop-all.sh would do) and then shutdown the machines.
  
   You can execute the following command to disable safe mode:
  
   hadoop fs -safemode leave
  
  
  
   Thanks
   Best Regards
  
   On Sat, Jan 17, 2015 at 8:31 AM, Su She suhsheka...@gmail.com
 wrote:
  
   Hello Everyone,
  
   I am encountering trouble running Spark applications when I shut
 down
   my
   EC2 instances. Everything else seems to work except Spark. When I
 try
   running a simple Spark application, like sc.parallelize() I get the
   message
   that hdfs name node is in safemode.
  
   Has anyone else had this issue? Is there a proper protocol I should
 be
   following to turn off my spark nodes?
  
   Thank you!
  
  
  
 
 
 



Re: saving rdd to multiple files named by the key

2015-01-26 Thread Aniket Bhatnagar
This might be helpful:
http://stackoverflow.com/questions/23995040/write-to-multiple-outputs-by-key-spark-one-spark-job

On Tue Jan 27 2015 at 07:45:18 Sharon Rapoport sha...@plaid.com wrote:

 Hi,

 I have an rdd of [k,v] pairs. I want to save each [v] to a file named [k].
 I got them by combining many [k,v] by [k]. I could then save to file by
 partitions, but that still doesn't allow me to choose the name, and leaves
 me stuck with foo/part-...

 Any tips?

 Thanks,
 Sharon



Mathematical functions in spark sql

2015-01-26 Thread 1esha
Hello everyone! 

I try execute select 2/3 and I get 0.. Is there any way
to cast double to int or something similar?

Also it will be cool to get list of functions supported by spark sql. 

Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Mathematical-functions-in-spark-sql-tp21383.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Mathematical functions in spark sql

2015-01-26 Thread Ted Yu
Have you tried floor() or ceil() functions ?

According to http://spark.apache.org/sql/, Spark SQL is compatible with
Hive SQL.

Cheers

On Mon, Jan 26, 2015 at 8:29 PM, 1esha alexey.romanc...@gmail.com wrote:

 Hello everyone!

 I try execute select 2/3 and I get 0.. Is there any way
 to cast double to int or something similar?

 Also it will be cool to get list of functions supported by spark sql.

 Thanks!



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Mathematical-functions-in-spark-sql-tp21383.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Spark on Yarn: java.lang.IllegalArgumentException: Invalid rule

2015-01-26 Thread maven
All, 

I recently try to build Spark-1.2 on my enterprise server (which has Hadoop
2.3 with YARN). Here're the steps I followed for the build: 

$ mvn -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests clean package 
$ export SPARK_HOME=/path/to/spark/folder 
$ export HADOOP_CONF_DIR=/etc/hadoop/conf 

However, when I try to work with this installation either locally or on
YARN, I get the following error: 

Exception in thread main java.lang.ExceptionInInitializerError 
at
org.apache.spark.util.Utils$.getSparkOrYarnConfig(Utils.scala:1784) 
at
org.apache.spark.storage.BlockManager.init(BlockManager.scala:105) 
at
org.apache.spark.storage.BlockManager.init(BlockManager.scala:180) 
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:292) 
at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:159) 
at org.apache.spark.SparkContext.init(SparkContext.scala:232) 
at water.MyDriver$.main(MyDriver.scala:19) 
at water.MyDriver.main(MyDriver.scala) 
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) 
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 
at java.lang.reflect.Method.invoke(Method.java:606) 
at
org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:360) 
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) 
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 
Caused by: org.apache.spark.SparkException: Unable to load YARN support 
at
org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:199)
 
at
org.apache.spark.deploy.SparkHadoopUtil$.init(SparkHadoopUtil.scala:194) 
at
org.apache.spark.deploy.SparkHadoopUtil$.clinit(SparkHadoopUtil.scala) 
... 15 more 
Caused by: java.lang.IllegalArgumentException: Invalid rule: L 
RULE:[2:$1@$0](.*@XXXCOMPANY.COM)s/(.*)@XXXCOMPANY.COM/$1/L 
DEFAULT 
at
org.apache.hadoop.security.authentication.util.KerberosName.parseRules(KerberosName.java:321)
 
at
org.apache.hadoop.security.authentication.util.KerberosName.setRules(KerberosName.java:386)
 
at
org.apache.hadoop.security.HadoopKerberosName.setConfiguration(HadoopKerberosName.java:75)
 
at
org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:247)
 
at
org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:283)
 
at
org.apache.spark.deploy.SparkHadoopUtil.init(SparkHadoopUtil.scala:43) 
at
org.apache.spark.deploy.yarn.YarnSparkHadoopUtil.init(YarnSparkHadoopUtil.scala:45)
 
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method) 
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
 
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
 
at java.lang.reflect.Constructor.newInstance(Constructor.java:526) 
at java.lang.Class.newInstance(Class.java:374) 
at
org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:196)
 
... 17 more 

I noticed that when I unset HADOOP_CONF_DIR, I'm able to work in the local
mode without any errors. I'm able to work with pre-installed Spark 1.0,
locally and on yarn, without any issues. It looks like I may be missing a
configuration step somewhere. Any thoughts on what may be causing this? 

NR



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Yarn-java-lang-IllegalArgumentException-Invalid-rule-tp21382.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SaveAsTextFile to S3 bucket

2015-01-26 Thread Nick Pentreath
Your output folder specifies

rdd.saveAsTextFile(s3n://nexgen-software/dev/output);

So it will try to write to /dev/output which is as expected. If you create
the directory /dev/output upfront in your bucket, and try to save it to
that (empty) directory, what is the behaviour?

On Tue, Jan 27, 2015 at 6:21 AM, Chen, Kevin kevin.c...@neustar.biz wrote:

  Does anyone know if I can save a RDD as a text file to a pre-created
 directory in S3 bucket?

  I have a directory created in S3 bucket: //nexgen-software/dev

  When I tried to save a RDD as text file in this directory:
 rdd.saveAsTextFile(s3n://nexgen-software/dev/output);


  I got following exception at runtime:

 Exception in thread main org.apache.hadoop.fs.s3.S3Exception:
 org.jets3t.service.S3ServiceException: S3 HEAD request failed for '/dev' -
 ResponseCode=403, ResponseMessage=Forbidden


  I have verified /dev has write permission. However, if I grant the
 bucket //nexgen-software write permission, I don't get exception. But the
 output is not created under dev. Rather, a different /dev/output directory
 is created directory in the bucket (//nexgen-software). Is this how
 saveAsTextFile behalves in S3? Is there anyway I can have output created
 under a pre-defied directory.


  Thanks in advance.







saving rdd to multiple files named by the key

2015-01-26 Thread Sharon Rapoport
Hi,

I have an rdd of [k,v] pairs. I want to save each [v] to a file named [k].
I got them by combining many [k,v] by [k]. I could then save to file by
partitions, but that still doesn't allow me to choose the name, and leaves
me stuck with foo/part-...

Any tips?

Thanks,
Sharon


Re: spark 1.2 ec2 launch script hang

2015-01-26 Thread Pete Zybrick
Try using an absolute path to the pem file



 On Jan 26, 2015, at 8:57 PM, ey-chih chow eyc...@hotmail.com wrote:
 
 Hi,
 
 I used the spark-ec2 script of spark 1.2 to launch a cluster.  I have
 modified the script according to 
 
 https://github.com/grzegorz-dubicki/spark/commit/5dd8458d2ab9753aae939b3bb33be953e2c13a70
 
 But the script was still hung at the following message:
 
 Waiting for cluster to enter 'ssh-ready'
 state.
 
 Any additional thing I should do to make it succeed?  Thanks.
 
 
 Ey-Chih Chow
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-2-ec2-launch-script-hang-tp21381.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: HDFS Namenode in safemode when I turn off my EC2 instance

2015-01-26 Thread Akhil Das
Command would be:

hadoop dfsadmin -safemode leave

If you are not able to ping your instances, it can be because of you are
blocking all the ICMP requests. Im not quiet sure why you are not able to
ping google.com from your instances. Make sure the internal IP (ifconfig)
is proper in the file /etc/hosts. And regarding the kafka, im assuming that
youbare running on a single instance and hence you are not having any
issues ( mostly, it binds to localhost in that case)
 On 27 Jan 2015 07:25, Su She suhsheka...@gmail.com wrote:

 Hello Sean and Akhil,

 I shut down the services on Cloudera Manager. I shut them down in the
 appropriate order and then stopped all services of CM. I then shut down my
 instances. I then turned my instances back on, but I am getting the same
 error.

 1) I tried hadoop fs -safemode leave and it said -safemode is an unknown
 command, but it does recognize hadoop fs

 2) I also noticed I can't ping my instances from my personal laptop and I
 can't ping google.com from my instances. However, I can still run my
 Kafka Zookeeper/server/console producer/consumer. I know this is the spark
 thread, but thought that might be relevant.

 Thank you for any suggestions!

 Best,

 Su



 On Thu, Jan 22, 2015 at 2:41 AM, Sean Owen so...@cloudera.com wrote:

 If you are using CDH, you would be shutting down services with
 Cloudera Manager. I believe you can do it manually using Linux
 'services' if you do the steps correctly across your whole cluster.
 I'm not sure if the stock stop-all.sh script is supposed to work.
 Certainly, if you are using CM, by far the easiest is to start/stop
 all of these things in CM.

 On Wed, Jan 21, 2015 at 6:08 PM, Su She suhsheka...@gmail.com wrote:
  Hello Sean  Akhil,
 
  I tried running the stop-all.sh script on my master and I got this
 message:
 
  localhost: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).
  chown: changing ownership of
  `/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/spark/logs':
 Operation
  not permitted
  no org.apache.spark.deploy.master.Master to stop
 
  I am running Spark (Yarn) via Cloudera Manager. I tried stopping it from
  Cloudera Manager first, but it looked like it was only stopping the
 history
  server, so I started Spark again and tried ./stop-all.sh and got the
 above
  message.
 
  Also, what is the command for shutting down storage or can I simply stop
  hdfs in Cloudera Manager?
 
  Thank you for the help!
 
 
 
  On Sat, Jan 17, 2015 at 12:58 PM, Su She suhsheka...@gmail.com wrote:
 
  Thanks Akhil and Sean for the responses.
 
  I will try shutting down spark, then storage and then the instances.
  Initially, when hdfs was in safe mode, I waited for 1 hour and the
 problem
  still persisted. I will try this new method.
 
  Thanks!
 
 
 
  On Sat, Jan 17, 2015 at 2:03 AM, Sean Owen so...@cloudera.com wrote:
 
  You would not want to turn off storage underneath Spark. Shut down
  Spark first, then storage, then shut down the instances. Reverse the
  order when restarting.
 
  HDFS will be in safe mode for a short time after being started before
  it becomes writeable. I would first check that it's not just that.
  Otherwise, find out why the cluster went into safe mode from the logs,
  fix it, and then leave safe mode.
 
  On Sat, Jan 17, 2015 at 9:03 AM, Akhil Das 
 ak...@sigmoidanalytics.com
  wrote:
   Safest way would be to first shutdown HDFS and then shutdown Spark
   (call
   stop-all.sh would do) and then shutdown the machines.
  
   You can execute the following command to disable safe mode:
  
   hadoop fs -safemode leave
  
  
  
   Thanks
   Best Regards
  
   On Sat, Jan 17, 2015 at 8:31 AM, Su She suhsheka...@gmail.com
 wrote:
  
   Hello Everyone,
  
   I am encountering trouble running Spark applications when I shut
 down
   my
   EC2 instances. Everything else seems to work except Spark. When I
 try
   running a simple Spark application, like sc.parallelize() I get the
   message
   that hdfs name node is in safemode.
  
   Has anyone else had this issue? Is there a proper protocol I
 should be
   following to turn off my spark nodes?
  
   Thank you!
  
  
  
 
 
 





Re: SaveAsTextFile to S3 bucket

2015-01-26 Thread Chen, Kevin
When spark saves rdd to a text file, the directory must not exist upfront. It 
will create a directory and write the data to part- under that directory. 
In my use case, I create a directory dev in the bucket ://nexgen-software/dev . 
I expect it creates output direct under dev and a part- under output. But 
it gave me exception as I only give write permission to dev not the bucket. If 
I open up write permission to bucket, it worked. But it did not create output 
directory under dev, it rather creates another dev/output directory under 
bucket. I just want to know if it is possible to have output directory created 
under dev directory I created upfront.

From: Nick Pentreath nick.pentre...@gmail.commailto:nick.pentre...@gmail.com
Date: Monday, January 26, 2015 9:15 PM
To: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: SaveAsTextFile to S3 bucket

Your output folder specifies

rdd.saveAsTextFile(s3n://nexgen-software/dev/output);

So it will try to write to /dev/output which is as expected. If you create the 
directory /dev/output upfront in your bucket, and try to save it to that 
(empty) directory, what is the behaviour?

On Tue, Jan 27, 2015 at 6:21 AM, Chen, Kevin 
kevin.c...@neustar.bizmailto:kevin.c...@neustar.biz wrote:
Does anyone know if I can save a RDD as a text file to a pre-created directory 
in S3 bucket?

I have a directory created in S3 bucket: //nexgen-software/dev

When I tried to save a RDD as text file in this directory:
rdd.saveAsTextFile(s3n://nexgen-software/dev/output);


I got following exception at runtime:

Exception in thread main org.apache.hadoop.fs.s3.S3Exception: 
org.jets3t.service.S3ServiceException: S3 HEAD request failed for '/dev' - 
ResponseCode=403, ResponseMessage=Forbidden


I have verified /dev has write permission. However, if I grant the bucket 
//nexgen-software write permission, I don't get exception. But the output is 
not created under dev. Rather, a different /dev/output directory is created 
directory in the bucket (//nexgen-software). Is this how saveAsTextFile 
behalves in S3? Is there anyway I can have output created under a pre-defied 
directory.


Thanks in advance.






spark sqlContext udaf

2015-01-26 Thread sunwei
Hi,  any one can show me some examples using UDAF for spark sqlcontext? 

Re: Mathematical functions in spark sql

2015-01-26 Thread Alexey Romanchuk
I have tried select ceil(2/3), but got key not found: floor

On Tue, Jan 27, 2015 at 11:05 AM, Ted Yu yuzhih...@gmail.com wrote:

 Have you tried floor() or ceil() functions ?

 According to http://spark.apache.org/sql/, Spark SQL is compatible with
 Hive SQL.

 Cheers

 On Mon, Jan 26, 2015 at 8:29 PM, 1esha alexey.romanc...@gmail.com wrote:

 Hello everyone!

 I try execute select 2/3 and I get 0.. Is there any
 way
 to cast double to int or something similar?

 Also it will be cool to get list of functions supported by spark sql.

 Thanks!



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Mathematical-functions-in-spark-sql-tp21383.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





spark 1.2 ec2 launch script hang

2015-01-26 Thread ey-chih chow
Hi,

I used the spark-ec2 script of spark 1.2 to launch a cluster.  I have
modified the script according to 

https://github.com/grzegorz-dubicki/spark/commit/5dd8458d2ab9753aae939b3bb33be953e2c13a70

But the script was still hung at the following message:

Waiting for cluster to enter 'ssh-ready'
state.

Any additional thing I should do to make it succeed?  Thanks.


Ey-Chih Chow



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-2-ec2-launch-script-hang-tp21381.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SaveAsTextFile to S3 bucket

2015-01-26 Thread Ashish Rangole
By default, the files will be created under the path provided as the
argument for saveAsTextFile. This argument is considered as a folder in the
bucket and actual files are created in it with the naming convention
part-n, where n is the number of output partition.

On Mon, Jan 26, 2015 at 9:15 PM, Nick Pentreath nick.pentre...@gmail.com
wrote:

 Your output folder specifies

 rdd.saveAsTextFile(s3n://nexgen-software/dev/output);

 So it will try to write to /dev/output which is as expected. If you create
 the directory /dev/output upfront in your bucket, and try to save it to
 that (empty) directory, what is the behaviour?

 On Tue, Jan 27, 2015 at 6:21 AM, Chen, Kevin kevin.c...@neustar.biz
 wrote:

  Does anyone know if I can save a RDD as a text file to a pre-created
 directory in S3 bucket?

  I have a directory created in S3 bucket: //nexgen-software/dev

  When I tried to save a RDD as text file in this directory:
 rdd.saveAsTextFile(s3n://nexgen-software/dev/output);


  I got following exception at runtime:

 Exception in thread main org.apache.hadoop.fs.s3.S3Exception:
 org.jets3t.service.S3ServiceException: S3 HEAD request failed for '/dev' -
 ResponseCode=403, ResponseMessage=Forbidden


  I have verified /dev has write permission. However, if I grant the
 bucket //nexgen-software write permission, I don't get exception. But the
 output is not created under dev. Rather, a different /dev/output directory
 is created directory in the bucket (//nexgen-software). Is this how
 saveAsTextFile behalves in S3? Is there anyway I can have output created
 under a pre-defied directory.


  Thanks in advance.








Re: spark 1.2 - Writing parque fails for timestamp with Unsupported datatype TimestampType

2015-01-26 Thread Manoj Samel
Awesome ! That would be great !!

On Mon, Jan 26, 2015 at 3:18 PM, Michael Armbrust mich...@databricks.com
wrote:

 I'm aiming for 1.3.

 On Mon, Jan 26, 2015 at 3:05 PM, Manoj Samel manojsamelt...@gmail.com
 wrote:

 Thanks Michael. I am sure there have been many requests for this support.

 Any release targeted for this?

 Thanks,

 On Sat, Jan 24, 2015 at 11:47 AM, Michael Armbrust 
 mich...@databricks.com wrote:

 Those annotations actually don't work because the timestamp is SQL has
 optional nano-second precision.

 However, there is a PR to add support using parquets INT96 type:
 https://github.com/apache/spark/pull/3820

 On Fri, Jan 23, 2015 at 12:08 PM, Manoj Samel manojsamelt...@gmail.com
 wrote:

 Looking further at the trace and ParquetTypes.scala, it seems there is
 no support for Timestamp and Date in fromPrimitiveDataType(ctype:
 DataType): Option[ParquetTypeInfo]. Since Parquet supports these type
 with some decoration over Int (
 https://github.com/Parquet/parquet-format/blob/master/LogicalTypes.md),
 any reason why Date / Timestamp are not supported right now ?

 Thanks,

 Manoj


 On Fri, Jan 23, 2015 at 11:40 AM, Manoj Samel manojsamelt...@gmail.com
  wrote:

 Using Spark 1.2

 Read a CSV file, apply schema to convert to SchemaRDD and then
 schemaRdd.saveAsParquetFile

 If the schema includes Timestamptype, it gives following trace when
 doing the save

 Exception in thread main java.lang.RuntimeException: Unsupported
 datatype TimestampType

 at scala.sys.package$.error(package.scala:27)

 at
 org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2.apply(
 ParquetTypes.scala:343)

 at
 org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2.apply(
 ParquetTypes.scala:292)

 at scala.Option.getOrElse(Option.scala:120)

 at org.apache.spark.sql.parquet.ParquetTypesConverter$.fromDataType(
 ParquetTypes.scala:291)

 at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$4.apply(
 ParquetTypes.scala:363)

 at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$4.apply(
 ParquetTypes.scala:362)

 at scala.collection.TraversableLike$$anonfun$map$1.apply(
 TraversableLike.scala:244)

 at scala.collection.TraversableLike$$anonfun$map$1.apply(
 TraversableLike.scala:244)

 at scala.collection.mutable.ResizableArray$class.foreach(
 ResizableArray.scala:59)

 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)

 at scala.collection.TraversableLike$class.map(
 TraversableLike.scala:244)

 at scala.collection.AbstractTraversable.map(Traversable.scala:105)

 at
 org.apache.spark.sql.parquet.ParquetTypesConverter$.convertFromAttributes(
 ParquetTypes.scala:361)

 at org.apache.spark.sql.parquet.ParquetTypesConverter$.writeMetaData(
 ParquetTypes.scala:407)

 at org.apache.spark.sql.parquet.ParquetRelation$.createEmpty(
 ParquetRelation.scala:166)

 at org.apache.spark.sql.parquet.ParquetRelation$.create(
 ParquetRelation.scala:145)

 at
 org.apache.spark.sql.execution.SparkStrategies$ParquetOperations$.apply(
 SparkStrategies.scala:204)

 at
 org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(
 QueryPlanner.scala:58)

 at
 org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(
 QueryPlanner.scala:58)

 at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)

 at org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(
 QueryPlanner.scala:59)

 at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(
 SQLContext.scala:418)

 at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(
 SQLContext.scala:416)

 at
 org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(
 SQLContext.scala:422)

 at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(
 SQLContext.scala:422)

 at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(
 SQLContext.scala:425)

 at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(
 SQLContext.scala:425)

 at org.apache.spark.sql.SchemaRDDLike$class.saveAsParquetFile(
 SchemaRDDLike.scala:76)

 at org.apache.spark.sql.SchemaRDD.saveAsParquetFile(
 SchemaRDD.scala:108)

 at bdrt.MyTest$.createParquetWithDate(MyTest.scala:88)

 at bdrt.MyTest$delayedInit$body.apply(MyTest.scala:54)

 at scala.Function0$class.apply$mcV$sp(Function0.scala:40)

 at scala.runtime.AbstractFunction0.apply$mcV$sp(
 AbstractFunction0.scala:12)

 at scala.App$$anonfun$main$1.apply(App.scala:71)

 at scala.App$$anonfun$main$1.apply(App.scala:71)

 at scala.collection.immutable.List.foreach(List.scala:318)

 at scala.collection.generic.TraversableForwarder$class.foreach(
 TraversableForwarder.scala:32)

 at scala.App$class.main(App.scala:71)

 at bdrt.MyTest$.main(MyTest.scala:10)









Re: Spark 1.2 – How to change Default (Random) port ….

2015-01-26 Thread Shailesh Birari
Thanks. But after setting spark.shuffle.blockTransferService to nio
application fails with Akka Client disassociation.

15/01/27 13:38:11 ERROR TaskSchedulerImpl: Lost executor 3 on
wynchcs218.wyn.cnw.co.nz: remote Akka client disassociated
15/01/27 13:38:11 INFO TaskSetManager: Re-queueing tasks for 3 from TaskSet
0.0
15/01/27 13:38:11 WARN TaskSetManager: Lost task 0.3 in stage 0.0 (TID 7,
wynchcs218.wyn.cnw.co.nz): ExecutorLostFailure (executor lost)
15/01/27 13:38:11 ERROR TaskSetManager: Task 0 in stage 0.0 failed 4 times;
aborting job
15/01/27 13:38:11 WARN TaskSetManager: Lost task 1.3 in stage 0.0 (TID 6,
wynchcs218.wyn.cnw.co.nz): ExecutorLostFailure (executor lost)
15/01/27 13:38:11 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks
have all completed, from pool
15/01/27 13:38:11 INFO TaskSchedulerImpl: Cancelling stage 0
15/01/27 13:38:11 INFO DAGScheduler: Failed to run count at
RowMatrix.scala:71
Exception in thread main org.apache.spark.SparkException: Job aborted due
to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure:
Lost task 0.3 in stage 0.0 (TID 7, wynchcs218.wyn.cnw.co.nz):
ExecutorLostFailure (executor lost)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
15/01/27 13:38:11 INFO DAGScheduler: Executor lost: 3 (epoch 3)
15/01/27 13:38:11 INFO BlockManagerMasterActor: Trying to remove executor 3
from BlockManagerMaster.
15/01/27 13:38:11 INFO BlockManagerMaster: Removed 3 successfully in
removeExecutor



On Mon, Jan 26, 2015 at 6:34 PM, Aaron Davidson ilike...@gmail.com wrote:

 This was a regression caused by Netty Block Transfer Service. The fix for
 this just barely missed the 1.2 release, and you can see the associated
 JIRA here: https://issues.apache.org/jira/browse/SPARK-4837

 Current master has the fix, and the Spark 1.2.1 release will have it
 included. If you don't want to rebuild from master or wait, then you can
 turn it off by setting spark.shuffle.blockTransferService to nio.

 On Sun, Jan 25, 2015 at 6:28 PM, Shailesh Birari sbirar...@gmail.com
 wrote:

 Can anyone please let me know ?
 I don't want to open all ports on n/w. So, am interested in the property
 by
 which this new port I can configure.

   Shailesh



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-2-How-to-change-Default-Random-port-tp21306p21360.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Re: No AMI for Spark 1.2 using ec2 scripts

2015-01-26 Thread Franc Carter
AMI's are specific to an AWS region, so the ami-id of the spark AMI in
us-west will be different if it exists. I can't remember where but I have a
memory of seeing somewhere that the AMI was only in us-east

cheers

On Mon, Jan 26, 2015 at 8:47 PM, Håkan Jonsson haj...@gmail.com wrote:

 Thanks,

 I also use Spark 1.2 with prebuilt for Hadoop 2.4. I launch both 1.1 and
 1.2 with the same command:

 ./spark-ec2 -k foo -i bar.pem launch mycluster

 By default this launches in us-east-1. I tried changing the the region
 using:

 -r us-west-1 but that had the same result:

 Could not resolve AMI at:
 https://raw.github.com/mesos/spark-ec2/v4/ami-list/us-west-1/pvm

 Looking up
 https://raw.github.com/mesos/spark-ec2/v4/ami-list/us-west-1/pvm in a
 browser results in the same AMI ID as yours. If I search for ami-7a320f3f
 AMI in AWS, I can't find any such image. I tried searching in all regions I
 could find in the AWS console.

 The AMI for 1.1 is spark.ami.pvm.v9 (ami-5bb18832). I can find that AMI in
 us-west-1.

 Strange. Not sure what to do.

 /Håkan


 On Mon Jan 26 2015 at 9:02:42 AM Charles Feduke charles.fed...@gmail.com
 wrote:

 I definitely have Spark 1.2 running within EC2 using the spark-ec2
 scripts. I downloaded Spark 1.2 with prebuilt for Hadoop 2.4 and later.

 What parameters are you using when you execute spark-ec2?


 I am launching in the us-west-1 region (ami-7a320f3f) which may explain
 things.

 On Mon Jan 26 2015 at 2:40:01 AM hajons haj...@gmail.com wrote:

 Hi,

 When I try to launch a standalone cluster on EC2 using the scripts in the
 ec2 directory for Spark 1.2, I get the following error:

 Could not resolve AMI at:
 https://raw.github.com/mesos/spark-ec2/v4/ami-list/us-east-1/pvm

 It seems there is not yet any AMI available on EC2. Any ideas when there
 will be one?

 This works without problems for version 1.1. Starting up a cluster using
 these scripts is so simple and straightforward, so I am really missing it
 on
 1.2.

 /Håkan





 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/No-AMI-for-Spark-1-2-using-ec2-scripts-tp21362.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 

*Franc Carter* | Systems Architect | Rozetta Technology

franc.car...@rozettatech.com  franc.car...@rozettatech.com|
www.rozettatechnology.com

Tel: +61 2 8355 2515

Level 4, 55 Harrington St, The Rocks NSW 2000

PO Box H58, Australia Square, Sydney NSW 1215

AUSTRALIA


Re: SparkSQL tasks spend too much time to finish.

2015-01-26 Thread Yi Tian

Hi, San

You need to provide more information to diagnose this problem, like :

1. What kind of SQL did you execute?
2. If there are some |group| operation in this SQL, could you do some
   statistic about how many unique group keys in this case?

On 1/26/15 17:01, luohui20...@sina.com wrote:


Hi there,

   When running a sql query, i found abnormal time cost of tasks 
like the attached pic. it runs fast in the first few tasks, but 
extremely slow in later tasks, which speed 100X more time than early 
ones.However,they are dealing with same size data at 128mb


   standalone  pseudo distributed cluster
   executor memory:40g

   driver momory:5g

Any advices will be appreciated.



Thanksamp;Best regards!
罗辉 San.Luo


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org


​


Re: Eclipse on spark

2015-01-26 Thread Luke Wilson-Mawer
I use this: http://scala-ide.org/

I also use Maven with this archetype:
https://github.com/davidB/scala-archetype-simple. To be frank though, you
should be fine using SBT.

On Sat, Jan 24, 2015 at 6:33 PM, riginos samarasrigi...@gmail.com wrote:

 How to compile a Spark project in Scala IDE for Eclipse? I got many scala
 scripts and i no longer want to load them from scala-shell what can i do?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Eclipse-on-spark-tp21350.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: [GraphX] Integration with TinkerPop3/Gremlin

2015-01-26 Thread Nicolas Colson
 TinkerPop has become an Apache Incubator project and seems to have Spark
in mind in their proposal
https://wiki.apache.org/incubator/TinkerPopProposal.
That's good news!
I hope there will be nice collaborations between the communities.


On Wed, Jan 7, 2015 at 11:31 AM, Nicolas Colson nicolas.col...@gmail.com
wrote:

 Saw it.
 It's a great read, thanks a lot!

 On Wed, Jan 7, 2015 at 11:27 AM, Corey Nolet cjno...@gmail.com wrote:

 Looking a little closer at the first link i posted, it appears the bottom
 of the page does contain links to all the replies.

 On Wed, Jan 7, 2015 at 6:16 AM, Corey Nolet cjno...@gmail.com wrote:

 There was a fairly lengthy conversation about this on the dev list and i
 believe a bunch of work has been logged against it as well. [1] contains a
 link to the initial conversation but I'm having trouble tracking down the
 link to the whole discussion. Also see [2] for code.

 [1] https://www.mail-archive.com/dev@spark.apache.org/msg06231.html
 [2] https://github.com/kellrott/spark-gremlin

 On Wed, Jan 7, 2015 at 6:03 AM, Nicolas Colson nicolas.col...@gmail.com
  wrote:

 Hi Spark/GraphX community,

 I'm wondering if you have TinkerPop3/Gremlin on your radar?
 (github https://github.com/tinkerpop/tinkerpop3, doc
 http://www.tinkerpop.com/docs/3.0.0-SNAPSHOT)

 They've done an amazing work refactoring their stack recently and
 Gremlin is a very nice DSL to work with graphs.
 They even have a scala client
 https://github.com/mpollmeier/gremlin-scala.

 So far, they've used Hadoop for MapReduce tasks and I think GraphX
 could nicely dig in.

 Any view?

 Thanks,

 Nicolas







Re: No AMI for Spark 1.2 using ec2 scripts

2015-01-26 Thread Håkan Jonsson
Thanks,

I also use Spark 1.2 with prebuilt for Hadoop 2.4. I launch both 1.1 and
1.2 with the same command:

./spark-ec2 -k foo -i bar.pem launch mycluster

By default this launches in us-east-1. I tried changing the the region
using:

-r us-west-1 but that had the same result:

Could not resolve AMI at:
https://raw.github.com/mesos/spark-ec2/v4/ami-list/us-west-1/pvm

Looking up https://raw.github.com/mesos/spark-ec2/v4/ami-list/us-west-1/pvm
in a browser results in the same AMI ID as yours. If I search
for ami-7a320f3f AMI in AWS, I can't find any such image. I tried searching
in all regions I could find in the AWS console.

The AMI for 1.1 is spark.ami.pvm.v9 (ami-5bb18832). I can find that AMI in
us-west-1.

Strange. Not sure what to do.

/Håkan

On Mon Jan 26 2015 at 9:02:42 AM Charles Feduke charles.fed...@gmail.com
wrote:

I definitely have Spark 1.2 running within EC2 using the spark-ec2 scripts.
I downloaded Spark 1.2 with prebuilt for Hadoop 2.4 and later.

What parameters are you using when you execute spark-ec2?


I am launching in the us-west-1 region (ami-7a320f3f) which may explain
things.

On Mon Jan 26 2015 at 2:40:01 AM hajons haj...@gmail.com wrote:

Hi,

When I try to launch a standalone cluster on EC2 using the scripts in the
ec2 directory for Spark 1.2, I get the following error:

Could not resolve AMI at:
https://raw.github.com/mesos/spark-ec2/v4/ami-list/us-east-1/pvm

It seems there is not yet any AMI available on EC2. Any ideas when there
will be one?

This works without problems for version 1.1. Starting up a cluster using
these scripts is so simple and straightforward, so I am really missing it on
1.2.

/Håkan





--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/No-AMI-for-Spark-1-2-using-ec2-scripts-tp21362.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org


Worker never used by our Spark applications

2015-01-26 Thread Federico Ragona
Hello,
we are running Spark 1.2.0 standalone on a cluster made up of 4 machines, each 
of them running one Worker and one of them also running the Master; they are 
all connected to the same HDFS instance. 

Until a few days ago, they were all configured with 

SPARK_WORKER_MEMORY = 18G

and the jobs running on our cluster were making use of all of them. 
A few days ago though, we added a new machine to the cluster, set up one Worker 
on that machine, and reconfigured the machines as follows:

| machine   | SPARK_WORKER_MEMORY |
| #1 | 16G |
| #2 | 18G |
| #3 | 24G |
| #4 | 18G |
| #5 (new)   | 36G |

Ever since we introduced this configuration change, our applications running on 
the cluster are not using the Worker running on machine #1 anymore, even though 
it is regularly registered to the cluster. 

I would be very grateful if anybody could explain how Spark chooses which 
workers to use and why that one is not used anymore.

Regards,
Federico Ragona


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: RE: Shuffle to HDFS

2015-01-26 Thread bit1...@163.com
I have also thought that Hadoop mapper output result is saved on HDFS, at least 
if the job only has Mapper but doesn't have Reducer.
If there is reducer, then the map output will be saved on local disk?




 
From: Shao, Saisai
Date: 2015-01-26 15:23
To: Larry Liu
CC: u...@spark.incubator.apache.org
Subject: RE: Shuffle to HDFS
Hey Larry,
 
I don’t think Hadoop will put shuffle output in HDFS, instead it’s behavior is 
the same as what Spark did, store mapper output (shuffle) data on local disks. 
You might misunderstood something J.
 
Thanks
Jerry
 
From: Larry Liu [mailto:larryli...@gmail.com] 
Sent: Monday, January 26, 2015 3:03 PM
To: Shao, Saisai
Cc: u...@spark.incubator.apache.org
Subject: Re: Shuffle to HDFS
 
Hi,Jerry
 
Thanks for your reply.
 
The reason I have this question is that in Hadoop, mapper intermediate output 
(shuffle) will be stored in HDFS. I think the default location for spark is 
/tmp I think. 
 
Larry
 
On Sun, Jan 25, 2015 at 9:44 PM, Shao, Saisai saisai.s...@intel.com wrote:
Hi Larry,
 
I don’t think current Spark’s shuffle can support HDFS as a shuffle output. 
Anyway, is there any specific reason to spill shuffle data to HDFS or NFS, this 
will severely increase the shuffle time.
 
Thanks
Jerry
 
From: Larry Liu [mailto:larryli...@gmail.com] 
Sent: Sunday, January 25, 2015 4:45 PM
To: u...@spark.incubator.apache.org
Subject: Shuffle to HDFS
 
How to change shuffle output to HDFS or NFS?
 


Re: RE: Shuffle to HDFS

2015-01-26 Thread Sean Owen
If there is no Reducer, there is no shuffle. The Mapper output goes to
HDFS, yes. But the question here is about shuffle files, right? Those
are written by the Mapper to local disk. Reducers load them from the
Mappers over the network then. Shuffle files do not go to HDFS.

On Mon, Jan 26, 2015 at 10:01 AM, bit1...@163.com bit1...@163.com wrote:
 I have also thought that Hadoop mapper output result is saved on HDFS, at
 least if the job only has Mapper but doesn't have Reducer.
 If there is reducer, then the map output will be saved on local disk?

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: No AMI for Spark 1.2 using ec2 scripts

2015-01-26 Thread Charles Feduke
I definitely have Spark 1.2 running within EC2 using the spark-ec2 scripts.
I downloaded Spark 1.2 with prebuilt for Hadoop 2.4 and later.

What parameters are you using when you execute spark-ec2?

I am launching in the us-west-1 region (ami-7a320f3f) which may explain
things.

On Mon Jan 26 2015 at 2:40:01 AM hajons haj...@gmail.com wrote:

 Hi,

 When I try to launch a standalone cluster on EC2 using the scripts in the
 ec2 directory for Spark 1.2, I get the following error:

 Could not resolve AMI at:
 https://raw.github.com/mesos/spark-ec2/v4/ami-list/us-east-1/pvm

 It seems there is not yet any AMI available on EC2. Any ideas when there
 will be one?

 This works without problems for version 1.1. Starting up a cluster using
 these scripts is so simple and straightforward, so I am really missing it
 on
 1.2.

 /Håkan



 --
 View this message in context: http://apache-spark-user-list.
 1001560.n3.nabble.com/No-AMI-for-Spark-1-2-using-ec2-scripts-tp21362.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Worker never used by our Spark applications

2015-01-26 Thread Federico Ragona
Hello,
we are running Spark 1.2.0 standalone on a cluster made up of 4 machines, each 
of them running one Worker and one of them also running the Master; they are 
all connected to the same HDFS instance. 

Until a few days ago, they were all configured with 

SPARK_WORKER_MEMORY = 18G

and the jobs running on our cluster were making use of all of them. 
A few days ago though, we added a new machine to the cluster, set up one Worker 
on that machine, and reconfigured the machines as follows:

| machine   | SPARK_WORKER_MEMORY |
| #1 | 16G |
| #2 | 18G |
| #3 | 24G |
| #4 | 18G |
| #5 (new)   | 36G |

Ever since we introduced this configuration change, our applications running on 
the cluster are not using the Worker running on machine #1 anymore, even though 
it is regularly registered to the cluster. 

I would be very grateful if anybody could explain how Spark chooses which 
workers to use and why that one is not used anymore.

Regards,
Federico Ragona
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Eclipse on spark

2015-01-26 Thread vaquar khan
I am using SBT
On 26 Jan 2015 15:54, Luke Wilson-Mawer lukewilsonma...@gmail.com wrote:

 I use this: http://scala-ide.org/

 I also use Maven with this archetype:
 https://github.com/davidB/scala-archetype-simple. To be frank though, you
 should be fine using SBT.

 On Sat, Jan 24, 2015 at 6:33 PM, riginos samarasrigi...@gmail.com wrote:

 How to compile a Spark project in Scala IDE for Eclipse? I got many scala
 scripts and i no longer want to load them from scala-shell what can i do?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Eclipse-on-spark-tp21350.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Re: Pairwise Processing of a List

2015-01-26 Thread Sean Owen
AFAIK ordering is not strictly guaranteed unless the RDD is the
product of a sort. I think that in practice, you'll never find
elements of a file read in some random order, for example (although
see the recent issue about partition ordering potentially depending on
how the local file system lists them).

Likewise I can't imagine you encounter elements from one Kafka
partition out of order. One receiver hears one partition and create
one block per block interval. What I'm not 100% clear on is whether
you get undefined ordering when you have multiple threads listening in
one receiver.

You can always sort RDDs by a timestamp of some sort to be sure,
although that has overheads. I'm also curious about what if anything
is guaranteed here without a sort.

On Mon, Jan 26, 2015 at 1:33 AM, Tobias Pfeiffer t...@preferred.jp wrote:
 Sean,

 On Mon, Jan 26, 2015 at 10:28 AM, Sean Owen so...@cloudera.com wrote:

 Note that RDDs don't really guarantee anything about ordering though,
 so this only makes sense if you've already sorted some upstream RDD by
 a timestamp or sequence number.


 Speaking of order, is there some reading on guarantees and non-guarantees
 about order in RDDs? For example, when reading a file and doing
 zipWithIndex, can I assume that the lines are numbered in order? Does this
 hold for receiving data from Kafka, too?

 Tobias


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark (Streaming?) holding on to Mesos Resources

2015-01-26 Thread Jörn Franke
Hi,

What do your jobs do?  Ideally post source code, but some description would
already helpful to support you.

Memory leaks can have several reasons - it may not be Spark at all.

Thank you.

Le 26 janv. 2015 22:28, Gerard Maas gerard.m...@gmail.com a écrit :

 (looks like the list didn't like a HTML table on the previous email. My
excuses for any duplicates)

 Hi,

 We are observing with certain regularity that our Spark  jobs, as Mesos
framework, are hoarding resources and not releasing them, resulting in
resource starvation to all jobs running on the Mesos cluster.

 For example:
 This is a job that has spark.cores.max = 4 and spark.executor.memory=3g

 | ID   |Framework  |Host|CPUs  |Mem
 …5050-16506-1146497 FooStreaming dnode-4.hdfs.private 7 13.4 GB
 …5050-16506-1146495 FooStreamingdnode-0.hdfs.private 1 6.4 GB
 …5050-16506-1146491 FooStreamingdnode-5.hdfs.private 7 11.9 GB
 …5050-16506-1146449 FooStreamingdnode-3.hdfs.private 7 4.9 GB
 …5050-16506-1146247 FooStreamingdnode-1.hdfs.private 0.5 5.9 GB
 …5050-16506-1146226 FooStreamingdnode-2.hdfs.private 3 7.9 GB
 …5050-16506-1144069 FooStreamingdnode-3.hdfs.private 1 8.7 GB
 …5050-16506-1133091 FooStreamingdnode-5.hdfs.private 1 1.7 GB
 …5050-16506-1133090 FooStreamingdnode-2.hdfs.private 5 5.2 GB
 …5050-16506-1133089 FooStreamingdnode-1.hdfs.private 6.5 6.3 GB
 …5050-16506-1133088 FooStreamingdnode-4.hdfs.private 1 251 MB
 …5050-16506-1133087 FooStreamingdnode-0.hdfs.private 6.4 6.8 GB

 The only way to release the resources is by manually finding the process
in the cluster and killing it. The jobs are often streaming but also batch
jobs show this behavior. We have more streaming jobs than batch, so stats
are biased.
 Any ideas of what's up here? Hopefully some very bad ugly bug that has
been fixed already and that will urge us to upgrade our infra?

 Mesos 0.20 +  Marathon 0.7.4 + Spark 1.1.0

 -kr, Gerard.


Re: Can Spark benefit from Hive-like partitions?

2015-01-26 Thread Michael Armbrust

 I'm not actually using Hive at the moment - in fact, I'm trying to avoid
 it if I can. I'm just wondering whether Spark has anything similar I can
 leverage?


Let me clarify, you do not need to have Hive installed, and what I'm
suggesting is completely self-contained in Spark SQL.  We support the Hive
Query Language for expressing partitioned tables when you are using a
HiveContext, but the execution will be done using RDDs.  If you don't
manually configure a hive installation, Spark will just create a local
metastore in the current directory.

In the future we are planning to support non-HiveQL mechanisms for
expressing partitioning.


Re: Spark (Streaming?) holding on to Mesos Resources

2015-01-26 Thread Gerard Maas
Hi Jörn,

A memory leak on the job would be contained within the resources reserved
for it, wouldn't it?
And the job holding resources is not always the same. Sometimes it's one of
the Streaming jobs, sometimes it's a heavy batch job that runs every hour.
Looks to me that whatever is causing the issue, it's participating in the
resource offer protocol of Mesos and my first suspect would be the Mesos
scheduler in Spark. (The table above is the tab Offers from the Mesos UI.

Are there any other factors involved in the offer acceptance/rejection
between Mesos and a scheduler?

What do you think?

-kr, Gerard.

On Mon, Jan 26, 2015 at 11:23 PM, Jörn Franke jornfra...@gmail.com wrote:

 Hi,

 What do your jobs do?  Ideally post source code, but some description
 would already helpful to support you.

 Memory leaks can have several reasons - it may not be Spark at all.

 Thank you.

 Le 26 janv. 2015 22:28, Gerard Maas gerard.m...@gmail.com a écrit :

 
  (looks like the list didn't like a HTML table on the previous email. My
 excuses for any duplicates)
 
  Hi,
 
  We are observing with certain regularity that our Spark  jobs, as Mesos
 framework, are hoarding resources and not releasing them, resulting in
 resource starvation to all jobs running on the Mesos cluster.
 
  For example:
  This is a job that has spark.cores.max = 4 and spark.executor.memory=3g
 
  | ID   |Framework  |Host|CPUs  |Mem
  …5050-16506-1146497 FooStreaming dnode-4.hdfs.private 7 13.4 GB
  …5050-16506-1146495 FooStreamingdnode-0.hdfs.private 1 6.4 GB
  …5050-16506-1146491 FooStreamingdnode-5.hdfs.private 7 11.9 GB
  …5050-16506-1146449 FooStreamingdnode-3.hdfs.private 7 4.9 GB
  …5050-16506-1146247 FooStreamingdnode-1.hdfs.private 0.5 5.9 GB
  …5050-16506-1146226 FooStreamingdnode-2.hdfs.private 3 7.9 GB
  …5050-16506-1144069 FooStreamingdnode-3.hdfs.private 1 8.7 GB
  …5050-16506-1133091 FooStreamingdnode-5.hdfs.private 1 1.7 GB
  …5050-16506-1133090 FooStreamingdnode-2.hdfs.private 5 5.2 GB
  …5050-16506-1133089 FooStreamingdnode-1.hdfs.private 6.5 6.3 GB
  …5050-16506-1133088 FooStreamingdnode-4.hdfs.private 1 251 MB
  …5050-16506-1133087 FooStreamingdnode-0.hdfs.private 6.4 6.8 GB
 
  The only way to release the resources is by manually finding the process
 in the cluster and killing it. The jobs are often streaming but also batch
 jobs show this behavior. We have more streaming jobs than batch, so stats
 are biased.
  Any ideas of what's up here? Hopefully some very bad ugly bug that has
 been fixed already and that will urge us to upgrade our infra?
 
  Mesos 0.20 +  Marathon 0.7.4 + Spark 1.1.0
 
  -kr, Gerard.




Re: spark 1.2 - Writing parque fails for timestamp with Unsupported datatype TimestampType

2015-01-26 Thread Michael Armbrust
I'm aiming for 1.3.

On Mon, Jan 26, 2015 at 3:05 PM, Manoj Samel manojsamelt...@gmail.com
wrote:

 Thanks Michael. I am sure there have been many requests for this support.

 Any release targeted for this?

 Thanks,

 On Sat, Jan 24, 2015 at 11:47 AM, Michael Armbrust mich...@databricks.com
  wrote:

 Those annotations actually don't work because the timestamp is SQL has
 optional nano-second precision.

 However, there is a PR to add support using parquets INT96 type:
 https://github.com/apache/spark/pull/3820

 On Fri, Jan 23, 2015 at 12:08 PM, Manoj Samel manojsamelt...@gmail.com
 wrote:

 Looking further at the trace and ParquetTypes.scala, it seems there is
 no support for Timestamp and Date in fromPrimitiveDataType(ctype:
 DataType): Option[ParquetTypeInfo]. Since Parquet supports these type
 with some decoration over Int (
 https://github.com/Parquet/parquet-format/blob/master/LogicalTypes.md),
 any reason why Date / Timestamp are not supported right now ?

 Thanks,

 Manoj


 On Fri, Jan 23, 2015 at 11:40 AM, Manoj Samel manojsamelt...@gmail.com
 wrote:

 Using Spark 1.2

 Read a CSV file, apply schema to convert to SchemaRDD and then
 schemaRdd.saveAsParquetFile

 If the schema includes Timestamptype, it gives following trace when
 doing the save

 Exception in thread main java.lang.RuntimeException: Unsupported
 datatype TimestampType

 at scala.sys.package$.error(package.scala:27)

 at
 org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2.apply(
 ParquetTypes.scala:343)

 at
 org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2.apply(
 ParquetTypes.scala:292)

 at scala.Option.getOrElse(Option.scala:120)

 at org.apache.spark.sql.parquet.ParquetTypesConverter$.fromDataType(
 ParquetTypes.scala:291)

 at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$4.apply(
 ParquetTypes.scala:363)

 at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$4.apply(
 ParquetTypes.scala:362)

 at scala.collection.TraversableLike$$anonfun$map$1.apply(
 TraversableLike.scala:244)

 at scala.collection.TraversableLike$$anonfun$map$1.apply(
 TraversableLike.scala:244)

 at scala.collection.mutable.ResizableArray$class.foreach(
 ResizableArray.scala:59)

 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)

 at scala.collection.TraversableLike$class.map(TraversableLike.scala:244
 )

 at scala.collection.AbstractTraversable.map(Traversable.scala:105)

 at
 org.apache.spark.sql.parquet.ParquetTypesConverter$.convertFromAttributes(
 ParquetTypes.scala:361)

 at org.apache.spark.sql.parquet.ParquetTypesConverter$.writeMetaData(
 ParquetTypes.scala:407)

 at org.apache.spark.sql.parquet.ParquetRelation$.createEmpty(
 ParquetRelation.scala:166)

 at org.apache.spark.sql.parquet.ParquetRelation$.create(
 ParquetRelation.scala:145)

 at
 org.apache.spark.sql.execution.SparkStrategies$ParquetOperations$.apply(
 SparkStrategies.scala:204)

 at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(
 QueryPlanner.scala:58)

 at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(
 QueryPlanner.scala:58)

 at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)

 at org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(
 QueryPlanner.scala:59)

 at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(
 SQLContext.scala:418)

 at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(
 SQLContext.scala:416)

 at
 org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(
 SQLContext.scala:422)

 at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(
 SQLContext.scala:422)

 at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(
 SQLContext.scala:425)

 at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(
 SQLContext.scala:425)

 at org.apache.spark.sql.SchemaRDDLike$class.saveAsParquetFile(
 SchemaRDDLike.scala:76)

 at org.apache.spark.sql.SchemaRDD.saveAsParquetFile(SchemaRDD.scala:108
 )

 at bdrt.MyTest$.createParquetWithDate(MyTest.scala:88)

 at bdrt.MyTest$delayedInit$body.apply(MyTest.scala:54)

 at scala.Function0$class.apply$mcV$sp(Function0.scala:40)

 at scala.runtime.AbstractFunction0.apply$mcV$sp(
 AbstractFunction0.scala:12)

 at scala.App$$anonfun$main$1.apply(App.scala:71)

 at scala.App$$anonfun$main$1.apply(App.scala:71)

 at scala.collection.immutable.List.foreach(List.scala:318)

 at scala.collection.generic.TraversableForwarder$class.foreach(
 TraversableForwarder.scala:32)

 at scala.App$class.main(App.scala:71)

 at bdrt.MyTest$.main(MyTest.scala:10)








Re: spark 1.2 - Writing parque fails for timestamp with Unsupported datatype TimestampType

2015-01-26 Thread Manoj Samel
Thanks Michael. I am sure there have been many requests for this support.

Any release targeted for this?

Thanks,

On Sat, Jan 24, 2015 at 11:47 AM, Michael Armbrust mich...@databricks.com
wrote:

 Those annotations actually don't work because the timestamp is SQL has
 optional nano-second precision.

 However, there is a PR to add support using parquets INT96 type:
 https://github.com/apache/spark/pull/3820

 On Fri, Jan 23, 2015 at 12:08 PM, Manoj Samel manojsamelt...@gmail.com
 wrote:

 Looking further at the trace and ParquetTypes.scala, it seems there is no
 support for Timestamp and Date in fromPrimitiveDataType(ctype: DataType):
 Option[ParquetTypeInfo]. Since Parquet supports these type with some
 decoration over Int (
 https://github.com/Parquet/parquet-format/blob/master/LogicalTypes.md),
 any reason why Date / Timestamp are not supported right now ?

 Thanks,

 Manoj


 On Fri, Jan 23, 2015 at 11:40 AM, Manoj Samel manojsamelt...@gmail.com
 wrote:

 Using Spark 1.2

 Read a CSV file, apply schema to convert to SchemaRDD and then
 schemaRdd.saveAsParquetFile

 If the schema includes Timestamptype, it gives following trace when
 doing the save

 Exception in thread main java.lang.RuntimeException: Unsupported
 datatype TimestampType

 at scala.sys.package$.error(package.scala:27)

 at
 org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2.apply(
 ParquetTypes.scala:343)

 at
 org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2.apply(
 ParquetTypes.scala:292)

 at scala.Option.getOrElse(Option.scala:120)

 at org.apache.spark.sql.parquet.ParquetTypesConverter$.fromDataType(
 ParquetTypes.scala:291)

 at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$4.apply(
 ParquetTypes.scala:363)

 at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$4.apply(
 ParquetTypes.scala:362)

 at scala.collection.TraversableLike$$anonfun$map$1.apply(
 TraversableLike.scala:244)

 at scala.collection.TraversableLike$$anonfun$map$1.apply(
 TraversableLike.scala:244)

 at scala.collection.mutable.ResizableArray$class.foreach(
 ResizableArray.scala:59)

 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)

 at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)

 at scala.collection.AbstractTraversable.map(Traversable.scala:105)

 at
 org.apache.spark.sql.parquet.ParquetTypesConverter$.convertFromAttributes(
 ParquetTypes.scala:361)

 at org.apache.spark.sql.parquet.ParquetTypesConverter$.writeMetaData(
 ParquetTypes.scala:407)

 at org.apache.spark.sql.parquet.ParquetRelation$.createEmpty(
 ParquetRelation.scala:166)

 at org.apache.spark.sql.parquet.ParquetRelation$.create(
 ParquetRelation.scala:145)

 at
 org.apache.spark.sql.execution.SparkStrategies$ParquetOperations$.apply(
 SparkStrategies.scala:204)

 at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(
 QueryPlanner.scala:58)

 at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(
 QueryPlanner.scala:58)

 at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)

 at org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(
 QueryPlanner.scala:59)

 at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(
 SQLContext.scala:418)

 at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(
 SQLContext.scala:416)

 at
 org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(
 SQLContext.scala:422)

 at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(
 SQLContext.scala:422)

 at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(
 SQLContext.scala:425)

 at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(
 SQLContext.scala:425)

 at org.apache.spark.sql.SchemaRDDLike$class.saveAsParquetFile(
 SchemaRDDLike.scala:76)

 at org.apache.spark.sql.SchemaRDD.saveAsParquetFile(SchemaRDD.scala:108)

 at bdrt.MyTest$.createParquetWithDate(MyTest.scala:88)

 at bdrt.MyTest$delayedInit$body.apply(MyTest.scala:54)

 at scala.Function0$class.apply$mcV$sp(Function0.scala:40)

 at scala.runtime.AbstractFunction0.apply$mcV$sp(
 AbstractFunction0.scala:12)

 at scala.App$$anonfun$main$1.apply(App.scala:71)

 at scala.App$$anonfun$main$1.apply(App.scala:71)

 at scala.collection.immutable.List.foreach(List.scala:318)

 at scala.collection.generic.TraversableForwarder$class.foreach(
 TraversableForwarder.scala:32)

 at scala.App$class.main(App.scala:71)

 at bdrt.MyTest$.main(MyTest.scala:10)







Re: Can Spark benefit from Hive-like partitions?

2015-01-26 Thread Danny Yates
Ah, well that is interesting. I'll experiment further tomorrow. Thank you for 
the info!

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark webUI - application details page

2015-01-26 Thread ilaxes
Hi,

I do'nt have any history server running. As SK's already pointed in a
previous post the history server seems to be required only in mesos or yarn
mode, not in standalone mode.

https://spark.apache.org/docs/1.1.1/monitoring.html
If Spark is run on Mesos or YARN, it is still possible to reconstruct the
UI of a finished application through Spark’s history server, provided that
the application’s event logs exist.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-webUI-application-details-page-tp3490p21379.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org