date:20151204

Re: Sparse Vector ArrayIndexOutOfBoundsException

2015-12-04 Thread Yanbo Liang

Could you also print the length of featureSet? I suspect it less than 62.
The first argument of Vectors.sparse() is the length of this sparse vector
not the length of non-null elements.

Yanbo

2015-12-03 22:30 GMT+08:00 nabegh :

> I'm trying to run a SVM classifier on unlabeled data. I followed  this
> <
> http://apache-spark-user-list.1001560.n3.nabble.com/MLLib-sparse-vector-td14273.html
> >
> to build the vectors and checked  this
> <
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala#L313
> >
>
> Now when I make the call to predict, I receive the following error. Any
> hints?
>
> val v = featureRDD.map(f => Vectors.sparse(featureSet.length, f))
> //length = 63
> val predictions = model.predict(v)
> println(s"predictions length  = ${predictions.collect.length}")
>
>
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due
> to stage failure: Task 1 in stage 121.0 failed 4 times, most recent
> failure:
> Lost task 1.3 in stage 121.0 (TID 233, 10.1.1.63):
> java.lang.ArrayIndexOutOfBoundsException: 62
> at
>
> breeze.linalg.operators.DenseVector_SparseVector_Ops$$anon$98.apply(SparseVectorOps.scala:297)
> at
>
> breeze.linalg.operators.DenseVector_SparseVector_Ops$$anon$98.apply(SparseVectorOps.scala:282)
> at
> breeze.linalg.operators.BinaryRegistry$class.apply(BinaryOp.scala:60)
> at breeze.linalg.VectorOps$$anon$171.apply(Vector.scala:528)
> at breeze.linalg.ImmutableNumericOps$class.dot(NumericOps.scala:98)
> at breeze.linalg.DenseVector.dot(DenseVector.scala:50)
> at
> org.apache.spark.mllib.classification.SVMModel.predictPoint(SVM.scala:81)
> at
>
> org.apache.spark.mllib.regression.GeneralizedLinearModel$$anonfun$predict$1$$anonfun$apply$1.apply(GeneralizedLinearAlgorithm.scala:71)
> at
>
> org.apache.spark.mllib.regression.GeneralizedLinearModel$$anonfun$predict$1$$anonfun$apply$1.apply(GeneralizedLinearAlgorithm.scala:71)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
> at
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
> at
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
> at
> scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
> at scala.collection.AbstractIterator.to(Iterator.scala:1157)
> at
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
> at
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
> at
>
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:909)
> at
>
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:909)
> at
>
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
> at
>
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
> at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Sparse-Vector-ArrayIndexOutOfBoundsException-tp25556.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Predictive Modeling

2015-12-04 Thread Chintan Bhatt

Hi,
I'm very much interested to make a predictive model using crime data (from
2001-present. It is big .csv file (about 1.5 GB) )in spark on hortonworks.
Can anyone tell me how to start?

-- 
CHINTAN BHATT 
Assistant Professor,
U & P U Patel Department of Computer Engineering,
Chandubhai S. Patel Institute of Technology,
Charotar University of Science And Technology (CHARUSAT),
Changa-388421, Gujarat, INDIA.
http://www.charusat.ac.in
*Personal Website*: https://sites.google.com/a/ecchanga.ac.in/chintan/

[image: IBM]


-- 


DISCLAIMER: The information transmitted is intended only for the person or 
entity to which it is addressed and may contain confidential and/or 
privileged material which is the intellectual property of Charotar 
University of Science & Technology (CHARUSAT). Any review, retransmission, 
dissemination or other use of, or taking of any action in reliance upon 
this information by persons or entities other than the intended recipient 
is strictly prohibited. If you are not the intended recipient, or the 
employee, or agent responsible for delivering the message to the intended 
recipient and/or if you have received this in error, please contact the 
sender and delete the material from the computer or device. CHARUSAT does 
not take any liability or responsibility for any malicious codes/software 
and/or viruses/Trojan horses that may have been picked up during the 
transmission of this message. By opening and solely relying on the contents 
or part thereof this message, and taking action thereof, the recipient 
relieves the CHARUSAT of all the liabilities including any damages done to 
the recipient's pc/laptop/peripherals and other communication devices due 
to any reason.

Re: Avoid Shuffling on Partitioned Data

2015-12-04 Thread Fengdong Yu

Yes. it results to a shuffle.



> On Dec 4, 2015, at 6:04 PM, Stephen Boesch  wrote:
> 
> @Yu Fengdong:  Your approach - specifically the groupBy results in a shuffle 
> does it not?
> 
> 2015-12-04 2:02 GMT-08:00 Fengdong Yu  >:
> There are many ways, one simple is:
> 
> such as: you want to know how many rows for each month:
> 
> sqlContext.read.parquet(“……../month=*”).select($“month").groupBy($”month”).count
> 
> 
> the output looks like:
> 
> monthcount
> 201411100
> 201412200
> 
> 
> hopes help.
> 
> 
> 
> > On Dec 4, 2015, at 5:53 PM, Yiannis Gkoufas  > > wrote:
> >
> > Hi there,
> >
> > I have my data stored in HDFS partitioned by month in Parquet format.
> > The directory looks like this:
> >
> > -month=201411
> > -month=201412
> > -month=201501
> > -
> >
> > I want to compute some aggregates for every timestamp.
> > How is it possible to achieve that by taking advantage of the existing 
> > partitioning?
> > One naive way I am thinking is issuing multiple sql queries:
> >
> > SELECT * FROM TABLE WHERE month=201411
> > SELECT * FROM TABLE WHERE month=201412
> > SELECT * FROM TABLE WHERE month=201501
> > .
> >
> > computing the aggregates on the results of each query and combining them in 
> > the end.
> >
> > I think there should be a better way right?
> >
> > Thanks
> 
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> 
> For additional commands, e-mail: user-h...@spark.apache.org 
> 
> 
>

Questions about Spark Shuffle and Heap

2015-12-04 Thread Jianneng Li

Hi,

On the Spark Configuration page (
http://spark.apache.org/docs/1.5.2/configuration.html), the documentation
for spark.shuffle.memoryFraction mentions that the fraction is taken from
the Java heap. However, the documentation
for spark.shuffle.io.preferDirectBufs implies that off-heap memory might be
used instead during shuffles. Do these two explanation conflict each other?

In a related question, with Tungsten enabled, when does Spark use off-heap
memory?

Thanks,

Jianneng




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Questions-about-Spark-Shuffle-and-Heap-tp25564.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Is it possible to pass additional parameters to a python function when used inside RDD.filter method?

2015-12-04 Thread Abhishek Shivkumar


Hi,

 I am using spark with python and I have a filter constraint as follows:

|my_rdd.filter(my_func)|

where my_func is a method I wrote to filter the rdd items based on my 
own logic. I have defined the my_func as follows:


|def  my_func(my_item):

{
...
}|

Now, I want to pass another separate parameter to my_func, besides the 
item that goes into it. How can I do that? I know my_item will refer to 
one item that comes from my_rdd and how can I pass my own parameter 
(let's say my_param) as an additional parameter to my_func?


Thanks
Abhishek S


--


*NOTICE AND DISCLAIMER*

This email (including attachments) is confidential. If you are not the 
intended recipient, notify the sender immediately, delete this email from 
your system and do not disclose or use for any purpose.


Business Address: Eagle House, 163 City Road, London, EC1V 1NR. United 
Kingdom
Registered Office: Finsgate, 5-7 Cranwood Street, London, EC1V 9EE. United 
Kingdom
Big Data Partnership Limited is a company registered in England & Wales 
with Company No 7904824

Avoid Shuffling on Partitioned Data

2015-12-04 Thread Yiannis Gkoufas

Hi there,

I have my data stored in HDFS partitioned by month in Parquet format.
The directory looks like this:

-month=201411
-month=201412
-month=201501
-

I want to compute some aggregates for every timestamp.
How is it possible to achieve that by taking advantage of the existing
partitioning?
One naive way I am thinking is issuing multiple sql queries:

SELECT * FROM TABLE WHERE month=201411
SELECT * FROM TABLE WHERE month=201412
SELECT * FROM TABLE WHERE month=201501
.

computing the aggregates on the results of each query and combining them in
the end.

I think there should be a better way right?

Thanks

Re: Avoid Shuffling on Partitioned Data

2015-12-04 Thread Stephen Boesch

@Yu Fengdong:  Your approach - specifically the groupBy results in a
shuffle does it not?

2015-12-04 2:02 GMT-08:00 Fengdong Yu :

> There are many ways, one simple is:
>
> such as: you want to know how many rows for each month:
>
>
> sqlContext.read.parquet(“……../month=*”).select($“month").groupBy($”month”).count
>
>
> the output looks like:
>
> monthcount
> 201411100
> 201412200
>
>
> hopes help.
>
>
>
> > On Dec 4, 2015, at 5:53 PM, Yiannis Gkoufas 
> wrote:
> >
> > Hi there,
> >
> > I have my data stored in HDFS partitioned by month in Parquet format.
> > The directory looks like this:
> >
> > -month=201411
> > -month=201412
> > -month=201501
> > -
> >
> > I want to compute some aggregates for every timestamp.
> > How is it possible to achieve that by taking advantage of the existing
> partitioning?
> > One naive way I am thinking is issuing multiple sql queries:
> >
> > SELECT * FROM TABLE WHERE month=201411
> > SELECT * FROM TABLE WHERE month=201412
> > SELECT * FROM TABLE WHERE month=201501
> > .
> >
> > computing the aggregates on the results of each query and combining them
> in the end.
> >
> > I think there should be a better way right?
> >
> > Thanks
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Getting error when trying to start master node after building spark 1.3

2015-12-04 Thread Mich Talebzadeh

Hi,

 

 

I am trying to make Hive work with Spark.

 

I have been told that I need to use Spark 1.3 and build it from source code
WITHOUT HIVE libraries.

 

I have built it as follows:

 

./make-distribution.sh --name "hadoop2-without-hive" --tgz
"-Pyarn,hadoop-provided,hadoop-2.4,parquet-provided"

 

 

Now the issue I have that I cannot start master node.

 

When I try

 

hduser@rhes564::/usr/lib/spark-1.3.0-bin-hadoop2-without-hive/sbin>
./start-master.sh

starting org.apache.spark.deploy.master.Master, logging to
/usr/lib/spark-1.3.0-bin-hadoop2-without-hive/sbin/../logs/spark-hduser-org.
apache.spark.deploy.master.Master-1-rhes564.out

failed to launch org.apache.spark.deploy.master.Master:

at java.lang.ClassLoader.loadClass(ClassLoader.java:357)

... 6 more

full log in
/usr/lib/spark-1.3.0-bin-hadoop2-without-hive/sbin/../logs/spark-hduser-org.
apache.spark.deploy.master.Master-1-rhes564.out

 

I get

 

Spark Command: /usr/java/latest/bin/java -cp
:/usr/lib/spark-1.3.0-bin-hadoop2-without-hive/sbin/../conf:/usr/lib/spark-1
.3.0-bin-hadoop2-without-hive/lib/spark-assembly-1.3.0-hadoop2.4.0.jar:/home
/hduser/hadoop-2.6.0/etc/hadoop -XX:MaxPermSize=128m
-Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m
org.apache.spark.deploy.master.Master --ip 50.140.197.217 --port 7077
--webui-port 8080



 

Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/Logger

at java.lang.Class.getDeclaredMethods0(Native Method)

at java.lang.Class.privateGetDeclaredMethods(Class.java:2521)

at java.lang.Class.getMethod0(Class.java:2764)

at java.lang.Class.getMethod(Class.java:1653)

at
sun.launcher.LauncherHelper.getMainMethod(LauncherHelper.java:494)

at
sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:486)

Caused by: java.lang.ClassNotFoundException: org.slf4j.Logger

at java.net.URLClassLoader$1.run(URLClassLoader.java:366)

at java.net.URLClassLoader$1.run(URLClassLoader.java:355)

at java.security.AccessController.doPrivileged(Native Method)

at java.net.URLClassLoader.findClass(URLClassLoader.java:354)

at java.lang.ClassLoader.loadClass(ClassLoader.java:424)

at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)

at java.lang.ClassLoader.loadClass(ClassLoader.java:357)

... 6 more

 

Any advice will be appreciated.

 

Thanks,

 

Mich

 

 

NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this
message shall not be understood as given or endorsed by Peridale Technology
Ltd, its subsidiaries or their employees, unless expressly so stated. It is
the responsibility of the recipient to ensure that this email is virus free,
therefore neither Peridale Ltd, its subsidiaries nor their employees accept
any responsibility.

Re: Turning off DTD Validation using XML Utils package - Spark

2015-12-04 Thread Darin McBeath

ok, a new capability has been added to spark-xml-utils (1.3.0) to address this 
request.

Essentially, the capability to specify 'processor' features has been added 
(through a new getInstance function).  Here is a list of  the features that can 
be set 
(http://www.saxonica.com/html/documentation/javadoc/net/sf/saxon/lib/FeatureKeys.html).
  Since we are leveraging the s9apiProcessor under the covers, features 
relevant to that are the only ones that would make sense to use.

To address your request of completely ignoring the Doctype declaration in the 
xml, you would need to do the following:

import net.sf.saxon.lib.FeatureKeys;
HashMap featureMap = new HashMap();
featureMap.put(FeatureKeys.ENTITY_RESOLVER_CLASS, 
"com.somepackage.IgnoreDoctype");
// The first parameter is the xpath expression
// The second parameter is the hashmap for the namespace mappings (in this case 
there are none)
// The third parameter is the hashmap for the processor features
XPathProcessor proc = XPathProcessor.getInstance("/books/book",null,featureMap);


The following evaluation should now work ok.

proc.evaluateString("Some 
BookSome 
Author200529.99"));
} catch (XPathException e) 


You then would define the following class (and make sure it is included in your 
application)

package com.somepackage;
import java.io.ByteArrayInputStream;
import org.xml.sax.EntityResolver;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;

public class IgnoreDoctype implements EntityResolver {

public InputSource resolveEntity(java.lang.String publicId, java.lang.String 
systemId)
 throws SAXException, java.io.IOException
  {
  // Ignore everything
  return new InputSource(new ByteArrayInputStream("".getBytes()));

  }
 

}


Lastly, you will need to include the saxon9he jar file (open source version).

This would work for XPath, XQuery, and XSLT.

Hope this helps.  When I get a chance, I will update the spark-xml-utils github 
site with details on the new getInstance function and some more information on 
the various features.
Darin.



From: Darin McBeath 
To: "user@spark.apache.org"  
Sent: Tuesday, December 1, 2015 11:51 AM
Subject: Re: Turning off DTD Validation using XML Utils package - Spark




The problem isn't really with DTD validation (by default validation is 
disabled).  The underlying problem is that the DTD can't be found (which is 
indicated in your stack trace below).  The underlying parser will try and 
retrieve the DTD (regardless of  validation) because things such as entities 
could be expressed in the DTD.

I will explore providing access to some of the underlying 'processor' 
configurations.  For example, you could provide your own EntityResolver class 
that could either completely ignore the Doctype declaration (return a 'dummy' 
DTD that is completely empty) or you could have it find 'local' versions (on 
the workers or in S3 and then cache them locally for performance).  

I will post an update when the code has been adjusted.

Darin.

- Original Message -
From: Shivalik 
To: user@spark.apache.org
Sent: Tuesday, December 1, 2015 8:15 AM
Subject: Turning off DTD Validation using XML Utils package - Spark

Hi Team,

I've been using XML Utils library 
(http://spark-packages.org/package/elsevierlabs-os/spark-xml-utils) to parse
XML using XPath in a spark job. One problem I am facing is with the DTDs. My
XML file, has a doctype tag included in it.

I want to turn off DTD validation using this library since I don't have
access to DTD file. Has someone faced this problem before. Please help.

The exception I am getting it is as below:

stage 0.0 (TID 0, localhost):
com.elsevier.spark_xml_utils.xpath.XPathException: I/O error reported by XML
parser processing null: /filename.dtd (No such file or directory)

at
com.elsevier.spark_xml_utils.xpath.XPathProcessor.evaluate(XPathProcessor.java:301)

at
com.elsevier.spark_xml_utils.xpath.XPathProcessor.evaluateString(XPathProcessor.java:219)

at com.thomsonreuters.xmlutils.XMLParser.lambda$0(XMLParser.java:31)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Turning-off-DTD-Validation-using-XML-Utils-package-Spark-tp25534.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail:

Re: Avoid Shuffling on Partitioned Data

2015-12-04 Thread Fengdong Yu

There are many ways, one simple is:

such as: you want to know how many rows for each month:

sqlContext.read.parquet(“……../month=*”).select($“month").groupBy($”month”).count


the output looks like:

monthcount
201411100
201412200


hopes help.



> On Dec 4, 2015, at 5:53 PM, Yiannis Gkoufas  wrote:
> 
> Hi there,
> 
> I have my data stored in HDFS partitioned by month in Parquet format.
> The directory looks like this:
> 
> -month=201411
> -month=201412
> -month=201501
> -
> 
> I want to compute some aggregates for every timestamp.
> How is it possible to achieve that by taking advantage of the existing 
> partitioning?
> One naive way I am thinking is issuing multiple sql queries:
> 
> SELECT * FROM TABLE WHERE month=201411
> SELECT * FROM TABLE WHERE month=201412
> SELECT * FROM TABLE WHERE month=201501
> .
> 
> computing the aggregates on the results of each query and combining them in 
> the end.
> 
> I think there should be a better way right?
> 
> Thanks


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: newbie best practices: is spark-ec2 intended to be used to manage long-lasting infrastructure ?

2015-12-04 Thread Sean Owen

There is no way to upgrade a running cluster here. You can stop a
cluster, and simply start a new cluster in the same way you started
the original cluster. That ought to be simple; the only issue I
suppose is that you have down-time since you have to shut the whole
thing down, but maybe that's acceptable.

If you have data, including HDFS, set up on ephemeral disks though
then yes that is lost. Really that's an 'ephemeral' HDFS cluster. It
has nothing to do with partitions.

You would want to get the data out to S3 first, and then copy it back
in later. Yes it's manual, but works fine.

For more production use cases, on Amazon, you probably want to look
into a distribution or product around Spark rather than manage it
yourself. That could be AWS's own EMR, Databricks cloud, or even CDH
running on AWS. Those would give you much more of a chance of
automatically getting updates and so on, but they're fairly different
products.

On Fri, Dec 4, 2015 at 3:21 AM, Divya Gehlot  wrote:
> Hello,
> Even I have the same queries in mind .
> What all the upgrades where we can use EC2 as compare to normal servers for
> spark and other big data product development .
> Hope to get inputs from the community .
>
> Thanks,
> Divya
>
> On Dec 4, 2015 6:05 AM, "Andy Davidson" 
> wrote:
>>
>> About 2 months ago I used spark-ec2 to set up a small cluster. The cluster
>> runs a spark streaming app 7x24 and stores the data to hdfs. I also need to
>> run some batch analytics on the data.
>>
>> Now that I have a little more experience I wonder if this was a good way
>> to set up the cluster the following issues
>>
>> I have not been able to find explicit directions for upgrading the spark
>> version
>>
>>
>> http://search-hadoop.com/m/q3RTt7E0f92v0tKh2=Re+Upgrading+Spark+in+EC2+clusters
>>
>> I am not sure where the data is physically be stored. I think I may
>> accidentally loose all my data
>> spark-ec2 makes it easy to launch a cluster with as many machines as you
>> like how ever Its not clear how I would add slaves to an existing
>> installation
>>
>>
>> Our Java streaming app we call rdd.saveAsTextFile(“hdfs://path”);
>>
>> ephemeral-hdfs/conf/hdfs-site.xml:
>>
>>   
>>
>> dfs.data.dir
>>
>> /mnt/ephemeral-hdfs/data,/mnt2/ephemeral-hdfs/data
>>
>>   
>>
>>
>> persistent-hdfs/conf/hdfs-site.xml
>>
>>
>> $ mount
>>
>> /dev/xvdb on /mnt type ext3 (rw,nodiratime)
>>
>> /dev/xvdf on /mnt2 type ext3 (rw,nodiratime)
>>
>>
>> http://spark.apache.org/docs/latest/ec2-scripts.html
>>
>>
>> "The spark-ec2 script also supports pausing a cluster. In this case, the
>> VMs are stopped but not terminated, so they lose all data on ephemeral disks
>> but keep the data in their root partitions and their persistent-pdfs.”
>>
>>
>> Initially I though using HDFS was a good idea. spark-ec2 makes HDFS easy
>> to use. I incorrectly thought spark some how knew how HDFS partitioned my
>> data.
>>
>> I think many people are using amazon s3. I do not have an direct
>> experience with S3. My concern would be that the data is not physically
>> stored closed to my slaves. I.e. High communication costs.
>>
>> Any suggestions would be greatly appreciated
>>
>> Andy

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: SparkR in Spark 1.5.2 jsonFile Bug Found

2015-12-04 Thread Yanbo Liang

I have created SPARK-12146 to track this issue.

2015-12-04 9:16 GMT+08:00 Felix Cheung :

> It looks like this has been broken around Spark 1.5.
>
> Please see JIRA SPARK-10185. This has been fixed in pyspark but
> unfortunately SparkR was missed. I have confirmed this is still broken in
> Spark 1.6.
>
> Could you please open a JIRA?
>
>
>
>
>
> On Thu, Dec 3, 2015 at 2:08 PM -0800, "tomasr3" <
> tomas.rodrig...@transvoyant.com> wrote:
>
> Hello,
>
> I believe to have encountered a bug with Spark 1.5.2. I am using RStudio
> and
> SparkR to read in JSON files with jsonFile(sqlContext, "path"). If "path"
> is
> a single path (e.g., "/path/to/dir0"), then it works fine;
>
> but, when "path" is a vector of paths (e.g.
>
> path <- c("/path/to/dir1","/path/to/dir2"), then I get the following error
> message:
>
> > raw.terror<-jsonFile(sqlContext,path)
> 15/12/03 15:59:55 ERROR RBackendHandler: jsonFile on 1 failed
> Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
>   java.io.IOException: No input paths specified in job
> at
>
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:201)
> at
>
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
> at
> org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
> at
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at
>
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at
>
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2
>
> Note that passing a vector of paths in Spark-1.4.1 works just fine. Any
> help
> is greatly appreciated if this is not a bug and perhaps an environment or
> different issue.
>
> Best,
> T
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-in-Spark-1-5-2-jsonFile-Bug-Found-tp25560.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Spark Streaming from S3

2015-12-04 Thread Steve Loughran

On 3 Dec 2015, at 19:31, Michele Freschi 
> wrote:

Hi Steve,

I’m on hadoop 2.7.1 using the s3n

switch to s3a. It's got better performance on big files (including a better 
forward seek that doesn't close connections; a faster close() on reads, + uses 
Amazon's own libraries).

if you still have issues, file under 
https://issues.apache.org/jira/browse/HADOOP-11694 .

FWIW there aren't any explicit tests in the hadoop codebase for working at that 
scale; there is one testing directory deletion that can be configured to scale 
up, but it's not doing the same actions as you —and it doesn't usually get run 
for more than a 100+ blobs, just because it object store test runs slow down 
builds, can be a bit unreliable, and for cost & credential security, aren't run 
in the ASF jenkins builds.

As s3a is the only one being worked on (we're too scared of breaking s3n), its 
the one to try —and complain about if it underperforms

-Steve

From: Steve Loughran >
Date: Thursday, December 3, 2015 at 4:12 AM
Cc: SPARK-USERS >
Subject: Re: Spark Streaming from S3

On 3 Dec 2015, at 00:42, Michele Freschi 
> wrote:

Hi all,

I have an app streaming from s3 (textFileStream) and recently I've observed 
increasing delay and long time to list files:

INFO dstream.FileInputDStream: Finding new files took 394160 ms
...
INFO scheduler.JobScheduler: Total delay: 404.796 s for time 144910020 ms 
(execution: 10.154 s)

At this time I have about 13K files under the key prefix that I'm monitoring - 
hadoop takes about 6 minutes to list all the files while aws cli takes only 
seconds.
My understanding is that this is a current limitation of hadoop but I wanted to 
confirm it in case it's a misconfiguration on my part.

not a known issue.

Usual questions: which Hadoop version and are you using s3n or s3a connectors. 
The latter does use the AWS sdk, but it's only been stable enough to use in 
Hadoop 2.7

Some alternatives I'm considering:
1. copy old files to a different key prefix
2. use one of the available SQS receivers 
(https://github.com/imapi/spark-sqs-receiver
 ?)
3. implement the s3 listing outside of spark and use socketTextStream, but I 
couldn't find if it's reliable or not
4. create a custom s3 receiver using aws sdk (even if doesn't look like it's 
possible to use them from pyspark)

Has anyone experienced the same issue and found a better way to solve it ?

Thanks,
Michele

anyone who can help me out with thi error please

2015-12-04 Thread Mich Talebzadeh

 

Hi,

 

 

I am trying to make Hive work with Spark.

 

I have been told that I need to use Spark 1.3 and build it from source code
WITHOUT HIVE libraries.

 

I have built it as follows:

 

./make-distribution.sh --name "hadoop2-without-hive" --tgz
"-Pyarn,hadoop-provided,hadoop-2.4,parquet-provided"

 

 

Now the issue I have that I cannot start master node.

 

When I try

 

hduser@rhes564::/usr/lib/spark-1.3.0-bin-hadoop2-without-hive/sbin

> ./start-master.sh

starting org.apache.spark.deploy.master.Master, logging to
/usr/lib/spark-1.3.0-bin-hadoop2-without-hive/sbin/../logs/spark-hduser-org.
apache.spark.deploy.master.Master-1-rhes564.out

failed to launch org.apache.spark.deploy.master.Master:

at java.lang.ClassLoader.loadClass(ClassLoader.java:357)

... 6 more

full log in
/usr/lib/spark-1.3.0-bin-hadoop2-without-hive/sbin/../logs/spark-hduser-org.
apache.spark.deploy.master.Master-1-rhes564.out

 

I get

 

Spark Command: /usr/java/latest/bin/java -cp
:/usr/lib/spark-1.3.0-bin-hadoop2-without-hive/sbin/../conf:/usr/lib/spark-1
.3.0-bin-hadoop2-without-hive/lib/spark-assembly-1.3.0-hadoop2.4.0.jar:/home
/hduser/hadoop-2.6.0/etc/hadoop -XX:MaxPermSize=128m
-Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m
org.apache.spark.deploy.master.Master --ip 50.140.197.217 --port 7077
--webui-port 8080



 

Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/Logger

at java.lang.Class.getDeclaredMethods0(Native Method)

at java.lang.Class.privateGetDeclaredMethods(Class.java:2521)

at java.lang.Class.getMethod0(Class.java:2764)

at java.lang.Class.getMethod(Class.java:1653)

at
sun.launcher.LauncherHelper.getMainMethod(LauncherHelper.java:494)

at
sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:486)

Caused by: java.lang.ClassNotFoundException: org.slf4j.Logger

at java.net.URLClassLoader$1.run(URLClassLoader.java:366)

at java.net.URLClassLoader$1.run(URLClassLoader.java:355)

at java.security.AccessController.doPrivileged(Native Method)

at java.net.URLClassLoader.findClass(URLClassLoader.java:354)

at java.lang.ClassLoader.loadClass(ClassLoader.java:424)

at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)

at java.lang.ClassLoader.loadClass(ClassLoader.java:357)

... 6 more

 

Any advice will be appreciated.

 

Thanks,

 

Mich

 

 

NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this
message shall not be understood as given or endorsed by Peridale Technology
Ltd, its subsidiaries or their employees, unless expressly so stated. It is
the responsibility of the recipient to ensure that this email is virus free,
therefore neither Peridale Ltd, its subsidiaries nor their employees accept
any responsibility.

has someone seen this error please?

2015-12-04 Thread Mich Talebzadeh

Hi,

 

 

I am trying to make Hive work with Spark.

 

I have been told that I need to use Spark 1.3 and build it from source code
WITHOUT HIVE libraries.

 

I have built it as follows:

 

./make-distribution.sh --name "hadoop2-without-hive" --tgz
"-Pyarn,hadoop-provided,hadoop-2.4,parquet-provided"

 

 

Now the issue I have that I cannot start master node.

 

When I try

 

hduser@rhes564::/usr/lib/spark-1.3.0-bin-hadoop2-without-hive/sbin

> ./start-master.sh

starting org.apache.spark.deploy.master.Master, logging to
/usr/lib/spark-1.3.0-bin-hadoop2-without-hive/sbin/../logs/spark-hduser-org.
apache.spark.deploy.master.Master-1-rhes564.out

failed to launch org.apache.spark.deploy.master.Master:

at java.lang.ClassLoader.loadClass(ClassLoader.java:357)

... 6 more

full log in
/usr/lib/spark-1.3.0-bin-hadoop2-without-hive/sbin/../logs/spark-hduser-org.
apache.spark.deploy.master.Master-1-rhes564.out

 

I get

 

Spark Command: /usr/java/latest/bin/java -cp
:/usr/lib/spark-1.3.0-bin-hadoop2-without-hive/sbin/../conf:/usr/lib/spark-1
.3.0-bin-hadoop2-without-hive/lib/spark-assembly-1.3.0-hadoop2.4.0.jar:/home
/hduser/hadoop-2.6.0/etc/hadoop -XX:MaxPermSize=128m
-Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m
org.apache.spark.deploy.master.Master --ip 50.140.197.217 --port 7077
--webui-port 8080



 

Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/Logger

at java.lang.Class.getDeclaredMethods0(Native Method)

at java.lang.Class.privateGetDeclaredMethods(Class.java:2521)

at java.lang.Class.getMethod0(Class.java:2764)

at java.lang.Class.getMethod(Class.java:1653)

at
sun.launcher.LauncherHelper.getMainMethod(LauncherHelper.java:494)

at
sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:486)

Caused by: java.lang.ClassNotFoundException: org.slf4j.Logger

at java.net.URLClassLoader$1.run(URLClassLoader.java:366)

at java.net.URLClassLoader$1.run(URLClassLoader.java:355)

at java.security.AccessController.doPrivileged(Native Method)

at java.net.URLClassLoader.findClass(URLClassLoader.java:354)

at java.lang.ClassLoader.loadClass(ClassLoader.java:424)

at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)

at java.lang.ClassLoader.loadClass(ClassLoader.java:357)

... 6 more

 

Any advice will be appreciated.

 

Thanks,

 

Mich

 

 

NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this
message shall not be understood as given or endorsed by Peridale Technology
Ltd, its subsidiaries or their employees, unless expressly so stated. It is
the responsibility of the recipient to ensure that this email is virus free,
therefore neither Peridale Ltd, its subsidiaries nor their employees accept
any responsibility.

Re: Spark Streaming Specify Kafka Partition

2015-12-04 Thread Cody Koeninger

So createDirectStream will give you a JavaInputDStream of R, where R is the
return type you chose for your message handler.

If you want a JavaPairInputDStream, you may have to call .mapToPair in
order to convert the stream, even if the type you chose for R was already
Tuple2

(note that I try to stay as far away from Java as possible, so this answer
is untested, possibly inaccurate, may throw checked exceptions etc etc)

On Thu, Dec 3, 2015 at 5:21 PM, Alan Braithwaite 
wrote:

> One quick newbie question since I got another chance to look at this
> today.  We're using java for our spark applications.  The
> createDirectStream we were using previously [1] returns a
> JavaPairInputDStream, but the createDirectStream with fromOffsets expects
> an argument recordClass to pass into the generic constructor for
> createDirectStream.
>
> In the code for the first function signature (without fromOffsets) it's
> being constructed in Scala as just a tuple (K, V).   How do I pass this
> same class/type information from java as the record class to get a 
> JavaPairInputDStream V>?
>
> I understand this might be a question more fit for a scala mailing list
> but google is failing me at the moment for hints on the interoperability of
> scala and java generics.
>
> [1] The original createDirectStream:
> https://github.com/apache/spark/blob/branch-1.5/external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaUtils.scala#L395-L423
>
> Thanks,
> - Alan
>
> On Tue, Dec 1, 2015 at 8:12 AM, Cody Koeninger  wrote:
>
>> I actually haven't tried that, since I tend to do the offset lookups if
>> necessary.
>>
>> It's possible that it will work, try it and let me know.
>>
>> Be aware that if you're doing a count() or take() operation directly on
>> the rdd it'll definitely give you the wrong result if you're using -1 for
>> one of the offsets.
>>
>>
>>
>> On Tue, Dec 1, 2015 at 9:58 AM, Alan Braithwaite 
>> wrote:
>>
>>> Neat, thanks.  If I specify something like -1 as the offset, will it
>>> consume from the latest offset or do I have to instrument that manually?
>>>
>>> - Alan
>>>
>>> On Tue, Dec 1, 2015 at 6:43 AM, Cody Koeninger 
>>> wrote:
>>>
 Yes, there is a version of createDirectStream that lets you specify
 fromOffsets: Map[TopicAndPartition, Long]

 On Mon, Nov 30, 2015 at 7:43 PM, Alan Braithwaite 
 wrote:

> Is there any mechanism in the kafka streaming source to specify the
> exact partition id that we want a streaming job to consume from?
>
> If not, is there a workaround besides writing our a custom receiver?
>
> Thanks,
> - Alan
>


>>>
>>
>

RDD functions

2015-12-04 Thread Sateesh Karuturi

Hello Spark experts...
Iam new to Apache Spark..Can anyone send me the proper Documentation to
learn RDD functions.
Thanks in advance...

How to get the list of available Transformations and actions for a RDD in Spark-Shell

2015-12-04 Thread Gokula Krishnan D

Hello All -

In spark-shell when we press tab after . ; we could see the
possible list of transformations and actions.

But unable to see all the list. is there any other way to get the rest of
the list. I'm mainly looking for sortByKey()

val sales_RDD = sc.textFile("Data/Scala/phone_sales.txt")
val sales_map = sales_RDD.map(sales=>{val x=sales.split(","); (x(0),x(1))})

Layout of phone_sales.txt is (Brand, #.of Phones sold)

I am mainly looking for SortByKey() but when I do Sales_map or sales_RDD, I
could see only sortBy() but not SortByKey().

By the way, I am using spark 1.3.0 with CDH 5.4

[image: Inline image 1]



Thanks
Gokul

Re: How to get the list of available Transformations and actions for a RDD in Spark-Shell

2015-12-04 Thread ayan guha

sortByKey() is a property of pairRDD as it requires key value pair to work.
I think in scala their are transformation such as .toPairRDD().

On Sat, Dec 5, 2015 at 12:01 AM, Gokula Krishnan D 
wrote:

> Hello All -
>
> In spark-shell when we press tab after . ; we could see the
> possible list of transformations and actions.
>
> But unable to see all the list. is there any other way to get the rest of
> the list. I'm mainly looking for sortByKey()
>
> val sales_RDD = sc.textFile("Data/Scala/phone_sales.txt")
> val sales_map = sales_RDD.map(sales=>{val x=sales.split(","); (x(0),x(1))})
>
> Layout of phone_sales.txt is (Brand, #.of Phones sold)
>
> I am mainly looking for SortByKey() but when I do Sales_map or sales_RDD,
> I could see only sortBy() but not SortByKey().
>
> By the way, I am using spark 1.3.0 with CDH 5.4
>
> [image: Inline image 1]
>
>
>
> Thanks
> Gokul
>
>


-- 
Best Regards,
Ayan Guha

Re: How to get the list of available Transformations and actions for a RDD in Spark-Shell

2015-12-04 Thread Gokula Krishnan D

Thanks Ayan for the updates.

But in my example, I hope "sales_map" is a pair_RDD , isn't it?.

Thanks & Regards,
Gokula Krishnan* (Gokul)*


On Fri, Dec 4, 2015 at 8:16 AM, ayan guha  wrote:

> sortByKey() is a property of pairRDD as it requires key value pair to
> work. I think in scala their are transformation such as .toPairRDD().
>
> On Sat, Dec 5, 2015 at 12:01 AM, Gokula Krishnan D 
> wrote:
>
>> Hello All -
>>
>> In spark-shell when we press tab after . ; we could see the
>> possible list of transformations and actions.
>>
>> But unable to see all the list. is there any other way to get the rest of
>> the list. I'm mainly looking for sortByKey()
>>
>> val sales_RDD = sc.textFile("Data/Scala/phone_sales.txt")
>> val sales_map = sales_RDD.map(sales=>{val x=sales.split(",");
>> (x(0),x(1))})
>>
>> Layout of phone_sales.txt is (Brand, #.of Phones sold)
>>
>> I am mainly looking for SortByKey() but when I do Sales_map or sales_RDD,
>> I could see only sortBy() but not SortByKey().
>>
>> By the way, I am using spark 1.3.0 with CDH 5.4
>>
>> [image: Inline image 1]
>>
>>
>>
>> Thanks
>> Gokul
>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>

Re: Is it possible to pass additional parameters to a python function when used inside RDD.filter method?

2015-12-04 Thread Praveen Chundi


Passing a lambda function should work.

my_rrd.filter(lambda x: myfunc(x,newparam))

Best regards,
Praveen Chundi

On 04.12.2015 13:19, Abhishek Shivkumar wrote:


Hi,

 I am using spark with python and I have a filter constraint as follows:

|my_rdd.filter(my_func)|

where my_func is a method I wrote to filter the rdd items based on my 
own logic. I have defined the my_func as follows:


|defmy_func(my_item):{...}|

Now, I want to pass another separate parameter to my_func, besides the 
item that goes into it. How can I do that? I know my_item will refer 
to one item that comes from my_rdd and how can I pass my own parameter 
(let's say my_param) as an additional parameter to my_func?


Thanks
Abhishek S


*NOTICE AND DISCLAIMER*

This email (including attachments) is confidential. If you are not the 
intended recipient, notify the sender immediately, delete this email 
from your system and do not disclose or use for any purpose.


Business Address: Eagle House, 163 City Road, London, EC1V 1NR. United 
Kingdom
Registered Office: Finsgate, 5-7 Cranwood Street, London, EC1V 9EE. 
United Kingdom
Big Data Partnership Limited is a company registered in England & 
Wales with Company No 7904824

Re: Python API Documentation Mismatch

2015-12-04 Thread Roberto Pagliari

Hi Yanbo,
You mean pyspark.mllib.recommendation right? That is the one used in the 
official tutorial.

Thank you,

From: Yanbo Liang >
Date: Friday, 4 December 2015 03:17
To: Felix Cheung >
Cc: Roberto Pagliari 
>, 
"user@spark.apache.org" 
>
Subject: Re: Python API Documentation Mismatch

Hi Roberto,

There are two ALS available: 
ml.recommendation.ALS
 and 
mllib.recommendation.ALS
 .
They have different usage and methods. I know it's confusion that Spark provide 
two version of the same algorithm. I strongly recommend to use the ALS 
algorithm at ML package.

Yanbo

2015-12-04 1:31 GMT+08:00 Felix Cheung 
>:
Please open an issue in JIRA, thanks!

On Thu, Dec 3, 2015 at 3:03 AM -0800, "Roberto Pagliari" 
> wrote:

Hello,
I believe there is a mismatch between the API documentation (1.5.2) and the 
software currently available.

Not all functions mentioned here
http://spark.apache.org/docs/latest/api/python/pyspark.ml.html#module-pyspark.ml.recommendation

are, in fact available. For example, the code below from the tutorial works

# Build the recommendation model using Alternating Least Squaresrank = 
10numIterations = 10model = ALS.train(ratings, rank, numIterations)

While the alternative shown in the API documentation will not (it will complain 
that ALS takes no arguments. Also, but inspecting the module with Python 
utilities I could not find several methods mentioned in the API docs)

>>> df = sqlContext.createDataFrame(... [(0, 0, 4.0), (0, 1, 2.0), (1, 1, 
>>> 3.0), (1, 2, 4.0), (2, 1, 1.0), (2, 2, 5.0)],... ["user", "item", 
>>> "rating"])>>> als = ALS(rank=10, maxIter=5)>>> model = als.fit(df)

Thank you,

[no subject]

2015-12-04 Thread Sateesh Karuturi

user-sc.1449231970.fbaoamghkloiongfhbbg-sateesh.karuturi9=
gmail@spark.apache.org

Re: Is it possible to pass additional parameters to a python function when used inside RDD.filter method?

2015-12-04 Thread Abhishek Shivkumar

Excellent. that did work - thanks.

On 4 December 2015 at 12:35, Praveen Chundi  wrote:

> Passing a lambda function should work.
>
> my_rrd.filter(lambda x: myfunc(x,newparam))
>
> Best regards,
> Praveen Chundi
>
>
> On 04.12.2015 13:19, Abhishek Shivkumar wrote:
>
> Hi,
>
>  I am using spark with python and I have a filter constraint as follows:
>
> my_rdd.filter(my_func)
>
> where my_func is a method I wrote to filter the rdd items based on my own
> logic. I have defined the my_func as follows:
>
> def my_func(my_item):
> {...}
>
> Now, I want to pass another separate parameter to my_func, besides the
> item that goes into it. How can I do that? I know my_item will refer to one
> item that comes from my_rdd and how can I pass my own parameter (let's say
> my_param) as an additional parameter to my_func?
>
> Thanks
> Abhishek S
>
> *NOTICE AND DISCLAIMER*
>
> This email (including attachments) is confidential. If you are not the
> intended recipient, notify the sender immediately, delete this email from
> your system and do not disclose or use for any purpose.
>
> Business Address: Eagle House, 163 City Road, London, EC1V 1NR. United
> Kingdom
> Registered Office: Finsgate, 5-7 Cranwood Street, London, EC1V 9EE. United
> Kingdom
> Big Data Partnership Limited is a company registered in England & Wales
> with Company No 7904824
>
>
>

-- 
 

*NOTICE AND DISCLAIMER*

This email (including attachments) is confidential. If you are not the 
intended recipient, notify the sender immediately, delete this email from 
your system and do not disclose or use for any purpose.

Business Address: Eagle House, 163 City Road, London, EC1V 1NR. United 
Kingdom
Registered Office: Finsgate, 5-7 Cranwood Street, London, EC1V 9EE. United 
Kingdom
Big Data Partnership Limited is a company registered in England & Wales 
with Company No 7904824

Re: newbie best practices: is spark-ec2 intended to be used to manage long-lasting infrastructure ?

2015-12-04 Thread Michal Klos

If you are running on AWS I would recommend using s3 instead of hdfs as a 
general practice if you are maintaining state or data there. This way you can 
treat your spark clusters as ephemeral compute resources that you can swap out 
easily -- eg if something breaks just spin up a fresh cluster and redirect your 
workload rather than fighting a fire and trying to debug and fix a broken 
cluster. It simplifies operations once you are in prod.

M





Sent from my iPhone
> On Dec 4, 2015, at 6:42 AM, Sean Owen  wrote:
> 
> There is no way to upgrade a running cluster here. You can stop a
> cluster, and simply start a new cluster in the same way you started
> the original cluster. That ought to be simple; the only issue I
> suppose is that you have down-time since you have to shut the whole
> thing down, but maybe that's acceptable.
> 
> If you have data, including HDFS, set up on ephemeral disks though
> then yes that is lost. Really that's an 'ephemeral' HDFS cluster. It
> has nothing to do with partitions.
> 
> You would want to get the data out to S3 first, and then copy it back
> in later. Yes it's manual, but works fine.
> 
> For more production use cases, on Amazon, you probably want to look
> into a distribution or product around Spark rather than manage it
> yourself. That could be AWS's own EMR, Databricks cloud, or even CDH
> running on AWS. Those would give you much more of a chance of
> automatically getting updates and so on, but they're fairly different
> products.
> 
>> On Fri, Dec 4, 2015 at 3:21 AM, Divya Gehlot  wrote:
>> Hello,
>> Even I have the same queries in mind .
>> What all the upgrades where we can use EC2 as compare to normal servers for
>> spark and other big data product development .
>> Hope to get inputs from the community .
>> 
>> Thanks,
>> Divya
>> 
>> On Dec 4, 2015 6:05 AM, "Andy Davidson" 
>> wrote:
>>> 
>>> About 2 months ago I used spark-ec2 to set up a small cluster. The cluster
>>> runs a spark streaming app 7x24 and stores the data to hdfs. I also need to
>>> run some batch analytics on the data.
>>> 
>>> Now that I have a little more experience I wonder if this was a good way
>>> to set up the cluster the following issues
>>> 
>>> I have not been able to find explicit directions for upgrading the spark
>>> version
>>> 
>>> 
>>> http://search-hadoop.com/m/q3RTt7E0f92v0tKh2=Re+Upgrading+Spark+in+EC2+clusters
>>> 
>>> I am not sure where the data is physically be stored. I think I may
>>> accidentally loose all my data
>>> spark-ec2 makes it easy to launch a cluster with as many machines as you
>>> like how ever Its not clear how I would add slaves to an existing
>>> installation
>>> 
>>> 
>>> Our Java streaming app we call rdd.saveAsTextFile(“hdfs://path”);
>>> 
>>> ephemeral-hdfs/conf/hdfs-site.xml:
>>> 
>>>  
>>> 
>>>dfs.data.dir
>>> 
>>>/mnt/ephemeral-hdfs/data,/mnt2/ephemeral-hdfs/data
>>> 
>>>  
>>> 
>>> 
>>> persistent-hdfs/conf/hdfs-site.xml
>>> 
>>> 
>>> $ mount
>>> 
>>> /dev/xvdb on /mnt type ext3 (rw,nodiratime)
>>> 
>>> /dev/xvdf on /mnt2 type ext3 (rw,nodiratime)
>>> 
>>> 
>>> http://spark.apache.org/docs/latest/ec2-scripts.html
>>> 
>>> 
>>> "The spark-ec2 script also supports pausing a cluster. In this case, the
>>> VMs are stopped but not terminated, so they lose all data on ephemeral disks
>>> but keep the data in their root partitions and their persistent-pdfs.”
>>> 
>>> 
>>> Initially I though using HDFS was a good idea. spark-ec2 makes HDFS easy
>>> to use. I incorrectly thought spark some how knew how HDFS partitioned my
>>> data.
>>> 
>>> I think many people are using amazon s3. I do not have an direct
>>> experience with S3. My concern would be that the data is not physically
>>> stored closed to my slaves. I.e. High communication costs.
>>> 
>>> Any suggestions would be greatly appreciated
>>> 
>>> Andy
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: RDD functions

2015-12-04 Thread Ndjido Ardo BAR

Hi Michal,

I think the following link could interest you. You gonna find there a lot
of examples!

http://homepage.cs.latrobe.edu.au/zhe/ZhenHeSparkRDDAPIExamples.html

cheers,
Ardo

On Fri, Dec 4, 2015 at 2:31 PM, Michal Klos  wrote:

> http://spark.apache.org/docs/latest/programming-guide.html#rdd-operations
>
> M
>
> On Dec 4, 2015, at 8:21 AM, Sateesh Karuturi 
> wrote:
>
> Hello Spark experts...
> Iam new to Apache Spark..Can anyone send me the proper Documentation to
> learn RDD functions.
> Thanks in advance...
>
>

Re: RDD functions

2015-12-04 Thread Michal Klos

http://spark.apache.org/docs/latest/programming-guide.html#rdd-operations

M

> On Dec 4, 2015, at 8:21 AM, Sateesh Karuturi  
> wrote:
> 
> Hello Spark experts...
> Iam new to Apache Spark..Can anyone send me the proper Documentation to learn 
> RDD functions.
> Thanks in advance...

Spark UI - Streaming Tab

2015-12-04 Thread patcharee


Hi,

We tried to get the streaming tab interface on Spark UI - 
https://databricks.com/blog/2015/07/08/new-visualizations-for-understanding-spark-streaming-applications.html


Tested on version 1.5.1, 1.6.0-snapshot, but no such interface for 
streaming applications at all. Any suggestions? Do we need to configure 
the history UI somehow to get such interface?


Thanks,
Patcharee

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Spark applications metrics

2015-12-04 Thread patcharee


Hi

How can I see the summary of data read / write, shuffle read / write, 
etc of an Application, not per stage?


Thanks,
Patcharee

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark UI - Streaming Tab

2015-12-04 Thread PhuDuc Nguyen

I believe the "Streaming" tab is dynamic - it appears once you have a
streaming job running, not when the cluster is simply up. It does not
depend on 1.6 and has been in there since at least 1.0.

HTH,
Duc

On Fri, Dec 4, 2015 at 7:28 AM, patcharee  wrote:

> Hi,
>
> We tried to get the streaming tab interface on Spark UI -
> https://databricks.com/blog/2015/07/08/new-visualizations-for-understanding-spark-streaming-applications.html
>
> Tested on version 1.5.1, 1.6.0-snapshot, but no such interface for
> streaming applications at all. Any suggestions? Do we need to configure the
> history UI somehow to get such interface?
>
> Thanks,
> Patcharee
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Spark Streaming Shuffle to Disk

2015-12-04 Thread spearson23

I'm running a Spark Streaming job on 1.3.1 which contains an
updateStateByKey.  The job works perfectly fine, but at some point (after a
few runs), it starts shuffling to disk no matter how much memory I give the
executors.

I have tried changing --executor-memory on spark-submit,
spark.shuffle.memoryFraction, spark.storage.memoryFraction, and
spark.storage.unrollFraction.  But no matter how I configure these, it
always spills to disk around 2.5GB.  

What is the best way to avoid spilling shuffle to disk?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Shuffle-to-Disk-tp25567.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: How to get the list of available Transformations and actions for a RDD in Spark-Shell

2015-12-04 Thread Ted Yu

Did a quick test:

rdd2: org.apache.spark.rdd.RDD[(Int, Int)] = MapPartitionsRDD[1] at map at
:29

I think sales_map is MapPartitionsRDD

FYI

On Fri, Dec 4, 2015 at 6:18 AM, Gokula Krishnan D 
wrote:

> Thanks Ayan for the updates.
>
> But in my example, I hope "sales_map" is a pair_RDD , isn't it?.
>
> Thanks & Regards,
> Gokula Krishnan* (Gokul)*
>
>
> On Fri, Dec 4, 2015 at 8:16 AM, ayan guha  wrote:
>
>> sortByKey() is a property of pairRDD as it requires key value pair to
>> work. I think in scala their are transformation such as .toPairRDD().
>>
>> On Sat, Dec 5, 2015 at 12:01 AM, Gokula Krishnan D 
>> wrote:
>>
>>> Hello All -
>>>
>>> In spark-shell when we press tab after . ; we could see the
>>> possible list of transformations and actions.
>>>
>>> But unable to see all the list. is there any other way to get the rest
>>> of the list. I'm mainly looking for sortByKey()
>>>
>>> val sales_RDD = sc.textFile("Data/Scala/phone_sales.txt")
>>> val sales_map = sales_RDD.map(sales=>{val x=sales.split(",");
>>> (x(0),x(1))})
>>>
>>> Layout of phone_sales.txt is (Brand, #.of Phones sold)
>>>
>>> I am mainly looking for SortByKey() but when I do Sales_map or
>>> sales_RDD, I could see only sortBy() but not SortByKey().
>>>
>>> By the way, I am using spark 1.3.0 with CDH 5.4
>>>
>>> [image: Inline image 1]
>>>
>>>
>>>
>>> Thanks
>>> Gokul
>>>
>>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>

ROW_TIMESTAMP support with UNSIGNED_LONG

2015-12-04 Thread pierre lacave

Hi I am trying to use the ROW_TIMESTAMP mapping featured in 4.6 as
described in https://phoenix.apache.org/rowtimestamp.html

However when inserting a timestamp in nanosecond I get the following
exception saying the value cannot be less than zero?

Inserting micros,micros or sec result in same result?

Any idea what's happening?

Thanks


0: jdbc:phoenix:hadoop1-dc:2181:/hbase> CREATE TABLE TEST (t UNSIGNED_LONG
NOT NULL CONSTRAINT pk PRIMARY KEY (t ROW_TIMESTAMP) );
No rows affected (1.654 seconds)
0: jdbc:phoenix:hadoop1-dc:2181:/hbase> UPSERT INTO TEST (t) VALUES
(14491610811);
Error: ERROR 201 (22000): Illegal data. Value of a column designated as
ROW_TIMESTAMP cannot be less than zero (state=22000,code=201)
java.sql.SQLException: ERROR 201 (22000): Illegal data. Value of a column
designated as ROW_TIMESTAMP cannot be less than zero
at
org.apache.phoenix.exception.SQLExceptionCode$Factory$1.newException(SQLExceptionCode.java:396)
at
org.apache.phoenix.exception.SQLExceptionInfo.buildException(SQLExceptionInfo.java:145)
at
org.apache.phoenix.schema.IllegalDataException.(IllegalDataException.java:38)
at
org.apache.phoenix.compile.UpsertCompiler.setValues(UpsertCompiler.java:135)
at
org.apache.phoenix.compile.UpsertCompiler.access$400(UpsertCompiler.java:114)
at
org.apache.phoenix.compile.UpsertCompiler$3.execute(UpsertCompiler.java:882)
at
org.apache.phoenix.jdbc.PhoenixStatement$2.call(PhoenixStatement.java:322)
at
org.apache.phoenix.jdbc.PhoenixStatement$2.call(PhoenixStatement.java:314)
at org.apache.phoenix.call.CallRunner.run(CallRunner.java:53)
at
org.apache.phoenix.jdbc.PhoenixStatement.executeMutation(PhoenixStatement.java:312)
at
org.apache.phoenix.jdbc.PhoenixStatement.execute(PhoenixStatement.java:1435)
at sqlline.Commands.execute(Commands.java:822)
at sqlline.Commands.sql(Commands.java:732)
at sqlline.SqlLine.dispatch(SqlLine.java:808)
at sqlline.SqlLine.begin(SqlLine.java:681)
at sqlline.SqlLine.start(SqlLine.java:398)
at sqlline.SqlLine.main(SqlLine.java:292)

*Pierre Lacave*
171 Skellig House, Custom House, Lower Mayor street, Dublin 1, Ireland
Phone :   +353879128708

Fwd: Can't run Spark Streaming Kinesis example

2015-12-04 Thread Brian London

On my local system (8 core MBP) the Kinesis ASL example isn't working out
of the box on a fresh build (Spark 1.5.2).  I can see records going into
the kinesis stream but the receiver is returning empty DStreams.  The
behavior is similar to an issue that's been discussed previously:

http://stackoverflow.com/questions/26941844/apache-spark-kinesis-sample-not-working

http://apache-spark-user-list.1001560.n3.nabble.com/Having-problem-with-Spark-streaming-with-Kinesis-td19863.html#a19929

In those the feedback was that it is an issue related to there being
sufficient cores allocated to both receive and process the incoming data.
However, in my case I have attempted running with the default (local[*]) as
well as local[2], local[4], and local[8] all with the same results.  Is it
possible that the actual number of worker threads is different from what's
requested?  Is there a way to check how many threads were actually
allocated?

How to access a RDD (that has been broadcasted) inside the filter method of another RDD?

2015-12-04 Thread Abhishek Shivkumar

Hi,

 I have RDD1 that is broadcasted.

I have a user defined method for the filter functionality of RDD2, written
as follows:

RDD2.filter(my_func)


I want to access the values of RDD1 inside my_func. Is that possible?
Should I pass RDD1 as a parameter into my_func?

Thanks
Abhishek S

-- 
 

*NOTICE AND DISCLAIMER*

This email (including attachments) is confidential. If you are not the 
intended recipient, notify the sender immediately, delete this email from 
your system and do not disclose or use for any purpose.

Business Address: Eagle House, 163 City Road, London, EC1V 1NR. United 
Kingdom
Registered Office: Finsgate, 5-7 Cranwood Street, London, EC1V 9EE. United 
Kingdom
Big Data Partnership Limited is a company registered in England & Wales 
with Company No 7904824

Regarding Join between two graphs

2015-12-04 Thread hastimal

Hello I have two graphRDDs one is property Graph and another one is connected
Component graph like:
* /var propGraph = Graph(vertexArray,edgeArray).cache()/*

with triplets:
/((0,),(14,null),)
((1,null),(11,“Kviswanath”.),)
((13,“ManiRatnam”.),(12,null),)/

 and another one is connected components with graphRDD:

*/var cc = propGraph.connectedComponents().cache()/*

with triplets:
/((0,0),(14,0),)
((1,1),(11,1),)
((13,12),(12,12),)/

Now my question is I need to get triplets containing triplets from propGrap
and cc graph like in following way after join:
/((0,),(14,null),),
*ccID1*
((1,null),(11,“Kviswanath”.),)
*ccID2*
((13,“ManiRatnam”.),(12,null),)
*ccID2*/

In this graph ccIDs are 0,1,12 which I got from connectedComponents().


So in this case I am doing following stuff but not working:
/val triplets = propGraph.joinVertices(cc.vertices)/

How to do this kind of join or joinVertices()


FYI: My algorithm says:
var cc = propGraph.connectedComponents()
var triplest= propGraph.join(cc).triplets()   //join original graph with so
that each vertex know's it's connected component  ID and then extract the
triplets
var rdfGraphs=
triplets.mapPartition(func:genRDFTriplets).reduceByKey(func:concat)
//Store the propery graph in RDD where each row has an ID and RDF graph (in
the N-triplets format)


FYI: genRDFTriplets function is
/def
genRDFTriplets(iter:Iterator[((Int,String),(Int,String),String,Int)]):Iterator[(Int,String)]={
  var result = List[(Int,String)]()
 while (iter.hasNext) {
val temp = iter.next()
//println(s"tempRDF is ${temp}")
result = result .:: (temp._4,(temp._1._2+" "+temp._3+" "+temp._2._2
+".\n").toString)
  }
   // println(s"resultRDF is ${result}")
  result.iterator
}/

Please help me I am newbie in Spark. I love big data technologies.





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Regarding-Join-between-two-graphs-tp25566.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark UI - Streaming Tab

2015-12-04 Thread patcharee


I ran streaming jobs, but no streaming tab appeared for those jobs.

Patcharee


On 04. des. 2015 18:12, PhuDuc Nguyen wrote:
I believe the "Streaming" tab is dynamic - it appears once you have a 
streaming job running, not when the cluster is simply up. It does not 
depend on 1.6 and has been in there since at least 1.0.


HTH,
Duc

On Fri, Dec 4, 2015 at 7:28 AM, patcharee > wrote:


Hi,

We tried to get the streaming tab interface on Spark UI -

https://databricks.com/blog/2015/07/08/new-visualizations-for-understanding-spark-streaming-applications.html

Tested on version 1.5.1, 1.6.0-snapshot, but no such interface for
streaming applications at all. Any suggestions? Do we need to
configure the history UI somehow to get such interface?

Thanks,
Patcharee

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org

For additional commands, e-mail: user-h...@spark.apache.org

is Multiple Spark Contexts is supported in spark 1.5.0 ?

2015-12-04 Thread prateek arora

Hi

I want to create multiple sparkContext in my application.
i read so many articles they suggest " usage of multiple contexts is
discouraged, since SPARK-2243 is still not resolved."
i want to know that Is spark 1.5.0 supported to create multiple contexts
without error ?
and if supported then are we need to set
"spark.driver.allowMultipleContexts" configuration parameter ?

Regards
Prateek



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/is-Multiple-Spark-Contexts-is-supported-in-spark-1-5-0-tp25568.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: is Multiple Spark Contexts is supported in spark 1.5.0 ?

2015-12-04 Thread Ted Yu

See Josh's response in this thread:

http://search-hadoop.com/m/q3RTt1z1hUw4TiG1=Re+Question+about+yarn+cluster+mode+and+spark+driver+allowMultipleContexts

Cheers

On Fri, Dec 4, 2015 at 9:46 AM, prateek arora 
wrote:

> Hi
>
> I want to create multiple sparkContext in my application.
> i read so many articles they suggest " usage of multiple contexts is
> discouraged, since SPARK-2243 is still not resolved."
> i want to know that Is spark 1.5.0 supported to create multiple contexts
> without error ?
> and if supported then are we need to set
> "spark.driver.allowMultipleContexts" configuration parameter ?
>
> Regards
> Prateek
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/is-Multiple-Spark-Contexts-is-supported-in-spark-1-5-0-tp25568.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: newbie best practices: is spark-ec2 intended to be used to manage long-lasting infrastructure ?

2015-12-04 Thread Sabarish Sasidharan

#2: if using hdfs it's on the disks. You can use the HDFS command line to
browse your data. And then use s3distcp or simply distcp to copy data from
hdfs to S3. Or even use hdfs get commands to copy to local disk and then
use S3 cli to copy to s3

#3. Cost of accessing data in S3 from  Ec2 nodes, though not as fast as
local disks, is still fast enough. You can use hdfs for intermediate steps
and use S3 for final storage. Make sure your s3 bucket is in the same
region as your Ec2 cluster.

Regards
Sab
On 04-Dec-2015 3:35 am, "Andy Davidson" 
wrote:

> About 2 months ago I used spark-ec2 to set up a small cluster. The cluster
> runs a spark streaming app 7x24 and stores the data to hdfs. I also need to
> run some batch analytics on the data.
>
> Now that I have a little more experience I wonder if this was a good way
> to set up the cluster the following issues
>
>1. I have not been able to find explicit directions for upgrading the
>spark version
>   1.
>   
> http://search-hadoop.com/m/q3RTt7E0f92v0tKh2=Re+Upgrading+Spark+in+EC2+clusters
>2. I am not sure where the data is physically be stored. I think I may
>accidentally loose all my data
>3. spark-ec2 makes it easy to launch a cluster with as many machines
>as you like how ever Its not clear how I would add slaves to an existing
>installation
>
>
> Our Java streaming app we call rdd.saveAsTextFile(“hdfs://path”);
>
> ephemeral-hdfs/conf/hdfs-site.xml:
>
>   
>
> dfs.data.dir
>
> /mnt/ephemeral-hdfs/data,/mnt2/ephemeral-hdfs/data
>
>   
>
>
> persistent-hdfs/conf/hdfs-site.xml
>
>
> $ mount
>
> /dev/xvdb on /mnt type ext3 (rw,nodiratime)
>
> /dev/xvdf on /mnt2 type ext3 (rw,nodiratime)
>
>
> http://spark.apache.org/docs/latest/ec2-scripts.html
>
> *"*The spark-ec2 script also supports pausing a cluster. In this case,
> the VMs are stopped but not terminated, so they *lose all data on
> ephemeral disks* but keep the data in their root partitions and their
> persistent-pdfs.”
>
>
> Initially I though using HDFS was a good idea. spark-ec2 makes HDFS easy
> to use. I incorrectly thought spark some how knew how HDFS partitioned my
> data.
>
> I think many people are using amazon s3. I do not have an direct
> experience with S3. My concern would be that the data is not physically
> stored closed to my slaves. I.e. High communication costs.
>
> Any suggestions would be greatly appreciated
>
> Andy
>

Re: understanding and disambiguating CPU-core related properties

2015-12-04 Thread Leonidas Patouchas

Regarding your 2nd question, there is great article from cloudera regurding
this:

http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2.
They focus on yarn setup but the big picture applies everywere.

In general, I believe that you have to know your data in order to configure
a-priory those set of params. From my experience the more cpu's the
merrier. I have noticed that i.e if i double the cpu's the job finishes in
half the time. This effect though does not have the same analogy (double
cps - half time) after I reach a specific number of cpu's (always depending
on the data and the job's actions). So it has a lot of try and observe.

In addition, there is a tight connection between cpu's and partitions.
Cloudera's article covers this.

Regards,
Leonidas

On Thu, Dec 3, 2015 at 5:44 PM, Manolis Sifalakis1 
wrote:

> I have found the documentation rather poor in helping me understand the
> interplay among the following properties in spark, even more how to set
> them. So this post is sent in hope for some discussion and "enlightenment"
> on the topic
>
> Let me start by asking if I have understood well the following:
>
> - spark.driver.cores:   how many cores the driver program should occupy
> - spark.cores.max:   how many cores my app will claim for computations
> - spark.executor.cores and spark.task.cpus:   how spark.cores.max are
> allocated per JVM (executor) and per task (java thread?)
>   I.e. + spark.executor.cores:   each JVM instance (executor) should use
> that many cores
> + spark.task.cpus: each task shoudl occupy max this # or cores
>
> If so far good, then...
>
> q1: Is spark.cores.max inclusive or not of spark.driver.cores ?
>
> q1: How should one decide statically a-priori how to distribute the
> spark.cores.max to JVMs and task ?
>
> q3: Since the set-up of cores-per-worker restricts how many cores can be
> max avail per executor and since an executor cannot spawn across workers,
> what is the rationale behind an application claiming cores
> (spark.cores.max) as opposed to merely executors ? (This will make an app
> never fail to be admitted)
>
> TIA for any clarifications/intuitions/experiences on the topic
>
> best
>
> Manolis.
>
>
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: is Multiple Spark Contexts is supported in spark 1.5.0 ?

2015-12-04 Thread Michael Armbrust

On Fri, Dec 4, 2015 at 11:24 AM, Anfernee Xu  wrote:

> If multiple users are looking at the same data set, then it's good choice
> to share the SparkContext.
>
> But my usercases are different, users are looking at different data(I use
> custom Hadoop InputFormat to load data from my data source based on the
> user input), the data might not have any overlap. For now I'm taking below
> approach
>

Still if you want fine grained sharing of compute resources as well, you
want to using single SparkContext.

Higher Processing times in Spark Streaming with kafka Direct

2015-12-04 Thread SRK

Hi,

Our processing times in  Spark Streaming with kafka Direct approach seems to
have increased considerably with increase in the Site traffic. Would
increasing the number of kafka partitions decrease  the processing times?
Any suggestions on tuning to reduce the processing times would be of great
help.

Thanks,
Swetha



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Higher-Processing-times-in-Spark-Streaming-with-kafka-Direct-tp25571.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: JMXSink for YARN deployment

2015-12-04 Thread spearson23

We use a metrics.property file on YARN by submitting applications like this:

spark-submit --conf spark.metrics.conf=metrics.properties --class CLASS_NAME
--master yarn-cluster --files /PATH/TO/metrics.properties /PATH/TO/CODE.JAR
/PATH/TO/CONFIG.FILE APP_NAME




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/JMXSink-for-YARN-deployment-tp13958p25570.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark SQL IN Clause

2015-12-04 Thread Xiao Li

https://github.com/apache/spark/pull/9055

This JIRA explains how to convert IN to Joins.

Thanks,

Xiao Li



2015-12-04 11:27 GMT-08:00 Michael Armbrust :

> The best way to run this today is probably to manually convert the query
> into a join.  I.e. create a dataframe that has all the numbers in it, and
> join/outer join it with the other table.  This way you avoid parsing a
> gigantic string.
>
> On Fri, Dec 4, 2015 at 10:36 AM, Ted Yu  wrote:
>
>> Have you seen this JIRA ?
>>
>> [SPARK-8077] [SQL] Optimization for TreeNodes with large numbers of
>> children
>>
>> From the numbers Michael published, 1 million numbers would still need
>> 250 seconds to parse.
>>
>> On Fri, Dec 4, 2015 at 10:14 AM, Madabhattula Rajesh Kumar <
>> mrajaf...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> How to use/best practices "IN" clause in Spark SQL.
>>>
>>> Use Case :-  Read the table based on number. I have a List of numbers.
>>> For example, 1million.
>>>
>>> Regards,
>>> Rajesh
>>>
>>
>>
>

Re: JMXSink for YARN deployment

2015-12-04 Thread spearson23

Run "spark-submit --help" to see all available options.

To get JMX to work you need to:

spark-submit --driver-java-options "-Dcom.sun.management.jmxremote
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.port=JMX_PORT" --conf
spark.metrics.conf=metrics.properties --class 'CLASS_NAME' --master
yarn-cluster --files /PATH/TO/metrics.properties /PATH/TO/JAR.FILE


This will run JMX on the driver node on or "JMX_PORT".  Note that the driver
node and the YARN master node are not the same, you'll have to look where
spark put the driver node and then connect there.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/JMXSink-for-YARN-deployment-tp13958p25572.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Exception in thread "main" java.lang.IncompatibleClassChangeError:

2015-12-04 Thread Prem Sure

Getting below exception while executing below program in eclipse.
any clue on whats wrong here would be helpful

*public* *class* WordCount {

*private* *static* *final* FlatMapFunction *WORDS_EXTRACTOR*
=

*new* *FlatMapFunction()* {

@Override

*public* Iterable call(String s) *throws* Exception {

*return* Arrays.*asList*(s.split(" "));

}

};

*private* *static* *final* PairFunction
*WORDS_MAPPER* =

*new* *PairFunction()* {

@Override

*public* Tuple2 call(String s) *throws* Exception {

*return* *new* Tuple2(s, 1);

}

};

*private* *static* *final* Function2
*WORDS_REDUCER* =

*new* *Function2()* {

@Override

*public* Integer call(Integer a, Integer b) *throws* Exception {

*return* a + b;

}

};

*public* *static* *void* main(String[] args) {

SparkConf conf = *new* SparkConf().setAppName("spark.WordCount").setMaster(
"local");

JavaSparkContext *context* = *new* JavaSparkContext(conf);

JavaRDD file = context.textFile("Input/SampleTextFile.txt");

file.saveAsTextFile("file:///Output/WordCount.txt");

JavaRDD words = file.flatMap(*WORDS_EXTRACTOR*);

JavaPairRDD pairs = words.mapToPair(*WORDS_MAPPER*);

JavaPairRDD counter = pairs.reduceByKey(*WORDS_REDUCER*);

counter.foreach(System.*out*::println);

counter.saveAsTextFile("file:///Output/WordCount.txt");

}

}

*Exception in thread "main" java.lang.IncompatibleClassChangeError:
Implementing class*

at java.lang.ClassLoader.defineClass1(*Native Method*)

at java.lang.ClassLoader.defineClass(Unknown Source)

at java.security.SecureClassLoader.defineClass(Unknown Source)

at java.net.URLClassLoader.defineClass(Unknown Source)

at java.net.URLClassLoader.access$100(Unknown Source)

at java.net.URLClassLoader$1.run(Unknown Source)

at java.net.URLClassLoader$1.run(Unknown Source)

at java.security.AccessController.doPrivileged(*Native Method*)

at java.net.URLClassLoader.findClass(Unknown Source)

at java.lang.ClassLoader.loadClass(Unknown Source)

at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)

at java.lang.ClassLoader.loadClass(Unknown Source)

at java.lang.Class.forName0(*Native Method*)

at java.lang.Class.forName(Unknown Source)

at org.apache.spark.mapred.SparkHadoopMapRedUtil$class.firstAvailableClass(
*SparkHadoopMapRedUtil.scala:61*)

at org.apache.spark.mapred.SparkHadoopMapRedUtil$class.newJobContext(
*SparkHadoopMapRedUtil.scala:27*)

at org.apache.spark.SparkHadoopWriter.newJobContext(
*SparkHadoopWriter.scala:39*)

at org.apache.spark.SparkHadoopWriter.getJobContext(
*SparkHadoopWriter.scala:149*)

at org.apache.spark.SparkHadoopWriter.preSetup(*SparkHadoopWriter.scala:63*)

at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(
*PairRDDFunctions.scala:1045*)

at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(
*PairRDDFunctions.scala:940*)

at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(
*PairRDDFunctions.scala:849*)

at org.apache.spark.rdd.RDD.saveAsTextFile(*RDD.scala:1164*)

at org.apache.spark.api.java.JavaRDDLike$class.saveAsTextFile(
*JavaRDDLike.scala:443*)

at org.apache.spark.api.java.JavaRDD.saveAsTextFile(*JavaRDD.scala:32*)

at spark.WordCount.main(*WordCount.java:44*)

Re: SparkSQL API to insert DataFrame into a static partition?

2015-12-04 Thread Isabelle Phan

Thanks all for your reply!

I tested both approaches: registering the temp table then executing SQL vs.
saving to HDFS filepath directly. The problem with the second approach is
that I am inserting data into a Hive table, so if I create a new partition
with this method, Hive metadata is not updated.

So I will be going with first approach.
Follow up question in this case: what is the cost of registering a temp
table? Is there a limit to the number of temp tables that can be registered
by Spark context?

Thanks again for your input.

Isabelle

On Wed, Dec 2, 2015 at 10:30 AM, Michael Armbrust 
wrote:

> you might also coalesce to 1 (or some small number) before writing to
> avoid creating a lot of files in that partition if you know that there is
> not a ton of data.
>
> On Wed, Dec 2, 2015 at 12:59 AM, Rishi Mishra 
> wrote:
>
>> As long as all your data is being inserted by Spark , hence using the
>> same hash partitioner,  what Fengdong mentioned should work.
>>
>> On Wed, Dec 2, 2015 at 9:32 AM, Fengdong Yu 
>> wrote:
>>
>>> Hi
>>> you can try:
>>>
>>> if your table under location “/test/table/“ on HDFS
>>> and has partitions:
>>>
>>>  “/test/table/dt=2012”
>>>  “/test/table/dt=2013”
>>>
>>> df.write.mode(SaveMode.Append).partitionBy("date”).save(“/test/table")
>>>
>>>
>>>
>>> On Dec 2, 2015, at 10:50 AM, Isabelle Phan  wrote:
>>>
>>> df.write.partitionBy("date").insertInto("my_table")
>>>
>>>
>>>
>>
>>
>> --
>> Regards,
>> Rishitesh Mishra,
>> SnappyData . (http://www.snappydata.io/)
>>
>> https://in.linkedin.com/in/rishiteshmishra
>>
>
>

RE: Broadcasting a parquet file using spark and python

2015-12-04 Thread Shuai Zheng

Hi all,

 

Sorry to re-open this thread.

 

I have a similar issue, one big parquet file left outer join quite a few 
smaller parquet files. But the running is extremely slow and even OOM sometimes 
(with 300M , I have two questions here:

 

1, If I use outer join, will Spark SQL auto use broadcast hashjoin?

2, If not, in the latest documents: 
http://spark.apache.org/docs/latest/sql-programming-guide.html

 


spark.sql.autoBroadcastJoinThreshold

10485760 (10 MB)

Configures the maximum size in bytes for a table that will be broadcast to all 
worker nodes when performing a join. By setting this value to -1 broadcasting 
can be disabled. Note that currently statistics are only supported for Hive 
Metastore tables where the command ANALYZE TABLE  COMPUTE STATISTICS 
noscan has been run.

 

How can I do this (run command analyze table) in Java? I know I can code it by 
myself (create a broadcast val and implement lookup by myself), but it will 
make code super ugly.

 

I hope we can have either API or hint to enforce the hashjoin (instead of this 
suspicious autoBroadcastJoinThreshold parameter). Do we have any ticket or 
roadmap for this feature?

 

Regards,

 

Shuai

 

From: Michael Armbrust [mailto:mich...@databricks.com] 
Sent: Wednesday, April 01, 2015 2:01 PM
To: Jitesh chandra Mishra
Cc: user
Subject: Re: Broadcasting a parquet file using spark and python

 

You will need to create a hive parquet table that points to the data and run 
"ANALYZE TABLE tableName noscan" so that we have statistics on the size.

 

On Tue, Mar 31, 2015 at 9:36 PM, Jitesh chandra Mishra  
wrote:

Hi Michael,

 

Thanks for your response. I am running 1.2.1. 

 

Is there any workaround to achieve the same with 1.2.1?

 

Thanks,

Jitesh

 

On Wed, Apr 1, 2015 at 12:25 AM, Michael Armbrust  
wrote:

In Spark 1.3 I would expect this to happen automatically when the parquet table 
is small (< 10mb, configurable with spark.sql.autoBroadcastJoinThreshold).  If 
you are running 1.3 and not seeing this, can you show the code you are using to 
create the table?

 

On Tue, Mar 31, 2015 at 3:25 AM, jitesh129  wrote:

How can we implement a BroadcastHashJoin for spark with python?

My SparkSQL inner joins are taking a lot of time since it is performing
ShuffledHashJoin.

Tables on which join is performed are stored as parquet files.

Please help.

Thanks and regards,
Jitesh



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Broadcasting-a-parquet-file-using-spark-and-python-tp22315.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark ML Random Forest output.

2015-12-04 Thread Vishnu Viswanath

Hi,

As per my understanding the probability matrix is giving the probability
that that particular item can belong to each class. So the one with highest
probability is your predicted class.

Since you have converted you label to index label, according the model the
classes are 0.0 to 9.0 and I see you are getting prediction as a value
which is in [0.0,1.0,,9.0] -  which is correct.

So what you want is a reverse map that can convert your predicted class
back to the String. I don't know if  StringIndexer has such an option, may
be you can create your own map and reverse map of (label to index) and
(index to label) and use this for getting back your original label.

May be there is better way to do this..

Regards,
Vishnu

On Fri, Dec 4, 2015 at 4:56 PM, Eugene Morozov 
wrote:

> Hello,
>
> I've got an input dataset of handwritten digits and working java code that
> uses random forest classification algorithm to determine the numbers. My
> test set is just some lines from the same input dataset - just to be sure
> I'm doing the right thing. My understanding is that having correct
> classifier in this case would give me the correct prediction.
> At the moment overfitting is not an issue.
>
> After applying StringIndexer to my input DataFrame I've applied an ugly
> trick and got "indexedLabel" metadata:
>
> {"ml_attr":{"vals":["1.0","7.0","3.0","9.0","2.0","6.0","0.0","4.0","8.0","5.0"],"type":"nominal","name":"indexedLabel"}}
>
>
> So, my algorithm gives me the following result. The question is I'm not
> sure I understand the meaning of the "prediction" here in the output. It
> looks like it's just an index of the highest probability value in the
> "prob" array. Shouldn't "prediction" be the actual class, i.e. one of the
> "0.0", "1.0", ..., "9.0"? If the prediction is just an ordinal number, then
> I have to manually correspond it to my classes, but for that I have to
> either specify classes manually to know their order or somehow be able to
> get them out of the classifier. Both of these way seem to be are not
> accessible.
>
> (4.0 -> prediction=7.0,
> prob=[0.004708283878223195,0.08478236104777455,0.03594642191080532,0.19286506771018885,0.038304389235523435,0.02841307797386,0.003334431932056404,0.5685242322346109,0.018564705500837587,0.024557028569980155]
> (9.0 -> prediction=3.0,
> prob=[0.018432404716456248,0.16837195846781422,0.05995559403934031,0.32282148259583565,0.018374168600855455,0.04792285114398864,0.018226352623526704,0.1611650363085499,0.11703073969440755,0.06769941180922535]
> (2.0 -> prediction=4.0,
> prob=[0.017918245251872154,0.029243677407669404,0.06228050320552064,0.03633295481094746,0.45707974962418885,0.09675606366289394,0.03921437851648226,0.043917057390743426,0.14132883075087405,0.0759285393788078]
>
> So, what is the prediction here? How can I specify classes manually or get
> the valid access to them?
> --
> Be well!
> Jean Morozov
>

Spark ML Random Forest output.

2015-12-04 Thread Eugene Morozov

Hello,

I've got an input dataset of handwritten digits and working java code that
uses random forest classification algorithm to determine the numbers. My
test set is just some lines from the same input dataset - just to be sure
I'm doing the right thing. My understanding is that having correct
classifier in this case would give me the correct prediction.
At the moment overfitting is not an issue.

After applying StringIndexer to my input DataFrame I've applied an ugly
trick and got "indexedLabel" metadata:
{"ml_attr":{"vals":["1.0","7.0","3.0","9.0","2.0","6.0","0.0","4.0","8.0","5.0"],"type":"nominal","name":"indexedLabel"}}

So, my algorithm gives me the following result. The question is I'm not
sure I understand the meaning of the "prediction" here in the output. It
looks like it's just an index of the highest probability value in the
"prob" array. Shouldn't "prediction" be the actual class, i.e. one of the
"0.0", "1.0", ..., "9.0"? If the prediction is just an ordinal number, then
I have to manually correspond it to my classes, but for that I have to
either specify classes manually to know their order or somehow be able to
get them out of the classifier. Both of these way seem to be are not
accessible.

(4.0 -> prediction=7.0,
prob=[0.004708283878223195,0.08478236104777455,0.03594642191080532,0.19286506771018885,0.038304389235523435,0.02841307797386,0.003334431932056404,0.5685242322346109,0.018564705500837587,0.024557028569980155]
(9.0 -> prediction=3.0,
prob=[0.018432404716456248,0.16837195846781422,0.05995559403934031,0.32282148259583565,0.018374168600855455,0.04792285114398864,0.018226352623526704,0.1611650363085499,0.11703073969440755,0.06769941180922535]
(2.0 -> prediction=4.0,
prob=[0.017918245251872154,0.029243677407669404,0.06228050320552064,0.03633295481094746,0.45707974962418885,0.09675606366289394,0.03921437851648226,0.043917057390743426,0.14132883075087405,0.0759285393788078]

So, what is the prediction here? How can I specify classes manually or get
the valid access to them?
--
Be well!
Jean Morozov

Re: Higher Processing times in Spark Streaming with kafka Direct

2015-12-04 Thread u...@moosheimer.com

Hi,

processing time depends on what you are doing with the events.
Increasing the number of partitions could be an idea if you write more messages 
to the topic than you read currently via Spark.

Can you write more details?

Mit freundlichen Grüßen / best regards
Kay-Uwe Moosheimer

> Am 04.12.2015 um 22:21 schrieb SRK :
> 
> Hi,
> 
> Our processing times in  Spark Streaming with kafka Direct approach seems to
> have increased considerably with increase in the Site traffic. Would
> increasing the number of kafka partitions decrease  the processing times?
> Any suggestions on tuning to reduce the processing times would be of great
> help.
> 
> Thanks,
> Swetha
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Higher-Processing-times-in-Spark-Streaming-with-kafka-Direct-tp25571.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark SQL IN Clause

2015-12-04 Thread Ted Yu

Thanks for the pointer, Xiao.

I found that leftanti join type is no longer
in 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/joinTypes.scala

FYI

On Fri, Dec 4, 2015 at 12:04 PM, Xiao Li  wrote:

> https://github.com/apache/spark/pull/9055
>
> This JIRA explains how to convert IN to Joins.
>
> Thanks,
>
> Xiao Li
>
>
>
> 2015-12-04 11:27 GMT-08:00 Michael Armbrust :
>
>> The best way to run this today is probably to manually convert the query
>> into a join.  I.e. create a dataframe that has all the numbers in it, and
>> join/outer join it with the other table.  This way you avoid parsing a
>> gigantic string.
>>
>> On Fri, Dec 4, 2015 at 10:36 AM, Ted Yu  wrote:
>>
>>> Have you seen this JIRA ?
>>>
>>> [SPARK-8077] [SQL] Optimization for TreeNodes with large numbers of
>>> children
>>>
>>> From the numbers Michael published, 1 million numbers would still need
>>> 250 seconds to parse.
>>>
>>> On Fri, Dec 4, 2015 at 10:14 AM, Madabhattula Rajesh Kumar <
>>> mrajaf...@gmail.com> wrote:
>>>
 Hi,

 How to use/best practices "IN" clause in Spark SQL.

 Use Case :-  Read the table based on number. I have a List of numbers.
 For example, 1million.

 Regards,
 Rajesh

>>>
>>>
>>
>

Re: Not all workers seem to run in a standalone cluster setup by spark-ec2 script

2015-12-04 Thread Kyohey Hamaguchi

Andy,

Thank you for replying.

I am specifying exactly like it to --master. I just had missed it when
writing that email.
2015年12月5日(土) 9:27 Andy Davidson :

> Hi Kyohey
>
> I think you need to pass the argument --master $MASTER_URL \
>
>
> master_URL is something like spark://
> ec2-54-215-112-121.us-west-1.compute.amazonaws.com:7077
>
> Its the public url to your master
>
>
> Andy
>
> From: Kyohey Hamaguchi 
> Date: Friday, December 4, 2015 at 11:28 AM
> To: "user @spark" 
> Subject: Not all workers seem to run in a standalone cluster setup by
> spark-ec2 script
>
> Hi,
>
> I have setup a Spark standalone-cluster, which involves 5 workers,
> using spark-ec2 script.
>
> After submitting my Spark application, I had noticed that just one
> worker seemed to run the application and other 4 workers were doing
> nothing. I had confirmed this by checking CPU and memory usage on the
> Spark Web UI (CPU usage indicates zero and memory is almost fully
> availabile.)
>
> This is the command used to launch:
>
> $ ~/spark/ec2/spark-ec2 -k awesome-keypair-name -i
> /path/to/.ssh/awesome-private-key.pem --region ap-northeast-1
> --zone=ap-northeast-1a --slaves 5 --instance-type m1.large
> --hadoop-major-version yarn launch awesome-spark-cluster
>
> And the command to run application:
>
> $ ssh -i ~/path/to/awesome-private-key.pem root@ec2-master-host-name
> "mkdir ~/awesome"
> $ scp -i ~/path/to/awesome-private-key.pem spark.jar
> root@ec2-master-host-name:~/awesome && ssh -i
> ~/path/to/awesome-private-key.pem root@ec2-master-host-name
> "~/spark-ec2/copy-dir ~/awesome"
> $ ssh -i ~/path/to/awesome-private-key.pem root@ec2-master-host-name
> "~/spark/bin/spark-submit --num-executors 5 --executor-cores 2
> --executor-memory 5G --total-executor-cores 10 --driver-cores 2
> --driver-memory 5G --class com.example.SparkIsAwesome
> awesome/spark.jar"
>
> How do I let the all of the workers execute the app?
>
> Or do I have wrong understanding on what workers, slaves and executors are?
>
> My understanding is: Spark driver(or maybe master?) sends a part of
> jobs to each worker (== executor == slave), so a Spark cluster
> automatically exploits all resources available in the cluster. Is this
> some sort of misconception?
>
> Thanks,
>
> --
> Kyohey Hamaguchi
> TEL:  080-6918-1708
> Mail: tnzk.ma...@gmail.com
> Blog: http://blog.tnzk.org/
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>

Re: Not all workers seem to run in a standalone cluster setup by spark-ec2 script

2015-12-04 Thread Andy Davidson

Hi Kyohey

I think you need to pass the argument --master $MASTER_URL \


master_URL is something like
spark://ec2-54-215-112-121.us-west-1.compute.amazonaws.com:7077

Its the public url to your master


Andy

From:  Kyohey Hamaguchi 
Date:  Friday, December 4, 2015 at 11:28 AM
To:  "user @spark" 
Subject:  Not all workers seem to run in a standalone cluster setup by
spark-ec2 script

> Hi,
> 
> I have setup a Spark standalone-cluster, which involves 5 workers,
> using spark-ec2 script.
> 
> After submitting my Spark application, I had noticed that just one
> worker seemed to run the application and other 4 workers were doing
> nothing. I had confirmed this by checking CPU and memory usage on the
> Spark Web UI (CPU usage indicates zero and memory is almost fully
> availabile.)
> 
> This is the command used to launch:
> 
> $ ~/spark/ec2/spark-ec2 -k awesome-keypair-name -i
> /path/to/.ssh/awesome-private-key.pem --region ap-northeast-1
> --zone=ap-northeast-1a --slaves 5 --instance-type m1.large
> --hadoop-major-version yarn launch awesome-spark-cluster
> 
> And the command to run application:
> 
> $ ssh -i ~/path/to/awesome-private-key.pem root@ec2-master-host-name
> "mkdir ~/awesome"
> $ scp -i ~/path/to/awesome-private-key.pem spark.jar
> root@ec2-master-host-name:~/awesome && ssh -i
> ~/path/to/awesome-private-key.pem root@ec2-master-host-name
> "~/spark-ec2/copy-dir ~/awesome"
> $ ssh -i ~/path/to/awesome-private-key.pem root@ec2-master-host-name
> "~/spark/bin/spark-submit --num-executors 5 --executor-cores 2
> --executor-memory 5G --total-executor-cores 10 --driver-cores 2
> --driver-memory 5G --class com.example.SparkIsAwesome
> awesome/spark.jar"
> 
> How do I let the all of the workers execute the app?
> 
> Or do I have wrong understanding on what workers, slaves and executors are?
> 
> My understanding is: Spark driver(or maybe master?) sends a part of
> jobs to each worker (== executor == slave), so a Spark cluster
> automatically exploits all resources available in the cluster. Is this
> some sort of misconception?
> 
> Thanks,
> 
> --
> Kyohey Hamaguchi
> TEL:  080-6918-1708
> Mail: tnzk.ma...@gmail.com
> Blog: http://blog.tnzk.org/
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 
>

spark.authenticate=true YARN mode doesn't work

2015-12-04 Thread prasadreddy

Hi All, 

I am running Spark YARN and trying to enable authentication by setting
spark.authenticate=true. After enable authentication I am not able to Run
Spark word count or any other programs. 

Any help will be appreciated. 

Thanks 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-authenticate-true-YARN-mode-doesn-t-work-tp25573.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: spark.authenticate=true YARN mode doesn't work

2015-12-04 Thread Ted Yu

Which release are you using ?

Please take a look at
https://spark.apache.org/docs/latest/running-on-yarn.html
There're several config parameters related to security:
spark.yarn.keytab
spark.yarn.principal
...

FYI

On Fri, Dec 4, 2015 at 5:47 PM, prasadreddy  wrote:

> Hi All,
>
> I am running Spark YARN and trying to enable authentication by setting
> spark.authenticate=true. After enable authentication I am not able to Run
> Spark word count or any other programs.
>
> Any help will be appreciated.
>
> Thanks
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/spark-authenticate-true-YARN-mode-doesn-t-work-tp25573.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: spark-ec2 vs. EMR

2015-12-04 Thread Jonathan Kelly

Sending this to the list again because I'm pretty sure it didn't work the
first time. A colleague just realized he was having the same problem with
the list not accepting his posts, but unsubscribing and re-subscribing
seemed to fix the issue for him. I've just unsubscribed and re-subscribed
too, so hopefully this works...

On Wednesday, December 2, 2015, Jonathan Kelly 
wrote:

> EMR is currently running a private preview of an upcoming feature allowing
> EMR clusters to be launched in VPC private subnets. This will allow you to
> launch a cluster in a subnet without an Internet Gateway attached. Please
> contact jonfr...@amazon.com
>  if you would like
> more information.
>
> ~ Jonathan
>
> Note: jonfr...@amazon.com
>  is not me. I'm a
> different Jonathan. :)
>
> On Wed, Dec 2, 2015 at 10:21 AM, Jerry Lam  > wrote:
>
>> Hi Dana,
>>
>> Yes, we get VPC + EMR working but I'm not the person who deploys it. It
>> is related to subnet as Alex points out.
>>
>> Just to want to add another point, spark-ec2 is nice to keep and improve
>> because it allows users to any version of spark (nightly-build for
>> example). EMR does not allow you to do that without manual process.
>>
>> Best Regards,
>>
>> Jerry
>>
>> On Wed, Dec 2, 2015 at 1:02 PM, Alexander Pivovarov > > wrote:
>>
>>> Do you think it's a security issue if EMR started in VPC with a subnet
>>> having Auto-assign Public IP: Yes
>>>
>>> you can remove all Inbound rules having 0.0.0.0/0 Source in master and
>>> slave Security Group
>>> So, master and slave boxes will be accessible only for users who are on
>>> VPN
>>>
>>>
>>>
>>>
>>> On Wed, Dec 2, 2015 at 9:44 AM, Dana Powers >> > wrote:
>>>
 EMR was a pain to configure on a private VPC last I tried. Has anyone
 had success with that? I found spark-ec2 easier to use w private
 networking, but also agree that I would use for prod.

 -Dana
 On Dec 1, 2015 12:29 PM, "Alexander Pivovarov" > wrote:

> 1. Emr 4.2.0 has Zeppelin as an alternative to DataBricks Notebooks
>
> 2. Emr has Ganglia 3.6.0
>
> 3. Emr has hadoop fs settings to make s3 work fast
> (direct.EmrFileSystem)
>
> 4. EMR has s3 keys in hadoop configs
>
> 5. EMR allows to resize cluster on fly.
>
> 6. EMR has aws sdk in spark classpath. Helps to reduce app assembly
> jar size
>
> 7. ec2 script installs all in /root, EMR has dedicated users: hadoop,
> zeppelin, etc. EMR is similar to Cloudera or Hortonworks
>
> 8. There are at least 3 spark-ec2 projects. (in apache/spark, in
> mesos, in amplab). Master branch in spark has outdated ec2 script. Other
> projects have broken links in readme. WHAT A MESS!
>
> 9. ec2 script has bad documentation and non informative error
> messages. e.g. readme does not say anything about --private-ips option. If
> you did not add the flag it will connect to empty string host (localhost)
> instead of master. Fixed only last week. Not sure if fixed in all branches
>
> 10. I think Amazon will include spark-jobserver to EMR soon.
>
> 11. You do not need to be aws expert to start EMR cluster. Users can
> use EMR web ui to start cluster to run some jobs or work in Zeppelun 
> during
> the day
>
> 12. EMR cluster starts in abour 8 min. Ec2 script works longer and you
> need to be online.
> On Dec 1, 2015 9:22 AM, "Jerry Lam"  > wrote:
>
>> Simply put:
>>
>> EMR = Hadoop Ecosystem (Yarn, HDFS, etc) + Spark + EMRFS + Amazon EMR
>> API + Selected Instance Types + Amazon EC2 Friendly (bootstrapping)
>> spark-ec2 = HDFS + Yarn (Optional) + Spark (Standalone Default) + Any
>> Instance Type
>>
>> I use spark-ec2 for prototyping and I have never use it for
>> production.
>>
>> just my $0.02
>>
>>
>>
>> On Dec 1, 2015, at 11:15 AM, Nick Chammas > > wrote:
>>
>> Pinging this thread in case anyone has thoughts on the matter they
>> want to share.
>>
>> On Sat, Nov 21, 2015 at 11:32 AM Nicholas Chammas <[hidden email]>
>> wrote:
>>
>>> Spark has come bundled with spark-ec2
>>>  for many
>>> years. At the same time, EMR has been capable of running Spark for a 
>>> while,
>>> and earlier this year it

Re: Improve saveAsTextFile performance

2015-12-04 Thread Ram VISWANADHA

That didn’t work :(
Any help I have documented some steps here.
http://stackoverflow.com/questions/34048340/spark-saveastextfile-last-stage-almost-never-finishes

Best Regards,
Ram

From: Sahil Sareen >
Date: Wednesday, December 2, 2015 at 10:18 PM
To: Ram VISWANADHA 
>
Cc: Ted Yu >, user 
>
Subject: Re: Improve saveAsTextFile performance

http://stackoverflow.com/questions/29213404/how-to-split-an-rdd-into-multiple-smaller-rdds-given-a-max-number-of-rows-per

Re: spark.authenticate=true YARN mode doesn't work

2015-12-04 Thread Ted Yu

Did you try setting "spark.authenticate.secret" ?

Cheers

On Fri, Dec 4, 2015 at 7:07 PM, Prasad Reddy  wrote:

> Hi Ted,
>
> Thank you for the reply.
>
> I am using 1.5.2.
>
> I am implementing SASL encryption. Authentication is required to implement
> SASL Encryption.
>
> I have configured like below in Spark-default.conf
>
> spark.authenticate true
>
> spark.authenticate.enableSaslEncryption true
>
> spark.network.sasl.serverAlwaysEncrypt true
>
>
> Any help will be appreciated.
>
>
>
> Thanks
>
> Prasad
>
>
>
> On Fri, Dec 4, 2015 at 5:55 PM, Ted Yu  wrote:
>
>> Which release are you using ?
>>
>> Please take a look at
>> https://spark.apache.org/docs/latest/running-on-yarn.html
>> There're several config parameters related to security:
>> spark.yarn.keytab
>> spark.yarn.principal
>> ...
>>
>> FYI
>>
>> On Fri, Dec 4, 2015 at 5:47 PM, prasadreddy  wrote:
>>
>>> Hi All,
>>>
>>> I am running Spark YARN and trying to enable authentication by setting
>>> spark.authenticate=true. After enable authentication I am not able to Run
>>> Spark word count or any other programs.
>>>
>>> Any help will be appreciated.
>>>
>>> Thanks
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/spark-authenticate-true-YARN-mode-doesn-t-work-tp25573.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>

the way to compare any two adjacent elements in one rdd

2015-12-04 Thread Zhiliang Zhu

Hi All,
I would like to compare any two adjacent elements in one given rdd, just as the 
single machine code part:
int a[N] = {...};for (int i=0; i < N - 1; ++i) {   compareFun(a[i], a[i+1]);}...
mapPartitions may work for some situations, however, it could not compare 
elements in different  partitions. foreach also seems not work.
Thanks,Zhiliang

Re: the way to compare any two adjacent elements in one rdd

2015-12-04 Thread Zhiliang Zhu

Hi DB Tsai,
Thanks very much for your kind reply!
Sorry that for one more issue, as tested it seems that filter could only return 
JavaRDD but not any JavaRDD , is it ?Then it is not much convenient 
to do general filter for RDD, mapPartitions could work some, but if some 
partition will left and return none element after filter by mapPartitions, some 
problemwill be there. 
Best Wishes!Zhiliang

On Saturday, December 5, 2015 3:00 PM, DB Tsai  wrote:

 This is tricky. You need to shuffle the ending and beginning elements
using mapPartitionWithIndex.

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D

On Fri, Dec 4, 2015 at 10:30 PM, Zhiliang Zhu  wrote:
> Hi All,
>
> I would like to compare any two adjacent elements in one given rdd, just as
> the single machine code part:
>
> int a[N] = {...};
> for (int i=0; i < N - 1; ++i) {
>    compareFun(a[i], a[i+1]);
> }
> ...
>
> mapPartitions may work for some situations, however, it could not compare
> elements in different  partitions.
> foreach also seems not work.
>
> Thanks,
> Zhiliang
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: the way to compare any two adjacent elements in one rdd

2015-12-04 Thread DB Tsai

This is tricky. You need to shuffle the ending and beginning elements
using mapPartitionWithIndex.

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D


On Fri, Dec 4, 2015 at 10:30 PM, Zhiliang Zhu  wrote:
> Hi All,
>
> I would like to compare any two adjacent elements in one given rdd, just as
> the single machine code part:
>
> int a[N] = {...};
> for (int i=0; i < N - 1; ++i) {
>compareFun(a[i], a[i+1]);
> }
> ...
>
> mapPartitions may work for some situations, however, it could not compare
> elements in different  partitions.
> foreach also seems not work.
>
> Thanks,
> Zhiliang
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

MLlib training time question

2015-12-04 Thread Haoyue Wang

Hi all,
I'm doing some experiment with Spark MLlib (version 1.5.0). I train
LogisticRegressionModel on a 2.06GB dataset (# of data: 2396130, # of
features: 3231961, # of classes: 2, format: LibSVM). I deployed Spark to a
4 nodes cluster, each node's spec: CPU: Intel(R) Xeon(R) CPU E5-2650 0 @
2.00GHz, 2 #CPUs * 8 #cores *2 #threads; Network: 40Gbps infiniband; RAM:
256GB (spark configuration: driver 100GB, spark executor 100GB).

I'm doing two experiments:
1) Load data into Hive, and use HiveContext in Spark program to load data
from Hive into an DataFrame, parse the DataFrame into a
RDD, then train LogisticRegressionModel on this RDD.
The training time is 389218 milliseconds.
2) Load data from a Socket Server into an RDD, which have done some feature
transforming, add 5 features to each datum. So the # of features is
3231966. Then repartition this RDD into 16 partitions, and parse the RDD
into  RDD, finally train  LogisticRegressionModel on this
RDD.
The training time is 838470 milliseconds.

The training time mentioned above is only the time of: final
LogisticRegressionModel model = new
LogisticRegressionWithLBFGS().setNumClasses(2).run(training.rdd());
not included the loading time and parsing time.

So here is the question: why these two experiments' training time have such
a large difference? I suppose they should be similar but actually 2x. I
even tried to repartition the RDD into 4/32/64/128 partitions, and cache
them before training in the experiment 2, but doesn't make sense.

Is there any inner difference between the RDD used for training in the
2 experiments that  cause the difference of training time?
I will be appreciate if you can give me some guidance.

Best,
Haoyue

Re: spark.authenticate=true YARN mode doesn't work

2015-12-04 Thread Prasad Reddy

I did tried. Same problem.

as you said earlier.

spark.yarn.keytab
spark.yarn.principal

are required.

On Fri, Dec 4, 2015 at 7:25 PM, Ted Yu  wrote:

> Did you try setting "spark.authenticate.secret" ?
>
> Cheers
>
> On Fri, Dec 4, 2015 at 7:07 PM, Prasad Reddy  wrote:
>
>> Hi Ted,
>>
>> Thank you for the reply.
>>
>> I am using 1.5.2.
>>
>> I am implementing SASL encryption. Authentication is required to
>> implement SASL Encryption.
>>
>> I have configured like below in Spark-default.conf
>>
>> spark.authenticate true
>>
>> spark.authenticate.enableSaslEncryption true
>>
>> spark.network.sasl.serverAlwaysEncrypt true
>>
>>
>> Any help will be appreciated.
>>
>>
>>
>> Thanks
>>
>> Prasad
>>
>>
>>
>> On Fri, Dec 4, 2015 at 5:55 PM, Ted Yu  wrote:
>>
>>> Which release are you using ?
>>>
>>> Please take a look at
>>> https://spark.apache.org/docs/latest/running-on-yarn.html
>>> There're several config parameters related to security:
>>> spark.yarn.keytab
>>> spark.yarn.principal
>>> ...
>>>
>>> FYI
>>>
>>> On Fri, Dec 4, 2015 at 5:47 PM, prasadreddy 
>>> wrote:
>>>
 Hi All,

 I am running Spark YARN and trying to enable authentication by setting
 spark.authenticate=true. After enable authentication I am not able to
 Run
 Spark word count or any other programs.

 Any help will be appreciated.

 Thanks



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/spark-authenticate-true-YARN-mode-doesn-t-work-tp25573.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


>>>
>>
>

Any role for volunteering

2015-12-04 Thread Deepak Sharma

Hi All
Sorry for spamming your inbox.
I am really keen to work on a big data project full time(preferably remote
from India) , if not I am open to volunteering as well.
Please do let me know if there is any such opportunity available

-- 
Thanks
Deepak

Re: is Multiple Spark Contexts is supported in spark 1.5.0 ?

2015-12-04 Thread prateek arora

Hi Ted
Thanks for the information .
is there any way that two different spark application share there data ?

Regards
Prateek

On Fri, Dec 4, 2015 at 9:54 AM, Ted Yu  wrote:

> See Josh's response in this thread:
>
>
> http://search-hadoop.com/m/q3RTt1z1hUw4TiG1=Re+Question+about+yarn+cluster+mode+and+spark+driver+allowMultipleContexts
>
> Cheers
>
> On Fri, Dec 4, 2015 at 9:46 AM, prateek arora 
> wrote:
>
>> Hi
>>
>> I want to create multiple sparkContext in my application.
>> i read so many articles they suggest " usage of multiple contexts is
>> discouraged, since SPARK-2243 is still not resolved."
>> i want to know that Is spark 1.5.0 supported to create multiple contexts
>> without error ?
>> and if supported then are we need to set
>> "spark.driver.allowMultipleContexts" configuration parameter ?
>>
>> Regards
>> Prateek
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/is-Multiple-Spark-Contexts-is-supported-in-spark-1-5-0-tp25568.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

Is Temporary Access Credential (AccessKeyId, SecretAccessKey + SecurityToken) support by Spark?

2015-12-04 Thread Lin, Hao

Hi,

Does anyone knows if Spark run in AWS is supported by temporary access 
credential (AccessKeyId, SecretAccessKey + SecurityToken) to access S3?  I only 
see references to specify fs.s3.awsAccessKeyId and fs.s3.awsSecretAccessKey, 
without any mention of security token. Apparently this is only for static 
credential.

Many thanks

Confidentiality Notice::  This email, including attachments, may include 
non-public, proprietary, confidential or legally privileged information.  If 
you are not an intended recipient or an authorized agent of an intended 
recipient, you are hereby notified that any dissemination, distribution or 
copying of the information contained in or transmitted with this e-mail is 
unauthorized and strictly prohibited.  If you have received this email in 
error, please notify the sender by replying to this message and permanently 
delete this e-mail, its attachments, and any copies of it immediately.  You 
should not retain, copy or use this e-mail or any attachment for any purpose, 
nor disclose all or any part of the contents to any other person. Thank you.

Oozie SparkAction not able to use spark conf values

2015-12-04 Thread Rajadayalan Perumalsamy

Hi

We are trying to change our existing oozie workflows to use SparkAction
instead of ShellAction.
We are passing spark configuration in spark-opts with --conf, but these
values are not accessible in Spark and it is throwing error.

Please note we are able to use SparkAction successfully in yarn-cluster
mode if we are not using the spark configurations. I have attached oozie
workflow.xml, oozie log and yarncontainer-spark log files.

Thanks
Raja
Log Type: stderr
Log Upload Time: Fri Dec 04 10:26:17 -0800 2015
Log Length: 6035
15/12/04 10:26:04 INFO yarn.ApplicationMaster: Registered signal handlers for 
[TERM, HUP, INT]
15/12/04 10:26:05 INFO yarn.ApplicationMaster: ApplicationAttemptId: 
appattempt_1447700095990_2804_02
15/12/04 10:26:06 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
15/12/04 10:26:06 INFO spark.SecurityManager: Changing view acls to: dev
15/12/04 10:26:06 INFO spark.SecurityManager: Changing modify acls to: dev
15/12/04 10:26:06 INFO spark.SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(dev); users with 
modify permissions: Set(dev)
15/12/04 10:26:06 INFO yarn.ApplicationMaster: Starting the user application in 
a separate Thread
15/12/04 10:26:06 INFO yarn.ApplicationMaster: Waiting for spark context 
initialization
15/12/04 10:26:06 INFO yarn.ApplicationMaster: Waiting for spark context 
initialization ... 
15/12/04 10:26:06 ERROR yarn.ApplicationMaster: User class threw exception: null
java.lang.ExceptionInInitializerError
at com.toyota.genericmodule.info.Info$.(Info.scala:20)
at com.toyota.genericmodule.info.Info$.(Info.scala)
at com.toyota.genericmodule.info.Info.main(Info.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:480)
Caused by: com.typesafe.config.ConfigException$UnresolvedSubstitution: 
application.conf: 53: Could not resolve substitution to a value: ${inputdb}
at 
com.typesafe.config.impl.ConfigReference.resolveSubstitutions(ConfigReference.java:84)
at 
com.typesafe.config.impl.ResolveSource.resolveCheckingReplacement(ResolveSource.java:110)
at 
com.typesafe.config.impl.ResolveContext.resolve(ResolveContext.java:114)
at 
com.typesafe.config.impl.ConfigConcatenation.resolveSubstitutions(ConfigConcatenation.java:178)
at 
com.typesafe.config.impl.ResolveSource.resolveCheckingReplacement(ResolveSource.java:110)
at 
com.typesafe.config.impl.ResolveContext.resolve(ResolveContext.java:114)
at 
com.typesafe.config.impl.ConfigDelayedMerge.resolveSubstitutions(ConfigDelayedMerge.java:96)
at 
com.typesafe.config.impl.ConfigDelayedMerge.resolveSubstitutions(ConfigDelayedMerge.java:59)
at 
com.typesafe.config.impl.ResolveSource.resolveCheckingReplacement(ResolveSource.java:110)
at 
com.typesafe.config.impl.ResolveContext.resolve(ResolveContext.java:114)
at 
com.typesafe.config.impl.ResolveSource.lookupSubst(ResolveSource.java:62)
at 
com.typesafe.config.impl.ConfigReference.resolveSubstitutions(ConfigReference.java:73)
at 
com.typesafe.config.impl.ResolveSource.resolveCheckingReplacement(ResolveSource.java:110)
at 
com.typesafe.config.impl.ResolveContext.resolve(ResolveContext.java:114)
at 
com.typesafe.config.impl.ConfigConcatenation.resolveSubstitutions(ConfigConcatenation.java:178)
at 
com.typesafe.config.impl.ResolveSource.resolveCheckingReplacement(ResolveSource.java:110)
at 
com.typesafe.config.impl.ResolveContext.resolve(ResolveContext.java:114)
at 
com.typesafe.config.impl.ResolveSource.lookupSubst(ResolveSource.java:62)
at 
com.typesafe.config.impl.ConfigReference.resolveSubstitutions(ConfigReference.java:73)
at 
com.typesafe.config.impl.ResolveSource.resolveCheckingReplacement(ResolveSource.java:110)
at 
com.typesafe.config.impl.ResolveContext.resolve(ResolveContext.java:114)
at 
com.typesafe.config.impl.SimpleConfigObject$1.modifyChildMayThrow(SimpleConfigObject.java:340)
at 
com.typesafe.config.impl.SimpleConfigObject.modifyMayThrow(SimpleConfigObject.java:279)
at 
com.typesafe.config.impl.SimpleConfigObject.resolveSubstitutions(SimpleConfigObject.java:320)
at 
com.typesafe.config.impl.SimpleConfigObject.resolveSubstitutions(SimpleConfigObject.java:24)
at 
com.typesafe.config.impl.ResolveSource.resolveCheckingReplacement(ResolveSource.java:110)
at 
com.typesafe.config.impl.ResolveContext.resolve(ResolveContext.java:114)

Re: is Multiple Spark Contexts is supported in spark 1.5.0 ?

2015-12-04 Thread prateek arora

Thanks ...

Is there any way my second application run in parallel and wait for
fetching data from hbase or any other data storeage system ?

Regards
Prateek

On Fri, Dec 4, 2015 at 10:24 AM, Ted Yu  wrote:

> How about using NoSQL data store such as HBase :-)
>
> On Fri, Dec 4, 2015 at 10:17 AM, prateek arora  > wrote:
>
>> Hi Ted
>> Thanks for the information .
>> is there any way that two different spark application share there data ?
>>
>> Regards
>> Prateek
>>
>> On Fri, Dec 4, 2015 at 9:54 AM, Ted Yu  wrote:
>>
>>> See Josh's response in this thread:
>>>
>>>
>>> http://search-hadoop.com/m/q3RTt1z1hUw4TiG1=Re+Question+about+yarn+cluster+mode+and+spark+driver+allowMultipleContexts
>>>
>>> Cheers
>>>
>>> On Fri, Dec 4, 2015 at 9:46 AM, prateek arora <
>>> prateek.arora...@gmail.com> wrote:
>>>
 Hi

 I want to create multiple sparkContext in my application.
 i read so many articles they suggest " usage of multiple contexts is
 discouraged, since SPARK-2243 is still not resolved."
 i want to know that Is spark 1.5.0 supported to create multiple contexts
 without error ?
 and if supported then are we need to set
 "spark.driver.allowMultipleContexts" configuration parameter ?

 Regards
 Prateek

 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/is-Multiple-Spark-Contexts-is-supported-in-spark-1-5-0-tp25568.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

>>>
>>
>

Spark SQL IN Clause

2015-12-04 Thread Madabhattula Rajesh Kumar

Hi,

How to use/best practices "IN" clause in Spark SQL.

Use Case :-  Read the table based on number. I have a List of numbers. For
example, 1million.

Regards,
Rajesh

Re: is Multiple Spark Contexts is supported in spark 1.5.0 ?

2015-12-04 Thread Ted Yu

How about using NoSQL data store such as HBase :-)

On Fri, Dec 4, 2015 at 10:17 AM, prateek arora 
wrote:

> Hi Ted
> Thanks for the information .
> is there any way that two different spark application share there data ?
>
> Regards
> Prateek
>
> On Fri, Dec 4, 2015 at 9:54 AM, Ted Yu  wrote:
>
>> See Josh's response in this thread:
>>
>>
>> http://search-hadoop.com/m/q3RTt1z1hUw4TiG1=Re+Question+about+yarn+cluster+mode+and+spark+driver+allowMultipleContexts
>>
>> Cheers
>>
>> On Fri, Dec 4, 2015 at 9:46 AM, prateek arora > > wrote:
>>
>>> Hi
>>>
>>> I want to create multiple sparkContext in my application.
>>> i read so many articles they suggest " usage of multiple contexts is
>>> discouraged, since SPARK-2243 is still not resolved."
>>> i want to know that Is spark 1.5.0 supported to create multiple contexts
>>> without error ?
>>> and if supported then are we need to set
>>> "spark.driver.allowMultipleContexts" configuration parameter ?
>>>
>>> Regards
>>> Prateek
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/is-Multiple-Spark-Contexts-is-supported-in-spark-1-5-0-tp25568.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>

Re: ROW_TIMESTAMP support with UNSIGNED_LONG

2015-12-04 Thread pierre lacave

Sorry wrong list, please ignore
On 4 Dec 2015 5:51 p.m., "pierre lacave"  wrote:

>
> Hi I am trying to use the ROW_TIMESTAMP mapping featured in 4.6 as
> described in https://phoenix.apache.org/rowtimestamp.html
>
> However when inserting a timestamp in nanosecond I get the following
> exception saying the value cannot be less than zero?
>
> Inserting micros,micros or sec result in same result?
>
> Any idea what's happening?
>
> Thanks
>
>
> 0: jdbc:phoenix:hadoop1-dc:2181:/hbase> CREATE TABLE TEST (t UNSIGNED_LONG
> NOT NULL CONSTRAINT pk PRIMARY KEY (t ROW_TIMESTAMP) );
> No rows affected (1.654 seconds)
> 0: jdbc:phoenix:hadoop1-dc:2181:/hbase> UPSERT INTO TEST (t) VALUES
> (14491610811);
> Error: ERROR 201 (22000): Illegal data. Value of a column designated as
> ROW_TIMESTAMP cannot be less than zero (state=22000,code=201)
> java.sql.SQLException: ERROR 201 (22000): Illegal data. Value of a column
> designated as ROW_TIMESTAMP cannot be less than zero
> at
> org.apache.phoenix.exception.SQLExceptionCode$Factory$1.newException(SQLExceptionCode.java:396)
> at
> org.apache.phoenix.exception.SQLExceptionInfo.buildException(SQLExceptionInfo.java:145)
> at
> org.apache.phoenix.schema.IllegalDataException.(IllegalDataException.java:38)
> at
> org.apache.phoenix.compile.UpsertCompiler.setValues(UpsertCompiler.java:135)
> at
> org.apache.phoenix.compile.UpsertCompiler.access$400(UpsertCompiler.java:114)
> at
> org.apache.phoenix.compile.UpsertCompiler$3.execute(UpsertCompiler.java:882)
> at
> org.apache.phoenix.jdbc.PhoenixStatement$2.call(PhoenixStatement.java:322)
> at
> org.apache.phoenix.jdbc.PhoenixStatement$2.call(PhoenixStatement.java:314)
> at org.apache.phoenix.call.CallRunner.run(CallRunner.java:53)
> at
> org.apache.phoenix.jdbc.PhoenixStatement.executeMutation(PhoenixStatement.java:312)
> at
> org.apache.phoenix.jdbc.PhoenixStatement.execute(PhoenixStatement.java:1435)
> at sqlline.Commands.execute(Commands.java:822)
> at sqlline.Commands.sql(Commands.java:732)
> at sqlline.SqlLine.dispatch(SqlLine.java:808)
> at sqlline.SqlLine.begin(SqlLine.java:681)
> at sqlline.SqlLine.start(SqlLine.java:398)
> at sqlline.SqlLine.main(SqlLine.java:292)
>
> *Pierre Lacave*
> 171 Skellig House, Custom House, Lower Mayor street, Dublin 1, Ireland
> Phone :   +353879128708
>

Re: Spark SQL IN Clause

2015-12-04 Thread Ted Yu

Have you seen this JIRA ?

[SPARK-8077] [SQL] Optimization for TreeNodes with large numbers of children

>From the numbers Michael published, 1 million numbers would still need 250
seconds to parse.

On Fri, Dec 4, 2015 at 10:14 AM, Madabhattula Rajesh Kumar <
mrajaf...@gmail.com> wrote:

> Hi,
>
> How to use/best practices "IN" clause in Spark SQL.
>
> Use Case :-  Read the table based on number. I have a List of numbers. For
> example, 1million.
>
> Regards,
> Rajesh
>

Re: Is Temporary Access Credential (AccessKeyId, SecretAccessKey + SecurityToken) support by Spark?

2015-12-04 Thread Michal Klos

We were looking into this as well --- the answer looks like "no"

Here's the ticket:
https://issues.apache.org/jira/browse/HADOOP-9680

m


On Fri, Dec 4, 2015 at 1:41 PM, Lin, Hao  wrote:

> Hi,
>
>
>
> Does anyone knows if Spark run in AWS is supported by temporary access
> credential (AccessKeyId, SecretAccessKey + SecurityToken) to access S3?  I
> only see references to specify fs.s3.awsAccessKeyId and
> fs.s3.awsSecretAccessKey, without any mention of security token. Apparently
> this is only for static credential.
>
>
>
> Many thanks
> Confidentiality Notice:: This email, including attachments, may include
> non-public, proprietary, confidential or legally privileged information. If
> you are not an intended recipient or an authorized agent of an intended
> recipient, you are hereby notified that any dissemination, distribution or
> copying of the information contained in or transmitted with this e-mail is
> unauthorized and strictly prohibited. If you have received this email in
> error, please notify the sender by replying to this message and permanently
> delete this e-mail, its attachments, and any copies of it immediately. You
> should not retain, copy or use this e-mail or any attachment for any
> purpose, nor disclose all or any part of the contents to any other person.
> Thank you.
>

RE: Is Temporary Access Credential (AccessKeyId, SecretAccessKey + SecurityToken) support by Spark?

2015-12-04 Thread Lin, Hao

Thanks, I will keep an eye on it.

From: Michal Klos [mailto:michal.klo...@gmail.com]
Sent: Friday, December 04, 2015 1:50 PM
To: Lin, Hao
Cc: user
Subject: Re: Is Temporary Access Credential (AccessKeyId, SecretAccessKey + 
SecurityToken) support by Spark?

We were looking into this as well --- the answer looks like "no"

Here's the ticket:
https://issues.apache.org/jira/browse/HADOOP-9680[issues.apache.org]

m

On Fri, Dec 4, 2015 at 1:41 PM, Lin, Hao 
> wrote:
Hi,

Does anyone knows if Spark run in AWS is supported by temporary access 
credential (AccessKeyId, SecretAccessKey + SecurityToken) to access S3?  I only 
see references to specify fs.s3.awsAccessKeyId and fs.s3.awsSecretAccessKey, 
without any mention of security token. Apparently this is only for static 
credential.

Many thanks
Confidentiality Notice:: This email, including attachments, may include 
non-public, proprietary, confidential or legally privileged information. If you 
are not an intended recipient or an authorized agent of an intended recipient, 
you are hereby notified that any dissemination, distribution or copying of the 
information contained in or transmitted with this e-mail is unauthorized and 
strictly prohibited. If you have received this email in error, please notify 
the sender by replying to this message and permanently delete this e-mail, its 
attachments, and any copies of it immediately. You should not retain, copy or 
use this e-mail or any attachment for any purpose, nor disclose all or any part 
of the contents to any other person. Thank you.

Confidentiality Notice::  This email, including attachments, may include 
non-public, proprietary, confidential or legally privileged information.  If 
you are not an intended recipient or an authorized agent of an intended 
recipient, you are hereby notified that any dissemination, distribution or 
copying of the information contained in or transmitted with this e-mail is 
unauthorized and strictly prohibited.  If you have received this email in 
error, please notify the sender by replying to this message and permanently 
delete this e-mail, its attachments, and any copies of it immediately.  You 
should not retain, copy or use this e-mail or any attachment for any purpose, 
nor disclose all or any part of the contents to any other person. Thank you.

Re: is Multiple Spark Contexts is supported in spark 1.5.0 ?

2015-12-04 Thread Michael Armbrust

To be clear, I don't think there is ever a compelling reason to create more
than one SparkContext in a single application.  The context is threadsafe
and can launch many jobs in parallel from multiple threads.  Even if there
wasn't global state that made it unsafe to do so, creating more than one
context creates an artificial barrier that prevents sharing of RDDs between
the two.

On Fri, Dec 4, 2015 at 10:47 AM, prateek arora 
wrote:

> Thanks ...
>
> Is there any way my second application run in parallel and wait for
> fetching data from hbase or any other data storeage system ?
>
> Regards
> Prateek
>
> On Fri, Dec 4, 2015 at 10:24 AM, Ted Yu  wrote:
>
>> How about using NoSQL data store such as HBase :-)
>>
>> On Fri, Dec 4, 2015 at 10:17 AM, prateek arora <
>> prateek.arora...@gmail.com> wrote:
>>
>>> Hi Ted
>>> Thanks for the information .
>>> is there any way that two different spark application share there data ?
>>>
>>> Regards
>>> Prateek
>>>
>>> On Fri, Dec 4, 2015 at 9:54 AM, Ted Yu  wrote:
>>>
 See Josh's response in this thread:


 http://search-hadoop.com/m/q3RTt1z1hUw4TiG1=Re+Question+about+yarn+cluster+mode+and+spark+driver+allowMultipleContexts

 Cheers

 On Fri, Dec 4, 2015 at 9:46 AM, prateek arora <
 prateek.arora...@gmail.com> wrote:

> Hi
>
> I want to create multiple sparkContext in my application.
> i read so many articles they suggest " usage of multiple contexts is
> discouraged, since SPARK-2243 is still not resolved."
> i want to know that Is spark 1.5.0 supported to create multiple
> contexts
> without error ?
> and if supported then are we need to set
> "spark.driver.allowMultipleContexts" configuration parameter ?
>
> Regards
> Prateek
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/is-Multiple-Spark-Contexts-is-supported-in-spark-1-5-0-tp25568.html
> Sent from the Apache Spark User List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

>>>
>>
>

Re: Spark UI - Streaming Tab

2015-12-04 Thread Josh Rosen

The Streaming tab is only supported in the live UI, not in the History
Server.

On Fri, Dec 4, 2015 at 9:31 AM, patcharee  wrote:

> I ran streaming jobs, but no streaming tab appeared for those jobs.
>
> Patcharee
>
>
>
> On 04. des. 2015 18:12, PhuDuc Nguyen wrote:
>
> I believe the "Streaming" tab is dynamic - it appears once you have a
> streaming job running, not when the cluster is simply up. It does not
> depend on 1.6 and has been in there since at least 1.0.
>
> HTH,
> Duc
>
> On Fri, Dec 4, 2015 at 7:28 AM, patcharee 
> wrote:
>
>> Hi,
>>
>> We tried to get the streaming tab interface on Spark UI -
>> 
>> https://databricks.com/blog/2015/07/08/new-visualizations-for-understanding-spark-streaming-applications.html
>>
>> Tested on version 1.5.1, 1.6.0-snapshot, but no such interface for
>> streaming applications at all. Any suggestions? Do we need to configure the
>> history UI somehow to get such interface?
>>
>> Thanks,
>> Patcharee
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
>

Re: Spark SQL IN Clause

2015-12-04 Thread Michael Armbrust

The best way to run this today is probably to manually convert the query
into a join.  I.e. create a dataframe that has all the numbers in it, and
join/outer join it with the other table.  This way you avoid parsing a
gigantic string.

On Fri, Dec 4, 2015 at 10:36 AM, Ted Yu  wrote:

> Have you seen this JIRA ?
>
> [SPARK-8077] [SQL] Optimization for TreeNodes with large numbers of
> children
>
> From the numbers Michael published, 1 million numbers would still need 250
> seconds to parse.
>
> On Fri, Dec 4, 2015 at 10:14 AM, Madabhattula Rajesh Kumar <
> mrajaf...@gmail.com> wrote:
>
>> Hi,
>>
>> How to use/best practices "IN" clause in Spark SQL.
>>
>> Use Case :-  Read the table based on number. I have a List of numbers.
>> For example, 1million.
>>
>> Regards,
>> Rajesh
>>
>
>

Re: Not all workers seem to run in a standalone cluster setup by spark-ec2 script

2015-12-04 Thread Nicholas Chammas

Quick question: Are you processing gzipped files by any chance? It's a
common stumbling block people hit.

See: http://stackoverflow.com/q/27531816/877069

Nick

On Fri, Dec 4, 2015 at 2:28 PM Kyohey Hamaguchi 
wrote:

> Hi,
>
> I have setup a Spark standalone-cluster, which involves 5 workers,
> using spark-ec2 script.
>
> After submitting my Spark application, I had noticed that just one
> worker seemed to run the application and other 4 workers were doing
> nothing. I had confirmed this by checking CPU and memory usage on the
> Spark Web UI (CPU usage indicates zero and memory is almost fully
> availabile.)
>
> This is the command used to launch:
>
> $ ~/spark/ec2/spark-ec2 -k awesome-keypair-name -i
> /path/to/.ssh/awesome-private-key.pem --region ap-northeast-1
> --zone=ap-northeast-1a --slaves 5 --instance-type m1.large
> --hadoop-major-version yarn launch awesome-spark-cluster
>
> And the command to run application:
>
> $ ssh -i ~/path/to/awesome-private-key.pem root@ec2-master-host-name
> "mkdir ~/awesome"
> $ scp -i ~/path/to/awesome-private-key.pem spark.jar
> root@ec2-master-host-name:~/awesome && ssh -i
> ~/path/to/awesome-private-key.pem root@ec2-master-host-name
> "~/spark-ec2/copy-dir ~/awesome"
> $ ssh -i ~/path/to/awesome-private-key.pem root@ec2-master-host-name
> "~/spark/bin/spark-submit --num-executors 5 --executor-cores 2
> --executor-memory 5G --total-executor-cores 10 --driver-cores 2
> --driver-memory 5G --class com.example.SparkIsAwesome
> awesome/spark.jar"
>
> How do I let the all of the workers execute the app?
>
> Or do I have wrong understanding on what workers, slaves and executors are?
>
> My understanding is: Spark driver(or maybe master?) sends a part of
> jobs to each worker (== executor == slave), so a Spark cluster
> automatically exploits all resources available in the cluster. Is this
> some sort of misconception?
>
> Thanks,
>
> --
> Kyohey Hamaguchi
> TEL:  080-6918-1708
> Mail: tnzk.ma...@gmail.com
> Blog: http://blog.tnzk.org/
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: is Multiple Spark Contexts is supported in spark 1.5.0 ?

2015-12-04 Thread Mark Hamstra

Where it could start to make some sense is if you wanted a single
application to be able to work with more than one Spark cluster -- but
that's a pretty weird or unusual thing to do, and I'm pretty sure it
wouldn't work correctly at present.

On Fri, Dec 4, 2015 at 11:10 AM, Michael Armbrust 
wrote:

> To be clear, I don't think there is ever a compelling reason to create
> more than one SparkContext in a single application.  The context is
> threadsafe and can launch many jobs in parallel from multiple threads.
> Even if there wasn't global state that made it unsafe to do so, creating
> more than one context creates an artificial barrier that prevents sharing
> of RDDs between the two.
>
> On Fri, Dec 4, 2015 at 10:47 AM, prateek arora  > wrote:
>
>> Thanks ...
>>
>> Is there any way my second application run in parallel and wait for
>> fetching data from hbase or any other data storeage system ?
>>
>> Regards
>> Prateek
>>
>> On Fri, Dec 4, 2015 at 10:24 AM, Ted Yu  wrote:
>>
>>> How about using NoSQL data store such as HBase :-)
>>>
>>> On Fri, Dec 4, 2015 at 10:17 AM, prateek arora <
>>> prateek.arora...@gmail.com> wrote:
>>>
 Hi Ted
 Thanks for the information .
 is there any way that two different spark application share there data ?

 Regards
 Prateek

 On Fri, Dec 4, 2015 at 9:54 AM, Ted Yu  wrote:

> See Josh's response in this thread:
>
>
> http://search-hadoop.com/m/q3RTt1z1hUw4TiG1=Re+Question+about+yarn+cluster+mode+and+spark+driver+allowMultipleContexts
>
> Cheers
>
> On Fri, Dec 4, 2015 at 9:46 AM, prateek arora <
> prateek.arora...@gmail.com> wrote:
>
>> Hi
>>
>> I want to create multiple sparkContext in my application.
>> i read so many articles they suggest " usage of multiple contexts is
>> discouraged, since SPARK-2243 is still not resolved."
>> i want to know that Is spark 1.5.0 supported to create multiple
>> contexts
>> without error ?
>> and if supported then are we need to set
>> "spark.driver.allowMultipleContexts" configuration parameter ?
>>
>> Regards
>> Prateek
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/is-Multiple-Spark-Contexts-is-supported-in-spark-1-5-0-tp25568.html
>> Sent from the Apache Spark User List mailing list archive at
>> Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

>>>
>>
>

Re: is Multiple Spark Contexts is supported in spark 1.5.0 ?

2015-12-04 Thread Anfernee Xu

If multiple users are looking at the same data set, then it's good choice
to share the SparkContext.

But my usercases are different, users are looking at different data(I use
custom Hadoop InputFormat to load data from my data source based on the
user input), the data might not have any overlap. For now I'm taking below
approach

 1.  once my webapp receives the user request, it will submit custom
Yarn application to my Yarn cluster to allocate a container where a
SparkContext is created and my driver is running
 2. I have to design a coordination/callback protocol so that user
session in my webapp can be notified when the Spark job is finished and the
result will be pushed back.

Let me know if you have any better solution

Thanks

On Fri, Dec 4, 2015 at 11:10 AM, Michael Armbrust 
wrote:

> To be clear, I don't think there is ever a compelling reason to create
> more than one SparkContext in a single application.  The context is
> threadsafe and can launch many jobs in parallel from multiple threads.
> Even if there wasn't global state that made it unsafe to do so, creating
> more than one context creates an artificial barrier that prevents sharing
> of RDDs between the two.
>
> On Fri, Dec 4, 2015 at 10:47 AM, prateek arora  > wrote:
>
>> Thanks ...
>>
>> Is there any way my second application run in parallel and wait for
>> fetching data from hbase or any other data storeage system ?
>>
>> Regards
>> Prateek
>>
>> On Fri, Dec 4, 2015 at 10:24 AM, Ted Yu  wrote:
>>
>>> How about using NoSQL data store such as HBase :-)
>>>
>>> On Fri, Dec 4, 2015 at 10:17 AM, prateek arora <
>>> prateek.arora...@gmail.com> wrote:
>>>
 Hi Ted
 Thanks for the information .
 is there any way that two different spark application share there data ?

 Regards
 Prateek

 On Fri, Dec 4, 2015 at 9:54 AM, Ted Yu  wrote:

> See Josh's response in this thread:
>
>
> http://search-hadoop.com/m/q3RTt1z1hUw4TiG1=Re+Question+about+yarn+cluster+mode+and+spark+driver+allowMultipleContexts
>
> Cheers
>
> On Fri, Dec 4, 2015 at 9:46 AM, prateek arora <
> prateek.arora...@gmail.com> wrote:
>
>> Hi
>>
>> I want to create multiple sparkContext in my application.
>> i read so many articles they suggest " usage of multiple contexts is
>> discouraged, since SPARK-2243 is still not resolved."
>> i want to know that Is spark 1.5.0 supported to create multiple
>> contexts
>> without error ?
>> and if supported then are we need to set
>> "spark.driver.allowMultipleContexts" configuration parameter ?
>>
>> Regards
>> Prateek
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/is-Multiple-Spark-Contexts-is-supported-in-spark-1-5-0-tp25568.html
>> Sent from the Apache Spark User List mailing list archive at
>> Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

>>>
>>
>


-- 
--Anfernee

Not all workers seem to run in a standalone cluster setup by spark-ec2 script

2015-12-04 Thread Kyohey Hamaguchi

Hi,

I have setup a Spark standalone-cluster, which involves 5 workers,
using spark-ec2 script.

After submitting my Spark application, I had noticed that just one
worker seemed to run the application and other 4 workers were doing
nothing. I had confirmed this by checking CPU and memory usage on the
Spark Web UI (CPU usage indicates zero and memory is almost fully
availabile.)

This is the command used to launch:

$ ~/spark/ec2/spark-ec2 -k awesome-keypair-name -i
/path/to/.ssh/awesome-private-key.pem --region ap-northeast-1
--zone=ap-northeast-1a --slaves 5 --instance-type m1.large
--hadoop-major-version yarn launch awesome-spark-cluster

And the command to run application:

$ ssh -i ~/path/to/awesome-private-key.pem root@ec2-master-host-name
"mkdir ~/awesome"
$ scp -i ~/path/to/awesome-private-key.pem spark.jar
root@ec2-master-host-name:~/awesome && ssh -i
~/path/to/awesome-private-key.pem root@ec2-master-host-name
"~/spark-ec2/copy-dir ~/awesome"
$ ssh -i ~/path/to/awesome-private-key.pem root@ec2-master-host-name
"~/spark/bin/spark-submit --num-executors 5 --executor-cores 2
--executor-memory 5G --total-executor-cores 10 --driver-cores 2
--driver-memory 5G --class com.example.SparkIsAwesome
awesome/spark.jar"

How do I let the all of the workers execute the app?

Or do I have wrong understanding on what workers, slaves and executors are?

My understanding is: Spark driver(or maybe master?) sends a part of
jobs to each worker (== executor == slave), so a Spark cluster
automatically exploits all resources available in the cluster. Is this
some sort of misconception?

Thanks,

--
Kyohey Hamaguchi
TEL:  080-6918-1708
Mail: tnzk.ma...@gmail.com
Blog: http://blog.tnzk.org/

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

How to modularize Spark Streaming Jobs?

2015-12-04 Thread SRK

Hi,

What is the way to modularize Spark Streaming jobs something along the lines
of what Spring XD does?

Thanks,
Swetha



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-modularize-Spark-Streaming-Jobs-tp25569.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

84 matches

Mail list logo