How to make sure that Spark Kafka Direct Streaming job maintains the state upon code deployment?

2017-06-27 Thread SRK
Hi,

We use UpdateStateByKey, reduceByKeyWindow and checkpoint the data.  We
store the offsets in Zookeeper. How to make sure that the state of the job
is maintained upon redeploying the code?

Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-make-sure-that-Spark-Kafka-Direct-Streaming-job-maintains-the-state-upon-code-deployment-tp28799.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



How to reduce the amount of data that is getting written to the checkpoint from Spark Streaming

2017-06-27 Thread SRK
Hi,

I have checkpoints enabled in Spark streaming and I use updateStateByKey and
reduceByKeyAndWindow with inverse functions. How do I reduce the amount of
data that I am writing to the checkpoint or clear out the data that I dont
care?

Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-reduce-the-amount-of-data-that-is-getting-written-to-the-checkpoint-from-Spark-Streaming-tp28798.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark standalone , client mode. How do I monitor?

2017-06-27 Thread anna stax
Hi all,

I have a spark standalone cluster. I am running a spark streaming
application on it and the deploy mode is client. I am looking for the best
way to monitor the cluster and application so that I will know when the
application/cluster is down. I cannot move to cluster deploy mode now.

I appreciate your thoughts.

Thanks
-Anna


Re: What is the equivalent of mapPartitions in SpqrkSQL?

2017-06-27 Thread Aaron Perrin
I'm assuming some things here, but hopefully I understand. So, basically
you have a big table of data distributed across a bunch of executors. And,
you want an efficient way to call a native method for each row.

It sounds similar to a dataframe writer to me. Except, instead of writing
to disk or network, you're 'writing' to a native function. Would a custom
dataframe writer work? That's what I'd try first.

If that doesn't work for your case, you could also try adding a column
where the column function does the native call. However, if doing it that
way, you'd have to ensure that the column function actually gets called for
all rows. (An interesting side effect of that is that you could JNI/WinAPI
errors there and set the column value to the result.)

There are other ways, too, if those options don't work...

On Sun, Jun 25, 2017 at 8:07 PM jeff saremi  wrote:

> My specific and immediate need is this: We have a native function wrapped
> in JNI. To increase performance we'd like to avoid calling it record by
> record. mapPartitions() give us the ability to invoke this in bulk. We're
> looking for a similar approach in SQL.
>
>
> --
> *From:* Ryan 
> *Sent:* Sunday, June 25, 2017 7:18:32 PM
> *To:* jeff saremi
> *Cc:* user@spark.apache.org
> *Subject:* Re: What is the equivalent of mapPartitions in SpqrkSQL?
>
> Why would you like to do so? I think there's no need for us to explicitly
> ask for a forEachPartition in spark sql because tungsten is smart enough to
> figure out whether a sql operation could be applied on each partition or
> there has to be a shuffle.
>
> On Sun, Jun 25, 2017 at 11:32 PM, jeff saremi 
> wrote:
>
>> You can do a map() using a select and functions/UDFs. But how do you
>> process a partition using SQL?
>>
>>
>>
>


Re: IDE for python

2017-06-27 Thread ayan guha
Depends on the need. For data exploration, i use notebooks whenever I can.
For developement, any good text editor should work, I use sublime. If you
want auto completion and all, you can use eclipse or pycharm, I do not :)

On Wed, 28 Jun 2017 at 7:17 am, Xiaomeng Wan  wrote:

> Hi,
> I recently switched from scala to python, and wondered which IDE people
> are using for python. I heard about pycharm, spyder etc. How do they
> compare with each other?
>
> Thanks,
> Shawn
>
-- 
Best Regards,
Ayan Guha


IDE for python

2017-06-27 Thread Xiaomeng Wan
Hi,
I recently switched from scala to python, and wondered which IDE people are
using for python. I heard about pycharm, spyder etc. How do they compare
with each other?

Thanks,
Shawn


Spark Encoder with mysql Enum and data truncated Error

2017-06-27 Thread mckunkel
I am using Spark via Java for a MYSQL/ML(machine learning) project.

In the mysql database, I have a column "status_change_type" of type enum =
{broke, fixed} in a table called "status_change" in a DB called "test".

I have an object StatusChangeDB that constructs the needed structure for the
table, however for the "status_change_type", I constructed it as a String. I
know the bytes from MYSQL enum to Java string are much different, but I am
using Spark, so the encoder does not recognize enums properly. However when
I try to set the value of the enum via a Java string, I receive the "data
truncated" error

org.apache.spark.SparkException: Job aborted due to stage failure: Task
0 in stage 4.0 failed 1 times, most recent failure: Lost task 0.0 in stage
4.0 (TID 9, localhost, executor driver): java.sql.BatchUpdateException: Data
truncated for column 'status_change_type' at row 1 at
com.mysql.jdbc.PreparedStatement.executeBatchSerially(PreparedStatement.java:2055)


I have tried to use enum for "status_change_type", however it fails with a
stack trace of

   
 Exception in thread "AWT-EventQueue-0" java.lang.NullPointerException at
org.spark_project.guava.reflect.TypeToken.method(TypeToken.java:465) at
org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:126)
at
org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:125)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at
scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at
scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) at
org.apache.spark.sql.catalyst.JavaTypeInference$.org$apache$spark$sql$catalyst$JavaTypeInference$$inferDataType(JavaTypeInference.scala:125)
at
org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:127)
at
org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:125)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at
scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at
scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) at
org.apache.spark.sql.catalyst.JavaTypeInference$.org$apache$spark$sql$catalyst$JavaTypeInference$$inferDataType(JavaTypeInference.scala:125)
at
org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:127)
at
org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:125)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at
scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at
scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) ... ...


I have tried to use the jdbc setting "jdbcCompliantTruncation=false" but
this does nothing as I get the same error of "data truncated" as first
stated. Here are my jdbc options map, in case I am using the
"jdbcCompliantTruncation=false" incorrectly.

public static Map jdbcOptions() {
Map jdbcOptions = new HashMap();
jdbcOptions.put("url",
"jdbc:mysql://localhost:3306/test?jdbcCompliantTruncation=false");
jdbcOptions.put("driver", "com.mysql.jdbc.Driver");
jdbcOptions.put("dbtable", "status_change");
jdbcOptions.put("user", "root");
jdbcOptions.put("password", "");
return jdbcOptions;
}

Here is the Spark method for inserting into the mysql DB

private void insertMYSQLQuery(Dataset changeDF) {
try {
   
changeDF.write().mode(SaveMode.Append).jdbc(SparkManager.jdbcAppendOptions(),
"status_change",
new java.util.Properties());
} catch (Exception e) {
System.out.println(e);
}
}

where jdbcAppendOptions uses the jdbcOptions methods as:

public static String jdbcAppendOptions() {

return SparkManager.jdbcOptions().get("url") + "=" +
SparkManager.jdbcOptions().get("user") + "="
+ SparkManager.jdbcOptions().get("password");

}

How do I achieve getting the values of type enum into the mysqlDB using
spark, or avoiding this "data truncated" error?

My only other thought would be to change the DB itself to use VARCHAR, but
the project leader is not to happy with the 

(Spark-ml) java.util.NosuchElementException: key not found exception on doing prediction and computing test error.

2017-06-27 Thread neha nihal
Hi,

I am using Apache spark 2.0.2 randomforest ml (standalone mode) for text
classification. TF-IDF feature extractor is also used. The training part
runs without any issues and returns 100% accuracy. But when I am trying to
do prediction using trained model and compute test error, it fails with
java.util.NosuchElementException: key not found exception.
Any help will be much appreciated.

Thanks & Regards


How do I find the time taken by each step in a stage in a Spark Job

2017-06-27 Thread SRK
Hi,

How do I find the time taken by each step in a stage in spark job? Also, how
do I find the bottleneck in each step and if a stage is skipped because of
the RDDs being persisted in streaming?

I am trying to identify which step in a job is taking time in my Streaming
job.

Thanks!




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-do-I-find-the-time-taken-by-each-step-in-a-stage-in-a-Spark-Job-tp28796.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: ZeroMQ Streaming in Spark2.x

2017-06-27 Thread Aashish Chaudhary
thanks, I was able to get it up and running.

One think I am not entirely sure if bahir provided python bindings to
ZeroMQ. Looking at the code it does not seems like but I might be wrong.

thanks,


On Mon, Jun 26, 2017 at 5:13 PM Aashish Chaudhary <
aashish.chaudh...@kitware.com> wrote:

> Thanks. I saw it earlier but did not whether this is the official way of
> doing Spark with ZeroMQ. Thanks, I will have a look.
> - Aashish
>
>
> On Mon, Jun 26, 2017 at 3:01 PM Shixiong(Ryan) Zhu <
> shixi...@databricks.com> wrote:
>
>> It's moved to http://bahir.apache.org/
>>
>> You can find document there.
>>
>> On Mon, Jun 26, 2017 at 11:58 AM, Aashish Chaudhary <
>> aashish.chaudh...@kitware.com> wrote:
>>
>>> Hi there,
>>>
>>> I am a beginner when it comes to Spark streaming. I was looking for some
>>> examples related to ZeroMQ and Spark and realized that ZeroMQUtils is no
>>> longer present in Spark 2.x.
>>>
>>> I would appreciate if someone can shed some light on the history and
>>> what I could do to use ZeroMQ with Spark Streaming in the current version.
>>>
>>> Thanks,
>>>
>>>
>>


Re: Question about Parallel Stages in Spark

2017-06-27 Thread satish lalam
Thanks Bryan. This is one Spark application with one job. This job has 3
stages. The first 2 are basic reads from cassandra tables and the 3rd is a
join between the two. I was expecting the first 2 stages to run in
parallel, however they run serially. Job has enough resources.

On Tue, Jun 27, 2017 at 4:03 AM, Bryan Jeffrey 
wrote:

> Satish,
>
> Is this two separate applications submitted to the Yarn scheduler? If so
> then you would expect that you would see the original case run in parallel.
>
> However, if this is one application your submission to Yarn guarantees
> that this application will fairly  contend with resources requested by
> other applications. However, the internal output operations within your
> application (jobs) will be scheduled by the driver (running on a single
> AM). This means that whatever driver options and code you've set will
> impact the application, but the Yarn scheduler will not impact (beyond
> allocating cores, memory, etc. between applications.)
>
>
>
> Get Outlook for Android 
>
>
>
>
> On Tue, Jun 27, 2017 at 2:33 AM -0400, "satish lalam" <
> satish.la...@gmail.com> wrote:
>
> Thanks All. To reiterate - stages inside a job can be run parallely as
>> long as - (a) there is no sequential dependency (b) the job has sufficient
>> resources.
>> however, my code was launching 2 jobs and they are sequential as you
>> rightly pointed out.
>> The issue which I was trying to highlight with that piece of pseudocode
>> however was that - I am observing a job with 2 stages which dont depend on
>> each other (they both are reading data from 2 seperate tables in db), they
>> both are scheduled and both stages get resources - but the 2nd stage really
>> does not pick up until the 1st stage is complete. It might be due to the db
>> driver - I will post it to the right forum. Thanks.
>>
>> On Mon, Jun 26, 2017 at 9:12 PM, Pralabh Kumar 
>> wrote:
>>
>>> i think my words also misunderstood. My point is they will not submit
>>> together since they are the part of one thread.
>>>
>>> val spark =  SparkSession.builder()
>>>   .appName("practice")
>>>   .config("spark.scheduler.mode","FAIR")
>>>   .enableHiveSupport().getOrCreate()
>>> val sc = spark.sparkContext
>>> sc.parallelize(List(1.to(1000))).map(s=>Thread.sleep(1)).collect()
>>> sc.parallelize(List(1.to(1000))).map(s=>Thread.sleep(1)).collect()
>>> Thread.sleep(1000)
>>>
>>>
>>> I ran this and both spark submit time are different for both the jobs .
>>>
>>> Please let me if I am wrong
>>>
>>> On Tue, Jun 27, 2017 at 9:17 AM, 萝卜丝炒饭 <1427357...@qq.com> wrote:
>>>
 My words cause misunderstanding.
 Step 1:A is submited to spark.
 Step 2:B is submitted to spark.

 Spark gets two independent jobs.The FAIR  is used to schedule A and B.

 Jeffrey' code did not cause two submit.



 ---Original---
 *From:* "Pralabh Kumar"
 *Date:* 2017/6/27 12:09:27
 *To:* "萝卜丝炒饭"<1427357...@qq.com>;
 *Cc:* 
 "user";"satishl";"Bryan
 Jeffrey";
 *Subject:* Re: Question about Parallel Stages in Spark

 Hi

 I don't think so spark submit ,will receive two submits .  Its will
 execute one submit and then to next one .  If the application is
 multithreaded ,and two threads are calling spark submit and one time , then
 they will run parallel provided the scheduler is FAIR and task slots are
 available .

 But in one thread ,one submit will complete and then the another one
 will start . If there are independent stages in one job, then those will
 run parallel.

 I agree with Bryan Jeffrey .


 Regards
 Pralabh Kumar

 On Tue, Jun 27, 2017 at 9:03 AM, 萝卜丝炒饭 <1427357...@qq.com> wrote:

> I think the spark cluster receives two submits, A and B.
> The FAIR  is used to schedule A and B.
> I am not sure about this.
>
> ---Original---
> *From:* "Bryan Jeffrey"
> *Date:* 2017/6/27 08:55:42
> *To:* "satishl";
> *Cc:* "user";
> *Subject:* Re: Question about Parallel Stages in Spark
>
> Hello.
>
> The driver is running the individual operations in series, but each
> operation is parallelized internally.  If you want them run in parallel 
> you
> need to provide the driver a mechanism to thread the job scheduling out:
>
> val rdd1 = sc.parallelize(1 to 10)
> val rdd2 = sc.parallelize(1 to 20)
>
> var thingsToDo: ParArray[(RDD[Int], Int)] = Array(rdd1, 
> rdd2).zipWithIndex.par
>
> thingsToDo.foreach { case(rdd, index) =>
>   for(i <- (1 to 1))
> logger.info(s"Index ${index} - ${rdd.sum()}")
> }
>
>
> This will run both 

the function of countByValueAndWindow and foreachRDD in DStream, would you like help me understand it please?

2017-06-27 Thread ??????????
HI all,


I have code like below:
Logger.getLogger("org.apache.spark").setLevel( Level.ERROR)
//Logger.getLogger("org.apache.spark.streaming.dstream").setLevel( 
Level.DEBUG)
val conf = new SparkConf().setAppName("testDstream").setMaster("local[4]")
//val sc = SparkContext.getOrCreate( conf)
val ssc = new StreamingContext(conf, Seconds(1))

ssc.checkpoint( "E:\\spark\\tmp\\cp")
val lines = ssc.socketTextStream("127.0.0.1", )
lines.foreachRDD( r=>{
  println("RDD" + r.id + "begin" + "   " + new SimpleDateFormat("-mm-dd 
 HH:MM:SS").format( new Date()))
  r.foreach( ele => println(":::" + ele))
  println("RDD" + r.id + "end")
})
lines.countByValueAndWindow( Seconds(4), Seconds(1)).foreachRDD( s => { 
 // here is key code 
  println( "countByValueAndWindow RDD ID IS : " + s.id + "begin")
  println("time is " + new SimpleDateFormat("-mm-dd  HH:MM:SS").format( 
new Date()))
  s.foreach( e => println("data is " + e._1 + " :" + e._2))
  println("countByValueAndWindow RDD ID IS : " + s.id + "end")
})

ssc.start() // Start the computation
ssc.awaitTermination()  // Wait for the computation to terminate
I run the code and use "nc" send the message manually. The speed I input 
message is about one letter per seconds.I know the time in log does not equal 
the window duration, but I think they are very near.the output and my comment 
is :---RDD1begin   
2017-41-27  22:06:16 RDD1end countByValueAndWindow RDD ID IS : 7  begin time is 
2017-41-27  22:06:16 countByValueAndWindow RDD ID IS : 7  end RDD8begin   
2017-41-27  22:06:17 RDD8end countByValueAndWindow RDD ID IS : 13  begin time 
is 2017-41-27  22:06:17 countByValueAndWindow RDD ID IS : 13  end RDD14begin   
2017-41-27  22:06:18 :::1 RDD14end countByValueAndWindow RDD ID IS : 19  begin  
 time is 2017-41-27  22:06:18  <== data from 22:06:15 -- 22:06:18 is in RDD 14. 
data is 1 :1 countByValueAndWindow RDD ID IS : 19  end RDD20begin   2017-41-27  
22:06:19 :::2 RDD20end countByValueAndWindow RDD ID IS : 25  begin  time is 
2017-41-27  22:06:19  <== data from 22:06:16 -- 22:06:19 is in RDD 14 ,20. data 
is 1 :1 data is 2 :1 countByValueAndWindow RDD ID IS : 25  end RDD26begin   
2017-41-27  22:06:20 :::3 RDD26end countByValueAndWindow RDD ID IS : 31  begin 
time is 2017-41-27  22:06:20 <== data from 22:06:17 -- 22:06:20 is in RDD 14 , 
20, 26 data is 2 :1 data is 1 :1 data is 3 :1 countByValueAndWindow RDD ID IS : 
31  end RDD32begin   2017-41-27  22:06:21 :::4 RDD32end countByValueAndWindow 
RDD ID IS : 37  begin time is 2017-41-27  22:06:21 <== data from 22:06:18 -- 
22:06:21 is in RDD 14 , 20,  26, 32 data is 2 :1 data is 1 :1 data is 4 :1 data 
is 3 :1 countByValueAndWindow RDD ID IS : 37  end RDD38begin   2017-41-27  
22:06:22:::5:::6 RDD38end countByValueAndWindow RDD ID IS : 43  begin time is 
2017-41-27  22:06:22<== data from 22:06:19 -- 22:06:22 is in RDD  20,  26, 
32,38. Here 14 is out of window. data is 4 :1 data is 5 :1 data is 6 :1 data is 
2 :1 data is 3 :1 countByValueAndWindow RDD ID IS : 43  end RDD44begin   
2017-41-27  22:06:23 :::7 RDD44end countByValueAndWindow RDD ID IS : 49  begin 
time is 2017-41-27  22:06:23  <== data from 22:06:29 -- 22:06:23 is in RDD
26, 32,38, 44. Here 20is out of window. data is 5 :1 data is 4 :1 data is 6 :1 
data is 7 :1 data is 3 :1 countByValueAndWindow RDD ID IS : 49  
end---I think the 
foreachRDD function outputs the last RDD calculated by countByValueAndWindow, 
and the above log validate my idea.Now, I change the red code 
tolines.countByValueAndWindow( Seconds(4), Seconds(6)).foreachRDD( s => {  
// here is key code the slide duration is 6 seconds. The log and my comment is 
below:---DD1begin   
2017-59-27  10:59:12 RDD1end RDD2begin   2017-59-27  10:59:13 :::1 :::2 RDD2end 
RDD3begin   2017-59-27  10:59:14 :::3 RDD3end RDD4begin   2017-59-27  10:59:15 
:::4 RDD4end RDD5begin   2017-59-27  10:59:16 :::5 RDD5end RDD6begin   
2017-59-27  10:59:17 RDD6end countByValueAndWindow RDD ID IS : 22  begin time 
is 2017-59-27  10:59:17 <== I think here is OK, event RDD2 is calculated. data 
is 4 :1 data is 5 :1 data is 1 :1 data is 2 :1 data is 3 :1 
countByValueAndWindow RDD ID IS : 22  end RDD23begin   2017-59-27  10:59:18 
:::6 RDD23end RDD24begin   2017-59-27  10:59:19 :::8 :::7 RDD24end RDD25begin   
2017-59-27  10:59:20 :::9 RDD25end RDD26begin   2017-59-27  10:59:21 :::0 
RDD26end RDD27begin   2017-59-27  10:59:22 :::- RDD27end RDD28begin   
2017-59-27  10:59:23 :::p RDD28end countByValueAndWindow RDD ID IS : 43  begin 
time is 2017-59-27  10:59:23 <==the data between 10:59:20 --10:59:23 should be 
RDD 25, 26, 27, 28. but the data is wrong.  data is 6 :1 data is 2 :1 data is 9 
:1 data is - :1 data is 1 :1 data is 8 :1 data is p :1 

[ML] Stop conditions for RandomForest

2017-06-27 Thread OBones

Hello,

Reading around on the theory behind tree based regression, I concluded 
that there are various reasons to stop exploring the tree when a given 
node has been reached. Among these, I have those two:


1. When starting to process a node, if its size (row count) is less than 
X then consider it a leaf
2. When a split for a node is considered, if any side of the split has 
its size less than Y, then ignore it when selecting the best split


As an example, let's consider a node with 45 rows, that for a given 
split creates two children, containing 5 and 35 rows respectively.


If I set X to 50, then the node is a leaf and no split is attempted
if I set X to 10 and Y to 15, then the splits are computed but because 
one of them has less than 15 rows, that split is ignored.


I'm using DecisionTreeRegressor and RandomForestRegressor on our data 
and because the former is implemented using the latter, they both share 
the same parameters.
Going through those parameters, I found minInstancesPerNode which to me 
is the Y value, but I could not find any parameter for the X value.

Have I missed something?
If not, would there be a way to implement this?

Regards



-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



proxy on spark UI

2017-06-27 Thread Soheila S.
Hi all,
I am using Hadoop 2.6.5 and spark 2.1.0 and run a job using spark-submit
and master is set to "yarn". When spark starts, I can load Spark UI page
using port 4040 but no job is shown in the page. After the following logs
(registering application master on yarn) spark UI is not accessible
anymore, even from tracking UI (ApplicationMaster) in cluster UI.

The URL (http://z401:4040) is redirected to a new one (
http://z401:8088/proxy/application_1498135277395_0009) and can not be
reached.

Any idea?

Thanks a lot in advance.

17/06/23 12:35:45 INFO cluster.YarnSchedulerBackend$YarnSchedulerEndpoint:
ApplicationMaster registered as NettyRpcEndpointRef(null)

17/06/23 12:35:45 INFO cluster.YarnClientSchedulerBackend: Add WebUI
Filter. org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter,
Map(PROXY_HOSTS -> z401, PROXY_URI_BASES ->
http://z401:8088/proxy/application_1498135277395_0009),
/proxy/application_1498135277395_0009

17/06/23 12:35:45 INFO ui.JettyUtils: Adding filter:
org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter

17/06/23 12:35:45 INFO yarn.Client: Application report for
application_1498135277395_0009 (state: RUNNING)


What is the purpose of having RDD.context and RDD.sparkContext at the same time?

2017-06-27 Thread Sergey Zhemzhitsky
Hello spark gurus,

Could you please shed some light on what is the purpose of having two
identical functions in RDD,
RDD.context [1] and RDD.sparkContext [2].

RDD.context seems to be used more frequently across the source code.

[1]
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1693
[2]
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L146

Kind Regards,
Sergey


PySpark 2.1.1 Can't Save Model - Permission Denied

2017-06-27 Thread John Omernik
Hello all, I am running PySpark 2.1.1 as a user, jomernik. I am working
through some documentation here:

https://spark.apache.org/docs/latest/mllib-ensembles.html#random-forests

And was working on the Random Forest Classification, and found it to be
working!  That said, when I try to save the model to my hdfs (MaprFS in my
case)  I got a weird error:

I tried to save here:

model.save(sc,
"maprfs:///user/jomernik/tmp/myRandomForestClassificationModel")

/user/jomernik is my user directory and I have full access to the
directory.



All the directories down to

/user/jomernik/tmp/myRandomForestClassificationModel/metadata/_temporary/0
are owned by my with full permissions, but when I get to this directory,
here is the ls

$ ls -ls

total 1

1 drwxr-xr-x 2 root root 1 Jun 27 07:38 task_20170627123834_0019_m_00

0 drwxr-xr-x 2 root root 0 Jun 27 07:38 _temporary

Am I doing something wrong here? Why is the temp stuff owned by root? Is
there a bug in saving things due to this ownership?

John






Exception:
Py4JJavaError: An error occurred while calling o338.save.
: org.apache.hadoop.security.AccessControlException: User jomernik(user id
101) does has been denied access to rename
 
/user/jomernik/tmp/myRandomForestClassificationModel/metadata/_temporary/0/task_20170627123834_0019_m_00/part-0
to /user/jomernik/tmp/myRandomForestClassificationModel/metadata/part-0
at com.mapr.fs.MapRFileSystem.rename(MapRFileSystem.java:1112)
at
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:461)
at
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:475)
at
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJobInternal(FileOutputCommitter.java:392)
at
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:364)
at
org.apache.hadoop.mapred.FileOutputCommitter.commitJob(FileOutputCommitter.java:136)
at org.apache.spark.SparkHadoopWriter.commitJob(SparkHadoopWriter.scala:111)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1227)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1168)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1168)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1168)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply$mcV$sp(PairRDDFunctions.scala:1071)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1037)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1037)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:1037)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply$mcV$sp(PairRDDFunctions.scala:963)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:963)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:963)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:962)
at
org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply$mcV$sp(RDD.scala:1489)
at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1468)
at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1468)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1468)
at
org.apache.spark.mllib.tree.model.TreeEnsembleModel$SaveLoadV1_0$.save(treeEnsembleModels.scala:440)
at
org.apache.spark.mllib.tree.model.RandomForestModel.save(treeEnsembleModels.scala:66)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at 

Re: Question about Parallel Stages in Spark

2017-06-27 Thread Bryan Jeffrey
Satish, 




Is this two separate applications submitted to the Yarn scheduler? If so then 
you would expect that you would see the original case run in parallel. 




However, if this is one application your submission to Yarn guarantees that 
this application will fairly  contend with resources requested by other 
applications. However, the internal output operations within your application 
(jobs) will be scheduled by the driver (running on a single AM). This means 
that whatever driver options and code you've set will impact the application, 
but the Yarn scheduler will not impact (beyond allocating cores, memory, etc. 
between applications.)








Get Outlook for Android







On Tue, Jun 27, 2017 at 2:33 AM -0400, "satish lalam"  
wrote:










Thanks All. To reiterate - stages inside a job can be run parallely as long as 
- (a) there is no sequential dependency (b) the job has sufficient resources. 
however, my code was launching 2 jobs and they are sequential as you rightly 
pointed out.The issue which I was trying to highlight with that piece of 
pseudocode however was that - I am observing a job with 2 stages which dont 
depend on each other (they both are reading data from 2 seperate tables in db), 
they both are scheduled and both stages get resources - but the 2nd stage 
really does not pick up until the 1st stage is complete. It might be due to the 
db driver - I will post it to the right forum. Thanks.
On Mon, Jun 26, 2017 at 9:12 PM, Pralabh Kumar  wrote:
i think my words also misunderstood. My point is they will not submit together 
since they are the part of one thread.  
val spark =  SparkSession.builder()
  .appName("practice")
  .config("spark.scheduler.mode","FAIR")
  .enableHiveSupport().getOrCreate()
val sc = spark.sparkContext
sc.parallelize(List(1.to(1000))).map(s=>Thread.sleep(1)).collect()
sc.parallelize(List(1.to(1000))).map(s=>Thread.sleep(1)).collect()
Thread.sleep(1000)
I ran this and both spark submit time are different for both the jobs .
Please let me if I am wrong
On Tue, Jun 27, 2017 at 9:17 AM, 萝卜丝炒饭 <1427357...@qq.com> wrote:
My words cause misunderstanding.Step 1:A is submited to spark.Step 2:B is 
submitted to spark.
Spark gets two independent jobs.The FAIR  is used to schedule A and B.
Jeffrey' code did not cause two submit.


 ---Original---From: "Pralabh Kumar"Date: 2017/6/27 
12:09:27To: "萝卜丝炒饭"<1427357...@qq.com>;Cc: 
"user";"satishl";"Bryan 
Jeffrey";Subject: Re: Question about Parallel Stages 
in Spark
Hi 
I don't think so spark submit ,will receive two submits .  Its will execute one 
submit and then to next one .  If the application is multithreaded ,and two 
threads are calling spark submit and one time , then they will run parallel 
provided the scheduler is FAIR and task slots are available . 
But in one thread ,one submit will complete and then the another one will start 
. If there are independent stages in one job, then those will run parallel.
I agree with Bryan Jeffrey .

RegardsPralabh Kumar
On Tue, Jun 27, 2017 at 9:03 AM, 萝卜丝炒饭 <1427357...@qq.com> wrote:
I think the spark cluster receives two submits, A and B.The FAIR  is used to 
schedule A and B.I am not sure about this.
 ---Original---From: "Bryan Jeffrey"Date: 2017/6/27 
08:55:42To: "satishl";Cc: 
"user";Subject: Re: Question about Parallel Stages in 
Spark
Hello.
The driver is running the individual operations in series, but each operation 
is parallelized internally.  If you want them run in parallel you need to 
provide the driver a mechanism to thread the job scheduling out:
val rdd1 = sc.parallelize(1 to 10)
val rdd2 = sc.parallelize(1 to 20)

var thingsToDo: ParArray[(RDD[Int], Int)] = Array(rdd1, rdd2).zipWithIndex.par

thingsToDo.foreach { case(rdd, index) =>
  for(i <- (1 to 1))
logger.info(s"Index ${index} - ${rdd.sum()}")
}
This will run both operations in parallel.

On Mon, Jun 26, 2017 at 8:10 PM, satishl  wrote:
For the below code, since rdd1 and rdd2 dont depend on each other - i was

expecting that both first and second printlns would be interwoven. However -

the spark job runs all "first " statements first and then all "seocnd"

statements next in serial fashion. I have set spark.scheduler.mode = FAIR.

obviously my understanding of parallel stages is wrong. What am I missing?



    val rdd1 = sc.parallelize(1 to 100)

    val rdd2 = sc.parallelize(1 to 100)



    for (i <- (1 to 100))

      println("first: " + rdd1.sum())

    for (i <- (1 to 100))

      println("second" + rdd2.sum())







--

View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Question-about-Parallel-Stages-in-Spark-tp28793.html

Sent from the Apache Spark User List mailing list 

Re: Question about Parallel Stages in Spark

2017-06-27 Thread satish lalam
Thanks All. To reiterate - stages inside a job can be run parallely as long
as - (a) there is no sequential dependency (b) the job has sufficient
resources.
however, my code was launching 2 jobs and they are sequential as you
rightly pointed out.
The issue which I was trying to highlight with that piece of pseudocode
however was that - I am observing a job with 2 stages which dont depend on
each other (they both are reading data from 2 seperate tables in db), they
both are scheduled and both stages get resources - but the 2nd stage really
does not pick up until the 1st stage is complete. It might be due to the db
driver - I will post it to the right forum. Thanks.

On Mon, Jun 26, 2017 at 9:12 PM, Pralabh Kumar 
wrote:

> i think my words also misunderstood. My point is they will not submit
> together since they are the part of one thread.
>
> val spark =  SparkSession.builder()
>   .appName("practice")
>   .config("spark.scheduler.mode","FAIR")
>   .enableHiveSupport().getOrCreate()
> val sc = spark.sparkContext
> sc.parallelize(List(1.to(1000))).map(s=>Thread.sleep(1)).collect()
> sc.parallelize(List(1.to(1000))).map(s=>Thread.sleep(1)).collect()
> Thread.sleep(1000)
>
>
> I ran this and both spark submit time are different for both the jobs .
>
> Please let me if I am wrong
>
> On Tue, Jun 27, 2017 at 9:17 AM, 萝卜丝炒饭 <1427357...@qq.com> wrote:
>
>> My words cause misunderstanding.
>> Step 1:A is submited to spark.
>> Step 2:B is submitted to spark.
>>
>> Spark gets two independent jobs.The FAIR  is used to schedule A and B.
>>
>> Jeffrey' code did not cause two submit.
>>
>>
>>
>> ---Original---
>> *From:* "Pralabh Kumar"
>> *Date:* 2017/6/27 12:09:27
>> *To:* "萝卜丝炒饭"<1427357...@qq.com>;
>> *Cc:* "user";"satishl";"Bryan
>> Jeffrey";
>> *Subject:* Re: Question about Parallel Stages in Spark
>>
>> Hi
>>
>> I don't think so spark submit ,will receive two submits .  Its will
>> execute one submit and then to next one .  If the application is
>> multithreaded ,and two threads are calling spark submit and one time , then
>> they will run parallel provided the scheduler is FAIR and task slots are
>> available .
>>
>> But in one thread ,one submit will complete and then the another one will
>> start . If there are independent stages in one job, then those will run
>> parallel.
>>
>> I agree with Bryan Jeffrey .
>>
>>
>> Regards
>> Pralabh Kumar
>>
>> On Tue, Jun 27, 2017 at 9:03 AM, 萝卜丝炒饭 <1427357...@qq.com> wrote:
>>
>>> I think the spark cluster receives two submits, A and B.
>>> The FAIR  is used to schedule A and B.
>>> I am not sure about this.
>>>
>>> ---Original---
>>> *From:* "Bryan Jeffrey"
>>> *Date:* 2017/6/27 08:55:42
>>> *To:* "satishl";
>>> *Cc:* "user";
>>> *Subject:* Re: Question about Parallel Stages in Spark
>>>
>>> Hello.
>>>
>>> The driver is running the individual operations in series, but each
>>> operation is parallelized internally.  If you want them run in parallel you
>>> need to provide the driver a mechanism to thread the job scheduling out:
>>>
>>> val rdd1 = sc.parallelize(1 to 10)
>>> val rdd2 = sc.parallelize(1 to 20)
>>>
>>> var thingsToDo: ParArray[(RDD[Int], Int)] = Array(rdd1, 
>>> rdd2).zipWithIndex.par
>>>
>>> thingsToDo.foreach { case(rdd, index) =>
>>>   for(i <- (1 to 1))
>>> logger.info(s"Index ${index} - ${rdd.sum()}")
>>> }
>>>
>>>
>>> This will run both operations in parallel.
>>>
>>>
>>> On Mon, Jun 26, 2017 at 8:10 PM, satishl  wrote:
>>>
 For the below code, since rdd1 and rdd2 dont depend on each other - i
 was
 expecting that both first and second printlns would be interwoven.
 However -
 the spark job runs all "first " statements first and then all "seocnd"
 statements next in serial fashion. I have set spark.scheduler.mode =
 FAIR.
 obviously my understanding of parallel stages is wrong. What am I
 missing?

 val rdd1 = sc.parallelize(1 to 100)
 val rdd2 = sc.parallelize(1 to 100)

 for (i <- (1 to 100))
   println("first: " + rdd1.sum())
 for (i <- (1 to 100))
   println("second" + rdd2.sum())



 --
 View this message in context: http://apache-spark-user-list.
 1001560.n3.nabble.com/Question-about-Parallel-Stages-in-Spar
 k-tp28793.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe e-mail: user-unsubscr...@spark.apache.org


>>>
>>
>


Re: gfortran runtime library for Spark

2017-06-27 Thread Saroj C
Thanks a lot.

Thanks & Regards
Saroj Kumar Choudhury
Tata Consultancy Services
(UNIT-I)- KALINGA PARK
IT/ITES  SPECIAL ECONOMIC ZONE (SEZ),PLOT NO. 35,
CHANDAKA INDUSTRIAL ESTATE, PATIA,
Bhubaneswar - 751 024,Orissa
India
Ph:- +91 674 664 5154
Mailto: saro...@tcs.com
Website: http://www.tcs.com

Experience certainty.   IT Services
Business Solutions
Consulting





From:   Yanbo Liang 
To: Saroj C 
Cc: "user@spark.apache.org" 
Date:   06/24/2017 07:38 AM
Subject:Re: gfortran runtime library for Spark



gfortran runtime library is still required for Spark 2.1 for better 
performance.
If it's not present on your nodes, you will see a warning message and a 
pure JVM implementation will be used instead, but you will not get the 
best performance.

Thanks
Yanbo

On Wed, Jun 21, 2017 at 5:30 PM, Saroj C  wrote:
Dear All, 
 Can you please let me know, if gfortran runtime library is still required 
for Spark 2.1, for better performance. Note, I am using Java APIs for 
Spark ? 

Thanks & Regards
Saroj 
=-=-=
Notice: The information contained in this e-mail
message and/or attachments to it may contain 
confidential or privileged information. If you are 
not the intended recipient, any dissemination, use, 
review, distribution, printing or copying of the 
information contained in this e-mail message 
and/or attachments to it are strictly prohibited. If 
you have received this communication in error, 
please notify us by reply e-mail or telephone and 
immediately and permanently delete the message 
and any attachments. Thank you