Re: Fwd: Model weights of linear regression becomes abnormal values

2015-05-29 Thread Petar Zecevic


You probably need to scale the values in the data set so that they are 
all of comparable ranges and translate them so that their means get to 0.


You can use pyspark.mllib.feature.StandardScaler(True, True) object for 
that.


On 28.5.2015. 6:08, Maheshakya Wijewardena wrote:


Hi,

I'm trying to use Sparks' *LinearRegressionWithSGD* in PySpark with 
the attached dataset. The code is attached. When I check the model 
weights vector after training, it contains `nan` values.

[nan,nan,nan,nan,nan,nan,nan,nan]
But for some data sets, this problem does not occur. What might be the reason 
for this?
Is this an issue with the data I'm using or a bug?
Best regards.
--
Pruthuvi Maheshakya Wijewardena
Software Engineer
WSO2 Lanka (Pvt) Ltd
Email: mahesha...@wso2.com 
Mobile: +94711228855/*
*/




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org




Re: Spark streaming with kafka

2015-05-29 Thread Akhil Das
Just after receiving the data from kafka, you can do a
dstream.count().print() to see spark and kafka is not the problem, after
that next step would be to identify where is the problem, you can do the
same count and print on each of the dstreams that you are creating (by
transforming), and finally, the indexing into elastic search peace, where
you need to make sure the connection is not a problem. May be you can
create a simple RDD of Map("id" -> "1", "Something" -> "Something") and
call the .saveToES to make sure it is getting written into ES.

Thanks
Best Regards

On Thu, May 28, 2015 at 7:03 PM, boci  wrote:

> Hi guys,
>
> I using spark streaming with kafka... In local machine (start as java
> application without using spark-submit) it's work, connect to kafka and do
> the job (*). I tried to put into spark docker container (hadoop 2.6, spark
> 1.3.1, try spark submit wil local[5] and yarn-client too ) but I'm out of
> success...
>
> No error on the console (the application started), I see something
> received from kafka but the result is not written out to elasticsearch...
>
> Where can I start the debug? I see in the spark console two job, both
> 0/1...
>
> Thanks
>
> --
> Skype: boci13, Hangout: boci.b...@gmail.com
>


FW: Websphere MQ as a data source for Apache Spark Streaming

2015-05-29 Thread Chaudhary, Umesh
Hi,
I have written  manual receiver for Websphere MQ and its working fine.
If I am doing JavaDStream.SaveAsTextFile(“/home/user/out.txt”)  then its 
generating a directory naming out.txt appending its timestamp.
In this directory only _SUCCESS file is present. I can see data on console 
while running in local mode but not able to save it as text file.
Is there any other way for saving streaming data?

From: Chaudhary, Umesh
Sent: Tuesday, May 26, 2015 2:39 AM
To: 'Arush Kharbanda'; user@spark.apache.org
Subject: RE: Websphere MQ as a data source for Apache Spark Streaming

Thanks for the suggestion, I will try and post the outcome.

From: Arush Kharbanda [mailto:ar...@sigmoidanalytics.com]
Sent: Monday, May 25, 2015 12:24 PM
To: Chaudhary, Umesh; user@spark.apache.org
Subject: Re: Websphere MQ as a data source for Apache Spark Streaming

Hi Umesh,

You can write a customer receiver for Websphere MQ, using the API for websphere 
MQ.

https://spark.apache.org/docs/latest/streaming-custom-receivers.html

Thanks
Arush

On Mon, May 25, 2015 at 8:04 PM, Chaudhary, Umesh 
mailto:umesh.chaudh...@searshc.com>> wrote:
I have seen it but it has different configuration for connecting the MQ.
I mean for Websphere MQ we need Host, Queue Manager, Channel And Queue Name but 
here according to MQTT protocol

client = new MqttClient(brokerUrl, MqttClient.generateClientId(), persistence)

It only expects Broker URL which is in appropriate for establishing connection 
with Websphere MQ.

Please Suggest !


From: Arush Kharbanda 
[mailto:ar...@sigmoidanalytics.com]
Sent: Monday, May 25, 2015 6:29 AM
To: Chaudhary, Umesh
Cc: user@spark.apache.org
Subject: Re: Websphere MQ as a data source for Apache Spark Streaming

Hi Umesh,

You can connect to Spark Streaming with MQTT  refer to the example.

https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/MQTTWordCount.scala



Thanks
Arush



On Mon, May 25, 2015 at 3:43 PM, umesh9794 
mailto:umesh.chaudh...@searshc.com>> wrote:
I was digging into the possibilities for Websphere MQ as a data source for
spark-streaming becuase it is needed in one of our use case. I got to know
that  MQTT    is the protocol that supports the
communication from MQ data structures but since I am a newbie to spark
streaming I need some working examples for the same. Did anyone try to
connect the MQ with spark streaming. Please devise the best way for doing
so.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Websphere-MQ-as-a-data-source-for-Apache-Spark-Streaming-tp23013.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.org



--

[Image removed by sender. Sigmoid 
Analytics]

Arush Kharbanda || Technical Teamlead

ar...@sigmoidanalytics.com || 
www.sigmoidanalytics.com
This message, including any attachments, is the property of Sears Holdings 
Corporation and/or one of its subsidiaries. It is confidential and may contain 
proprietary or legally privileged information. If you are not the intended 
recipient, please delete it without reading the contents. Thank you.



--

[Image removed by sender. Sigmoid 
Analytics]

Arush Kharbanda || Technical Teamlead

ar...@sigmoidanalytics.com || 
www.sigmoidanalytics.com

This message, including any attachments, is the property of Sears Holdings 
Corporation and/or one of its subsidiaries. It is confidential and may contain 
proprietary or legally privileged information. If you are not the intended 
recipient, please delete it without reading the contents. Thank you.


Re: FW: Websphere MQ as a data source for Apache Spark Streaming

2015-05-29 Thread Tathagata Das
Are you sure that the data can be saved as strings?
Another, more controlled approach is use DStream.foreachRDD , which takes a
Function2 parameter, RDD and Time. There you can explicitly do stuff with
the RDD, save it to separate files (separated by time), or whatever.  Might
help you to debug what is going on.
Might help if you shows the streaming program in a pastebin.

TD


On Fri, May 29, 2015 at 12:55 AM, Chaudhary, Umesh <
umesh.chaudh...@searshc.com> wrote:

>  Hi,
>
> I have written  manual receiver for Websphere MQ and its working fine.
>
> If I am doing *JavaDStream.SaveAsTextFile(“/home/user/out.txt”) * then
> its generating a directory naming *out.txt* appending its timestamp.
>
> In this directory only *_SUCCESS* file is present. I can see data on
> console while running in local mode but not able to save it as text file.
>
> Is there any other way for saving streaming data?
>
>
>
> *From:* Chaudhary, Umesh
> *Sent:* Tuesday, May 26, 2015 2:39 AM
> *To:* 'Arush Kharbanda'; user@spark.apache.org
> *Subject:* RE: Websphere MQ as a data source for Apache Spark Streaming
>
>
>
> Thanks for the suggestion, I will try and post the outcome.
>
>
>
> *From:* Arush Kharbanda [mailto:ar...@sigmoidanalytics.com
> ]
> *Sent:* Monday, May 25, 2015 12:24 PM
> *To:* Chaudhary, Umesh; user@spark.apache.org
> *Subject:* Re: Websphere MQ as a data source for Apache Spark Streaming
>
>
>
> Hi Umesh,
>
>
>
> You can write a customer receiver for Websphere MQ, using the API for
> websphere MQ.
>
>
>
> https://spark.apache.org/docs/latest/streaming-custom-receivers.html
>
>
>
> Thanks
>
> Arush
>
>
>
> On Mon, May 25, 2015 at 8:04 PM, Chaudhary, Umesh <
> umesh.chaudh...@searshc.com> wrote:
>
> I have seen it but it has different configuration for connecting the MQ.
>
> I mean for Websphere MQ we need Host, Queue Manager, Channel And Queue
> Name but here according to MQTT protocol
>
>
>
> client = new MqttClient(brokerUrl, MqttClient.generateClientId(),
> persistence)
>
>
>
> It only expects Broker URL which is in appropriate for establishing
> connection with Websphere MQ.
>
>
>
> Please Suggest !
>
>
>
>
>
> *From:* Arush Kharbanda [mailto:ar...@sigmoidanalytics.com]
> *Sent:* Monday, May 25, 2015 6:29 AM
> *To:* Chaudhary, Umesh
> *Cc:* user@spark.apache.org
> *Subject:* Re: Websphere MQ as a data source for Apache Spark Streaming
>
>
>
> Hi Umesh,
>
>
>
> You can connect to Spark Streaming with MQTT  refer to the example.
>
>
>
>
> https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/MQTTWordCount.scala
>
>
>
>
>
>
>
> Thanks
>
> Arush
>
>
>
>
>
>
>
> On Mon, May 25, 2015 at 3:43 PM, umesh9794 
> wrote:
>
> I was digging into the possibilities for Websphere MQ as a data source for
> spark-streaming becuase it is needed in one of our use case. I got to know
> that  MQTT    is the protocol that supports the
> communication from MQ data structures but since I am a newbie to spark
> streaming I need some working examples for the same. Did anyone try to
> connect the MQ with spark streaming. Please devise the best way for doing
> so.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Websphere-MQ-as-a-data-source-for-Apache-Spark-Streaming-tp23013.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>
>
>
> --
>
> [image: Image removed by sender. Sigmoid Analytics]
> 
>
> *Arush Kharbanda* || Technical Teamlead
>
> ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
>
> This message, including any attachments, is the property of Sears Holdings
> Corporation and/or one of its subsidiaries. It is confidential and may
> contain proprietary or legally privileged information. If you are not the
> intended recipient, please delete it without reading the contents. Thank
> you.
>
>
>
>
>
> --
>
> [image: Image removed by sender. Sigmoid Analytics]
> 
>
> *Arush Kharbanda* || Technical Teamlead
>
> ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
>This message, including any attachments, is the property of Sears
> Holdings Corporation and/or one of its subsidiaries. It is confidential and
> may contain proprietary or legally privileged information. If you are not
> the intended recipient, please delete it without reading the contents.
> Thank you.
>


Re: Recommended Scala version

2015-05-29 Thread Alex Nakos
Hi-

I’ve just built the latest spark RC from source (1.4.0 RC3) and can confirm
that the spark shell is still NOT working properly on 2.11. No classes in
the jar I've specified with the —jars argument on the command line are
available in the REPL.


Cheers
Alex

On Thu, May 28, 2015 at 8:38 AM, Tathagata Das  wrote:

> Would be great if you guys can test out the Spark 1.4.0 RC2 (RC3 coming
> out soon) with Scala 2.11 and report issues.
>
> TD
>
> On Tue, May 26, 2015 at 9:15 AM, Koert Kuipers  wrote:
>
>> we are still running into issues with spark-shell not working on 2.11,
>> but we are running on somewhat older master so maybe that has been resolved
>> already.
>>
>> On Tue, May 26, 2015 at 11:48 AM, Dean Wampler 
>> wrote:
>>
>>> Most of the 2.11 issues are being resolved in Spark 1.4. For a while,
>>> the Spark project has published maven artifacts that are compiled with 2.11
>>> and 2.10, although the downloads at
>>> http://spark.apache.org/downloads.html are still all for 2.10.
>>>
>>> Dean Wampler, Ph.D.
>>> Author: Programming Scala, 2nd Edition
>>>  (O'Reilly)
>>> Typesafe 
>>> @deanwampler 
>>> http://polyglotprogramming.com
>>>
>>> On Tue, May 26, 2015 at 10:33 AM, Ritesh Kumar Singh <
>>> riteshoneinamill...@gmail.com> wrote:
>>>
 Yes, recommended version is 2.10 as all the features are not supported
 by 2.11 version. Kafka libraries and JDBC components are yet to be ported
 to 2.11 version. And so if your project doesn't depend on these components,
 you can give v2.11 a try.

 Here's a link
 
  for
 building with 2.11 version.

 Though, you won't be running into any issues if you try v2.10 as of
 now. But then again, the future releases will have to shift to 2.11 version
 once support for v2.10 ends in the long run.


 On Tue, May 26, 2015 at 8:21 PM, Punyashloka Biswal <
 punya.bis...@gmail.com> wrote:

> Dear Spark developers and users,
>
> Am I correct in believing that the recommended version of Scala to use
> with Spark is currently 2.10? Is there any plan to switch to 2.11 in
> future? Are there any advantages to using 2.11 today?
>
> Regards,
> Punya



>>>
>>
>


Re: How to get the best performance with LogisticRegressionWithSGD?

2015-05-29 Thread SparknewUser
I've tried several different couple of parameters for my
LogisticRegressionWithSGD and here are my results.
My numIterations varies from 100 to 500 by 50 and my stepSize varies from
0.1 to 1 by 0.1.
My last line represents the maximum of each column and my last column the
maximum of each line and we see a growth and diminution. What is the logic?

My maximum is for the couple (numIter,StepSize)=(0.4,200)

numIter/stepSize0,1 0,2 0,3 0,4 0,5 0,6 0,7 
0,8 0,9 1   line max
 1000,670,690,500,480,500,690,70
0,500,660,55   
0,70
 1500,500,510,500,500,500,500,53
0,500,530,68   
0,68
 2000,670,710,640,740,500,700,71
0,710,500,50   
0,74
 2500,500,500,550,500,500,500,73
0,550,500,50   
0,73
 3000,670,500,500,670,500,670,72
0,480,660,67   
0,72
 3500,710,600,660,500,510,500,66
0,620,660,71   
0,71
 4000,510,540,710,670,620,500,50
0,500,510,50   
0,71
 4500,510,500,500,510,500,500,66
0,510,500,50   
0,66
 5000,510,640,500,500,510,490,66
0,670,540,51   
0,67

column max   0,71   0,710,710,740,620,700,730,71
0,660,71



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-get-the-best-performance-with-LogisticRegressionWithSGD-tp23053p23082.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Execption writing on two cassandra tables NoHostAvailableException: All host(s) tried for query failed (no host was tried)

2015-05-29 Thread Antonio Giambanco
Hi all,
I have in a single server installed spark 1.3.1 and cassandra 2.0.14
I'm coding a simple java class for Spark Streaming as follow:

   - reading header events from flume sink
   - based on header I write the event body on navigation or transaction
   table (cassandra)

unfortunatly I get NoHostAvailableException, if I comment the code for
saving one of the two tables everything works


*here the code*

 public static void main(String[] args) {

// Create a local StreamingContext with two working thread and
batch interval of 1 second
 SparkConf conf = new
SparkConf().setMaster("local[2]").setAppName("DWXNavigationApp");

 conf.set("spark.cassandra.connection.host", "127.0.0.1");
 conf.set("spark.cassandra.connection.native.port","9042");
 conf.set("spark.cassandra.output.batch.size.rows", "1");
 conf.set("spark.cassandra.output.concurrent.writes", "1");


 final JavaStreamingContext jssc = new JavaStreamingContext(conf,
Durations.seconds(1));

 JavaReceiverInputDStream flumeStreamNavig =
FlumeUtils.createPollingStream(jssc, "127.0.0.1", );


 JavaDStream logRowsNavig = flumeStreamNavig.map(
 new Function(){

@Override
public String call(SparkFlumeEvent arg0) throws
Exception {
// TODO Auto-generated method stub0.

Map headers =
arg0.event().getHeaders();

ByteBuffer bytePayload = arg0.event().getBody();
String s = headers.get("source_log").toString() +
"#" + new String(bytePayload.array());
System.out.println("RIGA: " + s);
return s;
}
 });


 logRowsNavig.foreachRDD(
 new Function,Void>(){
@Override
public Void call(JavaRDD rows) throws Exception
{

if(!rows.isEmpty()){

 //String header =
getHeaderFronRow(rows.collect());

 List listNavigation = new
ArrayList();
 List listTransaction = new
ArrayList();

 for(String row : rows.collect()){

 String header = row.substring(0,
row.indexOf("#"));

 if(header.contains("controller_log")){

 listNavigation.add(createNavigation(row));
 System.out.println("Added Element in
Navigation List");

 }else if(header.contains("business_log")){

 listTransaction.add(createTransaction(row));
 System.out.println("Added Element in
Transaction List");
 }

 }


 if(!listNavigation.isEmpty()){
 JavaRDD navigationRows=
jssc.sparkContext().parallelize(listNavigation);


 javaFunctions(navigationRows).writerBuilder("cassandrasink", "navigation",
mapToRow(Navigation.class)).saveToCassandra();
 }


 if(!listTransaction.isEmpty()){
 JavaRDD transactionRows=
jssc.sparkContext().parallelize(listTransaction);


 javaFunctions(transactionRows).writerBuilder("cassandrasink",
"transaction", mapToRow(Transaction.class)).saveToCassandra();

 }

}
return null;

}
 });

jssc.start();  // Start the computation
jssc.awaitTermination();   // Wait for the computation to terminate
 }


*here the exception*


15/05/29 11:19:29 ERROR QueryExecutor: Failed to execute:

com.datastax.spark.connector.writer.RichBatchStatement@ab76b83

com.datastax.driver.core.exceptions.NoHostAvailableException: All

host(s) tried for query failed (no host was tried)

 at

com.datastax.driver.core.RequestHandler.sendRequest(RequestHandler.java:107)

 at

com.datastax.driver.core.SessionManager.execute(SessionManager.java:538)

 at

com.datastax.driver.core.SessionManager.executeQuery(SessionManager.java:577)

 at

com.datastax.driver.core.SessionManager.executeAsync(SessionManager.java:119)

 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

 at

sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

 at

sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

 at java.lang.reflect.Method.invoke(Method.java:601)

 at

com.datastax.spark.connector.cql.SessionProxy.invoke(SessionProxy.scala:33)

 at $Proxy17.executeAsync(Unknown Source)

 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

 at

sun.reflect.NativeMethodAccessorImpl.i

Re: Adding an indexed column

2015-05-29 Thread Wesley Miao
One way I can see is to -

1. get rdd from your df
2. call rdd.zipWithIndex to get a new rdd
3. turn your new rdd to a new df

On Fri, May 29, 2015 at 5:43 AM, Cesar Flores  wrote:

>
> Assuming that I have the next data frame:
>
> flag | price
> --
> 1|47.808764653746
> 1|47.808764653746
> 1|31.9869279512204
> 1|47.7907893713564
> 1|16.7599200038239
> 1|16.7599200038239
> 1|20.3916014172137
>
> How can I create a data frame with an extra indexed column as the next one:
>
> flag | price  | index
> --|---
> 1|47.808764653746 | 0
> 1|47.808764653746 | 1
> 1|31.9869279512204| 2
> 1|47.7907893713564| 3
> 1|16.7599200038239| 4
> 1|16.7599200038239| 5
> 1|20.3916014172137| 6
>
> --
> Cesar Flores
>


Re: MLlib: how to get the best model with only the most significant explanatory variables in LogisticRegressionWithLBFGS or LogisticRegressionWithSGD ?

2015-05-29 Thread mélanie gallois
When will Spark 1.4 be available exactly?
To answer to "Model selection can be achieved through high
lambda resulting lots of zero in the coefficients" : Do you mean that
putting a high lambda as a parameter of the logistic regression keeps only
a few significant variables and "deletes" the others with a zero in the
coefficients? What is a high lambda for you?
Is the lambda a parameter available in Spark 1.4 only or can I see it in
Spark 1.3?

2015-05-23 0:04 GMT+02:00 Joseph Bradley :

> If you want to select specific variable combinations by hand, then you
> will need to modify the dataset before passing it to the ML algorithm.  The
> DataFrame API should make that easy to do.
>
> If you want to have an ML algorithm select variables automatically, then I
> would recommend using L1 regularization for now and possibly elastic net
> after 1.4 is release, per DB's suggestion.
>
> If you want detailed model statistics similar to what R provides, I've
> created a JIRA for discussing how we should add that functionality to
> MLlib.  Those types of stats will be added incrementally, but feedback
> would be great for prioritization:
> https://issues.apache.org/jira/browse/SPARK-7674
>
> To answer your question: "How are the weights calculated: is there a
> correlation calculation with the variable of interest?"
> --> Weights are calculated as with all logistic regression algorithms, by
> using convex optimization to minimize a regularized log loss.
>
> Good luck!
> Joseph
>
> On Fri, May 22, 2015 at 1:07 PM, DB Tsai  wrote:
>
>> In Spark 1.4, Logistic Regression with elasticNet is implemented in ML
>> pipeline framework. Model selection can be achieved through high
>> lambda resulting lots of zero in the coefficients.
>>
>> Sincerely,
>>
>> DB Tsai
>> ---
>> Blog: https://www.dbtsai.com
>>
>>
>> On Fri, May 22, 2015 at 1:19 AM, SparknewUser
>>  wrote:
>> > I am new in MLlib and in Spark.(I use Scala)
>> >
>> > I'm trying to understand how LogisticRegressionWithLBFGS and
>> > LogisticRegressionWithSGD work.
>> > I usually use R to do logistic regressions but now I do it on Spark
>> > to be able to analyze Big Data.
>> >
>> > The model only returns weights and intercept. My problem is that I have
>> no
>> > information about which variable is significant and which variable I had
>> > better
>> > to delete to improve my model. I only have the confusion matrix and the
>> AUC
>> > to evaluate the performance.
>> >
>> > Is there any way to have information about the variables I put in my
>> model?
>> > How can I try different variable combinations, do I have to modify the
>> > dataset
>> > of origin (e.g. delete one or several columns?)
>> > How are the weights calculated: is there a correlation calculation with
>> the
>> > variable
>> > of interest?
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/MLlib-how-to-get-the-best-model-with-only-the-most-significant-explanatory-variables-in-LogisticRegr-tp22993.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> >
>> > -
>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> > For additional commands, e-mail: user-h...@spark.apache.org
>> >
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


-- 
*Mélanie*


Re: Spark Streaming and Drools

2015-05-29 Thread Antonio Giambanco
Hi all,
I wrote a simple rule (Drools) and I'm trying to fire it, when I
fireAllRules nothing happen neither exceptions. . . do I need to setup
configurations?

Thanks

A G

2015-05-22 12:22 GMT+02:00 Dibyendu Bhattacharya <
dibyendu.bhattach...@gmail.com>:

> Hi,
>
> Sometime back I played with Distributed Rule processing by integrating
> Drool with HBase Co-Processors ..and invoke Rules on any incoming data ..
>
> https://github.com/dibbhatt/hbase-rule-engine
>
> You can get some idea how to use Drools rules if you see this
> RegionObserverCoprocessor ..
>
>
> https://github.com/dibbhatt/hbase-rule-engine/blob/master/src/main/java/hbase/rule/HBaseDroolObserver.java
>
>
> Idea is basically to create a stateless Ruleengine from the "drl" file and
> fire the rule on incoming data ..
>
> Even though the code is for invoking rules on HBase PUT object , but you
> can get an idea ..and modify it for Spark..
>
> Dibyendu
>
>
>
> On Fri, May 22, 2015 at 3:49 PM, Evo Eftimov 
> wrote:
>
>> I am not aware of existing examples but you can always “ask” Google
>>
>>
>>
>> Basically from Spark Streaming perspective, Drools is a third-party
>> Software Library, you would invoke it in the same way as any other
>> third-party software library from the Tasks (maps, filters etc) within your
>> DAG job
>>
>>
>>
>> *From:* Antonio Giambanco [mailto:antogia...@gmail.com]
>> *Sent:* Friday, May 22, 2015 11:07 AM
>> *To:* Evo Eftimov
>> *Cc:* user@spark.apache.org
>> *Subject:* Re: Spark Streaming and Drools
>>
>>
>>
>> Thanks a lot Evo,
>>
>> do you know where I can find some examples?
>>
>> Have a great one
>>
>>
>> A G
>>
>>
>>
>> 2015-05-22 12:00 GMT+02:00 Evo Eftimov :
>>
>> You can deploy and invoke Drools as a Singleton on every Spark Worker
>> Node / Executor / Worker JVM
>>
>>
>>
>> You can invoke it from e.g. map, filter etc and use the result from the
>> Rule to make decision how to transform/filter an event/message
>>
>>
>>
>> *From:* Antonio Giambanco [mailto:antogia...@gmail.com]
>> *Sent:* Friday, May 22, 2015 9:43 AM
>> *To:* user@spark.apache.org
>> *Subject:* Spark Streaming and Drools
>>
>>
>>
>> Hi All,
>>
>> I'm deploying and architecture that uses flume for sending log
>> information in a sink.
>>
>> Spark streaming read from this sink (pull strategy) e process al this
>> information, during this process I would like to make some event
>> processing. . . for example:
>>
>> Log appender writes information about all transactions in my trading
>> platforms,
>>
>> if a platform user sells more than buy during a week I need to receive an
>> alert on an event dashboard.
>>
>> How can I realize it? Is it possible with drools?
>>
>> Thanks so much
>>
>>
>>
>
>


Re: Batch aggregation by sliding window + join

2015-05-29 Thread Igor Berman
Hi Ayan,
thanks for the response
I'm using 1.3.1. I'll check window queries(I dont use spark-sql...only
core, might be I should?)
What do you mean by materialized? I can repartitionAndSort by key
daily-aggregation, however I'm not quite understand how it will help with
yesterdays block which needs to be loaded from file and it has no
connection to this repartition of daily block.


On 29 May 2015 at 01:51, ayan guha  wrote:

> Which version of spark? In 1.4 window queries will show up for these kind
> of scenarios.
>
> 1 thing I can suggest is keep daily aggregates materialised and partioned
> by key and sorted by key-day combination using repartitionandsort method.
> It allows you to use custom partitioner and custom sorter.
>
> Best
> Ayan
> On 29 May 2015 03:31, "igor.berman"  wrote:
>
>> Hi,
>> I have a batch daily job that computes daily aggregate of several counters
>> represented by some object.
>> After daily aggregation is done, I want to compute block of 3 days
>> aggregation(3,7,30 etc)
>> To do so I need to add new daily aggregation to the current block and then
>> subtract from current block the daily aggregation of the last day within
>> the
>> current block(sliding window...)
>> I've implemented it with something like:
>>
>> baseBlockRdd.leftjoin(lastDayRdd).map(subtraction).fullOuterJoin(newDayRdd).map(addition)
>> All rdds are keyed by unique id(long). Each rdd is saved in avro files
>> after
>> the job finishes and loaded when job starts(on next day). baseBlockRdd is
>> much larger than lastDay and newDay rdds(depends on the size of the block)
>>
>> Unfortunately the performance is not satisfactory due to many shuffles(I
>> have parallelism etc) I was looking for the way to improve performance
>> somehow, to make sure that one task "joins" same local keys without
>> reshuffling baseBlockRdd(which is big) each time the job starts(see
>> https://spark-project.atlassian.net/browse/SPARK-1061 as related issue)
>> so bottom line - how to join big rdd with smaller rdd without reshuffling
>> big rdd over and over again?
>> As soon as I've saved this big rdd and reloaded it from disk I want that
>> every other rdd will be partitioned and collocated by the same
>> "partitioner"(which is absent for hadooprdd) ... somehow, so that only
>> small
>> rdds will be sent over network.
>>
>> Another idea I had  - somehow split baseBlock into 2 parts with filter by
>> keys of small rdds and then join, however I'm not sure it's possible to
>> implement this filter without join.
>>
>> any ideas would be appreciated,
>> thanks in advance
>> Igor
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Batch-aggregation-by-sliding-window-join-tp23074.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>


Spark Executor Memory Usage

2015-05-29 Thread Valerii Moisieienko
Hello! 
My name is Valerii. I have noticed strange memory behaivour of Spark's
executor on my cluster. Cluster works in standalone mode with 3 workers.
Application runs in cluster mode.
>From topology configuration
spark.executor.memory  1536m
I checked heap usage via JVisualVM:
http://joxi.ru/Q2KqBMdSvYpDrj
and via htop:
http://joxi.ru/Vm63RWeCvG6L2Z

I have 2 questions regarding Spark's executors memory usage:
1. Why does Max Heap Size change during executor work?
2. Why does Memory usage via htop greater than executor's heap size? 

Thank you!




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Executor-Memory-Usage-tp23083.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Execption writing on two cassandra tables NoHostAvailableException: All host(s) tried for query failed (no host was tried)

2015-05-29 Thread Yana Kadiyska
are you able to connect to your cassandra installation via

cassandra_home$ ./bin/cqlsh

This exception generally means that your cassandra instance is not
reachable/accessible

On Fri, May 29, 2015 at 6:11 AM, Antonio Giambanco 
wrote:

> Hi all,
> I have in a single server installed spark 1.3.1 and cassandra 2.0.14
> I'm coding a simple java class for Spark Streaming as follow:
>
>- reading header events from flume sink
>- based on header I write the event body on navigation or transaction
>table (cassandra)
>
> unfortunatly I get NoHostAvailableException, if I comment the code for
> saving one of the two tables everything works
>
>
> *here the code*
>
>  public static void main(String[] args) {
>
> // Create a local StreamingContext with two working thread and
> batch interval of 1 second
>  SparkConf conf = new
> SparkConf().setMaster("local[2]").setAppName("DWXNavigationApp");
>
>  conf.set("spark.cassandra.connection.host", "127.0.0.1");
>  conf.set("spark.cassandra.connection.native.port","9042");
>  conf.set("spark.cassandra.output.batch.size.rows", "1");
>  conf.set("spark.cassandra.output.concurrent.writes", "1");
>
>
>  final JavaStreamingContext jssc = new JavaStreamingContext(conf,
> Durations.seconds(1));
>
>  JavaReceiverInputDStream flumeStreamNavig =
> FlumeUtils.createPollingStream(jssc, "127.0.0.1", );
>
>
>  JavaDStream logRowsNavig = flumeStreamNavig.map(
>  new Function(){
>
> @Override
> public String call(SparkFlumeEvent arg0) throws
> Exception {
> // TODO Auto-generated method stub0.
>
> Map headers =
> arg0.event().getHeaders();
>
> ByteBuffer bytePayload = arg0.event().getBody();
> String s = headers.get("source_log").toString() +
> "#" + new String(bytePayload.array());
> System.out.println("RIGA: " + s);
> return s;
> }
>  });
>
>
>  logRowsNavig.foreachRDD(
>  new Function,Void>(){
> @Override
> public Void call(JavaRDD rows) throws
> Exception {
>
> if(!rows.isEmpty()){
>
>  //String header =
> getHeaderFronRow(rows.collect());
>
>  List listNavigation = new
> ArrayList();
>  List listTransaction = new
> ArrayList();
>
>  for(String row : rows.collect()){
>
>  String header = row.substring(0,
> row.indexOf("#"));
>
>  if(header.contains("controller_log")){
>
>  listNavigation.add(createNavigation(row));
>  System.out.println("Added Element in
> Navigation List");
>
>  }else if(header.contains("business_log")){
>
>  listTransaction.add(createTransaction(row));
>  System.out.println("Added Element in
> Transaction List");
>  }
>
>  }
>
>
>  if(!listNavigation.isEmpty()){
>  JavaRDD navigationRows=
> jssc.sparkContext().parallelize(listNavigation);
>
>
>  javaFunctions(navigationRows).writerBuilder("cassandrasink", "navigation",
> mapToRow(Navigation.class)).saveToCassandra();
>  }
>
>
>  if(!listTransaction.isEmpty()){
>  JavaRDD transactionRows=
> jssc.sparkContext().parallelize(listTransaction);
>
>
>  javaFunctions(transactionRows).writerBuilder("cassandrasink",
> "transaction", mapToRow(Transaction.class)).saveToCassandra();
>
>  }
>
> }
> return null;
>
> }
>  });
>
> jssc.start();  // Start the computation
> jssc.awaitTermination();   // Wait for the computation to
> terminate
>  }
>
>
> *here the exception*
>
>
> 15/05/29 11:19:29 ERROR QueryExecutor: Failed to execute:
>
> com.datastax.spark.connector.writer.RichBatchStatement@ab76b83
>
> com.datastax.driver.core.exceptions.NoHostAvailableException: All
>
> host(s) tried for query failed (no host was tried)
>
>  at
>
>
> com.datastax.driver.core.RequestHandler.sendRequest(RequestHandler.java:107)
>
>  at
>
> com.datastax.driver.core.SessionManager.execute(SessionManager.java:538)
>
>  at
>
>
> com.datastax.driver.core.SessionManager.executeQuery(SessionManager.java:577)
>
>  at
>
>
> com.datastax.driver.core.SessionManager.executeAsync(SessionManager.java:119)
>
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>
>  at
>
>
> sun.ref

Re: Execption writing on two cassandra tables NoHostAvailableException: All host(s) tried for query failed (no host was tried)

2015-05-29 Thread Antonio Giambanco
sure I can, everything is on localhost . . . . it only happens when i want
to write in two or more tables in the same schema

A G

2015-05-29 16:10 GMT+02:00 Yana Kadiyska :

> are you able to connect to your cassandra installation via
>
> cassandra_home$ ./bin/cqlsh
>
> This exception generally means that your cassandra instance is not
> reachable/accessible
>
> On Fri, May 29, 2015 at 6:11 AM, Antonio Giambanco 
> wrote:
>
>> Hi all,
>> I have in a single server installed spark 1.3.1 and cassandra 2.0.14
>> I'm coding a simple java class for Spark Streaming as follow:
>>
>>- reading header events from flume sink
>>- based on header I write the event body on navigation or transaction
>>table (cassandra)
>>
>> unfortunatly I get NoHostAvailableException, if I comment the code for
>> saving one of the two tables everything works
>>
>>
>> *here the code*
>>
>>  public static void main(String[] args) {
>>
>> // Create a local StreamingContext with two working thread and
>> batch interval of 1 second
>>  SparkConf conf = new
>> SparkConf().setMaster("local[2]").setAppName("DWXNavigationApp");
>>
>>  conf.set("spark.cassandra.connection.host", "127.0.0.1");
>>  conf.set("spark.cassandra.connection.native.port","9042");
>>  conf.set("spark.cassandra.output.batch.size.rows", "1");
>>  conf.set("spark.cassandra.output.concurrent.writes", "1");
>>
>>
>>  final JavaStreamingContext jssc = new JavaStreamingContext(conf,
>> Durations.seconds(1));
>>
>>  JavaReceiverInputDStream flumeStreamNavig =
>> FlumeUtils.createPollingStream(jssc, "127.0.0.1", );
>>
>>
>>  JavaDStream logRowsNavig = flumeStreamNavig.map(
>>  new Function(){
>>
>> @Override
>> public String call(SparkFlumeEvent arg0) throws
>> Exception {
>> // TODO Auto-generated method stub0.
>>
>> Map headers =
>> arg0.event().getHeaders();
>>
>> ByteBuffer bytePayload = arg0.event().getBody();
>> String s = headers.get("source_log").toString() +
>> "#" + new String(bytePayload.array());
>> System.out.println("RIGA: " + s);
>> return s;
>> }
>>  });
>>
>>
>>  logRowsNavig.foreachRDD(
>>  new Function,Void>(){
>> @Override
>> public Void call(JavaRDD rows) throws
>> Exception {
>>
>> if(!rows.isEmpty()){
>>
>>  //String header =
>> getHeaderFronRow(rows.collect());
>>
>>  List listNavigation = new
>> ArrayList();
>>  List listTransaction = new
>> ArrayList();
>>
>>  for(String row : rows.collect()){
>>
>>  String header = row.substring(0,
>> row.indexOf("#"));
>>
>>  if(header.contains("controller_log")){
>>
>>  listNavigation.add(createNavigation(row));
>>  System.out.println("Added Element in
>> Navigation List");
>>
>>  }else
>> if(header.contains("business_log")){
>>
>>  listTransaction.add(createTransaction(row));
>>  System.out.println("Added Element in
>> Transaction List");
>>  }
>>
>>  }
>>
>>
>>  if(!listNavigation.isEmpty()){
>>  JavaRDD navigationRows=
>> jssc.sparkContext().parallelize(listNavigation);
>>
>>
>>  javaFunctions(navigationRows).writerBuilder("cassandrasink", "navigation",
>> mapToRow(Navigation.class)).saveToCassandra();
>>  }
>>
>>
>>  if(!listTransaction.isEmpty()){
>>  JavaRDD transactionRows=
>> jssc.sparkContext().parallelize(listTransaction);
>>
>>
>>  javaFunctions(transactionRows).writerBuilder("cassandrasink",
>> "transaction", mapToRow(Transaction.class)).saveToCassandra();
>>
>>  }
>>
>> }
>> return null;
>>
>> }
>>  });
>>
>> jssc.start();  // Start the computation
>> jssc.awaitTermination();   // Wait for the computation to
>> terminate
>>  }
>>
>>
>> *here the exception*
>>
>>
>> 15/05/29 11:19:29 ERROR QueryExecutor: Failed to execute:
>>
>> com.datastax.spark.connector.writer.RichBatchStatement@ab76b83
>>
>> com.datastax.driver.core.exceptions.NoHostAvailableException: All
>>
>> host(s) tried for query failed (no host was tried)
>>
>>  at
>>
>>
>> com.datastax.driver.core.RequestHandler.sendRequest(RequestHandler.java:107)
>>
>>  at
>>
>> com.datastax.driver.core.SessionManager.ex

Re: Spark Executor Memory Usage

2015-05-29 Thread Ted Yu
For #2, see
http://unix.stackexchange.com/questions/65835/htop-reporting-much-higher-memory-usage-than-free-or-top

Cheers

On Fri, May 29, 2015 at 6:56 AM, Valerii Moisieienko <
valeramoisee...@gmail.com> wrote:

> Hello!
> My name is Valerii. I have noticed strange memory behaivour of Spark's
> executor on my cluster. Cluster works in standalone mode with 3 workers.
> Application runs in cluster mode.
> From topology configuration
> spark.executor.memory  1536m
> I checked heap usage via JVisualVM:
> http://joxi.ru/Q2KqBMdSvYpDrj
> and via htop:
> http://joxi.ru/Vm63RWeCvG6L2Z
>
> I have 2 questions regarding Spark's executors memory usage:
> 1. Why does Max Heap Size change during executor work?
> 2. Why does Memory usage via htop greater than executor's heap size?
>
> Thank you!
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Executor-Memory-Usage-tp23083.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: SparkR Jobs Hanging in collectPartitions

2015-05-29 Thread Eskilson,Aleksander
Sure. Looking more closely at the code, I thought I might have had an error in 
the flow of data structures in the R code, the line that extracts the words 
from the corpus is now,
words <- distinct(SparkR:::flatMap(corpus function(line) {
strsplit(
gsub(“^\\s+|[[:punct:]]”, “”, tolower(line)),
“\\s”)[[1]]
}))
(just removes leading whitespace and all punctuation after having made the 
whole line lowercase, then splits to a vector of words, ultimately flattening 
the whole collection)

Counts works on the resultant words list, returning the value expected, so the 
hang most likely occurs during the subtract. I should mention, the size of the 
corpus is very small, just kb in size. The dictionary I subtract against is 
also quite modest by Spark standards, just 4.8MB, and I’ve got 2G memory for 
the Worker, which ought to be sufficient for such a small job.

The Scala analog runs quite fast, even with the subtract. If we look at the DAG 
for the SparkR job and compare that against the event timeline for Stage 3, it 
seems the job is stuck in Scheduler Delay (in 0/2 tasks completed) and never 
begins the rest of the stage. Unfortunately, the executor log hangs up as well, 
and doesn’t give much info.
[cid:F966AC39-9916-4CBD-B447-5BF1C136F67E]

Could you describe in a little more detail at what points data is actually held 
in R’s internal process memory? I was under the impression that 
SparkR:::textFile created an RDD object that would only be realized when a DAG 
requiring it was executed, and would therefore be part of the memory managed by 
Spark, and that memory would only be moved to R as an R object following a 
collect(), take(), etc.

Thanks,
Alek Eskilson
From: Shivaram Venkataraman 
mailto:shiva...@eecs.berkeley.edu>>
Reply-To: "shiva...@eecs.berkeley.edu" 
mailto:shiva...@eecs.berkeley.edu>>
Date: Wednesday, May 27, 2015 at 8:26 PM
To: Aleksander Eskilson 
mailto:alek.eskil...@cerner.com>>
Cc: "user@spark.apache.org" 
mailto:user@spark.apache.org>>
Subject: Re: SparkR Jobs Hanging in collectPartitions

Could you try to see which phase is causing the hang ? i.e. If you do a count() 
after flatMap does that work correctly ? My guess is that the hang is somehow 
related to data not fitting in the R process memory but its hard to say without 
more diagnostic information.

Thanks
Shivaram

On Tue, May 26, 2015 at 7:28 AM, Eskilson,Aleksander 
mailto:alek.eskil...@cerner.com>> wrote:
I’ve been attempting to run a SparkR translation of a similar Scala job that 
identifies words from a corpus not existing in a newline delimited dictionary. 
The R code is:

dict <- SparkR:::textFile(sc, src1)
corpus <- SparkR:::textFile(sc, src2)
words <- distinct(SparkR:::flatMap(corpus, function(line) { gsub(“[[:punct:]]”, 
“”, tolower(strsplit(line, “ |,|-“)[[1]]))}))
found <- subtract(words, dict)

(where src1, src2 are locations on HDFS)

Then attempting something like take(found, 10) or saveAsTextFile(found, dest) 
should realize the collection, but that stage of the DAG hangs in Scheduler 
Delay during the collectPartitions phase.

Synonymous Scala code however,
val corpus = sc.textFile(src1).flatMap(_.split(“ |,|-“))
val dict = sc.textFile(src2)
val words = corpus.map(word => word.filter(Character.isLetter(_))).disctinct()
val found = words.subtract(dict)

performs as expected. Any thoughts?

Thanks,
Alek Eskilson
CONFIDENTIALITY NOTICE This message and any included attachments are from 
Cerner Corporation and are intended only for the addressee. The information 
contained in this message is confidential and may constitute inside or 
non-public information under international, federal, or state securities laws. 
Unauthorized forwarding, printing, copying, distribution, or use of such 
information is strictly prohibited and may be unlawful. If you are not the 
addressee, please promptly delete this message and notify the sender of the 
delivery error by e-mail or you may call Cerner's corporate offices in Kansas 
City, Missouri, U.S.A at (+1) (816)221-1024.



Re: Batch aggregation by sliding window + join

2015-05-29 Thread ayan guha
My point is if you keep daily aggregates already computed then you do not
reprocess raw data. But yuh you may decide to recompute last 3 days
everyday.
On 29 May 2015 23:52, "Igor Berman"  wrote:

> Hi Ayan,
> thanks for the response
> I'm using 1.3.1. I'll check window queries(I dont use spark-sql...only
> core, might be I should?)
> What do you mean by materialized? I can repartitionAndSort by key
> daily-aggregation, however I'm not quite understand how it will help with
> yesterdays block which needs to be loaded from file and it has no
> connection to this repartition of daily block.
>
>
> On 29 May 2015 at 01:51, ayan guha  wrote:
>
>> Which version of spark? In 1.4 window queries will show up for these kind
>> of scenarios.
>>
>> 1 thing I can suggest is keep daily aggregates materialised and partioned
>> by key and sorted by key-day combination using repartitionandsort method.
>> It allows you to use custom partitioner and custom sorter.
>>
>> Best
>> Ayan
>> On 29 May 2015 03:31, "igor.berman"  wrote:
>>
>>> Hi,
>>> I have a batch daily job that computes daily aggregate of several
>>> counters
>>> represented by some object.
>>> After daily aggregation is done, I want to compute block of 3 days
>>> aggregation(3,7,30 etc)
>>> To do so I need to add new daily aggregation to the current block and
>>> then
>>> subtract from current block the daily aggregation of the last day within
>>> the
>>> current block(sliding window...)
>>> I've implemented it with something like:
>>>
>>> baseBlockRdd.leftjoin(lastDayRdd).map(subtraction).fullOuterJoin(newDayRdd).map(addition)
>>> All rdds are keyed by unique id(long). Each rdd is saved in avro files
>>> after
>>> the job finishes and loaded when job starts(on next day). baseBlockRdd is
>>> much larger than lastDay and newDay rdds(depends on the size of the
>>> block)
>>>
>>> Unfortunately the performance is not satisfactory due to many shuffles(I
>>> have parallelism etc) I was looking for the way to improve performance
>>> somehow, to make sure that one task "joins" same local keys without
>>> reshuffling baseBlockRdd(which is big) each time the job starts(see
>>> https://spark-project.atlassian.net/browse/SPARK-1061 as related issue)
>>> so bottom line - how to join big rdd with smaller rdd without reshuffling
>>> big rdd over and over again?
>>> As soon as I've saved this big rdd and reloaded it from disk I want that
>>> every other rdd will be partitioned and collocated by the same
>>> "partitioner"(which is absent for hadooprdd) ... somehow, so that only
>>> small
>>> rdds will be sent over network.
>>>
>>> Another idea I had  - somehow split baseBlock into 2 parts with filter by
>>> keys of small rdds and then join, however I'm not sure it's possible to
>>> implement this filter without join.
>>>
>>> any ideas would be appreciated,
>>> thanks in advance
>>> Igor
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Batch-aggregation-by-sliding-window-join-tp23074.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>


dataframe cumulative sum

2015-05-29 Thread Cesar Flores
What will be the more appropriate method to add a cumulative sum column to
a data frame. For example, assuming that I have the next data frame:

flag | price
--
1|47.808764653746
1|47.808764653746
1|31.9869279512204


How can I create a data frame with an extra cumsum column as the next one:

flag | price  | cumsum_price
--|---
1|47.808764653746 | 47.808764653746
1|47.808764653746 | 95.6175293075
1|31.9869279512204| 127.604457259


Thanks
-- 
Cesar Flores


Re: [Streaming] Configure executor logging on Mesos

2015-05-29 Thread Gerard Maas
Hi Tim,

Thanks for the info.   We (Andy Petrella and myself) have been diving a bit
deeper into this log config:

The log line I was referring to is this one (sorry, I provided the others
just for context)

*Using Spark's default log4j profile:
org/apache/spark/log4j-defaults.properties*

That line comes from Logging.scala [1] where a default config is loaded is
none is found in the classpath upon the startup of the Spark Mesos executor
in the Mesos sandbox. At that point in time, none of the
application-specific resources have been shipped yet as the executor JVM is
just starting up.   To load a custom configuration file we should have it
already on the sandbox before the executor JVM starts and add it to the
classpath on the startup command. Is that correct?

For the classpath customization, It looks like it should be possible to
pass a -Dlog4j.configuration  property by using the
'spark.executor.extraClassPath' that will be picked up at [2] and that
should be added to the command that starts the executor JVM, but the
resource must be already on the host before we can do that. Therefore we
also need some means of 'shipping' the log4j.configuration file to the
allocated executor.

This all boils down to your statement on the need of shipping extra files
to the sandbox. Bottom line: It's currently not possible to specify a
config file for your mesos executor. (ours grows several GB/day).

The only workaround I found so far is to open up the Spark assembly,
replace the log4j-default.properties and pack it up again.  That would
work, although kind of rudimentary as we use the same assembly for many
jobs.  Probably, accessing the log4j API programmatically should also work
(I didn't try that yet)

Should we open a JIRA for this functionality?

-kr, Gerard.




[1]
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/Logging.scala#L128
[2]
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerBackend.scala#L77

On Thu, May 28, 2015 at 7:50 PM, Tim Chen  wrote:

>
> -- Forwarded message --
> From: Tim Chen 
> Date: Thu, May 28, 2015 at 10:49 AM
> Subject: Re: [Streaming] Configure executor logging on Mesos
> To: Gerard Maas 
>
>
> Hi Gerard,
>
> The log line you referred to is not Spark logging but Mesos own logging,
> which is using glog.
>
> Our own executor logs should only contain very few lines though.
>
> Most of the log lines you'll see is from Spark, and it can be controled by
> specifiying a log4j.properties to be downloaded with your Mesos task.
> Alternatively if you are downloading Spark executor via spark.executor.uri,
> you can include log4j.properties in that tar ball.
>
> I think we probably need some more configurations for Spark scheduler to
> pick up extra files to be downloaded into the sandbox.
>
> Tim
>
>
>
>
>
> On Thu, May 28, 2015 at 6:46 AM, Gerard Maas 
> wrote:
>
>> Hi,
>>
>> I'm trying to control the verbosity of the logs on the Mesos executors
>> with no luck so far. The default behaviour is INFO on stderr dump with an
>> unbounded growth that gets too big at some point.
>>
>> I noticed that when the executor is instantiated, it locates a default
>> log configuration in the spark assembly:
>>
>> I0528 13:36:22.958067 26890 exec.cpp:206] Executor registered on slave
>> 20150528-063307-780930314-5050-8152-S5
>> Spark assembly has been built with Hive, including Datanucleus jars on
>> classpath
>> Using Spark's default log4j profile:
>> org/apache/spark/log4j-defaults.properties
>>
>> So, no matter what I provide in my job jar files (or also tried with
>> (spark.executor.extraClassPath=log4j.properties) takes effect in the
>> executor's configuration.
>>
>> How should I configure the log on the executors?
>>
>> thanks, Gerard.
>>
>
>
>


Re: Spark1.3.1 build issue with CDH5.4.0 getUnknownFields

2015-05-29 Thread Chen Song
Regarding the build itself, hadoop-2.6 is not even a valid profile.

I got the following WARNING for my build.

[WARNING] The requested profile "hadoop-2.6" could not be activated because
it does not exist.

Chen

On Fri, May 29, 2015 at 2:38 AM, trackissue121 
wrote:

> I had already tested query in Hive CLI and it works fine. Same query shows
> error in Spark SQL.
> On May 29, 2015 4:14 AM, ayan guha  wrote:
>
> Probably a naive question: can you try the same in hive CLI and see if
> your SQL is working? Looks like hive thing to me as spark is faithfully
> delegating the query to hive.
> On 29 May 2015 03:22, "Abhishek Tripathi"  wrote:
>
>> Hi ,
>> I'm using CDH5.4.0  quick start VM and tried to build Spark with Hive
>> compatibility so that I can run Spark sql and access temp table remotely.
>>
>> I used below command to build  Spark, it was build successful but when I
>> tried to access Hive data from Spark sql, I get error.
>>
>> Thanks,
>> Abhi
>>
>> ---
>> *mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0-cdh5.4.0 -Phive
>> -Phive-thriftserver -DskipTests clean package*
>>
>> [cloudera@quickstart spark-1.3.1]$export HADOOP_CONF_DIR=/etc/hive/conf/
>> [cloudera@quickstart spark-1.3.1]$ ./bin/spark-sql
>> SET spark.sql.hive.version=0.13.1
>> spark-sql> show tables;
>> sample_07 false
>> t1 false
>> Time taken: 3.901 seconds, Fetched 2 row(s)
>> spark-sql> select * from t1;
>> 15/05/19 23:48:46 ERROR SparkSQLDriver: Failed in [select * from t1]
>> java.lang.VerifyError: class
>> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$SetOwnerRequestProto
>> overrides final method *getUnknownFields*
>> .()Lcom/google/protobuf/UnknownFieldSet;
>> at java.lang.ClassLoader.defineClass1(Native Method)
>> at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
>>
>>
>>
>>


-- 
Chen Song


Python implementation of RDD interface

2015-05-29 Thread Sven Kreiss
I wanted to share a Python implementation of RDDs: pysparkling.

http://trivial.io/post/120179819751/pysparkling-is-a-native-implementation-of-the

The benefit is that you can apply the same code that you use in PySpark on
large datasets in pysparkling on small datasets or single documents. When
running with pysparkling, there is no dependency on the Java Virtual
Machine or Hadoop.

Sven


Re: spark java.io.FileNotFoundException: /user/spark/applicationHistory/application

2015-05-29 Thread igor.berman
in yarn your executors might run on every node in your cluster, so you need
to configure spark history to be on hdfs(so it will be accessible to every
executor)
probably you've switched from local to yarn mode when submitting



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-java-io-FileNotFoundException-user-spark-applicationHistory-application-tp23077p23084.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark SQL v MemSQL/Voltdb

2015-05-29 Thread Conor Doherty
Hi Ashish,

Transactions are a big difference between Spark SQL and MemSQL/VoltDB, but
there are other differences as well. I'm not an expert on Volt, but another
difference between Spark SQL and MemSQL is that DataFrames do not support
indexes and MemSQL tables do. This will have implications for scanning and
query execution performance. Recently released MemSQL 4 also contains
improvements to the distributed optimizer.

For large, infrequently changing data sets, you could use the MemSQL column
store and only need a single system for storage and query (Spark does not
include storage natively, so you would need to use an external data store).
You can also use Spark in combination with MemSQL, either row store or
column store, using the MemSQL Spark Connector.

Thanks,
Conor



On Thu, May 28, 2015 at 10:36 PM, Ashish Mukherjee <
ashish.mukher...@gmail.com> wrote:

> Hi Mohit,
>
> Thanks for your reply.
>
> If my use case is purely querying read-only data (no transaction
> scenarios), at what scale is one of them a better option than the other? I
> am aware that for scale which can be supported on a single node, VoltDB is
> a better choice. However, when the scale grows to a clustered scenario,
> which is the right engine at various degrees of scale?
>
> Regards,
> Ashish
>
> On Fri, May 29, 2015 at 6:57 AM, Mohit Jaggi  wrote:
>
>> I have used VoltDB and Spark. The use cases for the two are quite
>> different. VoltDB is intended for transactions and also supports queries on
>> the same(custom to voltdb) store. Spark(SQL) is NOT suitable for
>> transactions; it is designed for querying immutable data (which may exist
>> in several different forms of stores).
>>
>> > On May 28, 2015, at 7:48 AM, Ashish Mukherjee <
>> ashish.mukher...@gmail.com> wrote:
>> >
>> > Hello,
>> >
>> > I was wondering if there is any documented comparison of SparkSQL with
>> MemSQL/VoltDB kind of in-memory SQL databases. MemSQL etc. too allow
>> queries to be run in a clustered environment. What is  the major
>> differentiation?
>> >
>> > Regards,
>> > Ashish
>>
>>
>


Re: Spark1.3.1 build issue with CDH5.4.0 getUnknownFields

2015-05-29 Thread Alex Robbins
I've gotten that error when something is trying to use a different version
of protobuf than you want. Maybe check out a `mvn dependency:tree` to see
if someone is trying to use something other than libproto 2.5.0. (At least,
2.5.0 was current when I was having the problem)

On Fri, May 29, 2015 at 10:23 AM, Chen Song  wrote:

> Regarding the build itself, hadoop-2.6 is not even a valid profile.
>
> I got the following WARNING for my build.
>
> [WARNING] The requested profile "hadoop-2.6" could not be activated
> because it does not exist.
>
> Chen
>
> On Fri, May 29, 2015 at 2:38 AM, trackissue121 
> wrote:
>
>> I had already tested query in Hive CLI and it works fine. Same query
>> shows error in Spark SQL.
>> On May 29, 2015 4:14 AM, ayan guha  wrote:
>>
>> Probably a naive question: can you try the same in hive CLI and see if
>> your SQL is working? Looks like hive thing to me as spark is faithfully
>> delegating the query to hive.
>> On 29 May 2015 03:22, "Abhishek Tripathi" 
>> wrote:
>>
>>> Hi ,
>>> I'm using CDH5.4.0  quick start VM and tried to build Spark with Hive
>>> compatibility so that I can run Spark sql and access temp table remotely.
>>>
>>> I used below command to build  Spark, it was build successful but when I
>>> tried to access Hive data from Spark sql, I get error.
>>>
>>> Thanks,
>>> Abhi
>>>
>>> ---
>>> *mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0-cdh5.4.0 -Phive
>>> -Phive-thriftserver -DskipTests clean package*
>>>
>>> [cloudera@quickstart spark-1.3.1]$export HADOOP_CONF_DIR=/etc/hive/conf/
>>> [cloudera@quickstart spark-1.3.1]$ ./bin/spark-sql
>>> SET spark.sql.hive.version=0.13.1
>>> spark-sql> show tables;
>>> sample_07 false
>>> t1 false
>>> Time taken: 3.901 seconds, Fetched 2 row(s)
>>> spark-sql> select * from t1;
>>> 15/05/19 23:48:46 ERROR SparkSQLDriver: Failed in [select * from t1]
>>> java.lang.VerifyError: class
>>> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$SetOwnerRequestProto
>>> overrides final method *getUnknownFields*
>>> .()Lcom/google/protobuf/UnknownFieldSet;
>>> at java.lang.ClassLoader.defineClass1(Native Method)
>>> at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
>>>
>>>
>>>
>>>
>
>
> --
> Chen Song
>
>


Re: Python implementation of RDD interface

2015-05-29 Thread Davies Liu
There is another implementation of RDD interface in Python, called
DPark [1], Could you have a few words to compare these two?

[1] https://github.com/douban/dpark/

On Fri, May 29, 2015 at 8:29 AM, Sven Kreiss  wrote:
> I wanted to share a Python implementation of RDDs: pysparkling.
>
> http://trivial.io/post/120179819751/pysparkling-is-a-native-implementation-of-the
>
> The benefit is that you can apply the same code that you use in PySpark on
> large datasets in pysparkling on small datasets or single documents. When
> running with pysparkling, there is no dependency on the Java Virtual Machine
> or Hadoop.
>
> Sven

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Anybody using Spark SQL JDBC server with DSE Cassandra?

2015-05-29 Thread Mohammed Guller
Hi -

We have successfully integrated Spark SQL with Cassandra. We have a backend 
that provides a REST API that allows users to execute SQL queries on data in 
C*. Now we would like to also support JDBC/ODBC connectivity , so that user can 
use tools like Tableau to query data in C* through the Spark SQL JDBC server.

However, I have been unable to find a driver that would allow the Spark SQL 
Thrift/JDBC server to connect with Cassandra. DataStax provides a closed-source 
driver that comes only with the DSE version of Cassandra.

I would like to find out how many people are using the Spark SQL JDBC server + 
DSE Cassandra combination. If you do use Spark SQL JDBC server + DSE, I would 
appreciate if you could share your experience. For example, what kind of issues 
you have run into? How is the performance? What reporting tools you are using?

Thank  you.

Mohammed



Re: dataframe cumulative sum

2015-05-29 Thread Yin Huai
Hi Cesar,

We just added it in Spark 1.4.

In Spark 1.4, You can use window function in HiveContext to do it. Assuming
you want to calculate the cumulative sum for every flag,

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._

df.select(
  $"flag",
  $"price",

sum($"price").over(Window.partitionBy("flag").orderBy("price").rowsBetween(Long.MinValue,
0


In the code, over lets Spark SQL knows that you want to use window function
sum. partitionBy("flag") will partition the table by the value of flag and
the sum's scope is a single partition. orderBy("price") will sort rows in a
partition based on the value of price (probably this does not really matter
for your case. But using orderBy will make the result deterministic).
Finally, rowsBetween(Long.MinValue, 0) means that the sum value for every
row is calculated from price values of the first row in the partition to
the current row (so, you get the cumulative sum).

Thanks,

Yin

On Fri, May 29, 2015 at 8:09 AM, Cesar Flores  wrote:

> What will be the more appropriate method to add a cumulative sum column to
> a data frame. For example, assuming that I have the next data frame:
>
> flag | price
> --
> 1|47.808764653746
> 1|47.808764653746
> 1|31.9869279512204
>
>
> How can I create a data frame with an extra cumsum column as the next one:
>
> flag | price  | cumsum_price
> --|---
> 1|47.808764653746 | 47.808764653746
> 1|47.808764653746 | 95.6175293075
> 1|31.9869279512204| 127.604457259
>
>
> Thanks
> --
> Cesar Flores
>


Re: Python implementation of RDD interface

2015-05-29 Thread Sven Kreiss
I have to admit that I never ran DPark. I think the goals are very
different. The purpose of pysparkling is not to reproduce Spark on a
cluster, but to have a lightweight implementation with the same interface
to run locally or on an API server. I still run PySpark on a cluster to
preprocess a large number of documents to train a scikit-learn classifier,
but use pysparkling to preprocess single documents before applying that
classifier in API calls. The only dependencies of pysparkling are "boto"
and "requests" to access files via "s3://" or "http://"; whereas DPark needs
a Mesos cluster.

On Fri, May 29, 2015 at 2:46 PM Davies Liu  wrote:

> There is another implementation of RDD interface in Python, called
> DPark [1], Could you have a few words to compare these two?
>
> [1] https://github.com/douban/dpark/
>
> On Fri, May 29, 2015 at 8:29 AM, Sven Kreiss  wrote:
> > I wanted to share a Python implementation of RDDs: pysparkling.
> >
> >
> http://trivial.io/post/120179819751/pysparkling-is-a-native-implementation-of-the
> >
> > The benefit is that you can apply the same code that you use in PySpark
> on
> > large datasets in pysparkling on small datasets or single documents. When
> > running with pysparkling, there is no dependency on the Java Virtual
> Machine
> > or Hadoop.
> >
> > Sven
>


Re: SparkR Jobs Hanging in collectPartitions

2015-05-29 Thread Shivaram Venkataraman
For jobs with R UDFs (i.e. when we use the RDD API from SparkR) we use R on
both the driver side and on the worker side. So in this case when the
`flatMap` operation is run, the data is sent from the JVM to an R process
on the worker which in turn executes the `gsub` function.

Could you turn on INFO logging and send a pointer to the log file ? Its
pretty clear that the problem is happening in the call to `subtract`, which
in turn is doing a shuffle operation, but I am not sure why this should
happen.

Thanks
Shivaram

On Fri, May 29, 2015 at 7:56 AM, Eskilson,Aleksander <
alek.eskil...@cerner.com> wrote:

>  Sure. Looking more closely at the code, I thought I might have had an
> error in the flow of data structures in the R code, the line that extracts
> the words from the corpus is now,
> words <- distinct(SparkR:::flatMap(corpus function(line) {
> strsplit(
> gsub(“^\\s+|[[:punct:]]”, “”, tolower(line)),
> “\\s”)[[1]]
> }))
> (just removes leading whitespace and all punctuation after having made the
> whole line lowercase, then splits to a vector of words, ultimately
> flattening the whole collection)
>
>  Counts works on the resultant words list, returning the value expected,
> so the hang most likely occurs during the subtract. I should mention, the
> size of the corpus is very small, just kb in size. The dictionary I
> subtract against is also quite modest by Spark standards, just 4.8MB, and
> I’ve got 2G memory for the Worker, which ought to be sufficient for such a
> small job.
>
>  The Scala analog runs quite fast, even with the subtract. If we look at
> the DAG for the SparkR job and compare that against the event timeline for
> Stage 3, it seems the job is stuck in Scheduler Delay (in 0/2 tasks
> completed) and never begins the rest of the stage. Unfortunately, the
> executor log hangs up as well, and doesn’t give much info.
>
>  Could you describe in a little more detail at what points data is
> actually held in R’s internal process memory? I was under the impression
> that SparkR:::textFile created an RDD object that would only be realized
> when a DAG requiring it was executed, and would therefore be part of the
> memory managed by Spark, and that memory would only be moved to R as an R
> object following a collect(), take(), etc.
>
>  Thanks,
> Alek Eskilson
>  From: Shivaram Venkataraman 
> Reply-To: "shiva...@eecs.berkeley.edu" 
> Date: Wednesday, May 27, 2015 at 8:26 PM
> To: Aleksander Eskilson 
> Cc: "user@spark.apache.org" 
> Subject: Re: SparkR Jobs Hanging in collectPartitions
>
>   Could you try to see which phase is causing the hang ? i.e. If you do a
> count() after flatMap does that work correctly ? My guess is that the hang
> is somehow related to data not fitting in the R process memory but its hard
> to say without more diagnostic information.
>
>  Thanks
> Shivaram
>
> On Tue, May 26, 2015 at 7:28 AM, Eskilson,Aleksander <
> alek.eskil...@cerner.com> wrote:
>
>>  I’ve been attempting to run a SparkR translation of a similar Scala job
>> that identifies words from a corpus not existing in a newline delimited
>> dictionary. The R code is:
>>
>>  dict <- SparkR:::textFile(sc, src1)
>> corpus <- SparkR:::textFile(sc, src2)
>> words <- distinct(SparkR:::flatMap(corpus, function(line) {
>> gsub(“[[:punct:]]”, “”, tolower(strsplit(line, “ |,|-“)[[1]]))}))
>> found <- subtract(words, dict)
>>
>>  (where src1, src2 are locations on HDFS)
>>
>>  Then attempting something like take(found, 10) or saveAsTextFile(found,
>> dest) should realize the collection, but that stage of the DAG hangs in
>> Scheduler Delay during the collectPartitions phase.
>>
>>  Synonymous Scala code however,
>> val corpus = sc.textFile(src1).flatMap(_.split(“ |,|-“))
>> val dict = sc.textFile(src2)
>> val words = corpus.map(word =>
>> word.filter(Character.isLetter(_))).disctinct()
>> val found = words.subtract(dict)
>>
>>  performs as expected. Any thoughts?
>>
>>  Thanks,
>> Alek Eskilson
>> CONFIDENTIALITY NOTICE This message and any included attachments are from
>> Cerner Corporation and are intended only for the addressee. The information
>> contained in this message is confidential and may constitute inside or
>> non-public information under international, federal, or state securities
>> laws. Unauthorized forwarding, printing, copying, distribution, or use of
>> such information is strictly prohibited and may be unlawful. If you are not
>> the addressee, please promptly delete this message and notify the sender of
>> the delivery error by e-mail or you may call Cerner's corporate offices in
>> Kansas City, Missouri, U.S.A at (+1) (816)221-1024.
>>
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Python implementation of RDD interface

2015-05-29 Thread Davies Liu
DPark also can work in localhost without Mesos cluster (single thread
or multiple process).

I also think that running PySpark without JVM in local mode will help
develop, so both pysparkling and DPark are both useful.

On Fri, May 29, 2015 at 1:36 PM, Sven Kreiss  wrote:
> I have to admit that I never ran DPark. I think the goals are very
> different. The purpose of pysparkling is not to reproduce Spark on a
> cluster, but to have a lightweight implementation with the same interface to
> run locally or on an API server. I still run PySpark on a cluster to
> preprocess a large number of documents to train a scikit-learn classifier,
> but use pysparkling to preprocess single documents before applying that
> classifier in API calls. The only dependencies of pysparkling are "boto" and
> "requests" to access files via "s3://" or "http://"; whereas DPark needs a
> Mesos cluster.
>
> On Fri, May 29, 2015 at 2:46 PM Davies Liu  wrote:
>>
>> There is another implementation of RDD interface in Python, called
>> DPark [1], Could you have a few words to compare these two?
>>
>> [1] https://github.com/douban/dpark/
>>
>> On Fri, May 29, 2015 at 8:29 AM, Sven Kreiss  wrote:
>> > I wanted to share a Python implementation of RDDs: pysparkling.
>> >
>> >
>> > http://trivial.io/post/120179819751/pysparkling-is-a-native-implementation-of-the
>> >
>> > The benefit is that you can apply the same code that you use in PySpark
>> > on
>> > large datasets in pysparkling on small datasets or single documents.
>> > When
>> > running with pysparkling, there is no dependency on the Java Virtual
>> > Machine
>> > or Hadoop.
>> >
>> > Sven

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



spark-sql errors

2015-05-29 Thread Sanjay Subramanian
https://groups.google.com/a/cloudera.org/forum/#!topic/cdh-user/6SqGuYemnbc
 


Format RDD/SchemaRDD contents to screen?

2015-05-29 Thread Minnow Noir
I"m trying to debug query results inside spark-shell, but finding it
cumbersome to save to file and then use file system utils to explore the
results, and .foreach(print) tends to interleave the results among the
myriad log messages.  Take() and collect() truncate.

Is there a simple way to present the contents of an RDD/SchemaRDD on the
screen in a formatted way?   For example, say I want to take() the first 30
lines/rows in an *RDD and present them in a readable way on the screen so
that I can see what's missing or invalid.  Obviously, I'm just trying to
sample the results in a readable way, not download everything to the driver.


Thank you


Re: Is anyone using Amazon EC2? (second attempt!)

2015-05-29 Thread roni
Hi ,
Any update on this?
I am not sure if the issue I am seeing is related ..
I have 8 slaves and when I created the cluster I specified ebs volume with
100G.
I see on Ec2 8 volumes created and each attached to the corresponding slave.
But when I try to copy data on it , it complains that

/root/ephemeral-hdfs/bin/hadoop fs -cp /intersection hdfs://
ec2-54-149-112-136.us-west-2.compute.amazonaws.com:9010/

2015-05-28 23:40:35,447 WARN  hdfs.DFSClient
(DFSOutputStream.java:run(562)) - DataStreamer Exception

org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
/intersection/kmer150/commonGoodKmers/_temporary/_attempt_201504010056_0004_m_000428_3948/part-00428._COPYING_
could only be replicated to 0 nodes instead of minReplication (=1).  There
are 1 datanode(s) running and no node(s) are excluded in this operation.


It shows only 1 datanode , but for ephermal-hdfs it shows 8 datanodes.

Any thoughts?

Thanks

_R

On Sat, May 23, 2015 at 7:24 AM, Joe Wass  wrote:

> I used Spark on EC2 a while ago, but recent revisions seem to have broken
> the functionality.
>
> Is anyone actually using Spark on EC2 at the moment?
>
> The bug in question is:
>
> https://issues.apache.org/jira/browse/SPARK-5008
>
> It makes it impossible to use persistent HDFS without a workround on each
> slave node.
>
> No-one seems to be interested in the bug, so I wonder if other people
> aren't actually having this problem. If this is the case, any suggestions?
>
> Joe
>


Re: Is anyone using Amazon EC2? (second attempt!)

2015-05-29 Thread Sanjay Subramanian
I use spark on EC2 but it's a CDH 5.3.3 distribution (starving developer 
version) installed thru Cloudera Manager. Spark is configured to run on Yarn. 

Regards
Sanjay

Sent from my iPhone

> On May 29, 2015, at 6:16 PM, roni  wrote:
> 
> Hi ,
> Any update on this? 
> I am not sure if the issue I am seeing is related ..
> I have 8 slaves and when I created the cluster I specified ebs volume with 
> 100G.
> I see on Ec2 8 volumes created and each attached to the corresponding slave.
> But when I try to copy data on it , it complains that 
> /root/ephemeral-hdfs/bin/hadoop fs -cp /intersection 
> hdfs://ec2-54-149-112-136.us-west-2.compute.amazonaws.com:9010/
> 
> 2015-05-28 23:40:35,447 WARN  hdfs.DFSClient (DFSOutputStream.java:run(562)) 
> - DataStreamer Exception
> 
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): File 
> /intersection/kmer150/commonGoodKmers/_temporary/_attempt_201504010056_0004_m_000428_3948/part-00428._COPYING_
>  could only be replicated to 0 nodes instead of minReplication (=1).  There 
> are 1 datanode(s) running and no node(s) are excluded in this operation.
> 
> 
> 
> It shows only 1 datanode , but for ephermal-hdfs it shows 8 datanodes.
> 
> Any thoughts?
> 
> Thanks
> 
> _R
> 
> 
>> On Sat, May 23, 2015 at 7:24 AM, Joe Wass  wrote:
>> I used Spark on EC2 a while ago, but recent revisions seem to have broken 
>> the functionality.
>> 
>> Is anyone actually using Spark on EC2 at the moment?
>> 
>> The bug in question is:
>> 
>> https://issues.apache.org/jira/browse/SPARK-5008
>> 
>> It makes it impossible to use persistent HDFS without a workround on each 
>> slave node.
>> 
>> No-one seems to be interested in the bug, so I wonder if other people aren't 
>> actually having this problem. If this is the case, any suggestions? 
>> 
>> Joe
> 


RE: Official Docker container for Spark

2015-05-29 Thread Tridib Samanta
Thanks all for your reply. I was evaluating which one fits best for me. I 
picked epahomov/docker-spark from docker registry and suffice my need.
 
Thanks
Tridib
 
Date: Fri, 22 May 2015 14:15:42 +0530
Subject: Re: Official Docker container for Spark
From: riteshoneinamill...@gmail.com
To: 917361...@qq.com
CC: tridib.sama...@live.com; user@spark.apache.org

Use this:
sequenceiq/docker

Here's a link to their github repo:
docker-spark


They have repos for other big data tools too which are agin really nice. Its 
being maintained properly by their devs and 
  

Re: Format RDD/SchemaRDD contents to screen?

2015-05-29 Thread ayan guha
Depending on your spark version, you can convert schemaRDD to a dataframe
and then use .show()
On 30 May 2015 10:33, "Minnow Noir"  wrote:

> I"m trying to debug query results inside spark-shell, but finding it
> cumbersome to save to file and then use file system utils to explore the
> results, and .foreach(print) tends to interleave the results among the
> myriad log messages.  Take() and collect() truncate.
>
> Is there a simple way to present the contents of an RDD/SchemaRDD on the
> screen in a formatted way?   For example, say I want to take() the first 30
> lines/rows in an *RDD and present them in a readable way on the screen so
> that I can see what's missing or invalid.  Obviously, I'm just trying to
> sample the results in a readable way, not download everything to the driver.
>
>
> Thank you
>


Security,authorization and governance

2015-05-29 Thread Phani Yadavilli -X (pyadavil)
Hi Team,

Is there any opensource framework/tool for providing security authorization and 
data governance to spark.

Regards
Phani Kumar


Re: How Broadcast variable works

2015-05-29 Thread bit1...@163.com

Can someone help take a look at my questions? Thanks.



bit1...@163.com
 
From: bit1...@163.com
Date: 2015-05-29 18:57
To: user
Subject: How Broadcast variable works
Hi, 
I have a spark streaming application.  SparkContext uses broadcast vriables to 
broadcast Configuration information that each task will be used 

I have following two questions:
1. Will the broadcast variable be broadcasted every time when driver sends 
tasks to workers in each batch interval?
2. If the above question is true, then if the broadcast variable is modified 
between the batch interval(The configuration information is updated over time) 
and Spark Context broadcasts it again,  will tasks see the updated variable?

Thanks. 





bit1...@163.com


Re: How Broadcast variable works

2015-05-29 Thread ayan guha
1. No. thats the purpose of broadcast variable, ie not to be shipped with
every task. From Documentation

Broadcast Variables

Broadcast variables allow the programmer to keep a read-only variable
cached on each machine rather than shipping a copy of it with tasks. They
can be used, for example, to give every node a copy of a large input
dataset in an efficient manner. Spark also attempts to distribute broadcast
variables using efficient broadcast algorithms to reduce communication cost.


After the broadcast variable is created, it should be used instead of the
value v in any functions run on the cluster so that v is not shipped to the
nodes more than once. In addition, the object v should not be modified
after it is broadcast in order to ensure that all nodes get the same value
of the broadcast variable (e.g. if the variable is shipped to a new node
later).


2. See above :)

If you need what you are asking for, you are looking for closures.


Best

Ayan

On Sat, May 30, 2015 at 4:11 PM, bit1...@163.com  wrote:

>
> Can someone help take a look at my questions? Thanks.
>
> --
> bit1...@163.com
>
>
> *From:* bit1...@163.com
> *Date:* 2015-05-29 18:57
> *To:* user 
> *Subject:* How Broadcast variable works
> Hi,
> I have a spark streaming application.  SparkContext uses broadcast
> vriables to broadcast Configuration information that each task will be used
>
> I have following two questions:
> 1. Will the broadcast variable be broadcasted every time when driver sends
> tasks to workers in each batch interval?
> 2. If the above question is true, then if the broadcast variable is
> modified between the batch interval(The configuration information is
> updated over time) and Spark Context broadcasts it again,  will tasks see
> the updated variable?
>
> Thanks.
>
>
>
> --
> bit1...@163.com
>
>


-- 
Best Regards,
Ayan Guha