date:20160418

Re: [Spark 1.5.2] Log4j Configuration for executors

2016-04-18 Thread Prashant Sharma

May be you can try creating it before running the App.

??????spark sql on hive

2016-04-18 Thread Sea

It's a bug of hive. Please use hive metastore service instead of visiting mysql 
directly.
set hive.metastore.uris in hive-site.xml






--  --
??: "Jieliang Li";;
: 2016??4??19??(??) 12:55
??: "user"; 

: spark sql on hive



hi everyone.i use spark sql, but throw an exception:
Retrying creating default database after error: Error creating transactional 
connection factory
javax.jdo.JDOFatalInternalException: Error creating transactional connection 
factory
at 
org.datanucleus.api.jdo.NucleusJDOHelper.getJDOExceptionForNucleusException(NucleusJDOHelper.java:587)
at 
org.datanucleus.api.jdo.JDOPersistenceManagerFactory.freezeConfiguration(JDOPersistenceManagerFactory.java:788)
at 
org.datanucleus.api.jdo.JDOPersistenceManagerFactory.createPersistenceManagerFactory(JDOPersistenceManagerFactory.java:333)
at 
org.datanucleus.api.jdo.JDOPersistenceManagerFactory.getPersistenceManagerFactory(JDOPersistenceManagerFactory.java:202)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at javax.jdo.JDOHelper$16.run(JDOHelper.java:1965)
at java.security.AccessController.doPrivileged(Native Method)
at javax.jdo.JDOHelper.invoke(JDOHelper.java:1960)
at 
javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1166)
at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808)
at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701)
at 
org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:365)
at 
org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:394)
at 
org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:291)
at 
org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:258)
at 
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73)
at 
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
at 
org.apache.hadoop.hive.metastore.RawStoreProxy.(RawStoreProxy.java:56)
at 
org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy.java:65)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStore(HiveMetaStore.java:579)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:557)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:606)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:448)
at 
org.apache.hadoop.hive.metastore.RetryingHMSHandler.(RetryingHMSHandler.java:66)
at 
org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:72)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:5601)
at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:193)
at 
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:74)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at 
org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1486)
at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:64)
at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:74)
at 
org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:2841)
at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2860)
at 
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:453)
at 
org.apache.spark.sql.hive.HiveContext.sessionState$lzycompute(HiveContext.scala:229)
at 
org.apache.spark.sql.hive.HiveContext.sessionState(HiveContext.scala:225)
at 
org.apache.spark.sql.hive.HiveContext$QueryExecution.(HiveContext.scala:373)
at 
org.apache.spark.sql.hive.HiveContext.executePlan(HiveContext.scala:80)
at 
org.apache.spark.sql.hive.HiveContext.executePlan(HiveContext.scala:49)
at org.apache.spark.sql.DataFrame.(DataFrame.scala:131)
at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51)
at org.apache.spark.sql.SQLContext.l

Re: Processing millions of messages in milliseconds -- Architecture guide required

2016-04-18 Thread Prashant Sharma

Hello Deepak,

It is not clear what you want to do. Are you talking about spark streaming
? It is possible to process historical data in Spark batch mode too. You
can add a timestamp field in xml/json. Spark documentation is at
spark.apache.org. Spark has good inbuilt features to process json and
xml[1] messages.

Thanks,
Prashant Sharma

1. https://github.com/databricks/spark-xml

On Tue, Apr 19, 2016 at 10:31 AM, Deepak Sharma 
wrote:

> Hi all,
> I am looking for an architecture to ingest 10 mils of messages in the
> micro batches of seconds.
> If anyone has worked on similar kind of architecture  , can you please
> point me to any documentation around the same like what should be the
> architecture , which all components/big data ecosystem tools should i
> consider etc.
> The messages has to be in xml/json format , a preprocessor engine or
> message enhancer and then finally a processor.
> I thought about using data cache as well for serving the data
> The data cache should have the capability to serve the historical  data in
> milliseconds (may be upto 30 days of data)
> --
> Thanks
> Deepak
> www.bigdatabig.com
>
>

Re: Read Parquet in Java Spark

2016-04-18 Thread Zhan Zhang

You can try something like below, if you only have one column.

val rdd = parquetFile.javaRDD().map(row => row.getAs[String](0)

Thanks.

Zhan Zhang

On Apr 18, 2016, at 3:44 AM, Ramkumar V 
mailto:ramkumar.c...@gmail.com>> wrote:

HI,

Any idea on this ?

Thanks,
[http://www.mylivesignature.com/signatures/54491/300/42C82353F8F99C0C0B59C2E122C12687.png]
[http://thelinkedinman.com/wp-content/uploads/sites/2/2012/01/linkedinbutton.jpg]


On Mon, Apr 4, 2016 at 2:47 PM, Akhil Das 
mailto:ak...@sigmoidanalytics.com>> wrote:
I wasn't knowing you have a parquet file containing json data.

Thanks
Best Regards

On Mon, Apr 4, 2016 at 2:44 PM, Ramkumar V 
mailto:ramkumar.c...@gmail.com>> wrote:
Hi Akhil,

Thanks for your help. Why do you put separator as "," ?

I have a parquet file which contains only json in each line.

Thanks,
[http://www.mylivesignature.com/signatures/54491/300/42C82353F8F99C0C0B59C2E122C12687.png]
[http://thelinkedinman.com/wp-content/uploads/sites/2/2012/01/linkedinbutton.jpg]


On Mon, Apr 4, 2016 at 2:34 PM, Akhil Das 
mailto:ak...@sigmoidanalytics.com>> wrote:
Something like this (in scala):

val rdd = parquetFile.javaRDD().map(row => row.mkstring(","))

You can create a map operation over your javaRDD to convert the 
org.apache.spark.sql.Row
 to String (the Row.mkstring() Operation)

Thanks
Best Regards

On Mon, Apr 4, 2016 at 12:02 PM, Ramkumar V 
mailto:ramkumar.c...@gmail.com>> wrote:
Any idea on this ? How to convert parquet file into JavaRDD ?

Thanks,
[http://www.mylivesignature.com/signatures/54491/300/42C82353F8F99C0C0B59C2E122C12687.png]
[http://thelinkedinman.com/wp-content/uploads/sites/2/2012/01/linkedinbutton.jpg]


On Thu, Mar 31, 2016 at 4:30 PM, Ramkumar V 
mailto:ramkumar.c...@gmail.com>> wrote:
Hi,

Thanks for the reply.  I tried this. It's returning JavaRDD instead of 
JavaRDD. How to get JavaRDD ?

Error :
incompatible types: org.apache.spark.api.java.JavaRDD 
cannot be converted to org.apache.spark.api.java.JavaRDD





Thanks,
[http://www.mylivesignature.com/signatures/54491/300/42C82353F8F99C0C0B59C2E122C12687.png]
[http://thelinkedinman.com/wp-content/uploads/sites/2/2012/01/linkedinbutton.jpg]


On Thu, Mar 31, 2016 at 2:57 PM, UMESH CHAUDHARY 
mailto:umesh9...@gmail.com>> wrote:
>From Spark Documentation:


DataFrame parquetFile = sqlContext.read().parquet("people.parquet");


JavaRDD jRDD= parquetFile.javaRDD()

javaRDD() method will convert the DF to RDD

On Thu, Mar 31, 2016 at 2:51 PM, Ramkumar V 
mailto:ramkumar.c...@gmail.com>> wrote:
Hi,

I'm trying to read parquet log files in Java Spark. Parquet log files are 
stored in hdfs. I want to read and convert that parquet file into JavaRDD. I 
could able to find Sqlcontext dataframe api. How can I read if it is 
sparkcontext and rdd ? what is the best way to read it ?

Thanks,
[http://www.mylivesignature.com/signatures/54491/300/42C82353F8F99C0C0B59C2E122C12687.png]
[http://thelinkedinman.com/wp-content/uploads/sites/2/2012/01/linkedinbutton.jpg]

Re: Why Spark having OutOfMemory Exception?

2016-04-18 Thread Zhan Zhang

What kind of OOM? Driver or executor side? You can use coredump to find what 
cause the OOM.

Thanks.

Zhan Zhang

On Apr 18, 2016, at 9:44 PM, 李明伟 
mailto:kramer2...@126.com>> wrote:

Hi Samaga

Thanks very much for your reply and sorry for the delay reply.

Cassandra or Hive is a good suggestion.
However in my situation I am not sure if it will make sense.

My requirements is that to get the recent 24 hour data to generate report. The 
frequency is 5 minute.
So if use cassandra or hive, it means spark will have to read 24 hour data 
every 5 mintues. And among those data, a big part (like 23 hours or more ) will 
be repeatedly read.

The window in spark is for stream computing. I did not use it but I will 
consider it


Thanks again

Regards
Mingwei





At 2016-04-11 19:09:48, "Lohith Samaga M" 
mailto:lohith.sam...@mphasis.com>> wrote:
>Hi Kramer,
>   Some options:
>   1. Store in Cassandra with TTL = 24 hours. When you read the full 
> table, you get the latest 24 hours data.
>   2. Store in Hive as ORC file and use timestamp field to filter out the 
> old data.
>   3. Try windowing in spark or flink (have not used either).
>
>
>Best regards / Mit freundlichen Grüßen / Sincères salutations
>M. Lohith Samaga
>
>
>-Original Message-
>From: kramer2...@126.com [mailto:kramer2...@126.com]
>Sent: Monday, April 11, 2016 16.18
>To: user@spark.apache.org
>Subject: Why Spark having OutOfMemory Exception?
>
>I use spark to do some very simple calculation. The description is like below 
>(pseudo code):
>
>
>While timestamp == 5 minutes
>
>df = read_hdf() # Read hdfs to get a dataframe every 5 minutes
>
>my_dict[timestamp] = df # Put the data frame into a dict
>
>delete_old_dataframe( my_dict ) # Delete old dataframe (timestamp is one
>24 hour before)
>
>big_df = merge(my_dict) # Merge the recent 24 hours data frame
>
>To explain..
>
>I have new files comes in every 5 minutes. But I need to generate report on 
>recent 24 hours data.
>The concept of 24 hours means I need to delete the oldest data frame every 
>time I put a new one into it.
>So I maintain a dict (my_dict in above code), the dict contains map like
>timestamp: dataframe. Everytime I put dataframe into the dict, I will go 
>through the dict to delete those old data frame whose timestamp is 24 hour ago.
>After delete and input. I merge the data frames in the dict to a big one and 
>run SQL on it to get my report.
>
>*
>I want to know if any thing wrong about this model? Because it is very slow 
>after started for a while and hit OutOfMemory. I know that my memory is 
>enough. Also size of file is very small for test purpose. So should not have 
>memory problem.
>
>I am wondering if there is lineage issue, but I am not sure.
>
>*
>
>
>
>--
>View this message in context: 
>http://apache-spark-user-list.1001560.n3.nabble.com/Why-Spark-having-OutOfMemory-Exception-tp26743.html
>Sent from the Apache Spark User List mailing list archive at 
>Nabble.com.
>
>-
>To unsubscribe, e-mail: 
>user-unsubscr...@spark.apache.org 
>For additional commands, e-mail: 
>user-h...@spark.apache.org
>
>Information transmitted by this e-mail is proprietary to Mphasis, its 
>associated companies and/ or its customers and is intended
>for use only by the individual or entity to which it is addressed, and may 
>contain information that is privileged, confidential or
>exempt from disclosure under applicable law. If you are not the intended 
>recipient or it appears that this mail has been forwarded
>to you without proper authority, you are notified that any use or 
>dissemination of this information in any manner is strictly
>prohibited. In such cases, please notify us immediately at 
>mailmas...@mphasis.com and delete this mail 
>from your records.
>

Processing millions of messages in milliseconds -- Architecture guide required

2016-04-18 Thread Deepak Sharma

Hi all,
I am looking for an architecture to ingest 10 mils of messages in the micro
batches of seconds.
If anyone has worked on similar kind of architecture  , can you please
point me to any documentation around the same like what should be the
architecture , which all components/big data ecosystem tools should i
consider etc.
The messages has to be in xml/json format , a preprocessor engine or
message enhancer and then finally a processor.
I thought about using data cache as well for serving the data
The data cache should have the capability to serve the historical  data in
milliseconds (may be upto 30 days of data)
-- 
Thanks
Deepak
www.bigdatabig.com

spark sql on hive

2016-04-18 Thread Jieliang Li

hi everyone.i use spark sql, but throw an exception:
Retrying creating default database after error: Error creating transactional 
connection factory
javax.jdo.JDOFatalInternalException: Error creating transactional connection 
factory
at 
org.datanucleus.api.jdo.NucleusJDOHelper.getJDOExceptionForNucleusException(NucleusJDOHelper.java:587)
at 
org.datanucleus.api.jdo.JDOPersistenceManagerFactory.freezeConfiguration(JDOPersistenceManagerFactory.java:788)
at 
org.datanucleus.api.jdo.JDOPersistenceManagerFactory.createPersistenceManagerFactory(JDOPersistenceManagerFactory.java:333)
at 
org.datanucleus.api.jdo.JDOPersistenceManagerFactory.getPersistenceManagerFactory(JDOPersistenceManagerFactory.java:202)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at javax.jdo.JDOHelper$16.run(JDOHelper.java:1965)
at java.security.AccessController.doPrivileged(Native Method)
at javax.jdo.JDOHelper.invoke(JDOHelper.java:1960)
at 
javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1166)
at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808)
at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701)
at org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:365)
at 
org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:394)
at org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:291)
at org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:258)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
at org.apache.hadoop.hive.metastore.RawStoreProxy.(RawStoreProxy.java:56)
at 
org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy.java:65)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStore(HiveMetaStore.java:579)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:557)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:606)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:448)
at 
org.apache.hadoop.hive.metastore.RetryingHMSHandler.(RetryingHMSHandler.java:66)
at 
org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:72)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:5601)
at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:193)
at 
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:74)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at 
org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1486)
at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:64)
at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:74)
at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:2841)
at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2860)
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:453)
at 
org.apache.spark.sql.hive.HiveContext.sessionState$lzycompute(HiveContext.scala:229)
at org.apache.spark.sql.hive.HiveContext.sessionState(HiveContext.scala:225)
at 
org.apache.spark.sql.hive.HiveContext$QueryExecution.(HiveContext.scala:373)
at org.apache.spark.sql.hive.HiveContext.executePlan(HiveContext.scala:80)
at org.apache.spark.sql.hive.HiveContext.executePlan(HiveContext.scala:49)
at org.apache.spark.sql.DataFrame.(DataFrame.scala:131)
at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51)
at org.apache.spark.sql.SQLContext.load(SQLContext.scala:680)
at org.apache.spark.sql.SQLContext.load(SQLContext.scala:667)
at 
com.chinamobile.cmss.pdm.algo.ml.sparkbased.statistics.Correlation$$anonfun$2$$anonfun$apply$2$$anonfun$apply$3$$anonfun$apply$4.apply(CorrelationPipe.scala:46)
at 
com.chinamobile.cmss.pdm.algo.ml.sparkbased.statistics.Correlation$$anonfun$2$$anonfun$apply$2$$anonfun$apply$3$$anonfun$apply$4.apply(CorrelationPipe.scala:38)
at com.chinamobile.cmss.pdm.common.yavf.Lazy.expr$lzycompute(Lazy.scala:12)
at com.chinamobile.cmss.pdm.common.yavf.Lazy.expr(Lazy.scala:12)
at com.chinamobile.cmss.pdm.common.yavf.Lazy.unary_$bang(Lazy.scala:14)
at 
com.chinamobile.cmss.pdm.algo.AlgorithmRunner$class.execute(Algor

Is it possible that spark worker require more resource than the cluster?

2016-04-18 Thread kramer2...@126.com

I have a stand alone cluster running on one node

The ps command will show that Worker is having 1 GB memory and Driver is
having 256m.

root 23182 1  0 Apr01 ?00:19:30 java -cp
/opt/spark-1.6.0-bin-hadoop2.6/conf/:/opt/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/opt/hadoop-2.6.2/etc/hadoop/
-Xms1g -Xmx1g -XX:MaxPermSize=256m org.apache.spark.deploy.worker.Worker
--webui-port 8081 spark://ES01:7077

root 23053 1  0 Apr01 ?00:25:00 java -cp
/opt/spark-1.6.0-bin-hadoop2.6/conf/:/opt/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/opt/hadoop-2.6.2/etc/hadoop/
-Xms1g -Xmx1g -XX:MaxPermSize=256m org.apache.spark.deploy.master.Master
--ip ES01 --port 7077 --webui-port 8080


But when I submit my application to the cluster. I can specify that driver
use 10G memory and worker use 10G memory also. 

So is it make sense that I assign more memory to the application than the
cluster it self?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Is-it-possible-that-spark-worker-require-more-resource-than-the-cluster-tp26799.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re:RE: Why Spark having OutOfMemory Exception?

2016-04-18 Thread 李明伟

Hi Samaga


Thanks very much for your reply and sorry for the delay reply. 


Cassandra or Hive is a good suggestion. 
However in my situation I am not sure if it will make sense.


My requirements is that to get the recent 24 hour data to generate report. The 
frequency is 5 minute. 
So if use cassandra or hive, it means spark will have to read 24 hour data 
every 5 mintues. And among those data, a big part (like 23 hours or more ) will 
be repeatedly read.


The window in spark is for stream computing. I did not use it but I will 
consider it




Thanks again


Regards
Mingwei







At 2016-04-11 19:09:48, "Lohith Samaga M"  wrote:
>Hi Kramer,
>   Some options:
>   1. Store in Cassandra with TTL = 24 hours. When you read the full 
> table, you get the latest 24 hours data.
>   2. Store in Hive as ORC file and use timestamp field to filter out the 
> old data.
>   3. Try windowing in spark or flink (have not used either).
>
>
>Best regards / Mit freundlichen Grüßen / Sincères salutations
>M. Lohith Samaga
>
>
>-Original Message-
>From: kramer2...@126.com [mailto:kramer2...@126.com] 
>Sent: Monday, April 11, 2016 16.18
>To: user@spark.apache.org
>Subject: Why Spark having OutOfMemory Exception?
>
>I use spark to do some very simple calculation. The description is like below 
>(pseudo code):
>
>
>While timestamp == 5 minutes
>
>df = read_hdf() # Read hdfs to get a dataframe every 5 minutes
>
>my_dict[timestamp] = df # Put the data frame into a dict
>
>delete_old_dataframe( my_dict ) # Delete old dataframe (timestamp is one
>24 hour before)
>
>big_df = merge(my_dict) # Merge the recent 24 hours data frame
>
>To explain..
>
>I have new files comes in every 5 minutes. But I need to generate report on 
>recent 24 hours data. 
>The concept of 24 hours means I need to delete the oldest data frame every 
>time I put a new one into it.
>So I maintain a dict (my_dict in above code), the dict contains map like
>timestamp: dataframe. Everytime I put dataframe into the dict, I will go 
>through the dict to delete those old data frame whose timestamp is 24 hour ago.
>After delete and input. I merge the data frames in the dict to a big one and 
>run SQL on it to get my report.
>
>*
>I want to know if any thing wrong about this model? Because it is very slow 
>after started for a while and hit OutOfMemory. I know that my memory is 
>enough. Also size of file is very small for test purpose. So should not have 
>memory problem.
>
>I am wondering if there is lineage issue, but I am not sure. 
>
>*
>
>
>
>--
>View this message in context: 
>http://apache-spark-user-list.1001560.n3.nabble.com/Why-Spark-having-OutOfMemory-Exception-tp26743.html
>Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
>-
>To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
>commands, e-mail: user-h...@spark.apache.org
>
>Information transmitted by this e-mail is proprietary to Mphasis, its 
>associated companies and/ or its customers and is intended 
>for use only by the individual or entity to which it is addressed, and may 
>contain information that is privileged, confidential or 
>exempt from disclosure under applicable law. If you are not the intended 
>recipient or it appears that this mail has been forwarded 
>to you without proper authority, you are notified that any use or 
>dissemination of this information in any manner is strictly 
>prohibited. In such cases, please notify us immediately at 
>mailmas...@mphasis.com and delete this mail from your records.
>

[Spark 1.5.2] Log4j Configuration for executors

2016-04-18 Thread Divya Gehlot

Hi,
I tried configuring logs to write it to file  for Spark Driver and
Executors .
I have two separate log4j properties files for Spark driver and executor
respectively.
Its wrtiting log for Spark driver but for executor logs I am getting below
error :

java.io.FileNotFoundException: /home/hdfs/spark_executor.log (Permission
> denied)
> at java.io.FileOutputStream.open(Native Method)
> at java.io.FileOutputStream.(FileOutputStream.java:221)



Why its giving permission denied for executor log whereas Its writing
driver logs .

I am missing any settings ?


Would really appreciate the help.



Thanks,

Divya

spark-submit not adding application jar to classpath

2016-04-18 Thread Saif.A.Ellafi

Hi,

I am submitting a jar file to spark submit, which has some content inside 
src/main/resources.
I am unable to access such resources, since the application jar is not being 
added to the classpath.

This works fine if I include the application jar also in the -driver-class-path 
entry.

Is this healthy? any thoughts?
Saif

Re: Parse XML using java spark

2016-04-18 Thread Mail.com


You might look at using JaxB or Stax. If it is simple enough use data frames 
auto generated scheme.

Pradeep

> On Apr 18, 2016, at 6:37 PM, Jinan Alhajjaj  wrote:
> 
> Thank you for your help.
> I would like to parse the XML file using Java not scala . Can you please 
> provide me with exsample of how to parse XMl via java using spark. My XML 
> file is Wikipedia dump file 
> Thank you

Re: How to decide the number of tasks in Spark?

2016-04-18 Thread Mich Talebzadeh

Try to have a look at this doc

http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 18 April 2016 at 20:43, Dogtail L  wrote:

> Hi,
>
> When launching a job in Spark, I have great trouble deciding the number of
> tasks. Someone says it is better to create a task per HDFS block size,
> i.e., make sure one task process 128MB of input data; others suggest that
> the number of tasks should be the twice of the total cores available to the
> job. Also, I found that someone suggests launching small tasks using Spark,
> i.e., make sure each task lasts around 100ms.
>
> I am quite confused about all these suggestions. Is there any general rule
> for deciding the number of tasks in Spark? Great thanks!
>
> Best
>

Parse XML using java spark

2016-04-18 Thread Jinan Alhajjaj

Thank you for your help.I would like to parse the XML file using Java not scala 
. Can you please provide me with exsample of how to parse XMl via java using 
spark. My XML file is Wikipedia dump file Thank you

Re: spark-ec2 hitting yum install issues

2016-04-18 Thread Anusha S

Yes, it does not work manually. I am not able to really do 'yum search' to
find exact package names to
try others, but I tried python-pip and it gave same error.

I will post this in the link you pointed out. Thanks!

On Thu, Apr 14, 2016 at 6:11 PM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> If you log into the cluster and manually try that step does it still fail?
> Can you yum install anything else?
>
> You might want to report this issue directly on the spark-ec2 repo, btw:
> https://github.com/amplab/spark-ec2
>
> Nick
>
> On Thu, Apr 14, 2016 at 9:08 PM sanusha 
> wrote:
>
>>
>> I am using spark-1.6.1-prebuilt-with-hadoop-2.6 on mac. I am using the
>> spark-ec2 to launch a cluster in
>> Amazon VPC. The setup.sh script [run first thing on master after launch]
>> uses pssh and tries to install it
>> via 'yum install -y pssh'. This step always fails on the master AMI that
>> the
>> script uses by default as it is
>> not able to find it in the repo mirrors - hits 403.
>>
>> Has anyone faced this and know what's causing it? For now, I have changed
>> the script to not use pssh
>> as a workaround. But would like to fix the root cause.
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/spark-ec2-hitting-yum-install-issues-tp26786.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>


-- 
Anusha

Parse XML using Java Spark

2016-04-18 Thread Jinan Alhajjaj

I would like to parse an XML file using JAVA not Scala. The XML files is 
Wikipedia Dump files If there is any example of how to parse an XML using Java 
of course by Apache Spark ,I would like to share it with me Thank you

Re: inter spark application communication

2016-04-18 Thread Michael Segel

Hi, 

Putting this back in User email list… 

Ok, 

So its not inter-job communication, but taking the output of Job 1 and using it 
as input to job 2. 
(Chaining of jobs.) 

So you don’t need anything fancy… 


> On Apr 18, 2016, at 1:47 PM, Soumitra Johri  
> wrote:
> 
> Yes , I want to chain spark applications. 
> On Mon, Apr 18, 2016 at 4:46 PM Michael Segel  > wrote:
> Yes, but I’m confused. Are you chaining your spark jobs? So you run job one 
> and its output is the input to job 2? 
> 
>> On Apr 18, 2016, at 1:21 PM, Soumitra Johri > > wrote:
>> 
>> By Akka you mean the Akka Actors ?
>> 
>> I am sending basically sending counts which are further aggregated on the 
>> second  application.
>> 
>> 
>> Warm Regards
>> Soumitra
>> 
>> On Mon, Apr 18, 2016 at 3:54 PM, Michael Segel > > wrote:
>> have you thought about Akka?
>> 
>> What are you trying to send? Why do you want them to talk to one another?
>> 
>> > On Apr 18, 2016, at 12:04 PM, Soumitra Johri > > > wrote:
>> >
>> > Hi,
>> >
>> > I have two applications : App1 and App2.
>> > On a single cluster I have to spawn 5 instances os App1 and 1 instance of 
>> > App2.
>> >
>> > What would be the best way to send data from the 5 App1 instances to the 
>> > single App2 instance ?
>> >
>> > Right now I am using Kafka to send data from one spark application to the 
>> > spark application  but the setup doesn't seem right and I hope there is a 
>> > better way to do this.
>> >
>> > Warm Regards
>> > Soumitra
>> 
>> 
>

Re: How to start HDFS on Spark Standalone

2016-04-18 Thread Michael Segel

Perhaps this is a silly question on my part…. 

Why do you want to start up HDFS on a single node?

You only mention one windows machine in your description of your cluster. 
If this is a learning  experience, why not run Hadoop in a VM (MapR and I think 
the other vendors make linux images that can run in a VM) 

HTH

-Mike

> On Apr 18, 2016, at 10:27 AM, Jörn Franke  wrote:
> 
> I think the easiest would be to use a Hadoop Windows distribution, such as 
> Hortonworks. However, the Linux version of Hortonworks is a little bit more 
> advanced.
> 
> On 18 Apr 2016, at 14:13, My List  > wrote:
> 
>> Deepak,
>> 
>> The following could be a very dumb questions so pardon me for the same.
>> 1) When I download the binary for Spark with a version of Hadoop(Hadoop 2.6) 
>> does it not come in the zip or tar file?
>> 2) If it does not come along,Is there a Apache Hadoop for windows, is it in 
>> binary format or will have to build it?
>> 3) Is there a basic tutorial for Hadoop on windows for the basic needs of 
>> Spark.
>> 
>> Thanks in Advance !
>> 
>> On Mon, Apr 18, 2016 at 5:35 PM, Deepak Sharma > > wrote:
>> Once you download hadoop and format the namenode , you can use start-dfs.sh 
>> to start hdfs.
>> Then use 'jps' to sss if datanode/namenode services are up and running.
>> 
>> Thanks
>> Deepak
>> 
>> On Mon, Apr 18, 2016 at 5:18 PM, My List > > wrote:
>> Hi ,
>> 
>> I am a newbie on Spark.I wanted to know how to start and verify if HDFS has 
>> started on Spark stand alone.
>> 
>> Env - 
>> Windows 7 - 64 bit
>> Spark 1.4.1 With Hadoop 2.6
>> 
>> Using Scala Shell - spark-shell
>> 
>> 
>> -- 
>> Thanks,
>> Harry
>> 
>> 
>> 
>> -- 
>> Thanks
>> Deepak
>> www.bigdatabig.com 
>> www.keosha.net 
>> 
>> 
>> -- 
>> Thanks,
>> Harmeet

Re: inter spark application communication

2016-04-18 Thread Michael Segel

have you thought about Akka? 

What are you trying to send? Why do you want them to talk to one another? 

> On Apr 18, 2016, at 12:04 PM, Soumitra Johri  
> wrote:
> 
> Hi,
> 
> I have two applications : App1 and App2. 
> On a single cluster I have to spawn 5 instances os App1 and 1 instance of 
> App2.
> 
> What would be the best way to send data from the 5 App1 instances to the 
> single App2 instance ?
> 
> Right now I am using Kafka to send data from one spark application to the 
> spark application  but the setup doesn't seem right and I hope there is a 
> better way to do this.
> 
> Warm Regards
> Soumitra


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

How to decide the number of tasks in Spark?

2016-04-18 Thread Dogtail L

Hi,

When launching a job in Spark, I have great trouble deciding the number of
tasks. Someone says it is better to create a task per HDFS block size,
i.e., make sure one task process 128MB of input data; others suggest that
the number of tasks should be the twice of the total cores available to the
job. Also, I found that someone suggests launching small tasks using Spark,
i.e., make sure each task lasts around 100ms.

I am quite confused about all these suggestions. Is there any general rule
for deciding the number of tasks in Spark? Great thanks!

Best

Re: Spark support for Complex Event Processing (CEP)

2016-04-18 Thread Mich Talebzadeh

great stuff Mario. Much appreciated.

Mich

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 18 April 2016 at 20:08, Mario Ds Briggs  wrote:

> Hey Mich, Luciano
>
> Will provide links with docs by tomorrow
>
> thanks
> Mario
>
> - Message from Mich Talebzadeh  on Sun, 17
> Apr 2016 19:17:38 +0100 -
>
> *To:*
> Luciano Resende 
>
> *cc:*
> "user @spark" 
>
> *Subject:*
> Re: Spark support for Complex Event Processing (CEP)Thanks Luciano.
> Appreciated.
>
> Regards
>
> Dr Mich Talebzadeh
>
> LinkedIn
> *https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw*
> 
>
> *http://talebzadehmich.wordpress.com*
> 
>
>
>
> On 17 April 2016 at 17:32, Luciano Resende <*luckbr1...@gmail.com*
> > wrote:
>
>Hi Mitch,
>
>I know some folks that were investigating/prototyping on this area,
>let me see if I can get them to reply here with more details.
>
>On Sun, Apr 17, 2016 at 1:54 AM, Mich Talebzadeh <
>*mich.talebza...@gmail.com* > wrote:
>Hi,
>
>Has Spark got libraries for CEP using Spark Streaming with Kafka by
>any chance?
>
>I am looking at Flink that supposed to have these libraries for CEP
>but I find Flink itself very much work in progress.
>
>Thanks
>
>Dr Mich Talebzadeh
>
>LinkedIn
>
> *https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw*
>
> 
>
>*http://talebzadehmich.wordpress.com*
>
>
>
>
>
>
>--
>Luciano Resende
> *http://twitter.com/lresende1975* 
> *http://lresende.blogspot.com/* 
>
>
>
>
>

Re: Spark support for Complex Event Processing (CEP)

2016-04-18 Thread Mario Ds Briggs


Hey Mich, Luciano

 Will provide links with docs by tomorrow

thanks
Mario

- Message from Mich Talebzadeh  on Sun, 17
Apr 2016 19:17:38 +0100 -
 
  To: Luciano Resende  
 
  cc: "user @spark"   
 
 Subject: Re: Spark support for Complex Event
  Processing (CEP)   
 

Thanks Luciano. Appreciated.

Regards

Dr Mich Talebzadeh

LinkedIn
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

http://talebzadehmich.wordpress.com




On 17 April 2016 at 17:32, Luciano Resende  wrote:
  Hi Mitch,

  I know some folks that were investigating/prototyping on this area, let
  me see if I can get them to reply here with more details.

  On Sun, Apr 17, 2016 at 1:54 AM, Mich Talebzadeh <
  mich.talebza...@gmail.com> wrote:
   Hi,

   Has Spark got libraries for CEP using Spark Streaming with Kafka by any
   chance?

   I am looking at Flink that supposed to have these libraries for CEP but
   I find Flink itself very much work in progress.

   Thanks

   Dr Mich Talebzadeh

   LinkedIn
   
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

   http://talebzadehmich.wordpress.com






  --
  Luciano Resende
  http://twitter.com/lresende1975
  http://lresende.blogspot.com/

inter spark application communication

2016-04-18 Thread Soumitra Johri

Hi,

I have two applications : App1 and App2.
On a single cluster I have to spawn 5 instances os App1 and 1 instance of
App2.

What would be the best way to send data from the 5 App1 instances to the
single App2 instance ?

Right now I am using Kafka to send data from one spark application to the
spark application  but the setup doesn't seem right and I hope there is a
better way to do this.

Warm Regards
Soumitra

large scheduler delay

2016-04-18 Thread Darshan Singh

Hi,

I have an application which uses 3 parquet files , 2 of which are large and
another one is small. These files are in hdfs and are partitioned by column
"col1". Now I create  3 data-frames one for each parquet file but I pass
col1 value so that it reads the relevant data. I always read from the hdfs
as files are too big and almost all queries uses the col1 and thus read
very small data around 10-20 MB.

Then I join these data-frames. Now the data-frames from large tables are
joined using sort-merged and then join with 3rd one using broadcast hash
join. I have 3 executors and each with 10 cores. Only application running
on cluster is mine. The overall performance is quite good. But there is
large scheduler delay especially when it reads the smaller data-frame from
hdfs and in final step when it is using the broadcast hash join.

Usually compute time is just 50% of the scheduler delay time. The size of
3rd smaller dataframe is less than 1 MB. I am not sure why there is so much
scheduler delay.

Is there a diagnostic tools which can be used to check why there is
scheduler delay or point to possible causes of scheduler delay?


Thanks

Re: Calling Python code from Scala

2016-04-18 Thread Didier Marin

Hi,

Thank you for the quick answers !

Ndjido Ardo: actually the Python part is not legacy code. I'm not familiar
with UDF in Spark, do you have some examples of Python UDFs and how to use
them in Scala code ?

Holden Karau: the pipe interface seems like a good solution, but I'm a bit
concerned about performance. Will that work well if I need to pipe a lot of
data ?

Cheers,
Didier

2016-04-18 19:29 GMT+02:00 Holden Karau :

> So if there is just a few python functions your interested in accessing
> you can also use the pipe interface (you'll have to manually serialize your
> data on both ends in ways that Python and Scala can respectively parse) -
> but its a very generic approach and can work with many different languages.
>
> On Mon, Apr 18, 2016 at 10:23 AM, Ndjido Ardo BAR 
> wrote:
>
>> Hi Didier,
>>
>> I think with PySpark you can wrap your legacy Python functions into UDFs
>> and use it in your DataFrames. But you have to use DataFrames instead of
>> RDD.
>>
>> cheers,
>> Ardo
>>
>> On Mon, Apr 18, 2016 at 7:13 PM, didmar  wrote:
>>
>>> Hi,
>>>
>>> I have a Spark project in Scala and I would like to call some Python
>>> functions from within the program.
>>> Both parts are quite big, so re-coding everything in one language is not
>>> really an option.
>>>
>>> The workflow would be:
>>> - Creating a RDD with Scala code
>>> - Mapping a Python function over this RDD
>>> - Using the result directly in Scala
>>>
>>> I've read about PySpark internals, but that didn't help much.
>>> Is it possible to do so, and preferably in an efficent manner ?
>>>
>>> Cheers,
>>> Didier
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Calling-Python-code-from-Scala-tp26798.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>
>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
>

Re: Calling Python code from Scala

2016-04-18 Thread Mohit Jaggi

When faced with this issue I followed the approach taken by pyspark and used 
py4j. You have to:
- ensure your code is Java compatible
- use py4j to call the java (scala) code from python


> On Apr 18, 2016, at 10:29 AM, Holden Karau  wrote:
> 
> So if there is just a few python functions your interested in accessing you 
> can also use the pipe interface (you'll have to manually serialize your data 
> on both ends in ways that Python and Scala can respectively parse) - but its 
> a very generic approach and can work with many different languages.
> 
> On Mon, Apr 18, 2016 at 10:23 AM, Ndjido Ardo BAR  > wrote:
> Hi Didier,
> 
> I think with PySpark you can wrap your legacy Python functions into UDFs and 
> use it in your DataFrames. But you have to use DataFrames instead of RDD. 
> 
> cheers,
> Ardo
> 
> On Mon, Apr 18, 2016 at 7:13 PM, didmar  > wrote:
> Hi,
> 
> I have a Spark project in Scala and I would like to call some Python
> functions from within the program.
> Both parts are quite big, so re-coding everything in one language is not
> really an option.
> 
> The workflow would be:
> - Creating a RDD with Scala code
> - Mapping a Python function over this RDD
> - Using the result directly in Scala
> 
> I've read about PySpark internals, but that didn't help much.
> Is it possible to do so, and preferably in an efficent manner ?
> 
> Cheers,
> Didier
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Calling-Python-code-from-Scala-tp26798.html
>  
> 
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> 
> For additional commands, e-mail: user-h...@spark.apache.org 
> 
> 
> 
> 
> 
> 
> -- 
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau

Re: Read Parquet in Java Spark

2016-04-18 Thread Jan Rock

Hi,

Section Parquet in the documentation (it’s very nice document):
http://spark.apache.org/docs/latest/sql-programming-guide.html

Be sure that the parquet file is created properly:
http://www.do-hadoop.com/scala/csv-parquet-scala-spark-snippet/

Kind Regards,
Jan


> On 18 Apr 2016, at 11:44, Ramkumar V  wrote:
> 
> HI, 
> 
> Any idea on this ?
> 
> Thanks,
> 
>   
> 
> 
> On Mon, Apr 4, 2016 at 2:47 PM, Akhil Das  > wrote:
> I wasn't knowing you have a parquet file containing json data.
> 
> Thanks
> Best Regards
> 
> On Mon, Apr 4, 2016 at 2:44 PM, Ramkumar V  > wrote:
> Hi Akhil,
> 
> Thanks for your help. Why do you put separator as "," ?
> 
> I have a parquet file which contains only json in each line.
> 
> Thanks,
> 
>   
> 
> 
> On Mon, Apr 4, 2016 at 2:34 PM, Akhil Das  > wrote:
> Something like this (in scala):
> 
> val rdd = parquetFile.javaRDD().map(row => row.mkstring(","))
> 
> You can create a map operation over your javaRDD to convert the 
> org.apache.spark.sql.Row 
>  
> to String (the Row.mkstring() Operation)
> 
> Thanks
> Best Regards
> 
> On Mon, Apr 4, 2016 at 12:02 PM, Ramkumar V  > wrote:
> Any idea on this ? How to convert parquet file into JavaRDD ?
> 
> Thanks,
> 
>   
> 
> 
> On Thu, Mar 31, 2016 at 4:30 PM, Ramkumar V  > wrote:
> Hi,
> 
> Thanks for the reply.  I tried this. It's returning JavaRDD instead of 
> JavaRDD. How to get JavaRDD ?
> 
> Error :
> incompatible types: 
> org.apache.spark.api.java.JavaRDD cannot be 
> converted to org.apache.spark.api.java.JavaRDD
> 
> 
> 
> 
> 
> 
> 
> Thanks,
> 
>   
> 
> 
> On Thu, Mar 31, 2016 at 2:57 PM, UMESH CHAUDHARY  > wrote:
> From Spark Documentation:
> 
> DataFrame parquetFile = sqlContext.read().parquet("people.parquet");
> JavaRDD jRDD= parquetFile.javaRDD()
> 
> javaRDD() method will convert the DF to RDD
> 
> On Thu, Mar 31, 2016 at 2:51 PM, Ramkumar V  > wrote:
> Hi,
> 
> I'm trying to read parquet log files in Java Spark. Parquet log files are 
> stored in hdfs. I want to read and convert that parquet file into JavaRDD. I 
> could able to find Sqlcontext dataframe api. How can I read if it is 
> sparkcontext and rdd ? what is the best way to read it ?
> 
> Thanks,
> 
>   
> 
> 
> 
> 
> 
> 
> 
>

Re: Calling Python code from Scala

2016-04-18 Thread Holden Karau

So if there is just a few python functions your interested in accessing you
can also use the pipe interface (you'll have to manually serialize your
data on both ends in ways that Python and Scala can respectively parse) -
but its a very generic approach and can work with many different languages.

On Mon, Apr 18, 2016 at 10:23 AM, Ndjido Ardo BAR  wrote:

> Hi Didier,
>
> I think with PySpark you can wrap your legacy Python functions into UDFs
> and use it in your DataFrames. But you have to use DataFrames instead of
> RDD.
>
> cheers,
> Ardo
>
> On Mon, Apr 18, 2016 at 7:13 PM, didmar  wrote:
>
>> Hi,
>>
>> I have a Spark project in Scala and I would like to call some Python
>> functions from within the program.
>> Both parts are quite big, so re-coding everything in one language is not
>> really an option.
>>
>> The workflow would be:
>> - Creating a RDD with Scala code
>> - Mapping a Python function over this RDD
>> - Using the result directly in Scala
>>
>> I've read about PySpark internals, but that didn't help much.
>> Is it possible to do so, and preferably in an efficent manner ?
>>
>> Cheers,
>> Didier
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Calling-Python-code-from-Scala-tp26798.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau

Re: How to start HDFS on Spark Standalone

2016-04-18 Thread Jörn Franke

I think the easiest would be to use a Hadoop Windows distribution, such as 
Hortonworks. However, the Linux version of Hortonworks is a little bit more 
advanced.

> On 18 Apr 2016, at 14:13, My List  wrote:
> 
> Deepak,
> 
> The following could be a very dumb questions so pardon me for the same.
> 1) When I download the binary for Spark with a version of Hadoop(Hadoop 2.6) 
> does it not come in the zip or tar file?
> 2) If it does not come along,Is there a Apache Hadoop for windows, is it in 
> binary format or will have to build it?
> 3) Is there a basic tutorial for Hadoop on windows for the basic needs of 
> Spark.
> 
> Thanks in Advance !
> 
>> On Mon, Apr 18, 2016 at 5:35 PM, Deepak Sharma  wrote:
>> Once you download hadoop and format the namenode , you can use start-dfs.sh 
>> to start hdfs.
>> Then use 'jps' to sss if datanode/namenode services are up and running.
>> 
>> Thanks
>> Deepak
>> 
>>> On Mon, Apr 18, 2016 at 5:18 PM, My List  wrote:
>>> Hi ,
>>> 
>>> I am a newbie on Spark.I wanted to know how to start and verify if HDFS has 
>>> started on Spark stand alone.
>>> 
>>> Env - 
>>> Windows 7 - 64 bit
>>> Spark 1.4.1 With Hadoop 2.6
>>> 
>>> Using Scala Shell - spark-shell
>>> 
>>> 
>>> -- 
>>> Thanks,
>>> Harry
>> 
>> 
>> 
>> -- 
>> Thanks
>> Deepak
>> www.bigdatabig.com
>> www.keosha.net
> 
> 
> 
> -- 
> Thanks,
> Harmeet

Re: Calling Python code from Scala

2016-04-18 Thread Ndjido Ardo BAR

Hi Didier,

I think with PySpark you can wrap your legacy Python functions into UDFs
and use it in your DataFrames. But you have to use DataFrames instead of
RDD.

cheers,
Ardo

On Mon, Apr 18, 2016 at 7:13 PM, didmar  wrote:

> Hi,
>
> I have a Spark project in Scala and I would like to call some Python
> functions from within the program.
> Both parts are quite big, so re-coding everything in one language is not
> really an option.
>
> The workflow would be:
> - Creating a RDD with Scala code
> - Mapping a Python function over this RDD
> - Using the result directly in Scala
>
> I've read about PySpark internals, but that didn't help much.
> Is it possible to do so, and preferably in an efficent manner ?
>
> Cheers,
> Didier
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Calling-Python-code-from-Scala-tp26798.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Calling Python code from Scala

2016-04-18 Thread didmar

Hi,

I have a Spark project in Scala and I would like to call some Python
functions from within the program.
Both parts are quite big, so re-coding everything in one language is not
really an option.

The workflow would be:
- Creating a RDD with Scala code
- Mapping a Python function over this RDD
- Using the result directly in Scala

I've read about PySpark internals, but that didn't help much.
Is it possible to do so, and preferably in an efficent manner ?

Cheers,
Didier



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Calling-Python-code-from-Scala-tp26798.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Jdbc connection from sc.addJar failing

2016-04-18 Thread Mich Talebzadeh

You can either add jar to spark-submit as pointed out like

${SPARK_HOME}/bin/spark-submit \
--packages com.databricks:spark-csv_2.11:1.3.0 \
--jars
/home/hduser/jars/spark-streaming-kafka-assembly_2.10-1.6.1.jar \

OR

Create/package  a fat jar file that will include the JDBC driver for MSSQL
in the jar file itself

${SPARK_HOME}/bin/spark-submit \
--class "${FILE_NAME}" \
--master spark://50.140.197.217:7077 \
--executor-memory=12G \
--executor-cores=2 \
--num-executors=2 \
--files ${SPARK_HOME}/conf/log4j.properties \
${JAR_FILE}

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 18 April 2016 at 17:29, chaz2505  wrote:

> Hi all,
>
> I'm a bit stuck with a problem that I thought was solved in SPARK-6913 but
> can't seem to get it to work.
>
> I'm programatically adding a jar (sc.addJar(pathToJar)) after the
> SparkContext is created then using the driver from the jar to load a table
> through sqlContext.read.jdbc(). When I try this I'm getting
> java.lang.ClassNotFoundException:
> com.microsoft.sqlserver.jdbc.SQLServerDriver (in my case I'm connecting to
> a
> MS SQLServer).
>
> I've tried this on both the shell and in an application submitted through
> spark-submit. When I add the jar with --jars it works fine but my
> application is meant to be a long running app that should not require the
> jar to be added on application start.
>
> Running Class.forName(driver).newInstance does not work on the driver but
> in
> a map function of an RDD it does work so the jar is being added only to the
> executors - shouldn't this be enough for sqlContext.read.jdbc() to work?
>
> I'm using Spark 1.6.1.
>
> Thanks
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Jdbc-connection-from-sc-addJar-failing-tp26797.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Spark Streaming / Kafka Direct Approach: Dynamic Partitionning / Multi DC Spark

2016-04-18 Thread Jason Nerothin

Hi Erwan,

You might consider InsightEdge: http://insightedge.io  
. It has the capability of doing WAN between data grids and would save you the 
work of having to re-invent the wheel. Additionally, RDDs can be shared between 
developers in the same DC.

Thanks,
Jason

> On Apr 18, 2016, at 11:18 AM, Erwan ALLAIN  wrote:
> 
> Hello,
> 
> I'm currently designing a solution where 2 distinct clusters Spark (2 
> datacenters) share the same Kafka (Kafka rack aware or manual broker 
> repartition). 
> The aims are
> - preventing DC crash: using kafka resiliency and consumer group mechanism 
> (or else ?)
> - keeping consistent offset among replica (vs mirror maker,which does not 
> keep offset)
> 
> I have several questions 
> 
> 1) Dynamic repartition (one or 2 DC)
> 
> I'm using KafkaDirectStream which map one partition kafka with one spark. Is 
> it possible to handle new or removed partition ? 
> In the compute method, it looks like we are always using the currentOffset 
> map to query the next batch and therefore it's always the same number of 
> partition ? Can we request metadata at each batch ?
> 
> 2) Multi DC Spark
> 
> Using Direct approach, a way to achieve this would be 
> - to "assign" (kafka 0.9 term) all topics to the 2 sparks
> - only one is reading the partition (Check every x interval, "lock" stored in 
> cassandra for instance)
> 
> => not sure if it works just an idea
> 
> Using Consumer Group
> - CommitOffset manually at the end of the batch
> 
> => Does spark handle partition rebalancing ?
> 
> I'd appreciate any ideas ! Let me know if it's not clear.
> 
> Erwan
> 
>

Re: Spark Streaming / Kafka Direct Approach: Dynamic Partitionning / Multi DC Spark

2016-04-18 Thread Cody Koeninger

The current direct stream only handles exactly the partitions
specified at startup.  You'd have to restart the job if you changed
partitions.

https://issues.apache.org/jira/browse/SPARK-12177 has the ongoing work
towards using the kafka 0.10 consumer, which would allow for dynamic
topicparittions

Regarding your multi-DC questions, I'm not really clear on what you're saying.

On Mon, Apr 18, 2016 at 11:18 AM, Erwan ALLAIN  wrote:
> Hello,
>
> I'm currently designing a solution where 2 distinct clusters Spark (2
> datacenters) share the same Kafka (Kafka rack aware or manual broker
> repartition).
> The aims are
> - preventing DC crash: using kafka resiliency and consumer group mechanism
> (or else ?)
> - keeping consistent offset among replica (vs mirror maker,which does not
> keep offset)
>
> I have several questions
>
> 1) Dynamic repartition (one or 2 DC)
>
> I'm using KafkaDirectStream which map one partition kafka with one spark. Is
> it possible to handle new or removed partition ?
> In the compute method, it looks like we are always using the currentOffset
> map to query the next batch and therefore it's always the same number of
> partition ? Can we request metadata at each batch ?
>
> 2) Multi DC Spark
>
> Using Direct approach, a way to achieve this would be
> - to "assign" (kafka 0.9 term) all topics to the 2 sparks
> - only one is reading the partition (Check every x interval, "lock" stored
> in cassandra for instance)
>
> => not sure if it works just an idea
>
> Using Consumer Group
> - CommitOffset manually at the end of the batch
>
> => Does spark handle partition rebalancing ?
>
> I'd appreciate any ideas ! Let me know if it's not clear.
>
> Erwan
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Jdbc connection from sc.addJar failing

2016-04-18 Thread chaz2505

Hi all, 

I'm a bit stuck with a problem that I thought was solved in SPARK-6913 but
can't seem to get it to work. 

I'm programatically adding a jar (sc.addJar(pathToJar)) after the
SparkContext is created then using the driver from the jar to load a table
through sqlContext.read.jdbc(). When I try this I'm getting
java.lang.ClassNotFoundException:
com.microsoft.sqlserver.jdbc.SQLServerDriver (in my case I'm connecting to a
MS SQLServer). 

I've tried this on both the shell and in an application submitted through
spark-submit. When I add the jar with --jars it works fine but my
application is meant to be a long running app that should not require the
jar to be added on application start. 

Running Class.forName(driver).newInstance does not work on the driver but in
a map function of an RDD it does work so the jar is being added only to the
executors - shouldn't this be enough for sqlContext.read.jdbc() to work? 

I'm using Spark 1.6.1. 

Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Jdbc-connection-from-sc-addJar-failing-tp26797.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Spark Streaming / Kafka Direct Approach: Dynamic Partitionning / Multi DC Spark

2016-04-18 Thread Erwan ALLAIN

Hello,

I'm currently designing a solution where 2 distinct clusters Spark (2
datacenters) share the same Kafka (Kafka rack aware or manual broker
repartition).
The aims are
- preventing DC crash: using kafka resiliency and consumer group mechanism
(or else ?)
- keeping consistent offset among replica (vs mirror maker,which does not
keep offset)

I have several questions

1) Dynamic repartition (one or 2 DC)

I'm using KafkaDirectStream which map one partition kafka with one spark.
Is it possible to handle new or removed partition ?
In the compute method, it looks like we are always using the currentOffset
map to query the next batch and therefore it's always the same number of
partition ? Can we request metadata at each batch ?

2) Multi DC Spark

*Using Direct approach,* a way to achieve this would be
- to "assign" (kafka 0.9 term) all topics to the 2 sparks
- only one is reading the partition (Check every x interval, "lock" stored
in cassandra for instance)

=> not sure if it works just an idea

*Using Consumer Group*
- CommitOffset manually at the end of the batch

=> Does spark handle partition rebalancing ?

I'd appreciate any ideas ! Let me know if it's not clear.

Erwan

Re: Submitting applications to Mesos with cluster deploy mode

2016-04-18 Thread Joao Azevedo


On 18/04/16 11:54, Joao Azevedo wrote:

I'm trying to submit Spark applications to Mesos using the 'cluster'
deploy mode. I'm using Marathon as the container orchestration platform
and launching the Mesos Cluster Dispatcher through it. I'm using Spark
1.6.1 with Scala 2.11.


Using the current master branch makes things work. I believe this fix: 
https://github.com/apache/spark/commit/c9b89a0a0921ce3d52864afd4feb7f37b90f7b46 
is required for things to work properly in cluster mode on Mesos (and 
this commit is not on release 1.6.1).


Thanks!


Joao

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: How to start HDFS on Spark Standalone

2016-04-18 Thread Steve Loughran



1. You don't need to start HDFS or anything like that, just set up Spark so 
that it can use the Hadoop APIs for some things; on windows this depends on 
some native libs. This means you don't need to worry about learning it. Focus 
on the Spark APIs, python and/or scala

2. You should be able to find the native windows binaries here: 
https://github.com/steveloughran/winutils

These are either version I lifted out of HDP for windows, or, more recently, 
checking out and building on Windows the same git commit as used/voted in for 
the ASF releases.

3. For performance, you also need the native libs for compression codecs: 
snappy, LZO &c. I see I've put them in the windows 2.7.1 release, but not the 
others (ASF mvn package doesn't add them): 
https://github.com/steveloughran/winutils/tree/master/hadoop-2.7.1/bin ...

If the version of the hadoop libs aren't working with one of those versions 
I've put up, ping me and I'll build up the relevant binaries out of the ASF




From: My List mailto:mylistt...@gmail.com>>
Date: Monday, 18 April 2016 at 13:13
To: Deepak Sharma mailto:deepakmc...@gmail.com>>
Cc: SparkUser mailto:user@spark.apache.org>>
Subject: Re: How to start HDFS on Spark Standalone

Deepak,

The following could be a very dumb questions so pardon me for the same.
1) When I download the binary for Spark with a version of Hadoop(Hadoop 2.6) 
does it not come in the zip or tar file?
2) If it does not come along,Is there a Apache Hadoop for windows, is it in 
binary format or will have to build it?
3) Is there a basic tutorial for Hadoop on windows for the basic needs of Spark.

Thanks in Advance !

On Mon, Apr 18, 2016 at 5:35 PM, Deepak Sharma 
mailto:deepakmc...@gmail.com>> wrote:
Once you download hadoop and format the namenode , you can use start-dfs.sh to 
start hdfs.
Then use 'jps' to sss if datanode/namenode services are up and running.

Thanks
Deepak

On Mon, Apr 18, 2016 at 5:18 PM, My List 
mailto:mylistt...@gmail.com>> wrote:
Hi ,

I am a newbie on Spark.I wanted to know how to start and verify if HDFS has 
started on Spark stand alone.

Env -
Windows 7 - 64 bit
Spark 1.4.1 With Hadoop 2.6

Using Scala Shell - spark-shell


--
Thanks,
Harry



--
Thanks
Deepak
www.bigdatabig.com
www.keosha.net



--
Thanks,
Harmeet

Re: drools on spark, how to reload rule file?

2016-04-18 Thread Jason Nerothin

Could you post some psuedo-code? 

val data = rdd.whatever(…)
val facts: Array[DroolsCompatibleType] = convert(data)

facts.map{ f => ksession.insert( f ) }


> On Apr 18, 2016, at 9:20 AM, yaoxiaohua  wrote:
> 
> Thanks for your reply , Jason,
> I can use stateless session in spark streaming job.
> But now my question is when the rule update, how to pass it to RDD?
> We generate a ruleExecutor(stateless session) in main method,
> Then pass the ruleExectutor in Rdd.
>  
> I am new in drools, I am trying to read the drools doc now.
> Best Regards,
> Evan
> From: Jason Nerothin [mailto:jasonnerot...@gmail.com 
> ] 
> Sent: 2016年4月18日 21:42
> To: yaoxiaohua
> Cc: user@spark.apache.org 
> Subject: Re: drools on spark, how to reload rule file?
>  
> The limitation is in the drools implementation.
>  
> Changing a rule in a stateful KB is not possible, particularly if it leads to 
> logical contradictions with the previous version or any other rule in the KB.
>  
> When we ran into this, we worked around (part of) it by salting the rule name 
> with a unique id. To get the existing rules to be evaluated when we wanted, 
> we kept a property on each fact that we mutated each time. 
>  
> Hackery, but it worked.
>  
> I recommend you try hard to use a stateless KB, if it is possible.
> 
> Thank you.
>  
> Jason
>  
> // brevity and poor typing by iPhone
> 
> On Apr 18, 2016, at 04:43, yaoxiaohua  > wrote:
> 
>> Hi bros,
>> I am trying using drools on spark to parse log and do some 
>> rule match and derived some fields.
>> Now I refer one blog on cloudera, 
>> http://blog.cloudera.com/blog/2015/11/how-to-build-a-complex-event-processing-app-on-apache-spark-and-drools/
>>  
>> 
>> 
>> now I want to know whether it possible to reload the rule on 
>> the fly?
>> Thanks in advance.
>>  
>> Best Regards,
>> Evan

Re: Submitting applications to Mesos with cluster deploy mode

2016-04-18 Thread Joao Azevedo


On 18/04/16 13:30, Jacek Laskowski wrote:

Not that I might help much with deployment to Mesos, but can you
describe your Mesos/Marathon setup?


The setup is a single master with multiple slaves cluster. The master is 
also running Zookeeper besides 'mesos-master'. There's a DNS record 
pointing to the master node. The slaves run Marathon besides 
'mesos-slave' and use the Zookeeper DNS record to find Mesos' master. 
We're running Mesos v0.27.1 and Marathon v0.15.2. The slaves are all 
behind a load balancer which has a DNS record pointing to it. Everything 
is on Amazon EC2.



What's Mesos cluster dispatcher?


Mesos Cluster Dispatcher is the required application to use Spark's 
cluster mode in Mesos 
(http://spark.apache.org/docs/latest/running-on-mesos.html#cluster-mode). It 
is also managed through Marathon.



Joao

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

RE: drools on spark, how to reload rule file?

2016-04-18 Thread yaoxiaohua

Thanks for your reply , Jason,

I can use stateless session in spark streaming job.

But now my question is when the rule update, how to pass it to RDD?

We generate a ruleExecutor(stateless session) in main method,

Then pass the ruleExectutor in Rdd.

 

I am new in drools, I am trying to read the drools doc now.

Best Regards,

Evan

From: Jason Nerothin [mailto:jasonnerot...@gmail.com] 
Sent: 2016年4月18日 21:42
To: yaoxiaohua
Cc: user@spark.apache.org
Subject: Re: drools on spark, how to reload rule file?

 

The limitation is in the drools implementation.

 

Changing a rule in a stateful KB is not possible, particularly if it leads to 
logical contradictions with the previous version or any other rule in the KB.

 

When we ran into this, we worked around (part of) it by salting the rule name 
with a unique id. To get the existing rules to be evaluated when we wanted, we 
kept a property on each fact that we mutated each time. 

 

Hackery, but it worked.

 

I recommend you try hard to use a stateless KB, if it is possible.

Thank you.

 

Jason

 

// brevity and poor typing by iPhone


On Apr 18, 2016, at 04:43, yaoxiaohua  wrote:

Hi bros,

I am trying using drools on spark to parse log and do some rule 
match and derived some fields.

Now I refer one blog on cloudera, 

http://blog.cloudera.com/blog/2015/11/how-to-build-a-complex-event-processing-app-on-apache-spark-and-drools/



now I want to know whether it possible to reload the rule on 
the fly?

Thanks in advance.

 

Best Regards,

Evan

Re: strange HashPartitioner behavior in Spark

2016-04-18 Thread Raghava Mutharaju

No. We specify it as a configuration option to the spark-submit. Does that
make a difference?

Regards,
Raghava.


On Mon, Apr 18, 2016 at 9:56 AM, Sonal Goyal  wrote:

> Are you specifying your spark master in the scala program?
>
> Best Regards,
> Sonal
> Founder, Nube Technologies 
> Reifier at Strata Hadoop World
> 
> Reifier at Spark Summit 2015
> 
>
> 
>
>
>
> On Mon, Apr 18, 2016 at 6:26 PM, Raghava Mutharaju <
> m.vijayaragh...@gmail.com> wrote:
>
>> Mike,
>>
>> We tried that. This map task is actually part of a larger set of
>> operations. I pointed out this map task since it involves partitionBy() and
>> we always use partitionBy() whenever partition-unaware shuffle operations
>> are performed (such as distinct). We in fact do not notice a change in the
>> distribution after several unrelated stages are executed and a significant
>> time has passed (nearly 10-15 minutes).
>>
>> I agree. We are not looking for partitions to go to specific nodes and
>> nor do we expect a uniform distribution of keys across the cluster. There
>> will be a skew. But it cannot be that all the data is on one node and
>> nothing on the other and no, the keys are not the same. They vary from 1 to
>> around 55000 (integers). What makes this strange is that it seems to work
>> fine on the spark shell (REPL).
>>
>> Regards,
>> Raghava.
>>
>>
>> On Mon, Apr 18, 2016 at 1:14 AM, Mike Hynes <91m...@gmail.com> wrote:
>>
>>> A HashPartitioner will indeed partition based on the key, but you
>>> cannot know on *which* node that key will appear. Again, the RDD
>>> partitions will not necessarily be distributed evenly across your
>>> nodes because of the greedy scheduling of the first wave of tasks,
>>> particularly if those tasks have durations less than the initial
>>> executor delay. I recommend you look at your logs to verify if this is
>>> happening to you.
>>>
>>> Mike
>>>
>>> On 4/18/16, Anuj Kumar  wrote:
>>> > Good point Mike +1
>>> >
>>> > On Mon, Apr 18, 2016 at 9:47 AM, Mike Hynes <91m...@gmail.com> wrote:
>>> >
>>> >> When submitting a job with spark-submit, I've observed delays (up to
>>> >> 1--2 seconds) for the executors to respond to the driver in order to
>>> >> receive tasks in the first stage. The delay does not persist once the
>>> >> executors have been synchronized.
>>> >>
>>> >> When the tasks are very short, as may be your case (relatively small
>>> >> data and a simple map task like you have described), the 8 tasks in
>>> >> your stage may be allocated to only 1 executor in 2 waves of 4, since
>>> >> the second executor won't have responded to the master before the
>>> >> first 4 tasks on the first executor have completed.
>>> >>
>>> >> To see if this is the cause in your particular case, you could try the
>>> >> following to confirm:
>>> >> 1. Examine the starting times of the tasks alongside their
>>> >> executor
>>> >> 2. Make a "dummy" stage execute before your real stages to
>>> >> synchronize the executors by creating and materializing any random RDD
>>> >> 3. Make the tasks longer, i.e. with some silly computational
>>> >> work.
>>> >>
>>> >> Mike
>>> >>
>>> >>
>>> >> On 4/17/16, Raghava Mutharaju  wrote:
>>> >> > Yes its the same data.
>>> >> >
>>> >> > 1) The number of partitions are the same (8, which is an argument to
>>> >> > the
>>> >> > HashPartitioner). In the first case, these partitions are spread
>>> across
>>> >> > both the worker nodes. In the second case, all the partitions are on
>>> >> > the
>>> >> > same node.
>>> >> > 2) What resources would be of interest here? Scala shell takes the
>>> >> default
>>> >> > parameters since we use "bin/spark-shell --master " to
>>> run
>>> >> the
>>> >> > scala-shell. For the scala program, we do set some configuration
>>> >> > options
>>> >> > such as driver memory (12GB), parallelism is set to 8 and we use
>>> Kryo
>>> >> > serializer.
>>> >> >
>>> >> > We are running this on Azure D3-v2 machines which have 4 cores and
>>> 14GB
>>> >> > RAM.1 executor runs on each worker node. Following configuration
>>> >> > options
>>> >> > are set for the scala program -- perhaps we should move it to the
>>> spark
>>> >> > config file.
>>> >> >
>>> >> > Driver memory and executor memory are set to 12GB
>>> >> > parallelism is set to 8
>>> >> > Kryo serializer is used
>>> >> > Number of retainedJobs and retainedStages has been increased to
>>> check
>>> >> them
>>> >> > in the UI.
>>> >> >
>>> >> > What information regarding Spark Context would be of interest here?
>>> >> >
>>> >> > Regards,
>>> >> > Raghava.
>>> >> >
>>> >> > On Sun, Apr 17, 2016 at 10:54 PM, Anuj Kumar 
>>> >> > wrote:
>>> >> >
>>> >> >> If the data file is same then it should have similar distribution
>>> of
>>> >> >> keys.
>>> >> >> Few queries-
>>> >> >>
>>> >> >> 1. Di

Re: strange HashPartitioner behavior in Spark

2016-04-18 Thread Sonal Goyal

Are you specifying your spark master in the scala program?

Best Regards,
Sonal
Founder, Nube Technologies 
Reifier at Strata Hadoop World 
Reifier at Spark Summit 2015






On Mon, Apr 18, 2016 at 6:26 PM, Raghava Mutharaju <
m.vijayaragh...@gmail.com> wrote:

> Mike,
>
> We tried that. This map task is actually part of a larger set of
> operations. I pointed out this map task since it involves partitionBy() and
> we always use partitionBy() whenever partition-unaware shuffle operations
> are performed (such as distinct). We in fact do not notice a change in the
> distribution after several unrelated stages are executed and a significant
> time has passed (nearly 10-15 minutes).
>
> I agree. We are not looking for partitions to go to specific nodes and nor
> do we expect a uniform distribution of keys across the cluster. There will
> be a skew. But it cannot be that all the data is on one node and nothing on
> the other and no, the keys are not the same. They vary from 1 to around
> 55000 (integers). What makes this strange is that it seems to work fine on
> the spark shell (REPL).
>
> Regards,
> Raghava.
>
>
> On Mon, Apr 18, 2016 at 1:14 AM, Mike Hynes <91m...@gmail.com> wrote:
>
>> A HashPartitioner will indeed partition based on the key, but you
>> cannot know on *which* node that key will appear. Again, the RDD
>> partitions will not necessarily be distributed evenly across your
>> nodes because of the greedy scheduling of the first wave of tasks,
>> particularly if those tasks have durations less than the initial
>> executor delay. I recommend you look at your logs to verify if this is
>> happening to you.
>>
>> Mike
>>
>> On 4/18/16, Anuj Kumar  wrote:
>> > Good point Mike +1
>> >
>> > On Mon, Apr 18, 2016 at 9:47 AM, Mike Hynes <91m...@gmail.com> wrote:
>> >
>> >> When submitting a job with spark-submit, I've observed delays (up to
>> >> 1--2 seconds) for the executors to respond to the driver in order to
>> >> receive tasks in the first stage. The delay does not persist once the
>> >> executors have been synchronized.
>> >>
>> >> When the tasks are very short, as may be your case (relatively small
>> >> data and a simple map task like you have described), the 8 tasks in
>> >> your stage may be allocated to only 1 executor in 2 waves of 4, since
>> >> the second executor won't have responded to the master before the
>> >> first 4 tasks on the first executor have completed.
>> >>
>> >> To see if this is the cause in your particular case, you could try the
>> >> following to confirm:
>> >> 1. Examine the starting times of the tasks alongside their
>> >> executor
>> >> 2. Make a "dummy" stage execute before your real stages to
>> >> synchronize the executors by creating and materializing any random RDD
>> >> 3. Make the tasks longer, i.e. with some silly computational
>> >> work.
>> >>
>> >> Mike
>> >>
>> >>
>> >> On 4/17/16, Raghava Mutharaju  wrote:
>> >> > Yes its the same data.
>> >> >
>> >> > 1) The number of partitions are the same (8, which is an argument to
>> >> > the
>> >> > HashPartitioner). In the first case, these partitions are spread
>> across
>> >> > both the worker nodes. In the second case, all the partitions are on
>> >> > the
>> >> > same node.
>> >> > 2) What resources would be of interest here? Scala shell takes the
>> >> default
>> >> > parameters since we use "bin/spark-shell --master " to
>> run
>> >> the
>> >> > scala-shell. For the scala program, we do set some configuration
>> >> > options
>> >> > such as driver memory (12GB), parallelism is set to 8 and we use Kryo
>> >> > serializer.
>> >> >
>> >> > We are running this on Azure D3-v2 machines which have 4 cores and
>> 14GB
>> >> > RAM.1 executor runs on each worker node. Following configuration
>> >> > options
>> >> > are set for the scala program -- perhaps we should move it to the
>> spark
>> >> > config file.
>> >> >
>> >> > Driver memory and executor memory are set to 12GB
>> >> > parallelism is set to 8
>> >> > Kryo serializer is used
>> >> > Number of retainedJobs and retainedStages has been increased to check
>> >> them
>> >> > in the UI.
>> >> >
>> >> > What information regarding Spark Context would be of interest here?
>> >> >
>> >> > Regards,
>> >> > Raghava.
>> >> >
>> >> > On Sun, Apr 17, 2016 at 10:54 PM, Anuj Kumar 
>> >> > wrote:
>> >> >
>> >> >> If the data file is same then it should have similar distribution of
>> >> >> keys.
>> >> >> Few queries-
>> >> >>
>> >> >> 1. Did you compare the number of partitions in both the cases?
>> >> >> 2. Did you compare the resource allocation for Spark Shell vs Scala
>> >> >> Program being submitted?
>> >> >>
>> >> >> Also, can you please share the details of Spark Context, Environment
>> >> >> and
>> >> >> Executors when you run via Scala progra

Re: drools on spark, how to reload rule file?

2016-04-18 Thread Jason Nerothin

The limitation is in the drools implementation.

Changing a rule in a stateful KB is not possible, particularly if it leads to 
logical contradictions with the previous version or any other rule in the KB.

When we ran into this, we worked around (part of) it by salting the rule name 
with a unique id. To get the existing rules to be evaluated when we wanted, we 
kept a property on each fact that we mutated each time. 

Hackery, but it worked.

I recommend you try hard to use a stateless KB, if it is possible.

Thank you.

Jason

// brevity and poor typing by iPhone

> On Apr 18, 2016, at 04:43, yaoxiaohua  wrote:
> 
> Hi bros,
> I am trying using drools on spark to parse log and do some 
> rule match and derived some fields.
> Now I refer one blog on cloudera,
> http://blog.cloudera.com/blog/2015/11/how-to-build-a-complex-event-processing-app-on-apache-spark-and-drools/
>
> now I want to know whether it possible to reload the rule on 
> the fly?
> Thanks in advance.
>  
> Best Regards,
> Evan

RE: Logging in executors

2016-04-18 Thread Ashic Mahtab

I spent ages on this recently, and here's what I found:
--conf  
"spark.executor.extraJavaOptions=-Dlog4j.configuration=file:///local/file/on.executor.properties"
 
works. Alternatively, you can also do:
--conf  
"spark.executor.extraJavaOptions=-Dlog4j.configuration=filename.properties"  
--files="path/to/filename.properties"
log4j.properties files packaged with the application don't seem to have any 
effect. This is likely because log4j gets initialised before your app stuff is 
loaded. You can also reinitialise log4j logging as part of your application 
code. That also worked for us, but we went the extraJavaOptions route as it was 
less invasive on the application side.
-Ashic.

Date: Mon, 18 Apr 2016 10:32:03 -0300
Subject: Re: Logging in executors
From: cma...@despegar.com
To: yuzhih...@gmail.com
CC: user@spark.apache.org

Thanks Ted, already checked it but is not the same. I'm working with StandAlone 
spark, the examples refers to HDFS paths, therefore I assume Hadoop 2 Resource 
Manager is used. I've tried all possible flavours. The only one that worked was 
changing the spark-defaults.conf in every machine. I'll go with this by now, 
but the extra java opts for the executor are definitely not working, at least 
for logging configuration.

Thanks,-carlos.
On Fri, Apr 15, 2016 at 3:28 PM, Ted Yu  wrote:
See this thread: http://search-hadoop.com/m/q3RTtsFrd61q291j1
On Fri, Apr 15, 2016 at 5:38 AM, Carlos Rojas Matas  wrote:
Hi guys,
any clue on this? Clearly the 
spark.executor.extraJavaOpts=-Dlog4j.configuration is not working on the 
executors.
Thanks,-carlos.
On Wed, Apr 13, 2016 at 2:48 PM, Carlos Rojas Matas  wrote:
Hi Yong,
thanks for your response. As I said in my first email, I've tried both the 
reference to the classpath resource (env/dev/log4j-executor.properties) as the 
file:// protocol. Also, the driver logging is working fine and I'm using the 
same kind of reference.
Below the content of my classpath:


Plus this is the content of the exploded fat jar assembled with sbt assembly 
plugin:



This folder is at the root level of the classpath.
Thanks,-carlos.
On Wed, Apr 13, 2016 at 2:35 PM, Yong Zhang  wrote:



Is the env/dev/log4j-executor.properties file within your jar file? Is the path 
matching with what you specified as env/dev/log4j-executor.properties?
If you read the log4j document here: 
https://logging.apache.org/log4j/1.2/manual.html
When you specify the log4j.configuration=my_custom.properties, you have 2 
option:
1) the my_custom.properties has to be in the jar (or in the classpath). In your 
case, since you specify the package path, you need to make sure they are 
matched in your jar file2) use like 
log4j.configuration=file:///tmp/my_custom.properties. In this way, you need to 
make sure file my_custom.properties exists in /tmp folder on ALL of your worker 
nodes.
Yong

Date: Wed, 13 Apr 2016 14:18:24 -0300
Subject: Re: Logging in executors
From: cma...@despegar.com
To: yuzhih...@gmail.com
CC: user@spark.apache.org

Thanks for your response Ted. You're right, there was a typo. I changed it, now 
I'm executing:
bin/spark-submit --master spark://localhost:7077 --conf 
"spark.driver.extraJavaOptions=-Dlog4j.configuration=env/dev/log4j-driver.properties"
 --conf 
"spark.executor.extraJavaOptions=-Dlog4j.configuration=env/dev/log4j-executor.properties"
 --class
The content of this file is:
# Set everything to be logged to the consolelog4j.rootCategory=INFO, 
FILElog4j.appender.console=org.apache.log4j.ConsoleAppenderlog4j.appender.console.target=System.errlog4j.appender.console.layout=org.apache.log4j.PatternLayoutlog4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd
 HH:mm:ss} %p %c{1}: %m%n
log4j.appender.FILE=org.apache.log4j.RollingFileAppenderlog4j.appender.FILE.File=/tmp/executor.loglog4j.appender.FILE.ImmediateFlush=truelog4j.appender.FILE.Threshold=debuglog4j.appender.FILE.Append=truelog4j.appender.FILE.MaxFileSize=100MBlog4j.appender.FILE.MaxBackupIndex=5log4j.appender.FILE.layout=org.apache.log4j.PatternLayoutlog4j.appender.FILE.layout.ConversionPattern=%d{yy/MM/dd
 HH:mm:ss} %p %c{1}: %m%n
# Settings to quiet third party logs that are too 
verboselog4j.logger.org.spark-project.jetty=WARNlog4j.logger.org.spark-project.jetty.util.component.AbstractLifeCycle=ERRORlog4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFOlog4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFOlog4j.logger.org.apache.parquet=ERRORlog4j.logger.parquet=ERRORlog4j.logger.com.despegar.p13n=DEBUG
# SPARK-9183: Settings to avoid annoying messages when looking up nonexistent 
UDFs in SparkSQL with Hive 
supportlog4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATALlog4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR

Finally, the code on which I'm using logging in the executor is:
def groupAndCount(keys: DStream[(String, List[String])])(handler: 
ResultHandler) = {
  
  val result = keys.reduceByKey((prior, current) => {
(prior :::

Re: Logging in executors

2016-04-18 Thread Ted Yu

Looking through this thread, I don't see Spark version you use.

Can you tell us the Spark release ?

Thanks

On Mon, Apr 18, 2016 at 6:32 AM, Carlos Rojas Matas 
wrote:

> Thanks Ted, already checked it but is not the same. I'm working with
> StandAlone spark, the examples refers to HDFS paths, therefore I assume
> Hadoop 2 Resource Manager is used. I've tried all possible flavours. The
> only one that worked was changing the spark-defaults.conf in every machine.
> I'll go with this by now, but the extra java opts for the executor are
> definitely not working, at least for logging configuration.
>
> Thanks,
> -carlos.
>
> On Fri, Apr 15, 2016 at 3:28 PM, Ted Yu  wrote:
>
>> See this thread: http://search-hadoop.com/m/q3RTtsFrd61q291j1
>>
>> On Fri, Apr 15, 2016 at 5:38 AM, Carlos Rojas Matas 
>> wrote:
>>
>>> Hi guys,
>>>
>>> any clue on this? Clearly the
>>> spark.executor.extraJavaOpts=-Dlog4j.configuration is not working on the
>>> executors.
>>>
>>> Thanks,
>>> -carlos.
>>>
>>> On Wed, Apr 13, 2016 at 2:48 PM, Carlos Rojas Matas >> > wrote:
>>>
 Hi Yong,

 thanks for your response. As I said in my first email, I've tried both
 the reference to the classpath resource (env/dev/log4j-executor.properties)
 as the file:// protocol. Also, the driver logging is working fine and I'm
 using the same kind of reference.

 Below the content of my classpath:

 [image: Inline image 1]

 Plus this is the content of the exploded fat jar assembled with sbt
 assembly plugin:

 [image: Inline image 2]


 This folder is at the root level of the classpath.

 Thanks,
 -carlos.

 On Wed, Apr 13, 2016 at 2:35 PM, Yong Zhang 
 wrote:

> Is the env/dev/log4j-executor.properties file within your jar file? Is
> the path matching with what you specified as
> env/dev/log4j-executor.properties?
>
> If you read the log4j document here:
> https://logging.apache.org/log4j/1.2/manual.html
>
> When you specify the log4j.configuration=my_custom.properties, you
> have 2 option:
>
> 1) the my_custom.properties has to be in the jar (or in the
> classpath). In your case, since you specify the package path, you need to
> make sure they are matched in your jar file
> 2) use like log4j.configuration=file:///tmp/my_custom.properties. In
> this way, you need to make sure file my_custom.properties exists in /tmp
> folder on ALL of your worker nodes.
>
> Yong
>
> --
> Date: Wed, 13 Apr 2016 14:18:24 -0300
> Subject: Re: Logging in executors
> From: cma...@despegar.com
> To: yuzhih...@gmail.com
> CC: user@spark.apache.org
>
>
> Thanks for your response Ted. You're right, there was a typo. I
> changed it, now I'm executing:
>
> bin/spark-submit --master spark://localhost:7077 --conf
> "spark.driver.extraJavaOptions=-Dlog4j.configuration=env/dev/log4j-driver.properties"
> --conf
> "spark.executor.extraJavaOptions=-Dlog4j.configuration=env/dev/log4j-executor.properties"
> --class
>
> The content of this file is:
>
> # Set everything to be logged to the console
> log4j.rootCategory=INFO, FILE
> log4j.appender.console=org.apache.log4j.ConsoleAppender
> log4j.appender.console.target=System.err
> log4j.appender.console.layout=org.apache.log4j.PatternLayout
> log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss}
> %p %c{1}: %m%n
>
> log4j.appender.FILE=org.apache.log4j.RollingFileAppender
> log4j.appender.FILE.File=/tmp/executor.log
> log4j.appender.FILE.ImmediateFlush=true
> log4j.appender.FILE.Threshold=debug
> log4j.appender.FILE.Append=true
> log4j.appender.FILE.MaxFileSize=100MB
> log4j.appender.FILE.MaxBackupIndex=5
> log4j.appender.FILE.layout=org.apache.log4j.PatternLayout
> log4j.appender.FILE.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p
> %c{1}: %m%n
>
> # Settings to quiet third party logs that are too verbose
> log4j.logger.org.spark-project.jetty=WARN
>
> log4j.logger.org.spark-project.jetty.util.component.AbstractLifeCycle=ERROR
> log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
>
> log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
> log4j.logger.org.apache.parquet=ERROR
> log4j.logger.parquet=ERROR
> log4j.logger.com.despegar.p13n=DEBUG
>
> # SPARK-9183: Settings to avoid annoying messages when looking up
> nonexistent UDFs in SparkSQL with Hive support
> log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
> log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR
>
>
> Finally, the code on which I'm using logging in the executor is:
>
> def groupAndCount(keys: DStream[(String, List[String])])(handler: 
> ResultHandler) = {

Re: Logging in executors

2016-04-18 Thread Carlos Rojas Matas

Thanks Ted, already checked it but is not the same. I'm working with
StandAlone spark, the examples refers to HDFS paths, therefore I assume
Hadoop 2 Resource Manager is used. I've tried all possible flavours. The
only one that worked was changing the spark-defaults.conf in every machine.
I'll go with this by now, but the extra java opts for the executor are
definitely not working, at least for logging configuration.

Thanks,
-carlos.

On Fri, Apr 15, 2016 at 3:28 PM, Ted Yu  wrote:

> See this thread: http://search-hadoop.com/m/q3RTtsFrd61q291j1
>
> On Fri, Apr 15, 2016 at 5:38 AM, Carlos Rojas Matas 
> wrote:
>
>> Hi guys,
>>
>> any clue on this? Clearly the
>> spark.executor.extraJavaOpts=-Dlog4j.configuration is not working on the
>> executors.
>>
>> Thanks,
>> -carlos.
>>
>> On Wed, Apr 13, 2016 at 2:48 PM, Carlos Rojas Matas 
>> wrote:
>>
>>> Hi Yong,
>>>
>>> thanks for your response. As I said in my first email, I've tried both
>>> the reference to the classpath resource (env/dev/log4j-executor.properties)
>>> as the file:// protocol. Also, the driver logging is working fine and I'm
>>> using the same kind of reference.
>>>
>>> Below the content of my classpath:
>>>
>>> [image: Inline image 1]
>>>
>>> Plus this is the content of the exploded fat jar assembled with sbt
>>> assembly plugin:
>>>
>>> [image: Inline image 2]
>>>
>>>
>>> This folder is at the root level of the classpath.
>>>
>>> Thanks,
>>> -carlos.
>>>
>>> On Wed, Apr 13, 2016 at 2:35 PM, Yong Zhang 
>>> wrote:
>>>
 Is the env/dev/log4j-executor.properties file within your jar file? Is
 the path matching with what you specified as
 env/dev/log4j-executor.properties?

 If you read the log4j document here:
 https://logging.apache.org/log4j/1.2/manual.html

 When you specify the log4j.configuration=my_custom.properties, you have
 2 option:

 1) the my_custom.properties has to be in the jar (or in the classpath).
 In your case, since you specify the package path, you need to make sure
 they are matched in your jar file
 2) use like log4j.configuration=file:///tmp/my_custom.properties. In
 this way, you need to make sure file my_custom.properties exists in /tmp
 folder on ALL of your worker nodes.

 Yong

 --
 Date: Wed, 13 Apr 2016 14:18:24 -0300
 Subject: Re: Logging in executors
 From: cma...@despegar.com
 To: yuzhih...@gmail.com
 CC: user@spark.apache.org


 Thanks for your response Ted. You're right, there was a typo. I changed
 it, now I'm executing:

 bin/spark-submit --master spark://localhost:7077 --conf
 "spark.driver.extraJavaOptions=-Dlog4j.configuration=env/dev/log4j-driver.properties"
 --conf
 "spark.executor.extraJavaOptions=-Dlog4j.configuration=env/dev/log4j-executor.properties"
 --class

 The content of this file is:

 # Set everything to be logged to the console
 log4j.rootCategory=INFO, FILE
 log4j.appender.console=org.apache.log4j.ConsoleAppender
 log4j.appender.console.target=System.err
 log4j.appender.console.layout=org.apache.log4j.PatternLayout
 log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss}
 %p %c{1}: %m%n

 log4j.appender.FILE=org.apache.log4j.RollingFileAppender
 log4j.appender.FILE.File=/tmp/executor.log
 log4j.appender.FILE.ImmediateFlush=true
 log4j.appender.FILE.Threshold=debug
 log4j.appender.FILE.Append=true
 log4j.appender.FILE.MaxFileSize=100MB
 log4j.appender.FILE.MaxBackupIndex=5
 log4j.appender.FILE.layout=org.apache.log4j.PatternLayout
 log4j.appender.FILE.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p
 %c{1}: %m%n

 # Settings to quiet third party logs that are too verbose
 log4j.logger.org.spark-project.jetty=WARN

 log4j.logger.org.spark-project.jetty.util.component.AbstractLifeCycle=ERROR
 log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
 log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
 log4j.logger.org.apache.parquet=ERROR
 log4j.logger.parquet=ERROR
 log4j.logger.com.despegar.p13n=DEBUG

 # SPARK-9183: Settings to avoid annoying messages when looking up
 nonexistent UDFs in SparkSQL with Hive support
 log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
 log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR


 Finally, the code on which I'm using logging in the executor is:

 def groupAndCount(keys: DStream[(String, List[String])])(handler: 
 ResultHandler) = {

   val result = keys.reduceByKey((prior, current) => {
 (prior ::: current)
   }).flatMap {
 case (date, keys) =>
   val rs = keys.groupBy(x => x).map(
   obs =>{
 val (d,t) = date.split("@") match {
   case Array(d,t) => (d,t)

Re: strange HashPartitioner behavior in Spark

2016-04-18 Thread Raghava Mutharaju

Mike,

We tried that. This map task is actually part of a larger set of
operations. I pointed out this map task since it involves partitionBy() and
we always use partitionBy() whenever partition-unaware shuffle operations
are performed (such as distinct). We in fact do not notice a change in the
distribution after several unrelated stages are executed and a significant
time has passed (nearly 10-15 minutes).

I agree. We are not looking for partitions to go to specific nodes and nor
do we expect a uniform distribution of keys across the cluster. There will
be a skew. But it cannot be that all the data is on one node and nothing on
the other and no, the keys are not the same. They vary from 1 to around
55000 (integers). What makes this strange is that it seems to work fine on
the spark shell (REPL).

Regards,
Raghava.


On Mon, Apr 18, 2016 at 1:14 AM, Mike Hynes <91m...@gmail.com> wrote:

> A HashPartitioner will indeed partition based on the key, but you
> cannot know on *which* node that key will appear. Again, the RDD
> partitions will not necessarily be distributed evenly across your
> nodes because of the greedy scheduling of the first wave of tasks,
> particularly if those tasks have durations less than the initial
> executor delay. I recommend you look at your logs to verify if this is
> happening to you.
>
> Mike
>
> On 4/18/16, Anuj Kumar  wrote:
> > Good point Mike +1
> >
> > On Mon, Apr 18, 2016 at 9:47 AM, Mike Hynes <91m...@gmail.com> wrote:
> >
> >> When submitting a job with spark-submit, I've observed delays (up to
> >> 1--2 seconds) for the executors to respond to the driver in order to
> >> receive tasks in the first stage. The delay does not persist once the
> >> executors have been synchronized.
> >>
> >> When the tasks are very short, as may be your case (relatively small
> >> data and a simple map task like you have described), the 8 tasks in
> >> your stage may be allocated to only 1 executor in 2 waves of 4, since
> >> the second executor won't have responded to the master before the
> >> first 4 tasks on the first executor have completed.
> >>
> >> To see if this is the cause in your particular case, you could try the
> >> following to confirm:
> >> 1. Examine the starting times of the tasks alongside their
> >> executor
> >> 2. Make a "dummy" stage execute before your real stages to
> >> synchronize the executors by creating and materializing any random RDD
> >> 3. Make the tasks longer, i.e. with some silly computational
> >> work.
> >>
> >> Mike
> >>
> >>
> >> On 4/17/16, Raghava Mutharaju  wrote:
> >> > Yes its the same data.
> >> >
> >> > 1) The number of partitions are the same (8, which is an argument to
> >> > the
> >> > HashPartitioner). In the first case, these partitions are spread
> across
> >> > both the worker nodes. In the second case, all the partitions are on
> >> > the
> >> > same node.
> >> > 2) What resources would be of interest here? Scala shell takes the
> >> default
> >> > parameters since we use "bin/spark-shell --master " to run
> >> the
> >> > scala-shell. For the scala program, we do set some configuration
> >> > options
> >> > such as driver memory (12GB), parallelism is set to 8 and we use Kryo
> >> > serializer.
> >> >
> >> > We are running this on Azure D3-v2 machines which have 4 cores and
> 14GB
> >> > RAM.1 executor runs on each worker node. Following configuration
> >> > options
> >> > are set for the scala program -- perhaps we should move it to the
> spark
> >> > config file.
> >> >
> >> > Driver memory and executor memory are set to 12GB
> >> > parallelism is set to 8
> >> > Kryo serializer is used
> >> > Number of retainedJobs and retainedStages has been increased to check
> >> them
> >> > in the UI.
> >> >
> >> > What information regarding Spark Context would be of interest here?
> >> >
> >> > Regards,
> >> > Raghava.
> >> >
> >> > On Sun, Apr 17, 2016 at 10:54 PM, Anuj Kumar 
> >> > wrote:
> >> >
> >> >> If the data file is same then it should have similar distribution of
> >> >> keys.
> >> >> Few queries-
> >> >>
> >> >> 1. Did you compare the number of partitions in both the cases?
> >> >> 2. Did you compare the resource allocation for Spark Shell vs Scala
> >> >> Program being submitted?
> >> >>
> >> >> Also, can you please share the details of Spark Context, Environment
> >> >> and
> >> >> Executors when you run via Scala program?
> >> >>
> >> >> On Mon, Apr 18, 2016 at 4:41 AM, Raghava Mutharaju <
> >> >> m.vijayaragh...@gmail.com> wrote:
> >> >>
> >> >>> Hello All,
> >> >>>
> >> >>> We are using HashPartitioner in the following way on a 3 node
> cluster
> >> (1
> >> >>> master and 2 worker nodes).
> >> >>>
> >> >>> val u =
> >> >>> sc.textFile("hdfs://x.x.x.x:8020/user/azureuser/s.txt").map[(Int,
> >> >>> Int)](line => { line.split("\\|") match { case Array(x, y) =>
> >> >>> (y.toInt,
> >> >>> x.toInt) } }).partitionBy(new
> >> HashPartitioner(8)).setName("u").persist()
> >> >>>
> >> >>> u.count()
> >> >>>
> >>

Re: How to start HDFS on Spark Standalone

2016-04-18 Thread Deepak Sharma

I prefer using CentOS/SLES/Ubuntu personally.

Thanks
Deepak

On Mon, Apr 18, 2016 at 5:57 PM, My List  wrote:

> Deepak,
>
> I love the unix flavors have been a programmed on them. Just have a
> windows laptop and pc hence haven't moved to unix flavors. Was trying to
> run big data stuff on windows. Have run in so much of issues that I could
> just throw the laptop with windows out.
>
> Your view - Redhat, Ubuntu or Centos.
> Does Redhat give a one year licence on purchase etc?
>
> Thanks
>
> On Mon, Apr 18, 2016 at 5:52 PM, Deepak Sharma 
> wrote:
>
>> It works well with all flavors of Linux.
>> It all depends on your ex with these flavors.
>>
>> Thanks
>> Deepak
>>
>> On Mon, Apr 18, 2016 at 5:51 PM, My List  wrote:
>>
>>> Deepak,
>>>
>>> Would you advice that I use Ubuntu? or Redhat. Cause Windows support etc
>>> and issues are Galore on Spark.
>>> Since I am starting afresh, what would you advice?
>>>
>>> On Mon, Apr 18, 2016 at 5:45 PM, Deepak Sharma 
>>> wrote:
>>>
 Binary for Spark means ts spark built against hadoop 2.6
 It will not have any hadoop executables.
 You'll have to setup hadoop separately.
 I have not used windows version yet but there are some.

 Thanks
 Deepak

 On Mon, Apr 18, 2016 at 5:43 PM, My List  wrote:

> Deepak,
>
> The following could be a very dumb questions so pardon me for the same.
> 1) When I download the binary for Spark with a version of
> Hadoop(Hadoop 2.6) does it not come in the zip or tar file?
> 2) If it does not come along,Is there a Apache Hadoop for windows, is
> it in binary format or will have to build it?
> 3) Is there a basic tutorial for Hadoop on windows for the basic needs
> of Spark.
>
> Thanks in Advance !
>
> On Mon, Apr 18, 2016 at 5:35 PM, Deepak Sharma 
> wrote:
>
>> Once you download hadoop and format the namenode , you can use
>> start-dfs.sh to start hdfs.
>> Then use 'jps' to sss if datanode/namenode services are up and
>> running.
>>
>> Thanks
>> Deepak
>>
>> On Mon, Apr 18, 2016 at 5:18 PM, My List 
>> wrote:
>>
>>> Hi ,
>>>
>>> I am a newbie on Spark.I wanted to know how to start and verify if
>>> HDFS has started on Spark stand alone.
>>>
>>> Env -
>>> Windows 7 - 64 bit
>>> Spark 1.4.1 With Hadoop 2.6
>>>
>>> Using Scala Shell - spark-shell
>>>
>>>
>>> --
>>> Thanks,
>>> Harry
>>>
>>
>>
>>
>> --
>> Thanks
>> Deepak
>> www.bigdatabig.com
>> www.keosha.net
>>
>
>
>
> --
> Thanks,
> Harmeet
>



 --
 Thanks
 Deepak
 www.bigdatabig.com
 www.keosha.net

>>>
>>>
>>>
>>> --
>>> Thanks,
>>> Harmeet
>>>
>>
>>
>>
>> --
>> Thanks
>> Deepak
>> www.bigdatabig.com
>> www.keosha.net
>>
>
>
>
> --
> Thanks,
> Harmeet
>



-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net

Re: Submitting applications to Mesos with cluster deploy mode

2016-04-18 Thread Jacek Laskowski

Hi,

Not that I might help much with deployment to Mesos, but can you describe
your Mesos/Marathon setup? What's Mesos cluster dispatcher?

Jacek
18.04.2016 12:54 PM "Joao Azevedo"  napisał(a):

> Hi!
>
> I'm trying to submit Spark applications to Mesos using the 'cluster'
> deploy mode. I'm using Marathon as the container orchestration platform and
> launching the Mesos Cluster Dispatcher through it. I'm using Spark 1.6.1
> with Scala 2.11.
>
> I'm able to successfully communicate with the cluster dispatcher, but
> executors that run on a node different from the one where the driver is
> started fail to run. I believe that they're assuming the same (local)
> classpath as the driver, and thus failing to run. Here's what I see from
> the failed tasks on the executor nodes:
>
> sh -c '
> "/tmp/mesos/slaves/3b945207-1aac-497d-b2fb-9671d3d0646b-S1/frameworks/b925b271-ce42-48c9-8474-9e19304c2d20-0002/executors/driver-20160418101719-0001/runs/39272646-d305-40a6-9431-5efcacc67b80/spark-1.6.1-bin-2.4.0/bin/spark-class"
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url
> spark://CoarseGrainedScheduler@:35600 --executor-id
> 00706e76-f790-465d-b4cd-edd3934ddb6a-S1 --hostname 
> --cores 5 --app-id 00706e76-f790-465d-b4cd-edd3934ddb6a-0048'
> Command exited with status 127 (pid: 4118)
>
> When the executors are started in the same node where the driver is, the
> application runs ok.
>
> Here's how I'm submitting the applications:
>
> bin/spark-submit --master mesos://:7077
> --deploy-mode cluster --conf "spark.executor.uri="
> --executor-memory 5G --total-executor-cores 5 --supervise --class
> org.apache.spark.examples.SparkPi  10
>
> Does anyone have any ideas on what I might be doing wrong?
>
>
> Joao
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: How to start HDFS on Spark Standalone

2016-04-18 Thread My List

Deepak,

I love the unix flavors have been a programmed on them. Just have a windows
laptop and pc hence haven't moved to unix flavors. Was trying to run big
data stuff on windows. Have run in so much of issues that I could just
throw the laptop with windows out.

Your view - Redhat, Ubuntu or Centos.
Does Redhat give a one year licence on purchase etc?

Thanks

On Mon, Apr 18, 2016 at 5:52 PM, Deepak Sharma 
wrote:

> It works well with all flavors of Linux.
> It all depends on your ex with these flavors.
>
> Thanks
> Deepak
>
> On Mon, Apr 18, 2016 at 5:51 PM, My List  wrote:
>
>> Deepak,
>>
>> Would you advice that I use Ubuntu? or Redhat. Cause Windows support etc
>> and issues are Galore on Spark.
>> Since I am starting afresh, what would you advice?
>>
>> On Mon, Apr 18, 2016 at 5:45 PM, Deepak Sharma 
>> wrote:
>>
>>> Binary for Spark means ts spark built against hadoop 2.6
>>> It will not have any hadoop executables.
>>> You'll have to setup hadoop separately.
>>> I have not used windows version yet but there are some.
>>>
>>> Thanks
>>> Deepak
>>>
>>> On Mon, Apr 18, 2016 at 5:43 PM, My List  wrote:
>>>
 Deepak,

 The following could be a very dumb questions so pardon me for the same.
 1) When I download the binary for Spark with a version of Hadoop(Hadoop
 2.6) does it not come in the zip or tar file?
 2) If it does not come along,Is there a Apache Hadoop for windows, is
 it in binary format or will have to build it?
 3) Is there a basic tutorial for Hadoop on windows for the basic needs
 of Spark.

 Thanks in Advance !

 On Mon, Apr 18, 2016 at 5:35 PM, Deepak Sharma 
 wrote:

> Once you download hadoop and format the namenode , you can use
> start-dfs.sh to start hdfs.
> Then use 'jps' to sss if datanode/namenode services are up and running.
>
> Thanks
> Deepak
>
> On Mon, Apr 18, 2016 at 5:18 PM, My List  wrote:
>
>> Hi ,
>>
>> I am a newbie on Spark.I wanted to know how to start and verify if
>> HDFS has started on Spark stand alone.
>>
>> Env -
>> Windows 7 - 64 bit
>> Spark 1.4.1 With Hadoop 2.6
>>
>> Using Scala Shell - spark-shell
>>
>>
>> --
>> Thanks,
>> Harry
>>
>
>
>
> --
> Thanks
> Deepak
> www.bigdatabig.com
> www.keosha.net
>

 --
 Thanks,
 Harmeet

>>>
>>>
>>>
>>> --
>>> Thanks
>>> Deepak
>>> www.bigdatabig.com
>>> www.keosha.net
>>>
>>
>>
>>
>> --
>> Thanks,
>> Harmeet
>>
>
>
>
> --
> Thanks
> Deepak
> www.bigdatabig.com
> www.keosha.net
>

-- 
Thanks,
Harmeet

Re: How to start HDFS on Spark Standalone

2016-04-18 Thread Deepak Sharma

It works well with all flavors of Linux.
It all depends on your ex with these flavors.

Thanks
Deepak

On Mon, Apr 18, 2016 at 5:51 PM, My List  wrote:

> Deepak,
>
> Would you advice that I use Ubuntu? or Redhat. Cause Windows support etc
> and issues are Galore on Spark.
> Since I am starting afresh, what would you advice?
>
> On Mon, Apr 18, 2016 at 5:45 PM, Deepak Sharma 
> wrote:
>
>> Binary for Spark means ts spark built against hadoop 2.6
>> It will not have any hadoop executables.
>> You'll have to setup hadoop separately.
>> I have not used windows version yet but there are some.
>>
>> Thanks
>> Deepak
>>
>> On Mon, Apr 18, 2016 at 5:43 PM, My List  wrote:
>>
>>> Deepak,
>>>
>>> The following could be a very dumb questions so pardon me for the same.
>>> 1) When I download the binary for Spark with a version of Hadoop(Hadoop
>>> 2.6) does it not come in the zip or tar file?
>>> 2) If it does not come along,Is there a Apache Hadoop for windows, is it
>>> in binary format or will have to build it?
>>> 3) Is there a basic tutorial for Hadoop on windows for the basic needs
>>> of Spark.
>>>
>>> Thanks in Advance !
>>>
>>> On Mon, Apr 18, 2016 at 5:35 PM, Deepak Sharma 
>>> wrote:
>>>
 Once you download hadoop and format the namenode , you can use
 start-dfs.sh to start hdfs.
 Then use 'jps' to sss if datanode/namenode services are up and running.

 Thanks
 Deepak

 On Mon, Apr 18, 2016 at 5:18 PM, My List  wrote:

> Hi ,
>
> I am a newbie on Spark.I wanted to know how to start and verify if
> HDFS has started on Spark stand alone.
>
> Env -
> Windows 7 - 64 bit
> Spark 1.4.1 With Hadoop 2.6
>
> Using Scala Shell - spark-shell
>
>
> --
> Thanks,
> Harry
>



 --
 Thanks
 Deepak
 www.bigdatabig.com
 www.keosha.net

>>>
>>>
>>>
>>> --
>>> Thanks,
>>> Harmeet
>>>
>>
>>
>>
>> --
>> Thanks
>> Deepak
>> www.bigdatabig.com
>> www.keosha.net
>>
>
>
>
> --
> Thanks,
> Harmeet
>



-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net

Re: How to start HDFS on Spark Standalone

2016-04-18 Thread My List

Deepak,

Would you advice that I use Ubuntu? or Redhat. Cause Windows support etc
and issues are Galore on Spark.
Since I am starting afresh, what would you advice?

On Mon, Apr 18, 2016 at 5:45 PM, Deepak Sharma 
wrote:

> Binary for Spark means ts spark built against hadoop 2.6
> It will not have any hadoop executables.
> You'll have to setup hadoop separately.
> I have not used windows version yet but there are some.
>
> Thanks
> Deepak
>
> On Mon, Apr 18, 2016 at 5:43 PM, My List  wrote:
>
>> Deepak,
>>
>> The following could be a very dumb questions so pardon me for the same.
>> 1) When I download the binary for Spark with a version of Hadoop(Hadoop
>> 2.6) does it not come in the zip or tar file?
>> 2) If it does not come along,Is there a Apache Hadoop for windows, is it
>> in binary format or will have to build it?
>> 3) Is there a basic tutorial for Hadoop on windows for the basic needs of
>> Spark.
>>
>> Thanks in Advance !
>>
>> On Mon, Apr 18, 2016 at 5:35 PM, Deepak Sharma 
>> wrote:
>>
>>> Once you download hadoop and format the namenode , you can use
>>> start-dfs.sh to start hdfs.
>>> Then use 'jps' to sss if datanode/namenode services are up and running.
>>>
>>> Thanks
>>> Deepak
>>>
>>> On Mon, Apr 18, 2016 at 5:18 PM, My List  wrote:
>>>
 Hi ,

 I am a newbie on Spark.I wanted to know how to start and verify if HDFS
 has started on Spark stand alone.

 Env -
 Windows 7 - 64 bit
 Spark 1.4.1 With Hadoop 2.6

 Using Scala Shell - spark-shell


 --
 Thanks,
 Harry

>>>
>>>
>>>
>>> --
>>> Thanks
>>> Deepak
>>> www.bigdatabig.com
>>> www.keosha.net
>>>
>>
>>
>>
>> --
>> Thanks,
>> Harmeet
>>
>
>
>
> --
> Thanks
> Deepak
> www.bigdatabig.com
> www.keosha.net
>



-- 
Thanks,
Harmeet

Standard deviation on multiple columns

2016-04-18 Thread Naveen Kumar Pokala

Hi,

I am using spark 1.6.0

I want to find standard deviation of columns that will come dynamically.

  val stdDevOnAll = columnNames.map { x => stddev(x }

causalDf.groupBy(causalDf("A"),causalDf("B"),causalDf("C"))
.agg(stdDevOnAll:_*) //error line


I am trying to do as above.

But it is giving me compilation error as below.

overloaded method value agg with alternatives: (expr: 
org.apache.spark.sql.Column,exprs: 
org.apache.spark.sql.Column*)org.apache.spark.sql.DataFrame  (exprs: 
java.util.Map[String,String])org.apache.spark.sql.DataFrame  (exprs: 
scala.collection.immutable.Map[String,String])org.apache.spark.sql.DataFrame 
 (aggExpr: (String, String),aggExprs: (String, 
String)*)org.apache.spark.sql.DataFrame cannot be applied to 
(org.apache.spark.sql.Column)


Naveen

Re: How to start HDFS on Spark Standalone

2016-04-18 Thread Deepak Sharma

Binary for Spark means ts spark built against hadoop 2.6
It will not have any hadoop executables.
You'll have to setup hadoop separately.
I have not used windows version yet but there are some.

Thanks
Deepak

On Mon, Apr 18, 2016 at 5:43 PM, My List  wrote:

> Deepak,
>
> The following could be a very dumb questions so pardon me for the same.
> 1) When I download the binary for Spark with a version of Hadoop(Hadoop
> 2.6) does it not come in the zip or tar file?
> 2) If it does not come along,Is there a Apache Hadoop for windows, is it
> in binary format or will have to build it?
> 3) Is there a basic tutorial for Hadoop on windows for the basic needs of
> Spark.
>
> Thanks in Advance !
>
> On Mon, Apr 18, 2016 at 5:35 PM, Deepak Sharma 
> wrote:
>
>> Once you download hadoop and format the namenode , you can use
>> start-dfs.sh to start hdfs.
>> Then use 'jps' to sss if datanode/namenode services are up and running.
>>
>> Thanks
>> Deepak
>>
>> On Mon, Apr 18, 2016 at 5:18 PM, My List  wrote:
>>
>>> Hi ,
>>>
>>> I am a newbie on Spark.I wanted to know how to start and verify if HDFS
>>> has started on Spark stand alone.
>>>
>>> Env -
>>> Windows 7 - 64 bit
>>> Spark 1.4.1 With Hadoop 2.6
>>>
>>> Using Scala Shell - spark-shell
>>>
>>>
>>> --
>>> Thanks,
>>> Harry
>>>
>>
>>
>>
>> --
>> Thanks
>> Deepak
>> www.bigdatabig.com
>> www.keosha.net
>>
>
>
>
> --
> Thanks,
> Harmeet
>



-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net

Re: How to start HDFS on Spark Standalone

2016-04-18 Thread My List

Deepak,

The following could be a very dumb questions so pardon me for the same.
1) When I download the binary for Spark with a version of Hadoop(Hadoop
2.6) does it not come in the zip or tar file?
2) If it does not come along,Is there a Apache Hadoop for windows, is it in
binary format or will have to build it?
3) Is there a basic tutorial for Hadoop on windows for the basic needs of
Spark.

Thanks in Advance !

On Mon, Apr 18, 2016 at 5:35 PM, Deepak Sharma 
wrote:

> Once you download hadoop and format the namenode , you can use
> start-dfs.sh to start hdfs.
> Then use 'jps' to sss if datanode/namenode services are up and running.
>
> Thanks
> Deepak
>
> On Mon, Apr 18, 2016 at 5:18 PM, My List  wrote:
>
>> Hi ,
>>
>> I am a newbie on Spark.I wanted to know how to start and verify if HDFS
>> has started on Spark stand alone.
>>
>> Env -
>> Windows 7 - 64 bit
>> Spark 1.4.1 With Hadoop 2.6
>>
>> Using Scala Shell - spark-shell
>>
>>
>> --
>> Thanks,
>> Harry
>>
>
>
>
> --
> Thanks
> Deepak
> www.bigdatabig.com
> www.keosha.net
>

-- 
Thanks,
Harmeet

Re: How to start HDFS on Spark Standalone

2016-04-18 Thread Deepak Sharma

Once you download hadoop and format the namenode , you can use start-dfs.sh
to start hdfs.
Then use 'jps' to sss if datanode/namenode services are up and running.

Thanks
Deepak

On Mon, Apr 18, 2016 at 5:18 PM, My List  wrote:

> Hi ,
>
> I am a newbie on Spark.I wanted to know how to start and verify if HDFS
> has started on Spark stand alone.
>
> Env -
> Windows 7 - 64 bit
> Spark 1.4.1 With Hadoop 2.6
>
> Using Scala Shell - spark-shell
>
>
> --
> Thanks,
> Harry
>

-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net

How to start HDFS on Spark Standalone

2016-04-18 Thread My List

Hi ,

I am a newbie on Spark.I wanted to know how to start and verify if HDFS has
started on Spark stand alone.

Env -
Windows 7 - 64 bit
Spark 1.4.1 With Hadoop 2.6

Using Scala Shell - spark-shell


-- 
Thanks,
Harry

Submitting applications to Mesos with cluster deploy mode

2016-04-18 Thread Joao Azevedo


Hi!

I'm trying to submit Spark applications to Mesos using the 'cluster' 
deploy mode. I'm using Marathon as the container orchestration platform 
and launching the Mesos Cluster Dispatcher through it. I'm using Spark 
1.6.1 with Scala 2.11.


I'm able to successfully communicate with the cluster dispatcher, but 
executors that run on a node different from the one where the driver is 
started fail to run. I believe that they're assuming the same (local) 
classpath as the driver, and thus failing to run. Here's what I see from 
the failed tasks on the executor nodes:


sh -c ' 
"/tmp/mesos/slaves/3b945207-1aac-497d-b2fb-9671d3d0646b-S1/frameworks/b925b271-ce42-48c9-8474-9e19304c2d20-0002/executors/driver-20160418101719-0001/runs/39272646-d305-40a6-9431-5efcacc67b80/spark-1.6.1-bin-2.4.0/bin/spark-class" 
org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
spark://CoarseGrainedScheduler@:35600 --executor-id 
00706e76-f790-465d-b4cd-edd3934ddb6a-S1 --hostname  
--cores 5 --app-id 00706e76-f790-465d-b4cd-edd3934ddb6a-0048'

Command exited with status 127 (pid: 4118)

When the executors are started in the same node where the driver is, the 
application runs ok.


Here's how I'm submitting the applications:

bin/spark-submit --master 
mesos://:7077 --deploy-mode cluster 
--conf "spark.executor.uri=" --executor-memory 5G 
--total-executor-cores 5 --supervise --class 
org.apache.spark.examples.SparkPi  10


Does anyone have any ideas on what I might be doing wrong?


Joao

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Read Parquet in Java Spark

2016-04-18 Thread Ramkumar V

HI,

Any idea on this ?

*Thanks*,



On Mon, Apr 4, 2016 at 2:47 PM, Akhil Das 
wrote:

> I wasn't knowing you have a parquet file containing json data.
>
> Thanks
> Best Regards
>
> On Mon, Apr 4, 2016 at 2:44 PM, Ramkumar V 
> wrote:
>
>> Hi Akhil,
>>
>> Thanks for your help. Why do you put separator as "," ?
>>
>> I have a parquet file which contains only json in each line.
>>
>> *Thanks*,
>> 
>>
>>
>> On Mon, Apr 4, 2016 at 2:34 PM, Akhil Das 
>> wrote:
>>
>>> Something like this (in scala):
>>>
>>> val rdd = parquetFile.javaRDD().map(row => row.mkstring(","))
>>>
>>> You can create a map operation over your javaRDD to convert the
>>> org.apache.spark.sql.Row
>>> 
>>> to String (the Row.mkstring() Operation)
>>>
>>> Thanks
>>> Best Regards
>>>
>>> On Mon, Apr 4, 2016 at 12:02 PM, Ramkumar V 
>>> wrote:
>>>
 Any idea on this ? How to convert parquet file into JavaRDD ?

 *Thanks*,
 


 On Thu, Mar 31, 2016 at 4:30 PM, Ramkumar V 
 wrote:

> Hi,
>
> Thanks for the reply.  I tried this. It's returning JavaRDD
> instead of JavaRDD. How to get JavaRDD ?
>
> Error :
> incompatible types:
> org.apache.spark.api.java.JavaRDD cannot be
> converted to org.apache.spark.api.java.JavaRDD
>
>
>
>
>
> *Thanks*,
> 
>
>
> On Thu, Mar 31, 2016 at 2:57 PM, UMESH CHAUDHARY 
> wrote:
>
>> From Spark Documentation:
>>
>> DataFrame parquetFile = sqlContext.read().parquet("people.parquet");
>>
>> JavaRDD jRDD= parquetFile.javaRDD()
>>
>> javaRDD() method will convert the DF to RDD
>>
>> On Thu, Mar 31, 2016 at 2:51 PM, Ramkumar V 
>> wrote:
>>
>>> Hi,
>>>
>>> I'm trying to read parquet log files in Java Spark. Parquet log
>>> files are stored in hdfs. I want to read and convert that parquet file 
>>> into
>>> JavaRDD. I could able to find Sqlcontext dataframe api. How can I read 
>>> if
>>> it is sparkcontext and rdd ? what is the best way to read it ?
>>>
>>> *Thanks*,
>>> 
>>>
>>>
>>
>

>>>
>>
>

How to handle ActiveMQ message produce in case of spark speculation

2016-04-18 Thread aneesh.mohan

I understand that duplicate results due to speculation for handled at driver
side. But what if one of my spark task ( in a streaming scenario ) involves
sending a message to ActiveMQ. Here driver has no control over it and once
message is sent it's done. How is this handled?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-handle-ActiveMQ-message-produce-in-case-of-spark-speculation-tp26795.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

drools on spark, how to reload rule file?

2016-04-18 Thread yaoxiaohua

Hi bros,

I am trying using drools on spark to parse log and do some
rule match and derived some fields.

Now I refer one blog on cloudera, 

http://blog.cloudera.com/blog/2015/11/how-to-build-a-complex-event-processin
g-app-on-apache-spark-and-drools/



now I want to know whether it possible to reload the rule on
the fly?

Thanks in advance.

 

Best Regards,

Evan

Re: Apache Flink

2016-04-18 Thread Jörn Franke


What is your exact set of requirements for algo trading? Is it react in 
real-time or analysis over longer time? In the first case, I do not think a 
framework such as Spark or Flink makes sense. They are generic, but in order to 
compete with other usually custom developed highly - specialized engines in a 
low level language you need something else.

> On 18 Apr 2016, at 09:19, Mich Talebzadeh  wrote:
> 
> Please forgive me for going in tangent
> 
> Well there may be many SQL engines but only 5-6 are in the league. Oracle has 
> Oracle and MySQL plus TimesTen from various acquisitions. SAP has Hana, SAP 
> ASE, SAP IQ and few others again from acquiring Sybase . So very few big 
> players.
> 
> Cars, Fiat owns many groups including Ferrari and Maserati. The beloved 
> Jaguar belongs to Tata Motors and Rolls Royce belongs to BMW and actually 
> uses BMW engines. VW has many companies from Seat to Skoda etc.
> 
> However, that is the results of Markets getting too fragmented when 
> consolidation happens. Big data world is quite young but I gather it will go 
> the same way as most go, consolidation.
> 
> Anyway my original point was finding a tool that allows me to do CEP on Algo 
> trading using Kafka + another. As of now there is really none. I am still 
> exploring if Flink can do the job.
> 
> HTH
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
>  
> 
>> On 18 April 2016 at 07:46, Sean Owen  wrote:
>> Interesting tangent. I think there will never be a time when an
>> interesting area is covered only by one project, or product. Why are
>> there 30 SQL engines? or 50 car companies? it's a feature not a bug.
>> To the extent they provide different tradeoffs or functionality,
>> they're not entirely duplicative; to the extent they compete directly,
>> it's a win for the user.
>> 
>> As others have said, Flink (née Stratosphere) started quite a while
>> ago. But you can draw lines of influence back earlier than Spark. I
>> presume MS Dryad is the forerunner of all these.
>> 
>> And in case you wanted a third option, Google's DataFlow (now Apache
>> Beam) is really a reinvention of FlumeJava (nothing to do with Apache
>> Flume) from Google, in a way that Crunch was a port and minor update
>> of FlumeJava earlier. And it claims to run on Flink/Spark if you want.
>> 
>> https://cloud.google.com/dataflow/blog/dataflow-beam-and-spark-comparison
>> 
>> 
>> On Sun, Apr 17, 2016 at 10:25 PM, Mich Talebzadeh
>>  wrote:
>> > Also it always amazes me why they are so many tangential projects in Big
>> > Data space? Would not it be easier if efforts were spent on adding to Spark
>> > functionality rather than creating a new product like Flink?
>

Re: Apache Flink

2016-04-18 Thread Mich Talebzadeh

Thanks Todd I will have a look.

Regards

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 18 April 2016 at 01:58, Todd Nist  wrote:

> So there is an offering from Stratio, https://github.com/Stratio/Decision
>
> Decision CEP engine is a Complex Event Processing platform built on Spark
>> Streaming.
>>
>
>
>> It is the result of combining the power of Spark Streaming as a
>> continuous computing framework and Siddhi CEP engine as complex event
>> processing engine.
>
>
> https://stratio.atlassian.net/wiki/display/DECISION0x9/Home
>
> I have not used it, only read about it but it may be of some interest to
> you.
>
> -Todd
>
> On Sun, Apr 17, 2016 at 5:49 PM, Peyman Mohajerian 
> wrote:
>
>> Microbatching is certainly not a waste of time, you are making way too
>> strong of an statement. In fact in certain cases one tuple at the time
>> makes no sense, it all depends on the use cases. In fact if you understand
>> the history of the project Storm you would know that microbatching was
>> added later in Storm, Trident, and it is specifically for
>> microbatching/windowing.
>> In certain cases you are doing aggregation/windowing and throughput is
>> the dominant design consideration and you don't care what each individual
>> event/tuple does, e.g. of you push different event types to separate kafka
>> topics and all you care is to do a count, what is the need for single event
>> processing.
>>
>> On Sun, Apr 17, 2016 at 12:43 PM, Corey Nolet  wrote:
>>
>>> i have not been intrigued at all by the microbatching concept in Spark.
>>> I am used to CEP in real streams processing environments like Infosphere
>>> Streams & Storm where the granularity of processing is at the level of each
>>> individual tuple and processing units (workers) can react immediately to
>>> events being received and processed. The closest Spark streaming comes to
>>> this concept is the notion of "state" that that can be updated via the
>>> "updateStateBykey()" functions which are only able to be run in a
>>> microbatch. Looking at the expected design changes to Spark Streaming in
>>> Spark 2.0.0, it also does not look like tuple-at-a-time processing is on
>>> the radar for Spark, though I have seen articles stating that more effort
>>> is going to go into the Spark SQL layer in Spark streaming which may make
>>> it more reminiscent of Esper.
>>>
>>> For these reasons, I have not even tried to implement CEP in Spark. I
>>> feel it's a waste of time without immediate tuple-at-a-time processing.
>>> Without this, they avoid the whole problem of "back pressure" (though keep
>>> in mind, it is still very possible to overload the Spark streaming layer
>>> with stages that will continue to pile up and never get worked off) but
>>> they lose the granular control that you get in CEP environments by allowing
>>> the rules & processors to react with the receipt of each tuple, right away.
>>>
>>> Awhile back, I did attempt to implement an InfoSphere Streams-like API
>>> [1] on top of Apache Storm as an example of what such a design may look
>>> like. It looks like Storm is going to be replaced in the not so distant
>>> future by Twitter's new design called Heron. IIRC, Heron does not have an
>>> open source implementation as of yet.
>>>
>>> [1] https://github.com/calrissian/flowmix
>>>
>>> On Sun, Apr 17, 2016 at 3:11 PM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Hi Corey,

 Can you please point me to docs on using Spark for CEP? Do we have a
 set of CEP libraries somewhere. I am keen on getting hold of adaptor
 libraries for Spark something like below



 
 Thanks


 Dr Mich Talebzadeh



 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *



 http://talebzadehmich.wordpress.com



 On 17 April 2016 at 16:07, Corey Nolet  wrote:

> One thing I've noticed about Flink in my following of the project has
> been that it has established, in a few cases, some novel ideas and
> improvements over Spark. The problem with it, however, is that both the
> development team and the community around it are very small and many of
> those novel improvements have been rolled directly into Spark in 
> subsequent
> versions. I was considering changing over my architecture to Flink at one
> point to get better, more real-time CEP streaming support, but in the end 
> I
> decided to stick with Spark and just watch Flink continue to pressure it
> into improvement.
>
> On Sun, Apr 17, 2016 at 11:03 AM, Koert Kuipers 
> wrote:
>
>> i never found much info that flink w

R: How many disks for spark_local_dirs?

2016-04-18 Thread Luca Guerra

Hi Jan,
It's a physical server, I have launched the application with:
- "spark.cores.max": "12",
- "spark.executor.cores": "3"
- 2 GB RAM per worker
Spark version is 1.6.0, I don't use Hadoop.

Thanks,
Luca

-Messaggio originale-
Da: Jan Rock [mailto:r...@do-hadoop.com] 
Inviato: venerdì 15 aprile 2016 18:04
A: Luca Guerra 
Cc: user@spark.apache.org
Oggetto: Re: How many disks for spark_local_dirs?

Hi, 

is it physical server or AWS/Azure? What are the executed parameters for 
spark-shell command? Hadoop distro/version and Spark version?

Kind Regards,
Jan


> On 15 Apr 2016, at 16:15, luca_guerra  wrote:
> 
> Hi,
> I'm looking for a solution to improve my Spark cluster performances, I 
> have read from http://spark.apache.org/docs/latest/hardware-provisioning.html:
> "We recommend having 4-8 disks per node", I have tried both with one 
> and two disks but I have seen that with 2 disks the execution time is 
> doubled. Any explanations about this?
> 
> This is my configuration:
> 1 machine with 140 GB RAM 2 disks and 32 CPU (I know that is an 
> unusual
> configuration) and on this I have a standalone Spark cluster with 1 Worker.
> 
> Thank you very much for the help.
> 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/How-many-disks-for
> -spark-local-dirs-tp26790.html Sent from the Apache Spark User List 
> mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For 
> additional commands, e-mail: user-h...@spark.apache.org
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: How many disks for spark_local_dirs?

2016-04-18 Thread Mich Talebzadeh

.. to force spills to disk ...

That is pretty smart as at some point you are inevitable going to run out
of memory.

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 18 April 2016 at 08:17, Luca Guerra  wrote:

> Hi Mich,
>
> I have only 32 cores, I have tested with 2 GB of memory per worker to
> force spills to disk. My application had 12 cores and 3 cores per executor.
>
>
>
> Thank you very much.
>
> Luca
>
>
>
> *Da:* Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
> *Inviato:* venerdì 15 aprile 2016 18:56
> *A:* Luca Guerra 
> *Cc:* user @spark 
> *Oggetto:* Re: How many disks for spark_local_dirs?
>
>
>
> Is that 32 CPUs or 32 cores?
>
>
>
> So in this configuration assuming 32 cores you have I worker with how much
> memory (deducting memory for OS etc) and 32 cores.
>
>
>
> What is the ratio  of memory per core in this case?
>
>
>
> HTH
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn  
> *https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
>
>
> On 15 April 2016 at 16:15, luca_guerra  wrote:
>
> Hi,
> I'm looking for a solution to improve my Spark cluster performances, I have
> read from http://spark.apache.org/docs/latest/hardware-provisioning.html:
> "We recommend having 4-8 disks per node", I have tried both with one and
> two
> disks but I have seen that with 2 disks the execution time is doubled. Any
> explanations about this?
>
> This is my configuration:
> 1 machine with 140 GB RAM 2 disks and 32 CPU (I know that is an unusual
> configuration) and on this I have a standalone Spark cluster with 1 Worker.
>
> Thank you very much for the help.
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-many-disks-for-spark-local-dirs-tp26790.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>

Re: Apache Flink

2016-04-18 Thread Mich Talebzadeh

Please forgive me for going in tangent

Well there may be many SQL engines but only 5-6 are in the league. Oracle
has Oracle and MySQL plus TimesTen from various acquisitions. SAP has Hana,
SAP ASE, SAP IQ and few others again from acquiring Sybase . So very few
big players.

Cars, Fiat owns many groups including Ferrari and Maserati. The beloved
Jaguar belongs to Tata Motors and Rolls Royce belongs to BMW and actually
uses BMW engines. VW has many companies from Seat to Skoda etc.

However, that is the results of Markets getting too fragmented when
consolidation happens. Big data world is quite young but I gather it will
go the same way as most go, consolidation.

Anyway my original point was finding a tool that allows me to do CEP on
Algo trading using Kafka + another. As of now there is really none. I am
still exploring if Flink can do the job.

HTH

Dr Mich Talebzadeh

LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*

http://talebzadehmich.wordpress.com

On 18 April 2016 at 07:46, Sean Owen  wrote:

> Interesting tangent. I think there will never be a time when an
> interesting area is covered only by one project, or product. Why are
> there 30 SQL engines? or 50 car companies? it's a feature not a bug.
> To the extent they provide different tradeoffs or functionality,
> they're not entirely duplicative; to the extent they compete directly,
> it's a win for the user.
>
> As others have said, Flink (née Stratosphere) started quite a while
> ago. But you can draw lines of influence back earlier than Spark. I
> presume MS Dryad is the forerunner of all these.
>
> And in case you wanted a third option, Google's DataFlow (now Apache
> Beam) is really a reinvention of FlumeJava (nothing to do with Apache
> Flume) from Google, in a way that Crunch was a port and minor update
> of FlumeJava earlier. And it claims to run on Flink/Spark if you want.
>
> https://cloud.google.com/dataflow/blog/dataflow-beam-and-spark-comparison
>
>
> On Sun, Apr 17, 2016 at 10:25 PM, Mich Talebzadeh
>  wrote:
> > Also it always amazes me why they are so many tangential projects in Big
> > Data space? Would not it be easier if efforts were spent on adding to
> Spark
> > functionality rather than creating a new product like Flink?
>

R: How many disks for spark_local_dirs?

2016-04-18 Thread Luca Guerra

Hi Mich,
I have only 32 cores, I have tested with 2 GB of memory per worker to force 
spills to disk. My application had 12 cores and 3 cores per executor.

Thank you very much.
Luca

Da: Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
Inviato: venerdì 15 aprile 2016 18:56
A: Luca Guerra 
Cc: user @spark 
Oggetto: Re: How many disks for spark_local_dirs?

Is that 32 CPUs or 32 cores?

So in this configuration assuming 32 cores you have I worker with how much 
memory (deducting memory for OS etc) and 32 cores.

What is the ratio  of memory per core in this case?

HTH


Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com



On 15 April 2016 at 16:15, luca_guerra 
mailto:lgue...@bitbang.com>> wrote:
Hi,
I'm looking for a solution to improve my Spark cluster performances, I have
read from http://spark.apache.org/docs/latest/hardware-provisioning.html:
"We recommend having 4-8 disks per node", I have tried both with one and two
disks but I have seen that with 2 disks the execution time is doubled. Any
explanations about this?

This is my configuration:
1 machine with 140 GB RAM 2 disks and 32 CPU (I know that is an unusual
configuration) and on this I have a standalone Spark cluster with 1 Worker.

Thank you very much for the help.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-many-disks-for-spark-local-dirs-tp26790.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.org

Re: Agenda Announced for Spark Summit 2016 in San Francisco

2016-04-18 Thread Tim To

anyone know of discount promo code for the spark summit that still works??
Thanks,Tim  

On Wednesday, April 6, 2016 9:44 AM, Scott walent  
wrote:
 
 

 Spark Summit 2016 (www.spark-summit.org/2016) will be held from June 6-8 at 
the Union Square Hilton in San Francisco, and the recently released agenda 
features a stellar lineup of community talks led by top engineers, architects, 
data scientists, researchers, entrepreneurs and analysts from UC Berkeley, 
Duke, Microsoft, Netflix, Oracle, Bloomberg, Viacom, Airbnb, Uber, 
CareerBuilder and, of course, Databricks. There’s also a full day of hands-on 
Spark training, with courses for both beginners and advanced users.

As the excitement around Spark continues to grow, and the rapid adoption rate 
shows no signs of slowing down, Spark Summit is growing, too. More than 2,500 
participants are expected at the San Francisco conference, making it the 
largest event yet.

Join us in June to learn more about data engineering and data science at scale, 
spend time with other members of the Spark community, attend community meetups, 
revel in social activities associated with the Summit, and enjoy the beautiful 
city by the bay.

Developer Day: (June 7)
Aimed at a highly technical audience, this day will focus on topics about Spark 
dealing with memory management, performance, optimization, scale, and 
integration with the ecosystem, including dedicated tracks and sessions 
covering:
- Keynotes focusing on what’s new with Spark, where Spark is heading, and 
technical trends within Big Data
- Five technical tracks, including Developer, Data Science, Spark Ecosystem, 
Use Cases & Experiences, and Research
- Office hours from the Spark project leads at the Expo Hall Theater

Enterprise Day: (June 8)
For anyone interested in understanding how Spark is used in the enterprise, 
this day will include:
- Keynotes from leading vendors contributing to Spark and enterprise use cases
- Full day-long track of enterprise talks featuring use cases and a vendor panel
- Four technical tracks for continued learning from Developer Day

With more than 90 sessions, you’ll be able to pick and choose the topics that 
best suit your interests and expertise.

Get Tickets Online
Registration (www.spark-summit.org/2016/register/) is open now, and you can 
save $200 when you buy tickets before April 8th. 

We hope to see you at Spark Summit 2016 in San Francisco. Follow @spark_summit 
and #SparkSummit for updates.

-Spark Summit Organizers

69 matches

Mail list logo