Re: How to enable core dump in spark

2016-06-02 Thread prateek arora
please help me to solve my problem Regards Prateek -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-enable-core-dump-in-spark-tp27065p27081.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

KafkaLog4jAppender in spark executor and driver does not work well

2016-06-02 Thread jian he
Hi, http://stackoverflow.com/questions/32843186/custom-log4j-appender-in-spark-executor, that have the same problem in spark 1.6.1.And in spark driver, also have this issue:log4j:ERROR Could not instantiate class [kafka.producer.KafkaLog4jAppender]. java.lang.ClassNotFoundException:

Re: Classpath hell and Elasticsearch 2.3.2...

2016-06-02 Thread Chris Fregly
i recently powered through this Spark + ElasticSearch integration, as well. you see this + many other Spark integrations with the PANCAKE STACK here: https://github.com/fluxcapacitor/pipeline all configs found here:

Re: Classpath hell and Elasticsearch 2.3.2...

2016-06-02 Thread Nick Pentreath
Fair enough. However, if you take a look at the deployment guide ( http://spark.apache.org/docs/latest/submitting-applications.html#bundling-your-applications-dependencies) you will see that the generally advised approach is to package your app dependencies into a fat JAR and submit (possibly

Re: Classpath hell and Elasticsearch 2.3.2...

2016-06-02 Thread Kevin Burton
Yeah.. thanks Nick. Figured that out since your last email... I deleted the 2.10 by accident but then put 2+2 together. Got it working now. Still sticking to my story that it's somewhat complicated to setup :) Kevin On Thu, Jun 2, 2016 at 3:59 PM, Nick Pentreath

Re: Classpath hell and Elasticsearch 2.3.2...

2016-06-02 Thread Nick Pentreath
Which Scala version is Spark built against? I'd guess it's 2.10 since you're using spark-1.6, and you're using the 2.11 jar for es-hadoop. On Thu, 2 Jun 2016 at 15:50 Kevin Burton wrote: > Thanks. > > I'm trying to run it in a standalone cluster with an existing / large 100

Re: Classpath hell and Elasticsearch 2.3.2...

2016-06-02 Thread Kevin Burton
Thanks. I'm trying to run it in a standalone cluster with an existing / large 100 node ES install. I'm using the standard 1.6.1 -2.6 distribution with elasticsearch-hadoop-2.3.2... I *think* I'm only supposed to use the elasticsearch-spark_2.11-2.3.2.jar with it... but now I get the following

Re: Classpath hell and Elasticsearch 2.3.2...

2016-06-02 Thread Nick Pentreath
Hey there When I used es-hadoop, I just pulled in the dependency into my pom.xml, with spark as a "provided" dependency, and built a fat jar with assembly. Then with spark-submit use the --jars option to include your assembly jar (IIRC I sometimes also needed to use --driver-classpath too, but

Classpath hell and Elasticsearch 2.3.2...

2016-06-02 Thread Kevin Burton
I'm trying to get spark 1.6.1 to work with 2.3.2... needless to say it's not super easy. I wish there was an easier way to get this stuff to work.. Last time I tried to use spark more I was having similar problems with classpath setup and Cassandra. Seems a huge opportunity to make this easier

Re: how to increase threads per executor

2016-06-02 Thread Mich Talebzadeh
interesting. a vm with one core! one simple test can you try running with --executor-cores=1 and see it works ok please Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: how to increase threads per executor

2016-06-02 Thread Andres M Jimenez T
Mich, thanks for your time, i am launching spark-submit as follows: bin/spark-submit --class com.example.SparkStreamingImpl --master spark://dev1.dev:7077 --verbose --driver-memory 1g --executor-memory 1g --conf "spark.driver.extraJavaOptions=-Dcom.sun.management.jmxremote

Re: MLLIB, Random Forest and user defined loss function?

2016-06-02 Thread Yan Burdett
I have a similar question about distance function for KMeans. I believe only Euclidean distance is currently supported. On Thursday, June 2, 2016, xweb wrote: > Does MLLIB allow user to specify own loss functions? > Specially need it for Random forests. > > > > -- > View

MLLIB, Random Forest and user defined loss function?

2016-06-02 Thread xweb
Does MLLIB allow user to specify own loss functions? Specially need it for Random forests. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/MLLIB-Random-Forest-and-user-defined-loss-function-tp27080.html Sent from the Apache Spark User List mailing list

Re: Stream reading from database using spark streaming

2016-06-02 Thread Mich Talebzadeh
ok that is fine. so the source is an IMDB something like Oracle TimesTen that I have worked with before. The second source is some organised data (I assume you mean structured tabular data 1. Data is read from source one, the IMDB. The assumption is that within the batch interval that data

Re: ImportError: No module named numpy

2016-06-02 Thread Eike von Seggern
Hi, are you using Spark on one machine or many? If on many, are you sure numpy is correctly installed on all machines? To check that the environment is set-up correctly, you can try something like import os pythonpaths = sc.range(10).map(lambda i: os.environ.get("PYTHONPATH")).collect()

Re: Seeking advice on realtime querying over JDBC

2016-06-02 Thread Mich Talebzadeh
what is the source of your data? is that an RDMS database plus the topics streamed via Kafka from other sources? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: Seeking advice on realtime querying over JDBC

2016-06-02 Thread Cody Koeninger
Why are you wanting to expose spark over jdbc as opposed to just inserting the records from kafka into a jdbc compatible data store? On Thu, Jun 2, 2016 at 12:47 PM, Sunita Arvind wrote: > Hi Experts, > > We are trying to get a kafka stream ingested in Spark and expose the

Seeking advice on realtime querying over JDBC

2016-06-02 Thread Sunita Arvind
Hi Experts, We are trying to get a kafka stream ingested in Spark and expose the registered table over JDBC for querying. Here are some questions: 1. Spark Streaming supports single context per application right? If I have multiple customers and would like to create a kafka topic for each of them

Re: Stream reading from database using spark streaming

2016-06-02 Thread Mich Talebzadeh
I don't understand this. How are you going to read from RDBMS database, through JDBC? How often are you going to sample the transactional tables? You may find that a JDBC connection will take longer than your sliding window length. Is this for real time analytics? Thanks Dr Mich Talebzadeh

Re: Stream reading from database using spark streaming

2016-06-02 Thread Ted Yu
http://www.sparkexpert.com/2015/03/28/loading-database-data-into-spark-using-data-sources-api/ https://spark.apache.org/docs/1.6.1/api/scala/index.html#org.apache.spark.rdd.JdbcRDD FYI On Thu, Jun 2, 2016 at 6:26 AM, Zakaria Hili wrote: > I want to use spark streaming to

Re: how to increase threads per executor

2016-06-02 Thread Mich Talebzadeh
What are passing as parameters to Spark-submit? ${SPARK_HOME}/bin/spark-submit \ --executor-cores=12 \ Also check http://spark.apache.org/docs/latest/configuration.html Execution Behavior/spark.executor.cores HTH Dr Mich Talebzadeh LinkedIn *

how to increase threads per executor

2016-06-02 Thread Andres M Jimenez T
Hi, I am working with Spark 1.6.1, using kafka direct connect for streaming data. Using spark scheduler and 3 slaves. Kafka topic is partitioned with a value of 10. The problem i have is, there is only one thread per executor running my function (logic implementation). Can anybody tell me

Partitioning Data to optimize combineByKey

2016-06-02 Thread Nathan Case
Hello, I am trying to process a dataset that is approximately 2 tb using a cluster with 4.5 tb of ram. The data is in parquet format and is initially loaded into a dataframe. A subset of the data is then queried for and converted to RDD for more complicated processing. The first stage of that

How to generate seeded random numbers in GraphX Pregel API vertex procedure?

2016-06-02 Thread Roman Pastukhov
As far as I understand, best way to generate seeded random numbers in Spark is to use mapPartititons with a seeded Random instance for each partition. But graph.pregel in GraphX does not have anything similar to mapPartitions. Can something like this be done in GraphX Pregel API?

Re: ImportError: No module named numpy

2016-06-02 Thread nguyen duc tuan
​​ You should set both PYSPARK_DRIVER_PYTHON and PYSPARK_PYTHON the path to your python interpreter. 2016-06-02 20:32 GMT+07:00 Bhupendra Mishra : > did not resolved. :( > > On Thu, Jun 2, 2016 at 3:01 PM, Sergio Fernández > wrote: > >> >> On Thu,

Re: ImportError: No module named numpy

2016-06-02 Thread Bhupendra Mishra
did not resolved. :( On Thu, Jun 2, 2016 at 3:01 PM, Sergio Fernández wrote: > > On Thu, Jun 2, 2016 at 9:59 AM, Bhupendra Mishra < > bhupendra.mis...@gmail.com> wrote: >> >> and i have already exported environment variable in spark-env.sh as >> follows.. error still there

Re: Container preempted by scheduler - Spark job error

2016-06-02 Thread Ted Yu
Not much information in the attachment. There was TimeoutException w.r.t. BlockManagerMaster.removeRdd(). Any chance of more logs ? Thanks On Thu, Jun 2, 2016 at 2:07 AM, Vishnu Nair wrote: > Hi Ted > > We use Hadoop 2.6 & Spark 1.3.1. I also attached the error file

Stream reading from database using spark streaming

2016-06-02 Thread Zakaria Hili
I want to use spark streaming to read data from RDBMS database like mysql. but I don't know how to do this using JavaStreamingContext JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.milliseconds(500));DataFrame df = jssc. ?? I search in the internet but I didn't find

Re: Ignore features in Random Forest

2016-06-02 Thread Neha Mehta
Thanks Yuhao. Regards, Neha On Thu, Jun 2, 2016 at 11:51 AM, Yuhao Yang wrote: > Hi Neha, > > This looks like a feature engineering task. I think VectorSlicer can help > with your case. Please refer to > http://spark.apache.org/docs/latest/ml-features.html#vectorslicer . > >

Re: Spark support for update/delete operations on Hive ORC transactional tables

2016-06-02 Thread Mich Talebzadeh
thanks for that. I will have a look Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com On 2 June 2016 at

Re: Container preempted by scheduler - Spark job error

2016-06-02 Thread Jacek Laskowski
Hi, Few things for closer examination: * is yarn master URL accepted in 1.3? I thought it was only in later releases. Since you're seeing the issue it seems it does work. * I've never seen specifying confs using a single string. Can you check in the Web ui they're applied? * what about this

Re: Spark support for update/delete operations on Hive ORC transactional tables

2016-06-02 Thread Elliot West
Related to this, there exists an API in Hive to simplify the integrations of other frameworks with Hive's ACID feature: See: https://cwiki.apache.org/confluence/display/Hive/HCatalog+Streaming+Mutation+API It contains code for maintaining heartbeats, handling locks and transactions, and

Re: ImportError: No module named numpy

2016-06-02 Thread Sergio Fernández
On Thu, Jun 2, 2016 at 9:59 AM, Bhupendra Mishra wrote: > > and i have already exported environment variable in spark-env.sh as > follows.. error still there error: ImportError: No module named numpy > > export PYSPARK_PYTHON=/usr/bin/python > According the

Fwd: Container preempted by scheduler - Spark job error

2016-06-02 Thread Prabeesh K.
Hi Ted We use Hadoop 2.6 & Spark 1.3.1. I also attached the error file to this mail, please have a look at it. Thanks On Thu, Jun 2, 2016 at 11:51 AM, Ted Yu wrote: > Can you show the error in bit more detail ? > > Which release of hadoop / Spark are you using ? > > Is

Spark support for update/delete operations on Hive ORC transactional tables

2016-06-02 Thread Mich Talebzadeh
Hi, Spark does not support transactions because as I understand there is a piece in the execution side that needs to send heartbeats to Hive metastore saying a transaction is still alive". That has not been implemented in Spark yet to my knowledge." Any idea on the timelines when we are going to

Re: Container preempted by scheduler - Spark job error

2016-06-02 Thread Ted Yu
Can you show the error in bit more detail ? Which release of hadoop / Spark are you using ? Is CapacityScheduler being used ? Thanks On Thu, Jun 2, 2016 at 1:32 AM, Prabeesh K. wrote: > Hi I am using the below command to run a spark job and I get an error like >

Container preempted by scheduler - Spark job error

2016-06-02 Thread Prabeesh K.
Hi I am using the below command to run a spark job and I get an error like "Container preempted by scheduler" I am not sure if it's related to the wrong usage of Memory: nohup ~/spark1.3/bin/spark-submit \ --num-executors 50 \ --master yarn \ --deploy-mode cluster \ --queue adhoc \

Re: ImportError: No module named numpy

2016-06-02 Thread Bhupendra Mishra
its RHEL and i have already exported environment variable in spark-env.sh as follows.. error still there error: ImportError: No module named numpy export PYSPARK_PYTHON=/usr/bin/python thanks On Thu, Jun 2, 2016 at 12:04 AM, Julio Antonio Soto de Vicente < ju...@esbet.es> wrote: > Try adding

Fwd: Beeline - Spark thrift server user retrieval Issue

2016-06-02 Thread pooja mehta
-- Forwarded message -- From: pooja mehta Date: Thu, Jun 2, 2016 at 1:25 PM Subject: Fwd: Beeline - Spark thrift server user retrieval Issue To: user-subscr...@spark.apache.org -- Forwarded message -- From: pooja mehta

Re: --driver-cores for Standalone and YARN only?! What about Mesos?

2016-06-02 Thread Holden Karau
Also seems like this might be better suited for dev@ On Thursday, June 2, 2016, Sun Rui wrote: > yes, I think you can fire a JIRA issue for this. > But why removing the default value. Seems the default core is 1 according > to >

Re: --driver-cores for Standalone and YARN only?! What about Mesos?

2016-06-02 Thread Sun Rui
yes, I think you can fire a JIRA issue for this. But why removing the default value. Seems the default core is 1 according to https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/rest/mesos/MesosRestServer.scala#L110 On Jun 2, 2016, at 05:18, Jacek Laskowski

Re: spark-submit hive connection through spark Initial job has not accepted any resources

2016-06-02 Thread vinayak
Hi Herman, This error comes when you have started your master but no worker has been added to your cluster, please check through spark master UI is there any worker added in master? Also check in your driver code have you set configuration.setmaster(local[]) if added remove it and give spark

Re: get and append file name in record being reading

2016-06-02 Thread Sun Rui
You can use RDD.wholeTextFiles(). For example, suppose all your files are under /tmp/ABC_input/, val rdd = sc.wholeTextFiles("file:///tmp/ABC_input”) val rdd1 = rdd.flatMap { case (path, content) => val fileName = new java.io.File(path).getName content.split("\n").map { line =>

Spark Streaming join

2016-06-02 Thread karthik tunga
Hi, I have a scenario where I need to join DStream with a RDD. This is to add some metadata info to incoming events. This is fairly straight forward. What I also want to do is refresh this metadata RDD on a fixed schedule(or when underlying hdfs file changes). I want to "expire" and reload this

Re: Ignore features in Random Forest

2016-06-02 Thread Yuhao Yang
Hi Neha, This looks like a feature engineering task. I think VectorSlicer can help with your case. Please refer to http://spark.apache.org/docs/latest/ml-features.html#vectorslicer . Regards, Yuhao 2016-06-01 21:18 GMT+08:00 Neha Mehta : > Hi, > > I am performing