Re: Severe Spark Streaming performance degradation after upgrading to 1.6.1

2016-07-13 Thread Dibyendu Bhattacharya
You can get some good pointers in this JIRA https://issues.apache.org/jira/browse/SPARK-15796 Dibyendu On Thu, Jul 14, 2016 at 12:53 AM, Sunita wrote: > I am facing the same issue. Upgrading to Spark1.6 is causing hugh > performance > loss. Could you solve this issue?

Re: Issue in spark job. Remote rpc client dissociated

2016-07-13 Thread Balachandar R.A.
> > Hello Ted, > Thanks for the response. Here is the additional information. > I am using spark 1.6.1 (spark-1.6.1-bin-hadoop2.6) > > > > Here is the code snippet > > > > > > JavaRDD add = jsc.parallelize(listFolders, listFolders.size()); > > JavaRDD test = add.map(new

Re: Memory issue java.lang.OutOfMemoryError: Java heap space

2016-07-13 Thread Chanh Le
Can you show me at Spark UI -> executors tab and storage tab. It will show us how many executor was executed and how much memory we use to cache. > On Jul 14, 2016, at 9:49 AM, Jean Georges Perrin wrote: > > I use it as a standalone cluster. > > I run it through

Re: Memory issue java.lang.OutOfMemoryError: Java heap space

2016-07-13 Thread Jean Georges Perrin
I use it as a standalone cluster. I run it through start-master, then start-slave. I only have one slave now, but I will probably have a few soon. The "application" is run on a separate box. When everything was running on my mac, i was in local mode, but i never setup anything in local mode.

Re: Memory issue java.lang.OutOfMemoryError: Java heap space

2016-07-13 Thread Chanh Le
Hi Jean, How do you run your Spark Application? Local Mode, Cluster Mode? If you run in local mode did you use —driver-memory and —executor-memory because in local mode your setting about executor and driver didn’t work that you expected. > On Jul 14, 2016, at 8:43 AM, Jean Georges Perrin

Re: Memory issue java.lang.OutOfMemoryError: Java heap space

2016-07-13 Thread Jean Georges Perrin
Looks like replacing the setExecutorEnv() by set() did the trick... let's see how fast it'll process my 50x 10ˆ15 data points... > On Jul 13, 2016, at 9:24 PM, Jean Georges Perrin wrote: > > I have added: > > SparkConf conf = new >

Re: Memory issue java.lang.OutOfMemoryError: Java heap space

2016-07-13 Thread Jean Georges Perrin
I have added: SparkConf conf = new SparkConf().setAppName("app").setExecutorEnv("spark.executor.memory", "8g") .setMaster("spark://10.0.100.120:7077"); but it did not change a thing > On Jul 13, 2016, at 9:14 PM, Jean Georges Perrin

Memory issue java.lang.OutOfMemoryError: Java heap space

2016-07-13 Thread Jean Georges Perrin
Hi, I have a Java memory issue with Spark. The same application working on my 8GB Mac crashes on my 72GB Ubuntu server... I have changed things in the conf file, but it looks like Spark does not care, so I wonder if my issues are with the driver or executor. I set: spark.driver.memory

Doing record linkage using string comparators in Spark

2016-07-13 Thread Linh Tran
Hi guys, I'm hoping that someone can help me to make my setup more efficient. I'm trying to do record linkage across 2.5 billion records and have set myself up in Spark to handle the data. Right as of now, I'm relying on R (with the stringdist and RecordLinkage packages) to do the actual

Spark HBase bulk load using hfile format

2016-07-13 Thread yeshwanth kumar
Hi i am doing bulk load into HBase as HFileFormat, by using saveAsNewAPIHadoopFile when i try to write i am getting an exception java.io.IOException: Added a key not lexically larger than previous. following is the code snippet case class HBaseRow(rowKey: ImmutableBytesWritable, kv: KeyValue)

Re: Dense Vectors outputs in feature engineering

2016-07-13 Thread disha_dp
Hi Ian, You can create a dense vector of you features as follows: - String Index your features - Invoke One Hot Encoding on them, which generates a sparse vector - Now, in case you wish to merge these features, then use VectorAssembler (optional) - After transforming the dataframe to return

Dense Vectors outputs in feature engineering

2016-07-13 Thread rachmaninovquartet
Hi, I'm trying to use the StringIndexer and OneHotEncoder, in order to vectorize some of my features. Unfortunately, OneHotEncoder only returns sparse vectors. I can't find a way, much less an efficient one, to convert the columns generated by OneHotEncoder into dense vectors. I need this as I

Re: Online evaluation of MLLIB model

2016-07-13 Thread Jacek Laskowski
No real time in Spark unless near real time is your real time :-) Use Spark ML Pipeline API inside Spark Streaming workflow. Jacek On 13 Jul 2016 5:57 p.m., "Danilo Rizzo" wrote: > Hi All, I'm trying to create a ML pipeline that is in charge of the model > training. > In

Re: Structured Streaming and Microbatches

2016-07-13 Thread Jacek Laskowski
Hi, It's still microbatching architecture with triggers as batchIntervals. It's just faster by default and the API is more pleasant, i.e. Dataset-driven. Jacek On 13 Jul 2016 10:35 p.m., "Matthias Niehoff" < matthias.nieh...@codecentric.de> wrote: Hi everybody, as far as I understand with new

Re: Issue in spark job. Remote rpc client dissociated

2016-07-13 Thread Ted Yu
Which Spark release are you using ? Can you disclose what the folder processing does (code snippet is better) ? Thanks On Wed, Jul 13, 2016 at 9:44 AM, Balachandar R.A. wrote: > Hello > > In one of my use cases, i need to process list of folders in parallel. I > used

Structured Streaming and Microbatches

2016-07-13 Thread Matthias Niehoff
Hi everybody, as far as I understand with new the structured Streaming API the output will not get processed every x seconds anymore. Instead the data will be processed as soon as is arrived. But there might be a delay due to processing time for the data. A small example: Data comes in and the

Re: Severe Spark Streaming performance degradation after upgrading to 1.6.1

2016-07-13 Thread Sunita
I am facing the same issue. Upgrading to Spark1.6 is causing hugh performance loss. Could you solve this issue? I am also attempting memory settings as mentioned http://spark.apache.org/docs/latest/configuration.html#memory-management But its not making a lot of difference. Appreciate your inputs

Re: Spark Website

2016-07-13 Thread Reynold Xin
Thanks for reporting. This is due to https://issues.apache.org/jira/servicedesk/agent/INFRA/issue/INFRA-12055 On Wed, Jul 13, 2016 at 11:52 AM, Pradeep Gollakota wrote: > Worked for me if I go to https://spark.apache.org/site/ but not > https://spark.apache.org > > On

Re: Spark Website

2016-07-13 Thread Pradeep Gollakota
Worked for me if I go to https://spark.apache.org/site/ but not https://spark.apache.org On Wed, Jul 13, 2016 at 11:48 AM, Maurin Lenglart wrote: > Same here > > > > *From: *Benjamin Kim > *Date: *Wednesday, July 13, 2016 at 11:47 AM > *To: *manish

Re: Spark Website

2016-07-13 Thread Maurin Lenglart
Same here From: Benjamin Kim Date: Wednesday, July 13, 2016 at 11:47 AM To: manish ranjan Cc: user Subject: Re: Spark Website It takes me to the directories instead of the webpage. On Jul 13, 2016, at 11:45 AM, manish ranjan

Re: Spark Website

2016-07-13 Thread Benjamin Kim
It takes me to the directories instead of the webpage. > On Jul 13, 2016, at 11:45 AM, manish ranjan wrote: > > working for me. What do you mean 'as supposed to'? > > ~Manish > > > > On Wed, Jul 13, 2016 at 11:45 AM, Benjamin Kim

Re: Spark Website

2016-07-13 Thread manish ranjan
working for me. What do you mean 'as supposed to'? ~Manish On Wed, Jul 13, 2016 at 11:45 AM, Benjamin Kim wrote: > Has anyone noticed that the spark.apache.org is not working as supposed > to? > > > - >

Spark Website

2016-07-13 Thread Benjamin Kim
Has anyone noticed that the spark.apache.org is not working as supposed to? - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Tools for Balancing Partitions by Size

2016-07-13 Thread Pedro Rodriguez
Hi Gourav, In our case, we process raw logs into parquet tables that downstream applications can use for other jobs. The desired outcome is that we only need to worry about unbalanced input data at the preprocess step so that downstream jobs can assume balanced input data. In our specific case,

Re: When worker is killed driver continues to run causing issues in supervise mode

2016-07-13 Thread Noorul Islam Kamal Malmiyoda
Adding dev list On Jul 13, 2016 5:38 PM, "Noorul Islam K M" wrote: > > Spark version: 1.6.1 > Cluster Manager: Standalone > > I am experimenting with cluster mode deployment along with supervise for > high availability of streaming applications. > > 1. Submit a streaming job

Issue in spark job. Remote rpc client dissociated

2016-07-13 Thread Balachandar R.A.
Hello In one of my use cases, i need to process list of folders in parallel. I used Sc.parallelize (list,list.size).map(" logic to process the folder"). I have a six node cluster and there are six folders to process. Ideally i expect that each of my node process one folder. But, i see that a

Re: Spark Thrift Server performance

2016-07-13 Thread Mich Talebzadeh
Thanks guys Any idea on this What is the limit on the number of users accessing the thrift server concurrently. Say using Yarn, will Yarn control apps accessing the thrift server or users each armed with beeline connect to thrift server. Say my STS has this conf below

Re: Spark Thrift Server performance

2016-07-13 Thread ayan guha
Not really, that is not the primary intention. Our main goal is poor man's high availability (as STS does not provide HA mechanism like HS2) :). Additionally, we have made STS part of Ambari AUTO_START group, so Ambari brings up STS if it goes down for some intermittent reason. On Thu, Jul 14,

Online evaluation of MLLIB model

2016-07-13 Thread Danilo Rizzo
Hi All, I'm trying to create a ML pipeline that is in charge of the model training. In my use case I have the need to evaluate the mode in real time from an external application; googling I saw that I can submit a spark job using the submit API. Not sure if this is the best way to achieve that,

Re: Dependencies with runing Spark Streaming on Mesos cluster using Python

2016-07-13 Thread Shuai Lin
I think there are two options for you: First you can set `--conf spark.mesos.executor.docker.image= adolphlwq/mesos-for-spark-exector-image:1.6.0.beta2` in your spark submit args, so mesos would launch the executor with your custom image. Or you can remove the `local:` prefix in the --jars flag,

Re: Spark Thrift Server performance

2016-07-13 Thread Michael Segel
Hey, silly question? If you’re running a load balancer, are you trying to reuse the RDDs between jobs? TIA -Mike > On Jul 13, 2016, at 9:08 AM, ayan guha > wrote: > > My 2 cents: > > Yes, we are running multiple STS (we are running on

Re: Tools for Balancing Partitions by Size

2016-07-13 Thread Gourav Sengupta
Hi, Using file size is a very bad way of managing data provided you think that volume, variety and veracity does not holds true. Actually its a very bad way of thinking and designing data solutions, you are bound to hit bottle necks, optimization issues, and manual interventions. I have found

Re: Spark Thrift Server performance

2016-07-13 Thread ayan guha
My 2 cents: Yes, we are running multiple STS (we are running on different nodes, but you can run on same node, different ports). Using Ambari, it is really convenient to manage. We have set up a nginx load balancer as well pointing to both services and all our external BI tools connect to the

Re: How to run Zeppelin and Spark Thrift Server Together

2016-07-13 Thread Michael Segel
I believe that there is one JVM for the Thrift Service and that there is only one context for the service. This would allow you to share RDDs across multiple jobs, however… not so great for security. HTH… > On Jul 10, 2016, at 10:05 PM, Takeshi Yamamuro

Re: Dependencies with runing Spark Streaming on Mesos cluster using Python

2016-07-13 Thread Luke Adolph
Update: I revuild my mesos-exector-image ,I download *spark-streaming-kafka_2.10-1.6.0.jar* on *`/linker/jars`* I change my submit command: dcos spark run \ --submit-args='--jars > local:/linker/jars/spark-streaming-kafka_2.10-1.6.0.jar spark2cassandra.py > 10.140.0.14:2181

When worker is killed driver continues to run causing issues in supervise mode

2016-07-13 Thread Noorul Islam K M
Spark version: 1.6.1 Cluster Manager: Standalone I am experimenting with cluster mode deployment along with supervise for high availability of streaming applications. 1. Submit a streaming job in cluster mode with supervise 2. Say that driver is scheduled on worker1. The app started

Spark 7736

2016-07-13 Thread ayan guha
Hi I am facing same issue reporting on Spark 7736 on Spark 1.6.0. Is it any way to reopen the Jira? Reproduction steps attached. -- Best Regards, Ayan Guha Spark 7736.docx Description: MS-Word 2007 document

Dependencies with runing Spark Streaming on Mesos cluster using Python

2016-07-13 Thread Luke Adolph
Hi all: My spark runs on mesos.I write a spark streaming app using python, code on GitHub . The app has dependency "*org.apache.spark:spark-streaming-kafka_2.10:1.6.1* ". Spark on mesos has two important concepts: Spark Framework and Spark

Flume integration

2016-07-13 Thread Ian Brooks
Hi, I'm currently trying to implement a prototype Spark application that gets data from Flume and processes it. I'm using the pull based method mentioned in https://spark.apache.org/docs/1.6.1/streaming-flume-integration.html The is initially working fine for getting data from Flume, however

Re: Inode for STS

2016-07-13 Thread ayan guha
Thanks Christophe. Any comment from Spark dev community member would really helpful on the Jira. What I saw today is shutting down the thrift server process lead to a clean up. Also, we started removing any empty folders from /tmp. Is there any other or better method? On Wed, Jul 13, 2016 at

Re: Issue with Spark on 25 nodes cluster

2016-07-13 Thread ANDREA SPINA
Hi, I solved by increasing the akka timeout time. All the bests, 2016-06-28 15:04 GMT+02:00 ANDREA SPINA <74...@studenti.unimore.it>: > Hello everyone, > > I am running some experiments with Spark 1.4.0 on a ~80GiB dataset located > on hdfs-2.7.1. The environment is a 25 nodes cluster, 16 cores

Re: How to run Zeppelin and Spark Thrift Server Together

2016-07-13 Thread Chanh Le
Hi Ayan, I don’t know I did something wrong but still couldn’t set hive.metastore.warehouse.dir property. I set 3 hive-site.xml files in spark location, zeppelin, hive as well but still didn’t work. zeppeline/conf/hive-site.xml spark/conf/hive-site.xml hive/conf/hive-site.xml My

Spark, Kryo Serialization Issue with ProtoBuf field

2016-07-13 Thread Nkechi Achara
Hi, I am seeing an error when running my spark job relating to Serialization of a protobuf field when transforming an RDD. com.esotericsoftware.kryo.KryoException: java.lang.UnsupportedOperationException Serialization trace: otherAuthors_

Re: Handling categorical variables in StreamingLogisticRegressionwithSGD

2016-07-13 Thread kundan kumar
Hi Sean , Thanks for the reply !! Is there anything already available in spark that can fix the depth of categorical variables. The OneHotEncoder changes the level of the vector created depending on the number of distinct values coming in the stream. Is there any parameter available with the

Problem saving Hive table with Overwrite mode

2016-07-13 Thread nimrodo
Hi, I'm trying to write a partitioned parquet table and save it as a hive table at a specific path. The code I'm using is in Java (columns and table names are a bit different in my real code) and the code is executed using AirFlow which calls the spark-submit:

Spark Thrift Server performance

2016-07-13 Thread Mich Talebzadeh
Hi, I need some feedback on the performance of the Spark Thrift Server (STS) As far I can ascertain one can start STS passing the usual spark parameters ${SPARK_HOME}/sbin/start-thriftserver.sh \ --master spark://50.140.197.217:7077 \ --hiveconf

Re: Inode for STS

2016-07-13 Thread Christophe Préaud
Hi Ayan, I have opened a JIRA about this issues, but there are no answer so far: SPARK-15401 Regards, Christophe. On 13/07/16 05:54, ayan guha wrote: Hi We are running Spark Thrift Server as a long running application. However, it looks

Any Idea about this error : IllegalArgumentException: File segment length cannot be negative ?

2016-07-13 Thread Dibyendu Bhattacharya
In Spark Streaming job, I see a Batch failed with following error. Haven't seen anything like this earlier. This has happened during Shuffle for one Batch (haven't reoccurred after that).. Just curious to know what can cause this error. I am running Spark 1.5.1 Regards, Dibyendu Job aborted