Re: Spark 2.0 issue with left_outer join

2017-03-03 Thread Ankur Srivastava
Adding DEV. Or is there any other way to do subtractByKey using Dataset APIs? Thanks Ankur On Wed, Mar 1, 2017 at 1:28 PM, Ankur Srivastava wrote: > Hi Users, > > We are facing an issue with left_outer join using Spark Dataset api in 2.0 > Java API. Below is the code we have > > Dataset badIds

Re: Spark join over sorted columns of dataset.

2017-03-03 Thread Koert Kuipers
For RDD the shuffle is already skipped but the sort is not. In spark-sorted we track partitioning and sorting within partitions for key-value RDDs and can avoid the sort. See: https://github.com/tresata/spark-sorted For Dataset/DataFrame such optimizations are done automatically, however it's curr

Re: Spark join over sorted columns of dataset.

2017-03-03 Thread Rohit Verma
Sending it to dev’s. Can you please help me providing some ideas for below. Regards Rohit > On Feb 23, 2017, at 3:47 PM, Rohit Verma wrote: > > Hi > > While joining two columns of different dataset, how to optimize join if both > the columns are pre sorted within the dataset. > So that when sp

Re: How to run a spark on Pycharm

2017-03-03 Thread TheGeorge1918 .
Hey, Depends on your configuration. I configure my dockerfile with spark2.0 installed and in pycharm, properly configure the interpreter using docker and add following env in your script configuration. You can check the dockerfile here: https://github.com/zhangxuan1918/spark2.0 PYSPARK_PYTHON /u

RE: How to run a spark on Pycharm

2017-03-03 Thread Sidney Feiner
Hey, I once found an article about that: https://mengdong.github.io/2016/08/08/fully-armed-pyspark-with-ipython-and-jupyter/ And I once managed to set it up on Pycharm as well. What I had to do was to add /path/to/spark to a system variable called "PYTHTONPATH". Try that one, it might help ☺ Fro

Re: How to run a spark on Pycharm

2017-03-03 Thread Pushkar.Gujar
There are lot of articles available online which guide you thru setting up jupyter notebooks to run spark program. For e.g - http://blog.insightdatalabs.com/jupyter-on-apache-spark-step-by-step/ https://www.cloudera.com/documentation/enterprise/5-6-x/topics/spark_ipython.html https://gist.github.c

Re: How to run a spark on Pycharm

2017-03-03 Thread Anahita Talebi
Hi, Thanks for your answer. Sorry, I am completely beginner in running the code in spark. Could you please tell me a bit more in details how to do that? I installed ipython and Jupyter notebook on my local machine. But how can I run the code using them? Before, I tried to run the code with Pycha

Re: How to run a spark on Pycharm

2017-03-03 Thread Pushkar.Gujar
Jupyter notebook/ipython can be connected to apache spark Thank you, *Pushkar Gujar* On Fri, Mar 3, 2017 at 9:43 AM, Anahita Talebi wrote: > Hi everyone, > > I am trying to run a spark code on Pycharm. I tried to give the path of > spark as a environment variable to the configuration of Pycha

How to run a spark on Pycharm

2017-03-03 Thread Anahita Talebi
Hi everyone, I am trying to run a spark code on Pycharm. I tried to give the path of spark as a environment variable to the configuration of Pycharm. Unfortunately, I get the error. Does anyone know how I can run the spark code on Pycharm? It shouldn't be necessarily on Pycharm. if you know any ot

Re: Problems when submitting a spark job via the REST API

2017-03-03 Thread Kristinn Rúnarsson
Hi, I think I have found what was causing the exception. "spark.app.name" seams to be required in sparkProperties to successfully submit the job. At least when I include the app name my job is successfully submitted to the spark cluster. Silly mistakes, but the error message is not helping m

Re: Server Log Processing - Regex or ElasticSearch?

2017-03-03 Thread veera satya nv Dantuluri
Gaurav, I would suggest elastic search. > On Mar 3, 2017, at 3:27 AM, Gaurav1809 wrote: > > Hello All, > One small question if you can help me out. I am working on Server log > processing in Spark for my organization. I am using regular expressions > (Regex) for pattern matching and then do fu

Problems when submitting a spark job via the REST API

2017-03-03 Thread Kristinn Rúnarsson
Hi, I am trying to submit spark jobs via the "hidden" REST API( http://spark-cluster-ip:6066/v1/submissions/..), but I am getting ErrorResponse and I cant find what I am doing wrong. I have been following the instructions from this blog post: http://arturmkrtchyan.com/apache-spark-hidden-rest-api

Resource manager: estimation of application execution time/remaining time.

2017-03-03 Thread Mazen
Dear all, For a particular Spark extension, I would like to know whether it would be possible for the Resource Manager (e.g. Standalone cluster manager) to know or estimate the total execution time of a submitted application, or the execution remaining time of such application. Thanks. -- View

Re: kafka and zookeeper set up in prod for spark streaming

2017-03-03 Thread Mich Talebzadeh
Thanks all. How about Kafka HA which is important. Is it best to use application specific Kafka delivery or Kafka MirrorMaker? Cheers Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

2017-03-03 Thread Noorul Islam K M
> When Initial jobs have not accepted any resources then what all can be > wrong? Going through stackoverflow and various blogs does not help. Maybe > need better logging for this? Adding dev > Did you take a look at the spark UI to see your resource availability? Thanks and Regards Noorul

Re: kafka and zookeeper set up in prod for spark streaming

2017-03-03 Thread vincent gromakowski
I forgot to mention it also depends on the spark kafka connector you use. If it's receiver based, I recommend a dedicated zookeeper cluster because it is used to store offsets. If it's receiver less Zookeeper can be shared. 2017-03-03 9:29 GMT+01:00 Jörn Franke : > I think this highly depends on

Re: kafka and zookeeper set up in prod for spark streaming

2017-03-03 Thread vincent gromakowski
Hi, Depending on the Kafka version (< 0.8.2 I think), offsets are managed in Zookeeper and if you have lots of consumer it's recommended to use a dedicated zookeeper cluster (always with dedicated disks, even SSD is better). On newer version offsets are managed in special Kafka topics and Zookeeper

Re: kafka and zookeeper set up in prod for spark streaming

2017-03-03 Thread Jörn Franke
I think this highly depends on the risk that you want to be exposed to. If you have it on dedicated nodes there is less influence of other processes. I have seen both: on Hadoop nodes or dedicated. On Hadoop I would not recommend to put it on data nodes/heavily utilized nodes. Zookeeper does no

Server Log Processing - Regex or ElasticSearch?

2017-03-03 Thread Gaurav1809
Hello All, One small question if you can help me out. I am working on Server log processing in Spark for my organization. I am using regular expressions (Regex) for pattern matching and then do further analysis on the identifies pieces. Ip, username, date etc. Is this good approach? Shall I go fo

kafka and zookeeper set up in prod for spark streaming

2017-03-03 Thread Mich Talebzadeh
hi, In DEV, Kafka and ZooKeeper services can be co- located.on the same physical hosts In Prod moving forward do we need to set up Zookeeper on its own cluster not sharing with Hadoop cluster? Can these services be shared within the Hadoop cluster? How best to set up Zookeeper that is needed for