Adding DEV.
Or is there any other way to do subtractByKey using Dataset APIs?
Thanks
Ankur
On Wed, Mar 1, 2017 at 1:28 PM, Ankur Srivastava wrote:
> Hi Users,
>
> We are facing an issue with left_outer join using Spark Dataset api in 2.0
> Java API. Below is the
For RDD the shuffle is already skipped but the sort is not. In spark-sorted
we track partitioning and sorting within partitions for key-value RDDs and
can avoid the sort. See:
https://github.com/tresata/spark-sorted
For Dataset/DataFrame such optimizations are done automatically, however
it's
Sending it to dev’s.
Can you please help me providing some ideas for below.
Regards
Rohit
> On Feb 23, 2017, at 3:47 PM, Rohit Verma wrote:
>
> Hi
>
> While joining two columns of different dataset, how to optimize join if both
> the columns are pre sorted within
Hey,
Depends on your configuration. I configure my dockerfile with spark2.0
installed and in pycharm, properly configure the interpreter using docker
and add following env in your script configuration. You can check the
dockerfile here: https://github.com/zhangxuan1918/spark2.0
PYSPARK_PYTHON
Hey,
I once found an article about that:
https://mengdong.github.io/2016/08/08/fully-armed-pyspark-with-ipython-and-jupyter/
And I once managed to set it up on Pycharm as well. What I had to do was to add
/path/to/spark to a system variable called "PYTHTONPATH".
Try that one, it might help ☺
There are lot of articles available online which guide you thru setting up
jupyter notebooks to run spark program. For e.g -
http://blog.insightdatalabs.com/jupyter-on-apache-spark-step-by-step/
https://www.cloudera.com/documentation/enterprise/5-6-x/topics/spark_ipython.html
Hi,
Thanks for your answer.
Sorry, I am completely beginner in running the code in spark.
Could you please tell me a bit more in details how to do that?
I installed ipython and Jupyter notebook on my local machine. But how can I
run the code using them? Before, I tried to run the code with
Jupyter notebook/ipython can be connected to apache spark
Thank you,
*Pushkar Gujar*
On Fri, Mar 3, 2017 at 9:43 AM, Anahita Talebi
wrote:
> Hi everyone,
>
> I am trying to run a spark code on Pycharm. I tried to give the path of
> spark as a environment variable
Hi everyone,
I am trying to run a spark code on Pycharm. I tried to give the path of
spark as a environment variable to the configuration of Pycharm.
Unfortunately, I get the error. Does anyone know how I can run the spark
code on Pycharm?
It shouldn't be necessarily on Pycharm. if you know any
Hi,
I think I have found what was causing the exception.
"spark.app.name" seams to be required in sparkProperties to
successfully submit the job. At least when I include the app name my
job is successfully submitted to the spark cluster.
Silly mistakes, but the error message is not helping
Gaurav,
I would suggest elastic search.
> On Mar 3, 2017, at 3:27 AM, Gaurav1809 wrote:
>
> Hello All,
> One small question if you can help me out. I am working on Server log
> processing in Spark for my organization. I am using regular expressions
> (Regex) for
Hi,
I am trying to submit spark jobs via the "hidden" REST API(
http://spark-cluster-ip:6066/v1/submissions/..), but I am getting
ErrorResponse and I cant find what I am doing wrong.
I have been following the instructions from this blog post:
Dear all,
For a particular Spark extension, I would like to know whether it would be
possible for the Resource Manager (e.g. Standalone cluster manager) to know
or estimate the total execution time of a submitted application, or the
execution remaining time of such application.
Thanks.
--
Thanks all. How about Kafka HA which is important. Is it best to use
application specific Kafka delivery or Kafka MirrorMaker?
Cheers
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> When Initial jobs have not accepted any resources then what all can be
> wrong? Going through stackoverflow and various blogs does not help. Maybe
> need better logging for this? Adding dev
>
Did you take a look at the spark UI to see your resource availability?
Thanks and Regards
Noorul
I forgot to mention it also depends on the spark kafka connector you use.
If it's receiver based, I recommend a dedicated zookeeper cluster because
it is used to store offsets. If it's receiver less Zookeeper can be shared.
2017-03-03 9:29 GMT+01:00 Jörn Franke :
> I think
Hi,
Depending on the Kafka version (< 0.8.2 I think), offsets are managed in
Zookeeper and if you have lots of consumer it's recommended to use a
dedicated zookeeper cluster (always with dedicated disks, even SSD is
better). On newer version offsets are managed in special Kafka topics and
I think this highly depends on the risk that you want to be exposed to. If you
have it on dedicated nodes there is less influence of other processes.
I have seen both: on Hadoop nodes or dedicated. On Hadoop I would not recommend
to put it on data nodes/heavily utilized nodes.
Zookeeper does
Hello All,
One small question if you can help me out. I am working on Server log
processing in Spark for my organization. I am using regular expressions
(Regex) for pattern matching and then do further analysis on the identifies
pieces. Ip, username, date etc.
Is this good approach?
Shall I go
hi,
In DEV, Kafka and ZooKeeper services can be co- located.on the same
physical hosts
In Prod moving forward do we need to set up Zookeeper on its own cluster
not sharing with Hadoop cluster? Can these services be shared within the
Hadoop cluster?
How best to set up Zookeeper that is needed
20 matches
Mail list logo