Re: Spark 2.0 issue with left_outer join

2017-03-03 Thread Ankur Srivastava
Adding DEV.

Or is there any other way to do subtractByKey using Dataset APIs?

Thanks
Ankur

On Wed, Mar 1, 2017 at 1:28 PM, Ankur Srivastava  wrote:

> Hi Users,
>
> We are facing an issue with left_outer join using Spark Dataset api in 2.0
> Java API. Below is the code we have
>
> Dataset badIds = filteredDS.groupBy(col("id").alias("bid")).count()
> .filter((FilterFunction) row -> (Long) row.getAs("count") > 
> 75000);
> _logger.info("Id count with over 75K records that will be filtered: " + 
> badIds.count());
>
> Dataset fiteredRows = filteredDS.join(broadcast(badIds), 
> filteredDS.col("id").equalTo(badDevices.col("bid")), "left_outer")
> .filter((FilterFunction) row ->  row.getAs("bid") == null)
> .map((MapFunction) row -> 
> SomeDataFactory.createObjectFromDDRow(row), Encoders.bean(DeviceData.class));
>
>
> We get the counts in the log file and then the application fils with below
> exception
> Exception in thread "main" java.lang.UnsupportedOperationException: Only
> code-generated evaluation is supported.
> at org.apache.spark.sql.catalyst.expressions.objects.Invoke.
> eval(objects.scala:118)
> at org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin$.
> org$apache$spark$sql$catalyst$optimizer$EliminateOuterJoin$$
> canFilterOutNull(joins.scala:109)
> at org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin$$
> anonfun$7.apply(joins.scala:118)
> at org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin$$
> anonfun$7.apply(joins.scala:118)
> at scala.collection.LinearSeqOptimized$class.
> exists(LinearSeqOptimized.scala:93)
> at scala.collection.immutable.List.exists(List.scala:84)
> at org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin$.
> org$apache$spark$sql$catalyst$optimizer$EliminateOuterJoin$$
> buildNewJoinType(joins.scala:118)
> at org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin$$
> anonfun$apply$2.applyOrElse(joins.scala:133)
> at org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin$$
> anonfun$apply$2.applyOrElse(joins.scala:131)
> at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.
> apply(TreeNode.scala:279)
> at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.
> apply(TreeNode.scala:279)
> at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.
> withOrigin(TreeNode.scala:69)
> at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(
> TreeNode.scala:278)
> at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$
> transformDown$1.apply(TreeNode.scala:284)
> at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$
> transformDown$1.apply(TreeNode.scala:284)
> at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.
> apply(TreeNode.scala:321)
> at org.apache.spark.sql.catalyst.trees.TreeNode.
> mapProductIterator(TreeNode.scala:179)
> at org.apache.spark.sql.catalyst.trees.TreeNode.
> transformChildren(TreeNode.scala:319)
> at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(
> TreeNode.scala:284)
> at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$
> transformDown$1.apply(TreeNode.scala:284)
> at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$
> transformDown$1.apply(TreeNode.scala:284)
> at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.
> apply(TreeNode.scala:321)
> at org.apache.spark.sql.catalyst.trees.TreeNode.
> mapProductIterator(TreeNode.scala:179)
> at org.apache.spark.sql.catalyst.trees.TreeNode.
> transformChildren(TreeNode.scala:319)
> at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(
> TreeNode.scala:284)
> at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$
> transformDown$1.apply(TreeNode.scala:284)
> at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$
> transformDown$1.apply(TreeNode.scala:284)
> at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.
> apply(TreeNode.scala:321)
> at org.apache.spark.sql.catalyst.trees.TreeNode.
> mapProductIterator(TreeNode.scala:179)
> at org.apache.spark.sql.catalyst.trees.TreeNode.
> transformChildren(TreeNode.scala:319)
> at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(
> TreeNode.scala:284)
> at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$
> transformDown$1.apply(TreeNode.scala:284)
> at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$
> transformDown$1.apply(TreeNode.scala:284)
> at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.
> apply(TreeNode.scala:321)
> at org.apache.spark.sql.catalyst.trees.TreeNode.
> mapProductIterator(TreeNode.scala:179)
> at org.apache.spark.sql.catalyst.trees.TreeNode.
> transformChildren(TreeNode.scala:319)
> at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(
> 

Re: Spark join over sorted columns of dataset.

2017-03-03 Thread Koert Kuipers
For RDD the shuffle is already skipped but the sort is not. In spark-sorted
we track partitioning and sorting within partitions for key-value RDDs and
can avoid the sort. See:
https://github.com/tresata/spark-sorted

For Dataset/DataFrame such optimizations are done automatically, however
it's currently not always working for Dataset, see:
https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-19468

On Mar 3, 2017 11:06 AM, "Rohit Verma"  wrote:

Sending it to dev’s.
Can you please help me providing some ideas for below.

Regards
Rohit
> On Feb 23, 2017, at 3:47 PM, Rohit Verma 
wrote:
>
> Hi
>
> While joining two columns of different dataset, how to optimize join if
both the columns are pre sorted within the dataset.
> So that when spark do sort merge join the sorting phase can skipped.
>
> Regards
> Rohit


Re: Spark join over sorted columns of dataset.

2017-03-03 Thread Rohit Verma
Sending it to dev’s.
Can you please help me providing some ideas for below.

Regards
Rohit
> On Feb 23, 2017, at 3:47 PM, Rohit Verma  wrote:
> 
> Hi
> 
> While joining two columns of different dataset, how to optimize join if both 
> the columns are pre sorted within the dataset.
> So that when spark do sort merge join the sorting phase can skipped.
> 
> Regards
> Rohit



Re: How to run a spark on Pycharm

2017-03-03 Thread TheGeorge1918 .
Hey,

Depends on your configuration. I configure my dockerfile with spark2.0
installed and in pycharm, properly configure the interpreter using docker
and add following env in your script configuration. You can check the
dockerfile here: https://github.com/zhangxuan1918/spark2.0

PYSPARK_PYTHON  /usr/bin/python
PYSPARK_DRIVER_PYTHON   /usr/bin/python
PYTHONPATH  
/usr/spark/python:/usr/spark/python/lib/py4j-0.10.1-src.zip:$PYTHONPATH
SPARK_HOME  /usr/spark



On 3 Mar 2017, at 16:11, Sidney Feiner  wrote:

Hey,
I once found an article about that:
https://mengdong.github.io/2016/08/08/fully-armed-pyspark-with-ipython-and-
jupyter/

And I once managed to set it up on Pycharm as well. What I had to do was to
add /path/to/spark to a system variable called "PYTHTONPATH".
Try that one, it might help J

*From:* Anahita Talebi [mailto:anahita.t.am...@gmail.com]
*Sent:* Friday, March 3, 2017 5:05 PM
*To:* Pushkar.Gujar 
*Cc:* User 
*Subject:* Re: How to run a spark on Pycharm


Hi,

Thanks for your answer.

Sorry, I am completely beginner in running the code in spark.
Could you please tell me a bit more in details how to do that?

I installed ipython and Jupyter notebook on my local machine. But how can I
run the code using them? Before, I tried to run the code with Pycharm that
I was failed.
Thanks,
Anahita

On Fri, Mar 3, 2017 at 3:48 PM, Pushkar.Gujar 
wrote:

Jupyter notebook/ipython can be connected to apache spark


Thank you,
*Pushkar Gujar*


On Fri, Mar 3, 2017 at 9:43 AM, Anahita Talebi 
wrote:

Hi everyone,

I am trying to run a spark code on Pycharm. I tried to give the path of
spark as a environment variable to the configuration of Pycharm.
Unfortunately, I get the error. Does anyone know how I can run the spark
code on Pycharm?
It shouldn't be necessarily on Pycharm. if you know any other software, It
would be nice to tell me.
Thanks a lot,
Anahita


RE: How to run a spark on Pycharm

2017-03-03 Thread Sidney Feiner
Hey,
I once found an article about that:
https://mengdong.github.io/2016/08/08/fully-armed-pyspark-with-ipython-and-jupyter/

And I once managed to set it up on Pycharm as well. What I had to do was to add 
/path/to/spark to a system variable called "PYTHTONPATH".
Try that one, it might help ☺

From: Anahita Talebi [mailto:anahita.t.am...@gmail.com]
Sent: Friday, March 3, 2017 5:05 PM
To: Pushkar.Gujar 
Cc: User 
Subject: Re: How to run a spark on Pycharm

Hi,
Thanks for your answer.
Sorry, I am completely beginner in running the code in spark.
Could you please tell me a bit more in details how to do that?
I installed ipython and Jupyter notebook on my local machine. But how can I run 
the code using them? Before, I tried to run the code with Pycharm that I was 
failed.
Thanks,
Anahita

On Fri, Mar 3, 2017 at 3:48 PM, Pushkar.Gujar 
> wrote:
Jupyter notebook/ipython can be connected to apache spark


Thank you,
Pushkar Gujar


On Fri, Mar 3, 2017 at 9:43 AM, Anahita Talebi 
> wrote:
Hi everyone,
I am trying to run a spark code on Pycharm. I tried to give the path of spark 
as a environment variable to the configuration of Pycharm. Unfortunately, I get 
the error. Does anyone know how I can run the spark code on Pycharm?
It shouldn't be necessarily on Pycharm. if you know any other software, It 
would be nice to tell me.
Thanks a lot,
Anahita





Re: How to run a spark on Pycharm

2017-03-03 Thread Pushkar.Gujar
There are lot of articles available online which guide you thru setting up
jupyter notebooks to run spark program. For e.g -

http://blog.insightdatalabs.com/jupyter-on-apache-spark-step-by-step/
https://www.cloudera.com/documentation/enterprise/5-6-x/topics/spark_ipython.html
https://gist.github.com/tommycarpi/f5a67c66a8f2170e263c




Thank you,
*Pushkar Gujar*


On Fri, Mar 3, 2017 at 10:05 AM, Anahita Talebi 
wrote:

> Hi,
>
> Thanks for your answer.
>
> Sorry, I am completely beginner in running the code in spark.
>
> Could you please tell me a bit more in details how to do that?
> I installed ipython and Jupyter notebook on my local machine. But how can
> I run the code using them? Before, I tried to run the code with Pycharm
> that I was failed.
>
> Thanks,
> Anahita
>
> On Fri, Mar 3, 2017 at 3:48 PM, Pushkar.Gujar 
> wrote:
>
>> Jupyter notebook/ipython can be connected to apache spark
>>
>>
>> Thank you,
>> *Pushkar Gujar*
>>
>>
>> On Fri, Mar 3, 2017 at 9:43 AM, Anahita Talebi > > wrote:
>>
>>> Hi everyone,
>>>
>>> I am trying to run a spark code on Pycharm. I tried to give the path of
>>> spark as a environment variable to the configuration of Pycharm.
>>> Unfortunately, I get the error. Does anyone know how I can run the spark
>>> code on Pycharm?
>>> It shouldn't be necessarily on Pycharm. if you know any other software,
>>> It would be nice to tell me.
>>>
>>> Thanks a lot,
>>> Anahita
>>>
>>>
>>>
>>
>


Re: How to run a spark on Pycharm

2017-03-03 Thread Anahita Talebi
Hi,

Thanks for your answer.

Sorry, I am completely beginner in running the code in spark.

Could you please tell me a bit more in details how to do that?
I installed ipython and Jupyter notebook on my local machine. But how can I
run the code using them? Before, I tried to run the code with Pycharm that
I was failed.

Thanks,
Anahita

On Fri, Mar 3, 2017 at 3:48 PM, Pushkar.Gujar 
wrote:

> Jupyter notebook/ipython can be connected to apache spark
>
>
> Thank you,
> *Pushkar Gujar*
>
>
> On Fri, Mar 3, 2017 at 9:43 AM, Anahita Talebi 
> wrote:
>
>> Hi everyone,
>>
>> I am trying to run a spark code on Pycharm. I tried to give the path of
>> spark as a environment variable to the configuration of Pycharm.
>> Unfortunately, I get the error. Does anyone know how I can run the spark
>> code on Pycharm?
>> It shouldn't be necessarily on Pycharm. if you know any other software,
>> It would be nice to tell me.
>>
>> Thanks a lot,
>> Anahita
>>
>>
>>
>


Re: How to run a spark on Pycharm

2017-03-03 Thread Pushkar.Gujar
Jupyter notebook/ipython can be connected to apache spark


Thank you,
*Pushkar Gujar*


On Fri, Mar 3, 2017 at 9:43 AM, Anahita Talebi 
wrote:

> Hi everyone,
>
> I am trying to run a spark code on Pycharm. I tried to give the path of
> spark as a environment variable to the configuration of Pycharm.
> Unfortunately, I get the error. Does anyone know how I can run the spark
> code on Pycharm?
> It shouldn't be necessarily on Pycharm. if you know any other software, It
> would be nice to tell me.
>
> Thanks a lot,
> Anahita
>
>
>


How to run a spark on Pycharm

2017-03-03 Thread Anahita Talebi
Hi everyone,

I am trying to run a spark code on Pycharm. I tried to give the path of
spark as a environment variable to the configuration of Pycharm.
Unfortunately, I get the error. Does anyone know how I can run the spark
code on Pycharm?
It shouldn't be necessarily on Pycharm. if you know any other software, It
would be nice to tell me.

Thanks a lot,
Anahita


Re: Problems when submitting a spark job via the REST API

2017-03-03 Thread Kristinn Rúnarsson

Hi,

I think I have found what was causing the exception.
"spark.app.name" seams to be required in sparkProperties to 
successfully submit the job.  At least when I include the app name my 
job is successfully submitted to the spark cluster.


Silly mistakes, but the error message is not helping much :)

Best,
Kristinn R.

On fös, mar 3, 2017 at 1:22 EH, Kristinn Rúnarsson 
 wrote:

Hi,

I am trying to submit spark jobs via the "hidden" REST 
API(http://spark-cluster-ip:6066/v1/submissions/..), but I am getting 
ErrorResponse and I cant find what I am doing wrong.


I have been following the instructions from this blog post: 
http://arturmkrtchyan.com/apache-spark-hidden-rest-api


When I try to send a CreateSubmissionRequest action I get the 
following ErrorResponse:

{
  "action" : "ErrorResponse",
  "message" : "Malformed request: 
org.apache.spark.deploy.rest.SubmitRestProtocolException: Validation 
of message CreateSubmissionRequest 
failed!\n\torg.apache.spark.deploy.rest.SubmitRestProtocolMessage.validate(SubmitRestProtocolMessage.scala:70)\n\torg.apache.spark.deploy.rest.SubmitRequestServlet.doPost(RestSubmissionServer.scala:272)\n\tjavax.servlet.http.HttpServlet.service(HttpServlet.java:707)\n\tjavax.servlet.http.HttpServlet.service(HttpServlet.java:790)\n\torg.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:812)\n\torg.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:587)\n\torg.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)\n\torg.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)\n\torg.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)\n\torg.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)\n\torg.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)\n\torg.spark_project.jetty.server.Server.handle(Server.java:

499)\n\torg.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:311)\n\torg.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)\n\torg.spark_project.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544)\n\torg.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)\n\torg.spark_project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)\n\tjava.lang.Thread.run(Thread.java:745)",
  "serverSparkVersion" : "2.1.0"
}

This is how my request looks like:

curl -X POST http://kristinn:6066/v1/submissions/create --header 
"Content-Type:application/json;charset=UTF-8" --data '{

  "action": "CreateSubmissionRequest",
  "appArgs": [
100
  ],
  "appResource": 
"/opt/spark-2.1.0/examples/jars/spark-examples_2.11-2.1.0.jar",

  "clientSparkVersion": "2.1.0",
  "environmentVariables": {
"SPARK_ENV_LOADED": "1"
  },
  "mainClass": "org.apache.spark.examples.SparkPi",
  "sparkProperties": {
"spark.master": "spark://kristinn:6066",
"spark.submit.deployMode": "cluster",
"spark.jars": 
"/opt/spark-2.1.0/examples/jars/spark-examples_2.11-2.1.0.jar"

  }
}'

I can not see what is causing this by looking at the source code.

Is something wrong in my request or has anyone have a solution for 
this issue?


Best,
Kristinn Rúnarrson.






Re: Server Log Processing - Regex or ElasticSearch?

2017-03-03 Thread veera satya nv Dantuluri
Gaurav,

I would suggest elastic search.


> On Mar 3, 2017, at 3:27 AM, Gaurav1809  wrote:
> 
> Hello All,
> One small question if you can help me out. I am working on Server log
> processing in Spark for my organization. I am using regular expressions
> (Regex) for pattern matching and then do further analysis on the identifies
> pieces. Ip, username, date etc. 
> Is this good approach? 
> Shall I go for elastic search for pattern matching? 
> What will be your take on this. 
> Your inputs would help me to identify the robust and effective approach
> towards this activity. Have a nice day ahead.
> 
> Thanks.
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Server-Log-Processing-Regex-or-ElasticSearch-tp28452.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Problems when submitting a spark job via the REST API

2017-03-03 Thread Kristinn Rúnarsson
Hi,

I am trying to submit spark jobs via the "hidden" REST API(
http://spark-cluster-ip:6066/v1/submissions/..), but I am getting
ErrorResponse and I cant find what I am doing wrong.

I have been following the instructions from this blog post:
http://arturmkrtchyan.com/apache-spark-hidden-rest-api

When I try to send a CreateSubmissionRequest action I get the following
ErrorResponse:
{
  "action" : "ErrorResponse",
  "message" : "Malformed request:
org.apache.spark.deploy.rest.SubmitRestProtocolException: Validation of
message CreateSubmissionRequest
failed!\n\torg.apache.spark.deploy.rest.SubmitRestProtocolMessage.validate(SubmitRestProtocolMessage.scala:70)\n\torg.apache.spark.deploy.rest.SubmitRequestServlet.doPost(RestSubmissionServer.scala:272)\n\tjavax.servlet.http.HttpServlet.service(HttpServlet.java:707)\n\tjavax.servlet.http.HttpServlet.service(HttpServlet.java:790)\n\torg.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:812)\n\torg.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:587)\n\torg.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)\n\torg.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)\n\torg.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)\n\torg.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)\n\torg.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)\n\torg.spark_project.jetty.server.Server.handle(Server.java:499)\n\torg.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:311)\n\torg.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)\n\torg.spark_project.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544)\n\torg.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)\n\torg.spark_project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)\n\tjava.lang.Thread.run(Thread.java:745)",
  "serverSparkVersion" : "2.1.0"
}

This is how my request looks like:

curl -X POST http://kristinn:6066/v1/submissions/create --header
"Content-Type:application/json;charset=UTF-8" --data '{
  "action": "CreateSubmissionRequest",
  "appArgs": [
100
  ],
  "appResource":
"/opt/spark-2.1.0/examples/jars/spark-examples_2.11-2.1.0.jar",
  "clientSparkVersion": "2.1.0",
  "environmentVariables": {
"SPARK_ENV_LOADED": "1"
  },
  "mainClass": "org.apache.spark.examples.SparkPi",
  "sparkProperties": {
"spark.master": "spark://kristinn:6066",
"spark.submit.deployMode": "cluster",
"spark.jars":
"/opt/spark-2.1.0/examples/jars/spark-examples_2.11-2.1.0.jar"
  }
}'

I can not see what is causing this by looking at the source code.

Is something wrong in my request or has anyone have a solution for this
issue?

Best,
Kristinn Rúnarrson.


Resource manager: estimation of application execution time/remaining time.

2017-03-03 Thread Mazen
Dear all,

For a particular Spark extension, I would like to know whether it would be
possible for the Resource Manager (e.g. Standalone cluster manager) to know
or estimate the total execution time of a submitted application, or the
execution remaining time of such application.

Thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Resource-manager-estimation-of-application-execution-time-remaining-time-tp28453.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: kafka and zookeeper set up in prod for spark streaming

2017-03-03 Thread Mich Talebzadeh
Thanks all. How about Kafka HA which is important. Is it best to use
application specific Kafka delivery or Kafka MirrorMaker?

Cheers

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 3 March 2017 at 10:22, Mich Talebzadeh  wrote:

>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> Forwarded conversation
> Subject: kafka and zookeeper set up in prod for spark streaming
> 
>
> From: Mich Talebzadeh 
> Date: 3 March 2017 at 08:15
> To: "user @spark" 
>
>
>
> hi,
>
> In DEV, Kafka and ZooKeeper services can be co- located.on the same
> physical hosts
>
> In Prod moving forward do we need to set up Zookeeper on its own cluster
> not sharing with Hadoop cluster? Can these services be shared within the
> Hadoop cluster?
>
> How best to set up Zookeeper that is needed for Kafka for use with Spark
> Streaming?
>
> Thanks
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> --
> From: Jörn Franke 
> Date: 3 March 2017 at 08:29
> To: Mich Talebzadeh 
> Cc: "user @spark" 
>
>
> I think this highly depends on the risk that you want to be exposed to. If
> you have it on dedicated nodes there is less influence of other processes.
>
> I have seen both: on Hadoop nodes or dedicated. On Hadoop I would not
> recommend to put it on data nodes/heavily utilized nodes.
>
> Zookeeper does not need many resources (if you do not abuse it) and you
> may think about putting it on a dedicated small infrastructure of several
> nodes.
>
> --
> From: vincent gromakowski 
> Date: 3 March 2017 at 08:29
> To: Mich Talebzadeh 
> Cc: "user @spark" 
>
>
> Hi,
> Depending on the Kafka version (< 0.8.2 I think), offsets are managed in
> Zookeeper and if you have lots of consumer it's recommended to use a
> dedicated zookeeper cluster (always with dedicated disks, even SSD is
> better). On newer version offsets are managed in special Kafka topics and
> Zookeeper is only used to store metadata, you can share it with Hadoop.
> Maybe you can reach a limit depending on the size of your Kafka, the number
> of topics, producers/consumers... but I have never heard yet. Another point
> is to be careful about security on Zookeeper, sharing a cluster means you
> get the same security level (authentication or not)
>
> --
> From: vincent gromakowski 
> Date: 3 March 2017 at 08:31
> To: Jörn Franke 
> Cc: Mich Talebzadeh , "user @spark" <
> user@spark.apache.org>
>
>
> I forgot to mention it also depends on the spark kafka connector you use.
> If it's receiver based, I recommend a dedicated zookeeper cluster because
> it is used to store offsets. If it's receiver less Zookeeper can be shared.
>
>
>


Re: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

2017-03-03 Thread Noorul Islam K M

> When Initial jobs have not accepted any resources then what all can be
> wrong? Going through stackoverflow and various blogs does not help. Maybe
> need better logging for this? Adding dev
>

Did you take a look at the spark UI to see your resource availability?

Thanks and Regards
Noorul

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: kafka and zookeeper set up in prod for spark streaming

2017-03-03 Thread vincent gromakowski
I forgot to mention it also depends on the spark kafka connector you use.
If it's receiver based, I recommend a dedicated zookeeper cluster because
it is used to store offsets. If it's receiver less Zookeeper can be shared.

2017-03-03 9:29 GMT+01:00 Jörn Franke :

> I think this highly depends on the risk that you want to be exposed to. If
> you have it on dedicated nodes there is less influence of other processes.
>
> I have seen both: on Hadoop nodes or dedicated. On Hadoop I would not
> recommend to put it on data nodes/heavily utilized nodes.
>
> Zookeeper does not need many resources (if you do not abuse it) and you
> may think about putting it on a dedicated small infrastructure of several
> nodes.
>
> On 3 Mar 2017, at 09:15, Mich Talebzadeh 
> wrote:
>
>
> hi,
>
> In DEV, Kafka and ZooKeeper services can be co- located.on the same
> physical hosts
>
> In Prod moving forward do we need to set up Zookeeper on its own cluster
> not sharing with Hadoop cluster? Can these services be shared within the
> Hadoop cluster?
>
> How best to set up Zookeeper that is needed for Kafka for use with Spark
> Streaming?
>
> Thanks
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>


Re: kafka and zookeeper set up in prod for spark streaming

2017-03-03 Thread vincent gromakowski
Hi,
Depending on the Kafka version (< 0.8.2 I think), offsets are managed in
Zookeeper and if you have lots of consumer it's recommended to use a
dedicated zookeeper cluster (always with dedicated disks, even SSD is
better). On newer version offsets are managed in special Kafka topics and
Zookeeper is only used to store metadata, you can share it with Hadoop.
Maybe you can reach a limit depending on the size of your Kafka, the number
of topics, producers/consumers... but I have never heard yet. Another point
is to be careful about security on Zookeeper, sharing a cluster means you
get the same security level (authentication or not)

2017-03-03 9:15 GMT+01:00 Mich Talebzadeh :

>
> hi,
>
> In DEV, Kafka and ZooKeeper services can be co- located.on the same
> physical hosts
>
> In Prod moving forward do we need to set up Zookeeper on its own cluster
> not sharing with Hadoop cluster? Can these services be shared within the
> Hadoop cluster?
>
> How best to set up Zookeeper that is needed for Kafka for use with Spark
> Streaming?
>
> Thanks
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>


Re: kafka and zookeeper set up in prod for spark streaming

2017-03-03 Thread Jörn Franke
I think this highly depends on the risk that you want to be exposed to. If you 
have it on dedicated nodes there is less influence of other processes.

I have seen both: on Hadoop nodes or dedicated. On Hadoop I would not recommend 
to put it on data nodes/heavily utilized nodes.

Zookeeper does not need many resources (if you do not abuse it) and you may 
think about putting it on a dedicated small infrastructure of several nodes.

> On 3 Mar 2017, at 09:15, Mich Talebzadeh  wrote:
> 
> 
> hi,
> 
> In DEV, Kafka and ZooKeeper services can be co- located.on the same physical 
> hosts
> 
> In Prod moving forward do we need to set up Zookeeper on its own cluster not 
> sharing with Hadoop cluster? Can these services be shared within the Hadoop 
> cluster?
> 
> How best to set up Zookeeper that is needed for Kafka for use with Spark 
> Streaming?
> 
> Thanks
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  


Server Log Processing - Regex or ElasticSearch?

2017-03-03 Thread Gaurav1809
Hello All,
One small question if you can help me out. I am working on Server log
processing in Spark for my organization. I am using regular expressions
(Regex) for pattern matching and then do further analysis on the identifies
pieces. Ip, username, date etc. 
Is this good approach? 
Shall I go for elastic search for pattern matching? 
What will be your take on this. 
Your inputs would help me to identify the robust and effective approach
towards this activity. Have a nice day ahead.

Thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Server-Log-Processing-Regex-or-ElasticSearch-tp28452.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



kafka and zookeeper set up in prod for spark streaming

2017-03-03 Thread Mich Talebzadeh
hi,

In DEV, Kafka and ZooKeeper services can be co- located.on the same
physical hosts

In Prod moving forward do we need to set up Zookeeper on its own cluster
not sharing with Hadoop cluster? Can these services be shared within the
Hadoop cluster?

How best to set up Zookeeper that is needed for Kafka for use with Spark
Streaming?

Thanks

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.