Re: jdbc spark streaming

2018-12-28 Thread Thakrar, Jayesh
Yes, you can certainly use spark streaming, but reading from the original 
source table may still be time consuming and resource intensive.

Having some context on the RDBMS platform, data size/volumes involved and the 
tolerable lag (between changes being created and it being processed by Spark) 
will help people give you better recommendations/best practices.

All the same, one approach is to create triggers on the source table and insert 
data into a different table and then read from there.
Another approach is to push the delta data into something like Kafka and then 
use Spark streaming against that.
Taking that Kafka approach further, you can capture the delta upstream so that 
the processing that pushes it into the RDBMS can also push it to Kafka directly.

On 12/27/18, 4:52 PM, "Nicolas Paris"  wrote:

Hi

I have this living RDBMS and I d'like to apply a spark job on several
tables once new data get in.

I could run batch spark jobs thought cron jobs every minutes. But the
job takes time and resources to begin (sparkcontext, yarn)

I wonder if I could run one instance of a spark streaming job to save
those resources. However I haven't seen about structured streaming from
jdbc source in the documentation.

Any recommendation ?


-- 
nicolas




-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Custom Metric Sink on Executor Always ClassNotFound

2018-12-21 Thread Thakrar, Jayesh
Just curious - is this HttpSink your own custom sink or Dropwizard 
configuration?

If your own custom code, I would suggest looking/trying out the Dropwizard.
See 
http://spark.apache.org/docs/latest/monitoring.html#metrics
https://metrics.dropwizard.io/4.0.0/

Also, from what I know, the metrics from the tasks/executors are sent as 
accumulator values to the driver and the driver makes it available to the 
desired sink.

Furthermore, even without a custom HttpSink, there's already a builtin REST API 
available that provides you metrics
See http://spark.apache.org/docs/latest/monitoring.html#rest-api

While you can surely create your own custom sink (code), I would say try out 
custom configuration first as it will make Spark upgrades easy.

On 12/20/18, 3:53 PM, "Marcelo Vanzin"  wrote:

First, it's really weird to use "org.apache.spark" for a class that is
not in Spark.

For executors, the jar file of the sink needs to be in the system
classpath; the application jar is not in the system classpath, so that
does not work. There are different ways for you to get it there, most
of them manual (YARN is, I think, the only RM supported in Spark where
the application itself can do it).

On Thu, Dec 20, 2018 at 1:48 PM prosp4300  wrote:
>
> Hi, Spark Users
>
> I'm play with spark metric monitoring, and want to add a custom sink 
which is HttpSink that send the metric through Restful API
> A subclass of Sink "org.apache.spark.metrics.sink.HttpSink" is created 
and packaged within application jar
>
> It works for driver instance, but once enabled for executor instance, 
following ClassNotFoundException will be throw out. This seems due to 
MetricSystem is started very early for executor before application jar is 
loaded.
>
> I wonder is there any way or best practice to add custom sink for 
executor instance?
>
> 18/12/21 04:58:32 ERROR MetricsSystem: Sink class 
org.apache.spark.metrics.sink.HttpSink cannot be instantiated
> 18/12/21 04:58:32 WARN UserGroupInformation: PriviledgedActionException 
as:yarn (auth:SIMPLE) cause:java.lang.ClassNotFoundException: 
org.apache.spark.metrics.sink.HttpSink
> Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
> at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1933)
> at 
org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66)
> at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188)
> at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:284)
> at 
org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
> Caused by: java.lang.ClassNotFoundException: 
org.apache.spark.metrics.sink.HttpSink
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:348)
> at org.apache.spark.util.Utils$.classForName(Utils.scala:230)
> at 
org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:198)
> at 
org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:194)
> at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
> at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
> at 
scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
> at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
> at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
> at 
org.apache.spark.metrics.MetricsSystem.registerSinks(MetricsSystem.scala:194)
> at org.apache.spark.metrics.MetricsSystem.start(MetricsSystem.scala:102)
> at org.apache.spark.SparkEnv$.create(SparkEnv.scala:366)
> at org.apache.spark.SparkEnv$.createExecutorEnv(SparkEnv.scala:201)
> at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:223)
> at 
org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:67)
> at 
org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:66)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)
> ... 4 more
> stdout0,*container_e81_1541584460930_3814_01_05�
> spark.log36118/12/21 04:58:00 ERROR 
org.apache.spark.metrics.MetricsSystem.logError:70 - Sink class 
org.apache.spark.metrics.sink.HttpSink c

Re: How to track batch jobs in spark ?

2018-12-06 Thread Thakrar, Jayesh
See if https://spark.apache.org/docs/latest/monitoring.html helps.

Essentially whether you are running an app as spark-shell, via spark-submit 
(local, Spark-Cluster, YARN, Kubernetes, mesos), the driver will provide a UI 
on port 4040.

You can monitor via the UI and via a REST API

E.g. running a job locally on your laptop, you can run something like this
http://127.0.0.1:4040/api/v1/applications

To see the “jobs”, you can use something like this
(local-1544110095543 is just the id of my spark-shell on my laptop which I got 
from the command above)

http://127.0.0.1:4040/api/v1/applications/local-1544110095543/jobs

If you are only look for job completion, you can just monitor if there is a 
listener on the port.
Once job completes/fails, the driver and the listener will exit and hence it 
means the job is complete.

As far as giving you the percentage complete for the application, there is no 
such thing as unlike mapreduce, a Spark app is not a single step/job.
Using the REST API, you can see which job/stage is running and determine what 
percentage of your job is complete.
Even when getting the stage info, you only get the number of “tasks” complete 
v/s percentage complete.

From: kant kodali 
Date: Thursday, December 6, 2018 at 4:40 AM
To: Mark Hamstra 
Cc: , "user @spark" 
Subject: Re: How to track batch jobs in spark ?

Thanks for all responses.

1) I am not using YARN. I am using Spark Standalone.
2) yes I want to be able to kill the whole Application.
3) I want to be able to monitor the status of the Application which is running 
a batch query and expected to run for an hour or so, therefore, I am looking 
for some mechanism where I can monitor the progress like a percentage or 
something.

Thanks!


On Wed, Dec 5, 2018 at 3:12 PM Mark Hamstra 
mailto:m...@clearstorydata.com>> wrote:
That will kill an entire Spark application, not a batch Job.

On Wed, Dec 5, 2018 at 3:07 PM Priya Matpadi 
mailto:pmatp...@gmail.com>> wrote:
if you are deploying your spark application on YARN cluster,
1. ssh into master node
2. List the currently running application and retreive the application_id
yarn application --list
3. Kill the application using application_id of the form application_x_ 
from output of list command
yarn application --kill 

On Wed, Dec 5, 2018 at 1:42 PM kant kodali 
mailto:kanth...@gmail.com>> wrote:
Hi All,

How to track batch jobs in spark? For example, is there some id or token i can 
get after I spawn a batch job and use it to track the progress or to kill the 
batch job itself?

For Streaming, we have StreamingQuery.id()

Thanks!


Re: How to address seemingly low core utilization on a spark workload?

2018-11-15 Thread Thakrar, Jayesh
So 164 GB of parquet data –can potentially explode to upto 1000 GB data if the 
data is compressed (in practice it would be more like 400-600 GB)
Your executors have about 96 GB data.
With that kind of volume, 100-300 executors is ok (I would do tests with 
100-300), but 30k shuffle partitions are still excessive.

From: Vitaliy Pisarev 
Date: Thursday, November 15, 2018 at 1:58 PM
To: "Thakrar, Jayesh" 
Cc: Shahbaz , user , David 
Markovitz 
Subject: Re: How to address seemingly low core utilization on a spark workload?

Small update, my initial estimate was incorrect. I have one location with 16*4G 
= 64G parquests (in snappy) + 20 * 5G = 100G parquets. So a total of 164G.

I am running on Databricks.
Here are some settings:

spark.executor.extraJavaOptions=-XX:ReservedCodeCacheSize=256m 
-XX:+UseCodeCacheFlushing -Ddatabricks.serviceName=spark-executor-1 
-javaagent:/databricks/DatabricksAgent.jar -XX:+PrintFlagsFinal 
-XX:+PrintGCDateStamps -verbose:gc -XX:+PrintGCDetails -Xss4m 
-Djavax.xml.datatype.DatatypeFactory=com.sun.org.apache.xerces.internal.jaxp.datatype.DatatypeFactoryImpl
 
-Djavax.xml.parsers.DocumentBuilderFactory=com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl
 
-Djavax.xml.parsers.SAXParserFactory=com.sun.org.apache.xerces.internal.jaxp.SAXParserFactoryImpl
 
-Djavax.xml.validation.SchemaFactory:http://www.w3.org/2001/XMLSchema=com.sun.org.apache.xerces.internal.jaxp.validation.XMLSchemaFactory
 -Dorg.xml.sax.driver=com.sun.org.apache.xerces.internal.parsers.SAXParser 
-Dorg.w3c.dom.DOMImplementationSourceList=com.sun.org.apache.xerces.internal.dom.DOMXSImplementationSourceImpl
spark.executor.memory=107407m
spark.executor.tempDirectory=/local_disk0/tmp

These are the only relevant setting that I see set when looking at the logs. I 
am guessing this means that the others are simply set to default.
Are there any setting I should pay special attention to? (reference is also 
good).

My assumption is the the Databricks runtime is already preconfigured with known 
best practices (like corse per executor...). Now that I think of it I need to 
validate this assumption.

On Thu, Nov 15, 2018 at 9:14 PM Thakrar, Jayesh 
mailto:jthak...@conversantmedia.com>> wrote:
While there is some merit to that thought process, I would steer away from 
premature JVM GC optimization of this kind.
What are the memory, cpu and other settings (e.g. any JVM/GC settings) for the 
executors and driver?
So assuming that you are reading about 16 files of say 2-4 GB each, that’s 
about 32-64 GB of (compressed?) data in parquet files.
Do you have access to the Spark UI – what is the peak memory that you see for 
the executors?
The UI will also give you the time spent on GC by each executor.
So even if you completely eliminated all GC, that’s the max time you can 
potentially save.


From: Vitaliy Pisarev 
mailto:vitaliy.pisa...@biocatch.com>>
Date: Thursday, November 15, 2018 at 1:03 PM
To: Shahbaz mailto:shahzadh...@gmail.com>>
Cc: "Thakrar, Jayesh" 
mailto:jthak...@conversantmedia.com>>, user 
mailto:user@spark.apache.org>>, 
"dudu.markov...@microsoft.com<mailto:dudu.markov...@microsoft.com>" 
mailto:dudu.markov...@microsoft.com>>
Subject: Re: How to address seemingly low core utilization on a spark workload?

Agree, and I will try it. One clarification though: the amount of partitions 
also affects their in memory size. So fewer partitions may result in higher 
memory preassure and Ooms. I think this was the original intention.

So the motivation for partitioning is also to break down volumes yo fit the 
machines.

Is this premise wrong?

On Thu, Nov 15, 2018, 19:49 Shahbaz 
mailto:shahzadh...@gmail.com> wrote:
30k Sql shuffle partitions is extremely high.Core to Partition is 1 to  1 
,default value of Sql shuffle partitions is  200 ,set it to 300 or leave it to 
default ,see which one gives best performance,after you do that ,see how cores 
are being used?

Regards,
Shahbaz

On Thu, Nov 15, 2018 at 10:58 PM Vitaliy Pisarev 
mailto:vitaliy.pisa...@biocatch.com>> wrote:
Oh, regarding and shuffle.partitions being 30k, don't know. I inherited the 
workload from an engineer that is no longer around and am trying to make sense 
of things in general.

On Thu, Nov 15, 2018 at 7:26 PM Vitaliy Pisarev 
mailto:vitaliy.pisa...@biocatch.com>> wrote:
The quest is dual:


  *   Increase utilisation- because cores cost money and I want to make sure 
that if I fully utilise what I pay for. This is very blunt of corse, because 
there is always i/o and at least some degree of skew. Bottom line is do the 
same thing over the same time but with fewer (but better utilised) resources.
  *   Reduce runtime by increasing parallelism.
While not the same, I am looking at these as two sides of the same coin.





On Thu, Nov 15, 2018 at 6:58 PM Thakrar, Jayesh 
mailto:jthak...@conversantmedia.com>> wrote:
For that little data,

Re: How to address seemingly low core utilization on a spark workload?

2018-11-15 Thread Thakrar, Jayesh
While there is some merit to that thought process, I would steer away from 
premature JVM GC optimization of this kind.
What are the memory, cpu and other settings (e.g. any JVM/GC settings) for the 
executors and driver?
So assuming that you are reading about 16 files of say 2-4 GB each, that’s 
about 32-64 GB of (compressed?) data in parquet files.
Do you have access to the Spark UI – what is the peak memory that you see for 
the executors?
The UI will also give you the time spent on GC by each executor.
So even if you completely eliminated all GC, that’s the max time you can 
potentially save.


From: Vitaliy Pisarev 
Date: Thursday, November 15, 2018 at 1:03 PM
To: Shahbaz 
Cc: "Thakrar, Jayesh" , user 
, "dudu.markov...@microsoft.com" 

Subject: Re: How to address seemingly low core utilization on a spark workload?

Agree, and I will try it. One clarification though: the amount of partitions 
also affects their in memory size. So fewer partitions may result in higher 
memory preassure and Ooms. I think this was the original intention.

So the motivation for partitioning is also to break down volumes yo fit the 
machines.

Is this premise wrong?

On Thu, Nov 15, 2018, 19:49 Shahbaz 
mailto:shahzadh...@gmail.com> wrote:
30k Sql shuffle partitions is extremely high.Core to Partition is 1 to  1 
,default value of Sql shuffle partitions is  200 ,set it to 300 or leave it to 
default ,see which one gives best performance,after you do that ,see how cores 
are being used?

Regards,
Shahbaz

On Thu, Nov 15, 2018 at 10:58 PM Vitaliy Pisarev 
mailto:vitaliy.pisa...@biocatch.com>> wrote:
Oh, regarding and shuffle.partitions being 30k, don't know. I inherited the 
workload from an engineer that is no longer around and am trying to make sense 
of things in general.

On Thu, Nov 15, 2018 at 7:26 PM Vitaliy Pisarev 
mailto:vitaliy.pisa...@biocatch.com>> wrote:
The quest is dual:


  *   Increase utilisation- because cores cost money and I want to make sure 
that if I fully utilise what I pay for. This is very blunt of corse, because 
there is always i/o and at least some degree of skew. Bottom line is do the 
same thing over the same time but with fewer (but better utilised) resources.
  *   Reduce runtime by increasing parallelism.
While not the same, I am looking at these as two sides of the same coin.





On Thu, Nov 15, 2018 at 6:58 PM Thakrar, Jayesh 
mailto:jthak...@conversantmedia.com>> wrote:
For that little data, I find spark.sql.shuffle.partitions = 3 to be very 
high.
Any reason for that high value?

Do you have a baseline observation with the default value?

Also, enabling the jobgroup and job info through the API and observing through 
the UI will help you understand the code snippets when you have low utilization.

Finally, high utilization does not equate to high efficiency.
Its very likely that for your workload, you may only need 16-128 executors.
I would suggest getting the partition count for the various 
datasets/dataframes/rdds in your code by using

dataset.rdd. getNumPartitions

I would also suggest doing a number of tests with different number of executors 
too.

But coming back to the objective behind your quest – are you trying to maximize 
utilization hoping that by having high parallelism will reduce your total 
runtime?


From: Vitaliy Pisarev 
mailto:vitaliy.pisa...@biocatch.com>>
Date: Thursday, November 15, 2018 at 10:07 AM
To: mailto:jthak...@conversantmedia.com>>
Cc: user mailto:user@spark.apache.org>>, David Markovitz 
mailto:dudu.markov...@microsoft.com>>
Subject: Re: How to address seemingly low core utilization on a spark workload?

I am working with parquets and the metadata reading there is quite fast as 
there are at most 16 files (a couple of gigs each).

I find it very hard to answer the question: "how many partitions do you have?", 
many spark operations do not preserve partitioning and I have a lot of 
filtering and grouping going on.
What I can say is that I specified spark.sql.shuffle.partitions to 30,000.

I am not worried that there are not enough partitions to keep the cores 
working. Having said that I do see that the high utilisation correlates heavily 
with shuffle read/write. Whereas low utilisation correlates with no shuffling.
This leads me to the conclusion that compared to the amount of shuffling, the 
cluster is doing very little work.

Question is what can I do about it.

On Thu, Nov 15, 2018 at 5:29 PM Thakrar, Jayesh 
mailto:jthak...@conversantmedia.com>> wrote:
Can you shed more light on what kind of processing you are doing?

One common pattern that I have seen for active core/executor utilization 
dropping to zero is while reading ORC data and the driver seems (I think) to be 
doing schema validation.
In my case I would have hundreds of thousands of ORC data files and there is 
dead silence for about 1-2 hours.
I have tried providing a schema and disabling sche

Re: How to address seemingly low core utilization on a spark workload?

2018-11-15 Thread Thakrar, Jayesh
For that little data, I find spark.sql.shuffle.partitions = 3 to be very 
high.
Any reason for that high value?

Do you have a baseline observation with the default value?

Also, enabling the jobgroup and job info through the API and observing through 
the UI will help you understand the code snippets when you have low utilization.

Finally, high utilization does not equate to high efficiency.
Its very likely that for your workload, you may only need 16-128 executors.
I would suggest getting the partition count for the various 
datasets/dataframes/rdds in your code by using

dataset.rdd. getNumPartitions

I would also suggest doing a number of tests with different number of executors 
too.

But coming back to the objective behind your quest – are you trying to maximize 
utilization hoping that by having high parallelism will reduce your total 
runtime?


From: Vitaliy Pisarev 
Date: Thursday, November 15, 2018 at 10:07 AM
To: 
Cc: user , David Markovitz 
Subject: Re: How to address seemingly low core utilization on a spark workload?

I am working with parquets and the metadata reading there is quite fast as 
there are at most 16 files (a couple of gigs each).

I find it very hard to answer the question: "how many partitions do you have?", 
many spark operations do not preserve partitioning and I have a lot of 
filtering and grouping going on.
What I can say is that I specified spark.sql.shuffle.partitions to 30,000.

I am not worried that there are not enough partitions to keep the cores 
working. Having said that I do see that the high utilisation correlates heavily 
with shuffle read/write. Whereas low utilisation correlates with no shuffling.
This leads me to the conclusion that compared to the amount of shuffling, the 
cluster is doing very little work.

Question is what can I do about it.

On Thu, Nov 15, 2018 at 5:29 PM Thakrar, Jayesh 
mailto:jthak...@conversantmedia.com>> wrote:
Can you shed more light on what kind of processing you are doing?

One common pattern that I have seen for active core/executor utilization 
dropping to zero is while reading ORC data and the driver seems (I think) to be 
doing schema validation.
In my case I would have hundreds of thousands of ORC data files and there is 
dead silence for about 1-2 hours.
I have tried providing a schema and disabling schema validation while reading 
the ORC data, but that does not seem to help (Spark 2.2.1).

And as you know, in most cases, there is a linear relationship between number 
of partitions in your data and the concurrently active executors.

Another thing I would suggest is use the following two API calls/method – they 
will annotate the spark stages and jobs with what is being executed in the 
Spark UI.
SparkContext.setJobGroup(….)
SparkContext.setJobDescription(….)

From: Vitaliy Pisarev 
mailto:vitaliy.pisa...@biocatch.com>>
Date: Thursday, November 15, 2018 at 8:51 AM
To: user mailto:user@spark.apache.org>>
Cc: David Markovitz 
mailto:dudu.markov...@microsoft.com>>
Subject: How to address seemingly low core utilization on a spark workload?

I have a workload that runs on a cluster of 300 cores.
Below is a plot of the amount of active tasks over time during the execution of 
this workload:

[image.png]

What I deduce is that there are substantial intervals where the cores are 
heavily under-utilised.

What actions can I take to:

  *   Increase the efficiency (== core utilisation) of the cluster?
  *   Understand the root causes behind the drops in core utilisation?


Re: How to address seemingly low core utilization on a spark workload?

2018-11-15 Thread Thakrar, Jayesh
Can you shed more light on what kind of processing you are doing?

One common pattern that I have seen for active core/executor utilization 
dropping to zero is while reading ORC data and the driver seems (I think) to be 
doing schema validation.
In my case I would have hundreds of thousands of ORC data files and there is 
dead silence for about 1-2 hours.
I have tried providing a schema and disabling schema validation while reading 
the ORC data, but that does not seem to help (Spark 2.2.1).

And as you know, in most cases, there is a linear relationship between number 
of partitions in your data and the concurrently active executors.

Another thing I would suggest is use the following two API calls/method – they 
will annotate the spark stages and jobs with what is being executed in the 
Spark UI.
SparkContext.setJobGroup(….)
SparkContext.setJobDescription(….)

From: Vitaliy Pisarev 
Date: Thursday, November 15, 2018 at 8:51 AM
To: user 
Cc: David Markovitz 
Subject: How to address seemingly low core utilization on a spark workload?

I have a workload that runs on a cluster of 300 cores.
Below is a plot of the amount of active tasks over time during the execution of 
this workload:

[image.png]

What I deduce is that there are substantial intervals where the cores are 
heavily under-utilised.

What actions can I take to:

  *   Increase the efficiency (== core utilisation) of the cluster?
  *   Understand the root causes behind the drops in core utilisation?


Re: [Spark SQL] why spark sql hash() are returns the same hash value though the keys/expr are not same

2018-09-28 Thread Thakrar, Jayesh
Not sure I get what you mean….

I ran the query that you had – and don’t get the same hash as you.


From: Gokula Krishnan D 
Date: Friday, September 28, 2018 at 10:40 AM
To: "Thakrar, Jayesh" 
Cc: user 
Subject: Re: [Spark SQL] why spark sql hash() are returns the same hash value 
though the keys/expr are not same

Hello Jayesh,

I have masked the input values with .


Thanks & Regards,
Gokula Krishnan (Gokul)


On Wed, Sep 26, 2018 at 2:20 PM Thakrar, Jayesh 
mailto:jthak...@conversantmedia.com>> wrote:
Cannot reproduce your situation.
Can you share Spark version?

Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.2.0
  /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_92)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.sql("select hash('40514X'),hash('41751')").show()
++---+
|hash(40514X)|hash(41751)|
++---+
| -1898845883|  916273350|
++---+


scala> spark.sql("select hash('14589'),hash('40004')").show()
+---+---+
|hash(14589)|hash(40004)|
+---+---+
|  777096871|-1593820563|
+---+---+


scala>

From: Gokula Krishnan D mailto:email2...@gmail.com>>
Date: Tuesday, September 25, 2018 at 8:57 PM
To: user mailto:user@spark.apache.org>>
Subject: [Spark SQL] why spark sql hash() are returns the same hash value 
though the keys/expr are not same

Hello All,

I am calculating the hash value  of few columns and determining whether its an 
Insert/Delete/Update Record but found a scenario which is little weird since 
some of the records returns same hash value though the key's are totally 
different.

For the instance,


scala> spark.sql("select hash('40514X'),hash('41751')").show()

+---+---+

|hash(40514)|hash(41751)|

+---+---+

|  976573657|  976573657|

+---+---+


scala> spark.sql("select hash('14589'),hash('40004')").show()

+---+---+

|hash(14589)|hash(40004)|

+---+---+

|  777096871|  777096871|

+---+---+
I do understand that hash() returns an integer, are these reached the max 
value?.

Thanks & Regards,
Gokula Krishnan (Gokul)


Re: [Spark SQL] why spark sql hash() are returns the same hash value though the keys/expr are not same

2018-09-26 Thread Thakrar, Jayesh
Cannot reproduce your situation.
Can you share Spark version?

Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.2.0
  /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_92)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.sql("select hash('40514X'),hash('41751')").show()
++---+
|hash(40514X)|hash(41751)|
++---+
| -1898845883|  916273350|
++---+


scala> spark.sql("select hash('14589'),hash('40004')").show()
+---+---+
|hash(14589)|hash(40004)|
+---+---+
|  777096871|-1593820563|
+---+---+


scala>

From: Gokula Krishnan D 
Date: Tuesday, September 25, 2018 at 8:57 PM
To: user 
Subject: [Spark SQL] why spark sql hash() are returns the same hash value 
though the keys/expr are not same

Hello All,

I am calculating the hash value  of few columns and determining whether its an 
Insert/Delete/Update Record but found a scenario which is little weird since 
some of the records returns same hash value though the key's are totally 
different.

For the instance,


scala> spark.sql("select hash('40514X'),hash('41751')").show()

+---+---+

|hash(40514)|hash(41751)|

+---+---+

|  976573657|  976573657|

+---+---+


scala> spark.sql("select hash('14589'),hash('40004')").show()

+---+---+

|hash(14589)|hash(40004)|

+---+---+

|  777096871|  777096871|

+---+---+
I do understand that hash() returns an integer, are these reached the max 
value?.

Thanks & Regards,
Gokula Krishnan (Gokul)


Re: [PySpark] Releasing memory after a spark job is finished

2018-06-04 Thread Thakrar, Jayesh
Disclaimer - I use Spark with Scala and not Python.

But I am guessing that Jorn's reference to modularization is to ensure that you 
do the processing inside methods/functions and call those methods sequentially.
I believe that as long as an RDD/dataset variable is in scope, its memory may 
not be getting released.
By having functions, they will get out of scope and their memory can be 
released.

Also, assuming that the variables are not daisy-chained/inter-related as that 
too will not make it easy.


From: Jay 
Date: Monday, June 4, 2018 at 9:41 PM
To: Shuporno Choudhury 
Cc: "Jörn Franke [via Apache Spark User List]" 
, 
Subject: Re: [PySpark] Releasing memory after a spark job is finished

Can you tell us what version of Spark you are using and if Dynamic Allocation 
is enabled ?

Also, how are the files being read ? Is it a single read of all files using a 
file matching regex or are you running different threads in the same pyspark 
job?


On Mon 4 Jun, 2018, 1:27 PM Shuporno Choudhury, 
mailto:shuporno.choudh...@gmail.com>> wrote:
Thanks a lot for the insight.
Actually I have the exact same transformations for all the datasets, hence only 
1 python code.
Now, do you suggest that I run different spark-submit for all the different 
datasets given that I have the exact same transformations?

On Tue 5 Jun, 2018, 1:48 AM Jörn Franke [via Apache Spark User List], 
mailto:ml%2bs1001560n32458...@n3.nabble.com>>
 wrote:
Yes if they are independent with different transformations then I would create 
a separate python program. Especially for big data processing frameworks one 
should avoid to put everything in one big monotholic applications.


On 4. Jun 2018, at 22:02, Shuporno Choudhury <[hidden 
email]> wrote:
Hi,

Thanks for the input.
I was trying to get the functionality first, hence I was using local mode. I 
will be running on a cluster definitely but later.

Sorry for my naivety, but can you please elaborate on the modularity concept 
that you mentioned and how it will affect whatever I am already doing?
Do you mean running a different spark-submit for each different dataset when 
you say 'an independent python program for each process '?

On Tue, 5 Jun 2018 at 01:12, Jörn Franke [via Apache Spark User List] <[hidden 
email]> wrote:
Why don’t you modularize your code and write for each process an independent 
python program that is submitted via Spark?

Not sure though if Spark local make sense. If you don’t have a cluster then a 
normal python program can be much better.

On 4. Jun 2018, at 21:37, Shuporno Choudhury <[hidden 
email]> wrote:
Hi everyone,
I am trying to run a pyspark code on some data sets sequentially [basically 1. 
Read data into a dataframe 2.Perform some join/filter/aggregation 3. Write 
modified data in parquet format to a target location]
Now, while running this pyspark code across multiple independent data sets 
sequentially, the memory usage from the previous data set doesn't seem to get 
released/cleared and hence spark's memory consumption (JVM memory consumption 
from Task Manager) keeps on increasing till it fails at some data set.
So, is there a way to clear/remove dataframes that I know are not going to be 
used later?
Basically, can I clear out some memory programmatically (in the pyspark code) 
when processing for a particular data set ends?
At no point, I am caching any dataframe (so unpersist() is also not a solution).

I am running spark using local[*] as master. There is a single SparkSession 
that is doing all the processing.
If it is not possible to clear out memory, what can be a better approach for 
this problem?

Can someone please help me with this and tell me if I am going wrong anywhere?

--Thanks,
Shuporno Choudhury


If you reply to this email, your message will be added to the discussion below:
http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-Releasing-memory-after-a-spark-job-is-finished-tp32454p32455.html
To start a new topic under Apache Spark User List, email [hidden 
email]
To unsubscribe from Apache Spark User List, click here.
NAML


--
--Thanks,
Shuporno Choudhury


If you reply to this email, your message will be added to the discussion below:
http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-Releasing-memory-after-a-spark-job-is-finished-tp32454p32458.html
To start a new top

Re: How to skip nonexistent file when read files with spark?

2018-05-21 Thread Thakrar, Jayesh
Junfeng,

I would suggest preprocessing/validating the paths in plain Python (and not 
Spark) before you try to fetch data.
I am not familiar with Python Hadoop libraries, but see if this helps - 
http://crs4.github.io/pydoop/tutorial/hdfs_api.html

Best,
Jayesh

From: JF Chen 
Date: Monday, May 21, 2018 at 10:20 PM
To: ayan guha 
Cc: "Thakrar, Jayesh" , user 

Subject: Re: How to skip nonexistent file when read files with spark?

Thanks ayan,

Also I have tried this method, the most tricky thing is that dataframe union 
method must be based on same structure schema, while on my files, the schema is 
variable.


Regard,
Junfeng Chen

On Tue, May 22, 2018 at 10:33 AM, ayan guha 
mailto:guha.a...@gmail.com>> wrote:
A relatively naive solution will be:

0. Create a dummy blank dataframe
1. Loop through the list of paths.
2. Try to create the dataframe from the path. If success then union it 
cumulatively.
3. If error, just ignore it or handle as you wish.

At the end of the loop, just use the unioned df. This should not have any 
additional performance overhead as declaring dataframes and union is not 
expensive, unless you call any action within the loop.

Best
Ayan

On Tue, 22 May 2018 at 11:27 am, JF Chen 
mailto:darou...@gmail.com>> wrote:
Thanks, Thakrar,

I have tried to check the existence of path before read it, but HDFSCli python 
package seems not support wildcard.  "FileSystem.globStatus" is a java api 
while I am using python via livy Do you know any python api implementing 
the same function?


Regard,
Junfeng Chen

On Mon, May 21, 2018 at 9:01 PM, Thakrar, Jayesh 
mailto:jthak...@conversantmedia.com>> wrote:
Probably you can do some preprocessing/checking of the paths before you attempt 
to read it via Spark.
Whether it is local or hdfs filesystem, you can try to check for existence and 
other details by using the "FileSystem.globStatus" method from the Hadoop API.

From: JF Chen mailto:darou...@gmail.com>>
Date: Sunday, May 20, 2018 at 10:30 PM
To: user mailto:user@spark.apache.org>>
Subject: How to skip nonexistent file when read files with spark?

Hi Everyone
I meet a tricky problem recently. I am trying to read some file paths generated 
by other method. The file paths are represented by wild card in list, like [ 
'/data/*/12', '/data/*/13']
But in practice, if the wildcard cannot match any existed path, it will throw 
an exception:"pyspark.sql.utils.AnalysisException: 'Path does not exist: ...'", 
and the program stops after that.
Actually I want spark can just ignore and skip these nonexistent  file path, 
and continues to run. I have tried python HDFSCli api to check the existence of 
path , but hdfs cli cannot support wildcard.

Any good idea to solve my problem? Thanks~

Regard,
Junfeng Chen

--
Best Regards,
Ayan Guha



Re: How to skip nonexistent file when read files with spark?

2018-05-21 Thread Thakrar, Jayesh
Probably you can do some preprocessing/checking of the paths before you attempt 
to read it via Spark.
Whether it is local or hdfs filesystem, you can try to check for existence and 
other details by using the "FileSystem.globStatus" method from the Hadoop API.

From: JF Chen 
Date: Sunday, May 20, 2018 at 10:30 PM
To: user 
Subject: How to skip nonexistent file when read files with spark?

Hi Everyone
I meet a tricky problem recently. I am trying to read some file paths generated 
by other method. The file paths are represented by wild card in list, like [ 
'/data/*/12', '/data/*/13']
But in practice, if the wildcard cannot match any existed path, it will throw 
an exception:"pyspark.sql.utils.AnalysisException: 'Path does not exist: ...'", 
and the program stops after that.
Actually I want spark can just ignore and skip these nonexistent  file path, 
and continues to run. I have tried python HDFSCli api to check the existence of 
path , but hdfs cli cannot support wildcard.

Any good idea to solve my problem? Thanks~

Regard,
Junfeng Chen


Re: Spark Monitoring using Jolokia

2018-01-08 Thread Thakrar, Jayesh
And here's some more info on Spark Metrics

https://www.slideshare.net/JayeshThakrar/apache-bigdata2017sparkprofiling


From: Maximiliano Felice 
Date: Monday, January 8, 2018 at 8:14 AM
To: Irtiza Ali 
Cc: 
Subject: Re: Spark Monitoring using Jolokia

Hi!

I don't know very much about them, but I'm currently working in posting custom 
metrics into Graphite. I found useful the internals described in this library: 
https://github.com/groupon/spark-metrics

Hope this at least can give you a hint.

Best of luck!

2018-01-08 10:55 GMT-03:00 Irtiza Ali mailto:i...@an10.io>>:
Hello everyone,

I am building a monitoring tool for the spark, for that I needs sparks metrics. 
I am using jolokia to get the metrics.

I have a question that:

Can I get all the metrics provided by the spark rest api using the Jolokia?

How the spark rest api get the metrics internally?


Thanks



Re: Access to Applications metrics

2017-12-05 Thread Thakrar, Jayesh
You can also get the metrics from the Spark application events log file.

See https://www.slideshare.net/JayeshThakrar/apache-bigdata2017sparkprofiling


From: "Qiao, Richard" 
Date: Monday, December 4, 2017 at 6:09 PM
To: Nick Dimiduk , "user@spark.apache.org" 

Subject: Re: Access to Applications metrics

It works to collect Job level, through Jolokia java agent.

Best Regards
Richard


From: Nick Dimiduk 
Date: Monday, December 4, 2017 at 6:53 PM
To: "user@spark.apache.org" 
Subject: Re: Access to Applications metrics

Bump.

On Wed, Nov 15, 2017 at 2:28 PM, Nick Dimiduk 
mailto:ndimi...@gmail.com>> wrote:
Hello,

I'm wondering if it's possible to get access to the detailed job/stage/task 
level metrics via the metrics system (JMX, Graphite, &c). I've enabled the 
wildcard sink and I do not see them. It seems these values are only available 
over http/json and to SparkListener instances, is this the case? Has anyone 
worked on a SparkListener that would bridge data from one to the other?

Thanks,
Nick




The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates and may only be used solely in performance of 
work or services for Capital One. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed. If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.


Re: Why don't I see my spark jobs running in parallel in Cassandra/Spark DSE cluster?

2017-10-27 Thread Thakrar, Jayesh
What you have is sequential and hence sequential processing.
Also Spark/Scala are not parallel programming languages.
But even if they were, statements are executed sequentially unless you exploit 
the parallel/concurrent execution features.

Anyway, see if this works:

val (RDD1, RDD2) = (JavaFunctions.cassandraTable(...), 
JavaFunctions.cassandraTable(...))

val (RDD3, RDD4) = (RDD1.flatMap(..), RDD2.flatMap(..))


I am hoping that Spark being based on Scala, the behavior below will apply:
scala> var x = 0
x: Int = 0

scala> val (a,b) = (x + 1, x+1)
a: Int = 1
b: Int = 1



From: Cassa L 
Date: Friday, October 27, 2017 at 1:50 AM
To: Jörn Franke 
Cc: user , 
Subject: Re: Why don't I see my spark jobs running in parallel in 
Cassandra/Spark DSE cluster?

No, I dont use Yarn.  This is standalone spark that comes with DataStax 
Enterprise version of Cassandra.

On Thu, Oct 26, 2017 at 11:22 PM, Jörn Franke 
mailto:jornfra...@gmail.com>> wrote:
Do you use yarn ? Then you need to configure the queues with the right 
scheduler and method.

On 27. Oct 2017, at 08:05, Cassa L 
mailto:lcas...@gmail.com>> wrote:
Hi,
I have a spark job that has use case as below:
RRD1 and RDD2 read from Cassandra tables. These two RDDs then do some 
transformation and after that I do a count on transformed data.

Code somewhat  looks like this:

RDD1=JavaFunctions.cassandraTable(...)
RDD2=JavaFunctions.cassandraTable(...)
RDD3 = RDD1.flatMap(..)
RDD4 = RDD2.flatMap()

RDD3.count
RDD4.count

In Spark UI I see count() functions are getting called one after another. How 
do I make it parallel? I also looked at below discussion from Cloudera, but it 
does not show how to run driver functions in parallel. Do I just add Executor 
and run them in threads?

https://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/Getting-Spark-stages-to-run-in-parallel-inside-an-application/td-p/38515

Attaching UI snapshot here?


Thanks.
LCassa



Re: [Spark JDBC] Does spark support read from remote Hive server via JDBC

2017-06-12 Thread Thakrar, Jayesh
Could this be due to https://issues.apache.org/jira/browse/HIVE-6 ?

From: Patrik Medvedev 
Date: Monday, June 12, 2017 at 2:31 AM
To: Jörn Franke , vaquar khan 
Cc: Jean Georges Perrin , User 
Subject: Re: [Spark JDBC] Does spark support read from remote Hive server via 
JDBC

Hello,

All security checkings disabled, but i still don't have any info in result.


вс, 11 июн. 2017 г. в 14:24, Jörn Franke 
mailto:jornfra...@gmail.com>>:
Is sentry preventing the access?

On 11. Jun 2017, at 01:55, vaquar khan 
mailto:vaquar.k...@gmail.com>> wrote:
Hi ,
Pleaae check your firewall security setting sharing link one good link.

http://belablotski.blogspot.in/2016/01/access-hive-tables-from-spark-using.html?m=1



Regards,
Vaquar khan

On Jun 8, 2017 1:53 AM, "Patrik Medvedev" 
mailto:patrik.medve...@gmail.com>> wrote:
Hello guys,

Can somebody help me with my problem?
Let me know, if you need more details.


ср, 7 июн. 2017 г. в 16:43, Patrik Medvedev 
mailto:patrik.medve...@gmail.com>>:
No, I don't.

ср, 7 июн. 2017 г. в 16:42, Jean Georges Perrin 
mailto:j...@jgp.net>>:
Do you have some other security in place like Kerberos or impersonation? It may 
affect your access.


jg


On Jun 7, 2017, at 02:15, Patrik Medvedev 
mailto:patrik.medve...@gmail.com>> wrote:
Hello guys,

I need to execute hive queries on remote hive server from spark, but for some 
reasons i receive only column names(without data).
Data available in table, i checked it via HUE and java jdbc connection.

Here is my code example:
val test = spark.read
.option("url", "jdbc:hive2://remote.hive.server:1/work_base")
.option("user", "user")
.option("password", "password")
.option("dbtable", "some_table_with_data")
.option("driver", "org.apache.hive.jdbc.HiveDriver")
.format("jdbc")
.load()
test.show()


Scala version: 2.11
Spark version: 2.1.0, i also tried 2.1.1
Hive version: CDH 5.7 Hive 1.1.1
Hive JDBC version: 1.1.1

But this problem available on Hive with later versions, too.
Could you help me with this issue, because i didn't find anything in mail group 
answers and StackOverflow.
Or could you help me find correct solution how to query remote hive from spark?

--
Cheers,
Patrick


Re: spark-submit config via file

2017-03-27 Thread Thakrar, Jayesh
Roy - can you check if you have HADOOP_CONF_DIR and YARN_CONF_DIR set to the 
directory containing the HDFS and YARN configuration files?

From: Sandeep Nemuri 
Date: Monday, March 27, 2017 at 9:44 AM
To: Saisai Shao 
Cc: Yong Zhang , ", Roy" , user 

Subject: Re: spark-submit config via file

You should try adding your NN host and port in the URL.

On Mon, Mar 27, 2017 at 11:03 AM, Saisai Shao 
mailto:sai.sai.s...@gmail.com>> wrote:
It's quite obvious your hdfs URL is not complete, please looks at the 
exception, your hdfs URI doesn't have host, port. Normally it should be OK if 
HDFS is your default FS.

I think the problem is you're running on HDI, in which default FS is wasb. So 
here short name without host:port will lead to error. This looks like a HDI 
specific issue, you'd better ask HDI.


Exception in thread "main" java.io.IOException: Incomplete HDFS URI, no host: 
hdfs:///hdp/apps/2.6.0.0-403/spark2/spark2-hdp-yarn-archive.tar.gz

at 
org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:154)

at 
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2791)

at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99)

at 
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2825)

at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2807)

at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:386)

at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)





On Fri, Mar 24, 2017 at 9:18 PM, Yong Zhang 
mailto:java8...@hotmail.com>> wrote:

Of course it is possible.



You can always to set any configurations in your application using API, instead 
of pass in through the CLI.



val sparkConf = new 
SparkConf().setAppName(properties.get("appName")).set("master", 
properties.get("master")).set(xxx, properties.get("xxx"))

Your error is your environment problem.

Yong

From: , Roy mailto:rp...@njit.edu>>
Sent: Friday, March 24, 2017 7:38 AM
To: user
Subject: spark-submit config via file

Hi,

I am trying to deploy spark job by using spark-submit which has bunch of 
parameters like

spark-submit --class StreamingEventWriterDriver --master yarn --deploy-mode 
cluster --executor-memory 3072m --executor-cores 4 --files streaming.conf 
spark_streaming_2.11-assembly-1.0-SNAPSHOT.jar -conf "streaming.conf"

I was looking a way to put all these flags in the file to pass to spark-submit 
to make my spark-submitcommand simple like this

spark-submit --class StreamingEventWriterDriver --master yarn --deploy-mode 
cluster --properties-file properties.conf --files streaming.conf 
spark_streaming_2.11-assembly-1.0-SNAPSHOT.jar -conf "streaming.conf"

properties.conf has following contents



spark.executor.memory 3072m

spark.executor.cores 4



But I am getting following error



17/03/24 11:36:26 INFO Client: Use hdfs cache file as spark.yarn.archive for 
HDP, 
hdfsCacheFile:hdfs:///hdp/apps/2.6.0.0-403/spark2/spark2-hdp-yarn-archive.tar.gz

17/03/24 11:36:26 WARN AzureFileSystemThreadPoolExecutor: Disabling threads for 
Delete operation as thread count 0 is <= 1

17/03/24 11:36:26 INFO AzureFileSystemThreadPoolExecutor: Time taken for Delete 
operation is: 1 ms with threads: 0

17/03/24 11:36:27 INFO Client: Deleted staging directory 
wasb://a...@abc.blob.core.windows.net/user/sshuser/.sparkStaging/application_1488402758319_0492

Exception in thread "main" java.io.IOException: Incomplete HDFS URI, no host: 
hdfs:///hdp/apps/2.6.0.0-403/spark2/spark2-hdp-yarn-archive.tar.gz

at 
org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:154)

at 
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2791)

at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99)

at 
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2825)

at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2807)

at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:386)

at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)

at 
org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:364)

at 
org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:480)

at 
org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:552)

at 
org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:881)

at 
org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:170)

at org.apache.spark.deploy.yarn.Client.run(Client.scala:1218)

at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1277)

at org.apache.spark.deploy.yarn.Client.main(Client.scala)

at sun.reflect.

Re: Lost executor 4 Container killed by YARN for exceeding memory limits.

2017-02-14 Thread Thakrar, Jayesh
Nancy,

As the message from Spark indicates, spark.shuffle.memoryFraction is no longer 
used.
It’s a unified heap space for both data caching and other things.
Also, the previous 11 GB was not sufficient, and you are making the executor 
memory even smaller, not sure how it will work.

From: nancy henry 
Date: Tuesday, February 14, 2017 at 1:04 AM
To: Conversant 
Cc: Jon Gregg , "user @spark" 
Subject: Re: Lost executor 4 Container killed by YARN for exceeding memory 
limits.

Hi,


How to set this parameters while launching spark shell

spark.shuffle.memoryFraction=0.5

and

spark.yarn.executor.memoryOverhead=1024

I tried giving like this but I am giving below error

spark-shell --master yarn --deploy-mode client --driver-memory 16G 
--num-executors 500 executor-cores 4 --executor-memory 7G --conf 
spark.shuffle.memoryFraction=0.5 --conf spark.yarn.executor.memoryOverhead=1024

Warning
17/02/13 22:42:02 WARN SparkConf: Detected deprecated memory fraction settings: 
[spark.shuffle.memoryFraction]. As of Spark 1.6, execution and storage memory 
management are unified. All memory fractions used in the old model are now 
deprecated and no longer read. If you wish to use the old memory management, 
you may explicitly enable `spark.memory.useLegacyMode` (not recommended).



On Mon, Feb 13, 2017 at 11:23 PM, Thakrar, Jayesh 
mailto:jthak...@conversantmedia.com>> wrote:
Nancy,

As your log output indicated, your executor 11 GB memory limit.
While you might want to address the root cause/data volume as suggested by Jon, 
you can do an immediate test by changing your command as follows

spark-shell --master yarn --deploy-mode client --driver-memory 16G 
--num-executors 500 executor-cores 7 --executor-memory 14G

This essentially increases your executor memory from 11 GB to 14 GB.
Note that it will result in a potentially large footprint - from 500x11 to 
500x14 GB.
You may want to consult with your DevOps/Operations/Spark Admin team first.

From: Jon Gregg mailto:coble...@gmail.com>>
Date: Monday, February 13, 2017 at 8:58 AM
To: nancy henry mailto:nancyhenry6...@gmail.com>>
Cc: "user @spark" mailto:user@spark.apache.org>>
Subject: Re: Lost executor 4 Container killed by YARN for exceeding memory 
limits.

Setting Spark's memoryOverhead configuration variable is recommended in your 
logs, and has helped me with these issues in the past.  Search for 
"memoryOverhead" here:  http://spark.apache.org/docs/latest/running-on-yarn.html

That said, you're running on a huge cluster as it is.  If it's possible to 
filter your tables down before the join (keeping just the rows/columns you 
need), that may be a better solution.

Jon

On Mon, Feb 13, 2017 at 5:27 AM, nancy henry 
mailto:nancyhenry6...@gmail.com>> wrote:
Hi All,,

I am getting below error while I am trying to join 3 tables which are in ORC 
format in hive from 5 10gb tables through hive context in spark

Container killed by YARN for exceeding memory limits. 11.1 GB of 11 GB physical 
memory used. Consider boosting spark.yarn.executor.memoryOverhead.
17/02/13 02:21:19 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container 
killed by YARN for exceeding memory limits. 11.1 GB of 11 GB physical memory 
used


I am using below memory parameters to launch shell .. what else i could 
increase from these parameters or do I need to change any configuration 
settings please let me know

spark-shell --master yarn --deploy-mode client --driver-memory 16G 
--num-executors 500 executor-cores 7 --executor-memory 10G





Re: Lost executor 4 Container killed by YARN for exceeding memory limits.

2017-02-13 Thread Thakrar, Jayesh
Nancy,

As your log output indicated, your executor 11 GB memory limit.
While you might want to address the root cause/data volume as suggested by Jon, 
you can do an immediate test by changing your command as follows

spark-shell --master yarn --deploy-mode client --driver-memory 16G 
--num-executors 500 executor-cores 7 --executor-memory 14G

This essentially increases your executor memory from 11 GB to 14 GB.
Note that it will result in a potentially large footprint - from 500x11 to 
500x14 GB.
You may want to consult with your DevOps/Operations/Spark Admin team first.

From: Jon Gregg 
Date: Monday, February 13, 2017 at 8:58 AM
To: nancy henry 
Cc: "user @spark" 
Subject: Re: Lost executor 4 Container killed by YARN for exceeding memory 
limits.

Setting Spark's memoryOverhead configuration variable is recommended in your 
logs, and has helped me with these issues in the past.  Search for 
"memoryOverhead" here:  http://spark.apache.org/docs/latest/running-on-yarn.html

That said, you're running on a huge cluster as it is.  If it's possible to 
filter your tables down before the join (keeping just the rows/columns you 
need), that may be a better solution.

Jon

On Mon, Feb 13, 2017 at 5:27 AM, nancy henry 
mailto:nancyhenry6...@gmail.com>> wrote:
Hi All,,

I am getting below error while I am trying to join 3 tables which are in ORC 
format in hive from 5 10gb tables through hive context in spark

Container killed by YARN for exceeding memory limits. 11.1 GB of 11 GB physical 
memory used. Consider boosting spark.yarn.executor.memoryOverhead.
17/02/13 02:21:19 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container 
killed by YARN for exceeding memory limits. 11.1 GB of 11 GB physical memory 
used


I am using below memory parameters to launch shell .. what else i could 
increase from these parameters or do I need to change any configuration 
settings please let me know

spark-shell --master yarn --deploy-mode client --driver-memory 16G 
--num-executors 500 executor-cores 7 --executor-memory 10G




Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread Thakrar, Jayesh
Ben,

Also look at Phoenix (Apache project) which provides a better (one of the best) 
SQL/JDBC layer on top of HBase.
http://phoenix.apache.org/

Cheers,
Jayesh


From: vincent gromakowski 
Date: Monday, October 17, 2016 at 1:53 PM
To: Benjamin Kim 
Cc: Michael Segel , Jörn Franke 
, Mich Talebzadeh , Felix 
Cheung , "user@spark.apache.org" 

Subject: Re: Spark SQL Thriftserver with HBase

Instead of (or additionally to) saving results somewhere, you just start a 
thriftserver that expose the Spark tables of the SQLContext (or SparkSession 
now). That means you can implement any logic (and maybe use structured 
streaming) to expose your data. Today using the thriftserver means reading data 
from the persistent store every query, so if the data modeling doesn't fit the 
query it can be quite long.  What you generally do in a common spark job is to 
load the data and cache spark table in a in-memory columnar table which is 
quite efficient for any kind of query, the counterpart is that the cache isn't 
updated you have to implement a reload mechanism, and this solution isn't 
available using the thriftserver.
What I propose is to mix the two world: periodically/delta load data in spark 
table cache and expose it through the thriftserver. But you have to implement 
the loading logic, it can be very simple to very complex depending on your 
needs.


2016-10-17 19:48 GMT+02:00 Benjamin Kim 
mailto:bbuil...@gmail.com>>:
Is this technique similar to what Kinesis is offering or what Structured 
Streaming is going to have eventually?

Just curious.

Cheers,
Ben


On Oct 17, 2016, at 10:14 AM, vincent gromakowski 
mailto:vincent.gromakow...@gmail.com>> wrote:

I would suggest to code your own Spark thriftserver which seems to be very easy.
http://stackoverflow.com/questions/27108863/accessing-spark-sql-rdd-tables-through-the-thrift-server

I am starting to test it. The big advantage is that you can implement any logic 
because it's a spark job and then start a thrift server on temporary table. For 
example you can query a micro batch rdd from a kafka stream, or pre load some 
tables and implement a rolling cache to periodically update the spark in memory 
tables with persistent store...
It's not part of the public API and I don't know yet what are the issues doing 
this but I think Spark community should look at this path: making the 
thriftserver be instantiable in any spark job.

2016-10-17 18:17 GMT+02:00 Michael Segel 
mailto:msegel_had...@hotmail.com>>:
Guys,
Sorry for jumping in late to the game…

If memory serves (which may not be a good thing…) :

You can use HiveServer2 as a connection point to HBase.
While this doesn’t perform well, its probably the cleanest solution.
I’m not keen on Phoenix… wouldn’t recommend it….


The issue is that you’re trying to make HBase, a key/value object store, a 
Relational Engine… its not.

There are some considerations which make HBase not ideal for all use cases and 
you may find better performance with Parquet files.

One thing missing is the use of secondary indexing and query optimizations that 
you have in RDBMSs and are lacking in HBase / MapRDB / etc …  so your 
performance will vary.

With respect to Tableau… their entire interface in to the big data world 
revolves around the JDBC/ODBC interface. So if you don’t have that piece as 
part of your solution, you’re DOA w respect to Tableau.

Have you considered Drill as your JDBC connection point?  (YAAP: Yet another 
Apache project)


On Oct 9, 2016, at 12:23 PM, Benjamin Kim 
mailto:bbuil...@gmail.com>> wrote:

Thanks for all the suggestions. It would seem you guys are right about the 
Tableau side of things. The reports don’t need to be real-time, and they won’t 
be directly feeding off of the main DMP HBase data. Instead, it’ll be batched 
to Parquet or Kudu/Impala or even PostgreSQL.

I originally thought that we needed two-way data retrieval from the DMP HBase 
for ID generation, but after further investigation into the use-case and 
architecture, the ID generation needs to happen local to the Ad Servers where 
we generate a unique ID and store it in a ID linking table. Even better, many 
of the 3rd party services supply this ID. So, data only needs to flow in one 
direction. We will use Kafka as the bus for this. No JDBC required. This is 
also goes for the REST Endpoints. 3rd party services will hit ours to update 
our data with no need to read from our data. And, when we want to update their 
data, we will hit theirs to update their data using a triggered job.

This al boils down to just integrating with Kafka.

Once again, thanks for all the help.

Cheers,
Ben


On Oct 9, 2016, at 3:16 AM, Jörn Franke 
mailto:jornfra...@gmail.com>> wrote:

please keep also in mind that Tableau Server has the capabilities to store data 
in-memory and refresh only when needed the in-memory data. This means you can 
import it from any source and let your users work only on the in-memory data in 
Tableau Server.

On Sun, Oct 9, 2

Re: Is spark a right tool for updating a dataframe repeatedly

2016-10-17 Thread Thakrar, Jayesh
Yes, iterating over a dataframe and making changes is not uncommon.
Ofcourse RDDs, dataframes and datasets are immultable, but there is some 
optimization in the optimizer that can potentially help to dampen the 
effect/impact of creating a new rdd, df or ds.
Also, the use-case you cited is similar to what is done in regression, 
clustering and other algorithms.
I.e. you iterate making a change to a dataframe/dataset until the desired 
condition.
E.g. see this - 
https://spark.apache.org/docs/1.6.1/ml-classification-regression.html#linear-regression
 and the setting of the iteration ceiling

// instantiate the base classifier
val classifier = new LogisticRegression()
  .setMaxIter(params.maxIter)
  .setTol(params.tol)
  .setFitIntercept(params.fitIntercept)

Now the impact of that depends on a variety of things.
E.g. if the data is completely contained in memory and there is no spill over 
to disk, it might not be a big issue (ofcourse there will still be memory, CPU 
and network overhead/latency).
If you are looking at storing the data on disk (e.g. as part of a checkpoint or 
explicit storage), then there can be substantial I/O activity.



From: Xi Shen 
Date: Monday, October 17, 2016 at 2:54 AM
To: Divya Gehlot , Mungeol Heo 
Cc: "user @spark" 
Subject: Re: Is spark a right tool for updating a dataframe repeatedly

I think most of the "big data" tools, like Spark and Hive, are not designed to 
edit data. They are only designed to query data. I wonder in what scenario you 
need to update large volume of data repetitively.


On Mon, Oct 17, 2016 at 2:00 PM Divya Gehlot 
mailto:divya.htco...@gmail.com>> wrote:
If  my understanding is correct about your query
In spark Dataframes are immutable , cant update the dataframe.
you have to create a new dataframe to update the current dataframe .


Thanks,
Divya


On 17 October 2016 at 09:50, Mungeol Heo 
mailto:mungeol@gmail.com>> wrote:
Hello, everyone.

As I mentioned at the tile, I wonder that is spark a right tool for
updating a data frame repeatedly until there is no more date to
update.

For example.

while (if there was a updating) {
update a data frame A
}

If it is the right tool, then what is the best practice for this kind of work?
Thank you.

-
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org

--

Thanks,
David S.