Write data from Hbase using Spark Failing with NPE

2018-05-23 Thread Alchemist
aI am using Spark to write data to Hbase, I can read data just fine but write 
is failing with following exception. I found simila issue that got resolved by 
adding *site.xml and hbase JARs. But it is npot working for me.
      JavaPairRDD  tablePuts = 
hBaseRDD.mapToPair(new PairFunction, 
ImmutableBytesWritable, Put>() {       @Override public 
Tuple2 call(Tuple2 
results) throws Exception {                        byte[] accountId = 
results._2().getValue(Bytes.toBytes(COLFAMILY), Bytes.toBytes("accountId"));    
                                           String rowKey = new 
String(results._2().getRow());
                                                String accountId2 = 
(Bytes.toString(accountId));                         String vbMedia2 = 
Bytes.toString(vbmedia);                        
System.out.println("  accountId " + accountId2);
                                                //int prefix = getHash(rowKey); 
                       String prefix = getMd5Hash(rowKey);                      
  String newrowKey = prefix + rowKey;                        
System.out.println("  newrowKey &&&" + newrowKey);      
                  LOG.info("  newrowKey &&&" + 
newrowKey);                        // Add a single cell def:vbmedia             
             Put put = new Put( Bytes.toBytes(newrowKey) );                     
   put.addColumn(Bytes.toBytes("def"), Bytes.toBytes("accountId"), accountId);  
                       }                });    Job newAPIJobConfiguration = 
Job.getInstance(conf);  
newAPIJobConfiguration.getConfiguration().set(TableOutputFormat.OUTPUT_TABLE, 
OUT_TABLE_NAME);  
newAPIJobConfiguration.setOutputFormatClass(org.apache.hadoop.hbase.mapreduce.TableOutputFormat.class);
  
newAPIJobConfiguration.setOutputKeyClass(org.apache.hadoop.hbase.io.ImmutableBytesWritable.class);
  
newAPIJobConfiguration.setOutputValueClass(org.apache.hadoop.io.Writable.class);
        
tablePuts.saveAsNewAPIHadoopDataset(newAPIJobConfiguration.getConfiguration());

Exception in thread "main" java.lang.NullPointerException at 
org.apache.hadoop.hbase.security.UserProvider.instantiate(UserProvider.java:123)
 at 
org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:214)
 at 
org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:119)
 at 
org.apache.hadoop.hbase.mapreduce.TableOutputFormat.checkOutputSpecs(TableOutputFormat.java:177)
 at 
org.apache.spark.internal.io.HadoopMapReduceWriteConfigUtil.assertConf(SparkHadoopWriter.scala:387)
 at 
org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:71)
 at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1083)
 at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1081)
 at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1081)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) 
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) 
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) at 
org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1081)
 at 
org.apache.spark.api.java.JavaPairRDD.saveAsNewAPIHadoopDataset(JavaPairRDD.scala:831)
 at com.voicebase.etl.s3tohbase.HbaseScan2.main(HbaseScan2.java:148) at 
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498) at 
org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:879)
 at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197) at 
org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227) at 
org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136) at 
org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Re: help in copying data from one azure subscription to another azure subscription

2018-05-23 Thread Pushkar.Gujar
What are you using for storing data in those subscriptions? Datalake or
Blobs? There is Azure Data Factory already available that can do copy
between these cloud storage without having to go through spark


Thank you,
*Pushkar Gujar*


On Mon, May 21, 2018 at 8:59 AM, amit kumar singh 
wrote:

> HI Team,
>
> We are trying to move data between one azure subscription to another azure
> subscription is there a faster way to do through spark
>
> i am using distcp and its taking for ever
>
> thanks
> rohit
>


PySpark API on top of Apache Arrow

2018-05-23 Thread Corey Nolet
Please forgive me if this question has been asked already.

I'm working in Python with Arrow+Plasma+Pandas Dataframes. I'm curious if
anyone knows of any efforts to implement the PySpark API on top of Apache
Arrow directly. In my case, I'm doing data science on a machine with 288
cores and 1TB of ram.

It would make life much easier if I was able to use the flexibility of the
PySpark API (rather than having to be tied to the operations in Pandas). It
seems like an implementation would be fairly straightforward using the
Plasma server and object_ids.

If you have not heard of an effort underway to accomplish this, any reasons
why it would be a bad idea?


Thanks!


Re: Bulk / Fast Read and Write with MSSQL Server and Spark

2018-05-23 Thread ayan guha
Curious question: what is the reason of using spark here? Why not simple
sql-based ETL?

On Thu, May 24, 2018 at 5:09 AM, Ajay  wrote:

> Do you worry about spark overloading the SQL server?  We have had this
> issue in the past where all spark slaves tend to send lots of data at once
> to SQL and that slows down the latency of the rest of the system. We
> overcame this by using sqoop and running it in a controlled environment.
>
> On Wed, May 23, 2018 at 7:32 AM Chetan Khatri 
> wrote:
>
>> Super, just giving high level idea what i want to do. I have one source
>> schema which is MS SQL Server 2008 and target is also MS SQL Server 2008.
>> Currently there is c# based ETL application which does extract transform
>> and load as customer specific schema including indexing etc.
>>
>>
>> Thanks
>>
>> On Wed, May 23, 2018 at 7:11 PM, kedarsdixit > com> wrote:
>>
>>> Yes.
>>>
>>> Regards,
>>> Kedar Dixit
>>>
>>>
>>>
>>> --
>>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>>
>
> --
> Thanks,
> Ajay
>



-- 
Best Regards,
Ayan Guha


Re: Alternative for numpy in Spark Mlib

2018-05-23 Thread Suzen, Mehmet
You can  use Breeze, which is part of spark distribution:
https://github.com/scalanlp/breeze/wiki/Breeze-Linear-Algebra

Check out the modules under  import breeze._

On 23 May 2018 at 07:04, umargeek  wrote:
> Hi Folks,
>
> I am planning to rewrite one of my python module written for entropy
> calculation using numpy into Spark Mlib so that it can be processed in
> distributed manner.
>
> Can you please advise on the possibilities of the same approach or any
> alternatives.
>
> Thanks,
> Umar
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Bulk / Fast Read and Write with MSSQL Server and Spark

2018-05-23 Thread Ajay
Do you worry about spark overloading the SQL server?  We have had this
issue in the past where all spark slaves tend to send lots of data at once
to SQL and that slows down the latency of the rest of the system. We
overcame this by using sqoop and running it in a controlled environment.

On Wed, May 23, 2018 at 7:32 AM Chetan Khatri 
wrote:

> Super, just giving high level idea what i want to do. I have one source
> schema which is MS SQL Server 2008 and target is also MS SQL Server 2008.
> Currently there is c# based ETL application which does extract transform
> and load as customer specific schema including indexing etc.
>
>
> Thanks
>
> On Wed, May 23, 2018 at 7:11 PM, kedarsdixit <
> kedarnath_di...@persistent.com> wrote:
>
>> Yes.
>>
>> Regards,
>> Kedar Dixit
>>
>>
>>
>> --
>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>

-- 
Thanks,
Ajay


Re: Submit many spark applications

2018-05-23 Thread Marcelo Vanzin
On Wed, May 23, 2018 at 12:04 PM, raksja  wrote:
> So InProcessLauncher wouldnt use the native memory, so will it overload the
> mem of parent process?

I will still use "native memory" (since the parent process will still
use memory), just less of it. But yes, it will use more memory in the
parent process.

> Is there any way that we can overcome this?

Try to launch less applications concurrently.


-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



CMR: An open-source Data acquisition API for Spark is available

2018-05-23 Thread Thomas Fuller
Hi Folks,

Today I've released my open-source CMR API, which is used to acquire data
from several data providers directly in Spark.

Currently the CMR API offers integration with the following:

- Federal Reserve Bank of St. Louis
- World Bank
- TreasuryDirect.gov
- OpenFIGI.com

*Of note*:

- The project page is here ,
including a jar file that has been built from the current source, along
with a few demonstration videos (https://coherentlogic.com/wordpress/cmr/)
- Source code can be found here
 (
https://bitbucket.org/CoherentLogic/cmrapi/)
- See a very simple example of the CMR API configured with the Infinispan
distributed cache  (http://infinispan.org/) here

(
https://coherentlogic.com/wordpress/cmr-infinispan-lightning-fast-data-acquisition/
).

*Sample*:

Bring S 500 data directly into Spark from the Federal Reserve Bank of St.
Louis (https://fred.stlouisfed.org/ and
https://research.stlouisfed.org/docs/api/fred/) as follows:

val observationsDS = *cmr*.*fred*.*series*.*observations*.*withApiKey*
(FRED_API_KEY).*withSeriesId*("SP500").*doGetAsObservationsDataset*(spark)

Feedback is welcomed so please feel free to send me your comments.

Tom
Coherent Logic Limited  | LinkedIn



Re: Submit many spark applications

2018-05-23 Thread raksja
Hi Marcelo, 

I'm facing same issue when making spark-submits from an ec2 instance and
reaching native memory limit sooner. we have the #1, but we are still in
spark 2.1.0, couldnt try #2. 

So InProcessLauncher wouldnt use the native memory, so will it overload the
mem of parent process?

Is there any way that we can overcome this?




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark driver pod garbage collection

2018-05-23 Thread Anirudh Ramanathan
There's a flag to the controller manager that is in charge of retention
policy for terminated or completed pods.

https://kubernetes.io/docs/reference/command-line-tools-reference/kube-controller-manager/#options
--terminated-pod-gc-threshold int32 Default: 12500
Number of terminated pods that can exist before the terminated pod garbage
collector starts deleting terminated pods. If <= 0, the terminated pod
garbage collector is disabled.

On Wed, May 23, 2018, 8:34 AM purna pradeep  wrote:

> Hello,
>
> Currently I observe dead pods are not getting garbage collected (aka spark
> driver pods which have completed execution). So pods could sit in the
> namespace for weeks potentially. This makes listing, parsing, and reading
> pods slower and well as having junk sit on the cluster.
>
> I believe minimum-container-ttl-duration kubelet flag is by default set to
> 0 minute but I don’t see the completed spark driver pods are garbage
> collected
>
> Do I need to set any flag explicitly @ kubelet level?
>
>


Cannot make Spark to honour the spark.jars.ivySettings config

2018-05-23 Thread Bruno Aranda
Hi,

I am trying to use my own ivy settings file. For that, I am submitting to
Spark using a command such as the following to test:

spark-shell --packages some-org:some-artifact:102 --conf
spark.jars.ivySettings=/home/hadoop/ivysettings.xml

The idea is to be able to get the artifact from a private repository.

The ivy settings file at /home/hadoop/ivysettings.xml does exist and it is
valid. I have used it to resolve the artifact successfully, with something
such as:

java -jar ivy-2.4.0.jar -settings /home/hadoop/ivysettings.xml -dependency
some-org some-artifact 102

If I run the spark-shell command in verbose mode, I can see:

Using properties file: /usr/lib/spark/conf/spark-defaults.conf
[...]
Spark properties used, including those specified through
 --conf and those from the properties file
/usr/lib/spark/conf/spark-defaults.conf:
[...]
*(spark.jars.ivySettings,/home/hadoop/ivysettings.xml)*

Ivy Default Cache set to: /home/hadoop/.ivy2/cache
The jars for the packages stored in: /home/hadoop/.ivy2/jars
*:: loading settings :: url =
jar:file:/usr/lib/spark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml*
org.apache.spark#spark-streaming-kafka-0-10_2.11 added as a dependency
some-org#some-artifact added as a dependency
[...]

So, why is then Spark loading the default settings if I am to provide my
own and it seems to pick the config property correctly?

Thanks!!

Bruno


Spark driver pod garbage collection

2018-05-23 Thread purna pradeep
Hello,

Currently I observe dead pods are not getting garbage collected (aka spark
driver pods which have completed execution). So pods could sit in the
namespace for weeks potentially. This makes listing, parsing, and reading
pods slower and well as having junk sit on the cluster.

I believe minimum-container-ttl-duration kubelet flag is by default set to
0 minute but I don’t see the completed spark driver pods are garbage
collected

Do I need to set any flag explicitly @ kubelet level?


Re: [structured-streaming]How to reset Kafka offset in readStream and read from beginning

2018-05-23 Thread Sushil Kotnala
You can use .option( "auto.offset.reset","earliest") while reading from
kafka.
With this, new stream will read from the first offset present for topic .




On Wed, May 23, 2018 at 11:32 AM, karthikjay  wrote:

> Chris,
>
> Thank you for responding. I get it.
>
> But, if I am using a console sink without checkpoint location, I do not see
> any messages in the console in IntellijIDEA IDE. I do not explicitly
> specify
> checkpointLocation in this case. How do I clear the working directory data
> and force Spark to read Kafka messages from the beginning. ?
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Bulk / Fast Read and Write with MSSQL Server and Spark

2018-05-23 Thread Chetan Khatri
Super, just giving high level idea what i want to do. I have one source
schema which is MS SQL Server 2008 and target is also MS SQL Server 2008.
Currently there is c# based ETL application which does extract transform
and load as customer specific schema including indexing etc.


Thanks

On Wed, May 23, 2018 at 7:11 PM, kedarsdixit  wrote:

> Yes.
>
> Regards,
> Kedar Dixit
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Bulk / Fast Read and Write with MSSQL Server and Spark

2018-05-23 Thread kedarsdixit
Yes.

Regards,
Kedar Dixit



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Bulk / Fast Read and Write with MSSQL Server and Spark

2018-05-23 Thread Chetan Khatri
Thank you Kedar Dixit, Silvio Fiorito.

Just one question that - even it's not an azure cloud MS-SQL Server. It
should support MS-SQL Server installed on local machine. right ?

Thank you.

On Wed, May 23, 2018 at 6:18 PM, Silvio Fiorito <
silvio.fior...@granturing.com> wrote:

> Try this https://docs.microsoft.com/en-us/azure/sql-database/sql-
> database-spark-connector
>
>
>
>
>
> *From: *Chetan Khatri 
> *Date: *Wednesday, May 23, 2018 at 7:47 AM
> *To: *user 
> *Subject: *Bulk / Fast Read and Write with MSSQL Server and Spark
>
>
>
> All,
>
>
>
> I am looking for approach to do bulk read / write with MSSQL Server and
> Apache Spark 2.2 , please let me know if any library / driver for the same.
>
>
>
> Thank you.
>
> Chetan
>


Re: Adding jars

2018-05-23 Thread kedarsdixit
This can help us to solve the immediate issue, however the ideally one should
submit the jars in the beginning of the job.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Bulk / Fast Read and Write with MSSQL Server and Spark

2018-05-23 Thread Silvio Fiorito
Try this 
https://docs.microsoft.com/en-us/azure/sql-database/sql-database-spark-connector


From: Chetan Khatri 
Date: Wednesday, May 23, 2018 at 7:47 AM
To: user 
Subject: Bulk / Fast Read and Write with MSSQL Server and Spark

All,

I am looking for approach to do bulk read / write with MSSQL Server and Apache 
Spark 2.2 , please let me know if any library / driver for the same.

Thank you.
Chetan


Re: Adding jars

2018-05-23 Thread Sushil Kotnala
The purpose of broadcast variable is different.

@Malveeka, could you please explain your usecase and issue.
If the FAT/ Uber jar is not having required dependent jars then the spark
job will fail at the start itself.

What is your scenario in which you want to add new jars?
Also, what you mean by adding spark.jars in middle? (Did you mean middle of
processing) then you can not change spark jars in middle of execution.
However, you can change them for next processing iteration.


On Wed, May 23, 2018 at 5:39 PM, kedarsdixit  wrote:

> In case of already running jobs, you can make use of broadcasters which
> will
> broadcast the jars to workers, if you want to change it on the fly you can
> rebroadcast it.
>
> You can explore broadcasters bit more to make use of.
>
> Regards,
> Kedar Dixit
> Data Science at Persistent Systems Ltd.
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Adding jars

2018-05-23 Thread kedarsdixit
In case of already running jobs, you can make use of broadcasters which will
broadcast the jars to workers, if you want to change it on the fly you can
rebroadcast it.

You can explore broadcasters bit more to make use of.

Regards,
Kedar Dixit
Data Science at Persistent Systems Ltd.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Adding jars

2018-05-23 Thread Sushil Kotnala
Hi

With spark-submit we can start a new spark job,  but it will not add new
jar files in already running job.

~Sushil

On Wed, May 23, 2018, 17:28 kedarsdixit 
wrote:

> Hi,
>
> You can add dependencies in spark-submit as below:
>
> ./bin/spark-submit \
>   --class  \
>   --master  \
>   --deploy-mode  \
>   --conf = \
>   *--jars * \
>   ... # other options
>\
>   [application-arguments]
>
> Hope this helps.
>
> Regards,
>
> Kedar Dixit
> Data Science at Persistent Systems Ltd
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Spark is not evenly distributing data

2018-05-23 Thread kedarsdixit
Hi,

Can you elaborate more here? We don't understand the issue in detail.

Regards,
Kedar Dixit
Data Science at Persistent Systems Ltd.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Adding jars

2018-05-23 Thread kedarsdixit
Hi,

You can add dependencies in spark-submit as below:

./bin/spark-submit \
  --class  \
  --master  \
  --deploy-mode  \
  --conf = \
  *--jars * \
  ... # other options
   \
  [application-arguments]

Hope this helps.

Regards,

Kedar Dixit
Data Science at Persistent Systems Ltd



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Bulk / Fast Read and Write with MSSQL Server and Spark

2018-05-23 Thread kedarsdixit
Hi,

I had came across  this

  
a while ago check if this is helpful.

Regards,
~Kedar Dixit
Data Science @ Persistent Systems Ltd.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Bulk / Fast Read and Write with MSSQL Server and Spark

2018-05-23 Thread Chetan Khatri
All,

I am looking for approach to do bulk read / write with MSSQL Server and
Apache Spark 2.2 , please let me know if any library / driver for the same.

Thank you.
Chetan


Re: OOM: Structured Streaming aggregation state not cleaned up properly

2018-05-23 Thread Jungtaek Lim
The issue looks like fixed in
https://issues.apache.org/jira/browse/SPARK-23670, and likely 2.3.1 will
include the fix.

-Jungtaek Lim (HeartSaVioR)

2018년 5월 23일 (수) 오후 7:12, weand 님이 작성:

> Thanks for clarification. So it really seem a Spark UI OOM Issue.
>
> After setting:
> --conf spark.sql.ui.retainedExecutions=10
> --conf spark.worker.ui.retainedExecutors=10
> --conf spark.worker.ui.retainedDrivers=10
> --conf spark.ui.retainedJobs=10
> --conf spark.ui.retainedStages=10
> --conf spark.ui.retainedTasks=10
> --conf spark.streaming.ui.retainedBatches=10
>
> ...driver memory consumption still increases constantly over time (ending
> in
> OOM).
>
> TOP 10 Records by Heap Consumption:
> Class Name| Objects |
> Shallow Heap |Retained Heap
>
> --
> org.apache.spark.status.ElementTrackingStore  |   1 |
>
> 40 | >= 1.793.945.416
> org.apache.spark.util.kvstore.InMemoryStore   |   1 |
>
> 24 | >= 1.793.944.760
> org.apache.spark.util.kvstore.InMemoryStore$InstanceList  |  13 |
>
> 416 | >= 1.792.311.104
> org.apache.spark.sql.execution.ui.SparkPlanGraphWrapper   |  16.472 |
>
> 527.104 | >= 1.430.379.120
> org.apache.spark.sql.execution.ui.SparkPlanGraphNodeWrapper   | 378.856 |
>
> 9.092.544 | >= 1.415.224.880
> org.apache.spark.sql.execution.ui.SparkPlanGraphNode  | 329.440 |
> 10.542.080 | >= 1.389.888.112
> org.apache.spark.sql.execution.ui.SparkPlanGraphClusterWrapper|  49.416 |
>
> 1.976.640 |   >= 957.701.152
> org.apache.spark.sql.execution.ui.SQLExecutionUIData  |   1.000 |
>
> 64.000 |   >= 344.103.096
> org.apache.spark.sql.execution.ui.SQLPlanMetric   | 444.744 |
> 14.231.808 |>= 14.231.808
> org.apache.spark.sql.execution.ui.SparkPlanGraphEdge  | 312.968 |
> 10.014.976 |>= 10.014.976
>
> --
>
> >300k instances per SparkPlanGraphNodeWrapper, SparkPlanGraphNode and
> SQLPlanMetric.
>
> BTW: we are using 2.3.0.
>
> Shall I fill a new Jira for that memory leak in Spark UI? Only found
> https://issues.apache.org/jira/browse/SPARK-15716 but seems something
> different.
>
> Trying with spark.ui.enabled=false in the meantime.
>
>
> Tathagata Das wrote
> > Just to be clear, these screenshots are about the memory consumption of
> > the
> > driver. So this is nothing to do with streaming aggregation state which
> > are
> > kept in the memory of the executors, not the driver.
> >
> > On Tue, May 22, 2018 at 10:21 AM, Jungtaek Lim 
>
> > kabhwan@
>
> >  wrote:
> >
> >> 1. Could you share your Spark version?
> >> 2. Could you reduce "spark.sql.ui.retainedExecutions" and see whether it
> >> helps? This configuration is available in 2.3.0, and default value is
> >> 1000.
> >>
> >> Thanks,
> >> Jungtaek Lim (HeartSaVioR)
> >>
> >> 2018년 5월 22일 (화) 오후 4:29, weand 
>
> > andreas.weise@
>
> > 님이 작성:
> >>
> >>> You can see it even better on this screenshot:
> >>>
> >>> TOP Entries Collapsed #2
> >>> http://apache-spark-user-list.1001560.n3.nabble.com/
> > > file/t8542/27_001.png>
> >>>
> >>> Sorry for the spam, attached a not so perfect screen in the mail
> before.
> >>>
> >>>
> >>>
> >>> --
> >>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
> >>>
> >>> -
> >>> To unsubscribe e-mail:
>
> > user-unsubscribe@.apache
>
> >>>
> >>>
>
>
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: OOM: Structured Streaming aggregation state not cleaned up properly

2018-05-23 Thread weand
Thanks for clarification. So it really seem a Spark UI OOM Issue.

After setting:
--conf spark.sql.ui.retainedExecutions=10
--conf spark.worker.ui.retainedExecutors=10
--conf spark.worker.ui.retainedDrivers=10
--conf spark.ui.retainedJobs=10
--conf spark.ui.retainedStages=10
--conf spark.ui.retainedTasks=10
--conf spark.streaming.ui.retainedBatches=10

...driver memory consumption still increases constantly over time (ending in
OOM).

TOP 10 Records by Heap Consumption:
Class Name| Objects |
Shallow Heap |Retained Heap
--
org.apache.spark.status.ElementTrackingStore  |   1 |   
   
40 | >= 1.793.945.416
org.apache.spark.util.kvstore.InMemoryStore   |   1 |   
   
24 | >= 1.793.944.760
org.apache.spark.util.kvstore.InMemoryStore$InstanceList  |  13 |   
  
416 | >= 1.792.311.104
org.apache.spark.sql.execution.ui.SparkPlanGraphWrapper   |  16.472 | 
527.104 | >= 1.430.379.120
org.apache.spark.sql.execution.ui.SparkPlanGraphNodeWrapper   | 378.856 |   
9.092.544 | >= 1.415.224.880
org.apache.spark.sql.execution.ui.SparkPlanGraphNode  | 329.440 |  
10.542.080 | >= 1.389.888.112
org.apache.spark.sql.execution.ui.SparkPlanGraphClusterWrapper|  49.416 |   
1.976.640 |   >= 957.701.152
org.apache.spark.sql.execution.ui.SQLExecutionUIData  |   1.000 |  
64.000 |   >= 344.103.096
org.apache.spark.sql.execution.ui.SQLPlanMetric   | 444.744 |  
14.231.808 |>= 14.231.808
org.apache.spark.sql.execution.ui.SparkPlanGraphEdge  | 312.968 |  
10.014.976 |>= 10.014.976
--

>300k instances per SparkPlanGraphNodeWrapper, SparkPlanGraphNode and
SQLPlanMetric.

BTW: we are using 2.3.0.

Shall I fill a new Jira for that memory leak in Spark UI? Only found
https://issues.apache.org/jira/browse/SPARK-15716 but seems something
different.

Trying with spark.ui.enabled=false in the meantime.


Tathagata Das wrote
> Just to be clear, these screenshots are about the memory consumption of
> the
> driver. So this is nothing to do with streaming aggregation state which
> are
> kept in the memory of the executors, not the driver.
> 
> On Tue, May 22, 2018 at 10:21 AM, Jungtaek Lim 

> kabhwan@

>  wrote:
> 
>> 1. Could you share your Spark version?
>> 2. Could you reduce "spark.sql.ui.retainedExecutions" and see whether it
>> helps? This configuration is available in 2.3.0, and default value is
>> 1000.
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>> 2018년 5월 22일 (화) 오후 4:29, weand 

> andreas.weise@

> 님이 작성:
>>
>>> You can see it even better on this screenshot:
>>>
>>> TOP Entries Collapsed #2
>>> http://apache-spark-user-list.1001560.n3.nabble.com/
> > file/t8542/27_001.png>
>>>
>>> Sorry for the spam, attached a not so perfect screen in the mail before.
>>>
>>>
>>>
>>> --
>>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>>
>>> -
>>> To unsubscribe e-mail: 

> user-unsubscribe@.apache

>>>
>>>





--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Recall: spark sql in-clause problem

2018-05-23 Thread Shiva Prashanth Vallabhaneni
Shiva Prashanth Vallabhaneni would like to recall the message, "spark sql 
in-clause problem".


Any comments or statements made in this email are not necessarily those of 
Tavant Technologies. The information transmitted is intended only for the 
person or entity to which it is addressed and may contain confidential and/or 
privileged material. If you have received this in error, please contact the 
sender and delete the material from any computer. All emails sent from or to 
Tavant Technologies may be subject to our monitoring procedures.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



[Beginner][StructuredStreaming] Using Spark aggregation - WithWatermark on old data

2018-05-23 Thread karthikjay
I am doing the following aggregation on the data

val channelChangesAgg = tunerDataJsonDF
  .withWatermark("ts2", "10 seconds")
  .groupBy(window(col("ts2"),"10 seconds"),
col("env"),
col("servicegroupid"))
  .agg(count("linetransactionid") as "count1")

The only constraint here is that the data is backdated; even though the data
is chronologically ordered, the ts2 will be a old date. Given this
condition, will the watermarking and aggregation still work ?



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [structured-streaming]How to reset Kafka offset in readStream and read from beginning

2018-05-23 Thread karthikjay
Chris,

Thank you for responding. I get it. 

But, if I am using a console sink without checkpoint location, I do not see
any messages in the console in IntellijIDEA IDE. I do not explicitly specify
checkpointLocation in this case. How do I clear the working directory data
and force Spark to read Kafka messages from the beginning. ?



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Fwd: XGBoost on PySpark

2018-05-23 Thread Aakash Basu
Guys any insight on the below?

-- Forwarded message --
From: Aakash Basu 
Date: Sat, May 19, 2018 at 12:21 PM
Subject: XGBoost on PySpark
To: user 


Hi guys,

I need help in implementing XG-Boost in PySpark.

As per the conversation in a popular thread regarding XGB goes, it is
available in Scala and Java versions but not Python. But, we've to
implement a pythonic distributed solution (on Spark) maybe using DMLC or
similar, to go ahead with XGB solutioning.

Anybody implemented the same? If yes, please share some insight on how to
go about it.

Thanks,
Aakash.