[ANNOUNCE] Announcing Apache Spark 2.3.2

2018-09-26 Thread Saisai Shao
We are happy to announce the availability of Spark 2.3.2!

Apache Spark 2.3.2 is a maintenance release, based on the branch-2.3
maintenance branch of Spark. We strongly recommend all 2.3.x users to
upgrade to this stable release.

To download Spark 2.3.2, head over to the download page:
http://spark.apache.org/downloads.html

To view the release notes:
https://spark.apache.org/releases/spark-release-2-3-2.html

We would like to acknowledge all community members for contributing to
this release. This release would not have been possible without you.


Best regards
Saisai


Re: Spark 2.3.1: k8s driver pods stuck in Initializing state

2018-09-26 Thread Yinan Li
The spark-init ConfigMap is used for the init-container that is responsible
for downloading remote dependencies. The k8s submission client run by
spark-submit should create the ConfigMap and add a ConfigMap volume in the
driver pod. Can you provide the command you used to run the job?

On Wed, Sep 26, 2018 at 2:36 PM purna pradeep 
wrote:

> Hello ,
>
>
> We're running spark 2.3.1 on kubernetes v1.11.0 and our driver pods from
> k8s are getting stuck in initializing state like so:
>
> NAME
>   READY STATUS RESTARTS   AGE
>
> my-pod-fd79926b819d3b34b05250e23347d0e7-driver   0/1   Init:0/1   0
> 18h
>
>
> And from *kubectl describe pod*:
>
> *Warning  FailedMount  9m (x128 over 4h) * kubelet, 10.47.96.167  Unable
> to mount volumes for pod
> "my-pod-fd79926b819d3b34b05250e23347d0e7-driver_spark(1f3aba7b-c10f-11e8-bcec-1292fec79aba)":
> timeout expired waiting for volumes to attach or mount for pod
> "spark"/"my-pod-fd79926b819d3b34b05250e23347d0e7-driver". list of unmounted
> volumes=[spark-init-properties]. list of unattached
> volumes=[spark-init-properties download-jars-volume download-files-volume
> spark-token-tfpvp]
>   *Warning  FailedMount  4m (x153 over 4h)  kubelet,* 10.47.96.167
> MountVolume.SetUp failed for volume "spark-init-properties" : configmaps
> "my-pod-fd79926b819d3b34b05250e23347d0e7-init-config" not found
>
> From what I can see in *kubectl get configmap* the init config map for
> the driver pod isn't there.
>
> Am I correct in assuming since the configmap isn't being created the
> driver pod will never start (hence stuck in init)?
>
> Where does the init config map come from?
>
> Why would it not be created?
>
>
> Please suggest
>
> Thanks,
> Purna
>
>


Spark 2.3.1: k8s driver pods stuck in Initializing state

2018-09-26 Thread purna pradeep
Hello ,


We're running spark 2.3.1 on kubernetes v1.11.0 and our driver pods from
k8s are getting stuck in initializing state like so:

NAME
READY STATUS RESTARTS   AGE

my-pod-fd79926b819d3b34b05250e23347d0e7-driver   0/1   Init:0/1   0
  18h


And from *kubectl describe pod*:

*Warning  FailedMount  9m (x128 over 4h) * kubelet, 10.47.96.167  Unable to
mount volumes for pod
"my-pod-fd79926b819d3b34b05250e23347d0e7-driver_spark(1f3aba7b-c10f-11e8-bcec-1292fec79aba)":
timeout expired waiting for volumes to attach or mount for pod
"spark"/"my-pod-fd79926b819d3b34b05250e23347d0e7-driver". list of unmounted
volumes=[spark-init-properties]. list of unattached
volumes=[spark-init-properties download-jars-volume download-files-volume
spark-token-tfpvp]
  *Warning  FailedMount  4m (x153 over 4h)  kubelet,* 10.47.96.167
MountVolume.SetUp failed for volume "spark-init-properties" : configmaps
"my-pod-fd79926b819d3b34b05250e23347d0e7-init-config" not found

>From what I can see in *kubectl get configmap* the init config map for the
driver pod isn't there.

Am I correct in assuming since the configmap isn't being created the driver
pod will never start (hence stuck in init)?

Where does the init config map come from?

Why would it not be created?


Please suggest

Thanks,
Purna


Spark 2.3.1: k8s driver pods stuck in Initializing state

2018-09-26 Thread Purna Pradeep Mamillapalli
We're running spark 2.3.1 on kubernetes v1.11.0 and our driver pods from
k8s are getting stuck in initializing state like so:

NAME
READY STATUS RESTARTS   AGE

my-pod-fd79926b819d3b34b05250e23347d0e7-driver   0/1   Init:0/1   0
  18h


And from *kubectl describe pod*:

*Warning  FailedMount  9m (x128 over 4h) * kubelet, 10.47.96.167  Unable to
mount volumes for pod
"my-pod-fd79926b819d3b34b05250e23347d0e7-driver_spark(1f3aba7b-c10f-11e8-bcec-1292fec79aba)":
timeout expired waiting for volumes to attach or mount for pod
"spark"/"my-pod-fd79926b819d3b34b05250e23347d0e7-driver". list of unmounted
volumes=[spark-init-properties]. list of unattached
volumes=[spark-init-properties download-jars-volume download-files-volume
spark-token-tfpvp]
  *Warning  FailedMount  4m (x153 over 4h)  kubelet,* 10.47.96.167
MountVolume.SetUp failed for volume "spark-init-properties" : configmaps
"my-pod-fd79926b819d3b34b05250e23347d0e7-init-config" not found

From what I can see in *kubectl get configmap* the init config map for the
driver pod isn't there.

Am I correct in assuming since the configmap isn't being created the driver
pod will never start (hence stuck in init)?

Where does the init config map come from?

Why would it not be created?

Thanks,
Christopher Carney


The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates and may only be used solely in performance of 
work or services for Capital One. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed. If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.


Spark 2.3.1: k8s driver pods stuck in Initializing state

2018-09-26 Thread Christopher Carney
Our driver pods from k8s are getting stuck in initializing state like so:

NAME
READY STATUS RESTARTS   AGE

my-pod-fd79926b819d3b34b05250e23347d0e7-driver   0/1   Init:0/1   0
  18h


And from *kubectl describe pod*:

*Warning  FailedMount  9m (x128 over 4h) * kubelet, 10.47.96.167  Unable to
mount volumes for pod
"my-pod-fd79926b819d3b34b05250e23347d0e7-driver_spark(1f3aba7b-c10f-11e8-bcec-1292fec79aba)":
timeout expired waiting for volumes to attach or mount for pod
"spark"/"my-pod-fd79926b819d3b34b05250e23347d0e7-driver". list of unmounted
volumes=[spark-init-properties]. list of unattached
volumes=[spark-init-properties download-jars-volume download-files-volume
spark-token-tfpvp]
  *Warning  FailedMount  4m (x153 over 4h)  kubelet,* 10.47.96.167
MountVolume.SetUp failed for volume "spark-init-properties" : configmaps
"my-pod-fd79926b819d3b34b05250e23347d0e7-init-config" not found

From what I can see in *kubectl get configmap* the init config map for the
driver pod isn't there.

Am I correct in assuming since the configmap isn't being created the driver
pod will never start (hence stuck in init)?

Where does the init config map come from?

Why would it not be created?

Thanks,
Christopher Carney


The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates and may only be used solely in performance of 
work or services for Capital One. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed. If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.


Re: Given events with start and end times, how to count the number of simultaneous events using Spark?

2018-09-26 Thread kathleen li
You can use Spark sql window  function , something like
df.createOrReplaceTempView(“dfv”)
Select count(eventid) over ( partition by start_time, end_time orderly 
start_time) from  dfv

Sent from my iPhone

> On Sep 26, 2018, at 11:32 AM, Debajyoti Roy  wrote:
> 
> The problem statement and an approach to solve it using windows is described 
> here:
> 
> https://stackoverflow.com/questions/52509498/given-events-with-start-and-end-times-how-to-count-the-number-of-simultaneous-e
> 
> Looking for more elegant/performant solutions, if they exist. TIA !


Re: [Spark SQL] why spark sql hash() are returns the same hash value though the keys/expr are not same

2018-09-26 Thread Thakrar, Jayesh
Cannot reproduce your situation.
Can you share Spark version?

Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.2.0
  /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_92)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.sql("select hash('40514X'),hash('41751')").show()
++---+
|hash(40514X)|hash(41751)|
++---+
| -1898845883|  916273350|
++---+


scala> spark.sql("select hash('14589'),hash('40004')").show()
+---+---+
|hash(14589)|hash(40004)|
+---+---+
|  777096871|-1593820563|
+---+---+


scala>

From: Gokula Krishnan D 
Date: Tuesday, September 25, 2018 at 8:57 PM
To: user 
Subject: [Spark SQL] why spark sql hash() are returns the same hash value 
though the keys/expr are not same

Hello All,

I am calculating the hash value  of few columns and determining whether its an 
Insert/Delete/Update Record but found a scenario which is little weird since 
some of the records returns same hash value though the key's are totally 
different.

For the instance,


scala> spark.sql("select hash('40514X'),hash('41751')").show()

+---+---+

|hash(40514)|hash(41751)|

+---+---+

|  976573657|  976573657|

+---+---+


scala> spark.sql("select hash('14589'),hash('40004')").show()

+---+---+

|hash(14589)|hash(40004)|

+---+---+

|  777096871|  777096871|

+---+---+
I do understand that hash() returns an integer, are these reached the max 
value?.

Thanks & Regards,
Gokula Krishnan (Gokul)


Fwd: Spark 2.3.1: k8s driver pods stuck in Initializing state

2018-09-26 Thread Christopher Carney
Our driver pods from k8s are getting stuck in initializing state like so:

NAME
READY STATUS RESTARTS   AGE

my-pod-fd79926b819d3b34b05250e23347d0e7-driver   0/1   Init:0/1   0
  18h


And from *kubectl describe pod*:

*Warning  FailedMount  9m (x128 over 4h) * kubelet, 10.47.96.167  Unable to
mount volumes for pod
"my-pod-fd79926b819d3b34b05250e23347d0e7-driver_spark(1f3aba7b-c10f-11e8-bcec-1292fec79aba)":
timeout expired waiting for volumes to attach or mount for pod
"spark"/"my-pod-fd79926b819d3b34b05250e23347d0e7-driver". list of unmounted
volumes=[spark-init-properties]. list of unattached
volumes=[spark-init-properties download-jars-volume download-files-volume
spark-token-tfpvp]
  *Warning  FailedMount  4m (x153 over 4h)  kubelet,* 10.47.96.167
MountVolume.SetUp failed for volume "spark-init-properties" : configmaps
"my-pod-fd79926b819d3b34b05250e23347d0e7-init-config" not found

From what I can see in *kubectl get configmap* the init config map for the
driver pod isn't there.

Am I correct in assuming since the configmap isn't being created the driver
pod will never start (hence stuck in init)?

Where does the init config map come from?

Why would it not be created?

Thanks,
Christopher Carney


The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates and may only be used solely in performance of 
work or services for Capital One. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed. If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.


RE: Python kubernetes spark 2.4 branch

2018-09-26 Thread Garlapati, Suryanarayana (Nokia - IN/Bangalore)
Hi Ilan/Yinan,
My observation is as follows:
The dependent files specified with “--py-files 
http://10.75.145.25:80/Spark/getNN.py” are being downloaded and available in 
the container at 
“/var/data/spark-c163f15e-d59d-4975-b9be-91b6be062da9/spark-61094ca2-125b-48de-a154-214304dbe74/”.
I guess we need to export PYTHONPATH with this path as well with following code 
change in entrypoint.sh


if [ -n "$PYSPARK_FILES" ]; then
PYTHONPATH="$PYTHONPATH:$PYSPARK_FILES"
fi

to

if [ -n "$PYSPARK_FILES" ]; then
PYTHONPATH="$PYTHONPATH:"
fi
Let me know, if this approach is fine.

Please correct me if my understanding is wrong with this approach.

Regards
Surya

From: Garlapati, Suryanarayana (Nokia - IN/Bangalore)
Sent: Wednesday, September 26, 2018 9:14 AM
To: Ilan Filonenko ; liyinan...@gmail.com
Cc: Spark dev list ; user@spark.apache.org
Subject: RE: Python kubernetes spark 2.4 branch

Hi Ilan/ Yinan,
Yes my test case is also similar to the one described in 
https://issues.apache.org/jira/browse/SPARK-24736

My spark-submit is as follows:
./spark-submit --deploy-mode cluster --master 
k8s://https://10.75.145.23:8443 --conf 
spark.app.name=spark-py --properties-file /tmp/program_files/spark_py.conf 
--py-files http://10.75.145.25:80/Spark/getNN.py 
http://10.75.145.25:80/Spark/test.py

Following is the error observed:

+ exec /sbin/tini -s – /opt/spark/bin/spark-submit --conf 
spark.driver.bindAddress=192.168.1.22 --deploy-mode client --properties-file 
/opt/spark/conf/spark.properties --class org.apache.spark.deploy.PythonRunner 
http://10.75.145.25:80/Spark/test.py
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in 
[jar:file:/opt/spark/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in 
[jar:file:/opt/spark/jars/phoenix-4.13.1-HBase-1.3-client.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Traceback (most recent call last):
File "/tmp/spark-4c428c98-e123-4c29-a9f5-ef85f207e229/test.py", line 13, in 

from getNN import *
ImportError: No module named getNN
2018-09-25 16:19:57 INFO ShutdownHookManager:54 - Shutdown hook called
2018-09-25 16:19:57 INFO ShutdownHookManager:54 - Deleting directory 
/tmp/spark-4c428c98-e123-4c29-a9f5-ef85f207e229

Observing the same kind of behaviour as mentioned in 
https://issues.apache.org/jira/browse/SPARK-24736 (file getting downloaded and 
available in pod)

This is also the same with the local files as well:

./spark-submit --deploy-mode cluster --master 
k8s://https://10.75.145.23:8443 --conf 
spark.app.name=spark-py --properties-file /tmp/program_files/spark_py.conf 
--py-files ./getNN.py http://10.75.145.25:80/Spark/test.py

test.py has dependencies from getNN.py.


But the same is working in spark 2.2 k8s branch.


Regards
Surya

From: Ilan Filonenko mailto:i...@cornell.edu>>
Sent: Wednesday, September 26, 2018 2:06 AM
To: liyinan...@gmail.com
Cc: Garlapati, Suryanarayana (Nokia - IN/Bangalore) 
mailto:suryanarayana.garlap...@nokia.com>>; 
Spark dev list mailto:d...@spark.apache.org>>; 
user@spark.apache.org
Subject: Re: Python kubernetes spark 2.4 branch

Is this in reference to: https://issues.apache.org/jira/browse/SPARK-24736 ?

On Tue, Sep 25, 2018 at 12:38 PM Yinan Li 
mailto:liyinan...@gmail.com>> wrote:
Can you give more details on how you ran your app, did you build your own 
image, and which image are you using?

On Tue, Sep 25, 2018 at 10:23 AM Garlapati, Suryanarayana (Nokia - 
IN/Bangalore) 
mailto:suryanarayana.garlap...@nokia.com>> 
wrote:
Hi,
I am trying to run spark python testcases on k8s based on tag spark-2.4-rc1. 
When the dependent files are passed through the --py-files option, they are not 
getting resolved by the main python script. Please let me know, is this a known 
issue?

Regards
Surya



Given events with start and end times, how to count the number of simultaneous events using Spark?

2018-09-26 Thread Debajyoti Roy
The problem statement and an approach to solve it using windows is
described here:

https://stackoverflow.com/questions/52509498/given-events-with-start-and-end-times-how-to-count-the-number-of-simultaneous-e

Looking for more elegant/performant solutions, if they exist. TIA !


Re: Creating spark Row from database values

2018-09-26 Thread Kuttaiah Robin
Thanks, I'll check it out.

On Wed, Sep 26, 2018 at 6:25 PM Shahab Yunus  wrote:

> Hi there. Have you seen this link?
> https://medium.com/@mrpowers/manually-creating-spark-dataframes-b14dae906393
>
>
> It shows you multiple ways to manually create a dataframe.
>
> Hope it helps.
>
> Regards,
> Shahab
>
> On Wed, Sep 26, 2018 at 8:02 AM Kuttaiah Robin  wrote:
>
>> Hello,
>>
>> Currently I have Oracle database table with description as shown below;
>>
>> Table INSIGHT_ID_FED_IDENTIFIERS
>>   -
>> CURRENT_INSTANCE_ID   VARCHAR2(100)
>> PREVIOUS_INSTANCE_ID  VARCHAR2(100)
>>
>>
>> Sample values in the table basically output of select * from
>> INSIGHT_ID_FED_IDENTIFIERS. For simplicity I have put only one row.
>>
>>
>> CURRENT_INSTANCE_ID   PREVIOUS_INSTANCE_ID
>> ---   ---
>> curInstanceId1  prevInstanceId1
>>
>>
>> I have the spark schema associated with it.
>>
>>
>> Now I need to create a Spark row(org.apache.spark.sql.Row) out of it.
>>
>> Can someone help me understanding on how this can be achieved?
>>
>> regards,
>> Robin Kuttaiah
>>
>


Re: spark.lapply

2018-09-26 Thread Felix Cheung
It looks like the native R process is terminated from buffer overflow. Do you 
know how much data is involved?



From: Junior Alvarez 
Sent: Wednesday, September 26, 2018 7:33 AM
To: user@spark.apache.org
Subject: spark.lapply

Hi!

I’m using spark.lapply() in sparkR on a mesos service I get the following crash 
randomly (The spark.lapply() function is called around 150 times, some times it 
crashes after 16 calls, other after 25 calls and so on…it is completely random, 
even though the data used in the actual call is always the same the 150 times I 
called that function):

…

18/09/26 07:30:42 INFO TaskSetManager: Finished task 129.0 in stage 78.0 (TID 
1192) in 98 ms on 10.255.0.18 (executor 0) (121/143)

18/09/26 07:30:42 WARN TaskSetManager: Lost task 128.0 in stage 78.0 (TID 1191, 
10.255.0.18, executor 0): org.apache.spark.SparkException: R computation failed 
with

 7f327f4dd000-7f327f50 r-xp  08:11 174916727  
/lib/x86_64-linux-gnu/ld-2.19.so

7f327f51c000-7f327f6f2000 rw-p  00:00 0

7f327f6fc000-7f327f6fd000 rw-p  00:00 0

7f327f6fd000-7f327f6ff000 rw-p  00:00 0

7f327f6ff000-7f327f70 r--p 00022000 08:11 174916727  
/lib/x86_64-linux-gnu/ld-2.19.so

7f327f70-7f327f701000 rw-p 00023000 08:11 174916727  
/lib/x86_64-linux-gnu/ld-2.19.so

7f327f701000-7f327f702000 rw-p  00:00 0

7fff6070f000-7fff60767000 rw-p  00:00 0  [stack]

7fff6077f000-7fff60781000 r-xp  00:00 0  [vdso]

ff60-ff601000 r-xp  00:00 0  
[vsyscall]

*** buffer overflow detected ***: /usr/local/lib/R/bin/exec/R terminated

=== Backtrace: =

/lib/x86_64-linux-gnu/libc.so.6(+0x7329f)[0x7f327db9529f]

/lib/x86_64-linux-gnu/libc.so.6(__fortify_fail+0x5c)[0x7f327dc3087c]

/lib/x86_64-linux-gnu/libc.so.6(+0x10d750)[0x7f327dc2f750]

…

If I of course use the native R lapply() everything works fine.

I wonder if this is a known issue, and/or is there is a way to avoid it when 
using sparkR.

B r
/Junior



Re: Creating spark Row from database values

2018-09-26 Thread Shahab Yunus
Hi there. Have you seen this link?
https://medium.com/@mrpowers/manually-creating-spark-dataframes-b14dae906393


It shows you multiple ways to manually create a dataframe.

Hope it helps.

Regards,
Shahab

On Wed, Sep 26, 2018 at 8:02 AM Kuttaiah Robin  wrote:

> Hello,
>
> Currently I have Oracle database table with description as shown below;
>
> Table INSIGHT_ID_FED_IDENTIFIERS
>   -
> CURRENT_INSTANCE_ID   VARCHAR2(100)
> PREVIOUS_INSTANCE_ID  VARCHAR2(100)
>
>
> Sample values in the table basically output of select * from
> INSIGHT_ID_FED_IDENTIFIERS. For simplicity I have put only one row.
>
>
> CURRENT_INSTANCE_ID   PREVIOUS_INSTANCE_ID
> ---   ---
> curInstanceId1  prevInstanceId1
>
>
> I have the spark schema associated with it.
>
>
> Now I need to create a Spark row(org.apache.spark.sql.Row) out of it.
>
> Can someone help me understanding on how this can be achieved?
>
> regards,
> Robin Kuttaiah
>


spark and STS tokens (Federation Tokens)

2018-09-26 Thread Ashic Mahtab
Hi,
I'm looking to have spark jobs access S3 with temporary credentials. I've seen 
some examples around AssumeRole, but I have a scenario where the temp 
credentials are provided by GetFederationToken. Is there anything that can 
help, or do I need to use boto to execute GetFederationToken, and then pass the 
temp credentials as config params?

Also, for both GetFederationToken and AssumeRole, is there a valid way of 
refreshing the tokens once the job executes? Temp credentials from AssumeRole 
are quite limited in lifetime, and even with GetFederationToken, the maximum a 
set of temp credentials are valid is limited to 36 hours. If there a callback 
or similar thing we can give to spark that will be called when credentials are 
about to (have) expire (expired)?

Thanks,
Ashic.


Creating spark Row from database values

2018-09-26 Thread Kuttaiah Robin
Hello,

Currently I have Oracle database table with description as shown below;

Table INSIGHT_ID_FED_IDENTIFIERS
  -
CURRENT_INSTANCE_ID   VARCHAR2(100)
PREVIOUS_INSTANCE_ID  VARCHAR2(100)


Sample values in the table basically output of select * from
INSIGHT_ID_FED_IDENTIFIERS. For simplicity I have put only one row.


CURRENT_INSTANCE_ID   PREVIOUS_INSTANCE_ID
---   ---
curInstanceId1  prevInstanceId1


I have the spark schema associated with it.


Now I need to create a Spark row(org.apache.spark.sql.Row) out of it.

Can someone help me understanding on how this can be achieved?

regards,
Robin Kuttaiah


spark.lapply

2018-09-26 Thread Junior Alvarez
Hi!

I'm using spark.lapply() in sparkR on a mesos service I get the following crash 
randomly (The spark.lapply() function is called around 150 times, some times it 
crashes after 16 calls, other after 25 calls and so on...it is completely 
random, even though the data used in the actual call is always the same the 150 
times I called that function):

...

18/09/26 07:30:42 INFO TaskSetManager: Finished task 129.0 in stage 78.0 (TID 
1192) in 98 ms on 10.255.0.18 (executor 0) (121/143)

18/09/26 07:30:42 WARN TaskSetManager: Lost task 128.0 in stage 78.0 (TID 1191, 
10.255.0.18, executor 0): org.apache.spark.SparkException: R computation failed 
with

 7f327f4dd000-7f327f50 r-xp  08:11 174916727  
/lib/x86_64-linux-gnu/ld-2.19.so

7f327f51c000-7f327f6f2000 rw-p  00:00 0

7f327f6fc000-7f327f6fd000 rw-p  00:00 0

7f327f6fd000-7f327f6ff000 rw-p  00:00 0

7f327f6ff000-7f327f70 r--p 00022000 08:11 174916727  
/lib/x86_64-linux-gnu/ld-2.19.so

7f327f70-7f327f701000 rw-p 00023000 08:11 174916727  
/lib/x86_64-linux-gnu/ld-2.19.so

7f327f701000-7f327f702000 rw-p  00:00 0

7fff6070f000-7fff60767000 rw-p  00:00 0  [stack]

7fff6077f000-7fff60781000 r-xp  00:00 0  [vdso]

ff60-ff601000 r-xp  00:00 0  
[vsyscall]

*** buffer overflow detected ***: /usr/local/lib/R/bin/exec/R terminated

=== Backtrace: =

/lib/x86_64-linux-gnu/libc.so.6(+0x7329f)[0x7f327db9529f]

/lib/x86_64-linux-gnu/libc.so.6(__fortify_fail+0x5c)[0x7f327dc3087c]

/lib/x86_64-linux-gnu/libc.so.6(+0x10d750)[0x7f327dc2f750]

...

If I of course use the native R lapply() everything works fine.

I wonder if this is a known issue, and/or is there is a way to avoid it when 
using sparkR.

B r
/Junior



Unsubscribe

2018-09-26 Thread Iryna Kharaborkina



Re: Lightweight pipeline execution for single eow

2018-09-26 Thread Jatin Puri
Using FAIR mode.

If no other way. I think there is a limitation on number of parallel jobs
that spark can run. Is there a way that more number of jobs can run in
parallel. This is alright because, this sparkcontext would only be used
during web service calls.
I looked at spark configuration  page and tried a few. But they didnt seem
to work. I am using spark 2.3.1

Thanks.

On Sun, Sep 23, 2018 at 6:00 PM Michael Artz  wrote:

> Are you using the scheduler in fair mode instead of fifo mode?
>
> Sent from my iPhone
>
> > On Sep 22, 2018, at 12:58 AM, Jatin Puri  wrote:
> >
> > Hi.
> >
> > What tactics can I apply for such a scenario.
> >
> > I have a pipeline of 10 stages. Simple text processing. I train the data
> with the pipeline and for the fitted data, do some modelling and store the
> results.
> >
> > I also have a web-server, where I receive requests. For each request
> (dataframe of single row), I transform against the same pipeline created
> above. And do the respective action. The problem is: calling spark for
> single row takes less than  1 second, but under  higher  load, spark
> becomes  a major bottleneck.
> >
> > One solution  that I can  think of, is to have scala re-implementation
> of the same pipeline, and with  the help of the model generated above,
> process the requests. But this results in  duplication of code and hence
> maintenance.
> >
> > Is there any way, that I can call the same pipeline (transform) in a
> very light manner, and just for single row. So that it just works
> concurrently and spark does not remain a bottlenect?
> >
> > Thanks
> > Jatin
>


-- 
Jatin Puri
http://jatinpuri.com 


Re: How to access line fileName in loading file using the textFile method

2018-09-26 Thread vermanurag
Spark has sc.wholeTextFiles() which returns RDD of tuple. First element of
tuple if the file name and second element is the file content.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Pivot Column ordering in spark

2018-09-26 Thread Manohar Rao
I am doing a pivot transformation on an input dataset


Following input schema
=
 |-- c_salutation: string (nullable = true)
 |-- c_preferred_cust_flag: string (nullable = true)
 |-- integer_type_col: integer (nullable = false)
 |-- long_type_col: long (nullable = false)
 |-- string_type_col: string (nullable = true)
 |-- decimal_type_col: decimal(38,0) (nullable = true)

 My pivot column is c_preferred_cust_flag , pivot values is "Y","N","R"
  and group by column is c_salutation

 I am using the api  * pivot(String pivotColumn,*
*  java.util.List values) *
*on  RelationalGroupedDataset*



My aggregation functions after this pivot is
===
count(`string_type_col`) ,sum(`string_type_col`) ,sum(`integer_type_col`)
,avg(`integer_type_col`)
,sum(`long_type_col`) ,avg(`long_type_col`) ,avg(`decimal_type_col`)

===
My output dataset schema after the groupby.pivot.agg()
 is

 |-- c_salutation: string (nullable = true)
 |-- Y_count(`string_type_col`): long (nullable = true)
 |-- Y_sum(CAST(`string_type_col` AS DOUBLE)): double (nullable = true)
 |-- Y_sum(CAST(`integer_type_col` AS BIGINT)): long (nullable = true)
 |-- Y_avg(CAST(`integer_type_col` AS BIGINT)): double (nullable = true)
 |-- Y_sum(`long_type_col`): long (nullable = true)
 |-- Y_avg(`long_type_col`): double (nullable = true)
 |-- Y_avg(`decimal_type_col`): decimal(38,4) (nullable = true)
 |-- N_count(`string_type_col`): long (nullable = true)
 |-- N_sum(CAST(`string_type_col` AS DOUBLE)): double (nullable = true)
 |-- N_sum(CAST(`integer_type_col` AS BIGINT)): long (nullable = true)
 |-- N_avg(CAST(`integer_type_col` AS BIGINT)): double (nullable = true)
 |-- N_sum(`long_type_col`): long (nullable = true)
 |-- N_avg(`long_type_col`): double (nullable = true)
 |-- N_avg(`decimal_type_col`): decimal(38,4) (nullable = true)
 |-- R_count(`string_type_col`): long (nullable = true)
 |-- R_sum(CAST(`string_type_col` AS DOUBLE)): double (nullable = true)
 |-- R_sum(CAST(`integer_type_col` AS BIGINT)): long (nullable = true)
 |-- R_avg(CAST(`integer_type_col` AS BIGINT)): double (nullable = true)
 |-- R_sum(`long_type_col`): long (nullable = true)
 |-- R_avg(`long_type_col`): double (nullable = true)
 |-- R_avg(`decimal_type_col`): decimal(38,4) (nullable = true)

==
 My requirement is:
 ==
to rename the system generated column names such as
Y_count(`string_type_col`), N_avg(`decimal_type_col`)
 etc to a user defined name based on a mapping. I
I need to be able to do this programatically given a mapping of the form:
(pivotvalue + aggregationfunction) --> (requiredcolumnname)

===
 My question is :
 ===
 Can i rely on the order of the output columns generated?
The order looks to confirm to this pattern
PivotValue1-aggregationfunction1
PivotValue1-aggregationfunction2

PivotValue1-aggregationfunctionN

PivotValue2-aggregationfunction1
PivotValue2-aggregationfunction2
..
 Is this order standard across spark versions 2+ . ?
 Is this subject to change or not reliable from a user point of view. ?

 If not reliable , is there another way by which I can
logically/programatically
  identify that a column such  as R_sum(CAST(`integer_type_col` AS
BIGINT))
 corresponds to the input pivot value  "R" and aggregation function of
sum(`integer_type_col`)


Thanks

Manohar