Re: Log file location in Spark on K8s

2023-10-09 Thread Prashant Sharma
Hi Sanket,

Driver and executor logs are written to stdout by default, it can be
configured using SPARK_HOME/conf/log4j.properties file. The file including
the entire SPARK_HOME/conf is auto propogateded to all driver and executor
container and mounted as volume.

Thanks

On Mon, 9 Oct, 2023, 5:37 pm Agrawal, Sanket,
 wrote:

> Hi All,
>
>
>
> We are trying to send the spark logs using fluent-bit. We validated that
> fluent-bit is able to move logs of all other pods except the
> driver/executor pods.
>
>
>
> It would be great if someone can guide us where should I look for spark
> logs in Spark on Kubernetes with client/cluster mode deployment.
>
>
>
> Thanks,
> Sanket A.
>
> This message (including any attachments) contains confidential information
> intended for a specific individual and purpose, and is protected by law. If
> you are not the intended recipient, you should delete this message and any
> disclosure, copying, or distribution of this message, or the taking of any
> action based on it, by you is strictly prohibited.
>
> Deloitte refers to a Deloitte member firm, one of its related entities, or
> Deloitte Touche Tohmatsu Limited ("DTTL"). Each Deloitte member firm is a
> separate legal entity and a member of DTTL. DTTL does not provide services
> to clients. Please see www.deloitte.com/about to learn more.
>
> v.E.1
>


Re: Connection pool shut down in Spark Iceberg Streaming Connector

2023-10-05 Thread Prashant Sharma
Hi Sanket, more details might help here.

How does your spark configuration look like?

What exactly was done when this happened?

On Thu, 5 Oct, 2023, 2:29 pm Agrawal, Sanket,
 wrote:

> Hello Everyone,
>
>
>
> We are trying to stream the changes in our Iceberg tables stored in AWS
> S3. We are achieving this through Spark-Iceberg Connector and using JAR
> files for Spark-AWS. Suddenly we have started receiving error “Connection
> pool shut down”.
>
>
>
> Spark Version: 3.4.1
>
> Iceberg: 1.3.1
>
>
>
> Any help or guidance would of great help.
>
>
>
> Thank You,
>
> Sanket A.
>
>
>
> This message (including any attachments) contains confidential information
> intended for a specific individual and purpose, and is protected by law. If
> you are not the intended recipient, you should delete this message and any
> disclosure, copying, or distribution of this message, or the taking of any
> action based on it, by you is strictly prohibited.
>
> Deloitte refers to a Deloitte member firm, one of its related entities, or
> Deloitte Touche Tohmatsu Limited ("DTTL"). Each Deloitte member firm is a
> separate legal entity and a member of DTTL. DTTL does not provide services
> to clients. Please see www.deloitte.com/about to learn more.
>
> v.E.1
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org


Re: EOF Exception Spark Structured Streams - Kubernetes

2021-02-01 Thread Prashant Sharma
Hi Sachit,

The fix verison on that JIRA says 3.0.2, so this fix is not yet released.
Soon, there will be a 3.1.1 release, in the meantime you can try out the
3.1.1-rc which also has the fix and let us know your findings.

Thanks,


On Mon, Feb 1, 2021 at 10:24 AM Sachit Murarka 
wrote:

> Following is the related JIRA , Can someone pls check
>
> https://issues.apache.org/jira/browse/SPARK-24266
>
> I am using 3.0.1 , It says fixed in 3.0.0 and 3.1.0 . Could you please
> suggest what can be done to avoid this?
>
> Kind Regards,
> Sachit Murarka
>
>
> On Sun, Jan 31, 2021 at 6:38 PM Sachit Murarka 
> wrote:
>
>> Hi Users,
>>
>> I am running Spark application on Kubernetes and getting the following
>> exception in the driver pod. Though it is not affecting the output.
>>
>> This exception is coming every 5 minutes and this is a structured
>> streaming job.
>>
>> Could anyone please advise ?
>>
>> 21/01/29 06:33:15 WARN WatchConnectionManager: Exec
>> Failurejava.io.EOFException at
>> okio.RealBufferedSource.require(RealBufferedSource.java:61) at
>> okio.RealBufferedSource.readByte(RealBufferedSource.java:74) at
>> okhttp3.internal.ws.WebSocketReader.readHeader(WebSocketReader.java:117) at
>> okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:101)
>> at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274) at
>> okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214) at
>> okhttp3.RealCall$AsyncCall.execute(RealCall.java:203) at
>> okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) at
>> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>> at
>> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>> at java.base/java.lang.Thread.run(Thread.java:834)21/01/29 06:38:16 WARN
>> WatchConnectionManager: Exec Failure
>>
>>
>> Kind Regards,
>> Sachit Murarka
>>
>


Re: Suggestion on Spark 2.4.7 vs Spark 3 for Kubernetes

2021-01-05 Thread Prashant Sharma
 A lot of developers may have already moved to 3.0.x, FYI 3.1.0 is just
around the corner hopefully(in a few days) and has a lot of improvements to
spark on K8s, including it will be transitioning from experimental to GA in
this release.

See: https://issues.apache.org/jira/browse/SPARK-33005

Thanks,

On Tue, Jan 5, 2021 at 12:41 AM Sachit Murarka 
wrote:

> Hi Users,
>
> Could you please tell which Spark version have you used in Production for
> Kubernetes.
> Which is a recommended version for Production provided that both Streaming
> and core apis have to be used using Pyspark.
>
> Thanks !
>
> Kind Regards,
> Sachit Murarka
>


Re: Error while running Spark on K8s

2021-01-04 Thread Prashant Sharma
Can you please paste the full exception trace, and mention spark and k8s
version?

On Mon, Jan 4, 2021 at 6:19 PM Sachit Murarka 
wrote:

> Hi Prashant
>
> Thanks for the response!
>
> I created the service account with the permissions  and following is the
> command:
>
> spark-submit --deploy-mode cluster --master k8s://http://ip:port --name
> "sachit"   --conf spark.kubernetes.pyspark.pythonVersion=3
> --conf spark.kubernetes.namespace=spark-test --conf
> spark.executor.instances=5 --conf
> spark.kubernetes.authenticate.driver.serviceAccountName=spark-sa  --conf
> spark.kubernetes.container.image=sparkpy local:///opt/spark/da/main.py
>
> Kind Regards,
> Sachit Murarka
>
>
> On Mon, Jan 4, 2021 at 5:46 PM Prashant Sharma 
> wrote:
>
>> Hi Sachit,
>>
>> Can you give more details on how did you run? i.e. spark submit command.
>> My guess is, a service account with sufficient privilege is not provided.
>> Please see:
>> http://spark.apache.org/docs/latest/running-on-kubernetes.html#rbac
>>
>> Thanks,
>>
>> On Mon, Jan 4, 2021 at 5:27 PM Sachit Murarka 
>> wrote:
>>
>>> Hi All,
>>> I am getting the below error when I am trying to run the spark job on
>>> Kubernetes, I am running it in cluster mode.
>>>
>>> Exception in thread "main"
>>> io.fabric8.kubernetes.client.KubernetesClientException: Operation:
>>> [create] for kind: [Pod] with name: [null] in namespace:
>>> [spark-test] failed.
>>> at
>>> io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64)
>>> at
>>> io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72)
>>>
>>>
>>> I saw there was an JIRA opened already. SPARK-31786
>>> I tried the parameter mentioned in JIRA
>>> too(spark.kubernetes.driverEnv.HTTP2_DISABLE=true), that also did not work.
>>> Can anyone suggest what can be done?
>>>
>>> Kind Regards,
>>> Sachit Murarka
>>>
>>


Re: Error while running Spark on K8s

2021-01-04 Thread Prashant Sharma
Hi Sachit,

Can you give more details on how did you run? i.e. spark submit command. My
guess is, a service account with sufficient privilege is not provided.
Please see:
http://spark.apache.org/docs/latest/running-on-kubernetes.html#rbac

Thanks,

On Mon, Jan 4, 2021 at 5:27 PM Sachit Murarka 
wrote:

> Hi All,
> I am getting the below error when I am trying to run the spark job on
> Kubernetes, I am running it in cluster mode.
>
> Exception in thread "main"
> io.fabric8.kubernetes.client.KubernetesClientException: Operation:
> [create] for kind: [Pod] with name: [null] in namespace:
> [spark-test] failed.
> at
> io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64)
> at
> io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72)
>
>
> I saw there was an JIRA opened already. SPARK-31786
> I tried the parameter mentioned in JIRA
> too(spark.kubernetes.driverEnv.HTTP2_DISABLE=true), that also did not work.
> Can anyone suggest what can be done?
>
> Kind Regards,
> Sachit Murarka
>


Re: Spark3 on k8S reading encrypted data from HDFS with KMS in HA

2020-08-19 Thread Prashant Sharma
-dev
Hi,

I have used Spark with HDFS encrypted with Hadoop KMS, and it worked well.
Somehow, I could not recall, if I had the kubernetes in the mix. Somehow,
seeing the error, it is not clear what caused the failure. Can I reproduce
this somehow?

Thanks,

On Sat, Aug 15, 2020 at 7:18 PM Michel Sumbul 
wrote:

> Hi guys,
>
> Does anyone have an idea on this issue? even some tips to troubleshoot it?
> I got the impression that after the creation of the delegation for the
> KMS, the token is not sent to the executor or maybe not saved?
>
> I'm sure I'm not the only one using Spark with HDFS encrypted with KMS :-)
>
> Thanks,
> Michel
>
> Le jeu. 13 août 2020 à 14:32, Michel Sumbul  a
> écrit :
>
>> Hi guys,
>>
>> Does anyone try Spark3 on k8s reading data from HDFS encrypted with KMS
>> in HA mode (with kerberos)?
>>
>> I have a wordcount job running with Spark3 reading data on HDFS (hadoop
>> 3.1) everything secure with kerberos. Everything works fine if the data
>> folder is not encrypted (spark on k8s). If the data is on an encrypted
>> folder, Spark3 on yarn is working fine but it doesn't work when Spark3 is
>> running on K8S.
>> I submit the job with spark-submit command and I provide the keytab and
>> the principal to use.
>> I got the kerberos error saying that there is no TGT to authenticate to
>> the KMS (ranger kms, full stack trace of the error at the end of the mail)
>> servers but in the log I can see that Spark get 2 delegation token, one for
>> each KMS servers:
>>
>> -- --
>>
>> 20/08/13 10:50:50 INFO HadoopDelegationTokenManager: Attempting to login
>> to KDC using principal: mytestu...@paf.com
>>
>> 20/08/13 10:50:50 INFO HadoopDelegationTokenManager: Successfully logged
>> into KDC.
>>
>> 20/08/13 10:50:52 WARN DomainSocketFactory: The short-circuit local reads
>> feature cannot be used because libhadoop cannot be loaded.
>>
>> 20/08/13 10:50:52 INFO HadoopFSDelegationTokenProvider: getting token
>> for: DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-237056190_16, ugi=
>> mytestu...@paf.com (auth:KERBEROS)]] with renewer testuser
>>
>> 20/08/13 10:50:52 INFO DFSClient: Created token for testuser:
>> HDFS_DELEGATION_TOKEN owner= mytestu...@paf.com, renewer=testuser,
>> realUser=, issueDate=1597315852353, maxDate=1597920652353,
>> sequenceNumber=55185062, masterKeyId=1964 on ha-hdfs:cluster2
>>
>> 20/08/13 10:50:52 INFO KMSClientProvider: New token created: (Kind:
>> kms-dt, Service: kms://ht...@server2.paf.com:9393/kms, Ident: (kms-dt
>> owner=testuser, renewer=testuser, realUser=, issueDate=1597315852642,
>> maxDate=1597920652642, sequenceNumber=3929883, masterKeyId=623))
>>
>> 20/08/13 10:50:52 INFO HadoopFSDelegationTokenProvider: getting token
>> for: DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-237056190_16, ugi=
>> testu...@paf.com (auth:KERBEROS)]] with renewer testu...@paf.com
>>
>> 20/08/13 10:50:52 INFO DFSClient: Created token for testuser:
>> HDFS_DELEGATION_TOKEN owner=testu...@paf.com, renewer=testuser,
>> realUser=, issueDate=1597315852744, maxDate=1597920652744,
>> sequenceNumber=55185063, masterKeyId=1964 on ha-hdfs:cluster2
>>
>> 20/08/13 10:50:52 INFO KMSClientProvider: New token created: (Kind:
>> kms-dt, Service: kms://ht...@server.paf.com:9393/kms, Ident: (kms-dt
>> owner=testuser, renewer=testuser, realUser=, issueDate=1597315852839,
>> maxDate=1597920652839, sequenceNumber=3929884, masterKeyId=624))
>>
>> 20/08/13 10:50:52 INFO HadoopFSDelegationTokenProvider: Renewal interval
>> is 86400104 for token HDFS_DELEGATION_TOKEN
>>
>> 20/08/13 10:50:52 INFO HadoopFSDelegationTokenProvider: Renewal interval
>> is 86400108 for token kms-dt
>>
>> 20/08/13 10:50:54 INFO HiveConf: Found configuration file null
>>
>> 20/08/13 10:50:54 INFO HadoopDelegationTokenManager: Scheduling renewal
>> in 18.0 h.
>>
>> 20/08/13 10:50:54 INFO HadoopDelegationTokenManager: Updating delegation
>> tokens.
>>
>> 20/08/13 10:50:54 INFO SparkHadoopUtil: Updating delegation tokens for
>> current user.
>>
>> 20/08/13 10:50:55 INFO SparkHadoopUtil: Updating delegation tokens for
>> current user.
>> --- --
>>
>> In the core-site.xml, I have the following property for the 2 kms server
>>
>> --
>>
>> hadoop.security.key.provider.path
>>
>> kms://ht...@server.paf.com;server2.paf.com:9393/kms
>>
>> -
>>
>>
>> Does anyone have an idea how to make it work? Or at least anyone has been
>> able to make it work?
>> Does anyone know where the delegation tokens are saved during the
>> execution of jobs on k8s and how it is shared between the executors?
>>
>>
>> Thanks,
>> Michel
>>
>> PS: The full stack trace of the error:
>>
>> 
>>
>> Caused by: org.apache.spark.SparkException: Job aborted due to stage
>> failure: Task 22 in stage 0.0 failed 4 times, most recent failure: Lost
>> task 22.3 in stage 0.0 (TID 23, 10.5.5.5, executor 1): java.io.IOException:
>> 

Re: Spark 3.0 with Hadoop 2.6 HDFS/Hive

2020-07-19 Thread Prashant Sharma
Hi Ashika,

Hadoop 2.6 is now no longer supported, and since it has not been maintained
in the last 2 years, it means it may have some security issues unpatched.
Spark 3.0 onwards, we no longer support it, in other words, we have
modified our codebase in a way that Hadoop 2.6 won't work. However, if you
are determined, you can always apply a custom patch to spark codebase and
support it. I would recommend moving to newer Hadoop.

Thanks,

On Mon, Jul 20, 2020 at 8:41 AM Ashika Umanga 
wrote:

> Greetings,
>
> Hadoop 2.6 has been removed according to this ticket
> https://issues.apache.org/jira/browse/SPARK-25016
>
> We run our Spark cluster on K8s in standalone mode.
> We access HDFS/Hive running on a Hadoop 2.6 cluster.
> We've been using Spark 2.4.5 and planning on upgrading to Spark 3.0.0
> However, we dont have any control over the Hadoop cluster and it will
> remain in 2.6
>
> Is Spark 3.0 still compatible with HDFS/Hive running on Hadoop 2.6 ?
>
> Best Regards,
>


Re: Spark Compatibility with Java 11

2020-07-14 Thread Prashant Sharma
Hi Ankur,

Java 11 support was added in Spark 3.0.
https://issues.apache.org/jira/browse/SPARK-24417

Thanks,


On Tue, Jul 14, 2020 at 6:12 PM Ankur Mittal 
wrote:

> Hi,
>
> I am using Spark 2.X and need to execute Java 11 .Its not able to execute
> Java 11 using Spark 2.X.
>
> Is there any way we can use Java 11 with Spark2.X?
>
> Has this issue been resolved  in Spark 3.0 ?
>
>
> --
> Regards
> Ankur Mittal
>
>


Re: [Spark 3.0 Kubernetes] Does Spark 3.0 support production deployment

2020-07-12 Thread Prashant Sharma
Driver HA, is not yet available in k8s mode. It can be a good area, to
work. I want to take a look at it. I personally refer to spark official
documentation for reference.
Thanks,



On Fri, Jul 10, 2020, 9:30 PM Varshney, Vaibhav <
vaibhav.varsh...@siemens.com> wrote:

> Hi Prashant,
>
>
>
> It sounds encouraging. During scale down of the cluster, probably few of
> the spark jobs are impacted due to re-computation of shuffle data. This is
> not of supreme importance for us for now.
>
> Is there any reference deployment architecture available, which is HA ,
> scalable and dynamic-allocation-enabled for deploying Spark on K8s? Any
> suggested github repo or link?
>
>
>
> Thanks,
>
> Vaibhav V
>
>
>
>
>
> *From:* Prashant Sharma 
> *Sent:* Friday, July 10, 2020 12:57 AM
> *To:* user@spark.apache.org
> *Cc:* Sean Owen ; Ramani, Sai (DI SW CAS MP AFC ARC) <
> sai.ram...@siemens.com>; Varshney, Vaibhav (DI SW CAS MP AFC ARC) <
> vaibhav.varsh...@siemens.com>
> *Subject:* Re: [Spark 3.0 Kubernetes] Does Spark 3.0 support production
> deployment
>
>
>
> Hi,
>
>
>
> Whether it is a blocker or not, is upto you to decide. But, spark k8s
> cluster supports dynamic allocation, through a different mechanism, that
> is, without using an external shuffle service.
> https://issues.apache.org/jira/browse/SPARK-27963. There are pros and
> cons of both approaches. The only disadvantage of scaling without external
> shuffle service is, when the cluster scales down or it loses executors due
> to some external cause ( for example losing spot instances), we lose the
> shuffle data (data that was computed as an intermediate to some overall
> computation) on that executor. This situation may not lead to data loss, as
> spark can recompute the lost shuffle data.
>
>
>
> Dynamically, scaling up and down scaling, is helpful when the spark
> cluster is running off, "spot instances on AWS" for example or when the
> size of data is not known in advance. In other words, we cannot estimate
> how much resources would be needed to process the data. Dynamic scaling,
> lets the cluster increase its size only based on the number of pending
> tasks, currently this is the only metric implemented.
>
>
>
> I don't think it is a blocker for my production use cases.
>
>
>
> Thanks,
>
> Prashant
>
>
>
> On Fri, Jul 10, 2020 at 2:06 AM Varshney, Vaibhav <
> vaibhav.varsh...@siemens.com> wrote:
>
> Thanks for response. We have tried it in dev env. For production, if Spark
> 3.0 is not leveraging k8s scheduler, then would Spark Cluster in K8s be
> "static"?
> As per https://issues.apache.org/jira/browse/SPARK-24432 it seems it is
> still blocker for production workloads?
>
> Thanks,
> Vaibhav V
>
> -Original Message-
> From: Sean Owen 
> Sent: Thursday, July 9, 2020 3:20 PM
> To: Varshney, Vaibhav (DI SW CAS MP AFC ARC)  >
> Cc: user@spark.apache.org; Ramani, Sai (DI SW CAS MP AFC ARC) <
> sai.ram...@siemens.com>
> Subject: Re: [Spark 3.0 Kubernetes] Does Spark 3.0 support production
> deployment
>
> I haven't used the K8S scheduler personally, but, just based on that
> comment I wouldn't worry too much. It's been around for several versions
> and AFAIK works fine in general. We sometimes aren't so great about
> removing "experimental" labels. That said I know there are still some
> things that could be added to it and more work going on, and maybe people
> closer to that work can comment. But yeah you shouldn't be afraid to try it.
>
> On Thu, Jul 9, 2020 at 3:18 PM Varshney, Vaibhav <
> vaibhav.varsh...@siemens.com> wrote:
> >
> > Hi Spark Experts,
> >
> >
> >
> > We are trying to deploy spark on Kubernetes.
> >
> > As per doc
> http://spark.apache.org/docs/latest/running-on-kubernetes.html, it looks
> like K8s deployment is experimental.
> >
> > "The Kubernetes scheduler is currently experimental ".
> >
> >
> >
> > Spark 3.0 does not support production deployment using k8s scheduler?
> >
> > What’s the plan on full support of K8s scheduler?
> >
> >
> >
> > Thanks,
> >
> > Vaibhav V
>
>


Re: [Spark 3.0 Kubernetes] Does Spark 3.0 support production deployment

2020-07-09 Thread Prashant Sharma
Hi,

Whether it is a blocker or not, is upto you to decide. But, spark k8s
cluster supports dynamic allocation, through a different mechanism, that
is, without using an external shuffle service.
https://issues.apache.org/jira/browse/SPARK-27963. There are pros and cons
of both approaches. The only disadvantage of scaling without external
shuffle service is, when the cluster scales down or it loses executors due
to some external cause ( for example losing spot instances), we lose the
shuffle data (data that was computed as an intermediate to some overall
computation) on that executor. This situation may not lead to data loss, as
spark can recompute the lost shuffle data.

Dynamically, scaling up and down scaling, is helpful when the spark cluster
is running off, "spot instances on AWS" for example or when the size of
data is not known in advance. In other words, we cannot estimate how much
resources would be needed to process the data. Dynamic scaling, lets the
cluster increase its size only based on the number of pending tasks,
currently this is the only metric implemented.

I don't think it is a blocker for my production use cases.

Thanks,
Prashant

On Fri, Jul 10, 2020 at 2:06 AM Varshney, Vaibhav <
vaibhav.varsh...@siemens.com> wrote:

> Thanks for response. We have tried it in dev env. For production, if Spark
> 3.0 is not leveraging k8s scheduler, then would Spark Cluster in K8s be
> "static"?
> As per https://issues.apache.org/jira/browse/SPARK-24432 it seems it is
> still blocker for production workloads?
>
> Thanks,
> Vaibhav V
>
> -Original Message-
> From: Sean Owen 
> Sent: Thursday, July 9, 2020 3:20 PM
> To: Varshney, Vaibhav (DI SW CAS MP AFC ARC)  >
> Cc: user@spark.apache.org; Ramani, Sai (DI SW CAS MP AFC ARC) <
> sai.ram...@siemens.com>
> Subject: Re: [Spark 3.0 Kubernetes] Does Spark 3.0 support production
> deployment
>
> I haven't used the K8S scheduler personally, but, just based on that
> comment I wouldn't worry too much. It's been around for several versions
> and AFAIK works fine in general. We sometimes aren't so great about
> removing "experimental" labels. That said I know there are still some
> things that could be added to it and more work going on, and maybe people
> closer to that work can comment. But yeah you shouldn't be afraid to try it.
>
> On Thu, Jul 9, 2020 at 3:18 PM Varshney, Vaibhav <
> vaibhav.varsh...@siemens.com> wrote:
> >
> > Hi Spark Experts,
> >
> >
> >
> > We are trying to deploy spark on Kubernetes.
> >
> > As per doc
> http://spark.apache.org/docs/latest/running-on-kubernetes.html, it looks
> like K8s deployment is experimental.
> >
> > "The Kubernetes scheduler is currently experimental ".
> >
> >
> >
> > Spark 3.0 does not support production deployment using k8s scheduler?
> >
> > What’s the plan on full support of K8s scheduler?
> >
> >
> >
> > Thanks,
> >
> > Vaibhav V
>


Employment opportunities.

2019-06-12 Thread Prashant Sharma
Hi,

My employer(IBM) is interested in hiring people in hyderabad if they are
committers in any of the Apache Projects and are interested Spark and
ecosystem.

Thanks,
Prashant.


Spark Streaming RDD Cleanup too slow

2018-09-05 Thread Prashant Sharma
I have a Spark Streaming job which takes too long to delete temp RDD's. I
collect about 4MM telemetry metrics per minute and do minor aggregations in
the Streaming Job.

I am using Amazon R4 instances.  The Driver RPC call although Async,i
believe, is slow getting the handle for future object  at "askAsync call.
Here  is the Spark code which does the cleanup -
https://github.com/apache/spark/blob/5264164a67df498b73facae207eda12ee133be7d/core/src/main/scala/org/apache/spark/storage/BlockManagerMaster.scala#L125

Any chance anyone else encountered similar issue with their Streaming jobs?
About 20% of our time (~60 secs) is spent in cleaning the temp RDDs.
best,
Prashant


Re: Spark Structured Streaming not connecting to Kafka using kerberos

2017-10-26 Thread Prashant Sharma
Hi Darshan,

Did you try passing the config directly as an option, like this:

.option("kafka.sasl.jaas.config", saslConfig)


Where saslConfig can look like:

com.sun.security.auth.module.Krb5LoginModule required \
useKeyTab=true \
storeKey=true  \
keyTab="/etc/security/keytabs/kafka_client.keytab" \
principal="kafka-clien...@example.com";

Reference: http://kafka.apache.org/documentation.html#
security_kerberos_sasl_clientconfig

Thanks,
Prashant.

On Tue, Oct 17, 2017 at 11:21 AM, Darshan Pandya 
wrote:

> HI Burak,
>
> Well turns out it worked fine when i submit in cluster mode. I also tried
> to convert my app in dstreams. In dstreams too it works well only when
> deployed in cluster mode.
>
> Here is how i configured the stream.
>
>
> val lines = spark.readStream
>   .format("kafka")
>   .option("kafka.bootstrap.servers", jobParams.boorstrapServer)
>   .option("subscribe", jobParams.sourceTopic)
>   .option("startingOffsets", "latest")
>   .option("minPartitions", "10")
>   .option("failOnDataLoss", "true")
>   .load()
>
>
>
> Sincerely,
> Darshan
>
> On Mon, Oct 16, 2017 at 12:08 PM, Burak Yavuz  wrote:
>
>> Hi Darshan,
>>
>> How are you creating your kafka stream? Can you please share the options
>> you provide?
>>
>> spark.readStream.format("kafka")
>>   .option(...) // all these please
>>   .load()
>>
>>
>> On Sat, Oct 14, 2017 at 1:55 AM, Darshan Pandya 
>> wrote:
>>
>>> Hello,
>>>
>>> I'm using Spark 2.1.0 on CDH 5.8 with kafka 0.10.0.1 + kerberos
>>>
>>> I am unable to connect to the kafka broker with the following message
>>>
>>>
>>> 17/10/14 14:29:10 WARN clients.NetworkClient: Bootstrap broker
>>> 10.197.19.25:9092 disconnected
>>>
>>> and is unable to consume any messages.
>>>
>>> And am using it as follows
>>>
>>> jaas.conf
>>>
>>> KafkaClient {
>>> com.sun.security.auth.module.Krb5LoginModule required
>>> useKeyTab=true
>>> keyTab="./gandalf.keytab"
>>> storeKey=true
>>> useTicketCache=false
>>> serviceName="kafka"
>>> principal="gand...@domain.com";
>>> };
>>>
>>> $SPARK_HOME/bin/spark-submit \
>>> --master yarn \
>>> --files jaas.conf,gandalf.keytab \
>>> --driver-java-options "-Djava.security.auth.login.config=./jaas.conf 
>>> -Dhdp.version=2.4.2.0-258" \
>>> --conf 
>>> "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=./jaas.conf"
>>>  \
>>> --class com.example.ClassName uber-jar-with-deps-and-hive-site.jar
>>>
>>> Thanks in advance.
>>>
>>>
>>> --
>>> Sincerely,
>>> Darshan
>>>
>>>
>>
>
>
> --
> Sincerely,
> Darshan
>
>


Kafka Spark structured streaming latency benchmark.

2016-12-17 Thread Prashant Sharma
Hi,

Goal of my benchmark is to arrive at end to end latency lower than 100ms
and sustain them over time, by consuming from a kafka topic and writing
back to another kafka topic using Spark. Since the job does not do
aggregation and does a constant time processing on each message, it
appeared to me as an achievable target. But, then there are some surprising
and interesting pattern to observe.

 Basically, it has four components namely,
1) kafka
2) Long running kafka producer, rate limited to 1000 msgs/sec, with each
message of about 1KB.
3) Spark  job subscribed to `test` topic and writes out to another topic
`output`.
4) A Kafka consumer, reading from the `output` topic.

How the latency was measured ?

While sending messages from kafka producer, each message is embedded the
timestamp at which it is pushed to the kafka `test` topic. Spark receives
each message and writes them out to `output` topic as is. When these
messages arrive at Kafka consumer, their embedded time is subtracted from
the time of arrival at the consumer and a scatter plot of the same is
attached.

The scatter plots sample only 10 minutes of data received during initial
one hour and then again 10 minutes of data received after 2 hours of run.



These plots indicate a significant slowdown in latency, in the later
scatter plot indicate almost all the messages were received with a delay
larger than 2 seconds. However, first plot show that most messages arrived
in less than 100ms latency. The two samples were taken with time difference
of 2 hours approx.

After running the test for 24 hours, the jstat

and jmap

output
for the jobs indicate possibility  of memory constrains. To be more clear,
job was run with local[20] and memory of 5GB(spark.driver.memory). The job
is straight forward and located here: https://github.com/ScrapCodes/
KafkaProducer/blob/master/src/main/scala/com/github/scrapcod
es/kafka/SparkSQLKafkaConsumer.scala .


What is causing the gradual slowdown? I need help in diagnosing the
problem.

Thanks,

--Prashant


Re: If we run sc.textfile(path,xxx) many times, will the elements be the same in each partition

2016-11-10 Thread Prashant Sharma
+user -dev

Since the same hash based partitioner is in action by default. In my
understanding every time same partitioning will happen.

Thanks,

On Nov 10, 2016 7:13 PM, "WangJianfei" 
wrote:

> Hi Devs:
> If  i run sc.textFile(path,xxx) many times, will the elements be the
> same(same element,same order)in each partitions?
> My experiment show that it's the same, but which may not cover all the
> cases. Thank you!
>
>
>
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/If-we-run-sc-
> textfile-path-xxx-many-times-will-the-elements-be-the-same-
> in-each-partition-tp19814.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Large files with wholetextfile()

2016-07-12 Thread Prashant Sharma
Hi Baahu,

That should not be a problem, given you allocate sufficient buffer for
reading.

I was just working on implementing a patch[1] to support the feature for
reading wholetextfiles in SQL. This can actually be slightly better
approach, because here we read to offheap memory for holding data(using
unsafe interface).

1. https://github.com/apache/spark/pull/14151

Thanks,



--Prashant


On Tue, Jul 12, 2016 at 6:24 PM, Bahubali Jain  wrote:

> Hi,
> We have a requirement where in we need to process set of xml files, each
> of the xml files contain several records (eg:
> 
>  data of record 1..
> 
>
> 
> data of record 2..
> 
>
> Expected output is   
>
> Since we needed file name as well in output ,we chose wholetextfile() . We
> had to go against using StreamXmlRecordReader and StreamInputFormat since I
> could not find a way to retreive the filename.
>
> These xml files could be pretty big, occasionally they could reach a size
> of 1GB.Since contents of each file would be put into a single partition,would
> such big files be a issue ?
> The AWS cluster(50 Nodes) that we use is fairly strong , with each machine
> having memory of around 60GB.
>
> Thanks,
> Baahu
>


Re: Streaming K-means not printing predictions

2016-04-26 Thread Prashant Sharma
Since you are reading from file stream, I would suggest instead of printing
try to save it on a file. There may be output the first time and then no
data in subsequent iterations.

Prashant Sharma



On Tue, Apr 26, 2016 at 7:40 PM, Ashutosh Kumar <kmr.ashutos...@gmail.com>
wrote:

> I created a Streaming k means based on scala example. It keeps running
> without any error but never prints predictions
>
> Here is Log
>
> 19:15:05,050 INFO
> org.apache.spark.streaming.scheduler.InputInfoTracker - remove old
> batch metadata: 146167824 ms
> 19:15:10,001 INFO
> org.apache.spark.streaming.dstream.FileInputDStream   - Finding new
> files took 1 ms
> 19:15:10,001 INFO
> org.apache.spark.streaming.dstream.FileInputDStream   - New files
> at time 146167831 ms:
>
> 19:15:10,007 INFO
> org.apache.spark.streaming.dstream.FileInputDStream   - Finding new
> files took 2 ms
> 19:15:10,007 INFO
> org.apache.spark.streaming.dstream.FileInputDStream   - New files
> at time 146167831 ms:
>
> 19:15:10,014 INFO
> org.apache.spark.streaming.scheduler.JobScheduler - Added jobs
> for time 146167831 ms
> 19:15:10,015 INFO
> org.apache.spark.streaming.scheduler.JobScheduler - Starting
> job streaming job 146167831 ms.0 from job set of time 146167831 ms
> 19:15:10,028 INFO
> org.apache.spark.SparkContext - Starting
> job: collect at StreamingKMeans.scala:89
> 19:15:10,028 INFO
> org.apache.spark.scheduler.DAGScheduler   - Job 292
> finished: collect at StreamingKMeans.scala:89, took 0.41 s
> 19:15:10,029 INFO
> org.apache.spark.streaming.scheduler.JobScheduler - Finished
> job streaming job 146167831 ms.0 from job set of time 146167831 ms
> 19:15:10,029 INFO
> org.apache.spark.streaming.scheduler.JobScheduler - Starting
> job streaming job 146167831 ms.1 from job set of time 146167831 ms
> ---
> Time: 146167831 ms
> ---
>
> 19:15:10,036 INFO
> org.apache.spark.streaming.scheduler.JobScheduler - Finished
> job streaming job 146167831 ms.1 from job set of time 146167831 ms
> 19:15:10,036 INFO
> org.apache.spark.rdd.MapPartitionsRDD - Removing
> RDD 2912 from persistence list
> 19:15:10,037 INFO
> org.apache.spark.rdd.MapPartitionsRDD - Removing
> RDD 2911 from persistence list
> 19:15:10,037 INFO
> org.apache.spark.storage.BlockManager - Removing
> RDD 2912
> 19:15:10,037 INFO
> org.apache.spark.streaming.scheduler.JobScheduler - Total
> delay: 0.036 s for time 146167831 ms (execution: 0.021 s)
> 19:15:10,037 INFO
> org.apache.spark.rdd.UnionRDD - Removing
> RDD 2800 from persistence list
> 19:15:10,037 INFO
> org.apache.spark.storage.BlockManager - Removing
> RDD 2911
> 19:15:10,037 INFO
> org.apache.spark.streaming.dstream.FileInputDStream   - Cleared 1
> old files that were older than 146167825 ms: 1461678245000 ms
> 19:15:10,037 INFO
> org.apache.spark.rdd.MapPartitionsRDD - Removing
> RDD 2917 from persistence list
> 19:15:10,037 INFO
> org.apache.spark.storage.BlockManager - Removing
> RDD 2800
> 19:15:10,037 INFO
> org.apache.spark.rdd.MapPartitionsRDD - Removing
> RDD 2916 from persistence list
> 19:15:10,037 INFO
> org.apache.spark.rdd.MapPartitionsRDD - Removing
> RDD 2915 from persistence list
> 19:15:10,037 INFO
> org.apache.spark.rdd.MapPartitionsRDD - Removing
> RDD 2914 from persistence list
> 19:15:10,037 INFO
> org.apache.spark.rdd.UnionRDD - Removing
> RDD 2803 from persistence list
> 19:15:10,037 INFO
> org.apache.spark.streaming.dstream.FileInputDStream   - Cleared 1
> old files that were older than 146167825 ms: 1461678245000 ms
> 19:15:10,038 INFO
> org.apache.spark.streaming.scheduler.ReceivedBlockTracker - Deleting
> batches ArrayBuffer()
> 19:15:10,038 INFO
> org.apache.spark.storage.BlockManager - Removing
> RDD 2917
> 19:15:10,038 INFO
> org.apache.spark.streaming.scheduler.InputInfoTracker - remove old
> batch metadata: 1461678245000 ms
> 19:15:10,038 INFO
> org.apache.spark.storage.BlockManager - Removing
> RDD 2914
> 19:15:10,038 INFO
> org.apache.spark.storage.BlockManager - Removing
> RDD 2916
> 19:15:10,038

Re: Save RDD to HDFS using Spark Python API

2016-04-26 Thread Prashant Sharma
What Davies said is correct, second argument is hadoop's output format.
Hadoop supports many type of output format's and all of them have their own
advantages. Apart from the one specified above,
https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/lib/output/FileOutputFormat.html
is one such formatter class.


thanks,

Prashant Sharma



On Wed, Apr 27, 2016 at 5:22 AM, Davies Liu <dav...@databricks.com> wrote:

> hdfs://192.168.10.130:9000/dev/output/test already exists, so you need
> to remove it first.
>
> On Tue, Apr 26, 2016 at 5:28 AM, Luke Adolph <kenan3...@gmail.com> wrote:
> > Hi, all:
> > Below is my code:
> >
> > from pyspark import *
> > import re
> >
> > def getDateByLine(input_str):
> > str_pattern = '^\d{4}-\d{2}-\d{2}'
> > pattern = re.compile(str_pattern)
> > match = pattern.match(input_str)
> > if match:
> > return match.group()
> > else:
> > return None
> >
> > file_url = "hdfs://192.168.10.130:9000/dev/test/test.log"
> > input_file = sc.textFile(file_url)
> > line = input_file.filter(getDateByLine).map(lambda x: (x[:10], 1))
> > counts = line.reduceByKey(lambda a,b: a+b)
> > print counts.collect()
> > counts.saveAsHadoopFile("hdfs://192.168.10.130:9000/dev/output/test",\
> >
>  "org.apache.hadoop.mapred.SequenceFileOutputFormat")
> >
> >
> > What I confused is the method saveAsHadoopFile,I have read the pyspark
> API,
> > But I still don’t understand the second arg mean
> >
> > Below is the output when I run above code:
> > ```
> >
> > [(u'2016-02-29', 99), (u'2016-03-02', 30)]
> >
> >
> ---
> > Py4JJavaError Traceback (most recent call
> last)
> >  in ()
> >  18 counts = line.reduceByKey(lambda a,b: a+b)
> >  19 print counts.collect()
> > ---> 20
> > counts.saveAsHadoopFile("hdfs://192.168.10.130:9000/dev/output/test",
> > "org.apache.hadoop.mapred.SequenceFileOutputFormat")
> >
> > /mydata/softwares/spark-1.6.1/python/pyspark/rdd.pyc in
> > saveAsHadoopFile(self, path, outputFormatClass, keyClass, valueClass,
> > keyConverter, valueConverter, conf, compressionCodecClass)
> >1419  keyClass,
> > valueClass,
> >1420  keyConverter,
> > valueConverter,
> > -> 1421  jconf,
> > compressionCodecClass)
> >1422
> >1423 def saveAsSequenceFile(self, path,
> compressionCodecClass=None):
> >
> >
> /mydata/softwares/spark-1.6.1/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py
> > in __call__(self, *args)
> > 811 answer = self.gateway_client.send_command(command)
> > 812 return_value = get_return_value(
> > --> 813 answer, self.gateway_client, self.target_id,
> self.name)
> > 814
> > 815 for temp_arg in temp_args:
> >
> > /mydata/softwares/spark-1.6.1/python/pyspark/sql/utils.pyc in deco(*a,
> **kw)
> >  43 def deco(*a, **kw):
> >  44 try:
> > ---> 45 return f(*a, **kw)
> >  46 except py4j.protocol.Py4JJavaError as e:
> >  47 s = e.java_exception.toString()
> >
> >
> /mydata/softwares/spark-1.6.1/python/lib/py4j-0.9-src.zip/py4j/protocol.py
> > in get_return_value(answer, gateway_client, target_id, name)
> > 306 raise Py4JJavaError(
> > 307 "An error occurred while calling
> {0}{1}{2}.\n".
> > --> 308 format(target_id, ".", name), value)
> > 309 else:
> > 310 raise Py4JError(
> >
> > Py4JJavaError: An error occurred while calling
> > z:org.apache.spark.api.python.PythonRDD.saveAsHadoopFile.
> > : org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory
> > hdfs://192.168.10.130:9000/dev/output/test already exists
> >   at
> >
> org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:132)
> >   at
> >
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1179)
> >   at
> >
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1156)
> >   at
> >
> org.apa

Re: Choosing an Algorithm in Spark MLib

2016-04-21 Thread Prashant Sharma
As far as I can understand, your requirements are pretty straight forward
and doable with just simple SQL queries. Take a look at Spark SQL on spark
documentation.

Prashant Sharma



On Tue, Apr 12, 2016 at 8:13 PM, Joe San <codeintheo...@gmail.com> wrote:

> up vote
> down votefavorite
> <http://datascience.stackexchange.com/questions/11167/algorithm-suggestion-for-a-specific-problem/11174?noredirect=1#>
>
> I'm working on a problem where in I have some data sets about some power
> generating units. Each of these units have been activated to run in the
> past and while activation, some units went into some issues. I now have all
> these data and I would like to come up with some sort of Ranking for these
> generating units. The criteria for ranking would be pretty simple to start
> with. They are:
>
>1. Maximum number of times a particular generating unit was activated
>2. How many times did the generating unit ran into problems during
>activation
>
> Later on I would expand on this ranking algorithm by adding more criteria.
> I will be using Apache Spark MLIB library and I can already see that there
> are quite a few algorithms already in place.
>
> http://spark.apache.org/docs/latest/mllib-guide.html
>
> I'm just not sure which algorithm would fit my purpose. Any suggestions?
>


Re: Spark streaming batch time displayed is not current system time but it is processing current messages

2016-04-19 Thread Prashant Sharma
This can happen if system time is not in sync. By default, streaming uses
SystemClock(it also supports ManualClock) and that relies
on System.currentTimeMillis() for determining start time.

Prashant Sharma



On Sat, Apr 16, 2016 at 10:09 PM, Hemalatha A <
hemalatha.amru...@googlemail.com> wrote:

> Can anyone help me in debugging  this issue please.
>
>
> On Thu, Apr 14, 2016 at 12:24 PM, Hemalatha A <
> hemalatha.amru...@googlemail.com> wrote:
>
>> Hi,
>>
>> I am facing a problem in Spark streaming.
>>
>
>
> Time: 1460823006000 ms
> ---
>
> ---
> Time: 1460823008000 ms
> ---
>
>
>
>
>> The time displayed in Spark streaming console as above is 4 days prior
>> i.e.,  April 10th, which is not current system time of the cluster  but the
>> job is processing current messages that is pushed right now April 14th.
>>
>> Can anyone please advice what time does Spark streaming display? Also,
>> when there  is scheduling delay of say 8 hours, what time does Spark
>> display- current rime or   hours behind?
>>
>> --
>>
>>
>> Regards
>> Hemalatha
>>
>
>
>
> --
>
>
> Regards
> Hemalatha
>


Re: [Spark 1.5.2] Log4j Configuration for executors

2016-04-19 Thread Prashant Sharma
May be you can try creating it before running the App.


Re: Processing millions of messages in milliseconds -- Architecture guide required

2016-04-18 Thread Prashant Sharma
Hello Deepak,

It is not clear what you want to do. Are you talking about spark streaming
? It is possible to process historical data in Spark batch mode too. You
can add a timestamp field in xml/json. Spark documentation is at
spark.apache.org. Spark has good inbuilt features to process json and
xml[1] messages.

Thanks,
Prashant Sharma

1. https://github.com/databricks/spark-xml

On Tue, Apr 19, 2016 at 10:31 AM, Deepak Sharma <deepakmc...@gmail.com>
wrote:

> Hi all,
> I am looking for an architecture to ingest 10 mils of messages in the
> micro batches of seconds.
> If anyone has worked on similar kind of architecture  , can you please
> point me to any documentation around the same like what should be the
> architecture , which all components/big data ecosystem tools should i
> consider etc.
> The messages has to be in xml/json format , a preprocessor engine or
> message enhancer and then finally a processor.
> I thought about using data cache as well for serving the data
> The data cache should have the capability to serve the historical  data in
> milliseconds (may be upto 30 days of data)
> --
> Thanks
> Deepak
> www.bigdatabig.com
>
>


Re: Renaming sc variable in sparkcontext throws task not serializable

2016-03-02 Thread Prashant Sharma
*This is a known issue. *
https://issues.apache.org/jira/browse/SPARK-3200


Prashant Sharma



On Thu, Mar 3, 2016 at 9:01 AM, Rahul Palamuttam <rahulpala...@gmail.com>
wrote:

> Thank you Jeff.
>
> I have filed a JIRA under the following link :
>
> https://issues.apache.org/jira/browse/SPARK-13634
>
> For some reason the spark context is being pulled into the referencing
> environment of the closure.
> I also had no problems with batch jobs.
>
> On Wed, Mar 2, 2016 at 7:18 PM, Jeff Zhang <zjf...@gmail.com> wrote:
>
>> I can reproduce it in spark-shell. But it works for batch job. Looks like
>> spark repl issue.
>>
>> On Thu, Mar 3, 2016 at 10:43 AM, Rahul Palamuttam <rahulpala...@gmail.com
>> > wrote:
>>
>>> Hi All,
>>>
>>> We recently came across this issue when using the spark-shell and
>>> zeppelin.
>>> If we assign the sparkcontext variable (sc) to a new variable and
>>> reference
>>> another variable in an RDD lambda expression we get a task not
>>> serializable exception.
>>>
>>> The following three lines of code illustrate this :
>>>
>>> val temp = 10
>>> val newSC = sc
>>> val new RDD = newSC.parallelize(0 to 100).map(p => p + temp).
>>>
>>> I am not sure if this is a known issue, or we should file a JIRA for it.
>>> We originally came across this bug in the SciSpark project.
>>>
>>> Best,
>>>
>>> Rahul P
>>>
>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>


Re: External JARs not loading Spark Shell Scala 2.11

2015-04-09 Thread Prashant Sharma
You are right this needs to be done. I can work on it soon, I was not sure
if there is any one even using scala 2.11 spark repl. Actually there is a
patch in scala 2.10 shell to support adding jars (Lost the JIRA ID), which
has to be ported for scala 2.11 too. If however, you(or anyone else) are
planning to work, I can help you ?

Prashant Sharma



On Thu, Apr 9, 2015 at 3:08 PM, anakos ana...@gmail.com wrote:

 Hi-

 I am having difficulty getting the 1.3.0 Spark shell to find an external
 jar.  I have build Spark locally for Scala 2.11 and I am starting the REPL
 as follows:

 bin/spark-shell --master yarn --jars data-api-es-data-export-4.0.0.jar

 I see the following line in the console output:

 15/04/09 09:52:15 INFO spark.SparkContext: Added JAR

 file:/opt/spark/spark-1.3.0_2.11-hadoop2.3/data-api-es-data-export-4.0.0.jar
 at http://192.168.115.31:54421/jars/data-api-es-data-export-4.0.0.jar with
 timestamp 1428569535904

 but when i try to import anything from this jar, it's simply not available.
 When I try to add the jar manually using the command

 :cp /path/to/jar

 the classes in the jar are still unavailable. I understand that 2.11 is not
 officially supported, but has anyone been able to get an external jar
 loaded
 in the 1.3.0 release?  Is this a known issue? I have tried searching around
 for answers but the only thing I've found that may be related is this:

 https://issues.apache.org/jira/browse/SPARK-3257

 Any/all help is much appreciated.
 Thanks
 Alex



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/External-JARs-not-loading-Spark-Shell-Scala-2-11-tp22434.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




UnsatisfiedLinkError related to libgfortran when running MLLIB code on RHEL 5.8

2015-03-03 Thread Prashant Sharma
Hi Folks,

We are trying to run the following code from the spark shell in a CDH 5.3
cluster running on RHEL 5.8.

*spark-shell --master yarn --deploy-mode client --num-executors 15
--executor-cores 6 --executor-memory 12G *
*import org.apache.spark.mllib.recommendation.ALS *
*import org.apache.spark.mllib.recommendation.Rating *
*val users_item_score_clean =
sc.textFile(/tmp/spark_mllib_test).map(_.split(,)) *
*val ratings = users_item_score_clean.map(x= Rating(x(0).toInt,
x(1).toInt, x(2).toDouble)) *
*val rank = 10 *
*val numIterations = 20 *
*val alpha = 1.0 *
*val lambda = 0.01 *
*val model = ALS.trainImplicit(ratings, rank, numIterations, lambda,alpha) *



We are getting the following error (detailed error is attached in
error.log):


*-- org.jblas ERROR Couldn't load copied link file:
java.lang.UnsatisfiedLinkError: *
*/u08/hadoop/yarn/nm/usercache/sharma.p/appcache/application_1425015707226_0128/*
*container_e12_1425015707226_0128_01_10/tmp/jblas7605371780736016929libjblas_arch_flavor.so:
libgfortran.so.3: *
*cannot open shared object file: No such file or directory. *

*On Linux 64bit, you need additional support libraries. *
*You need to install libgfortran3. *

*For example for debian or Ubuntu, type sudo apt-get install
libgfortran3 *

*For more information,
see https://github.com/mikiobraun/jblas/wiki/Missing-Libraries
https://github.com/mikiobraun/jblas/wiki/Missing-Libraries *
*15/03/02 14:50:25 ERROR executor.Executor: Exception in task 22.0 in stage
6.0 (TID 374) *
*java.lang.UnsatisfiedLinkError: org.jblas.NativeBlas.dposv(CII[DII[DII)I *
*at org.jblas.NativeBlas.dposv(Native Method) *
*at org.jblas.SimpleBlas.posv(SimpleBlas.java:369) *


This exact code runs fine on another CDH 5.3 cluster which runs on RHEL
6.5.

libgfortran.so.3 is not present on the problematic cluster.

*[root@node04 ~]# find / -name libgfortran.so.3 2/dev/null *


I am able to find *libgfortran.so.3 * on the cluster where the above code
works:

*[root@workingclusternode04 ~]# find / -name libgfortran.so.3 2/dev/null *
*/usr/lib64/libgfortran.so.3 *



The following output shows that the fortran packages are installed on both
the clusters:

*On the cluster where this is not working *

*[root@node04 ~]# yum list | grep -i fortran *
*gcc-gfortran.x86_64 4.1.2-52.el5_8.1 installed *
*libgfortran.i386 4.1.2-52.el5_8.1 installed *
*libgfortran.x86_64 4.1.2-52.el5_8.1 installed *


*On the cluster where the spark job is this working *

*[root@** workingclusternode04** ~]# yum list | grep -i fortran *
*Repository 'bda' is missing name in configuration, using id *
*compat-libgfortran-41.x86_64 4.1.2-39.el6 @bda *
*gcc-gfortran.x86_64 4.4.7-4.el6 @bda *
*libgfortran.x86_64 4.4.7-4.el6 @bda *


Has anybody run into this? Any pointers are much appreciated.

Regards,
Prashant
-- org.jblas ERROR Couldn't load copied link file: 
java.lang.UnsatisfiedLinkError: 
/u08/hadoop/yarn/nm/usercache/sharma.p/appcache/application_1425015707226_0128/container_e12_1425015707226_0128_01_10/tmp/jblas7605371780736016929libjblas_arch_flavor.so:
 libgfortran.so.3: cannot open shared object file: No such file or directory.

On Linux 64bit, you need additional support libraries.
You need to install libgfortran3.

For example for debian or Ubuntu, type sudo apt-get install libgfortran3

For more information, see 
https://github.com/mikiobraun/jblas/wiki/Missing-Libraries
15/03/02 14:50:25 ERROR executor.Executor: Exception in task 22.0 in stage 6.0 
(TID 374)
java.lang.UnsatisfiedLinkError: org.jblas.NativeBlas.dposv(CII[DII[DII)I
at org.jblas.NativeBlas.dposv(Native Method)
at org.jblas.SimpleBlas.posv(SimpleBlas.java:369)
at org.jblas.Solve.solvePositive(Solve.java:68)
at 
org.apache.spark.mllib.recommendation.ALS.solveLeastSquares(ALS.scala:607)
at 
org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$2.apply(ALS.scala:593)
at 
org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$2.apply(ALS.scala:581)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofInt.foreach(ArrayOps.scala:156)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofInt.map(ArrayOps.scala:156)
at 
org.apache.spark.mllib.recommendation.ALS.org$apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:581)
at 
org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:510)
at 

Re: Bind Exception

2015-01-19 Thread Prashant Sharma
Deep, Yes you have another spark shell or application sticking around
somewhere. Try to inspect running processes and lookout for jave process.
And kill it.
This might be helpful
https://www.digitalocean.com/community/tutorials/how-to-use-ps-kill-and-nice-to-manage-processes-in-linux



Also, That is just a warning. FYI spark ignores BindException and probes
for next available port and continues. So you application is fine if that
particular error comes up.

Prashant Sharma



On Tue, Jan 20, 2015 at 10:30 AM, Deep Pradhan pradhandeep1...@gmail.com
wrote:

 Yes, I have increased the driver memory in spark-default.conf to 2g. Still
 the error persists.

 On Tue, Jan 20, 2015 at 10:18 AM, Ted Yu yuzhih...@gmail.com wrote:

 Have you seen these threads ?

 http://search-hadoop.com/m/JW1q5tMFlb
 http://search-hadoop.com/m/JW1q5dabji1

 Cheers

 On Mon, Jan 19, 2015 at 8:33 PM, Deep Pradhan pradhandeep1...@gmail.com
 wrote:

 Hi Ted,
 When I am running the same job with small data, I am able to run. But
 when I run it with relatively bigger set of data, it is giving me
 OutOfMemoryError: GC overhead limit exceeded.
 The first time I run the job, no output. When I run for second time, I
 am getting this error. I am aware that, the memory is getting full, but is
 there any way to avoid this?
 I have a single node Spark cluster.

 Thank You

 On Tue, Jan 20, 2015 at 9:52 AM, Deep Pradhan pradhandeep1...@gmail.com
  wrote:

 I had the Spark Shell running through out. Is it because of that?

 On Tue, Jan 20, 2015 at 9:47 AM, Ted Yu yuzhih...@gmail.com wrote:

 Was there another instance of Spark running on the same machine ?

 Can you pastebin the full stack trace ?

 Cheers

 On Mon, Jan 19, 2015 at 8:11 PM, Deep Pradhan 
 pradhandeep1...@gmail.com wrote:

 Hi,
 I am running a Spark job. I get the output correctly but when I see
 the logs file I see the following:
 AbstractLifeCycle: FAILED.: java.net.BindException: Address
 already in use...

 What could be the reason for this?

 Thank You









Re: Is it safe to use Scala 2.11 for Spark build?

2014-11-17 Thread Prashant Sharma
It is safe in the sense we would help you with the fix if you run into
issues. I have used it, but since I worked on the patch the opinion can be
biased. I am using scala 2.11 for day to day development. You should
checkout the build instructions here :
https://github.com/ScrapCodes/spark-1/blob/patch-3/docs/building-spark.md

Prashant Sharma



On Tue, Nov 18, 2014 at 12:19 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote:

 Any notable issues for using Scala 2.11? Is it stable now?

 Or can I use Scala 2.11 in my spark application and use Spark dist build
 with 2.10 ?

 I'm looking forward to migrate to 2.11 for some quasiquote features.
 Couldn't make it run in 2.10...

 Cheers,
 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/



Re: Is it safe to use Scala 2.11 for Spark build?

2014-11-17 Thread Prashant Sharma
Looks like sbt/sbt -Pscala-2.11 is broken by a recent patch for improving
maven build.

Prashant Sharma



On Tue, Nov 18, 2014 at 12:57 PM, Prashant Sharma scrapco...@gmail.com
wrote:

 It is safe in the sense we would help you with the fix if you run into
 issues. I have used it, but since I worked on the patch the opinion can be
 biased. I am using scala 2.11 for day to day development. You should
 checkout the build instructions here :
 https://github.com/ScrapCodes/spark-1/blob/patch-3/docs/building-spark.md

 Prashant Sharma



 On Tue, Nov 18, 2014 at 12:19 PM, Jianshi Huang jianshi.hu...@gmail.com
 wrote:

 Any notable issues for using Scala 2.11? Is it stable now?

 Or can I use Scala 2.11 in my spark application and use Spark dist build
 with 2.10 ?

 I'm looking forward to migrate to 2.11 for some quasiquote features.
 Couldn't make it run in 2.10...

 Cheers,
 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/





Re: Spray client reports Exception: akka.actor.ActorSystem.dispatcher()Lscala/concurrent/ExecutionContext

2014-10-29 Thread Prashant Sharma
Yes we shade akka to change its protobuf version (If I am not wrong.). Yes,
binary compatibility with other akka modules is compromised. One thing you
can try is use akka from org.spark-project.akka, I have not tried this and
not sure if its going to help you but may be you could exclude the akka
spray depends on and use the akka spark depends on.

Prashant Sharma



On Wed, Oct 29, 2014 at 9:27 AM, Jianshi Huang jianshi.hu...@gmail.com
wrote:

 I'm using Spark built from HEAD, I think it uses modified Akka 2.3.4,
 right?

 Jianshi

 On Wed, Oct 29, 2014 at 5:53 AM, Mohammed Guller moham...@glassbeam.com
 wrote:

  Try a version built with Akka 2.2.x



 Mohammed



 *From:* Jianshi Huang [mailto:jianshi.hu...@gmail.com]
 *Sent:* Tuesday, October 28, 2014 3:03 AM
 *To:* user
 *Subject:* Spray client reports Exception:
 akka.actor.ActorSystem.dispatcher()Lscala/concurrent/ExecutionContext



 Hi,



 I got the following exceptions when using Spray client to write to
 OpenTSDB using its REST API.



   Exception in thread pool-10-thread-2 java.lang.NoSuchMethodError:
 akka.actor.ActorSystem.dispatcher()Lscala/concurrent/ExecutionContext;



 It worked locally in my Intellij but failed when I launch it from
 Spark-submit.



 Google suggested it's a compatibility issue in Akka. And I'm using latest
 Spark built from the HEAD, so the Akka used in Spark-submit is 2.3.4-spark.



 I tried both Spray 1.3.2 (built for Akka 2.3.6) and 1.3.1 (built for
 2.3.4). Both failed with the same exception.



 Anyone has idea what went wrong? Need help!



 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/




 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/



Re: Spark SQL reduce number of java threads

2014-10-28 Thread Prashant Sharma
What is the motivation behind this ?

You can start with master as local[NO_OF_THREADS]. Reducing the threads at
all other places can have unexpected results. Take a look at this.
http://spark.apache.org/docs/latest/configuration.html.

Prashant Sharma



On Tue, Oct 28, 2014 at 2:08 PM, Wanda Hawk wanda_haw...@yahoo.com.invalid
wrote:

 Hello

 I am trying to reduce the number of java threads (about 80 on my system)
 to as few as there can be.
 What settings can be done in spark-1.1.0/conf/spark-env.sh ? (or other
 places as well)
 I am also using hadoop for storing data on hdfs

 Thank you,
 Wanda



Re: unable to make a custom class as a key in a pairrdd

2014-10-23 Thread Prashant Sharma
Are you doing this in REPL ? Then there is a bug filed for this, I just
can't recall the bug ID at the moment.

Prashant Sharma



On Fri, Oct 24, 2014 at 4:07 AM, Niklas Wilcke 
1wil...@informatik.uni-hamburg.de wrote:

  Hi Jao,

 I don't really know why this doesn't work but I have two hints.
 You don't need to override hashCode and equals. The modifier case is doing
 that for you. Writing

 case class PersonID(id: String)

 would be enough to get the class you want I think.
 If I change the type of the id param to Int it works for me but I don't
 know why.

 case class PersonID(id: Int)

 Looks like a strange behavior to me. Have a try.

 Good luck,
 Niklas


 On 23.10.2014 21:52, Jaonary Rabarisoa wrote:

  Hi all,

  I have the following case class that I want to use as a key in a
 key-value rdd. I defined the equals and hashCode methode but it's not
 working. What I'm doing wrong ?

  *case class PersonID(id: String) {*

 * override def hashCode = id.hashCode*

 * override def equals(other: Any) = other match {*

 * case that: PersonID = this.id http://this.id == that.id
 http://that.id  this.getClass == that.getClass*
 * case _ = false*
 * }   *
 * }   *


 * val p = sc.parallelize((1 until 10).map(x = (PersonID(1),x )))*


  *p.groupByKey.collect foreach println*

  *(PersonID(1),CompactBuffer(5))*
 *(PersonID(1),CompactBuffer(6))*
 *(PersonID(1),CompactBuffer(7))*
 *(PersonID(1),CompactBuffer(8, 9))*
 *(PersonID(1),CompactBuffer(1))*
 *(PersonID(1),CompactBuffer(2))*
 *(PersonID(1),CompactBuffer(3))*
 *(PersonID(1),CompactBuffer(4))*


  Best,

  Jao





Re: Default spark.deploy.recoveryMode

2014-10-15 Thread Prashant Sharma
[Removing dev lists]

You are absolutely correct about that.

Prashant Sharma



On Tue, Oct 14, 2014 at 5:03 PM, Priya Ch learnings.chitt...@gmail.com
wrote:

 Hi Spark users/experts,

 In Spark source code  (Master.scala  Worker.scala), when  registering the
 worker with master, I see the usage of *persistenceEngine*. When we don't
 specify spark.deploy.recovery mode explicitly, what is the default value
 used ? This recovery mode is used to persists and restore the application 
 worker details.

  I see when recovery mode not specified explicitly,
 *BlackHolePersistenceEngine* being used. Am i right ?


 Thanks,
 Padma Ch



Re: Default spark.deploy.recoveryMode

2014-10-15 Thread Prashant Sharma
So if you need those features you can go ahead and setup one of Filesystem
or zookeeper options. Please take a look at:
http://spark.apache.org/docs/latest/spark-standalone.html.

Prashant Sharma



On Wed, Oct 15, 2014 at 3:25 PM, Chitturi Padma 
learnings.chitt...@gmail.com wrote:

 which means the details are not persisted and hence any failures in
 workers and master wouldnt start the daemons normally ..right ?

 On Wed, Oct 15, 2014 at 12:17 PM, Prashant Sharma [via Apache Spark User
 List] [hidden email] http://user/SendEmail.jtp?type=nodenode=16483i=0
  wrote:

 [Removing dev lists]

 You are absolutely correct about that.

 Prashant Sharma



 On Tue, Oct 14, 2014 at 5:03 PM, Priya Ch [hidden email]
 http://user/SendEmail.jtp?type=nodenode=16468i=0 wrote:

 Hi Spark users/experts,

 In Spark source code  (Master.scala  Worker.scala), when  registering
 the worker with master, I see the usage of *persistenceEngine*. When we
 don't specify spark.deploy.recovery mode explicitly, what is the default
 value used ? This recovery mode is used to persists and restore the
 application  worker details.

  I see when recovery mode not specified explicitly,
 *BlackHolePersistenceEngine* being used. Am i right ?


 Thanks,
 Padma Ch




 --
  If you reply to this email, your message will be added to the
 discussion below:

 http://apache-spark-user-list.1001560.n3.nabble.com/Default-spark-deploy-recoveryMode-tp16375p16468.html
  To start a new topic under Apache Spark User List, email [hidden email]
 http://user/SendEmail.jtp?type=nodenode=16483i=1
 To unsubscribe from Apache Spark User List, click here.
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml



 --
 View this message in context: Re: Default spark.deploy.recoveryMode
 http://apache-spark-user-list.1001560.n3.nabble.com/Default-spark-deploy-recoveryMode-tp16375p16483.html
 Sent from the Apache Spark User List mailing list archive
 http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com.



Re: Nested Case Classes (Found and Required Same)

2014-09-12 Thread Prashant Sharma
What is your spark version ?  This was fixed I suppose. Can you try it with
latest release ?

Prashant Sharma



On Fri, Sep 12, 2014 at 9:47 PM, Ramaraju Indukuri iramar...@gmail.com
wrote:

 This is only a problem in shell, but works fine in batch mode though. I am
 also interested in how others are solving the problem of case class
 limitation on number of variables.

 Regards
 Ram

 On Fri, Sep 12, 2014 at 12:12 PM, iramaraju iramar...@gmail.com wrote:

 I think this is a popular issue, but need help figuring a way around if
 this
 issue is unresolved. I have a dataset that has more than 70 columns. To
 have
 all the columns fit into my RDD, I am experimenting the following. (I
 intend
 to use the InputData to parse the file and have 3 or 4 columnsets to
 accommodate the full list of variables)

 case class ColumnSet(C1: Double , C2: Double , C3: Double)
 case class InputData(EQN: String, ts: String,Set1 :ColumnSet,Set2
 :ColumnSet)

 val  set1 = ColumnSet(1,2,3)
 val a = InputData(a,a,set1,set1)

 returns the following

 console:16: error: type mismatch;
  found   : ColumnSet
  required: ColumnSet
val a = InputData(a,a,set1,set1)

 Where as the same code works fine in my scala console.

 Is there a work around for my problem ?

 Regards
 Ram



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Nested-Case-Classes-Found-and-Required-Same-tp14096.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




 --
 --
 Ramaraju Indukuri



Re: .sparkrc for Spark shell?

2014-09-03 Thread Prashant Sharma
Hey,

You can use spark-shell -i sparkrc, to do this.

Prashant Sharma




On Wed, Sep 3, 2014 at 2:17 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote:

 To make my shell experience merrier, I need to import several packages,
 and define implicit sparkContext and sqlContext.

 Is there a startup file (e.g. ~/.sparkrc) that Spark shell will load when
 it's started?


 Cheers,
 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/



Re: spark streaming actor receiver doesn't play well with kryoserializer

2014-07-31 Thread Prashant Sharma
This looks like a bug to me. This happens because we serialize the code
that starts the receiver and send it across. And since we have not
registered the classes of akka library it does not work. I have not tried
myself, but may be by including something like chill-akka (
https://github.com/xitrum-framework/chill-akka) might help. I am not well
aware about how kryo works internally, may be someone else can throw some
light on this.

Prashant Sharma




On Sat, Jul 26, 2014 at 6:26 AM, Alan Ngai a...@opsclarity.com wrote:

 The stack trace was from running the Actor count sample directly, without
 a spark cluster, so I guess the logs would be from both?  I enabled more
 logging and got this stack trace

 4/07/25 17:55:26 [INFO] SecurityManager: Changing view acls to: alan
  14/07/25 17:55:26 [INFO] SecurityManager: SecurityManager: authentication
 disabled; ui acls disabled; users with view permissions: Set(alan)
  14/07/25 17:55:26 [DEBUG] AkkaUtils: In createActorSystem, requireCookie
 is: off
  14/07/25 17:55:26 [INFO] Slf4jLogger: Slf4jLogger started
  14/07/25 17:55:27 [INFO] Remoting: Starting remoting
  14/07/25 17:55:27 [INFO] Remoting: Remoting started; listening on
 addresses :[akka.tcp://spark@leungshwingchun:52156]
  14/07/25 17:55:27 [INFO] Remoting: Remoting now listens on addresses: [
 akka.tcp://spark@leungshwingchun:52156]
  14/07/25 17:55:27 [INFO] SparkEnv: Registering MapOutputTracker
  14/07/25 17:55:27 [INFO] SparkEnv: Registering BlockManagerMaster
  14/07/25 17:55:27 [DEBUG] DiskBlockManager: Creating local directories at
 root dirs '/var/folders/fq/fzcyqkcx7rgbycg4kr8z3m18gn/T/'
  14/07/25 17:55:27 [INFO] DiskBlockManager: Created local directory at
 /var/folders/fq/fzcyqkcx7rgbycg4kr8z3m18gn/T/spark-local-20140725175527-32f2
  14/07/25 17:55:27 [INFO] MemoryStore: MemoryStore started with capacity
 297.0 MB.
  14/07/25 17:55:27 [INFO] ConnectionManager: Bound socket to port 52157
 with id = ConnectionManagerId(leungshwingchun,52157)
  14/07/25 17:55:27 [INFO] BlockManagerMaster: Trying to register
 BlockManager
  14/07/25 17:55:27 [INFO] BlockManagerInfo: Registering block manager
 leungshwingchun:52157 with 297.0 MB RAM
  14/07/25 17:55:27 [INFO] BlockManagerMaster: Registered BlockManager
  14/07/25 17:55:27 [INFO] HttpServer: Starting HTTP Server
  14/07/25 17:55:27 [DEBUG] HttpServer: HttpServer is not using security
  14/07/25 17:55:27 [INFO] HttpBroadcast: Broadcast server started at
 http://192.168.1.233:52158
  14/07/25 17:55:27 [INFO] HttpFileServer: HTTP File server directory is
 /var/folders/fq/fzcyqkcx7rgbycg4kr8z3m18gn/T/spark-5254ba11-037b-4761-b92a-3b18d42762de
  14/07/25 17:55:27 [INFO] HttpServer: Starting HTTP Server
  14/07/25 17:55:27 [DEBUG] HttpServer: HttpServer is not using security
  14/07/25 17:55:27 [DEBUG] HttpFileServer: HTTP file server started at:
 http://192.168.1.233:52159
  14/07/25 17:55:27 [INFO] SparkUI: Started SparkUI at
 http://leungshwingchun:4040
  14/07/25 17:55:27 [DEBUG] MutableMetricsFactory: field
 org.apache.hadoop.metrics2.lib.MutableRate
 org.apache.hadoop.security.UserGroupInformation$UgiMetrics.loginSuccess
 with annotation
 @org.apache.hadoop.metrics2.annotation.Metric(valueName=Time, about=,
 value=[Rate of successful kerberos logins and latency (milliseconds)],
 always=false, type=DEFAULT, sampleName=Ops)
  14/07/25 17:55:27 [DEBUG] MutableMetricsFactory: field
 org.apache.hadoop.metrics2.lib.MutableRate
 org.apache.hadoop.security.UserGroupInformation$UgiMetrics.loginFailure
 with annotation
 @org.apache.hadoop.metrics2.annotation.Metric(valueName=Time, about=,
 value=[Rate of failed kerberos logins and latency (milliseconds)],
 always=false, type=DEFAULT, sampleName=Ops)
  14/07/25 17:55:27 [DEBUG] MetricsSystemImpl: UgiMetrics, User and group
 related metrics
  2014-07-25 17:55:27.796 java[79107:1703] Unable to load realm info from
 SCDynamicStore
 14/07/25 17:55:27 [DEBUG] KerberosName: Kerberos krb5 configuration not
 found, setting default realm to empty
  14/07/25 17:55:27 [DEBUG] Groups:  Creating new Groups object
  14/07/25 17:55:27 [DEBUG] NativeCodeLoader: Trying to load the
 custom-built native-hadoop library...
  14/07/25 17:55:27 [DEBUG] NativeCodeLoader: Failed to load native-hadoop
 with error: java.lang.UnsatisfiedLinkError: no hadoop in java.library.path
  14/07/25 17:55:27 [DEBUG] NativeCodeLoader: java.library.path=
  14/07/25 17:55:27 [WARN] NativeCodeLoader: Unable to load native-hadoop
 library for your platform... using builtin-java classes where applicable
  14/07/25 17:55:27 [DEBUG] JniBasedUnixGroupsMappingWithFallback: Falling
 back to shell based
  14/07/25 17:55:27 [DEBUG] JniBasedUnixGroupsMappingWithFallback: Group
 mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping
  14/07/25 17:55:27 [DEBUG] Groups: Group mapping
 impl=org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback;
 cacheTimeout=30
  14/07/25 17:55:28 [INFO] SparkContext: Added JAR
 file

Re: Emacs Setup Anyone?

2014-07-26 Thread Prashant Sharma
Normally any setup that has inferior mode for scala repl will also support
spark repl (with little or no modifications).

 Apart from that I personally use spark repl normally by invoking
spark-shell in a shell in emacs, and I keep the scala tags(etags) for the
spark loaded. With this setup it is kinda fast to do either tag prediction
at point which is not accurate etc.. but its useful.

Incase you are working on building this(inferior mode for spark repl) for
us, I can come up with a wishlist.



Prashant Sharma


On Sat, Jul 26, 2014 at 3:07 AM, Andrei faithlessfri...@gmail.com wrote:

 I have never tried Spark REPL from within Emacs, but I remember that
 switching from normal Python to Pyspark was as simple as changing
 interpreter name at the beginning of session. Seems like ensime [1]
 (together with ensime-emacs [2]) should be a good point to start. For
 example, take a look at ensime-sbt.el [3] that defines a number of
 Scala/SBT commands.

 [1]: https://github.com/ensime/ensime-server
 [2]: https://github.com/ensime/ensime-emacs
 [3]: https://github.com/ensime/ensime-emacs/blob/master/ensime-sbt.el




 On Thu, Jul 24, 2014 at 10:14 PM, Steve Nunez snu...@hortonworks.com
 wrote:

 Anyone out there have a good configuration for emacs? Scala-mode sort of
 works, but I’d love to see a fully-supported spark-mode with an inferior
 shell. Searching didn’t turn up much of anything.

 Any emacs users out there? What setup are you using?

 Cheers,
 - SteveN



 CONFIDENTIALITY NOTICE
 NOTICE: This message is intended for the use of the individual or entity
 to which it is addressed and may contain information that is confidential,
 privileged and exempt from disclosure under applicable law. If the reader
 of this message is not the intended recipient, you are hereby notified that
 any printing, copying, dissemination, distribution, disclosure or
 forwarding of this communication is strictly prohibited. If you have
 received this communication in error, please contact the sender immediately
 and delete it from your system. Thank You.





Re: ZeroMQ Stream - stack guard problem and no data

2014-06-04 Thread Prashant Sharma
Hi,

What is your Zeromq version ? It is known to work well with 2.2

an output of `sudo ldconfig -v | grep zmq` would helpful in this regard.

Thanks

Prashant Sharma


On Wed, Jun 4, 2014 at 11:40 AM, Tobias Pfeiffer t...@preferred.jp wrote:

 Hi,

 I am trying to use Spark Streaming (1.0.0) with ZeroMQ, i.e. I say

 def bytesToStringIterator(x: Seq[ByteString]) =
 (x.map(_.utf8String)).iterator
 val lines: DStream[String] =
   ZeroMQUtils.createStream(ssc,
 tcp://localhost:5556,
 Subscribe(mytopic),
 bytesToStringIterator _)
 lines.print()

 but when I start this program (in local mode), I get

 OpenJDK 64-Bit Server VM warning: You have loaded library
 /tmp/jna2713405829859698528.tmp which might have disabled stack guard.
 The VM will try to fix the stack guard now.
 It's highly recommended that you fix the library with 'execstack -c
 libfile', or link it with '-z noexecstack'.

 and no data is received. The ZeroMQ setup should be ok, though; the Python
 code

 context = zmq.Context()
 socket = context.socket(zmq.SUB)
 socket.setsockopt(zmq.SUBSCRIBE, mytopic)
 socket.connect(tcp://localhost:5556)
 while True:
 msg = socket.recv()
 print msg
 time.sleep(1)

 works fine and prints the messages issued by the publisher.

 Any suggestions on what is going wrong here?

 Thanks
 Tobias



Re: Apache Spark is not building in Mac/Java 8

2014-05-02 Thread Prashant Sharma
you will need to change sbt version to 13.2. I think spark 0.9.1 was
released with sbt 13 ? Incase not then it may not work with java 8. Just
wait for 1.0 release or give 1.0 release candidate a try !

http://mail-archives.apache.org/mod_mbox/spark-dev/201404.mbox/%3CCABPQxstL6nwTO2H9p8%3DGJh1g2zxOJd02Wt7L06mCLjo-vwwG9Q%40mail.gmail.com%3E

Prashant Sharma


On Fri, May 2, 2014 at 3:56 PM, N.Venkata Naga Ravi nvn_r...@hotmail.comwrote:


 Hi,


 I am tyring to build Apache Spark with Java 8 in my Mac system ( OS X
 10.8.5) , but getting following exception.
 Please help on resolving it.


 dhcp-173-39-68-28:spark-0.9.1 neravi$ java -version
 java version 1.8.0
 Java(TM) SE Runtime Environment (build 1.8.0-b132)
 Java HotSpot(TM) 64-Bit Server VM (build 25.0-b70, mixed mode)
 dhcp-173-39-68-28:spark-0.9.1 neravi$ ./sbt/sbt assembly
 Launching sbt from sbt/sbt-launch-0.12.4.jar
 Java HotSpot(TM) 64-Bit Server VM warning: ignoring option
 MaxPermSize=350m; support was removed in 8.0
 [info] Loading project definition from
 /Applications/spark-0.9.1/project/project

 *[info] Compiling 1 Scala source to
 /Applications/spark-0.9.1/project/project/target/scala-2.9.2/sbt-0.12/classes...
 [error] error while loading CharSequence, class file
 '/Library/Java/JavaVirtualMachines/jdk1.8.0.jdk/Contents/Home/jre/lib/rt.jar(java/lang/CharSequence.class)'
 is broken*
 [error] (bad constant pool tag 18 at byte 10)
 [error] error while loading Comparator, class file
 '/Library/Java/JavaVirtualMachines/jdk1.8.0.jdk/Contents/Home/jre/lib/rt.jar(java/util/Comparator.class)'
 is broken
 [error] (bad constant pool tag 18 at byte 20)
 [error] two errors found
 [error] (compile:compile) Compilation failed
 Project loading failed: (r)etry, (q)uit, (l)ast, or (i)gnore? r
 [info] Loading project definition from
 /Applications/spark-0.9.1/project/project
 [info] Compiling 1 Scala source to
 /Applications/spark-0.9.1/project/project/target/scala-2.9.2/sbt-0.12/classes...
 [error] error while loading CharSequence, class file
 '/Library/Java/JavaVirtualMachines/jdk1.8.0.jdk/Contents/Home/jre/lib/rt.jar(java/lang/CharSequence.class)'
 is broken
 [error] (bad constant pool tag 18 at byte 10)
 [error] error while loading Comparator, class file
 '/Library/Java/JavaVirtualMachines/jdk1.8.0.jdk/Contents/Home/jre/lib/rt.jar(java/util/Comparator.class)'
 is broken
 [error] (bad constant pool tag 18 at byte 20)
 [error] two errors found
 [error] (compile:compile) Compilation failed
 Project loading failed: (r)etry, (q)uit, (l)ast, or (i)gnore? q


 Thanks,
 Ravi



Re: Apache Spark is not building in Mac/Java 8

2014-05-02 Thread Prashant Sharma
I have pasted the link in my previous post.

Prashant Sharma


On Fri, May 2, 2014 at 4:15 PM, N.Venkata Naga Ravi nvn_r...@hotmail.comwrote:

 Thanks for your quick replay.

 I tried with fresh installation, it downloads sbt 0.12.4 only (please
 check below logs). So it is not working. Can you tell where this 1.0
 release candidate located which i can try?


 dhcp-173-39-68-28:spark-0.9.1 neravi$ ./sbt/sbt assembly
 Attempting to fetch sbt
 
 100.0%

 Launching sbt from sbt/sbt-launch-0.12.4.jar
 Java HotSpot(TM) 64-Bit Server VM warning: ignoring option
 MaxPermSize=350m; support was removed in 8.0
 [info] Loading project definition from
 /Applications/spark-0.9.1/project/project
 [info] Updating
 {file:/Applications/spark-0.9.1/project/project/}default-f15d5a...
 [info] Resolving org.scala-sbt#precompiled-2_10_1;0.12.4 ...
 [info] Done updating.

 [info] Compiling 1 Scala source to
 /Applications/spark-0.9.1/project/project/target/scala-2.9.2/sbt-0.12/classes...
 [error] error while loading CharSequence, class file
 '/Library/Java/JavaVirtualMachines/jdk1.8.0.jdk/Contents/Home/jre/lib/rt.jar(java/lang/CharSequence.class)'
 is broken
 [error] (bad constant pool tag 18 at byte 10)
 [error] error while loading Comparator, class file
 '/Library/Java/JavaVirtualMachines/jdk1.8.0.jdk/Contents/Home/jre/lib/rt.jar(java/util/Comparator.class)'
 is broken
 [error] (bad constant pool tag 18 at byte 20)
 [error] two errors found
 [error] (compile:compile) Compilation failed
 Project loading failed: (r)etry, (q)uit, (l)ast, or (i)gnore? q
 dhcp-173-39-68-28:spark-0.9.1 neravi$ ls
 CHANGES.txtassemblycoredocs
 extraspom.xmlsbinyarn
 LICENSEbageldataec2
 graphxprojectsbt
 NOTICEbindevexamples
 make-distribution.shpythonstreaming
 README.mdconfdockerexternal
 mllibrepltools
 dhcp-173-39-68-28:spark-0.9.1 neravi$ cd sbt/
 dhcp-173-39-68-28:sbt neravi$ ls
 sbt


 *sbt-launch-0.12.4.jar *
 --





 *From: scrapco...@gmail.com scrapco...@gmail.comDate: Fri, 2 May 2014
 16:02:48 +0530Subject: Re: Apache Spark is not building in Mac/Java 8To:
 user@spark.apache.org user@spark.apache.org *
 *you will need to change sbt version to 13.2. I think spark 0.9.1 was
 released with sbt 13 ? Incase not then it may not work with java 8. Just
 wait for 1.0 release or give 1.0 release candidate a try !*


 *http://mail-archives.apache.org/mod_mbox/spark-dev/201404.mbox/%3CCABPQxstL6nwTO2H9p8%3DGJh1g2zxOJd02Wt7L06mCLjo-vwwG9Q%40mail.gmail.com%3E
 http://mail-archives.apache.org/mod_mbox/spark-dev/201404.mbox/%3cCABPQxstL6nwTO2H9p8%3DGJh1g2zxOJd02Wt7L06mCLjo-vwwG9Q%40mail.gmail.com%3e
 *

 *Prashant Sharma*



 *On Fri, May 2, 2014 at 3:56 PM, N.Venkata Naga Ravi nvn_r...@hotmail.com
 nvn_r...@hotmail.com wrote:*






































 *Hi,I am tyring to build Apache Spark with Java 8 in my Mac system ( OS X
 10.8.5) , but getting following exception.Please help on resolving
 it.dhcp-173-39-68-28:spark-0.9.1 neravi$ java -version java version
 1.8.0Java(TM) SE Runtime Environment (build 1.8.0-b132)Java HotSpot(TM)
 64-Bit Server VM (build 25.0-b70, mixed mode)dhcp-173-39-68-28:spark-0.9.1
 neravi$ ./sbt/sbt assemblyLaunching sbt from sbt/sbt-launch-0.12.4.jar Java
 HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=350m;
 support was removed in 8.0[info] Loading project definition from
 /Applications/spark-0.9.1/project/project[info] Compiling 1 Scala source to
 /Applications/spark-0.9.1/project/project/target/scala-2.9.2/sbt-0.12/classes...
 [error] error while loading CharSequence, class file
 '/Library/Java/JavaVirtualMachines/jdk1.8.0.jdk/Contents/Home/jre/lib/rt.jar(java/lang/CharSequence.class)'
 is broken[error] (bad constant pool tag 18 at byte 10) [error] error while
 loading Comparator, class file
 '/Library/Java/JavaVirtualMachines/jdk1.8.0.jdk/Contents/Home/jre/lib/rt.jar(java/util/Comparator.class)'
 is broken[error] (bad constant pool tag 18 at byte 20) [error] two errors
 found[error] (compile:compile) Compilation failedProject loading failed:
 (r)etry, (q)uit, (l)ast, or (i)gnore? r[info] Loading project definition
 from /Applications/spark-0.9.1/project/project [info] Compiling 1 Scala
 source to
 /Applications/spark-0.9.1/project/project/target/scala-2.9.2/sbt-0.12/classes...[error]
 error while loading CharSequence, class file
 '/Library/Java/JavaVirtualMachines/jdk1.8.0.jdk/Contents/Home/jre/lib/rt.jar(java/lang/CharSequence.class)'
 is broken [error] (bad constant pool tag 18 at byte 10)[error] error while
 loading Comparator, class file
 '/Library/Java/JavaVirtualMachines/jdk1.8.0.jdk/Contents/Home/jre/lib/rt.jar(java/util/Comparator.class)'
 is broken [error

Re: when to use broadcast variables

2014-05-02 Thread Prashant Sharma
I had like to be corrected on this but I am just trying to say small enough
of the order of few 100 MBs. Imagine the size gets shipped to all nodes, it
can be a GB but not GBs and then depends on the network too.

Prashant Sharma


On Fri, May 2, 2014 at 6:42 PM, Diana Carroll dcarr...@cloudera.com wrote:

 Anyone have any guidance on using a broadcast variable to ship data to
 workers vs. an RDD?

 Like, say I'm joining web logs in an RDD with user account data.  I could
 keep the account data in an RDD or if it's small, a broadcast variable
 instead.  How small is small?  Small enough that I know it can easily fit
 in memory on a single node?  Some other guideline?

 Thanks!

 Diana



Re: Issue during Spark streaming with ZeroMQ source

2014-04-29 Thread Prashant Sharma
Unfortunately zeromq 4.0.1 is not supported.
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/streaming/examples/ZeroMQWordCount.scala#L63Says
about the version. You will need that version of zeromq to see it
work. Basically I have seen it working nicely with zeromq 2.2.0 and if you
have jzmq libraries installed performance is much better.

Prashant Sharma


On Tue, Apr 29, 2014 at 12:29 PM, Francis.Hu
francis...@reachjunction.comwrote:

  Hi, all



 I installed spark-0.9.1 and zeromq 4.0.1 , and then run below example:



 ./bin/run-example
 org.apache.spark.streaming.examples.SimpleZeroMQPublisher *tcp*://
 127.0.1.1:1234 foo.bar`

 ./bin/run-example org.apache.spark.streaming.examples.ZeroMQWordCount
 local[2] *tcp*://127.0.1.1:1234 *foo*`



 No any message was received in ZeroMQWordCount side.



 Does anyone know what the issue is ?





 Thanks,

 Francis





Re: 答复: Issue during Spark streaming with ZeroMQ source

2014-04-29 Thread Prashant Sharma
Well that is not going to be easy, simply because we depend on akka-zeromq
for zeromq support. And since akka does not support the latest zeromq
library yet, I doubt if there is something simple that can be done to
support it.

Prashant Sharma


On Tue, Apr 29, 2014 at 2:44 PM, Francis.Hu francis...@reachjunction.comwrote:

  Thanks, Prashant Sharma





 It works right now after degrade zeromq from 4.0.1 to  2.2.

 Do you know the new release of spark  whether it will upgrade zeromq ?

 Many of our programs are using zeromq 4.0.1, so if in next release ,spark
 streaming can release with a newer zeromq  that would be better for us.





 Francis.



 *发件人:* Prashant Sharma [mailto:scrapco...@gmail.com]
 *发送时间:* Tuesday, April 29, 2014 15:53
 *收件人:* user@spark.apache.org
 *主题:* Re: Issue during Spark streaming with ZeroMQ source



 Unfortunately zeromq 4.0.1 is not supported.
 https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/streaming/examples/ZeroMQWordCount.scala#L63Says
  about the version. You will need that version of zeromq to see it
 work. Basically I have seen it working nicely with zeromq 2.2.0 and if you
 have jzmq libraries installed performance is much better.


   Prashant Sharma



 On Tue, Apr 29, 2014 at 12:29 PM, Francis.Hu francis...@reachjunction.com
 wrote:

 Hi, all



 I installed spark-0.9.1 and zeromq 4.0.1 , and then run below example:



 ./bin/run-example
 org.apache.spark.streaming.examples.SimpleZeroMQPublisher *tcp*://
 127.0.1.1:1234 foo.bar`

 ./bin/run-example org.apache.spark.streaming.examples.ZeroMQWordCount
 local[2] *tcp*://127.0.1.1:1234 *foo*`



 No any message was received in ZeroMQWordCount side.



 Does anyone know what the issue is ?





 Thanks,

 Francis







Re: Need help about how hadoop works.

2014-04-24 Thread Prashant Sharma
Prashant Sharma


On Thu, Apr 24, 2014 at 12:15 PM, Carter gyz...@hotmail.com wrote:

 Thanks Mayur.

 So without Hadoop and any other distributed file systems, by running:
  val doc = sc.textFile(/home/scalatest.txt,5)
  doc.count
 we can only get parallelization within the computer where the file is
 loaded, but not the parallelization within the computers in the cluster
 (Spark can not automatically duplicate the file to the other computers in
 the cluster), is this understanding correct? Thank you.


Spark will not distribute that file for you on other systems, however if
the file(/home/scalatest.txt) is present on the same path on all systems
it will be processed on all nodes. We generally use hdfs which takes care
of this distribution.





 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-about-how-hadoop-works-tp4638p4734.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: Need help about how hadoop works.

2014-04-24 Thread Prashant Sharma
It is the same file and hadoop library that we use for splitting takes care
of assigning the right split to each node.

Prashant Sharma


On Thu, Apr 24, 2014 at 1:36 PM, Carter gyz...@hotmail.com wrote:

 Thank you very much for your help Prashant.

 Sorry I still have another question about your answer: however if the
 file(/home/scalatest.txt) is present on the same path on all systems it
 will be processed on all nodes.

 When presenting the file to the same path on all nodes, do we just simply
 copy the same file to all nodes, or do we need to split the original file
 into different parts (each part is still with the same file name
 scalatest.txt), and copy each part to a different node for
 parallelization?

 Thank you very much.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-about-how-hadoop-works-tp4638p4738.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: K-means faster on Mahout then on Spark

2014-03-25 Thread Prashant Sharma
I think Mahout uses FuzzyKmeans, which is different algorithm and it is not
iterative.

Prashant Sharma


On Tue, Mar 25, 2014 at 6:50 PM, Egor Pahomov pahomov.e...@gmail.comwrote:

 Hi, I'm running benchmark, which compares Mahout and SparkML. For now I
 have next results for k-means:
 Number of iterations= 10, number of elements = 1000, mahouttime= 602,
 spark time = 138
 Number of iterations= 40, number of elements = 1000, mahouttime= 1917,
 spark time = 330
 Number of iterations= 70, number of elements = 1000, mahouttime= 3203,
 spark time = 388
 Number of iterations= 10, number of elements = 1, mahouttime=
 1235, spark time = 2226
 Number of iterations= 40, number of elements = 1, mahouttime=
 2755, spark time = 6388
 Number of iterations= 70, number of elements = 1, mahouttime=
 4107, spark time = 10967
 Number of iterations= 10, number of elements = 10, mahouttime=
 7070, spark time = 25268

 Time in seconds. It runs on Yarn cluster with about 40 machines. Elements
 for clusterization are randomly created. When I changed persistence level
 from Memory to Memory_and_disk, on big data spark started to work faster.

 What am I missing?

 See my benchmarking code in attachment.


 --



 *Sincerely yours Egor PakhomovScala Developer, Yandex*