Avro RDD to DataFrame

2015-11-16 Thread Deenar Toraskar
Hi

The spark-avro module supports creation of a DataFrame from avro files. How
can convert a RDD of Avro objects that i get via SparkStreaming into a
DataFrame?

  val avroStream = KafkaUtils.createDirectStream[AvroKey[GenericRecord],
NullWritable, AvroKeyInputFormat[GenericRecord]](..)

https://github.com/databricks/spark-avro

// Creates a DataFrame from a specified fileDataFrame df =
sqlContext.read().format("com.databricks.spark.avro")
.load("src/test/resources/episodes.avro");



Regards
Deenar


Reading non UTF-8 files via spark streaming

2015-11-16 Thread tarek_abouzeid
Hi,

i am trying to read files which are ISO-8859-6  encoded via spark streaming,
but the default encoding for 

" ssc.textFileStream " is UTF-8 , so i don't get the data properly , so is
there a way change the default encoding for textFileStream , or a way to
read the file's bytes then i can handle the encoding ? 

Thanks so much in advance  



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Reading-non-UTF-8-files-via-spark-streaming-tp25397.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



ISDATE Function

2015-11-16 Thread Ravisankar Mani
Hi Everyone,


 In MSSQL server suppprt "ISDATE()" function is used to fine current column
values date or not?.  Is any possible to achieve current column values date
or not?


Regards,
Ravi


Re: Hive on Spark orc file empty

2015-11-16 Thread 张炜
Hi Deepak and all,
write() is a function of DataFrame, please check
https://spark.apache.org/docs/1.5.1/api/java/org/apache/spark/sql/DataFrame.html
,
the last one is write()
The problem is write to HDFS is successful but
esults.write.format("orc").save("yahoo_stocks_orc")
has empty folder.

Could anyone help?

Regards,
Sai

On Mon, Nov 16, 2015 at 8:00 PM Deepak Sharma  wrote:

> Sai,
> I am bit confused here.
> How are you using write with results?
> I am using spark 1.4.1 and when i use write , it complains about write not
> being member of DataFrame.
> error:value write is not a member of org.apache.spark.sql.DataFrame
>
> Thanks
> Deepak
>
> On Mon, Nov 16, 2015 at 4:10 PM, 张炜  wrote:
>
>> Dear all,
>> I am following this article to try Hive on Spark
>>
>> http://hortonworks.com/hadoop-tutorial/using-hive-with-orc-from-apache-spark/
>>
>> My environment:
>> Hive 1.2.1
>> Spark 1.5.1
>>
>> in a nutshell, I ran spark-shell, created a hive table
>>
>> hiveContext.sql("create table yahoo_orc_table (date STRING, open_price
>> FLOAT, high_price FLOAT, low_price FLOAT, close_price FLOAT, volume INT,
>> adj_price FLOAT) stored as orc")
>>
>> I also computed a dataframe and can show correct contents.
>> val results = sqlContext.sql("SELECT * FROM yahoo_stocks_temp")
>>
>> Then I executed the save command
>> results.write.format("orc").save("yahoo_stocks_orc")
>>
>> I can see a folder named "yahoo_stocks_orc" got successfully and there is
>> a _SUCCESS file inside it, but no orc file at all. I repeated this for many
>> times and it's the same result.
>>
>> But
>> results.write.format("orc").save("hdfs://*:8020/yahoo_stocks_orc")
>> can successfully write contents.
>>
>> Please kindly help.
>>
>> Regards,
>> Sai
>>
>>
>
>
> --
> Thanks
> Deepak
> www.bigdatabig.com
> www.keosha.net
>


Re: Spark Job is getting killed after certain hours

2015-11-16 Thread Ilya Ganelin
Your Kerberos cert is likely expiring. Check your expiration settings.

-Ilya Ganelin

On Mon, Nov 16, 2015 at 9:20 PM, Vipul Rai  wrote:

> Hi Nikhil,
> It seems you have Kerberos enabled cluster and it is unable to
> authenticate using the ticket.
> Please check the Kerberos settings, it could also be because of Kerberos
> version mismatch on nodes.
>
> Thanks,
> Vipul
>
> On Tue 17 Nov, 2015 07:31 Nikhil Gs  wrote:
>
>> Hello Team,
>>
>> Below is the error which we are facing in our cluster after 14 hours of
>> starting the spark submit job. Not able to understand the issue and why its
>> facing the below error after certain time.
>>
>> If any of you have faced the same scenario or if you have any idea then
>> please guide us. To identify the issue, if you need any other info then
>> please revert me back with the requirement.Thanks a lot in advance.
>>
>> *Log Error:  *
>>
>> 15/11/16 04:54:48 ERROR ipc.AbstractRpcClient: SASL authentication
>> failed. The most likely cause is missing or invalid credentials. Consider
>> 'kinit'.
>>
>> javax.security.sasl.SaslException: *GSS initiate failed [Caused by
>> GSSException: No valid credentials provided (Mechanism level: Failed to
>> find any Kerberos tgt)]*
>>
>> at
>> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:212)
>>
>> at
>> org.apache.hadoop.hbase.security.HBaseSaslRpcClient.saslConnect(HBaseSaslRpcClient.java:179)
>>
>> at
>> org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.setupSaslConnection(RpcClientImpl.java:605)
>>
>> at
>> org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.access$600(RpcClientImpl.java:154)
>>
>> at
>> org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection$2.run(RpcClientImpl.java:731)
>>
>> at
>> org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection$2.run(RpcClientImpl.java:728)
>>
>> at java.security.AccessController.doPrivileged(Native
>> Method)
>>
>> at javax.security.auth.Subject.doAs(Subject.java:415)
>>
>> at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
>>
>> at
>> org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.setupIOstreams(RpcClientImpl.java:728)
>>
>> at
>> org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.writeRequest(RpcClientImpl.java:881)
>>
>> at
>> org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.tracedWriteRequest(RpcClientImpl.java:850)
>>
>> at
>> org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1174)
>>
>> at
>> org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:216)
>>
>> at
>> org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:300)
>>
>> at
>> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$BlockingStub.get(ClientProtos.java:31865)
>>
>> at
>> org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRowOrBefore(ProtobufUtil.java:1580)
>>
>> at
>> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1294)
>>
>> at
>> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1126)
>>
>> at
>> org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:369)
>>
>> at
>> org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:320)
>>
>> at
>> org.apache.hadoop.hbase.client.BufferedMutatorImpl.backgroundFlushCommits(BufferedMutatorImpl.java:206)
>>
>> at
>> org.apache.hadoop.hbase.client.BufferedMutatorImpl.flush(BufferedMutatorImpl.java:183)
>>
>> at
>> org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:1482)
>>
>> at
>> org.apache.hadoop.hbase.client.HTable.put(HTable.java:1095)
>>
>> at
>> com.suxk.bigdata.pulse.consumer.ModempollHbaseLoadHelper$1.run(ModempollHbaseLoadHelper.java:89)
>>
>> at java.security.AccessController.doPrivileged(Native
>> Method)
>>
>> at javax.security.auth.Subject.doAs(Subject.java:356)
>>
>> at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1651)
>>
>> at
>> com.suxk.bigdata.pulse.consumer.ModempollHbaseLoadHelper.loadToHbase(ModempollHbaseLoadHelper.java:48)
>>
>> at
>> com.suxk.bigdata.pulse.consumer.ModempollSparkStreamingEngine$1.call(ModempollSparkStreamingEngine.java:52)
>>
>> at
>> com.suxk.bigdata.pulse.consumer.ModempollSparkStreamingEngine$1.call(ModempollSparkStreamingEngine.java:48)
>>
>> at
>> org.apache.spark.api.java.Java

Re: Spark Job is getting killed after certain hours

2015-11-16 Thread Vipul Rai
Hi Nikhil,
It seems you have Kerberos enabled cluster and it is unable to authenticate
using the ticket.
Please check the Kerberos settings, it could also be because of Kerberos
version mismatch on nodes.

Thanks,
Vipul

On Tue 17 Nov, 2015 07:31 Nikhil Gs  wrote:

> Hello Team,
>
> Below is the error which we are facing in our cluster after 14 hours of
> starting the spark submit job. Not able to understand the issue and why its
> facing the below error after certain time.
>
> If any of you have faced the same scenario or if you have any idea then
> please guide us. To identify the issue, if you need any other info then
> please revert me back with the requirement.Thanks a lot in advance.
>
> *Log Error:  *
>
> 15/11/16 04:54:48 ERROR ipc.AbstractRpcClient: SASL authentication failed.
> The most likely cause is missing or invalid credentials. Consider 'kinit'.
>
> javax.security.sasl.SaslException: *GSS initiate failed [Caused by
> GSSException: No valid credentials provided (Mechanism level: Failed to
> find any Kerberos tgt)]*
>
> at
> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:212)
>
> at
> org.apache.hadoop.hbase.security.HBaseSaslRpcClient.saslConnect(HBaseSaslRpcClient.java:179)
>
> at
> org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.setupSaslConnection(RpcClientImpl.java:605)
>
> at
> org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.access$600(RpcClientImpl.java:154)
>
> at
> org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection$2.run(RpcClientImpl.java:731)
>
> at
> org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection$2.run(RpcClientImpl.java:728)
>
> at java.security.AccessController.doPrivileged(Native
> Method)
>
> at javax.security.auth.Subject.doAs(Subject.java:415)
>
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
>
> at
> org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.setupIOstreams(RpcClientImpl.java:728)
>
> at
> org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.writeRequest(RpcClientImpl.java:881)
>
> at
> org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.tracedWriteRequest(RpcClientImpl.java:850)
>
> at
> org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1174)
>
> at
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:216)
>
> at
> org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:300)
>
> at
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$BlockingStub.get(ClientProtos.java:31865)
>
> at
> org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRowOrBefore(ProtobufUtil.java:1580)
>
> at
> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1294)
>
> at
> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1126)
>
> at
> org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:369)
>
> at
> org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:320)
>
> at
> org.apache.hadoop.hbase.client.BufferedMutatorImpl.backgroundFlushCommits(BufferedMutatorImpl.java:206)
>
> at
> org.apache.hadoop.hbase.client.BufferedMutatorImpl.flush(BufferedMutatorImpl.java:183)
>
> at
> org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:1482)
>
> at
> org.apache.hadoop.hbase.client.HTable.put(HTable.java:1095)
>
> at
> com.suxk.bigdata.pulse.consumer.ModempollHbaseLoadHelper$1.run(ModempollHbaseLoadHelper.java:89)
>
> at java.security.AccessController.doPrivileged(Native
> Method)
>
> at javax.security.auth.Subject.doAs(Subject.java:356)
>
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1651)
>
> at
> com.suxk.bigdata.pulse.consumer.ModempollHbaseLoadHelper.loadToHbase(ModempollHbaseLoadHelper.java:48)
>
> at
> com.suxk.bigdata.pulse.consumer.ModempollSparkStreamingEngine$1.call(ModempollSparkStreamingEngine.java:52)
>
> at
> com.suxk.bigdata.pulse.consumer.ModempollSparkStreamingEngine$1.call(ModempollSparkStreamingEngine.java:48)
>
> at
> org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1.apply(JavaPairRDD.scala:999)
>
> at
> scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>
> at
> scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
>
> at
> scala.collection.

Re: Conf Settings in Mesos

2015-11-16 Thread Jo Voordeckers
I have run in a related issue I think, args passed to spark-submit to my
cluster dispatcher get lost in translation when lauching the driver from
mesos, I'm suggesting this patch:

https://github.com/jayv/spark/commit/b2025ddc1d565d1cc3036200fc3b3046578f4b02

- Jo Voordeckers


On Thu, Nov 12, 2015 at 6:05 AM, John Omernik  wrote:

> Hey all,
>
> I noticed today that if I take a tgz as my URI for Mesos, that I have to
> repackaged it with my conf settings from where I execute say pyspark for
> the executors to have the right configuration settings.
>
> That is...
>
> If I take a "stock" tgz from makedistribution.sh, unpack it, and then set
> the URI in spark-defaults to be the unmodified tgz as the URI. Change other
> settings in both spark-defaults.conf and spark-env.sh, then run
> ./bin/pyspark from that unpacked directory, I guess I would have thought
> that when the executor spun up, that some sort of magic was happening where
> the conf directory or the conf settings would propagate out to the
> executors (thus making configuration changes easier to manage)
>
> For things to work, I had to unpack the tgz, change conf settings, then
> repackage the tgz with all my conf settings for the tgz in the URI then run
> it. Then it seemed to work.
>
> I have a work around, but I guess, from a usability point of view, it
> would be nice to have tgz that is "binaries" and that when it's run, it
> takes the conf at run time. It would help with managing multiple
> configurations that are using the same binaries (different models/apps etc)
> Instead of having to repackage an tgz for each app, it would just
> propagate...am I looking at this wrong?
>
> John
>
>
>


Re: Spark-shell connecting to Mesos stuck at sched.cpp

2015-11-16 Thread Jo Voordeckers
I've seen this issue when the mesos cluster couldn't figure out my IP
address correctly, have you tried setting the ENV var with your IP address
when launching spark or mesos cluster dispatcher like:

 LIBPROCESS_IP="172.16.0.180"


- Jo Voordeckers


On Sun, Nov 15, 2015 at 6:59 PM, Jong Wook Kim  wrote:

> I'm having problem connecting my spark app to a Mesos cluster; any help on
> the below question would be appreciated.
>
>
> http://stackoverflow.com/questions/33727154/spark-shell-connecting-to-mesos-stuck-at-sched-cpp
>
> Thanks,
> Jong Wook
>


Re: how can evenly distribute my records in all partition

2015-11-16 Thread Sabarish Sasidharan
You can write your own custom partitioner to achieve this

Regards
Sab
On 17-Nov-2015 1:11 am, "prateek arora"  wrote:

> Hi
>
> I have a RDD with 30 record ( Key/value pair ) and running 30 executor . i
> want to reparation this RDD in to 30 partition so every partition  get one
> record and assigned to one executor .
>
> when i used rdd.repartition(30) its repartition my rdd in 30 partition but
> some partition get 2 record , some get 1 record and some not getting any
> record .
>
> is there any way in spark so i can evenly distribute my record in all
> partition .
>
> Regards
> Prateek
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/how-can-evenly-distribute-my-records-in-all-partition-tp25394.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Spark Job is getting killed after certain hours

2015-11-16 Thread Nikhil Gs
Hello Team,

Below is the error which we are facing in our cluster after 14 hours of
starting the spark submit job. Not able to understand the issue and why its
facing the below error after certain time.

If any of you have faced the same scenario or if you have any idea then
please guide us. To identify the issue, if you need any other info then
please revert me back with the requirement.Thanks a lot in advance.

*Log Error:  *

15/11/16 04:54:48 ERROR ipc.AbstractRpcClient: SASL authentication failed.
The most likely cause is missing or invalid credentials. Consider 'kinit'.

javax.security.sasl.SaslException: *GSS initiate failed [Caused by
GSSException: No valid credentials provided (Mechanism level: Failed to
find any Kerberos tgt)]*

at
com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:212)

at
org.apache.hadoop.hbase.security.HBaseSaslRpcClient.saslConnect(HBaseSaslRpcClient.java:179)

at
org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.setupSaslConnection(RpcClientImpl.java:605)

at
org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.access$600(RpcClientImpl.java:154)

at
org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection$2.run(RpcClientImpl.java:731)

at
org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection$2.run(RpcClientImpl.java:728)

at java.security.AccessController.doPrivileged(Native
Method)

at javax.security.auth.Subject.doAs(Subject.java:415)

at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)

at
org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.setupIOstreams(RpcClientImpl.java:728)

at
org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.writeRequest(RpcClientImpl.java:881)

at
org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.tracedWriteRequest(RpcClientImpl.java:850)

at
org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1174)

at
org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:216)

at
org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:300)

at
org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$BlockingStub.get(ClientProtos.java:31865)

at
org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRowOrBefore(ProtobufUtil.java:1580)

at
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1294)

at
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1126)

at
org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:369)

at
org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:320)

at
org.apache.hadoop.hbase.client.BufferedMutatorImpl.backgroundFlushCommits(BufferedMutatorImpl.java:206)

at
org.apache.hadoop.hbase.client.BufferedMutatorImpl.flush(BufferedMutatorImpl.java:183)

at
org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:1482)

at
org.apache.hadoop.hbase.client.HTable.put(HTable.java:1095)

at
com.suxk.bigdata.pulse.consumer.ModempollHbaseLoadHelper$1.run(ModempollHbaseLoadHelper.java:89)

at java.security.AccessController.doPrivileged(Native
Method)

at javax.security.auth.Subject.doAs(Subject.java:356)

at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1651)

at
com.suxk.bigdata.pulse.consumer.ModempollHbaseLoadHelper.loadToHbase(ModempollHbaseLoadHelper.java:48)

at
com.suxk.bigdata.pulse.consumer.ModempollSparkStreamingEngine$1.call(ModempollSparkStreamingEngine.java:52)

at
com.suxk.bigdata.pulse.consumer.ModempollSparkStreamingEngine$1.call(ModempollSparkStreamingEngine.java:48)

at
org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1.apply(JavaPairRDD.scala:999)

at
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)

at
scala.collection.Iterator$$anon$10.next(Iterator.scala:312)

at
scala.collection.Iterator$class.foreach(Iterator.scala:727)

at
scala.collection.AbstractIterator.foreach(Iterator.scala:1157)

at
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)

at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)

at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)

at scala.collection.TraversableOnce$class.to
(Trav

Mesos cluster dispatcher doesn't respect most args from the submit req

2015-11-16 Thread Jo Voordeckers
Hi all,

I'm running the mesos cluster dispatcher, however when I submit jobs with
things like jvm args, classpath order and UI port aren't added to the
commandline executed by the mesos scheduler. In fact it only cares about
the class, jar and num cores/mem.

https://github.com/jayv/spark/blob/mesos_cluster_params/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterScheduler.scala#L412-L424

I've made an attempt at adding a few of the args that I believe are useful
to the MesosClusterScheduler class, which seems to solve my problem.

Please have a look:

https://github.com/apache/spark/pull/9752

Thanks

- Jo Voordeckers


Stage retry limit

2015-11-16 Thread pnpritchard
In my app, I see a condition where a stage fails and Spark retries it
endlessly. I see the configuration for task retry limit
(spark.task.maxFailures), but is there a configuration for limiting the
number of stage retries?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Stage-retry-limit-tp25396.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: YARN Labels

2015-11-16 Thread Saisai Shao
Node label for AM is not yet supported for Spark now, currently only
executor is supported.

On Tue, Nov 17, 2015 at 7:57 AM, Ted Yu  wrote:

> Wangda, YARN committer, told me that support for selecting which nodes the
> application master is running on is integrated to the upcoming hadoop 2.8.0
> release.
>
> Stay tuned.
>
> On Mon, Nov 16, 2015 at 1:36 PM, Ted Yu  wrote:
>
>> There is no such configuration parameter for selecting which nodes the
>> application master is running on.
>>
>> Cheers
>>
>> On Mon, Nov 16, 2015 at 12:52 PM, Alex Rovner 
>> wrote:
>>
>>> I was wondering if there is analogues configuration parameter to 
>>> "spark.yarn.executor.nodeLabelExpression"
>>> which restricts which nodes the application master is running on.
>>>
>>> One of our clusters runs on AWS with a portion of the nodes being spot
>>> nodes. We would like to force the application master not to run on spot
>>> nodes. For what ever reason, application master is not able to recover in
>>> cases the node where it was running suddenly disappears, which is the case
>>> with spot nodes.
>>>
>>> Any guidance on this topic is appreciated.
>>>
>>> *Alex Rovner*
>>> *Director, Data Engineering *
>>> *o:* 646.759.0052
>>>
>>> * *
>>>
>>
>>
>


Re: YARN Labels

2015-11-16 Thread Ted Yu
Wangda, YARN committer, told me that support for selecting which nodes the
application master is running on is integrated to the upcoming hadoop 2.8.0
release.

Stay tuned.

On Mon, Nov 16, 2015 at 1:36 PM, Ted Yu  wrote:

> There is no such configuration parameter for selecting which nodes the
> application master is running on.
>
> Cheers
>
> On Mon, Nov 16, 2015 at 12:52 PM, Alex Rovner 
> wrote:
>
>> I was wondering if there is analogues configuration parameter to 
>> "spark.yarn.executor.nodeLabelExpression"
>> which restricts which nodes the application master is running on.
>>
>> One of our clusters runs on AWS with a portion of the nodes being spot
>> nodes. We would like to force the application master not to run on spot
>> nodes. For what ever reason, application master is not able to recover in
>> cases the node where it was running suddenly disappears, which is the case
>> with spot nodes.
>>
>> Any guidance on this topic is appreciated.
>>
>> *Alex Rovner*
>> *Director, Data Engineering *
>> *o:* 646.759.0052
>>
>> * *
>>
>
>


Re: Spark Implementation of XGBoost

2015-11-16 Thread Joseph Bradley
One comment about
"""
1) I agree the sorting method you suggested is a very efficient way to
handle the unordered categorical variables in binary classification
and regression. I propose we have a Spark ML Transformer to do the
sorting and encoding, bringing the benefits to many tree based
methods. How about I open a jira for this?
"""

--> MLlib trees do this currently, so you could check out that code as an
example.
I'm not sure how this would work as a generic transformer, though; it seems
more like an internal part of space-partitioning algorithms.



On Tue, Oct 27, 2015 at 5:04 PM, Meihua Wu 
wrote:

> Hi DB Tsai,
>
> Thank you again for your insightful comments!
>
> 1) I agree the sorting method you suggested is a very efficient way to
> handle the unordered categorical variables in binary classification
> and regression. I propose we have a Spark ML Transformer to do the
> sorting and encoding, bringing the benefits to many tree based
> methods. How about I open a jira for this?
>
> 2) For L2/L1 regularization vs Learning rate (I use this name instead
> shrinkage to avoid confusion), I have the following observations:
>
> Suppose G and H are the sum (over the data assigned to a leaf node) of
> the 1st and 2nd derivative of the loss evaluated at f_m, respectively.
> Then for this leaf node,
>
> * With a learning rate eta, f_{m+1} = f_m - G/H*eta
>
> * With a L2 regularization coefficient lambda, f_{m+1} =f_m - G/(H+lambda)
>
> If H>0 (convex loss), both approach lead to "shrinkage":
>
> * For the learning rate approach, the percentage of shrinkage is
> uniform for any leaf node.
>
> * For L2 regularization, the percentage of shrinkage would adapt to
> the number of instances assigned to a leaf node: more instances =>
> larger G and H => less shrinkage. This behavior is intuitive to me. If
> the value estimated from this node is based on a large amount of data,
> the value should be reliable and less shrinkage is needed.
>
> I suppose we could have something similar for L1.
>
> I am not aware of theoretical results to conclude which method is
> better. Likely to be dependent on the data at hand. Implementing
> learning rate is on my radar for version 0.2. I should be able to add
> it in a week or so. I will send you a note once it is done.
>
> Thanks,
>
> Meihua
>
> On Tue, Oct 27, 2015 at 1:02 AM, DB Tsai  wrote:
> > Hi Meihua,
> >
> > For categorical features, the ordinal issue can be solved by trying
> > all kind of different partitions 2^(q-1) -1 for q values into two
> > groups. However, it's computational expensive. In Hastie's book, in
> > 9.2.4, the trees can be trained by sorting the residuals and being
> > learnt as if they are ordered. It can be proven that it will give the
> > optimal solution. I have a proof that this works for learning
> > regression trees through variance reduction.
> >
> > I'm also interested in understanding how the L1 and L2 regularization
> > within the boosting works (and if it helps with overfitting more than
> > shrinkage).
> >
> > Thanks.
> >
> > Sincerely,
> >
> > DB Tsai
> > --
> > Web: https://www.dbtsai.com
> > PGP Key ID: 0xAF08DF8D
> >
> >
> > On Mon, Oct 26, 2015 at 8:37 PM, Meihua Wu 
> wrote:
> >> Hi DB Tsai,
> >>
> >> Thank you very much for your interest and comment.
> >>
> >> 1) feature sub-sample is per-node, like random forest.
> >>
> >> 2) The current code heavily exploits the tree structure to speed up
> >> the learning (such as processing multiple learning node in one pass of
> >> the training data). So a generic GBM is likely to be a different
> >> codebase. Do you have any nice reference of efficient GBM? I am more
> >> than happy to look into that.
> >>
> >> 3) The algorithm accept training data as a DataFrame with the
> >> featureCol indexed by VectorIndexer. You can specify which variable is
> >> categorical in the VectorIndexer. Please note that currently all
> >> categorical variables are treated as ordered. If you want some
> >> categorical variables as unordered, you can pass the data through
> >> OneHotEncoder before the VectorIndexer. I do have a plan to handle
> >> unordered categorical variable using the approach in RF in Spark ML
> >> (Please see roadmap in the README.md)
> >>
> >> Thanks,
> >>
> >> Meihua
> >>
> >>
> >>
> >> On Mon, Oct 26, 2015 at 4:06 PM, DB Tsai  wrote:
> >>> Interesting. For feature sub-sampling, is it per-node or per-tree? Do
> >>> you think you can implement generic GBM and have it merged as part of
> >>> Spark codebase?
> >>>
> >>> Sincerely,
> >>>
> >>> DB Tsai
> >>> --
> >>> Web: https://www.dbtsai.com
> >>> PGP Key ID: 0xAF08DF8D
> >>>
> >>>
> >>> On Mon, Oct 26, 2015 at 11:42 AM, Meihua Wu
> >>>  wrote:
>  Hi Spark User/Dev,
> 
>  Inspired by the success of XGBoost, I have created a Spark package for
>  gradient boosting tree with 2nd order approximation of arbitrary
>  us

Parallelizing operations using Spark

2015-11-16 Thread Susheel Kumar
Hello Spark Users,

My first email to spark mailing list and looking forward. I have been
working on Solr and in the past have used Java thread pooling to
parallelize Solr indexing using SolrJ.

Now i am again working on indexing data and this time from JSON files (in
100 thousands) and before I try out parallelizing the operations using
Spark (reading each JSON file, post its content to Solr) I wanted to
confirm my understanding.


By reading json files using wholeTextFiles and then posting the content to
Solr

- would be similar to what i will achieve using Java multi-threading /
thread pooling and using ExecutorFramework  and
- what additional other advantages i would get by using Spark (less code...)
- How we can parallelize/batch this further? For e.g. In my Java
multi-threaded i not only parallelize the reading / data acquisition but
also posting in batches in parallel.


Below is the code snippet to give you an idea of what i am thinking to
start initially.  Please feel free to suggest/correct my understanding and
below code structure.

SparkConf conf = new SparkConf().setAppName(appName).setMaster("local[8]");

JavaSparkContext sc = new JavaSparkContext(conf);

JavaPairRDD rdd = sc.wholeTextFiles("/../*.json");

rdd.foreach(new VoidFunction>() {


@Override

public void post(Tuple2 arg0) throws Exception {

//post content to Solr

arg0._2

...

...

}

});


Thanks,

Susheel


Re: [SPARK STREAMING] Questions regarding foreachPartition

2015-11-16 Thread Cody Koeninger
Ordering would be on a per-partition basis, not global ordering.

You typically want to acquire resources inside the foreachpartition
closure, just before handling the iterator.

http://spark.apache.org/docs/latest/streaming-programming-guide.html#design-patterns-for-using-foreachrdd

On Mon, Nov 16, 2015 at 4:02 PM, Nipun Arora 
wrote:

> Hi,
> I wanted to understand forEachPartition logic. In the code below, I am
> assuming the iterator is executing in a distributed fashion.
>
> 1. Assuming I have a stream which has timestamp data which is sorted. Will
> the stringiterator in foreachPartition process each line in order?
>
> 2. Assuming I have a static pool of Kafka connections, where should I get
> a connection from a pool to be used to send data to Kafka?
>
> addMTSUnmatched.foreachRDD(
> new Function, Void>() {
> @Override
> public Void call(JavaRDD stringJavaRDD) throws Exception {
> stringJavaRDD.foreachPartition(
>
> new VoidFunction>() {
> @Override
> public void call(Iterator stringIterator) 
> throws Exception {
> while(stringIterator.hasNext()){
> String str = stringIterator.next();
> if(OnlineUtils.ESFlag) {
> OnlineUtils.printToFile(str, 1, 
> type1_outputFile, OnlineUtils.client);
> }else{
> OnlineUtils.printToFile(str, 1, 
> type1_outputFile);
> }
> }
> }
> }
> );
> return null;
> }
> }
> );
>
>
>
> Thanks
>
> Nipun
>
>


[SPARK STREAMING] Questions regarding foreachPartition

2015-11-16 Thread Nipun Arora
Hi,
I wanted to understand forEachPartition logic. In the code below, I am
assuming the iterator is executing in a distributed fashion.

1. Assuming I have a stream which has timestamp data which is sorted. Will
the stringiterator in foreachPartition process each line in order?

2. Assuming I have a static pool of Kafka connections, where should I get a
connection from a pool to be used to send data to Kafka?

addMTSUnmatched.foreachRDD(
new Function, Void>() {
@Override
public Void call(JavaRDD stringJavaRDD) throws Exception {
stringJavaRDD.foreachPartition(

new VoidFunction>() {
@Override
public void call(Iterator
stringIterator) throws Exception {
while(stringIterator.hasNext()){
String str = stringIterator.next();
if(OnlineUtils.ESFlag) {
OnlineUtils.printToFile(str,
1, type1_outputFile, OnlineUtils.client);
}else{
OnlineUtils.printToFile(str,
1, type1_outputFile);
}
}
}
}
);
return null;
}
}
);



Thanks

Nipun


Re: YARN Labels

2015-11-16 Thread Ted Yu
There is no such configuration parameter for selecting which nodes the
application master is running on.

Cheers

On Mon, Nov 16, 2015 at 12:52 PM, Alex Rovner 
wrote:

> I was wondering if there is analogues configuration parameter to 
> "spark.yarn.executor.nodeLabelExpression"
> which restricts which nodes the application master is running on.
>
> One of our clusters runs on AWS with a portion of the nodes being spot
> nodes. We would like to force the application master not to run on spot
> nodes. For what ever reason, application master is not able to recover in
> cases the node where it was running suddenly disappears, which is the case
> with spot nodes.
>
> Any guidance on this topic is appreciated.
>
> *Alex Rovner*
> *Director, Data Engineering *
> *o:* 646.759.0052
>
> * *
>


YARN Labels

2015-11-16 Thread Alex Rovner
I was wondering if there is analogues configuration parameter to
"spark.yarn.executor.nodeLabelExpression"
which restricts which nodes the application master is running on.

One of our clusters runs on AWS with a portion of the nodes being spot
nodes. We would like to force the application master not to run on spot
nodes. For what ever reason, application master is not able to recover in
cases the node where it was running suddenly disappears, which is the case
with spot nodes.

Any guidance on this topic is appreciated.

*Alex Rovner*
*Director, Data Engineering *
*o:* 646.759.0052

* *


Re: ReduceByKeyAndWindow does repartitioning twice on recovering from checkpoint

2015-11-16 Thread Tathagata Das
ReduceByKeyAndWindow with inverse function has certain unique
characteristics - it reuses a lot of the intermediate partitioned,
partially-reduced data. For every new batch, it "reduces" the new data, and
"inverse reduces" the old out-of-window data. That inverse-reduce needs
data from an old batch. Under normal operation, that old batch is
partitioned and cached, and so there is only one partitioning - the new
data. Thats not the case for the recovery time - the to-be-inverse-reduced
old batch data needs to be re-read from kafka and repartitioned again.
Hence the two repartitions.

On Sun, Nov 15, 2015 at 9:05 AM, kundan kumar  wrote:

> Hi,
>
> I am using spark streaming check-pointing mechanism and reading the data
> from Kafka. The window duration for my application is 2 hrs with a sliding
> interval of 15 minutes.
>
> So, my batches run at following intervals...
>
>- 09:45
>- 10:00
>- 10:15
>- 10:30
>- and so on
>
> When my job is restarted, and recovers from the checkpoint it does the
> re-partitioning step twice for each 15 minute job until the window of 2
> hours is complete. Then the re-partitioning takes place only once.
>
> For example - when the job recovers at 16:15 it does re-partitioning for
> the 16:15 Kafka stream and the 14:15 Kafka stream as well. Also, all the
> other intermediate stages are computed for 16:15 batch. I am using
> reduceByKeyAndWindow with inverse function. Now once the 2 hrs window is
> complete 18:15 onward re-partitioning takes place only once. Seems like the
> checkpoint does not have RDD stored for beyond 2 hrs which is my window
> duration. Because of this my job takes more time than usual.
>
> Is there a way or some configuration parameter which would help avoid
> repartitioning twice ?
>
> Attaching the snaps when repartitioning takes place twice after recovery
> from checkpoint.
>
> Thanks !!
>
> Kundan
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>


Re: send transformed RDD to s3 from slaves

2015-11-16 Thread Walrus theCat
Update:

You can now answer this on stackoverflow for 100 bounty:

http://stackoverflow.com/questions/33704073/how-to-send-transformed-data-from-partitions-to-s3

On Fri, Nov 13, 2015 at 4:56 PM, Walrus theCat 
wrote:

> Hi,
>
> I have an RDD which crashes the driver when being collected.  I want to
> send the data on its partitions out to S3 without bringing it back to the
> driver. I try calling rdd.foreachPartition, but the data that gets sent has
> not gone through the chain of transformations that I need.  It's the data
> as it was ingested initially.  After specifying my chain of
> transformations, but before calling foreachPartition, I call rdd.count in
> order to force the RDD to transform.  The data it sends out is still not
> transformed.  How do I get the RDD to send out transformed data when
> calling foreachPartition?
>
> Thanks
>


how can evenly distribute my records in all partition

2015-11-16 Thread prateek arora
Hi

I have a RDD with 30 record ( Key/value pair ) and running 30 executor . i
want to reparation this RDD in to 30 partition so every partition  get one
record and assigned to one executor .

when i used rdd.repartition(30) its repartition my rdd in 30 partition but
some partition get 2 record , some get 1 record and some not getting any
record .

is there any way in spark so i can evenly distribute my record in all
partition .

Regards
Prateek



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/how-can-evenly-distribute-my-records-in-all-partition-tp25394.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark Powered By Page

2015-11-16 Thread Alex Rovner
I would like to list our organization on the Powered by Page.

Company: Magnetic
Description: We are leveraging Spark Core, Streaming and YARN to process
our massive datasets.

*Alex Rovner*
*Director, Data Engineering *
*o:* 646.759.0052

* *


Re: bin/pyspark SparkContext is missing?

2015-11-16 Thread Andy Davidson
Thanks

andy

From:  Davies Liu 
Date:  Friday, November 13, 2015 at 3:42 PM
To:  Andrew Davidson 
Cc:  "user @spark" 
Subject:  Re: bin/pyspark SparkContext is missing?

> You forgot to create a SparkContext instance:
> 
> sc = SparkContext()
> 
> On Tue, Nov 3, 2015 at 9:59 AM, Andy Davidson
>  wrote:
>>  I am having a heck of a time getting Ipython notebooks to work on my 1.5.1
>>  AWS cluster I created using spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2
>> 
>>  I have read the instructions for using iPython notebook on
>>  http://spark.apache.org/docs/latest/programming-guide.html#using-the-shell
>> 
>>  I want to run the notebook server on my master and use an ssh tunnel to
>>  connect a web browser running on my mac.
>> 
>>  I am confident the cluster is set up correctly because the sparkPi example
>>  runs.
>> 
>>  I am able to use IPython notebooks on my local mac and work with spark and
>>  local files with out any problems.
>> 
>>  I know the ssh tunnel is working.
>> 
>>  On my cluster I am able to use python shell in general
>> 
>>  [ec2-user@ip-172-31-29-60 dataScience]$ /root/spark/bin/pyspark --master
>>  local[2]
>> 
>> 
>  from pyspark import SparkContext
>> 
>  textFile = sc.textFile("file:///home/ec2-user/dataScience/readme.txt")
>> 
>  textFile.take(1)
>> 
>> 
>> 
>>  When I run the exact same code in iPython notebook I get
>> 
>>  ---
>>  NameError Traceback (most recent call last)
>>   in ()
>>   11 from pyspark import SparkContext, SparkConf
>>   12
>>  ---> 13 textFile =
>>  sc.textFile("file:///home/ec2-user/dataScience/readme.txt")
>>   14
>>   15 textFile.take(1)
>> 
>>  NameError: name 'sc' is not defined
>> 
>> 
>> 
>> 
>>  To try an debug I wrote a script to launch pyspark and added Œset ­x¹ to
>>  pyspark so I could see what the script was doing
>> 
>>  Any idea how I can debug this?
>> 
>>  Thanks in advance
>> 
>>  Andy
>> 
>>  $ cat notebook.sh
>> 
>>  set -x
>> 
>>  export PYSPARK_DRIVER_PYTHON=ipython
>> 
>>  export PYSPARK_DRIVER_PYTHON_OPTS="notebook --no-browser --port=7000"
>> 
>>  /root/spark/bin/pyspark --master local[2]
>> 
>> 
>> 
>> 
>>  [ec2-user@ip-172-31-29-60 dataScience]$ ./notebook.sh
>> 
>>  ++ export PYSPARK_DRIVER_PYTHON=ipython
>> 
>>  ++ PYSPARK_DRIVER_PYTHON=ipython
>> 
>>  ++ export 'PYSPARK_DRIVER_PYTHON_OPTS=notebook --no-browser --port=7000'
>> 
>>  ++ PYSPARK_DRIVER_PYTHON_OPTS='notebook --no-browser --port=7000'
>> 
>>  ++ /root/spark/bin/pyspark --master 'local[2]'
>> 
>>  +++ dirname /root/spark/bin/pyspark
>> 
>>  ++ cd /root/spark/bin/..
>> 
>>  ++ pwd
>> 
>>  + export SPARK_HOME=/root/spark
>> 
>>  + SPARK_HOME=/root/spark
>> 
>>  + source /root/spark/bin/load-spark-env.sh
>> 
>>   dirname /root/spark/bin/pyspark
>> 
>>  +++ cd /root/spark/bin/..
>> 
>>  +++ pwd
>> 
>>  ++ FWDIR=/root/spark
>> 
>>  ++ '[' -z '' ']'
>> 
>>  ++ export SPARK_ENV_LOADED=1
>> 
>>  ++ SPARK_ENV_LOADED=1
>> 
>>   dirname /root/spark/bin/pyspark
>> 
>>  +++ cd /root/spark/bin/..
>> 
>>  +++ pwd
>> 
>>  ++ parent_dir=/root/spark
>> 
>>  ++ user_conf_dir=/root/spark/conf
>> 
>>  ++ '[' -f /root/spark/conf/spark-env.sh ']'
>> 
>>  ++ set -a
>> 
>>  ++ . /root/spark/conf/spark-env.sh
>> 
>>  +++ export JAVA_HOME=/usr/java/latest
>> 
>>  +++ JAVA_HOME=/usr/java/latest
>> 
>>  +++ export SPARK_LOCAL_DIRS=/mnt/spark,/mnt2/spark
>> 
>>  +++ SPARK_LOCAL_DIRS=/mnt/spark,/mnt2/spark
>> 
>>  +++ export SPARK_MASTER_OPTS=
>> 
>>  +++ SPARK_MASTER_OPTS=
>> 
>>  +++ '[' -n 1 ']'
>> 
>>  +++ export SPARK_WORKER_INSTANCES=1
>> 
>>  +++ SPARK_WORKER_INSTANCES=1
>> 
>>  +++ export SPARK_WORKER_CORES=2
>> 
>>  +++ SPARK_WORKER_CORES=2
>> 
>>  +++ export HADOOP_HOME=/root/ephemeral-hdfs
>> 
>>  +++ HADOOP_HOME=/root/ephemeral-hdfs
>> 
>>  +++ export
>>  SPARK_MASTER_IP=ec2-54-215-207-132.us-west-1.compute.amazonaws.com
>> 
>>  +++ SPARK_MASTER_IP=ec2-54-215-207-132.us-west-1.compute.amazonaws.com
>> 
>>   cat /root/spark-ec2/cluster-url
>> 
>>  +++ export
>>  MASTER=spark://ec2-54-215-207-132.us-west-1.compute.amazonaws.com:7077
>> 
>>  +++ MASTER=spark://ec2-54-215-207-132.us-west-1.compute.amazonaws.com:7077
>> 
>>  +++ export SPARK_SUBMIT_LIBRARY_PATH=:/root/ephemeral-hdfs/lib/native/
>> 
>>  +++ SPARK_SUBMIT_LIBRARY_PATH=:/root/ephemeral-hdfs/lib/native/
>> 
>>  +++ export SPARK_SUBMIT_CLASSPATH=::/root/ephemeral-hdfs/conf
>> 
>>  +++ SPARK_SUBMIT_CLASSPATH=::/root/ephemeral-hdfs/conf
>> 
>>   wget -q -O - http://169.254.169.254/latest/meta-data/public-hostname
>> 
>>  +++ export
>>  SPARK_PUBLIC_DNS=ec2-54-215-207-132.us-west-1.compute.amazonaws.com
>> 
>>  +++ SPARK_PUBLIC_DNS=ec2-54-215-207-132.us-west-1.compute.amazonaws.com
>> 
>>  +++ export YARN_CONF_DIR=/root/ephemeral-hdfs/conf
>> 
>>  +++ YARN_CONF_DIR=/root/ephemeral-hdfs/conf
>> 
>>   id -u
>> 
>>  +++ '[' 222 == 0 ']'
>> 
>>  ++ set +a
>> 
>>  ++ '[' -z '' ']'
>> 
>>  ++ ASS

Re: spark-submit stuck and no output in console

2015-11-16 Thread Kayode Odeyemi
> Or are you saying that the Java process never even starts?


Exactly.

Here's what I got back from jstack as expected:

hadoop-user@yks-hadoop-m01:/usr/local/spark/bin$ jstack 31316
31316: Unable to open socket file: target process not responding or HotSpot
VM not loaded
The -F option can be used when the target process is not responding
hadoop-user@yks-hadoop-m01:/usr/local/spark/bin$ jstack 31316 -F
Attaching to core -F from executable 31316, please wait...
Error attaching to core file: Can't attach to the core file


Re: How 'select name,age from TBL_STUDENT where age = 37' is optimized when caching it

2015-11-16 Thread Xiao Li
Your dataframe is cached. Thus, your plan is stored as an InMemoryRelation.

You can read the logics in CacheManager.scala.

Good luck,

Xiao Li

2015-11-16 6:35 GMT-08:00 Todd :

> Hi,
>
> When I cache the dataframe and run the query,
>
> val df = sqlContext.sql("select  name,age from TBL_STUDENT where age =
> 37")
> df.cache()
> df.show
> println(df.queryExecution)
>
>  I got the following execution plan,from the optimized logical plan,I can
> see the whole analyzed logical plan is totally replaced with the
> InMemoryRelation logical plan. But when I look into the Optimizer, I didn't
> see any optimizer that relates to the InMemoryRelation.
>
> Could you please explain how the optimization works?
>
>
>
>
> == Parsed Logical Plan ==
> ', argString:< [unresolvedalias(UnresolvedAttribute:
> 'name),unresolvedalias(UnresolvedAttribute: 'age)]>
>   ', argString:< (UnresolvedAttribute: 'age = 37)>
> ', argString:< [TBL_STUDENT], None>
>
> == Analyzed Logical Plan ==
> name: string, age: int
> , argString:<
> [AttributeReference:name#1,AttributeReference:age#3]>
>   , argString:< (AttributeReference:age#3 = 37)>
> , argString:< TBL_STUDENT>
>   , argString:<
> [AttributeReference:id#0,AttributeReference:name#1,AttributeReference:classId#2,AttributeReference:age#3],
> MapPartitionsRDD[4] at main at NativeMethodAccessorImpl.java:-2>
>
> == Optimized Logical Plan ==
> , argString:<
> [AttributeReference:name#1,AttributeReference:age#3], true, 1,
> StorageLevel(true, true, false, true, 1), (, argString:<
> [AttributeReference:name#1,AttributeReference:age#3]>), None>
>
> == Physical Plan ==
> , argString:<
> [AttributeReference:name#1,AttributeReference:age#3], (,
> argString:< [AttributeReference:name#1,AttributeReference:age#3], true,
> 1, StorageLevel(true, true, false, true, 1), (,
> argString:< [AttributeReference:name#1,AttributeReference:age#3]>), None>)>
>
>
>
>
>
>
>
>
>
>
>


Spark SQL UDAF works fine locally, OutOfMemory on YARN

2015-11-16 Thread Alex Nastetsky
Hi,

I am using Spark 1.5.1.

I have a Spark SQL UDAF that works fine on a tiny dataset (13 narrow rows)
in local mode, but runs out of memory on YARN about half the time
(OutOfMemory: Java Heap Space). The rest of the time, it works on YARN.

Note that in all instances, the input data is the same.

Here is the UDAF: https://gist.github.com/alexnastetsky/581af2672328c4b8b023

I am also using a trivial UDT to keep track of each unique value and its
count.

The basic idea is to have a secondary grouping and to count the number
occurrences of each value in the group. For example, we want to group on
column X; then for each group, we want to aggregate the rows by column Y
and count how many times each unique value of Y appears.

So, given the following data:

X Y
a 1
a 2
a 2
a 2
b 3
b 3
b 4

I would do

myudaf = new MergeArraysOfElementWithCountUDAF()
df = // load data
df.groupBy($"X")
.agg(
myudaf($"Y").as("aggY")
)

should provide data like the following

X aggY
a [{"element":"1", "count":"1"}, {"element":"2", "count":"3"}]
b [{"element":"3", "count":"2"}, {"element":"4", "count":"1"}]

There's also an option to take as input an array, instead of a scalar, in
which case it just loops through the array and performs the same operation.

I've added some logging to show the Runtime.getRuntime.freeMemory right
before it throws the OOM error, and it shows plenty of memory (16 GB, when
I was running on a large node) still available. So I'm not sure if it's
some huge memory spike, or it's not actually seeing that available memory.

When the OOM does happen, it consistently happens at this line:
https://gist.github.com/alexnastetsky/581af2672328c4b8b023#file-mergearraysofelementwithcountudaf-scala-L59

java.lang.OutOfMemoryError: Java heap space
at
scala.reflect.ManifestFactory$$anon$1.newArray(Manifest.scala:165)
at
scala.reflect.ManifestFactory$$anon$1.newArray(Manifest.scala:164)
at org.apache.spark.sql.types.ArrayData.toArray(ArrayData.scala:108)
at
org.apache.spark.sql.catalyst.CatalystTypeConverters$MapConverter.toScala(CatalystTypeConverters.scala:235)
at
org.apache.spark.sql.catalyst.CatalystTypeConverters$MapConverter.toScala(CatalystTypeConverters.scala:193)
at
org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToScalaConverter$2.apply(CatalystTypeConverters.scala:414)
at
org.apache.spark.sql.execution.aggregate.InputAggregationBuffer.get(udaf.scala:298)
at org.apache.spark.sql.Row$class.getAs(Row.scala:316)
at
org.apache.spark.sql.execution.aggregate.InputAggregationBuffer.getAs(udaf.scala:269)
at
com.verve.scorecard.spark.sql.MergeArraysOfElementWithCountUDAF.merge(MergeArraysOfElementWithCountUDAF.scala:59)

Again, the data is tiny and it doesn't explain why it only happens some of
the time on YARN, and never when running in local mode.

Here's how I am running the app:

spark-submit \
--deploy-mode cluster \
--master yarn \
--num-executors 1 \
--executor-cores 1 \
--executor-memory 18g \
--driver-java-options "-XX:MaxPermSize=256m" \
--conf "spark.executor.extraJavaOptions=-XX:MaxPermSize=256m" \
--conf spark.storage.memoryFraction=0.2 \
--conf spark.shuffle.memoryFraction=0.6 \
--conf spark.sql.shuffle.partitions=1000 \
[...app specific stuff...]

Note that I am using a single executor and single core to help with
debugging, but I have the same issue with more executors/nodes.

I am running this on EMR on AWS, so this is unlikely to be a hardware issue
(different hardware each time I launch a cluster).

I've also isolated the issue to this UDAF, as removing it from my Spark SQL
makes the issue go away.

Any ideas would be appreciated.

Thanks,
Alex.


Re: Join and HashPartitioner question

2015-11-16 Thread Erwan ALLAIN
You may need to persist r1 after partitionBy call. second join will be more
efficient.

On Mon, Nov 16, 2015 at 2:48 PM, Rishi Mishra  wrote:

> AFAIK and can see in the code both of them should behave same.
>
> On Sat, Nov 14, 2015 at 2:10 AM, Alexander Pivovarov  > wrote:
>
>> Hi Everyone
>>
>> Is there any difference in performance btw the following two joins?
>>
>>
>> val r1: RDD[(String, String]) = ???
>> val r2: RDD[(String, String]) = ???
>>
>> val partNum = 80
>> val partitioner = new HashPartitioner(partNum)
>>
>> // Join 1
>> val res1 = r1.partitionBy(partitioner).join(r2.partitionBy(partitioner))
>>
>> // Join 2
>> val res2 = r1.join(r2, partNum)
>>
>>
>>
>
>
> --
> Regards,
> Rishitesh Mishra,
> SnappyData . (http://www.snappydata.io/)
>
> https://in.linkedin.com/in/rishiteshmishra
>


Re: spark-submit stuck and no output in console

2015-11-16 Thread Jonathan Kelly
He means for you to use jstack to obtain a stacktrace of all of the
threads. Or are you saying that the Java process never even starts?

On Mon, Nov 16, 2015 at 7:48 AM, Kayode Odeyemi  wrote:

> Spark 1.5.1
>
> The fact is that there's no stack trace. No output from that command at
> all to the console.
>
> This is all I get:
>
> hadoop-user@yks-hadoop-m01:/usr/local/spark/bin$ tail -1
> /tmp/spark-profile-job.log
> nohup: ignoring input
> /usr/local/spark/bin/spark-class: line 76: 29516 Killed
>  "$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@"
>
>
> On Mon, Nov 16, 2015 at 5:22 PM, Ted Yu  wrote:
>
>> Which release of Spark are you using ?
>>
>> Can you take stack trace and pastebin it ?
>>
>> Thanks
>>
>> On Mon, Nov 16, 2015 at 5:50 AM, Kayode Odeyemi 
>> wrote:
>>
>>> ./spark-submit --class com.migration.UpdateProfiles --executor-memory 8g
>>> ~/migration-profiles-0.1-SNAPSHOT.jar
>>>
>>> is stuck and outputs nothing to the console.
>>>
>>> What could be the cause of this? Current max heap size is 1.75g and it's
>>> only using 1g.
>>>
>>>
>>
>
>
> --
> Odeyemi 'Kayode O.
> http://ng.linkedin.com/in/kayodeodeyemi. t: @charyorde
>


Re: spark-submit stuck and no output in console

2015-11-16 Thread Kayode Odeyemi
Spark 1.5.1

The fact is that there's no stack trace. No output from that command at all
to the console.

This is all I get:

hadoop-user@yks-hadoop-m01:/usr/local/spark/bin$ tail -1
/tmp/spark-profile-job.log
nohup: ignoring input
/usr/local/spark/bin/spark-class: line 76: 29516 Killed
 "$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@"


On Mon, Nov 16, 2015 at 5:22 PM, Ted Yu  wrote:

> Which release of Spark are you using ?
>
> Can you take stack trace and pastebin it ?
>
> Thanks
>
> On Mon, Nov 16, 2015 at 5:50 AM, Kayode Odeyemi  wrote:
>
>> ./spark-submit --class com.migration.UpdateProfiles --executor-memory 8g
>> ~/migration-profiles-0.1-SNAPSHOT.jar
>>
>> is stuck and outputs nothing to the console.
>>
>> What could be the cause of this? Current max heap size is 1.75g and it's
>> only using 1g.
>>
>>
>


-- 
Odeyemi 'Kayode O.
http://ng.linkedin.com/in/kayodeodeyemi. t: @charyorde


Re: spark-submit stuck and no output in console

2015-11-16 Thread Ted Yu
Which release of Spark are you using ?

Can you take stack trace and pastebin it ?

Thanks

On Mon, Nov 16, 2015 at 5:50 AM, Kayode Odeyemi  wrote:

> ./spark-submit --class com.migration.UpdateProfiles --executor-memory 8g
> ~/migration-profiles-0.1-SNAPSHOT.jar
>
> is stuck and outputs nothing to the console.
>
> What could be the cause of this? Current max heap size is 1.75g and it's
> only using 1g.
>
>


[Spark-Avro] Question related to the Avro data generated by Spark-Avro

2015-11-16 Thread java8964
Hi, I have one question related to Spark-Avro, not sure if here is the best 
place to ask.
I have the following Scala Case class, populated with the data in the Spark 
application, and I tried to save it as AVRO format in the HDFS
case class Claim (  ..)
case class Coupon (  account_id: Long    claims: List[Claim])
As the above example, the Coupon case class contains List of Claim class.
In the RDD, it holds an Iterator of Coupon data, and I will try to save it into 
the HDFS. I am using Spark 1.3.1, with Spark-Avro 1.0.0 (which matches with 
Spark 1.3.x)
rdd.toDF.save("hdfs_location", "com.databricks.spark.avro")
I have no problem to save the data this way, but the problem is that I cannot 
use the avro data in Hive.
Here is the schema example generated by Spark AVRO for the above data:
{
   "type":"record",
   "name":"topLevelRecord",
   "fields":[{
   "name":"account_id",
   "type":"long"
},{"name":"claims",
"type":[
   {
  "type":"array",
  "items":[
 {
"type":"record",
"name":"claims",
"fields":[
..
The claims field is generated as an union contains array, instead of array of 
structure directly.Or for more clearly, here is the schema in the hive when 
pointing to the data generated by Spark-Avro:desc tableOK
col_namedata_type   comment
account_id  bigint  from deserializer...claims  
uniontype>> 
from deserializerObviously, this causes trouble for Hive to query this data (at 
least in the Hive 0.12, which we are currently use), so end user cannot query 
it in the hive like "select claims[0].account_id from table".
I wonder why Spark-Avro has to wrapping a union structure in this case, instead 
of just building "array"?Or better, is there a way I can control the 
AVRO generated in this case in Spark-AVOR?ThanksYong
  

[POWERED BY] Please add our organization

2015-11-16 Thread Adrien Mogenet
Name: Content Square
URL: http://www.contentsquare.com

Description:
We use Spark to regularly read raw data, convert them into Parquet, and
process them to create advanced analytics dashboards: aggregation,
sampling, statistics computations, anomaly detection, machine learning.

-- 

*Adrien Mogenet*
Head of Backend/Infrastructure
adrien.moge...@contentsquare.com
(+33)6.59.16.64.22
http://www.contentsquare.com
50, avenue Montaigne - 75008 Paris


How 'select name,age from TBL_STUDENT where age = 37' is optimized when caching it

2015-11-16 Thread Todd
Hi,

When I cache the dataframe and run the query,
  
val df = sqlContext.sql("select  name,age from TBL_STUDENT where age = 37")
df.cache()
df.show
println(df.queryExecution)

 I got the following execution plan,from the optimized logical plan,I can see 
the whole analyzed logical plan is totally replaced with the InMemoryRelation 
logical plan. But when I look into the Optimizer, I didn't see any optimizer 
that relates to the InMemoryRelation.

Could you please explain how the optimization works?




== Parsed Logical Plan ==
', argString:< [unresolvedalias(UnresolvedAttribute: 
'name),unresolvedalias(UnresolvedAttribute: 'age)]>
  ', argString:< (UnresolvedAttribute: 'age = 37)>
', argString:< [TBL_STUDENT], None>

== Analyzed Logical Plan ==
name: string, age: int
, argString:< [AttributeReference:name#1,AttributeReference:age#3]>
  , argString:< (AttributeReference:age#3 = 37)>
, argString:< TBL_STUDENT>
  , argString:< 
[AttributeReference:id#0,AttributeReference:name#1,AttributeReference:classId#2,AttributeReference:age#3],
 MapPartitionsRDD[4] at main at NativeMethodAccessorImpl.java:-2>

== Optimized Logical Plan ==
, argString:< 
[AttributeReference:name#1,AttributeReference:age#3], true, 1, 
StorageLevel(true, true, false, true, 1), (, argString:< 
[AttributeReference:name#1,AttributeReference:age#3]>), None>

== Physical Plan ==
, argString:< 
[AttributeReference:name#1,AttributeReference:age#3], (, 
argString:< [AttributeReference:name#1,AttributeReference:age#3], true, 1, 
StorageLevel(true, true, false, true, 1), (, argString:< 
[AttributeReference:name#1,AttributeReference:age#3]>), None>)>












spark-submit stuck and no output in console

2015-11-16 Thread Kayode Odeyemi
./spark-submit --class com.migration.UpdateProfiles --executor-memory 8g
~/migration-profiles-0.1-SNAPSHOT.jar

is stuck and outputs nothing to the console.

What could be the cause of this? Current max heap size is 1.75g and it's
only using 1g.


Re: Join and HashPartitioner question

2015-11-16 Thread Rishi Mishra
AFAIK and can see in the code both of them should behave same.

On Sat, Nov 14, 2015 at 2:10 AM, Alexander Pivovarov 
wrote:

> Hi Everyone
>
> Is there any difference in performance btw the following two joins?
>
>
> val r1: RDD[(String, String]) = ???
> val r2: RDD[(String, String]) = ???
>
> val partNum = 80
> val partitioner = new HashPartitioner(partNum)
>
> // Join 1
> val res1 = r1.partitionBy(partitioner).join(r2.partitionBy(partitioner))
>
> // Join 2
> val res2 = r1.join(r2, partNum)
>
>
>


-- 
Regards,
Rishitesh Mishra,
SnappyData . (http://www.snappydata.io/)

https://in.linkedin.com/in/rishiteshmishra


Re: DynamoDB Connector?

2015-11-16 Thread Nick Pentreath
See this thread for some info:
http://apache-spark-user-list.1001560.n3.nabble.com/DynamoDB-input-source-td8814.html

I don't think the situation has changed that much - if you're using Spark
on EMR, then I think the InputFormat is available in a JAR (though I
haven't tested that). Otherwise you'll need to try to get the JAR and see
if you can get it to work outside of EMR.

I'm afraid this thread (
https://forums.aws.amazon.com/thread.jspa?threadID=168506) does not appear
encouraging, even for using Spark on EMR to read from DynamoDB using
the InputFormat! It's a pity AWS doesn't open source the InputFormat.

On Mon, Nov 16, 2015 at 5:00 AM, Charles Cobb 
wrote:

> Hi,
>
> What is the best practice for reading from DynamoDB from Spark? I know I
> can use the Java API, but this doesn't seem to take data locality into
> consideration at all.
>
> I was looking for something along the lines of the cassandra connector:
> https://github.com/datastax/spark-cassandra-connector
>
> Thanks,
> CJ
>
>


Re: Spark Expand Cluster

2015-11-16 Thread Dinesh Ranganathan
Hi Sab,

I did not specify number of executors when I submitted the spark
application. I was in the impression spark looks at the cluster and figures
out the number of executors it can use based on the cluster size
automatically, is this what you call dynamic allocation?. I am spark
newbie, so apologies if I am missing the obvious. While the application was
running I added more core nodes by resizing my EMR instance and I can see
the new nodes on the resource manager but my running application did not
pick up those machines I've just added.   Let me know If i am missing a
step here.

Thanks,
Dan

On 16 November 2015 at 12:38, Sabarish Sasidharan <
sabarish.sasidha...@manthan.com> wrote:

> Spark will use the number of executors you specify in spark-submit. Are
> you saying that Spark is not able to use more executors after you modify it
> in spark-submit? Are you using dynamic allocation?
>
> Regards
> Sab
>
> On Mon, Nov 16, 2015 at 5:54 PM, dineshranganathan <
> dineshranganat...@gmail.com> wrote:
>
>> I have my Spark application deployed on AWS EMR on yarn cluster mode.
>> When I
>> increase the capacity of my cluster by adding more Core instances on AWS,
>> I
>> don't see Spark picking up the new instances dynamically. Is there
>> anything
>> I can do to tell Spark to pick up the newly added boxes??
>>
>> Dan
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Expand-Cluster-tp25393.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
>
> --
>
> Architect - Big Data
> Ph: +91 99805 99458
>
> Manthan Systems | *Company of the year - Analytics (2014 Frost and
> Sullivan India ICT)*
> +++
>



-- 
Dinesh Ranganathan


Re: Spark Expand Cluster

2015-11-16 Thread Sabarish Sasidharan
Spark will use the number of executors you specify in spark-submit. Are you
saying that Spark is not able to use more executors after you modify it in
spark-submit? Are you using dynamic allocation?

Regards
Sab

On Mon, Nov 16, 2015 at 5:54 PM, dineshranganathan <
dineshranganat...@gmail.com> wrote:

> I have my Spark application deployed on AWS EMR on yarn cluster mode.
> When I
> increase the capacity of my cluster by adding more Core instances on AWS, I
> don't see Spark picking up the new instances dynamically. Is there anything
> I can do to tell Spark to pick up the newly added boxes??
>
> Dan
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Expand-Cluster-tp25393.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 

Architect - Big Data
Ph: +91 99805 99458

Manthan Systems | *Company of the year - Analytics (2014 Frost and Sullivan
India ICT)*
+++


Spark Expand Cluster

2015-11-16 Thread dineshranganathan
I have my Spark application deployed on AWS EMR on yarn cluster mode.  When I
increase the capacity of my cluster by adding more Core instances on AWS, I
don't see Spark picking up the new instances dynamically. Is there anything
I can do to tell Spark to pick up the newly added boxes??

Dan



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Expand-Cluster-tp25393.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Hive on Spark orc file empty

2015-11-16 Thread Deepak Sharma
Sai,
I am bit confused here.
How are you using write with results?
I am using spark 1.4.1 and when i use write , it complains about write not
being member of DataFrame.
error:value write is not a member of org.apache.spark.sql.DataFrame

Thanks
Deepak

On Mon, Nov 16, 2015 at 4:10 PM, 张炜  wrote:

> Dear all,
> I am following this article to try Hive on Spark
>
> http://hortonworks.com/hadoop-tutorial/using-hive-with-orc-from-apache-spark/
>
> My environment:
> Hive 1.2.1
> Spark 1.5.1
>
> in a nutshell, I ran spark-shell, created a hive table
>
> hiveContext.sql("create table yahoo_orc_table (date STRING, open_price
> FLOAT, high_price FLOAT, low_price FLOAT, close_price FLOAT, volume INT,
> adj_price FLOAT) stored as orc")
>
> I also computed a dataframe and can show correct contents.
> val results = sqlContext.sql("SELECT * FROM yahoo_stocks_temp")
>
> Then I executed the save command
> results.write.format("orc").save("yahoo_stocks_orc")
>
> I can see a folder named "yahoo_stocks_orc" got successfully and there is
> a _SUCCESS file inside it, but no orc file at all. I repeated this for many
> times and it's the same result.
>
> But
> results.write.format("orc").save("hdfs://*:8020/yahoo_stocks_orc") can
> successfully write contents.
>
> Please kindly help.
>
> Regards,
> Sai
>
>


-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


Hive on Spark orc file empty

2015-11-16 Thread 张炜
Dear all,
I am following this article to try Hive on Spark
http://hortonworks.com/hadoop-tutorial/using-hive-with-orc-from-apache-spark/

My environment:
Hive 1.2.1
Spark 1.5.1

in a nutshell, I ran spark-shell, created a hive table

hiveContext.sql("create table yahoo_orc_table (date STRING, open_price
FLOAT, high_price FLOAT, low_price FLOAT, close_price FLOAT, volume INT,
adj_price FLOAT) stored as orc")

I also computed a dataframe and can show correct contents.
val results = sqlContext.sql("SELECT * FROM yahoo_stocks_temp")

Then I executed the save command
results.write.format("orc").save("yahoo_stocks_orc")

I can see a folder named "yahoo_stocks_orc" got successfully and there is a
_SUCCESS file inside it, but no orc file at all. I repeated this for many
times and it's the same result.

But
results.write.format("orc").save("hdfs://*:8020/yahoo_stocks_orc") can
successfully write contents.

Please kindly help.

Regards,
Sai


Re: How to enable MetricsServlet sink in Spark 1.5.0?

2015-11-16 Thread rakesh rakshit
Hi Saisai,

As I mentioned,

I am getting very less information using the /metrics/json URI and

*also the /metrics/master/json and /metrics/applications/json URIs do not
seem to be working*.

Please verify the following at your end.

The metrics dumped using /metrics/json is as folllows:



{
   "version": "3.0.0",
   "gauges":
   {
   "DAGScheduler.job.activeJobs":
   {
   "value": 2
   },
   "DAGScheduler.job.allJobs":
   {
   "value": 8
   },
   "DAGScheduler.stage.failedStages":
   {
   "value": 0
   },
   "DAGScheduler.stage.runningStages":
   {
   "value": 2
   },
   "DAGScheduler.stage.waitingStages":
   {
   "value": 0
   },

"app-20151116150148-0001.driver.BlockManager.disk.diskSpaceUsed_MB":
   {
   "value": 0
   },
   "app-20151116150148-0001.driver.BlockManager.memory.maxMem_MB":
   {
   "value": 1590
   },
   "app-20151116150148-0001.driver.BlockManager.memory.memUsed_MB":
   {
   "value": 0
   },

"app-20151116150148-0001.driver.BlockManager.memory.remainingMem_MB":
   {
   "value": 1590
   },

"app-20151116150148-0001.driver.NetworkWordCount.StreamingMetrics.streaming.lastCompletedBatch_processingDelay":
   {
   "value": 142
   },

"app-20151116150148-0001.driver.NetworkWordCount.StreamingMetrics.streaming.lastCompletedBatch_processingEndTime":
   {
   "value": 1447666380153
   },

"app-20151116150148-0001.driver.NetworkWordCount.StreamingMetrics.streaming.lastCompletedBatch_processingStartTime":
   {
   "value": 1447666380011
   },

"app-20151116150148-0001.driver.NetworkWordCount.StreamingMetrics.streaming.lastCompletedBatch_schedulingDelay":
   {
   "value": 0
   },

"app-20151116150148-0001.driver.NetworkWordCount.StreamingMetrics.streaming.lastCompletedBatch_submissionTime":
   {
   "value": 1447666380011
   },

"app-20151116150148-0001.driver.NetworkWordCount.StreamingMetrics.streaming.lastCompletedBatch_totalDelay":
   {
   "value": 142
   },

"app-20151116150148-0001.driver.NetworkWordCount.StreamingMetrics.streaming.lastReceivedBatch_processingEndTime":
   {
   "value": 1447666380153
   },

"app-20151116150148-0001.driver.NetworkWordCount.StreamingMetrics.streaming.lastReceivedBatch_processingStartTime":
   {
   "value": 1447666380011
   },

"app-20151116150148-0001.driver.NetworkWordCount.StreamingMetrics.streaming.lastReceivedBatch_records":
   {
   "value": 2
   },

"app-20151116150148-0001.driver.NetworkWordCount.StreamingMetrics.streaming.lastReceivedBatch_submissionTime":
   {
   "value": 1447666380011
   },

"app-20151116150148-0001.driver.NetworkWordCount.StreamingMetrics.streaming.receivers":
   {
   "value": 2
   },

"app-20151116150148-0001.driver.NetworkWordCount.StreamingMetrics.streaming.retainedCompletedBatches":
   {
   "value": 2
   },

"app-20151116150148-0001.driver.NetworkWordCount.StreamingMetrics.streaming.runningBatches":
   {
   "value": 0
   },

"app-20151116150148-0001.driver.NetworkWordCount.StreamingMetrics.streaming.totalCompletedBatches":
   {
   "value": 2
   },

"app-20151116150148-0001.driver.NetworkWordCount.StreamingMetrics.streaming.totalProcessedRecords":
   {
   "value": 2
   },

"app-20151116150148-0001.driver.NetworkWordCount.StreamingMetrics.streaming.totalReceivedRecords":
   {
   "value": 2
   },

"app-20151116150148-0001.driver.NetworkWordCount.StreamingMetrics.streaming.unprocessedBatches":
   {
   "value": 0
   },

"app-20151116150148-0001.driver.NetworkWordCount.StreamingMetrics.streaming.waitingBatches":
   {
   "value": 0
   }
   },
   "counters":
   {
   },
   "histograms":
   {
   },
   "meters":
   {
   },
   "timers":
   {
   "DAGScheduler.messageProcessingTime":
   {
   "count": 161,
   "max": 213.212698,
   "mean": 2.848791141751413,
   "min": 0.044305,
   "p50": 0.268713,
   "p75": 0.376208996,
   "p95": 13.734098,
   "p98": 26.988401,
   "p99": 34.595124,
   "p999": 213.212698,
   "stddev": 16.63943615966855,
   "m15_rate": 27.1366618014181,
   "m1_rate": 4.23709

RE: Size exceeds Integer.MAX_VALUE on EMR 4.0.0 Spark 1.4.1

2015-11-16 Thread Ewan Leith
How big do you expect the file to be? Spark has issues with single blocks over 
2GB (see https://issues.apache.org/jira/browse/SPARK-1476 and 
https://issues.apache.org/jira/browse/SPARK-6235 for example)

If you don’t know, try running

df.repartition(100).write.format…

to get an idea of how  big it would be, I assume it’s over 2 GB

From: Zhang, Jingyu [mailto:jingyu.zh...@news.com.au]
Sent: 16 November 2015 10:17
To: user 
Subject: Size exceeds Integer.MAX_VALUE on EMR 4.0.0 Spark 1.4.1


I am using spark-csv to save files in s3, it shown Size exceeds. Please let me 
know how to fix it. Thanks.

df.write()

.format("com.databricks.spark.csv")

.option("header", "true")

.save("s3://newcars.csv");

java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE

at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:860)

at 
org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:125)

at 
org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:113)

at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1285)

at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:127)

at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:134)

at 
org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:511)

at 
org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:429)

at org.apache.spark.storage.BlockManager.get(BlockManager.scala:617)

at 
org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:154)

at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:242)

at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)

at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)

at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)

at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)

at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)

at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)

at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)

at org.apache.spark.scheduler.Task.run(Task.scala:70)

at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)

at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)


This message and its attachments may contain legally privileged or confidential 
information. It is intended solely for the named addressee. If you are not the 
addressee indicated in this message or responsible for delivery of the message 
to the addressee, you may not copy or deliver this message or its attachments 
to anyone. Rather, you should permanently delete this message and its 
attachments and kindly notify the sender by reply e-mail. Any content of this 
message and its attachments which does not relate to the official business of 
the sending company must be taken not to have been sent or endorsed by that 
company or any of its related entities. No warranty is made that the e-mail or 
attachments are free from computer virus or other defect.


Re: Size exceeds Integer.MAX_VALUE on EMR 4.0.0 Spark 1.4.1

2015-11-16 Thread Sabarish Sasidharan
You can try increasing the number of partitions before writing it out.

Regards
Sab

On Mon, Nov 16, 2015 at 3:46 PM, Zhang, Jingyu 
wrote:

> I am using spark-csv to save files in s3, it shown Size exceeds. Please let 
> me know how to fix it. Thanks.
>
> df.write()
> .format("com.databricks.spark.csv")
> .option("header", "true")
> .save("s3://newcars.csv");
>
> java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
>   at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:860)
>   at 
> org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:125)
>   at 
> org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:113)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1285)
>   at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:127)
>   at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:134)
>   at 
> org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:511)
>   at 
> org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:429)
>   at org.apache.spark.storage.BlockManager.get(BlockManager.scala:617)
>   at 
> org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:154)
>   at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:242)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:70)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
>
>
>
> This message and its attachments may contain legally privileged or
> confidential information. It is intended solely for the named addressee. If
> you are not the addressee indicated in this message or responsible for
> delivery of the message to the addressee, you may not copy or deliver this
> message or its attachments to anyone. Rather, you should permanently delete
> this message and its attachments and kindly notify the sender by reply
> e-mail. Any content of this message and its attachments which does not
> relate to the official business of the sending company must be taken not to
> have been sent or endorsed by that company or any of its related entities.
> No warranty is made that the e-mail or attachments are free from computer
> virus or other defect.




-- 

Architect - Big Data
Ph: +91 99805 99458

Manthan Systems | *Company of the year - Analytics (2014 Frost and Sullivan
India ICT)*
+++


Re: Size exceeds Integer.MAX_VALUE (SparkSQL$TreeNodeException: sort, tree) on EMR 4.0.0 Spark 1.4.1

2015-11-16 Thread Zhang, Jingyu
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: sort, tree:
Sort [net_site#50 ASC,device#6 ASC], true
 Exchange (RangePartitioning 200)
  Project 
[net_site#50,device#6,total_count#105L,adblock_count#106L,noanalytics_count#107L,unique_nk_count#109L]
   HashOuterJoin [net_site#50,device#6], [net_site#530,device#449],
LeftOuter, None
Project 
[adblock_count#106L,total_count#105L,net_site#50,noanalytics_count#107L,device#6]
 HashOuterJoin [net_site#50,device#6], [net_site#419,device#338],
LeftOuter, None
  Project [total_count#105L,device#6,adblock_count#106L,net_site#50]
   HashOuterJoin [net_site#50,device#6],
[net_site#308,device#227], LeftOuter, None
Project [total_count#105L,device#6,net_site#50]
 HashOuterJoin [net_site#50,device#6],
[net_site#197,device#116], LeftOuter, None
  Project [device#6,net_site#50]
   Aggregate false, [net_site#50,device#6], [net_site#50,device#6]
Exchange (HashPartitioning 200)
 Aggregate true, [net_site#50,device#6], [net_site#50,device#6]
  InMemoryColumnarTableScan [net_site#50,device#6],
(InMemoryRelation
[net_site#50,device#6,cbd#5,et#8,news_key#16,underscore_et#35], true,
1, StorageLevel(true, true, false, true, 1), (Project
[net_site#50,device#6,cbd#5,et#8,news_key#16,underscore_et#35]), None)
  Aggregate false, [net_site#197,device#116],
[net_site#197,device#116,Coalesce(SUM(PartialCount#705L),0) AS
total_count#105L]
   Exchange (HashPartitioning 200)
Aggregate true, [net_site#197,device#116],
[net_site#197,device#116,COUNT(device#116) AS PartialCount#705L]
 Project [net_site#197,device#116]
  Filter (IS NULL et#118 && (underscore_et#145 = view))
   InMemoryColumnarTableScan
[net_site#197,device#116,et#118,underscore_et#145], [IS NULL
et#118,(underscore_et#145 = view)], (InMemoryRelation
[net_site#197,device#116,cbd#115,et#118,news_key#126,underscore_et#145],
true, 1, StorageLevel(true, true, false, true, 1), (Project
[net_site#50,device#6,cbd#5,et#8,news_key#16,underscore_et#35]), None)
Aggregate false, [net_site#308,device#227],
[net_site#308,device#227,Coalesce(SUM(PartialCount#709L),0) AS
adblock_count#106L]
 Exchange (HashPartitioning 200)
  Aggregate true, [net_site#308,device#227],
[net_site#308,device#227,COUNT(device#227) AS PartialCount#709L]
   Project [net_site#308,device#227]
Filter (cbd#226 LIKE _1___)
 InMemoryColumnarTableScan
[net_site#308,device#227,cbd#226], [(cbd#226 LIKE _1___)],
(InMemoryRelation
[net_site#308,device#227,cbd#226,et#229,news_key#237,underscore_et#256],
true, 1, StorageLevel(true, true, false, true, 1), (Project
[net_site#50,device#6,cbd#5,et#8,news_key#16,underscore_et#35]), None)
  Aggregate false, [net_site#419,device#338],
[net_site#419,device#338,Coalesce(SUM(PartialCount#713L),0) AS
noanalytics_count#107L]
   Exchange (HashPartitioning 200)
Aggregate true, [net_site#419,device#338],
[net_site#419,device#338,COUNT(device#338) AS PartialCount#713L]
 Project [net_site#419,device#338]
  Filter ((CAST(et#340, DoubleType) = 3.0) && IS NOT NULL net_site#419)
   InMemoryColumnarTableScan [net_site#419,device#338,et#340],
[(CAST(et#340, DoubleType) = 3.0),IS NOT NULL net_site#419],
(InMemoryRelation
[net_site#419,device#338,cbd#337,et#340,news_key#348,underscore_et#367],
true, 1, StorageLevel(true, true, false, true, 1), (Project
[net_site#50,device#6,cbd#5,et#8,news_key#16,underscore_et#35]), None)
Aggregate false, [net_site#530,device#449],
[net_site#530,device#449,Coalesce(SUM(PartialCount#717L),0) AS
unique_nk_count#109L]
 Exchange (HashPartitioning 200)
  Aggregate true, [net_site#530,device#449],
[net_site#530,device#449,COUNT(device#449) AS PartialCount#717L]
   Project [net_site#530,device#449]
Filter (cnt#108L = 1)
 Aggregate false, [net_site#530,device#449,news_key#459],
[net_site#530,device#449,news_key#459,Coalesce(SUM(PartialCount#719L),0)
AS cnt#108L]
  Exchange (HashPartitioning 200)
   Aggregate true, [net_site#530,device#449,news_key#459],
[net_site#530,device#449,news_key#459,COUNT(news_key#459) AS
PartialCount#719L]
Project [net_site#530,device#449,news_key#459]
 Filter (CAST(et#451, DoubleType) = 3.0)
  InMemoryColumnarTableScan
[net_site#530,device#449,news_key#459,et#451], [(CAST(et#451,
DoubleType) = 3.0)], (InMemoryRelation
[net_site#530,device#449,cbd#448,et#451,news_key#459,underscore_et#478],
true, 1, StorageLevel(true, true, false, true, 1), (Project
[net_site#50,device#6,cbd#5,et#8,news_key#16,underscore_et#35]), None)

org.apache.spark.sql.catalyst.errors.package$TreeNodeException: sort, tree:
Sort [net_site#50 ASC,device#6 ASC], true
 Exchange (RangePartitioning 200)
  Project 
[net_site#50,device#6,total_count#105L,adblock_count#106L,noanalytics_

Size exceeds Integer.MAX_VALUE on EMR 4.0.0 Spark 1.4.1

2015-11-16 Thread Zhang, Jingyu
I am using spark-csv to save files in s3, it shown Size exceeds.
Please let me know how to fix it. Thanks.

df.write()
.format("com.databricks.spark.csv")
.option("header", "true")
.save("s3://newcars.csv");

java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:860)
at 
org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:125)
at 
org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:113)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1285)
at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:127)
at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:134)
at 
org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:511)
at 
org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:429)
at org.apache.spark.storage.BlockManager.get(BlockManager.scala:617)
at 
org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:154)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:242)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

-- 
This message and its attachments may contain legally privileged or 
confidential information. It is intended solely for the named addressee. If 
you are not the addressee indicated in this message or responsible for 
delivery of the message to the addressee, you may not copy or deliver this 
message or its attachments to anyone. Rather, you should permanently delete 
this message and its attachments and kindly notify the sender by reply 
e-mail. Any content of this message and its attachments which does not 
relate to the official business of the sending company must be taken not to 
have been sent or endorsed by that company or any of its related entities. 
No warranty is made that the e-mail or attachments are free from computer 
virus or other defect.


Re: How to enable MetricsServlet sink in Spark 1.5.0?

2015-11-16 Thread Saisai Shao
it should worked. I tested in my local environment with "curl
http://localhost:4040/metrics/json/";, there's metrics dumped. For cluster
metrics, you have to change your base url to point to cluster manager.

Thanks
Jerry

On Mon, Nov 16, 2015 at 5:42 PM, ihavethepotential <
ihavethepotent...@gmail.com> wrote:

> Hi all,
>
> I am trying to get the metrics using the MetricsServlet sink(that I guess
> is
> enabled by default) as mentioned in the Spark documentation:
>
> "5. MetricsServlet is added by default as a sink in master, worker and
> client
> # driver, you can send http request "/metrics/json" to get a snapshot of
> all
> the
> # registered metrics in json format. For master, requests
> "/metrics/master/json" and
> # "/metrics/applications/json" can be sent seperately to get metrics
> snapshot of
> # instance master and applications. MetricsServlet may not be configured by
> self."
>
> I am getting very less information using the /metrics/json URI and also the
> /metrics/master/json and /metrics/applications/json URIs do not seem to be
> working.
>
> My intent is to get the metrics for
> 1. Cluster summary
> 2. Apps running and completed
> 3. Executors summary of the apps
> 4. Job summary of the apps
> 5. DAGs metrics for the app jobs
>
> Help for the same is greatly appreciated.
>
> Thanks and Regards,
> Rakesh
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-enable-MetricsServlet-sink-in-Spark-1-5-0-tp25392.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: No spark examples jar in maven repository after 1.1.1 ?

2015-11-16 Thread Jeff Zhang
But it may be useful for user to check the example source code in IDE just
by adding it to maven dependency. Otherwise user have to either download
the source code or check it in github.

On Mon, Nov 16, 2015 at 5:32 PM, Sean Owen  wrote:

> I think because they're not a library? they're example code, not
> something you build an app on.
>
> On Mon, Nov 16, 2015 at 9:27 AM, Jeff Zhang  wrote:
> > I don't find spark examples jar in maven repository after 1.1.1. Any
> reason
> > for that ?
> >
> > http://mvnrepository.com/artifact/org.apache.spark/spark-examples_2.10
> >
> >
> > --
> > Best Regards
> >
> > Jeff Zhang
>



-- 
Best Regards

Jeff Zhang


How to enable MetricsServlet sink in Spark 1.5.0?

2015-11-16 Thread ihavethepotential
Hi all,

I am trying to get the metrics using the MetricsServlet sink(that I guess is
enabled by default) as mentioned in the Spark documentation:

"5. MetricsServlet is added by default as a sink in master, worker and
client
# driver, you can send http request "/metrics/json" to get a snapshot of all
the
# registered metrics in json format. For master, requests
"/metrics/master/json" and
# "/metrics/applications/json" can be sent seperately to get metrics
snapshot of
# instance master and applications. MetricsServlet may not be configured by
self."

I am getting very less information using the /metrics/json URI and also the
/metrics/master/json and /metrics/applications/json URIs do not seem to be
working.

My intent is to get the metrics for 
1. Cluster summary 
2. Apps running and completed
3. Executors summary of the apps
4. Job summary of the apps
5. DAGs metrics for the app jobs

Help for the same is greatly appreciated.

Thanks and Regards,
Rakesh



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-enable-MetricsServlet-sink-in-Spark-1-5-0-tp25392.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: No spark examples jar in maven repository after 1.1.1 ?

2015-11-16 Thread Sean Owen
I think because they're not a library? they're example code, not
something you build an app on.

On Mon, Nov 16, 2015 at 9:27 AM, Jeff Zhang  wrote:
> I don't find spark examples jar in maven repository after 1.1.1. Any reason
> for that ?
>
> http://mvnrepository.com/artifact/org.apache.spark/spark-examples_2.10
>
>
> --
> Best Regards
>
> Jeff Zhang

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



No spark examples jar in maven repository after 1.1.1 ?

2015-11-16 Thread Jeff Zhang
I don't find spark examples jar in maven repository after 1.1.1. Any reason
for that ?

http://mvnrepository.com/artifact/org.apache.spark/spark-examples_2.10


-- 
Best Regards

Jeff Zhang