Support for group aggregate pandas UDF in streaming aggregation for SPARK 3.0 python

2020-08-11 Thread Aesha Dhar Roy
Hi,

Is there any plan to remove the limitation mentioned below?

*Streaming aggregation doesn't support group aggregate pandas UDF *

We want to run our data modelling jobs real time using Spark 3.0 and kafka
2.4 and need to have support for custom aggregate pandas UDF on stream
windows.
Is there any plan for this in the upcoming releases ?

Regards,
Aesha


External hive metastore (remote) managed tables

2020-05-28 Thread Debajyoti Roy
Hi, anyone knows the behavior of dropping managed tables in case of
external hive meta store:

Deletion of the data (e.g. from object store) happens from Spark sql or,
the external hive metastore ?

Confused by local mode and remote mode codes.


Re: Why Apache Spark doesn't use Calcite?

2020-01-15 Thread Debajyoti Roy
Thanks Xiao, a more up to date publication in a conference like VLDB will
certainly turn the the tide for many of us trying to defend Spark's
Optimizer.

On Wed, Jan 15, 2020 at 9:39 AM Xiao Li  wrote:

> In the upcoming Spark 3.0, we introduced a new framework for Adaptive
> Query Execution in Catalyst. This can adjust the plans based on the runtime
> statistics. This is missing in Calcite based on my understanding.
>
> Catalyst is also very easy to enhance. We also use the dynamic programming
> approach in our cost-based join reordering. If needed, in the future, we
> also can improve the existing CBO and make it more general. The paper of
> Spark SQL was published 5 years ago. A lot of great contributions were made
> in the past 5 years.
>
> Cheers,
>
> Xiao
>
> Debajyoti Roy  于2020年1月15日周三 上午9:23写道:
>
>> Thanks all, and Matei.
>>
>> TL;DR of the conclusion for my particular case:
>> Qualitatively, while Catalyst[1] tries to mitigate learning curve and
>> maintenance burden, it lacks the dynamic programming approach used by
>> Calcite[2] and risks falling into local minima.
>> Quantitatively, there is no reproducible benchmark, that fairly compares
>> Optimizer frameworks, apples to apples (excluding execution).
>>
>> References:
>> [1] -
>> https://amplab.cs.berkeley.edu/wp-content/uploads/2015/03/SparkSQLSigmod2015.pdf
>> [2] - https://arxiv.org/pdf/1802.10233.pdf
>>
>> On Mon, Jan 13, 2020 at 5:37 PM Matei Zaharia 
>> wrote:
>>
>>> I’m pretty sure that Catalyst was built before Calcite, or at least in
>>> parallel. Calcite 1.0 was only released in 2015. From a technical
>>> standpoint, building Catalyst in Scala also made it more concise and easier
>>> to extend than an optimizer written in Java (you can find various
>>> presentations about how Catalyst works).
>>>
>>> Matei
>>>
>>> > On Jan 13, 2020, at 8:41 AM, Michael Mior  wrote:
>>> >
>>> > It's fairly common for adapters (Calcite's abstraction of a data
>>> > source) to push down predicates. However, the API certainly looks a
>>> > lot different than Catalyst's.
>>> > --
>>> > Michael Mior
>>> > mm...@apache.org
>>> >
>>> > Le lun. 13 janv. 2020 à 09:45, Jason Nerothin
>>> >  a écrit :
>>> >>
>>> >> The implementation they chose supports push down predicates, Datasets
>>> and other features that are not available in Calcite:
>>> >>
>>> >> https://databricks.com/glossary/catalyst-optimizer
>>> >>
>>> >> On Mon, Jan 13, 2020 at 8:24 AM newroyker 
>>> wrote:
>>> >>>
>>> >>> Was there a qualitative or quantitative benchmark done before a
>>> design
>>> >>> decision was made not to use Calcite?
>>> >>>
>>> >>> Are there limitations (for heuristic based, cost based, * aware
>>> optimizer)
>>> >>> in Calcite, and frameworks built on top of Calcite? In the context
>>> of big
>>> >>> data / TCPH benchmarks.
>>> >>>
>>> >>> I was unable to dig up anything concrete from user group / Jira.
>>> Appreciate
>>> >>> if any Catalyst veteran here can give me pointers. Trying to defend
>>> >>> Spark/Catalyst.
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> >>> --
>>> >>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>> >>>
>>> >>> -
>>> >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>> >>>
>>> >>
>>> >>
>>> >> --
>>> >> Thanks,
>>> >> Jason
>>> >
>>> > -
>>> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>> >
>>>
>>>


Re: Why Apache Spark doesn't use Calcite?

2020-01-15 Thread Debajyoti Roy
Thanks all, and Matei.

TL;DR of the conclusion for my particular case:
Qualitatively, while Catalyst[1] tries to mitigate learning curve and
maintenance burden, it lacks the dynamic programming approach used by
Calcite[2] and risks falling into local minima.
Quantitatively, there is no reproducible benchmark, that fairly compares
Optimizer frameworks, apples to apples (excluding execution).

References:
[1] -
https://amplab.cs.berkeley.edu/wp-content/uploads/2015/03/SparkSQLSigmod2015.pdf
[2] - https://arxiv.org/pdf/1802.10233.pdf

On Mon, Jan 13, 2020 at 5:37 PM Matei Zaharia 
wrote:

> I’m pretty sure that Catalyst was built before Calcite, or at least in
> parallel. Calcite 1.0 was only released in 2015. From a technical
> standpoint, building Catalyst in Scala also made it more concise and easier
> to extend than an optimizer written in Java (you can find various
> presentations about how Catalyst works).
>
> Matei
>
> > On Jan 13, 2020, at 8:41 AM, Michael Mior  wrote:
> >
> > It's fairly common for adapters (Calcite's abstraction of a data
> > source) to push down predicates. However, the API certainly looks a
> > lot different than Catalyst's.
> > --
> > Michael Mior
> > mm...@apache.org
> >
> > Le lun. 13 janv. 2020 à 09:45, Jason Nerothin
> >  a écrit :
> >>
> >> The implementation they chose supports push down predicates, Datasets
> and other features that are not available in Calcite:
> >>
> >> https://databricks.com/glossary/catalyst-optimizer
> >>
> >> On Mon, Jan 13, 2020 at 8:24 AM newroyker  wrote:
> >>>
> >>> Was there a qualitative or quantitative benchmark done before a design
> >>> decision was made not to use Calcite?
> >>>
> >>> Are there limitations (for heuristic based, cost based, * aware
> optimizer)
> >>> in Calcite, and frameworks built on top of Calcite? In the context of
> big
> >>> data / TCPH benchmarks.
> >>>
> >>> I was unable to dig up anything concrete from user group / Jira.
> Appreciate
> >>> if any Catalyst veteran here can give me pointers. Trying to defend
> >>> Spark/Catalyst.
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> --
> >>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
> >>>
> >>> -
> >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >>>
> >>
> >>
> >> --
> >> Thanks,
> >> Jason
> >
> > -
> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >
>
>


Spark Dataset transformations for time based events

2018-12-25 Thread Debajyoti Roy
Hope everyone is enjoying their holidays.

If anyone here ran into these time based event transformation patterns or
have a strong opinion about the approach please let me know / reply in SO:

   1. Enrich using as-of-time:
   
https://stackoverflow.com/questions/53928880/how-to-do-a-time-based-as-of-join-of-two-datasets-in-apache-spark
   2. Snapshot of state with time to state with effective start and end
   time:
   
https://stackoverflow.com/questions/53928372/given-dataset-of-state-snapshots-at-time-t-how-to-transform-it-into-dataset-with/53928400#53928400


Thanks in advance!
Roy


Given events with start and end times, how to count the number of simultaneous events using Spark?

2018-09-26 Thread Debajyoti Roy
The problem statement and an approach to solve it using windows is
described here:

https://stackoverflow.com/questions/52509498/given-events-with-start-and-end-times-how-to-count-the-number-of-simultaneous-e

Looking for more elegant/performant solutions, if they exist. TIA !


spark-submit config via file

2017-03-24 Thread , Roy
Hi,

I am trying to deploy spark job by using spark-submit which has bunch of
parameters like

spark-submit --class StreamingEventWriterDriver --master yarn --deploy-mode
cluster --executor-memory 3072m --executor-cores 4 --files streaming.conf
spark_streaming_2.11-assembly-1.0-SNAPSHOT.jar -conf "streaming.conf"

I was looking a way to put all these flags in the file to pass to
spark-submit to make my spark-submitcommand simple like this

spark-submit --class StreamingEventWriterDriver --master yarn --deploy-mode
cluster --properties-file properties.conf --files streaming.conf
spark_streaming_2.11-assembly-1.0-SNAPSHOT.jar -conf "streaming.conf"

properties.conf has following contents


spark.executor.memory 3072m

spark.executor.cores 4


But I am getting following error


17/03/24 11:36:26 INFO Client: Use hdfs cache file as spark.yarn.archive
for HDP,
hdfsCacheFile:hdfs:///hdp/apps/2.6.0.0-403/spark2/spark2-hdp-yarn-archive.tar.gz

17/03/24 11:36:26 WARN AzureFileSystemThreadPoolExecutor: Disabling threads
for Delete operation as thread count 0 is <= 1

17/03/24 11:36:26 INFO AzureFileSystemThreadPoolExecutor: Time taken for
Delete operation is: 1 ms with threads: 0

17/03/24 11:36:27 INFO Client: Deleted staging directory wasb://
a...@abc.blob.core.windows.net/user/sshuser/.sparkStaging/application_1488402758319_0492

Exception in thread "main" java.io.IOException: Incomplete HDFS URI, no
host: hdfs:///hdp/apps/2.6.0.0-403/spark2/spark2-hdp-yarn-archive.tar.gz

at
org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:154)

at
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2791)

at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99)

at
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2825)

at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2807)

at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:386)

at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)

at
org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:364)

at org.apache.spark.deploy.yarn.Client.org
$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:480)

at
org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:552)

at
org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:881)

at
org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:170)

at org.apache.spark.deploy.yarn.Client.run(Client.scala:1218)

at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1277)

at org.apache.spark.deploy.yarn.Client.main(Client.scala)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:498)

at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:745)

at
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)

at
org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)

at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)

at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

17/03/24 11:36:27 INFO MetricsSystemImpl: Stopping azure-file-system
metrics system...

Anyone know is this is even possible ?


Thanks...

Roy


spark-itemsimilarity No FileSystem for scheme error

2016-01-05 Thread roy
Hi we are using CDH 5.4.0 with Spark 1.5.2 (doesn't come with CDH 5.4.0)


I am following this link
https://mahout.apache.org/users/recommender/intro-cooccurrence-spark.html to
trying to test/create new algorithm with mahout item-similarity.

I am running following command 

./bin/mahout spark-itemsimilarity \
--input $INPUT \
--output $OUTPUT \
--filter1 o --filter2 v \
--inDelim "\t" \
 --itemIDColumn 2 --rowIDColumn 0 --filterColumn 1 \
 --master yarn-client \
 -D:fs.hdfs.impl=org.apache.hadoop.hdfs.DistributedFileSystem \
 -D:fs.file.impl=org.apache.hadoop.fs.LocalFileSystem

I am getting following error  
 
java.io.IOException: No FileSystem for scheme: hdfs
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2385)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2392)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:167)
at org.apache.spark.deploy.yarn.Client.cleanupStagingDir(Client.scala:143)
at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:129)
at
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
at
org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144)
at org.apache.spark.SparkContext.(SparkContext.scala:523)
at
org.apache.mahout.sparkbindings.package$.mahoutSparkContext(package.scala:91)
at
org.apache.mahout.drivers.MahoutSparkDriver.start(MahoutSparkDriver.scala:83)
at
org.apache.mahout.drivers.ItemSimilarityDriver$.start(ItemSimilarityDriver.scala:118)
at
org.apache.mahout.drivers.ItemSimilarityDriver$.process(ItemSimilarityDriver.scala:199)
at
org.apache.mahout.drivers.ItemSimilarityDriver$$anonfun$main$1.apply(ItemSimilarityDriver.scala:112)
at
org.apache.mahout.drivers.ItemSimilarityDriver$$anonfun$main$1.apply(ItemSimilarityDriver.scala:110)
at scala.Option.map(Option.scala:145)
at
org.apache.mahout.drivers.ItemSimilarityDriver$.main(ItemSimilarityDriver.scala:110)
at
org.apache.mahout.drivers.ItemSimilarityDriver.main(ItemSimilarityDriver.scala)


I found solution here by adding following properties to into
/etc/hadoop/conf/core-site.xml on client/gateway machine more info 


  fs.file.impl
  org.apache.hadoop.fs.LocalFileSystem
  The FileSystem for file: uris.



  fs.hdfs.impl
  org.apache.hadoop.hdfs.DistributedFileSystem
  The FileSystem for hdfs: uris. 
 

 But is there any better way to solve this error ?

Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-itemsimilarity-No-FileSystem-for-scheme-error-tp25887.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Error in load hbase on spark

2015-10-08 Thread Roy Wang

I want to load hbase table into spark.
JavaPairRDD hBaseRDD =
sc.newAPIHadoopRDD(conf, TableInputFormat.class,
ImmutableBytesWritable.class, Result.class);

*when call hBaseRDD.count(),got error.*

Caused by: java.lang.IllegalStateException: The input format instance has
not been properly initialized. Ensure you call initializeTable either in
your constructor or initialize method
at
org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getTable(TableInputFormatBase.java:389)
at
org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.createRecordReader(TableInputFormatBase.java:158)
... 11 more

*But when job start,I can get these logs*
2015-10-09 09:17:00[main] WARN  TableInputFormatBase:447 - initializeTable
called multiple times. Overwriting connection and table reference;
TableInputFormatBase will not close these old references when done.

Does anyone know how does this happen?

Thanks! 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Error-in-load-hbase-on-spark-tp24986.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



python version in spark-submit

2015-10-01 Thread roy
Hi,

 We have python2.6 (default) on cluster and also we have installed
python2.7.

I was looking a way to set python version in spark-submit.

anyone know how to do this ?

Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/python-version-in-spark-submit-tp24902.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



how to control timeout in node failure for spark task ?

2015-09-25 Thread roy
Hi,

  We are running Spark 1.3 on CDH 5.4.1 on top of YARN. we want to know how
do we control task timeout when node fails and task running on it should be
restarted on another node. at present job wait for approximately 10 min to
restart the task were running on failed node.

http://spark.apache.org/docs/latest/configuration.html Here i see many
timeout config, just dont know which one to override.

any help here ?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-control-timeout-in-node-failure-for-spark-task-tp24825.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



pyspark driver in cluster rather than gateway/client

2015-09-10 Thread roy
Hi,

  Is there any way to make spark driver to run in side YARN containers
rather than gateway/client machine.

  At present even with config parameters --master yarn & --deploy-mode
cluster driver runs on gateway/client machine.

We are on CDH 5.4.1 with YARN and Spark 1.3

any help on this ?

Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-driver-in-cluster-rather-than-gateway-client-tp24641.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: The auxService:spark_shuffle does not exist

2015-07-07 Thread roy
we tried --master yarn-client with no different result.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/The-auxService-spark-shuffle-does-not-exist-tp23662p23689.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



The auxService:spark_shuffle does not exist

2015-07-06 Thread roy
I am getting following error for simple spark job

I am running following command 

/spark-submit --class org.apache.spark.examples.SparkPi --deploy-mode
cluster --master yarn
/opt/cloudera/parcels/CDH/lib/spark/lib/spark-examples-1.2.0-cdh5.3.1-hadoop2.5.0-cdh5.3.1.jar/

but job doesn't show any progress and just showing following on cmd

15/07/06 22:18:41 INFO Client: Application report for
application_1436234808473_0017 (state: RUNNING)
15/07/06 22:18:42 INFO Client: Application report for
application_1436234808473_0017 (state: RUNNING)
15/07/06 22:18:43 INFO Client: Application report for
application_1436234808473_0017 (state: RUNNING)
15/07/06 22:18:44 INFO Client: Application report for
application_1436234808473_0017 (state: RUNNING)
15/07/06 22:18:45 INFO Client: Application report for
application_1436234808473_0017 (state: RUNNING)
15/07/06 22:18:46 INFO Client: Application report for
application_1436234808473_0017 (state: RUNNING)
15/07/06 22:18:47 INFO Client: Application report for
application_1436234808473_0017 (state: RUNNING)
15/07/06 22:18:48 INFO Client: Application report for
application_1436234808473_0017 (state: RUNNING)
15/07/06 22:18:49 INFO Client: Application report for
application_1436234808473_0017 (state: RUNNING)
15/07/06 22:18:50 INFO Client: Application report for
application_1436234808473_0017 (state: RUNNING)

Then I had to kill this job, and look into logs found following 

Exception in thread ContainerLauncher #4 java.lang.Error:
org.apache.hadoop.yarn.exceptions.InvalidAuxServiceException: The
auxService:spark_shuffle does not exist
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1151)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.yarn.exceptions.InvalidAuxServiceException: The
auxService:spark_shuffle does not exist
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at
org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateException(SerializedExceptionPBImpl.java:168)
at
org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.deSerialize(SerializedExceptionPBImpl.java:106)
at
org.apache.hadoop.yarn.client.api.impl.NMClientImpl.startContainer(NMClientImpl.java:206)
at
org.apache.spark.deploy.yarn.ExecutorRunnable.startContainer(ExecutorRunnable.scala:110)
at
org.apache.spark.deploy.yarn.ExecutorRunnable.run(ExecutorRunnable.scala:65)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
... 2 more
Exception in thread ContainerLauncher #5 java.lang.Error:
org.apache.hadoop.yarn.exceptions.InvalidAuxServiceException: The
auxService:spark_shuffle does not exist
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1151)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.yarn.exceptions.InvalidAuxServiceException: The
auxService:spark_shuffle does not exist
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at
org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateException(SerializedExceptionPBImpl.java:168)
at
org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.deSerialize(SerializedExceptionPBImpl.java:106)
at
org.apache.hadoop.yarn.client.api.impl.NMClientImpl.startContainer(NMClientImpl.java:206)
at
org.apache.spark.deploy.yarn.ExecutorRunnable.startContainer(ExecutorRunnable.scala:110)
at
org.apache.spark.deploy.yarn.ExecutorRunnable.run(ExecutorRunnable.scala:65)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
... 2 more

dont know why this is happening. Any one know whats wrong here. This started
happening after cloudera manager upgrade from CM 5.3.1 to CM 5.4.3.

We are on CDH 5.3.1

Thanks  



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/The-auxService-spark-shuffle-does-not-exist-tp23662.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To 

Yarn application ID for Spark job on Yarn

2015-06-22 Thread roy
Hi,

  Is there a way to get Yarn application ID inside spark application, when
running spark Job on YARN ?

Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Yarn-application-ID-for-Spark-job-on-Yarn-tp23429.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark job fails silently

2015-06-22 Thread roy
Hi,

   Our spark job on yarn suddenly started failing silently without showing
any error following is the trace.


Using properties file: /usr/lib/spark/conf/spark-defaults.conf
Adding default property:
spark.serializer=org.apache.spark.serializer.KryoSerializer
Adding default property:
spark.executor.extraJavaOptions=-Dlog4j.configuration=file:///etc/spark/log4j.properties
Adding default property: spark.eventLog.enabled=true
Adding default property: spark.shuffle.service.enabled=true
Adding default property:
spark.driver.extraLibraryPath=/usr/lib/hadoop/lib/native
Adding default property:
spark.yarn.historyServer.address=http://ds-hnn002.dev.abc.com:18088
Adding default property:
spark.yarn.am.extraLibraryPath=/usr/lib/hadoop/lib/native
Adding default property: spark.ui.showConsoleProgress=true
Adding default property: spark.shuffle.service.port=7337
Adding default property: spark.master=yarn-client
Adding default property:
spark.executor.extraLibraryPath=/usr/lib/hadoop/lib/native
Adding default property:
spark.eventLog.dir=hdfs://magnetic-hadoop-dev/user/spark/applicationHistory
Adding default property:
spark.yarn.jar=local:/usr/lib/spark/assembly/lib/spark-assembly-1.3.0-cdh5.4.0-hadoop2.6.0-cdh5.4.0.jar
Parsed arguments:
  master  yarn
  deployMode  null
  executorMemory  3G
  executorCores   null
  totalExecutorCores  null
  propertiesFile  /usr/lib/spark/conf/spark-defaults.conf
  driverMemory4G
  driverCores null
  driverExtraClassPathnull
  driverExtraLibraryPath  /usr/lib/hadoop/lib/native
  driverExtraJavaOptions  null
  supervise   false
  queue   null
  numExecutors30
  files   null
  pyFiles null
  archivesnull
  mainClass   null
  primaryResource
file:/home/jonathanarfa/code/updb/spark/updb2vw_testing.py
  nameupdb2vw_testing.py
  childArgs   [--date 2015-05-20]
  jarsnull
  packagesnull
  repositoriesnull
  verbose true

Spark properties used, including those specified through
 --conf and those from the properties file
/usr/lib/spark/conf/spark-defaults.conf:
  spark.executor.extraLibraryPath - /usr/lib/hadoop/lib/native
  spark.yarn.jar -
local:/usr/lib/spark/assembly/lib/spark-assembly-1.3.0-cdh5.4.0-hadoop2.6.0-cdh5.4.0.jar
  spark.driver.extraLibraryPath - /usr/lib/hadoop/lib/native
  spark.yarn.historyServer.address - http://ds-hnn002.dev.abc.com:18088
  spark.yarn.am.extraLibraryPath - /usr/lib/hadoop/lib/native
  spark.eventLog.enabled - true
  spark.ui.showConsoleProgress - true
  spark.serializer - org.apache.spark.serializer.KryoSerializer
  spark.executor.extraJavaOptions -
-Dlog4j.configuration=file:///etc/spark/log4j.properties
  spark.shuffle.service.enabled - true
  spark.shuffle.service.port - 7337
  spark.eventLog.dir -
hdfs://magnetic-hadoop-dev/user/spark/applicationHistory
  spark.master - yarn-client


Main class:
org.apache.spark.deploy.PythonRunner
Arguments:
file:/home/jonathanarfa/code/updb/spark/updb2vw_testing.py
null
--date
2015-05-20
System properties:
spark.executor.extraLibraryPath - /usr/lib/hadoop/lib/native
spark.driver.memory - 4G
spark.executor.memory - 3G
spark.yarn.jar -
local:/usr/lib/spark/assembly/lib/spark-assembly-1.3.0-cdh5.4.0-hadoop2.6.0-cdh5.4.0.jar
spark.driver.extraLibraryPath - /usr/lib/hadoop/lib/native
spark.executor.instances - 30
spark.yarn.historyServer.address - http://ds-hnn002.dev.abc.com:18088
spark.yarn.am.extraLibraryPath - /usr/lib/hadoop/lib/native
spark.ui.showConsoleProgress - true
spark.eventLog.enabled - true
spark.yarn.dist.files -
file:/home/jonathanarfa/code/updb/spark/updb2vw_testing.py
SPARK_SUBMIT - true
spark.serializer - org.apache.spark.serializer.KryoSerializer
spark.executor.extraJavaOptions -
-Dlog4j.configuration=file:///etc/spark/log4j.properties
spark.shuffle.service.enabled - true
spark.app.name - updb2vw_testing.py
spark.shuffle.service.port - 7337
spark.eventLog.dir -
hdfs://magnetic-hadoop-dev/user/spark/applicationHistory
spark.master - yarn-client
Classpath elements:



spark.akka.frameSize=60
spark.app.name=updb2vw_2015-05-20
spark.driver.extraLibraryPath=/usr/lib/hadoop/lib/native
spark.driver.maxResultSize=2G
spark.driver.memory=4G
spark.eventLog.dir=hdfs://magnetic-hadoop-dev/user/spark/applicationHistory
spark.eventLog.enabled=true
spark.executor.extraJavaOptions=-Dlog4j.configuration=file:///etc/spark/log4j.properties
spark.executor.extraLibraryPath=/usr/lib/hadoop/lib/native
spark.executor.instances=30
spark.executor.memory=3G
spark.master=yarn-client
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.shuffle.manager=hash
spark.shuffle.service.enabled=true
spark.shuffle.service.port=7337
spark.task.maxFailures=6
spark.ui.showConsoleProgress=true

spark on yarn failing silently

2015-06-22 Thread roy
Hi,

  suddenly our spark job on yarn started failing silently without showing
any error, following is the trace in verbose mode





Using properties file: /usr/lib/spark/conf/spark-defaults.conf
Adding default property:
spark.serializer=org.apache.spark.serializer.KryoSerializer
Adding default property:
spark.executor.extraJavaOptions=-Dlog4j.configuration=file:///etc/spark/log4j.properties
Adding default property: spark.eventLog.enabled=true
Adding default property: spark.shuffle.service.enabled=true
Adding default property:
spark.driver.extraLibraryPath=/usr/lib/hadoop/lib/native
Adding default property:
spark.yarn.historyServer.address=http://ds-hnn002.dev.abc.com:18088
Adding default property:
spark.yarn.am.extraLibraryPath=/usr/lib/hadoop/lib/native
Adding default property: spark.ui.showConsoleProgress=true
Adding default property: spark.shuffle.service.port=7337
Adding default property: spark.master=yarn-client
Adding default property:
spark.executor.extraLibraryPath=/usr/lib/hadoop/lib/native
Adding default property:
spark.eventLog.dir=hdfs://my-hadoop-dev/user/spark/applicationHistory
Adding default property:
spark.yarn.jar=local:/usr/lib/spark/assembly/lib/spark-assembly-1.3.0-cdh5.4.0-hadoop2.6.0-cdh5.4.0.jar
Parsed arguments:
  master  yarn
  deployMode  null
  executorMemory  3G
  executorCores   null
  totalExecutorCores  null
  propertiesFile  /usr/lib/spark/conf/spark-defaults.conf
  driverMemory4G
  driverCores null
  driverExtraClassPathnull
  driverExtraLibraryPath  /usr/lib/hadoop/lib/native
  driverExtraJavaOptions  null
  supervise   false
  queue   null
  numExecutors30
  files   null
  pyFiles null
  archivesnull
  mainClass   null
  primaryResource file:/home/xyz/code/updb/spark/updb2vw_testing.py
  nameupdb2vw_testing.py
  childArgs   [--date 2015-05-20]
  jarsnull
  packagesnull
  repositoriesnull
  verbose true

Spark properties used, including those specified through
 --conf and those from the properties file
/usr/lib/spark/conf/spark-defaults.conf:
  spark.executor.extraLibraryPath - /usr/lib/hadoop/lib/native
  spark.yarn.jar -
local:/usr/lib/spark/assembly/lib/spark-assembly-1.3.0-cdh5.4.0-hadoop2.6.0-cdh5.4.0.jar
  spark.driver.extraLibraryPath - /usr/lib/hadoop/lib/native
  spark.yarn.historyServer.address - http://ds-hnn002.dev.abc.com:18088
  spark.yarn.am.extraLibraryPath - /usr/lib/hadoop/lib/native
  spark.eventLog.enabled - true
  spark.ui.showConsoleProgress - true
  spark.serializer - org.apache.spark.serializer.KryoSerializer
  spark.executor.extraJavaOptions -
-Dlog4j.configuration=file:///etc/spark/log4j.properties
  spark.shuffle.service.enabled - true
  spark.shuffle.service.port - 7337
  spark.eventLog.dir - hdfs://my-hadoop-dev/user/spark/applicationHistory
  spark.master - yarn-client


Main class:
org.apache.spark.deploy.PythonRunner
Arguments:
file:/home/xyz/code/updb/spark/updb2vw_testing.py
null
--date
2015-05-20
System properties:
spark.executor.extraLibraryPath - /usr/lib/hadoop/lib/native
spark.driver.memory - 4G
spark.executor.memory - 3G
spark.yarn.jar -
local:/usr/lib/spark/assembly/lib/spark-assembly-1.3.0-cdh5.4.0-hadoop2.6.0-cdh5.4.0.jar
spark.driver.extraLibraryPath - /usr/lib/hadoop/lib/native
spark.executor.instances - 30
spark.yarn.historyServer.address - http://ds-hnn002.dev.abc.com:18088
spark.yarn.am.extraLibraryPath - /usr/lib/hadoop/lib/native
spark.ui.showConsoleProgress - true
spark.eventLog.enabled - true
spark.yarn.dist.files - file:/home/xyz/code/updb/spark/updb2vw_testing.py
SPARK_SUBMIT - true
spark.serializer - org.apache.spark.serializer.KryoSerializer
spark.executor.extraJavaOptions -
-Dlog4j.configuration=file:///etc/spark/log4j.properties
spark.shuffle.service.enabled - true
spark.app.name - updb2vw_testing.py
spark.shuffle.service.port - 7337
spark.eventLog.dir - hdfs://my-hadoop-dev/user/spark/applicationHistory
spark.master - yarn-client
Classpath elements:



spark.akka.frameSize=60
spark.app.name=updb2vw_2015-05-20
spark.driver.extraLibraryPath=/usr/lib/hadoop/lib/native
spark.driver.maxResultSize=2G
spark.driver.memory=4G
spark.eventLog.dir=hdfs://my-hadoop-dev/user/spark/applicationHistory
spark.eventLog.enabled=true
spark.executor.extraJavaOptions=-Dlog4j.configuration=file:///etc/spark/log4j.properties
spark.executor.extraLibraryPath=/usr/lib/hadoop/lib/native
spark.executor.instances=30
spark.executor.memory=3G
spark.master=yarn-client
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.shuffle.manager=hash
spark.shuffle.service.enabled=true
spark.shuffle.service.port=7337
spark.task.maxFailures=6
spark.ui.showConsoleProgress=true
spark.yarn.am.extraLibraryPath=/usr/lib/hadoop/lib/native

spark java.io.FileNotFoundException: /user/spark/applicationHistory/application

2015-05-28 Thread roy
hi,

 Suddenly spark jobs started failing with following error

Exception in thread main java.io.FileNotFoundException:
/user/spark/applicationHistory/application_1432824195832_1275.inprogress (No
such file or directory)

full trace here

[21:50:04 x...@hadoop-client01.dev:~]$ spark-submit --class
org.apache.spark.examples.SparkPi --master yarn
/usr/lib/spark/lib/spark-examples.jar 10
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in
[jar:file:/usr/lib/zookeeper/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/usr/lib/flume-ng/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/usr/lib/parquet/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/usr/lib/avro/avro-tools-1.7.6-cdh5.4.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
15/05/28 21:55:21 INFO SparkContext: Running Spark version 1.3.0
15/05/28 21:55:21 WARN SparkConf: In Spark 1.0 and later spark.local.dir
will be overridden by the value set by the cluster manager (via
SPARK_LOCAL_DIRS in mesos/standalone and LOCAL_DIRS in YARN).
15/05/28 21:55:21 WARN SparkConf:
SPARK_JAVA_OPTS was detected (set to ' -Dspark.local.dir=/srv/tmp/xyz ').
This is deprecated in Spark 1.0+.

Please instead use:
 - ./spark-submit with conf/spark-defaults.conf to set defaults for an
application
 - ./spark-submit with --driver-java-options to set -X options for a driver
 - spark.executor.extraJavaOptions to set -X options for executors
 - SPARK_DAEMON_JAVA_OPTS to set java options for standalone daemons (master
or worker)

15/05/28 21:55:21 WARN SparkConf: Setting 'spark.executor.extraJavaOptions'
to ' -Dspark.local.dir=/srv/tmp/xyz ' as a work-around.
15/05/28 21:55:21 WARN SparkConf: Setting 'spark.driver.extraJavaOptions' to
' -Dspark.local.dir=/srv/tmp/xyz ' as a work-around.
15/05/28 21:55:22 INFO SecurityManager: Changing view acls to: xyz
15/05/28 21:55:22 INFO SecurityManager: Changing modify acls to: xyz
15/05/28 21:55:22 INFO SecurityManager: SecurityManager: authentication
disabled; ui acls disabled; users with view permissions: Set(xyz); users
with modify permissions: Set(xyz)
15/05/28 21:55:22 INFO Slf4jLogger: Slf4jLogger started
15/05/28 21:55:22 INFO Remoting: Starting remoting
15/05/28 21:55:22 INFO Remoting: Remoting started; listening on addresses
:[akka.tcp://sparkdri...@hadoop-client01.abc.com:51876]
15/05/28 21:55:22 INFO Remoting: Remoting now listens on addresses:
[akka.tcp://sparkdri...@hadoop-client01.abc.com:51876]
15/05/28 21:55:22 INFO Utils: Successfully started service 'sparkDriver' on
port 51876.
15/05/28 21:55:22 INFO SparkEnv: Registering MapOutputTracker
15/05/28 21:55:22 INFO SparkEnv: Registering BlockManagerMaster
15/05/28 21:55:22 INFO DiskBlockManager: Created local directory at
/srv/tmp/xyz/spark-1e66e6eb-7ad6-4f62-87fc-f0cfaa631e36/blockmgr-61f866b8-6475-4a11-88b2-792d2ba22662
15/05/28 21:55:22 INFO MemoryStore: MemoryStore started with capacity 265.4
MB
15/05/28 21:55:23 INFO HttpFileServer: HTTP File server directory is
/srv/tmp/xyz/spark-2b676170-3f88-44bf-87a3-600de1b7ee24/httpd-b84f76d5-26c7-4c63-9223-f6c5aa3899f0
15/05/28 21:55:23 INFO HttpServer: Starting HTTP Server
15/05/28 21:55:23 INFO Server: jetty-8.y.z-SNAPSHOT
15/05/28 21:55:23 INFO AbstractConnector: Started
SocketConnector@0.0.0.0:41538
15/05/28 21:55:23 INFO Utils: Successfully started service 'HTTP file
server' on port 41538.
15/05/28 21:55:23 INFO SparkEnv: Registering OutputCommitCoordinator
15/05/28 21:55:23 INFO Server: jetty-8.y.z-SNAPSHOT
15/05/28 21:55:23 INFO AbstractConnector: Started
SelectChannelConnector@0.0.0.0:4040
15/05/28 21:55:23 INFO Utils: Successfully started service 'SparkUI' on port
4040.
15/05/28 21:55:23 INFO SparkUI: Started SparkUI at
http://hadoop-client01.abc.com:4040
15/05/28 21:55:23 INFO SparkContext: Added JAR
file:/usr/lib/spark/lib/spark-examples.jar at
http://10.0.3.62:41538/jars/spark-examples.jar with timestamp 1432850123523
15/05/28 21:55:24 INFO Client: Requesting a new application from cluster
with 16 NodeManagers
15/05/28 21:55:24 INFO Client: Verifying our application has not requested
more than the maximum memory capability of the cluster (61440 MB per
container)
15/05/28 21:55:24 INFO Client: Will allocate AM container, with 896 MB
memory including 384 MB overhead
15/05/28 21:55:24 INFO Client: Setting up container launch context for our
AM
15/05/28 21:55:24 INFO Client: Preparing resources for our AM container
15/05/28 21:55:24 INFO Client: Setting up the launch environment for our AM
container
15/05/28 21:55:24 INFO SecurityManager: Changing view acls to: xyz
15/05/28 21:55:24 INFO SecurityManager: Changing modify acls to: xyz
15/05/28 21:55:24 INFO SecurityManager: SecurityManager: 

Re: Spark HistoryServer not coming up

2015-05-21 Thread roy
This got resolved after cleaning /user/spark/applicationHistory/*



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-HistoryServer-not-coming-up-tp22975p22981.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark HistoryServer not coming up

2015-05-21 Thread roy
Hi,

   After restarting Spark HistoryServer, it failed to come up, I checked
logs for Spark HistoryServer found following messages :'

 2015-05-21 11:38:03,790 WARN org.apache.spark.scheduler.ReplayListenerBus:
Log path provided contains no log files.
2015-05-21 11:38:52,319 INFO org.apache.spark.deploy.history.HistoryServer:
Registered signal handlers for [TERM, HUP, INT]
2015-05-21 11:38:52,328 WARN
org.apache.spark.deploy.history.HistoryServerArguments: Setting log
directory through the command line is deprecated as of Spark 1.1.0. Please
set this through spark.history.fs.logDirectory instead.
2015-05-21 11:38:52,461 INFO org.apache.spark.SecurityManager: Changing view
acls to: spark
2015-05-21 11:38:52,462 INFO org.apache.spark.SecurityManager: Changing
modify acls to: spark
2015-05-21 11:38:52,463 INFO org.apache.spark.SecurityManager:
SecurityManager: authentication disabled; ui acls disabled; users with view
permissions: Set(spark); users with modify permissions: Set(spark)
2015-05-21 11:41:24,893 ERROR org.apache.spark.deploy.history.HistoryServer:
RECEIVED SIGNAL 15: SIGTERM
2015-05-21 11:41:33,439 INFO org.apache.spark.deploy.history.HistoryServer:
Registered signal handlers for [TERM, HUP, INT]
2015-05-21 11:41:33,447 WARN
org.apache.spark.deploy.history.HistoryServerArguments: Setting log
directory through the command line is deprecated as of Spark 1.1.0. Please
set this through spark.history.fs.logDirectory instead.
2015-05-21 11:41:33,578 INFO org.apache.spark.SecurityManager: Changing view
acls to: spark
2015-05-21 11:41:33,579 INFO org.apache.spark.SecurityManager: Changing
modify acls to: spark
2015-05-21 11:41:33,579 INFO org.apache.spark.SecurityManager:
SecurityManager: authentication disabled; ui acls disabled; users with view
permissions: Set(spark); users with modify permissions: Set(spark)
2015-05-21 11:44:07,147 WARN org.apache.hadoop.hdfs.BlockReaderFactory: I/O
error constructing remote block reader.
java.io.EOFException: Premature EOF: no length prefix available
at
org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2109)
at
org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:408)
at
org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReader(BlockReaderFactory.java:785)
at
org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:663)
at
org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:327)
at
org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:574)
at
org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:797)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:844)
at java.io.DataInputStream.read(DataInputStream.java:149)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:273)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:283)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:325)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:177)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.BufferedReader.fill(BufferedReader.java:154)
at java.io.BufferedReader.readLine(BufferedReader.java:317)
at java.io.BufferedReader.readLine(BufferedReader.java:382)
at
scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:67)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at
org.apache.spark.scheduler.ReplayListenerBus$$anonfun$replay$2.apply(ReplayListenerBus.scala:69)
at
org.apache.spark.scheduler.ReplayListenerBus$$anonfun$replay$2.apply(ReplayListenerBus.scala:55)
at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
at
org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:55)
at
org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$5.apply(FsHistoryProvider.scala:175)
at
org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$5.apply(FsHistoryProvider.scala:172)
at
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
at
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
at

How to process data in chronological order

2015-05-20 Thread roy
I have a key-value RDD, key is a timestamp (femto-second resolution, so
grouping buys me nothing) and I want to reduce it in the chronological
order.

How do I do that in spark?

I am fine with reducing contiguous sections of the set separately and then
aggregating the resulting objects locally.

Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-process-data-in-chronological-order-tp22966.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Possible to disable Spark HTTP server ?

2015-05-05 Thread roy
Hi,

  When we start spark job it start new HTTP server for each new job.
Is it possible to disable HTTP server for each job ?


Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Possible-to-disable-Spark-HTTP-server-tp22772.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



spark.logConf with log4j.rootCategory=WARN

2015-05-01 Thread roy
Hi, 

  I have recently enable log4j.rootCategory=WARN, console in spark
configuration. but after that spark.logConf=True has becomes ineffective. 

  So just want to confirm if this is because  log4j.rootCategory=WARN ?

Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-logConf-with-log4j-rootCategory-WARN-tp22731.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



shuffle.FetchFailedException in spark on YARN job

2015-04-18 Thread roy
Hi,

 My spark job is failing with following error message 

org.apache.spark.shuffle.FetchFailedException:
/mnt/ephemeral12/yarn/nm/usercache/abc/appcache/application_1429353954024_1691/spark-local-20150418132335-0723/28/shuffle_3_1_0.index
(No such file or directory)
at
org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67)
at
org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
at
org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at
org.apache.spark.graphx.impl.EdgePartition.updateVertices(EdgePartition.scala:89)
at
org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:75)
at
org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:73)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:210)
at
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:65)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.FileNotFoundException:
/mnt/ephemeral12/yarn/nm/usercache/abc/appcache/application_1429353954024_1691/spark-local-20150418132335-0723/28/shuffle_3_1_0.index
(No such file or directory)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.init(FileInputStream.java:146)
at
org.apache.spark.shuffle.IndexShuffleBlockManager.getBlockData(IndexShuffleBlockManager.scala:109)
at
org.apache.spark.storage.BlockManager.getBlockData(BlockManager.scala:305)
at
org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchLocalBlocks(ShuffleBlockFetcherIterator.scala:235)
at
org.apache.spark.storage.ShuffleBlockFetcherIterator.initialize(ShuffleBlockFetcherIterator.scala:268)
at
org.apache.spark.storage.ShuffleBlockFetcherIterator.init(ShuffleBlockFetcherIterator.scala:115)
at
org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.fetch(BlockStoreShuffleFetcher.scala:76)
at
org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:40)
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at
org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
... 7 more

)


my job conf was :
--num-executors 30 --driver-memory 3G --executor-memory 5G --conf
spark.yarn.executor.memoryOverhead=1000 --conf
spark.yarn.driver.memoryOverhead=1000 --conf spark.akka.frameSize=1
--executor-cores 3

any idea why its failing on shuffle fetch ?

thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/shuffle-FetchFailedException-in-spark-on-YARN-job-tp22557.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



spark job progress-style report on console ?

2015-04-09 Thread roy
Hi,

  How do i get spark job progress-style report on console ?

I tried to set --conf spark.ui.showConsoleProgress=true but it 

thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-job-progress-style-report-on-console-tp22440.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark SQL Avro Library for 1.2

2015-04-08 Thread roy
How do I build Spark SQL Avro Library for Spark 1.2 ?

I was following this https://github.com/databricks/spark-avro and was able
to build spark-avro_2.10-1.0.0.jar by simply running sbt/sbt package from
the project root.

but we are on Spark 1.2 and need compatible spark-avro jar.

Any idea how do I do it ?

Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Avro-Library-for-1-2-tp22421.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark 1.3 on CDH 5.3.1 YARN

2015-04-08 Thread roy
Hi,

  We have cluster running on CDH 5.3.2 and Spark 1.2 (Which is current
version in CDH5.3.2), But We want to try Spark 1.3 without breaking existing
setup, so is it possible to have Spark 1.3 on existing setup ?

Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-3-on-CDH-5-3-1-YARN-tp22422.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: can't union two rdds

2015-03-31 Thread roy
use zip



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/can-t-union-two-rdds-tp22320p22321.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark History Server : jobs link doesn't open

2015-03-26 Thread , Roy
in log I found this

2015-03-26 19:42:09,531 WARN org.eclipse.jetty.servlet.ServletHandler:
Error for /history/application_1425934191900_87572
org.spark-project.guava.common.util.concurrent.ExecutionError:
java.lang.OutOfMemoryError: GC overhead limit exceeded
at 
org.spark-project.guava.common.cache.LocalCache$Segment.get(LocalCache.java:2261)
at 
org.spark-project.guava.common.cache.LocalCache.get(LocalCache.java:4000)


thanks

On Thu, Mar 26, 2015 at 7:27 PM, , Roy rp...@njit.edu wrote:

 We have Spark on YARN, with Cloudera Manager 5.3.2 and CDH 5.3.2

 Jobs link on spark History server  doesn't open and shows following
 message :

 HTTP ERROR: 500

 Problem accessing /history/application_1425934191900_87572. Reason:

 Server Error

 --
 *Powered by Jetty://*




Spark History Server : jobs link doesn't open

2015-03-26 Thread , Roy
We have Spark on YARN, with Cloudera Manager 5.3.2 and CDH 5.3.2

Jobs link on spark History server  doesn't open and shows following message
:

HTTP ERROR: 500

Problem accessing /history/application_1425934191900_87572. Reason:

Server Error

--
*Powered by Jetty://*


Re: FAILED SelectChannelConnector@0.0.0.0:4040 java.net.BindException: Address already in use

2015-03-25 Thread , Roy
Yes I do have other application already running.

Thanks for your explanation.



On Wed, Mar 25, 2015 at 2:49 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:

 It means you are already having 4 applications running on 4040, 4041,
 4042, 4043. And that's why it was able to run on 4044.

 You can do a *netstat -pnat | grep 404* *And see what all processes are
 running.

 Thanks
 Best Regards

 On Wed, Mar 25, 2015 at 1:13 AM, , Roy rp...@njit.edu wrote:

 I get following message for each time I run spark job


1. 15/03/24 15:35:56 WARN AbstractLifeCycle: FAILED
SelectChannelConnector@0.0.0.0:4040: java.net.BindException: Address
already in use


 full trace is here

 http://pastebin.com/xSvRN01f

 how do I fix this ?

 I am on CDH 5.3.1

 thanks
 roy






FAILED SelectChannelConnector@0.0.0.0:4040 java.net.BindException: Address already in use

2015-03-24 Thread , Roy
I get following message for each time I run spark job


   1. 15/03/24 15:35:56 WARN AbstractLifeCycle: FAILED
   SelectChannelConnector@0.0.0.0:4040: java.net.BindException: Address
   already in use


full trace is here

http://pastebin.com/xSvRN01f

how do I fix this ?

I am on CDH 5.3.1

thanks
roy


Spark error NoClassDefFoundError: org/apache/hadoop/mapred/InputSplit

2015-03-23 Thread , Roy
Hi,


  I am using CDH 5.3.2 packages installation through Cloudera Manager 5.3.2

I am trying to run one spark job with following command

PYTHONPATH=~/code/utils/ spark-submit --master yarn --executor-memory 3G
--num-executors 30 --driver-memory 2G --executor-cores 2 --name=analytics
/home/abc/code/updb/spark/UPDB3analytics.py -date 2015-03-01

but I am getting following error

15/03/23 11:06:49 WARN TaskSetManager: Lost task 9.0 in stage 0.0 (TID 7,
hdp003.dev.xyz.com): java.lang.NoClassDefFoundError:
org/apache/hadoop/mapred/InputSplit
at java.lang.Class.getDeclaredConstructors0(Native Method)
at java.lang.Class.privateGetDeclaredConstructors(Class.java:2532)
at java.lang.Class.getDeclaredConstructors(Class.java:1901)
at java.io.ObjectStreamClass.computeDefaultSUID(ObjectStreamClass.java:1749)
at java.io.ObjectStreamClass.access$100(ObjectStreamClass.java:72)
at java.io.ObjectStreamClass$1.run(ObjectStreamClass.java:250)
at java.io.ObjectStreamClass$1.run(ObjectStreamClass.java:248)
at java.security.AccessController.doPrivileged(Native Method)
at java.io.ObjectStreamClass.getSerialVersionUID(ObjectStreamClass.java:247)
at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:613)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1622)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
at
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException:
org.apache.hadoop.mapred.InputSplit
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
... 25 more

here is the full trace

https://gist.github.com/anonymous/3492f0ec63d7a23c47cf


What do you think about the level of resource manager and file system?

2015-02-11 Thread Fangqi (Roy)
[cid:image004.jpg@01D04629.1F451950] [cid:image005.png@01D04629.1F451950]

Hi guys~

Comparing these two architectures, why BDAS put Yarn and Mesos under the HDFS, 
do you have any special consideration? Or just easy to express the AMPLab stack?

Best regards!


unsubscribe

2014-05-05 Thread Shubhabrata Roy

unsubscribe