reduceByKey to get all associated values

2014-08-07 Thread Konstantin Kudryavtsev
Hi there,

I'm interested if it is possible to get the same behavior as for reduce
function from MR framework. I mean for each key K get list of associated
values ListV.

There is function reduceByKey that works only with separate V from list. Is
it exist any way to get list? Because I have to sort it in particular way
and apply some business logic.

Thank you in advance,
Konstantin Kudryavtsev


Re: Ports required for running spark

2014-07-31 Thread Konstantin Kudryavtsev
Hi Haiyang,

you are right, YARN takes over the resource management, bot I constantly
got Exception ConnectionRefused on mentioned port. So, I suppose some spark
internal communications are done via this port... but I don't know what
exactly and how can I change it...

Thank you,
Konstantin Kudryavtsev


On Thu, Jul 31, 2014 at 2:53 PM, Haiyang Fu haiyangfu...@gmail.com wrote:

 Hi Konstantin,
 Would you please post some more details? Error info or exception from the
 log on what situation?when you run spark job on yarn cluster mode ,yarn
 will take over all the resource management.


 On Thu, Jul 31, 2014 at 6:17 PM, Konstantin Kudryavtsev 
 kudryavtsev.konstan...@gmail.com wrote:

 Hi Larry,

 I'm afraid this is standalone mode, I'm interesting in YARN

 Also, I don't see port-in-trouble 33007   which i believe related to Akka

 Thank you,
 Konstantin Kudryavtsev


 On Thu, Jul 31, 2014 at 1:11 PM, Larry Xiao xia...@sjtu.edu.cn wrote:

  Hi Konstantin,

 I think you can find it at
 https://spark.apache.org/docs/latest/spark-standalone.html#configuring-ports-for-network-security
 and you can specify port for master or worker at conf/spark-env.sh

 Larry


 On 7/31/14, 6:04 PM, Konstantin Kudryavtsev wrote:

 Hi there,

  I'm trying to run Spark on YARN cluster and face with issued that some
 ports are closed, particularly port 33007 (I suppose it's used by Akka)

  Could you please provide me with a list of all ports required for
 Spark?
 Also, is it possible to set up these ports?

 Thank you in advance,
 Konstantin Kudryavtsev







Spark scheduling with Capacity scheduler

2014-07-17 Thread Konstantin Kudryavtsev
Hi all,

I'm using HDP 2.0, YARN. I'm running both MapReduce and Spark jobs on this
cluster, is it possible somehow use Capacity scheduler for Spark jobs
management as well as MR jobs? I mean, I'm able to send MR job to specific
queue, may I do the same with Spark job?
thank you in advance

Thank you,
Konstantin Kudryavtsev


Filtering data during the read

2014-07-09 Thread Konstantin Kudryavtsev
Hi all,

I wondered if you could help me to clarify the next situation:
in the classic example

val file = spark.textFile(hdfs://...)
val errors = file.filter(line = line.contains(ERROR))

As I understand, the data is read in memory in first, and after that
filtering is applying. Is it any way to apply filtering during the read
step? and don't put all objects into memory?

Thank you,
Konstantin Kudryavtsev


java.lang.OutOfMemoryError (java.lang.OutOfMemoryError: GC overhead limit exceeded)

2014-07-08 Thread Konstantin Kudryavtsev
Hi all,

I faced with the next exception during map step:
java.lang.OutOfMemoryError (java.lang.OutOfMemoryError: GC overhead limit
exceeded)
java.lang.reflect.Array.newInstance(Array.java:70)
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:325)
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:293)
com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699)
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611)
com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648)
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605)
com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648)
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605)
com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648)
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605)
com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699)
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611)
com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:43)
com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34)
com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:115)
org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:125)
org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:155)
org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:154)
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
I'm using Spark 1.0In map I create new object each time, as I understand I
can't reuse object similar to MapReduce development? I wondered, if you
could point me how is it possible to avoid GC overhead...thank you in
advance

Thank you,
Konstantin Kudryavtsev


how to convert RDD to PairRDDFunctions ?

2014-07-08 Thread Konstantin Kudryavtsev
Hi all,

sorry for fooly question, but how can I get PairRDDFunctions RDD? I'm doing
it to perform leftOuterJoin aftewards

currently I do in this was (it seems incorrect):
val parRDD = new PairRDDFunctions( oldRdd.map(i = (i.key, i)) )

I guess this constructor is definitely wrong...


Thank you,
Konstantin Kudryavtsev


Re: Unable to run Spark 1.0 SparkPi on HDP 2.0

2014-07-07 Thread Konstantin Kudryavtsev
guys, I'm not talking about running spark on VM, I don have problem with it.

I confused in the next:
1) Hortonworks describe installation process as RPMs on each node
2) spark home page said that everything I need is YARN

And I'm in stucj with understanding what I need to do to run spark on yarn
(do I need RPMs installations or only build spark on edge node?)


Thank you,
Konstantin Kudryavtsev


On Mon, Jul 7, 2014 at 4:34 AM, Robert James srobertja...@gmail.com wrote:

 I can say from my experience that getting Spark to work with Hadoop 2
 is not for the beginner; after solving one problem after another
 (dependencies, scripts, etc.), I went back to Hadoop 1.

 Spark's Maven, ec2 scripts, and others all use Hadoop 1 - not sure
 why, but, given so, Hadoop 2 has too many bumps

 On 7/6/14, Marco Shaw marco.s...@gmail.com wrote:
  That is confusing based on the context you provided.
 
  This might take more time than I can spare to try to understand.
 
  For sure, you need to add Spark to run it in/on the HDP 2.1 express VM.
 
  Cloudera's CDH 5 express VM includes Spark, but the service isn't
 running by
  default.
 
  I can't remember for MapR...
 
  Marco
 
  On Jul 6, 2014, at 6:33 PM, Konstantin Kudryavtsev
  kudryavtsev.konstan...@gmail.com wrote:
 
  Marco,
 
  Hortonworks provides a Tech Preview of Spark 0.9.1 with HDP 2.1 that you
  can try
  from
 
 http://hortonworks.com/wp-content/uploads/2014/05/SparkTechnicalPreview.pdf
   HDP 2.1 means YARN, at the same time they propose ti install rpm
 
  On other hand, http://spark.apache.org/ said 
  Integrated with Hadoop
  Spark can run on Hadoop 2's YARN cluster manager, and can read any
  existing Hadoop data.
 
  If you have a Hadoop 2 cluster, you can run Spark without any
 installation
  needed. 
 
  And this is confusing for me... do I need rpm installation on not?...
 
 
  Thank you,
  Konstantin Kudryavtsev
 
 
  On Sun, Jul 6, 2014 at 10:56 PM, Marco Shaw marco.s...@gmail.com
  wrote:
  Can you provide links to the sections that are confusing?
 
  My understanding, the HDP1 binaries do not need YARN, while the HDP2
  binaries do.
 
  Now, you can also install Hortonworks Spark RPM...
 
  For production, in my opinion, RPMs are better for manageability.
 
  On Jul 6, 2014, at 5:39 PM, Konstantin Kudryavtsev
  kudryavtsev.konstan...@gmail.com wrote:
 
  Hello, thanks for your message... I'm confused, Hortonworhs suggest
  install spark rpm on each node, but on Spark main page said that yarn
  enough and I don't need to install it... What the difference?
 
  sent from my HTC
 
  On Jul 6, 2014 8:34 PM, vs vinayshu...@gmail.com wrote:
  Konstantin,
 
  HWRK provides a Tech Preview of Spark 0.9.1 with HDP 2.1 that you can
  try
  from
 
 http://hortonworks.com/wp-content/uploads/2014/05/SparkTechnicalPreview.pdf
 
  Let me know if you see issues with the tech preview.
 
  spark PI example on HDP 2.0
 
  I downloaded spark 1.0 pre-build from
  http://spark.apache.org/downloads.html
  (for HDP2)
  The run example from spark web-site:
  ./bin/spark-submit --class org.apache.spark.examples.SparkPi
  --master
  yarn-cluster --num-executors 3 --driver-memory 2g --executor-memory
 2g
  --executor-cores 1 ./lib/spark-examples-1.0.0-hadoop2.2.0.jar 2
 
  I got error:
  Application application_1404470405736_0044 failed 3 times due to AM
  Container for appattempt_1404470405736_0044_03 exited with
  exitCode: 1
  due to: Exception from container-launch:
  org.apache.hadoop.util.Shell$ExitCodeException:
  at org.apache.hadoop.util.Shell.runCommand(Shell.java:464)
  at org.apache.hadoop.util.Shell.run(Shell.java:379)
  at
 
 org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
  at
 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
  at
 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283)
  at
 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79)
  at java.util.concurrent.FutureTask.run(FutureTask.java:262)
  at
 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
  at
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
  at java.lang.Thread.run(Thread.java:744)
  .Failing this attempt.. Failing the application.
 
  Unknown/unsupported param List(--executor-memory, 2048,
  --executor-cores, 1,
  --num-executors, 3)
  Usage: org.apache.spark.deploy.yarn.ApplicationMaster [options]
  Options:
--jar JAR_PATH   Path to your application's JAR file (required)
--class CLASS_NAME   Name of your application's main class
  (required)
  ...bla-bla-bla
  
 
 
 
  --
  View this message in context:
 
 http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-run-Spark-1-0-SparkPi-on-HDP-2-0-tp8802p8873.html
  Sent from the Apache Spark User List mailing list archive

Re: Unable to run Spark 1.0 SparkPi on HDP 2.0

2014-07-07 Thread Konstantin Kudryavtsev
thank you Krishna!

 Could you please explain why do I need install spark on each node if Spark
official site said: If you have a Hadoop 2 cluster, you can run Spark
without any installation needed

I have HDP 2 (YARN) and that's why I hope I don't need to install spark on
each node

Thank you,
Konstantin Kudryavtsev


On Mon, Jul 7, 2014 at 1:57 PM, Krishna Sankar ksanka...@gmail.com wrote:

 Konstantin,

1. You need to install the hadoop rpms on all nodes. If it is Hadoop
2, the nodes would have hdfs  YARN.
2. Then you need to install Spark on all nodes. I haven't had
experience with HDP, but the tech preview might have installed Spark as
well.
3. In the end, one should have hdfs,yarn  spark installed on all the
nodes.
4. After installations, check the web console to make sure hdfs, yarn
 spark are running.
5. Then you are ready to start experimenting/developing spark
applications.

 HTH.
 Cheers
 k/


 On Mon, Jul 7, 2014 at 2:34 AM, Konstantin Kudryavtsev 
 kudryavtsev.konstan...@gmail.com wrote:

 guys, I'm not talking about running spark on VM, I don have problem with
 it.

 I confused in the next:
 1) Hortonworks describe installation process as RPMs on each node
 2) spark home page said that everything I need is YARN

 And I'm in stucj with understanding what I need to do to run spark on
 yarn (do I need RPMs installations or only build spark on edge node?)


 Thank you,
 Konstantin Kudryavtsev


 On Mon, Jul 7, 2014 at 4:34 AM, Robert James srobertja...@gmail.com
 wrote:

 I can say from my experience that getting Spark to work with Hadoop 2
 is not for the beginner; after solving one problem after another
 (dependencies, scripts, etc.), I went back to Hadoop 1.

 Spark's Maven, ec2 scripts, and others all use Hadoop 1 - not sure
 why, but, given so, Hadoop 2 has too many bumps

 On 7/6/14, Marco Shaw marco.s...@gmail.com wrote:
  That is confusing based on the context you provided.
 
  This might take more time than I can spare to try to understand.
 
  For sure, you need to add Spark to run it in/on the HDP 2.1 express VM.
 
  Cloudera's CDH 5 express VM includes Spark, but the service isn't
 running by
  default.
 
  I can't remember for MapR...
 
  Marco
 
  On Jul 6, 2014, at 6:33 PM, Konstantin Kudryavtsev
  kudryavtsev.konstan...@gmail.com wrote:
 
  Marco,
 
  Hortonworks provides a Tech Preview of Spark 0.9.1 with HDP 2.1 that
 you
  can try
  from
 
 http://hortonworks.com/wp-content/uploads/2014/05/SparkTechnicalPreview.pdf
   HDP 2.1 means YARN, at the same time they propose ti install rpm
 
  On other hand, http://spark.apache.org/ said 
  Integrated with Hadoop
  Spark can run on Hadoop 2's YARN cluster manager, and can read any
  existing Hadoop data.
 
  If you have a Hadoop 2 cluster, you can run Spark without any
 installation
  needed. 
 
  And this is confusing for me... do I need rpm installation on not?...
 
 
  Thank you,
  Konstantin Kudryavtsev
 
 
  On Sun, Jul 6, 2014 at 10:56 PM, Marco Shaw marco.s...@gmail.com
  wrote:
  Can you provide links to the sections that are confusing?
 
  My understanding, the HDP1 binaries do not need YARN, while the HDP2
  binaries do.
 
  Now, you can also install Hortonworks Spark RPM...
 
  For production, in my opinion, RPMs are better for manageability.
 
  On Jul 6, 2014, at 5:39 PM, Konstantin Kudryavtsev
  kudryavtsev.konstan...@gmail.com wrote:
 
  Hello, thanks for your message... I'm confused, Hortonworhs suggest
  install spark rpm on each node, but on Spark main page said that
 yarn
  enough and I don't need to install it... What the difference?
 
  sent from my HTC
 
  On Jul 6, 2014 8:34 PM, vs vinayshu...@gmail.com wrote:
  Konstantin,
 
  HWRK provides a Tech Preview of Spark 0.9.1 with HDP 2.1 that you
 can
  try
  from
 
 http://hortonworks.com/wp-content/uploads/2014/05/SparkTechnicalPreview.pdf
 
  Let me know if you see issues with the tech preview.
 
  spark PI example on HDP 2.0
 
  I downloaded spark 1.0 pre-build from
  http://spark.apache.org/downloads.html
  (for HDP2)
  The run example from spark web-site:
  ./bin/spark-submit --class org.apache.spark.examples.SparkPi
  --master
  yarn-cluster --num-executors 3 --driver-memory 2g
 --executor-memory 2g
  --executor-cores 1 ./lib/spark-examples-1.0.0-hadoop2.2.0.jar 2
 
  I got error:
  Application application_1404470405736_0044 failed 3 times due to AM
  Container for appattempt_1404470405736_0044_03 exited with
  exitCode: 1
  due to: Exception from container-launch:
  org.apache.hadoop.util.Shell$ExitCodeException:
  at org.apache.hadoop.util.Shell.runCommand(Shell.java:464)
  at org.apache.hadoop.util.Shell.run(Shell.java:379)
  at
 
 org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
  at
 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195

Re: Unable to run Spark 1.0 SparkPi on HDP 2.0

2014-07-07 Thread Konstantin Kudryavtsev
Hi Chester,

Thank you very much, it is clear now - just two different way to support
spark on acluster

Thank you,
Konstantin Kudryavtsev


On Mon, Jul 7, 2014 at 3:22 PM, Chester @work ches...@alpinenow.com wrote:

 In Yarn cluster mode, you can either have spark on all the cluster nodes
 or supply the spark jar yourself. In the 2nd case, you don't need install
 spark on cluster at all. As you supply the spark assembly as we as your app
 jar together.

 I hope this make it clear

 Chester

 Sent from my iPhone

 On Jul 7, 2014, at 5:05 AM, Konstantin Kudryavtsev 
 kudryavtsev.konstan...@gmail.com wrote:

 thank you Krishna!

  Could you please explain why do I need install spark on each node if
 Spark official site said: If you have a Hadoop 2 cluster, you can run
 Spark without any installation needed

 I have HDP 2 (YARN) and that's why I hope I don't need to install spark on
 each node

 Thank you,
 Konstantin Kudryavtsev


 On Mon, Jul 7, 2014 at 1:57 PM, Krishna Sankar ksanka...@gmail.com
 wrote:

 Konstantin,

1. You need to install the hadoop rpms on all nodes. If it is Hadoop
2, the nodes would have hdfs  YARN.
2. Then you need to install Spark on all nodes. I haven't had
experience with HDP, but the tech preview might have installed Spark as
well.
3. In the end, one should have hdfs,yarn  spark installed on all the
nodes.
4. After installations, check the web console to make sure hdfs, yarn
 spark are running.
5. Then you are ready to start experimenting/developing spark
applications.

 HTH.
 Cheers
 k/


 On Mon, Jul 7, 2014 at 2:34 AM, Konstantin Kudryavtsev 
 kudryavtsev.konstan...@gmail.com wrote:

 guys, I'm not talking about running spark on VM, I don have problem with
 it.

 I confused in the next:
 1) Hortonworks describe installation process as RPMs on each node
 2) spark home page said that everything I need is YARN

 And I'm in stucj with understanding what I need to do to run spark on
 yarn (do I need RPMs installations or only build spark on edge node?)


 Thank you,
 Konstantin Kudryavtsev


 On Mon, Jul 7, 2014 at 4:34 AM, Robert James srobertja...@gmail.com
 wrote:

 I can say from my experience that getting Spark to work with Hadoop 2
 is not for the beginner; after solving one problem after another
 (dependencies, scripts, etc.), I went back to Hadoop 1.

 Spark's Maven, ec2 scripts, and others all use Hadoop 1 - not sure
 why, but, given so, Hadoop 2 has too many bumps

 On 7/6/14, Marco Shaw marco.s...@gmail.com wrote:
  That is confusing based on the context you provided.
 
  This might take more time than I can spare to try to understand.
 
  For sure, you need to add Spark to run it in/on the HDP 2.1 express
 VM.
 
  Cloudera's CDH 5 express VM includes Spark, but the service isn't
 running by
  default.
 
  I can't remember for MapR...
 
  Marco
 
  On Jul 6, 2014, at 6:33 PM, Konstantin Kudryavtsev
  kudryavtsev.konstan...@gmail.com wrote:
 
  Marco,
 
  Hortonworks provides a Tech Preview of Spark 0.9.1 with HDP 2.1 that
 you
  can try
  from
 
 http://hortonworks.com/wp-content/uploads/2014/05/SparkTechnicalPreview.pdf
   HDP 2.1 means YARN, at the same time they propose ti install rpm
 
  On other hand, http://spark.apache.org/ said 
  Integrated with Hadoop
  Spark can run on Hadoop 2's YARN cluster manager, and can read any
  existing Hadoop data.
 
  If you have a Hadoop 2 cluster, you can run Spark without any
 installation
  needed. 
 
  And this is confusing for me... do I need rpm installation on not?...
 
 
  Thank you,
  Konstantin Kudryavtsev
 
 
  On Sun, Jul 6, 2014 at 10:56 PM, Marco Shaw marco.s...@gmail.com
  wrote:
  Can you provide links to the sections that are confusing?
 
  My understanding, the HDP1 binaries do not need YARN, while the HDP2
  binaries do.
 
  Now, you can also install Hortonworks Spark RPM...
 
  For production, in my opinion, RPMs are better for manageability.
 
  On Jul 6, 2014, at 5:39 PM, Konstantin Kudryavtsev
  kudryavtsev.konstan...@gmail.com wrote:
 
  Hello, thanks for your message... I'm confused, Hortonworhs suggest
  install spark rpm on each node, but on Spark main page said that
 yarn
  enough and I don't need to install it... What the difference?
 
  sent from my HTC
 
  On Jul 6, 2014 8:34 PM, vs vinayshu...@gmail.com wrote:
  Konstantin,
 
  HWRK provides a Tech Preview of Spark 0.9.1 with HDP 2.1 that you
 can
  try
  from
 
 http://hortonworks.com/wp-content/uploads/2014/05/SparkTechnicalPreview.pdf
 
  Let me know if you see issues with the tech preview.
 
  spark PI example on HDP 2.0
 
  I downloaded spark 1.0 pre-build from
  http://spark.apache.org/downloads.html
  (for HDP2)
  The run example from spark web-site:
  ./bin/spark-submit --class org.apache.spark.examples.SparkPi
  --master
  yarn-cluster --num-executors 3 --driver-memory 2g
 --executor-memory 2g
  --executor-cores 1 ./lib/spark-examples-1.0.0-hadoop2.2.0.jar 2
 
  I got error:
  Application

Control number of tasks per stage

2014-07-07 Thread Konstantin Kudryavtsev
Hi all,

is it any way to control the number tasks per stage?

currently I see situation when only 2 tasks are created per stage and each
of them is very slow, at the same time cluster has a huge number of unused
nodes


Thank you,
Konstantin Kudryavtsev


Re: Unable to run Spark 1.0 SparkPi on HDP 2.0

2014-07-06 Thread Konstantin Kudryavtsev
Hello, thanks for your message... I'm confused, Hortonworhs suggest install
spark rpm on each node, but on Spark main page said that yarn enough and I
don't need to install it... What the difference?

sent from my HTC
On Jul 6, 2014 8:34 PM, vs vinayshu...@gmail.com wrote:

 Konstantin,

 HWRK provides a Tech Preview of Spark 0.9.1 with HDP 2.1 that you can try
 from
 http://hortonworks.com/wp-content/uploads/2014/05/SparkTechnicalPreview.pdf

 Let me know if you see issues with the tech preview.

 spark PI example on HDP 2.0

 I downloaded spark 1.0 pre-build from
 http://spark.apache.org/downloads.html
 (for HDP2)
 The run example from spark web-site:
 ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master
 yarn-cluster --num-executors 3 --driver-memory 2g --executor-memory 2g
 --executor-cores 1 ./lib/spark-examples-1.0.0-hadoop2.2.0.jar 2

 I got error:
 Application application_1404470405736_0044 failed 3 times due to AM
 Container for appattempt_1404470405736_0044_03 exited with exitCode: 1
 due to: Exception from container-launch:
 org.apache.hadoop.util.Shell$ExitCodeException:
 at org.apache.hadoop.util.Shell.runCommand(Shell.java:464)
 at org.apache.hadoop.util.Shell.run(Shell.java:379)
 at
 org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
 at

 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
 at

 org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283)
 at

 org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79)
 at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 at

 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)
 .Failing this attempt.. Failing the application.

 Unknown/unsupported param List(--executor-memory, 2048, --executor-cores,
 1,
 --num-executors, 3)
 Usage: org.apache.spark.deploy.yarn.ApplicationMaster [options]
 Options:
   --jar JAR_PATH   Path to your application's JAR file (required)
   --class CLASS_NAME   Name of your application's main class (required)
 ...bla-bla-bla
 



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-run-Spark-1-0-SparkPi-on-HDP-2-0-tp8802p8873.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: Unable to run Spark 1.0 SparkPi on HDP 2.0

2014-07-06 Thread Konstantin Kudryavtsev
Marco,

Hortonworks provides a Tech Preview of Spark 0.9.1 with HDP 2.1 that you
can try
from
http://hortonworks.com/wp-content/uploads/2014/05/SparkTechnicalPreview.pdf
HDP 2.1 means YARN, at the same time they propose ti install rpm

On other hand, http://spark.apache.org/ said 
Integrated with Hadoop

Spark can run on Hadoop 2's YARN cluster manager, and can read any existing
Hadoop data.
If you have a Hadoop 2 cluster, you can run Spark without any installation
needed. 

And this is confusing for me... do I need rpm installation on not?...


Thank you,
Konstantin Kudryavtsev


On Sun, Jul 6, 2014 at 10:56 PM, Marco Shaw marco.s...@gmail.com wrote:

 Can you provide links to the sections that are confusing?

 My understanding, the HDP1 binaries do not need YARN, while the HDP2
 binaries do.

 Now, you can also install Hortonworks Spark RPM...

 For production, in my opinion, RPMs are better for manageability.

 On Jul 6, 2014, at 5:39 PM, Konstantin Kudryavtsev 
 kudryavtsev.konstan...@gmail.com wrote:

 Hello, thanks for your message... I'm confused, Hortonworhs suggest
 install spark rpm on each node, but on Spark main page said that yarn
 enough and I don't need to install it... What the difference?

 sent from my HTC
 On Jul 6, 2014 8:34 PM, vs vinayshu...@gmail.com wrote:

 Konstantin,

 HWRK provides a Tech Preview of Spark 0.9.1 with HDP 2.1 that you can try
 from

 http://hortonworks.com/wp-content/uploads/2014/05/SparkTechnicalPreview.pdf

 Let me know if you see issues with the tech preview.

 spark PI example on HDP 2.0

 I downloaded spark 1.0 pre-build from
 http://spark.apache.org/downloads.html
 (for HDP2)
 The run example from spark web-site:
 ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master
 yarn-cluster --num-executors 3 --driver-memory 2g --executor-memory 2g
 --executor-cores 1 ./lib/spark-examples-1.0.0-hadoop2.2.0.jar 2

 I got error:
 Application application_1404470405736_0044 failed 3 times due to AM
 Container for appattempt_1404470405736_0044_03 exited with exitCode: 1
 due to: Exception from container-launch:
 org.apache.hadoop.util.Shell$ExitCodeException:
 at org.apache.hadoop.util.Shell.runCommand(Shell.java:464)
 at org.apache.hadoop.util.Shell.run(Shell.java:379)
 at
 org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
 at

 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
 at

 org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283)
 at

 org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79)
 at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 at

 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)
 .Failing this attempt.. Failing the application.

 Unknown/unsupported param List(--executor-memory, 2048, --executor-cores,
 1,
 --num-executors, 3)
 Usage: org.apache.spark.deploy.yarn.ApplicationMaster [options]
 Options:
   --jar JAR_PATH   Path to your application's JAR file (required)
   --class CLASS_NAME   Name of your application's main class (required)
 ...bla-bla-bla
 



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-run-Spark-1-0-SparkPi-on-HDP-2-0-tp8802p8873.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.




Spark 1.0 failed on HDP 2.0 with absurd exception

2014-07-05 Thread Konstantin Kudryavtsev
Hi all,

I have cluster with HDP 2.0. I built Spark 1.0 on edge node and trying to
run with a command
./bin/spark-submit --class test.etl.RunETL --master yarn-cluster
--num-executors 14 --driver-memory 3200m --executor-memory 3g
--executor-cores 2 my-etl-1.0-SNAPSHOT-hadoop2.2.0.jar

in result I got failed YARN application with following stack trace

Application application_1404481778533_0068 failed 3 times due to AM
Container for appattempt_1404481778533_0068_03 exited with exitCode: 1
due to: Exception from container-launch:
org.apache.hadoop.util.Shell$ExitCodeException:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:464)
at org.apache.hadoop.util.Shell.run(Shell.java:379)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
at
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
 .Failing this attempt.. Failing the application

Log Type: stderr

Log Length: 686

Unknown/unsupported param List(--executor-memory, 3072,
--executor-cores, 2, --num-executors, 14)
Usage: org.apache.spark.deploy.yarn.ApplicationMaster [options]
Options:
  --jar JAR_PATH   Path to your application's JAR file (required)
  --class CLASS_NAME   Name of your application's main class (required)
  --args ARGS  Arguments to be passed to your application's main class.
   Mutliple invocations are possible, each will be
passed in order.
  --num-workers NUMNumber of workers to start (Default: 2)
  --worker-cores NUM   Number of cores for the workers (Default: 1)
  --worker-memory MEM  Memory per Worker (e.g. 1000M, 2G) (Default: 1G)


Seems like the old spark notation any ideas?

Thank you,
Konstantin Kudryavtsev


[no subject]

2014-07-05 Thread Konstantin Kudryavtsev
I faced in very strange behavior of job that I was run on YARN hadoop
cluster. One of stages (map function) was split in 80 tasks, 10 of them
successfully finished in ~2 min, but all other jobs are running  40 min
and still not finished... I suspect they hung on.
Any ideas what's going on and how can it be fixed?


Thank you,
Konstantin Kudryavtsev


Unable to run Spark 1.0 SparkPi on HDP 2.0

2014-07-04 Thread Konstantin Kudryavtsev
Hi all,

I stuck in issue with runing spark PI example on HDP 2.0

I downloaded spark 1.0 pre-build from http://spark.apache.org/downloads.html
(for HDP2)
The run example from spark web-site:
./bin/spark-submit --class org.apache.spark.examples.SparkPi --master
yarn-cluster --num-executors 3 --driver-memory 2g --executor-memory 2g
--executor-cores 1 ./lib/spark-examples-1.0.0-hadoop2.2.0.jar 2

I got error:
Application application_1404470405736_0044 failed 3 times due to AM
Container for appattempt_1404470405736_0044_03 exited with exitCode: 1
due to: Exception from container-launch:
org.apache.hadoop.util.Shell$ExitCodeException:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:464)
at org.apache.hadoop.util.Shell.run(Shell.java:379)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
at
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
.Failing this attempt.. Failing the application.

Unknown/unsupported param List(--executor-memory, 2048,
--executor-cores, 1, --num-executors, 3)
Usage: org.apache.spark.deploy.yarn.ApplicationMaster [options]
Options:
  --jar JAR_PATH   Path to your application's JAR file (required)
  --class CLASS_NAME   Name of your application's main class (required)

...bla-bla-bla


any ideas? how can I make it works?

Thank you,
Konstantin Kudryavtsev


Re: Run spark unit test on Windows 7

2014-07-03 Thread Konstantin Kudryavtsev
It sounds really strange...

I guess it is a bug, critical bug and must be fixed... at least some flag
must be add (unable.hadoop)

I found the next workaround :
1) download compiled winutils.exe from
http://social.msdn.microsoft.com/Forums/windowsazure/en-US/28a57efb-082b-424b-8d9e-731b1fe135de/please-read-if-experiencing-job-failures?forum=hdinsight
2) put this file into d:\winutil\bin
3) add in my test: System.setProperty(hadoop.home.dir, d:\\winutil\\)

after that test runs

Thank you,
Konstantin Kudryavtsev


On Wed, Jul 2, 2014 at 10:24 PM, Denny Lee denny.g@gmail.com wrote:

 You don't actually need it per se - its just that some of the Spark
 libraries are referencing Hadoop libraries even if they ultimately do not
 call them. When I was doing some early builds of Spark on Windows, I
 admittedly had Hadoop on Windows running as well and had not run into this
 particular issue.



 On Wed, Jul 2, 2014 at 12:04 PM, Kostiantyn Kudriavtsev 
 kudryavtsev.konstan...@gmail.com wrote:

 No, I don't

 why do I need to have HDP installed? I don't use Hadoop at all and I'd
 like to read data from local filesystem

 On Jul 2, 2014, at 9:10 PM, Denny Lee denny.g@gmail.com wrote:

 By any chance do you have HDP 2.1 installed? you may need to install the
 utils and update the env variables per
 http://stackoverflow.com/questions/18630019/running-apache-hadoop-2-1-0-on-windows


 On Jul 2, 2014, at 10:20 AM, Konstantin Kudryavtsev 
 kudryavtsev.konstan...@gmail.com wrote:

 Hi Andrew,

 it's windows 7 and I doesn't set up any env variables here

 The full stack trace:

 14/07/02 19:59:31 WARN NativeCodeLoader: Unable to load native-hadoop
 library for your platform... using builtin-java classes where applicable
 14/07/02 19:59:31 ERROR Shell: Failed to locate the winutils binary in
 the hadoop binary path
 java.io.IOException: Could not locate executable null\bin\winutils.exe in
 the Hadoop binaries.
 at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318)
  at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333)
 at org.apache.hadoop.util.Shell.clinit(Shell.java:326)
  at org.apache.hadoop.util.StringUtils.clinit(StringUtils.java:76)
 at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93)
  at org.apache.hadoop.security.Groups.init(Groups.java:77)
 at
 org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:240)
  at
 org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:255)
 at
 org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:283)
  at
 org.apache.spark.deploy.SparkHadoopUtil.init(SparkHadoopUtil.scala:36)
 at
 org.apache.spark.deploy.SparkHadoopUtil$.init(SparkHadoopUtil.scala:109)
  at
 org.apache.spark.deploy.SparkHadoopUtil$.clinit(SparkHadoopUtil.scala)
 at org.apache.spark.SparkContext.init(SparkContext.scala:228)
  at org.apache.spark.SparkContext.init(SparkContext.scala:97)
 at my.example.EtlTest.testETL(IxtoolsDailyAggTest.scala:13)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
  at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
  at junit.framework.TestCase.runTest(TestCase.java:168)
 at junit.framework.TestCase.runBare(TestCase.java:134)
  at junit.framework.TestResult$1.protect(TestResult.java:110)
 at junit.framework.TestResult.runProtected(TestResult.java:128)
  at junit.framework.TestResult.run(TestResult.java:113)
 at junit.framework.TestCase.run(TestCase.java:124)
  at junit.framework.TestSuite.runTest(TestSuite.java:232)
 at junit.framework.TestSuite.run(TestSuite.java:227)
  at
 org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:81)
 at org.junit.runner.JUnitCore.run(JUnitCore.java:130)
  at
 com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:74)
 at
 com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:211)
  at
 com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:67)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:606)
 at com.intellij.rt.execution.application.AppMain.main(AppMain.java:120)


 Thank you,
 Konstantin Kudryavtsev


 On Wed, Jul 2, 2014 at 8:15 PM, Andrew Or and...@databricks.com wrote:

 Hi Konstatin,

 We use hadoop as a library in a few places in Spark. I wonder why the
 path includes null though.

 Could you provide the full stack trace?

 Andrew


 2014-07-02 9:38 GMT-07:00 Konstantin Kudryavtsev 
 kudryavtsev.konstan...@gmail.com:

 Hi all,

 I'm trying to run some transformation on *Spark*, it works

Run spark unit test on Windows 7

2014-07-02 Thread Konstantin Kudryavtsev
Hi all,

I'm trying to run some transformation on *Spark*, it works fine on cluster
(YARN, linux machines). However, when I'm trying to run it on local machine
(*Windows 7*) under unit test, I got errors:

java.io.IOException: Could not locate executable null\bin\winutils.exe
in the Hadoop binaries.
at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318)
at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333)
at org.apache.hadoop.util.Shell.clinit(Shell.java:326)
at org.apache.hadoop.util.StringUtils.clinit(StringUtils.java:76)
at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93)


My code is following:

@Test
def testETL() = {
val conf = new SparkConf()
val sc = new SparkContext(local, test, conf)
try {
val etl = new IxtoolsDailyAgg() // empty constructor

val data = sc.parallelize(List(in1, in2, in3))

etl.etl(data) // rdd transformation, no access to SparkContext or Hadoop
Assert.assertTrue(true)
} finally {
if(sc != null)
sc.stop()
}
}


Why is it trying to access hadoop at all? and how can I fix it? Thank you
in advance

Thank you,
Konstantin Kudryavtsev


Re: Run spark unit test on Windows 7

2014-07-02 Thread Konstantin Kudryavtsev
Hi Andrew,

it's windows 7 and I doesn't set up any env variables here

The full stack trace:

14/07/02 19:59:31 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
14/07/02 19:59:31 ERROR Shell: Failed to locate the winutils binary in the
hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in
the Hadoop binaries.
at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318)
 at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333)
at org.apache.hadoop.util.Shell.clinit(Shell.java:326)
 at org.apache.hadoop.util.StringUtils.clinit(StringUtils.java:76)
at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93)
 at org.apache.hadoop.security.Groups.init(Groups.java:77)
at
org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:240)
 at
org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:255)
at
org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:283)
 at org.apache.spark.deploy.SparkHadoopUtil.init(SparkHadoopUtil.scala:36)
at
org.apache.spark.deploy.SparkHadoopUtil$.init(SparkHadoopUtil.scala:109)
 at org.apache.spark.deploy.SparkHadoopUtil$.clinit(SparkHadoopUtil.scala)
at org.apache.spark.SparkContext.init(SparkContext.scala:228)
 at org.apache.spark.SparkContext.init(SparkContext.scala:97)
at my.example.EtlTest.testETL(IxtoolsDailyAggTest.scala:13)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
 at junit.framework.TestCase.runTest(TestCase.java:168)
at junit.framework.TestCase.runBare(TestCase.java:134)
 at junit.framework.TestResult$1.protect(TestResult.java:110)
at junit.framework.TestResult.runProtected(TestResult.java:128)
 at junit.framework.TestResult.run(TestResult.java:113)
at junit.framework.TestCase.run(TestCase.java:124)
 at junit.framework.TestSuite.runTest(TestSuite.java:232)
at junit.framework.TestSuite.run(TestSuite.java:227)
 at
org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:81)
at org.junit.runner.JUnitCore.run(JUnitCore.java:130)
 at
com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:74)
at
com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:211)
 at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:67)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:120)


Thank you,
Konstantin Kudryavtsev


On Wed, Jul 2, 2014 at 8:15 PM, Andrew Or and...@databricks.com wrote:

 Hi Konstatin,

 We use hadoop as a library in a few places in Spark. I wonder why the path
 includes null though.

 Could you provide the full stack trace?

 Andrew


 2014-07-02 9:38 GMT-07:00 Konstantin Kudryavtsev 
 kudryavtsev.konstan...@gmail.com:

 Hi all,

 I'm trying to run some transformation on *Spark*, it works fine on
 cluster (YARN, linux machines). However, when I'm trying to run it on local
 machine (*Windows 7*) under unit test, I got errors:


 java.io.IOException: Could not locate executable null\bin\winutils.exe in 
 the Hadoop binaries.
 at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318)
 at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333)
 at org.apache.hadoop.util.Shell.clinit(Shell.java:326)
 at org.apache.hadoop.util.StringUtils.clinit(StringUtils.java:76)
 at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93)


 My code is following:


 @Test
 def testETL() = {
 val conf = new SparkConf()
 val sc = new SparkContext(local, test, conf)
 try {
 val etl = new IxtoolsDailyAgg() // empty constructor

 val data = sc.parallelize(List(in1, in2, in3))

 etl.etl(data) // rdd transformation, no access to SparkContext or 
 Hadoop
 Assert.assertTrue(true)
 } finally {
 if(sc != null)
 sc.stop()
 }
 }


 Why is it trying to access hadoop at all? and how can I fix it? Thank you
 in advance

 Thank you,
 Konstantin Kudryavtsev





NullPointerException on ExternalAppendOnlyMap

2014-07-02 Thread Konstantin Kudryavtsev
Hi all,

I catch very confusing exception running Spark 1.0 on HDP2.1
During save rdd as text file I got:


14/07/02 10:11:12 WARN TaskSetManager: Loss was due to
java.lang.NullPointerException
java.lang.NullPointerException
at 
org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator.org$apache$spark$util$collection$ExternalAppendOnlyMap$ExternalIterator$$getMorePairs(ExternalAppendOnlyMap.scala:254)
at 
org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator$$anonfun$3.apply(ExternalAppendOnlyMap.scala:237)
at 
org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator$$anonfun$3.apply(ExternalAppendOnlyMap.scala:236)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator.init(ExternalAppendOnlyMap.scala:236)
at 
org.apache.spark.util.collection.ExternalAppendOnlyMap.iterator(ExternalAppendOnlyMap.scala:218)
at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:162)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at 
org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at 
org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
at org.apache.spark.scheduler.Task.run(Task.scala:51)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)


Do you have any idea what is it? how can I debug this issue or perhaps
access another log?


Thank you,
Konstantin Kudryavtsev


unsibscribe

2014-05-05 Thread Konstantin Kudryavtsev
unsibscribe

Thank you,
Konstantin Kudryavtsev


Re: Pig on Spark

2014-04-10 Thread Konstantin Kudryavtsev
Hi Mayur,

I wondered if you could share your findings in some way (github, blog post,
etc). I guess your experience will be very interesting/useful for many
people

sent from Lenovo YogaTablet
On Apr 8, 2014 8:48 PM, Mayur Rustagi mayur.rust...@gmail.com wrote:

 Hi Ankit,
 Thanx for all the work on Pig.
 Finally got it working. Couple of high level bugs right now:

- Getting it working on Spark 0.9.0
- Getting UDF working
- Getting generate functionality working
- Exhaustive test suite on Spark on Pig

 are you maintaining a Jira somewhere?

 I am currently trying to deploy it on 0.9.0.

 Regards
 Mayur

 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi



 On Fri, Mar 14, 2014 at 1:37 PM, Aniket Mokashi aniket...@gmail.comwrote:

 We will post fixes from our side at - https://github.com/twitter/pig.

 Top on our list are-
 1. Make it work with pig-trunk (execution engine interface) (with 0.8 or
 0.9 spark).
 2. Support for algebraic udfs (this mitigates the group by oom problems).

 Would definitely love more contribution on this.

 Thanks,
 Aniket


 On Fri, Mar 14, 2014 at 12:29 PM, Mayur Rustagi 
 mayur.rust...@gmail.comwrote:

 Dam I am off to NY for Structure Conf. Would it be possible to meet
 anytime after 28th March?
 I am really interested in making it stable  production quality.

 Regards
 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi



 On Fri, Mar 14, 2014 at 11:53 AM, Julien Le Dem jul...@twitter.comwrote:

 Hi Mayur,
 Are you going to the Pig meetup this afternoon?
 http://www.meetup.com/PigUser/events/160604192/
 Aniket and I will be there.
 We would be happy to chat about Pig-on-Spark



 On Tue, Mar 11, 2014 at 8:56 AM, Mayur Rustagi mayur.rust...@gmail.com
  wrote:

 Hi Lin,
 We are working on getting Pig on spark functional with 0.8.0, have you
 got it working on any spark version ?
 Also what all functionality works on it?
 Regards
 Mayur

 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi



 On Mon, Mar 10, 2014 at 11:00 PM, Xiangrui Meng men...@gmail.comwrote:

 Hi Sameer,

 Lin (cc'ed) could also give you some updates about Pig on Spark
 development on her side.

 Best,
 Xiangrui

 On Mon, Mar 10, 2014 at 12:52 PM, Sameer Tilak ssti...@live.com
 wrote:
  Hi Mayur,
  We are planning to upgrade our distribution MR1 MR2 (YARN) and the
 goal is
  to get SPROK set up next month. I will keep you posted. Can you
 please keep
  me informed about your progress as well.
 
  
  From: mayur.rust...@gmail.com
  Date: Mon, 10 Mar 2014 11:47:56 -0700
 
  Subject: Re: Pig on Spark
  To: user@spark.apache.org
 
 
  Hi Sameer,
  Did you make any progress on this. My team is also trying it out
 would love
  to know some detail so progress.
 
  Mayur Rustagi
  Ph: +1 (760) 203 3257
  http://www.sigmoidanalytics.com
  @mayur_rustagi
 
 
 
  On Thu, Mar 6, 2014 at 2:20 PM, Sameer Tilak ssti...@live.com
 wrote:
 
  Hi Aniket,
  Many thanks! I will check this out.
 
  
  Date: Thu, 6 Mar 2014 13:46:50 -0800
  Subject: Re: Pig on Spark
  From: aniket...@gmail.com
  To: user@spark.apache.org; tgraves...@yahoo.com
 
 
  There is some work to make this work on yarn at
  https://github.com/aniket486/pig. (So, compile pig with ant
  -Dhadoopversion=23)
 
  You can look at
 https://github.com/aniket486/pig/blob/spork/pig-spark to
  find out what sort of env variables you need (sorry, I haven't been
 able to
  clean this up- in-progress). There are few known issues with this,
 I will
  work on fixing them soon.
 
  Known issues-
  1. Limit does not work (spork-fix)
  2. Foreach requires to turn off schema-tuple-backend (should be a
 pig-jira)
  3. Algebraic udfs dont work (spork-fix in-progress)
  4. Group by rework (to avoid OOMs)
  5. UDF Classloader issue (requires SPARK-1053, then you can put
  pig-withouthadoop.jar as SPARK_JARS in SparkContext along with udf
 jars)
 
  ~Aniket
 
 
 
 
  On Thu, Mar 6, 2014 at 1:36 PM, Tom Graves tgraves...@yahoo.com
 wrote:
 
  I had asked a similar question on the dev mailing list a while back
 (Jan
  22nd).
 
  See the archives:
 
 http://mail-archives.apache.org/mod_mbox/spark-dev/201401.mbox/browser-
  look for spork.
 
  Basically Matei said:
 
  Yup, that was it, though I believe people at Twitter picked it up
 again
  recently. I'd suggest
  asking Dmitriy if you know him. I've seen interest in this from
 several
  other groups, and
  if there's enough of it, maybe we can start another open source
 repo to
  track it. The work
  in that repo you pointed to was done over one week, and already had
 most of
  Pig's operators
  working. (I helped out with this prototype over Twitter's hack
 week.) That
  work also calls
  the Scala API directly, because it was done before we 

Re: how to save RDD partitions in different folders?

2014-04-04 Thread Konstantin Kudryavtsev
Hi Evan,

Could you please provide a code-snippet? Because it not clear for me, in
Hadoop you need to engage addNamedOutput method and I'm in stuck how to use
it from Spark

Thank you,
Konstantin Kudryavtsev


On Fri, Apr 4, 2014 at 5:27 PM, Evan Sparks evan.spa...@gmail.com wrote:

 Have a look at MultipleOutputs in the hadoop API. Spark can read and write
 to arbitrary hadoop formats.

  On Apr 4, 2014, at 6:01 AM, dmpour23 dmpou...@gmail.com wrote:
 
  Hi all,
  Say I have an input file which I would like to partition using
  HashPartitioner k times.
 
  Calling  rdd.saveAsTextFile(hdfs://); will save k files as part-0
  part-k
  Is there a way to save each partition in specific folders?
 
  i.e. src
   part0/part-0
   part1/part-1
   part1/part-k
 
  thanks
  Dimitri
 
 
 
 
 
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/how-to-save-RDD-partitions-in-different-folders-tp3754.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.