reduceByKey to get all associated values
Hi there, I'm interested if it is possible to get the same behavior as for reduce function from MR framework. I mean for each key K get list of associated values ListV. There is function reduceByKey that works only with separate V from list. Is it exist any way to get list? Because I have to sort it in particular way and apply some business logic. Thank you in advance, Konstantin Kudryavtsev
Re: Ports required for running spark
Hi Haiyang, you are right, YARN takes over the resource management, bot I constantly got Exception ConnectionRefused on mentioned port. So, I suppose some spark internal communications are done via this port... but I don't know what exactly and how can I change it... Thank you, Konstantin Kudryavtsev On Thu, Jul 31, 2014 at 2:53 PM, Haiyang Fu haiyangfu...@gmail.com wrote: Hi Konstantin, Would you please post some more details? Error info or exception from the log on what situation?when you run spark job on yarn cluster mode ,yarn will take over all the resource management. On Thu, Jul 31, 2014 at 6:17 PM, Konstantin Kudryavtsev kudryavtsev.konstan...@gmail.com wrote: Hi Larry, I'm afraid this is standalone mode, I'm interesting in YARN Also, I don't see port-in-trouble 33007 which i believe related to Akka Thank you, Konstantin Kudryavtsev On Thu, Jul 31, 2014 at 1:11 PM, Larry Xiao xia...@sjtu.edu.cn wrote: Hi Konstantin, I think you can find it at https://spark.apache.org/docs/latest/spark-standalone.html#configuring-ports-for-network-security and you can specify port for master or worker at conf/spark-env.sh Larry On 7/31/14, 6:04 PM, Konstantin Kudryavtsev wrote: Hi there, I'm trying to run Spark on YARN cluster and face with issued that some ports are closed, particularly port 33007 (I suppose it's used by Akka) Could you please provide me with a list of all ports required for Spark? Also, is it possible to set up these ports? Thank you in advance, Konstantin Kudryavtsev
Spark scheduling with Capacity scheduler
Hi all, I'm using HDP 2.0, YARN. I'm running both MapReduce and Spark jobs on this cluster, is it possible somehow use Capacity scheduler for Spark jobs management as well as MR jobs? I mean, I'm able to send MR job to specific queue, may I do the same with Spark job? thank you in advance Thank you, Konstantin Kudryavtsev
Filtering data during the read
Hi all, I wondered if you could help me to clarify the next situation: in the classic example val file = spark.textFile(hdfs://...) val errors = file.filter(line = line.contains(ERROR)) As I understand, the data is read in memory in first, and after that filtering is applying. Is it any way to apply filtering during the read step? and don't put all objects into memory? Thank you, Konstantin Kudryavtsev
java.lang.OutOfMemoryError (java.lang.OutOfMemoryError: GC overhead limit exceeded)
Hi all, I faced with the next exception during map step: java.lang.OutOfMemoryError (java.lang.OutOfMemoryError: GC overhead limit exceeded) java.lang.reflect.Array.newInstance(Array.java:70) com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:325) com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:293) com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699) com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611) com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648) com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605) com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648) com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605) com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648) com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605) com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699) com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611) com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:43) com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34) com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:115) org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:125) org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30) org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:155) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:154) scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) I'm using Spark 1.0In map I create new object each time, as I understand I can't reuse object similar to MapReduce development? I wondered, if you could point me how is it possible to avoid GC overhead...thank you in advance Thank you, Konstantin Kudryavtsev
how to convert RDD to PairRDDFunctions ?
Hi all, sorry for fooly question, but how can I get PairRDDFunctions RDD? I'm doing it to perform leftOuterJoin aftewards currently I do in this was (it seems incorrect): val parRDD = new PairRDDFunctions( oldRdd.map(i = (i.key, i)) ) I guess this constructor is definitely wrong... Thank you, Konstantin Kudryavtsev
Re: Unable to run Spark 1.0 SparkPi on HDP 2.0
guys, I'm not talking about running spark on VM, I don have problem with it. I confused in the next: 1) Hortonworks describe installation process as RPMs on each node 2) spark home page said that everything I need is YARN And I'm in stucj with understanding what I need to do to run spark on yarn (do I need RPMs installations or only build spark on edge node?) Thank you, Konstantin Kudryavtsev On Mon, Jul 7, 2014 at 4:34 AM, Robert James srobertja...@gmail.com wrote: I can say from my experience that getting Spark to work with Hadoop 2 is not for the beginner; after solving one problem after another (dependencies, scripts, etc.), I went back to Hadoop 1. Spark's Maven, ec2 scripts, and others all use Hadoop 1 - not sure why, but, given so, Hadoop 2 has too many bumps On 7/6/14, Marco Shaw marco.s...@gmail.com wrote: That is confusing based on the context you provided. This might take more time than I can spare to try to understand. For sure, you need to add Spark to run it in/on the HDP 2.1 express VM. Cloudera's CDH 5 express VM includes Spark, but the service isn't running by default. I can't remember for MapR... Marco On Jul 6, 2014, at 6:33 PM, Konstantin Kudryavtsev kudryavtsev.konstan...@gmail.com wrote: Marco, Hortonworks provides a Tech Preview of Spark 0.9.1 with HDP 2.1 that you can try from http://hortonworks.com/wp-content/uploads/2014/05/SparkTechnicalPreview.pdf HDP 2.1 means YARN, at the same time they propose ti install rpm On other hand, http://spark.apache.org/ said Integrated with Hadoop Spark can run on Hadoop 2's YARN cluster manager, and can read any existing Hadoop data. If you have a Hadoop 2 cluster, you can run Spark without any installation needed. And this is confusing for me... do I need rpm installation on not?... Thank you, Konstantin Kudryavtsev On Sun, Jul 6, 2014 at 10:56 PM, Marco Shaw marco.s...@gmail.com wrote: Can you provide links to the sections that are confusing? My understanding, the HDP1 binaries do not need YARN, while the HDP2 binaries do. Now, you can also install Hortonworks Spark RPM... For production, in my opinion, RPMs are better for manageability. On Jul 6, 2014, at 5:39 PM, Konstantin Kudryavtsev kudryavtsev.konstan...@gmail.com wrote: Hello, thanks for your message... I'm confused, Hortonworhs suggest install spark rpm on each node, but on Spark main page said that yarn enough and I don't need to install it... What the difference? sent from my HTC On Jul 6, 2014 8:34 PM, vs vinayshu...@gmail.com wrote: Konstantin, HWRK provides a Tech Preview of Spark 0.9.1 with HDP 2.1 that you can try from http://hortonworks.com/wp-content/uploads/2014/05/SparkTechnicalPreview.pdf Let me know if you see issues with the tech preview. spark PI example on HDP 2.0 I downloaded spark 1.0 pre-build from http://spark.apache.org/downloads.html (for HDP2) The run example from spark web-site: ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --num-executors 3 --driver-memory 2g --executor-memory 2g --executor-cores 1 ./lib/spark-examples-1.0.0-hadoop2.2.0.jar 2 I got error: Application application_1404470405736_0044 failed 3 times due to AM Container for appattempt_1404470405736_0044_03 exited with exitCode: 1 due to: Exception from container-launch: org.apache.hadoop.util.Shell$ExitCodeException: at org.apache.hadoop.util.Shell.runCommand(Shell.java:464) at org.apache.hadoop.util.Shell.run(Shell.java:379) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) .Failing this attempt.. Failing the application. Unknown/unsupported param List(--executor-memory, 2048, --executor-cores, 1, --num-executors, 3) Usage: org.apache.spark.deploy.yarn.ApplicationMaster [options] Options: --jar JAR_PATH Path to your application's JAR file (required) --class CLASS_NAME Name of your application's main class (required) ...bla-bla-bla -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-run-Spark-1-0-SparkPi-on-HDP-2-0-tp8802p8873.html Sent from the Apache Spark User List mailing list archive
Re: Unable to run Spark 1.0 SparkPi on HDP 2.0
thank you Krishna! Could you please explain why do I need install spark on each node if Spark official site said: If you have a Hadoop 2 cluster, you can run Spark without any installation needed I have HDP 2 (YARN) and that's why I hope I don't need to install spark on each node Thank you, Konstantin Kudryavtsev On Mon, Jul 7, 2014 at 1:57 PM, Krishna Sankar ksanka...@gmail.com wrote: Konstantin, 1. You need to install the hadoop rpms on all nodes. If it is Hadoop 2, the nodes would have hdfs YARN. 2. Then you need to install Spark on all nodes. I haven't had experience with HDP, but the tech preview might have installed Spark as well. 3. In the end, one should have hdfs,yarn spark installed on all the nodes. 4. After installations, check the web console to make sure hdfs, yarn spark are running. 5. Then you are ready to start experimenting/developing spark applications. HTH. Cheers k/ On Mon, Jul 7, 2014 at 2:34 AM, Konstantin Kudryavtsev kudryavtsev.konstan...@gmail.com wrote: guys, I'm not talking about running spark on VM, I don have problem with it. I confused in the next: 1) Hortonworks describe installation process as RPMs on each node 2) spark home page said that everything I need is YARN And I'm in stucj with understanding what I need to do to run spark on yarn (do I need RPMs installations or only build spark on edge node?) Thank you, Konstantin Kudryavtsev On Mon, Jul 7, 2014 at 4:34 AM, Robert James srobertja...@gmail.com wrote: I can say from my experience that getting Spark to work with Hadoop 2 is not for the beginner; after solving one problem after another (dependencies, scripts, etc.), I went back to Hadoop 1. Spark's Maven, ec2 scripts, and others all use Hadoop 1 - not sure why, but, given so, Hadoop 2 has too many bumps On 7/6/14, Marco Shaw marco.s...@gmail.com wrote: That is confusing based on the context you provided. This might take more time than I can spare to try to understand. For sure, you need to add Spark to run it in/on the HDP 2.1 express VM. Cloudera's CDH 5 express VM includes Spark, but the service isn't running by default. I can't remember for MapR... Marco On Jul 6, 2014, at 6:33 PM, Konstantin Kudryavtsev kudryavtsev.konstan...@gmail.com wrote: Marco, Hortonworks provides a Tech Preview of Spark 0.9.1 with HDP 2.1 that you can try from http://hortonworks.com/wp-content/uploads/2014/05/SparkTechnicalPreview.pdf HDP 2.1 means YARN, at the same time they propose ti install rpm On other hand, http://spark.apache.org/ said Integrated with Hadoop Spark can run on Hadoop 2's YARN cluster manager, and can read any existing Hadoop data. If you have a Hadoop 2 cluster, you can run Spark without any installation needed. And this is confusing for me... do I need rpm installation on not?... Thank you, Konstantin Kudryavtsev On Sun, Jul 6, 2014 at 10:56 PM, Marco Shaw marco.s...@gmail.com wrote: Can you provide links to the sections that are confusing? My understanding, the HDP1 binaries do not need YARN, while the HDP2 binaries do. Now, you can also install Hortonworks Spark RPM... For production, in my opinion, RPMs are better for manageability. On Jul 6, 2014, at 5:39 PM, Konstantin Kudryavtsev kudryavtsev.konstan...@gmail.com wrote: Hello, thanks for your message... I'm confused, Hortonworhs suggest install spark rpm on each node, but on Spark main page said that yarn enough and I don't need to install it... What the difference? sent from my HTC On Jul 6, 2014 8:34 PM, vs vinayshu...@gmail.com wrote: Konstantin, HWRK provides a Tech Preview of Spark 0.9.1 with HDP 2.1 that you can try from http://hortonworks.com/wp-content/uploads/2014/05/SparkTechnicalPreview.pdf Let me know if you see issues with the tech preview. spark PI example on HDP 2.0 I downloaded spark 1.0 pre-build from http://spark.apache.org/downloads.html (for HDP2) The run example from spark web-site: ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --num-executors 3 --driver-memory 2g --executor-memory 2g --executor-cores 1 ./lib/spark-examples-1.0.0-hadoop2.2.0.jar 2 I got error: Application application_1404470405736_0044 failed 3 times due to AM Container for appattempt_1404470405736_0044_03 exited with exitCode: 1 due to: Exception from container-launch: org.apache.hadoop.util.Shell$ExitCodeException: at org.apache.hadoop.util.Shell.runCommand(Shell.java:464) at org.apache.hadoop.util.Shell.run(Shell.java:379) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195
Re: Unable to run Spark 1.0 SparkPi on HDP 2.0
Hi Chester, Thank you very much, it is clear now - just two different way to support spark on acluster Thank you, Konstantin Kudryavtsev On Mon, Jul 7, 2014 at 3:22 PM, Chester @work ches...@alpinenow.com wrote: In Yarn cluster mode, you can either have spark on all the cluster nodes or supply the spark jar yourself. In the 2nd case, you don't need install spark on cluster at all. As you supply the spark assembly as we as your app jar together. I hope this make it clear Chester Sent from my iPhone On Jul 7, 2014, at 5:05 AM, Konstantin Kudryavtsev kudryavtsev.konstan...@gmail.com wrote: thank you Krishna! Could you please explain why do I need install spark on each node if Spark official site said: If you have a Hadoop 2 cluster, you can run Spark without any installation needed I have HDP 2 (YARN) and that's why I hope I don't need to install spark on each node Thank you, Konstantin Kudryavtsev On Mon, Jul 7, 2014 at 1:57 PM, Krishna Sankar ksanka...@gmail.com wrote: Konstantin, 1. You need to install the hadoop rpms on all nodes. If it is Hadoop 2, the nodes would have hdfs YARN. 2. Then you need to install Spark on all nodes. I haven't had experience with HDP, but the tech preview might have installed Spark as well. 3. In the end, one should have hdfs,yarn spark installed on all the nodes. 4. After installations, check the web console to make sure hdfs, yarn spark are running. 5. Then you are ready to start experimenting/developing spark applications. HTH. Cheers k/ On Mon, Jul 7, 2014 at 2:34 AM, Konstantin Kudryavtsev kudryavtsev.konstan...@gmail.com wrote: guys, I'm not talking about running spark on VM, I don have problem with it. I confused in the next: 1) Hortonworks describe installation process as RPMs on each node 2) spark home page said that everything I need is YARN And I'm in stucj with understanding what I need to do to run spark on yarn (do I need RPMs installations or only build spark on edge node?) Thank you, Konstantin Kudryavtsev On Mon, Jul 7, 2014 at 4:34 AM, Robert James srobertja...@gmail.com wrote: I can say from my experience that getting Spark to work with Hadoop 2 is not for the beginner; after solving one problem after another (dependencies, scripts, etc.), I went back to Hadoop 1. Spark's Maven, ec2 scripts, and others all use Hadoop 1 - not sure why, but, given so, Hadoop 2 has too many bumps On 7/6/14, Marco Shaw marco.s...@gmail.com wrote: That is confusing based on the context you provided. This might take more time than I can spare to try to understand. For sure, you need to add Spark to run it in/on the HDP 2.1 express VM. Cloudera's CDH 5 express VM includes Spark, but the service isn't running by default. I can't remember for MapR... Marco On Jul 6, 2014, at 6:33 PM, Konstantin Kudryavtsev kudryavtsev.konstan...@gmail.com wrote: Marco, Hortonworks provides a Tech Preview of Spark 0.9.1 with HDP 2.1 that you can try from http://hortonworks.com/wp-content/uploads/2014/05/SparkTechnicalPreview.pdf HDP 2.1 means YARN, at the same time they propose ti install rpm On other hand, http://spark.apache.org/ said Integrated with Hadoop Spark can run on Hadoop 2's YARN cluster manager, and can read any existing Hadoop data. If you have a Hadoop 2 cluster, you can run Spark without any installation needed. And this is confusing for me... do I need rpm installation on not?... Thank you, Konstantin Kudryavtsev On Sun, Jul 6, 2014 at 10:56 PM, Marco Shaw marco.s...@gmail.com wrote: Can you provide links to the sections that are confusing? My understanding, the HDP1 binaries do not need YARN, while the HDP2 binaries do. Now, you can also install Hortonworks Spark RPM... For production, in my opinion, RPMs are better for manageability. On Jul 6, 2014, at 5:39 PM, Konstantin Kudryavtsev kudryavtsev.konstan...@gmail.com wrote: Hello, thanks for your message... I'm confused, Hortonworhs suggest install spark rpm on each node, but on Spark main page said that yarn enough and I don't need to install it... What the difference? sent from my HTC On Jul 6, 2014 8:34 PM, vs vinayshu...@gmail.com wrote: Konstantin, HWRK provides a Tech Preview of Spark 0.9.1 with HDP 2.1 that you can try from http://hortonworks.com/wp-content/uploads/2014/05/SparkTechnicalPreview.pdf Let me know if you see issues with the tech preview. spark PI example on HDP 2.0 I downloaded spark 1.0 pre-build from http://spark.apache.org/downloads.html (for HDP2) The run example from spark web-site: ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --num-executors 3 --driver-memory 2g --executor-memory 2g --executor-cores 1 ./lib/spark-examples-1.0.0-hadoop2.2.0.jar 2 I got error: Application
Control number of tasks per stage
Hi all, is it any way to control the number tasks per stage? currently I see situation when only 2 tasks are created per stage and each of them is very slow, at the same time cluster has a huge number of unused nodes Thank you, Konstantin Kudryavtsev
Re: Unable to run Spark 1.0 SparkPi on HDP 2.0
Hello, thanks for your message... I'm confused, Hortonworhs suggest install spark rpm on each node, but on Spark main page said that yarn enough and I don't need to install it... What the difference? sent from my HTC On Jul 6, 2014 8:34 PM, vs vinayshu...@gmail.com wrote: Konstantin, HWRK provides a Tech Preview of Spark 0.9.1 with HDP 2.1 that you can try from http://hortonworks.com/wp-content/uploads/2014/05/SparkTechnicalPreview.pdf Let me know if you see issues with the tech preview. spark PI example on HDP 2.0 I downloaded spark 1.0 pre-build from http://spark.apache.org/downloads.html (for HDP2) The run example from spark web-site: ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --num-executors 3 --driver-memory 2g --executor-memory 2g --executor-cores 1 ./lib/spark-examples-1.0.0-hadoop2.2.0.jar 2 I got error: Application application_1404470405736_0044 failed 3 times due to AM Container for appattempt_1404470405736_0044_03 exited with exitCode: 1 due to: Exception from container-launch: org.apache.hadoop.util.Shell$ExitCodeException: at org.apache.hadoop.util.Shell.runCommand(Shell.java:464) at org.apache.hadoop.util.Shell.run(Shell.java:379) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) .Failing this attempt.. Failing the application. Unknown/unsupported param List(--executor-memory, 2048, --executor-cores, 1, --num-executors, 3) Usage: org.apache.spark.deploy.yarn.ApplicationMaster [options] Options: --jar JAR_PATH Path to your application's JAR file (required) --class CLASS_NAME Name of your application's main class (required) ...bla-bla-bla -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-run-Spark-1-0-SparkPi-on-HDP-2-0-tp8802p8873.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Unable to run Spark 1.0 SparkPi on HDP 2.0
Marco, Hortonworks provides a Tech Preview of Spark 0.9.1 with HDP 2.1 that you can try from http://hortonworks.com/wp-content/uploads/2014/05/SparkTechnicalPreview.pdf HDP 2.1 means YARN, at the same time they propose ti install rpm On other hand, http://spark.apache.org/ said Integrated with Hadoop Spark can run on Hadoop 2's YARN cluster manager, and can read any existing Hadoop data. If you have a Hadoop 2 cluster, you can run Spark without any installation needed. And this is confusing for me... do I need rpm installation on not?... Thank you, Konstantin Kudryavtsev On Sun, Jul 6, 2014 at 10:56 PM, Marco Shaw marco.s...@gmail.com wrote: Can you provide links to the sections that are confusing? My understanding, the HDP1 binaries do not need YARN, while the HDP2 binaries do. Now, you can also install Hortonworks Spark RPM... For production, in my opinion, RPMs are better for manageability. On Jul 6, 2014, at 5:39 PM, Konstantin Kudryavtsev kudryavtsev.konstan...@gmail.com wrote: Hello, thanks for your message... I'm confused, Hortonworhs suggest install spark rpm on each node, but on Spark main page said that yarn enough and I don't need to install it... What the difference? sent from my HTC On Jul 6, 2014 8:34 PM, vs vinayshu...@gmail.com wrote: Konstantin, HWRK provides a Tech Preview of Spark 0.9.1 with HDP 2.1 that you can try from http://hortonworks.com/wp-content/uploads/2014/05/SparkTechnicalPreview.pdf Let me know if you see issues with the tech preview. spark PI example on HDP 2.0 I downloaded spark 1.0 pre-build from http://spark.apache.org/downloads.html (for HDP2) The run example from spark web-site: ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --num-executors 3 --driver-memory 2g --executor-memory 2g --executor-cores 1 ./lib/spark-examples-1.0.0-hadoop2.2.0.jar 2 I got error: Application application_1404470405736_0044 failed 3 times due to AM Container for appattempt_1404470405736_0044_03 exited with exitCode: 1 due to: Exception from container-launch: org.apache.hadoop.util.Shell$ExitCodeException: at org.apache.hadoop.util.Shell.runCommand(Shell.java:464) at org.apache.hadoop.util.Shell.run(Shell.java:379) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) .Failing this attempt.. Failing the application. Unknown/unsupported param List(--executor-memory, 2048, --executor-cores, 1, --num-executors, 3) Usage: org.apache.spark.deploy.yarn.ApplicationMaster [options] Options: --jar JAR_PATH Path to your application's JAR file (required) --class CLASS_NAME Name of your application's main class (required) ...bla-bla-bla -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-run-Spark-1-0-SparkPi-on-HDP-2-0-tp8802p8873.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Spark 1.0 failed on HDP 2.0 with absurd exception
Hi all, I have cluster with HDP 2.0. I built Spark 1.0 on edge node and trying to run with a command ./bin/spark-submit --class test.etl.RunETL --master yarn-cluster --num-executors 14 --driver-memory 3200m --executor-memory 3g --executor-cores 2 my-etl-1.0-SNAPSHOT-hadoop2.2.0.jar in result I got failed YARN application with following stack trace Application application_1404481778533_0068 failed 3 times due to AM Container for appattempt_1404481778533_0068_03 exited with exitCode: 1 due to: Exception from container-launch: org.apache.hadoop.util.Shell$ExitCodeException: at org.apache.hadoop.util.Shell.runCommand(Shell.java:464) at org.apache.hadoop.util.Shell.run(Shell.java:379) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) .Failing this attempt.. Failing the application Log Type: stderr Log Length: 686 Unknown/unsupported param List(--executor-memory, 3072, --executor-cores, 2, --num-executors, 14) Usage: org.apache.spark.deploy.yarn.ApplicationMaster [options] Options: --jar JAR_PATH Path to your application's JAR file (required) --class CLASS_NAME Name of your application's main class (required) --args ARGS Arguments to be passed to your application's main class. Mutliple invocations are possible, each will be passed in order. --num-workers NUMNumber of workers to start (Default: 2) --worker-cores NUM Number of cores for the workers (Default: 1) --worker-memory MEM Memory per Worker (e.g. 1000M, 2G) (Default: 1G) Seems like the old spark notation any ideas? Thank you, Konstantin Kudryavtsev
[no subject]
I faced in very strange behavior of job that I was run on YARN hadoop cluster. One of stages (map function) was split in 80 tasks, 10 of them successfully finished in ~2 min, but all other jobs are running 40 min and still not finished... I suspect they hung on. Any ideas what's going on and how can it be fixed? Thank you, Konstantin Kudryavtsev
Unable to run Spark 1.0 SparkPi on HDP 2.0
Hi all, I stuck in issue with runing spark PI example on HDP 2.0 I downloaded spark 1.0 pre-build from http://spark.apache.org/downloads.html (for HDP2) The run example from spark web-site: ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --num-executors 3 --driver-memory 2g --executor-memory 2g --executor-cores 1 ./lib/spark-examples-1.0.0-hadoop2.2.0.jar 2 I got error: Application application_1404470405736_0044 failed 3 times due to AM Container for appattempt_1404470405736_0044_03 exited with exitCode: 1 due to: Exception from container-launch: org.apache.hadoop.util.Shell$ExitCodeException: at org.apache.hadoop.util.Shell.runCommand(Shell.java:464) at org.apache.hadoop.util.Shell.run(Shell.java:379) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) .Failing this attempt.. Failing the application. Unknown/unsupported param List(--executor-memory, 2048, --executor-cores, 1, --num-executors, 3) Usage: org.apache.spark.deploy.yarn.ApplicationMaster [options] Options: --jar JAR_PATH Path to your application's JAR file (required) --class CLASS_NAME Name of your application's main class (required) ...bla-bla-bla any ideas? how can I make it works? Thank you, Konstantin Kudryavtsev
Re: Run spark unit test on Windows 7
It sounds really strange... I guess it is a bug, critical bug and must be fixed... at least some flag must be add (unable.hadoop) I found the next workaround : 1) download compiled winutils.exe from http://social.msdn.microsoft.com/Forums/windowsazure/en-US/28a57efb-082b-424b-8d9e-731b1fe135de/please-read-if-experiencing-job-failures?forum=hdinsight 2) put this file into d:\winutil\bin 3) add in my test: System.setProperty(hadoop.home.dir, d:\\winutil\\) after that test runs Thank you, Konstantin Kudryavtsev On Wed, Jul 2, 2014 at 10:24 PM, Denny Lee denny.g@gmail.com wrote: You don't actually need it per se - its just that some of the Spark libraries are referencing Hadoop libraries even if they ultimately do not call them. When I was doing some early builds of Spark on Windows, I admittedly had Hadoop on Windows running as well and had not run into this particular issue. On Wed, Jul 2, 2014 at 12:04 PM, Kostiantyn Kudriavtsev kudryavtsev.konstan...@gmail.com wrote: No, I don't why do I need to have HDP installed? I don't use Hadoop at all and I'd like to read data from local filesystem On Jul 2, 2014, at 9:10 PM, Denny Lee denny.g@gmail.com wrote: By any chance do you have HDP 2.1 installed? you may need to install the utils and update the env variables per http://stackoverflow.com/questions/18630019/running-apache-hadoop-2-1-0-on-windows On Jul 2, 2014, at 10:20 AM, Konstantin Kudryavtsev kudryavtsev.konstan...@gmail.com wrote: Hi Andrew, it's windows 7 and I doesn't set up any env variables here The full stack trace: 14/07/02 19:59:31 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/07/02 19:59:31 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries. at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318) at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333) at org.apache.hadoop.util.Shell.clinit(Shell.java:326) at org.apache.hadoop.util.StringUtils.clinit(StringUtils.java:76) at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93) at org.apache.hadoop.security.Groups.init(Groups.java:77) at org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:240) at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:255) at org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:283) at org.apache.spark.deploy.SparkHadoopUtil.init(SparkHadoopUtil.scala:36) at org.apache.spark.deploy.SparkHadoopUtil$.init(SparkHadoopUtil.scala:109) at org.apache.spark.deploy.SparkHadoopUtil$.clinit(SparkHadoopUtil.scala) at org.apache.spark.SparkContext.init(SparkContext.scala:228) at org.apache.spark.SparkContext.init(SparkContext.scala:97) at my.example.EtlTest.testETL(IxtoolsDailyAggTest.scala:13) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at junit.framework.TestCase.runTest(TestCase.java:168) at junit.framework.TestCase.runBare(TestCase.java:134) at junit.framework.TestResult$1.protect(TestResult.java:110) at junit.framework.TestResult.runProtected(TestResult.java:128) at junit.framework.TestResult.run(TestResult.java:113) at junit.framework.TestCase.run(TestCase.java:124) at junit.framework.TestSuite.runTest(TestSuite.java:232) at junit.framework.TestSuite.run(TestSuite.java:227) at org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:81) at org.junit.runner.JUnitCore.run(JUnitCore.java:130) at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:74) at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:211) at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:67) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:120) Thank you, Konstantin Kudryavtsev On Wed, Jul 2, 2014 at 8:15 PM, Andrew Or and...@databricks.com wrote: Hi Konstatin, We use hadoop as a library in a few places in Spark. I wonder why the path includes null though. Could you provide the full stack trace? Andrew 2014-07-02 9:38 GMT-07:00 Konstantin Kudryavtsev kudryavtsev.konstan...@gmail.com: Hi all, I'm trying to run some transformation on *Spark*, it works
Run spark unit test on Windows 7
Hi all, I'm trying to run some transformation on *Spark*, it works fine on cluster (YARN, linux machines). However, when I'm trying to run it on local machine (*Windows 7*) under unit test, I got errors: java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries. at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318) at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333) at org.apache.hadoop.util.Shell.clinit(Shell.java:326) at org.apache.hadoop.util.StringUtils.clinit(StringUtils.java:76) at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93) My code is following: @Test def testETL() = { val conf = new SparkConf() val sc = new SparkContext(local, test, conf) try { val etl = new IxtoolsDailyAgg() // empty constructor val data = sc.parallelize(List(in1, in2, in3)) etl.etl(data) // rdd transformation, no access to SparkContext or Hadoop Assert.assertTrue(true) } finally { if(sc != null) sc.stop() } } Why is it trying to access hadoop at all? and how can I fix it? Thank you in advance Thank you, Konstantin Kudryavtsev
Re: Run spark unit test on Windows 7
Hi Andrew, it's windows 7 and I doesn't set up any env variables here The full stack trace: 14/07/02 19:59:31 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/07/02 19:59:31 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries. at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318) at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333) at org.apache.hadoop.util.Shell.clinit(Shell.java:326) at org.apache.hadoop.util.StringUtils.clinit(StringUtils.java:76) at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93) at org.apache.hadoop.security.Groups.init(Groups.java:77) at org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:240) at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:255) at org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:283) at org.apache.spark.deploy.SparkHadoopUtil.init(SparkHadoopUtil.scala:36) at org.apache.spark.deploy.SparkHadoopUtil$.init(SparkHadoopUtil.scala:109) at org.apache.spark.deploy.SparkHadoopUtil$.clinit(SparkHadoopUtil.scala) at org.apache.spark.SparkContext.init(SparkContext.scala:228) at org.apache.spark.SparkContext.init(SparkContext.scala:97) at my.example.EtlTest.testETL(IxtoolsDailyAggTest.scala:13) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at junit.framework.TestCase.runTest(TestCase.java:168) at junit.framework.TestCase.runBare(TestCase.java:134) at junit.framework.TestResult$1.protect(TestResult.java:110) at junit.framework.TestResult.runProtected(TestResult.java:128) at junit.framework.TestResult.run(TestResult.java:113) at junit.framework.TestCase.run(TestCase.java:124) at junit.framework.TestSuite.runTest(TestSuite.java:232) at junit.framework.TestSuite.run(TestSuite.java:227) at org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:81) at org.junit.runner.JUnitCore.run(JUnitCore.java:130) at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:74) at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:211) at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:67) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:120) Thank you, Konstantin Kudryavtsev On Wed, Jul 2, 2014 at 8:15 PM, Andrew Or and...@databricks.com wrote: Hi Konstatin, We use hadoop as a library in a few places in Spark. I wonder why the path includes null though. Could you provide the full stack trace? Andrew 2014-07-02 9:38 GMT-07:00 Konstantin Kudryavtsev kudryavtsev.konstan...@gmail.com: Hi all, I'm trying to run some transformation on *Spark*, it works fine on cluster (YARN, linux machines). However, when I'm trying to run it on local machine (*Windows 7*) under unit test, I got errors: java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries. at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318) at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333) at org.apache.hadoop.util.Shell.clinit(Shell.java:326) at org.apache.hadoop.util.StringUtils.clinit(StringUtils.java:76) at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93) My code is following: @Test def testETL() = { val conf = new SparkConf() val sc = new SparkContext(local, test, conf) try { val etl = new IxtoolsDailyAgg() // empty constructor val data = sc.parallelize(List(in1, in2, in3)) etl.etl(data) // rdd transformation, no access to SparkContext or Hadoop Assert.assertTrue(true) } finally { if(sc != null) sc.stop() } } Why is it trying to access hadoop at all? and how can I fix it? Thank you in advance Thank you, Konstantin Kudryavtsev
NullPointerException on ExternalAppendOnlyMap
Hi all, I catch very confusing exception running Spark 1.0 on HDP2.1 During save rdd as text file I got: 14/07/02 10:11:12 WARN TaskSetManager: Loss was due to java.lang.NullPointerException java.lang.NullPointerException at org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator.org$apache$spark$util$collection$ExternalAppendOnlyMap$ExternalIterator$$getMorePairs(ExternalAppendOnlyMap.scala:254) at org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator$$anonfun$3.apply(ExternalAppendOnlyMap.scala:237) at org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator$$anonfun$3.apply(ExternalAppendOnlyMap.scala:236) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator.init(ExternalAppendOnlyMap.scala:236) at org.apache.spark.util.collection.ExternalAppendOnlyMap.iterator(ExternalAppendOnlyMap.scala:218) at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:162) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Do you have any idea what is it? how can I debug this issue or perhaps access another log? Thank you, Konstantin Kudryavtsev
unsibscribe
unsibscribe Thank you, Konstantin Kudryavtsev
Re: Pig on Spark
Hi Mayur, I wondered if you could share your findings in some way (github, blog post, etc). I guess your experience will be very interesting/useful for many people sent from Lenovo YogaTablet On Apr 8, 2014 8:48 PM, Mayur Rustagi mayur.rust...@gmail.com wrote: Hi Ankit, Thanx for all the work on Pig. Finally got it working. Couple of high level bugs right now: - Getting it working on Spark 0.9.0 - Getting UDF working - Getting generate functionality working - Exhaustive test suite on Spark on Pig are you maintaining a Jira somewhere? I am currently trying to deploy it on 0.9.0. Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Fri, Mar 14, 2014 at 1:37 PM, Aniket Mokashi aniket...@gmail.comwrote: We will post fixes from our side at - https://github.com/twitter/pig. Top on our list are- 1. Make it work with pig-trunk (execution engine interface) (with 0.8 or 0.9 spark). 2. Support for algebraic udfs (this mitigates the group by oom problems). Would definitely love more contribution on this. Thanks, Aniket On Fri, Mar 14, 2014 at 12:29 PM, Mayur Rustagi mayur.rust...@gmail.comwrote: Dam I am off to NY for Structure Conf. Would it be possible to meet anytime after 28th March? I am really interested in making it stable production quality. Regards Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Fri, Mar 14, 2014 at 11:53 AM, Julien Le Dem jul...@twitter.comwrote: Hi Mayur, Are you going to the Pig meetup this afternoon? http://www.meetup.com/PigUser/events/160604192/ Aniket and I will be there. We would be happy to chat about Pig-on-Spark On Tue, Mar 11, 2014 at 8:56 AM, Mayur Rustagi mayur.rust...@gmail.com wrote: Hi Lin, We are working on getting Pig on spark functional with 0.8.0, have you got it working on any spark version ? Also what all functionality works on it? Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Mon, Mar 10, 2014 at 11:00 PM, Xiangrui Meng men...@gmail.comwrote: Hi Sameer, Lin (cc'ed) could also give you some updates about Pig on Spark development on her side. Best, Xiangrui On Mon, Mar 10, 2014 at 12:52 PM, Sameer Tilak ssti...@live.com wrote: Hi Mayur, We are planning to upgrade our distribution MR1 MR2 (YARN) and the goal is to get SPROK set up next month. I will keep you posted. Can you please keep me informed about your progress as well. From: mayur.rust...@gmail.com Date: Mon, 10 Mar 2014 11:47:56 -0700 Subject: Re: Pig on Spark To: user@spark.apache.org Hi Sameer, Did you make any progress on this. My team is also trying it out would love to know some detail so progress. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Thu, Mar 6, 2014 at 2:20 PM, Sameer Tilak ssti...@live.com wrote: Hi Aniket, Many thanks! I will check this out. Date: Thu, 6 Mar 2014 13:46:50 -0800 Subject: Re: Pig on Spark From: aniket...@gmail.com To: user@spark.apache.org; tgraves...@yahoo.com There is some work to make this work on yarn at https://github.com/aniket486/pig. (So, compile pig with ant -Dhadoopversion=23) You can look at https://github.com/aniket486/pig/blob/spork/pig-spark to find out what sort of env variables you need (sorry, I haven't been able to clean this up- in-progress). There are few known issues with this, I will work on fixing them soon. Known issues- 1. Limit does not work (spork-fix) 2. Foreach requires to turn off schema-tuple-backend (should be a pig-jira) 3. Algebraic udfs dont work (spork-fix in-progress) 4. Group by rework (to avoid OOMs) 5. UDF Classloader issue (requires SPARK-1053, then you can put pig-withouthadoop.jar as SPARK_JARS in SparkContext along with udf jars) ~Aniket On Thu, Mar 6, 2014 at 1:36 PM, Tom Graves tgraves...@yahoo.com wrote: I had asked a similar question on the dev mailing list a while back (Jan 22nd). See the archives: http://mail-archives.apache.org/mod_mbox/spark-dev/201401.mbox/browser- look for spork. Basically Matei said: Yup, that was it, though I believe people at Twitter picked it up again recently. I'd suggest asking Dmitriy if you know him. I've seen interest in this from several other groups, and if there's enough of it, maybe we can start another open source repo to track it. The work in that repo you pointed to was done over one week, and already had most of Pig's operators working. (I helped out with this prototype over Twitter's hack week.) That work also calls the Scala API directly, because it was done before we
Re: how to save RDD partitions in different folders?
Hi Evan, Could you please provide a code-snippet? Because it not clear for me, in Hadoop you need to engage addNamedOutput method and I'm in stuck how to use it from Spark Thank you, Konstantin Kudryavtsev On Fri, Apr 4, 2014 at 5:27 PM, Evan Sparks evan.spa...@gmail.com wrote: Have a look at MultipleOutputs in the hadoop API. Spark can read and write to arbitrary hadoop formats. On Apr 4, 2014, at 6:01 AM, dmpour23 dmpou...@gmail.com wrote: Hi all, Say I have an input file which I would like to partition using HashPartitioner k times. Calling rdd.saveAsTextFile(hdfs://); will save k files as part-0 part-k Is there a way to save each partition in specific folders? i.e. src part0/part-0 part1/part-1 part1/part-k thanks Dimitri -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-save-RDD-partitions-in-different-folders-tp3754.html Sent from the Apache Spark User List mailing list archive at Nabble.com.