reduceByKey to get all associated values

2014-08-07 Thread Konstantin Kudryavtsev
Hi there,

I'm interested if it is possible to get the same behavior as for reduce
function from MR framework. I mean for each key K get list of associated
values List.

There is function reduceByKey that works only with separate V from list. Is
it exist any way to get list? Because I have to sort it in particular way
and apply some business logic.

Thank you in advance,
Konstantin Kudryavtsev


Re: Ports required for running spark

2014-07-31 Thread Konstantin Kudryavtsev
Hi Haiyang,

you are right, YARN takes over the resource management, bot I constantly
got Exception ConnectionRefused on mentioned port. So, I suppose some spark
internal communications are done via this port... but I don't know what
exactly and how can I change it...

Thank you,
Konstantin Kudryavtsev


On Thu, Jul 31, 2014 at 2:53 PM, Haiyang Fu  wrote:

> Hi Konstantin,
> Would you please post some more details? Error info or exception from the
> log on what situation?when you run spark job on yarn cluster mode ,yarn
> will take over all the resource management.
>
>
> On Thu, Jul 31, 2014 at 6:17 PM, Konstantin Kudryavtsev <
> kudryavtsev.konstan...@gmail.com> wrote:
>
>> Hi Larry,
>>
>> I'm afraid this is standalone mode, I'm interesting in YARN
>>
>> Also, I don't see port-in-trouble 33007   which i believe related to Akka
>>
>> Thank you,
>> Konstantin Kudryavtsev
>>
>>
>> On Thu, Jul 31, 2014 at 1:11 PM, Larry Xiao  wrote:
>>
>>>  Hi Konstantin,
>>>
>>> I think you can find it at
>>> https://spark.apache.org/docs/latest/spark-standalone.html#configuring-ports-for-network-security
>>> and you can specify port for master or worker at conf/spark-env.sh
>>>
>>> Larry
>>>
>>>
>>> On 7/31/14, 6:04 PM, Konstantin Kudryavtsev wrote:
>>>
>>> Hi there,
>>>
>>>  I'm trying to run Spark on YARN cluster and face with issued that some
>>> ports are closed, particularly port 33007 (I suppose it's used by Akka)
>>>
>>>  Could you please provide me with a list of all ports required for
>>> Spark?
>>> Also, is it possible to set up these ports?
>>>
>>> Thank you in advance,
>>> Konstantin Kudryavtsev
>>>
>>>
>>>
>>
>


Re: Ports required for running spark

2014-07-31 Thread Konstantin Kudryavtsev
Hi Larry,

I'm afraid this is standalone mode, I'm interesting in YARN

Also, I don't see port-in-trouble 33007   which i believe related to Akka

Thank you,
Konstantin Kudryavtsev


On Thu, Jul 31, 2014 at 1:11 PM, Larry Xiao  wrote:

>  Hi Konstantin,
>
> I think you can find it at
> https://spark.apache.org/docs/latest/spark-standalone.html#configuring-ports-for-network-security
> and you can specify port for master or worker at conf/spark-env.sh
>
> Larry
>
>
> On 7/31/14, 6:04 PM, Konstantin Kudryavtsev wrote:
>
> Hi there,
>
>  I'm trying to run Spark on YARN cluster and face with issued that some
> ports are closed, particularly port 33007 (I suppose it's used by Akka)
>
>  Could you please provide me with a list of all ports required for Spark?
> Also, is it possible to set up these ports?
>
> Thank you in advance,
> Konstantin Kudryavtsev
>
>
>


Ports required for running spark

2014-07-31 Thread Konstantin Kudryavtsev
Hi there,

I'm trying to run Spark on YARN cluster and face with issued that some
ports are closed, particularly port 33007 (I suppose it's used by Akka)

Could you please provide me with a list of all ports required for Spark?
Also, is it possible to set up these ports?

Thank you in advance,
Konstantin Kudryavtsev


Spark scheduling with Capacity scheduler

2014-07-17 Thread Konstantin Kudryavtsev
Hi all,

I'm using HDP 2.0, YARN. I'm running both MapReduce and Spark jobs on this
cluster, is it possible somehow use Capacity scheduler for Spark jobs
management as well as MR jobs? I mean, I'm able to send MR job to specific
queue, may I do the same with Spark job?
thank you in advance

Thank you,
Konstantin Kudryavtsev


Filtering data during the read

2014-07-09 Thread Konstantin Kudryavtsev
Hi all,

I wondered if you could help me to clarify the next situation:
in the classic example

val file = spark.textFile("hdfs://...")
val errors = file.filter(line => line.contains("ERROR"))

As I understand, the data is read in memory in first, and after that
filtering is applying. Is it any way to apply filtering during the read
step? and don't put all objects into memory?

Thank you,
Konstantin Kudryavtsev


how to convert RDD to PairRDDFunctions ?

2014-07-08 Thread Konstantin Kudryavtsev
Hi all,

sorry for fooly question, but how can I get PairRDDFunctions RDD? I'm doing
it to perform leftOuterJoin aftewards

currently I do in this was (it seems incorrect):
val parRDD = new PairRDDFunctions( oldRdd.map(i => (i.key, i)) )

I guess this constructor is definitely wrong...


Thank you,
Konstantin Kudryavtsev


java.lang.OutOfMemoryError (java.lang.OutOfMemoryError: GC overhead limit exceeded)

2014-07-08 Thread Konstantin Kudryavtsev
Hi all,

I faced with the next exception during map step:
java.lang.OutOfMemoryError (java.lang.OutOfMemoryError: GC overhead limit
exceeded)
java.lang.reflect.Array.newInstance(Array.java:70)
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:325)
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:293)
com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699)
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611)
com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648)
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605)
com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648)
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605)
com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648)
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605)
com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699)
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611)
com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:43)
com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34)
com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:115)
org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:125)
org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:155)
org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:154)
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
I'm using Spark 1.0In map I create new object each time, as I understand I
can't reuse object similar to MapReduce development? I wondered, if you
could point me how is it possible to avoid GC overhead...thank you in
advance

Thank you,
Konstantin Kudryavtsev


Control number of tasks per stage

2014-07-07 Thread Konstantin Kudryavtsev
Hi all,

is it any way to control the number tasks per stage?

currently I see situation when only 2 tasks are created per stage and each
of them is very slow, at the same time cluster has a huge number of unused
nodes


Thank you,
Konstantin Kudryavtsev


Re: Unable to run Spark 1.0 SparkPi on HDP 2.0

2014-07-07 Thread Konstantin Kudryavtsev
Hi Chester,

Thank you very much, it is clear now - just two different way to support
spark on acluster

Thank you,
Konstantin Kudryavtsev


On Mon, Jul 7, 2014 at 3:22 PM, Chester @work  wrote:

> In Yarn cluster mode, you can either have spark on all the cluster nodes
> or supply the spark jar yourself. In the 2nd case, you don't need install
> spark on cluster at all. As you supply the spark assembly as we as your app
> jar together.
>
> I hope this make it clear
>
> Chester
>
> Sent from my iPhone
>
> On Jul 7, 2014, at 5:05 AM, Konstantin Kudryavtsev <
> kudryavtsev.konstan...@gmail.com> wrote:
>
> thank you Krishna!
>
>  Could you please explain why do I need install spark on each node if
> Spark official site said: If you have a Hadoop 2 cluster, you can run
> Spark without any installation needed
>
> I have HDP 2 (YARN) and that's why I hope I don't need to install spark on
> each node
>
> Thank you,
> Konstantin Kudryavtsev
>
>
> On Mon, Jul 7, 2014 at 1:57 PM, Krishna Sankar 
> wrote:
>
>> Konstantin,
>>
>>1. You need to install the hadoop rpms on all nodes. If it is Hadoop
>>2, the nodes would have hdfs & YARN.
>>2. Then you need to install Spark on all nodes. I haven't had
>>experience with HDP, but the tech preview might have installed Spark as
>>well.
>>3. In the end, one should have hdfs,yarn & spark installed on all the
>>nodes.
>>4. After installations, check the web console to make sure hdfs, yarn
>>& spark are running.
>>5. Then you are ready to start experimenting/developing spark
>>applications.
>>
>> HTH.
>> Cheers
>> 
>>
>>
>> On Mon, Jul 7, 2014 at 2:34 AM, Konstantin Kudryavtsev <
>> kudryavtsev.konstan...@gmail.com> wrote:
>>
>>> guys, I'm not talking about running spark on VM, I don have problem with
>>> it.
>>>
>>> I confused in the next:
>>> 1) Hortonworks describe installation process as RPMs on each node
>>> 2) spark home page said that everything I need is YARN
>>>
>>> And I'm in stucj with understanding what I need to do to run spark on
>>> yarn (do I need RPMs installations or only build spark on edge node?)
>>>
>>>
>>> Thank you,
>>> Konstantin Kudryavtsev
>>>
>>>
>>> On Mon, Jul 7, 2014 at 4:34 AM, Robert James 
>>> wrote:
>>>
>>>> I can say from my experience that getting Spark to work with Hadoop 2
>>>> is not for the beginner; after solving one problem after another
>>>> (dependencies, scripts, etc.), I went back to Hadoop 1.
>>>>
>>>> Spark's Maven, ec2 scripts, and others all use Hadoop 1 - not sure
>>>> why, but, given so, Hadoop 2 has too many bumps
>>>>
>>>> On 7/6/14, Marco Shaw  wrote:
>>>> > That is confusing based on the context you provided.
>>>> >
>>>> > This might take more time than I can spare to try to understand.
>>>> >
>>>> > For sure, you need to add Spark to run it in/on the HDP 2.1 express
>>>> VM.
>>>> >
>>>> > Cloudera's CDH 5 express VM includes Spark, but the service isn't
>>>> running by
>>>> > default.
>>>> >
>>>> > I can't remember for MapR...
>>>> >
>>>> > Marco
>>>> >
>>>> >> On Jul 6, 2014, at 6:33 PM, Konstantin Kudryavtsev
>>>> >>  wrote:
>>>> >>
>>>> >> Marco,
>>>> >>
>>>> >> Hortonworks provides a Tech Preview of Spark 0.9.1 with HDP 2.1 that
>>>> you
>>>> >> can try
>>>> >> from
>>>> >>
>>>> http://hortonworks.com/wp-content/uploads/2014/05/SparkTechnicalPreview.pdf
>>>> >>  HDP 2.1 means YARN, at the same time they propose ti install rpm
>>>> >>
>>>> >> On other hand, http://spark.apache.org/ said "
>>>> >> Integrated with Hadoop
>>>> >> Spark can run on Hadoop 2's YARN cluster manager, and can read any
>>>> >> existing Hadoop data.
>>>> >>
>>>> >> If you have a Hadoop 2 cluster, you can run Spark without any
>>>> installation
>>>> >> needed. "
>>>> >>
>>>> >> And this is confusing for me... do I need rpm installation on not?

Re: Unable to run Spark 1.0 SparkPi on HDP 2.0

2014-07-07 Thread Konstantin Kudryavtsev
thank you Krishna!

 Could you please explain why do I need install spark on each node if Spark
official site said: If you have a Hadoop 2 cluster, you can run Spark
without any installation needed

I have HDP 2 (YARN) and that's why I hope I don't need to install spark on
each node

Thank you,
Konstantin Kudryavtsev


On Mon, Jul 7, 2014 at 1:57 PM, Krishna Sankar  wrote:

> Konstantin,
>
>1. You need to install the hadoop rpms on all nodes. If it is Hadoop
>2, the nodes would have hdfs & YARN.
>2. Then you need to install Spark on all nodes. I haven't had
>experience with HDP, but the tech preview might have installed Spark as
>well.
>3. In the end, one should have hdfs,yarn & spark installed on all the
>nodes.
>4. After installations, check the web console to make sure hdfs, yarn
>& spark are running.
>5. Then you are ready to start experimenting/developing spark
>applications.
>
> HTH.
> Cheers
> 
>
>
> On Mon, Jul 7, 2014 at 2:34 AM, Konstantin Kudryavtsev <
> kudryavtsev.konstan...@gmail.com> wrote:
>
>> guys, I'm not talking about running spark on VM, I don have problem with
>> it.
>>
>> I confused in the next:
>> 1) Hortonworks describe installation process as RPMs on each node
>> 2) spark home page said that everything I need is YARN
>>
>> And I'm in stucj with understanding what I need to do to run spark on
>> yarn (do I need RPMs installations or only build spark on edge node?)
>>
>>
>> Thank you,
>> Konstantin Kudryavtsev
>>
>>
>> On Mon, Jul 7, 2014 at 4:34 AM, Robert James 
>> wrote:
>>
>>> I can say from my experience that getting Spark to work with Hadoop 2
>>> is not for the beginner; after solving one problem after another
>>> (dependencies, scripts, etc.), I went back to Hadoop 1.
>>>
>>> Spark's Maven, ec2 scripts, and others all use Hadoop 1 - not sure
>>> why, but, given so, Hadoop 2 has too many bumps
>>>
>>> On 7/6/14, Marco Shaw  wrote:
>>> > That is confusing based on the context you provided.
>>> >
>>> > This might take more time than I can spare to try to understand.
>>> >
>>> > For sure, you need to add Spark to run it in/on the HDP 2.1 express VM.
>>> >
>>> > Cloudera's CDH 5 express VM includes Spark, but the service isn't
>>> running by
>>> > default.
>>> >
>>> > I can't remember for MapR...
>>> >
>>> > Marco
>>> >
>>> >> On Jul 6, 2014, at 6:33 PM, Konstantin Kudryavtsev
>>> >>  wrote:
>>> >>
>>> >> Marco,
>>> >>
>>> >> Hortonworks provides a Tech Preview of Spark 0.9.1 with HDP 2.1 that
>>> you
>>> >> can try
>>> >> from
>>> >>
>>> http://hortonworks.com/wp-content/uploads/2014/05/SparkTechnicalPreview.pdf
>>> >>  HDP 2.1 means YARN, at the same time they propose ti install rpm
>>> >>
>>> >> On other hand, http://spark.apache.org/ said "
>>> >> Integrated with Hadoop
>>> >> Spark can run on Hadoop 2's YARN cluster manager, and can read any
>>> >> existing Hadoop data.
>>> >>
>>> >> If you have a Hadoop 2 cluster, you can run Spark without any
>>> installation
>>> >> needed. "
>>> >>
>>> >> And this is confusing for me... do I need rpm installation on not?...
>>> >>
>>> >>
>>> >> Thank you,
>>> >> Konstantin Kudryavtsev
>>> >>
>>> >>
>>> >>> On Sun, Jul 6, 2014 at 10:56 PM, Marco Shaw 
>>> >>> wrote:
>>> >>> Can you provide links to the sections that are confusing?
>>> >>>
>>> >>> My understanding, the HDP1 binaries do not need YARN, while the HDP2
>>> >>> binaries do.
>>> >>>
>>> >>> Now, you can also install Hortonworks Spark RPM...
>>> >>>
>>> >>> For production, in my opinion, RPMs are better for manageability.
>>> >>>
>>> >>>> On Jul 6, 2014, at 5:39 PM, Konstantin Kudryavtsev
>>> >>>>  wrote:
>>> >>>>
>>> >>>> Hello, thanks for your message... I'm confused, Hortonworhs suggest
>>> >>>> install spark rpm on each node, but on Sp

Re: Unable to run Spark 1.0 SparkPi on HDP 2.0

2014-07-07 Thread Konstantin Kudryavtsev
guys, I'm not talking about running spark on VM, I don have problem with it.

I confused in the next:
1) Hortonworks describe installation process as RPMs on each node
2) spark home page said that everything I need is YARN

And I'm in stucj with understanding what I need to do to run spark on yarn
(do I need RPMs installations or only build spark on edge node?)


Thank you,
Konstantin Kudryavtsev


On Mon, Jul 7, 2014 at 4:34 AM, Robert James  wrote:

> I can say from my experience that getting Spark to work with Hadoop 2
> is not for the beginner; after solving one problem after another
> (dependencies, scripts, etc.), I went back to Hadoop 1.
>
> Spark's Maven, ec2 scripts, and others all use Hadoop 1 - not sure
> why, but, given so, Hadoop 2 has too many bumps
>
> On 7/6/14, Marco Shaw  wrote:
> > That is confusing based on the context you provided.
> >
> > This might take more time than I can spare to try to understand.
> >
> > For sure, you need to add Spark to run it in/on the HDP 2.1 express VM.
> >
> > Cloudera's CDH 5 express VM includes Spark, but the service isn't
> running by
> > default.
> >
> > I can't remember for MapR...
> >
> > Marco
> >
> >> On Jul 6, 2014, at 6:33 PM, Konstantin Kudryavtsev
> >>  wrote:
> >>
> >> Marco,
> >>
> >> Hortonworks provides a Tech Preview of Spark 0.9.1 with HDP 2.1 that you
> >> can try
> >> from
> >>
> http://hortonworks.com/wp-content/uploads/2014/05/SparkTechnicalPreview.pdf
> >>  HDP 2.1 means YARN, at the same time they propose ti install rpm
> >>
> >> On other hand, http://spark.apache.org/ said "
> >> Integrated with Hadoop
> >> Spark can run on Hadoop 2's YARN cluster manager, and can read any
> >> existing Hadoop data.
> >>
> >> If you have a Hadoop 2 cluster, you can run Spark without any
> installation
> >> needed. "
> >>
> >> And this is confusing for me... do I need rpm installation on not?...
> >>
> >>
> >> Thank you,
> >> Konstantin Kudryavtsev
> >>
> >>
> >>> On Sun, Jul 6, 2014 at 10:56 PM, Marco Shaw 
> >>> wrote:
> >>> Can you provide links to the sections that are confusing?
> >>>
> >>> My understanding, the HDP1 binaries do not need YARN, while the HDP2
> >>> binaries do.
> >>>
> >>> Now, you can also install Hortonworks Spark RPM...
> >>>
> >>> For production, in my opinion, RPMs are better for manageability.
> >>>
> >>>> On Jul 6, 2014, at 5:39 PM, Konstantin Kudryavtsev
> >>>>  wrote:
> >>>>
> >>>> Hello, thanks for your message... I'm confused, Hortonworhs suggest
> >>>> install spark rpm on each node, but on Spark main page said that yarn
> >>>> enough and I don't need to install it... What the difference?
> >>>>
> >>>> sent from my HTC
> >>>>
> >>>>> On Jul 6, 2014 8:34 PM, "vs"  wrote:
> >>>>> Konstantin,
> >>>>>
> >>>>> HWRK provides a Tech Preview of Spark 0.9.1 with HDP 2.1 that you can
> >>>>> try
> >>>>> from
> >>>>>
> http://hortonworks.com/wp-content/uploads/2014/05/SparkTechnicalPreview.pdf
> >>>>>
> >>>>> Let me know if you see issues with the tech preview.
> >>>>>
> >>>>> "spark PI example on HDP 2.0
> >>>>>
> >>>>> I downloaded spark 1.0 pre-build from
> >>>>> http://spark.apache.org/downloads.html
> >>>>> (for HDP2)
> >>>>> The run example from spark web-site:
> >>>>> ./bin/spark-submit --class org.apache.spark.examples.SparkPi
> >>>>> --master
> >>>>> yarn-cluster --num-executors 3 --driver-memory 2g --executor-memory
> 2g
> >>>>> --executor-cores 1 ./lib/spark-examples-1.0.0-hadoop2.2.0.jar 2
> >>>>>
> >>>>> I got error:
> >>>>> Application application_1404470405736_0044 failed 3 times due to AM
> >>>>> Container for appattempt_1404470405736_0044_03 exited with
> >>>>> exitCode: 1
> >>>>> due to: Exception from container-launch:
> >>>>> org.apache.hadoop.util.Shell$ExitCodeException:
> >>>>> at org.apache.hadoop.util.Shell.runCommand(Shell.j

Re: Unable to run Spark 1.0 SparkPi on HDP 2.0

2014-07-06 Thread Konstantin Kudryavtsev
Marco,

Hortonworks provides a Tech Preview of Spark 0.9.1 with HDP 2.1 that you
can try
from
http://hortonworks.com/wp-content/uploads/2014/05/SparkTechnicalPreview.pdf
HDP 2.1 means YARN, at the same time they propose ti install rpm

On other hand, http://spark.apache.org/ said "
Integrated with Hadoop

Spark can run on Hadoop 2's YARN cluster manager, and can read any existing
Hadoop data.
If you have a Hadoop 2 cluster, you can run Spark without any installation
needed. "

And this is confusing for me... do I need rpm installation on not?...


Thank you,
Konstantin Kudryavtsev


On Sun, Jul 6, 2014 at 10:56 PM, Marco Shaw  wrote:

> Can you provide links to the sections that are confusing?
>
> My understanding, the HDP1 binaries do not need YARN, while the HDP2
> binaries do.
>
> Now, you can also install Hortonworks Spark RPM...
>
> For production, in my opinion, RPMs are better for manageability.
>
> On Jul 6, 2014, at 5:39 PM, Konstantin Kudryavtsev <
> kudryavtsev.konstan...@gmail.com> wrote:
>
> Hello, thanks for your message... I'm confused, Hortonworhs suggest
> install spark rpm on each node, but on Spark main page said that yarn
> enough and I don't need to install it... What the difference?
>
> sent from my HTC
> On Jul 6, 2014 8:34 PM, "vs"  wrote:
>
>> Konstantin,
>>
>> HWRK provides a Tech Preview of Spark 0.9.1 with HDP 2.1 that you can try
>> from
>>
>> http://hortonworks.com/wp-content/uploads/2014/05/SparkTechnicalPreview.pdf
>>
>> Let me know if you see issues with the tech preview.
>>
>> "spark PI example on HDP 2.0
>>
>> I downloaded spark 1.0 pre-build from
>> http://spark.apache.org/downloads.html
>> (for HDP2)
>> The run example from spark web-site:
>> ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master
>> yarn-cluster --num-executors 3 --driver-memory 2g --executor-memory 2g
>> --executor-cores 1 ./lib/spark-examples-1.0.0-hadoop2.2.0.jar 2
>>
>> I got error:
>> Application application_1404470405736_0044 failed 3 times due to AM
>> Container for appattempt_1404470405736_0044_03 exited with exitCode: 1
>> due to: Exception from container-launch:
>> org.apache.hadoop.util.Shell$ExitCodeException:
>> at org.apache.hadoop.util.Shell.runCommand(Shell.java:464)
>> at org.apache.hadoop.util.Shell.run(Shell.java:379)
>> at
>> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
>> at
>>
>> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
>> at
>>
>> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283)
>> at
>>
>> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79)
>> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>> at
>>
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> at
>>
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> at java.lang.Thread.run(Thread.java:744)
>> .Failing this attempt.. Failing the application.
>>
>> Unknown/unsupported param List(--executor-memory, 2048, --executor-cores,
>> 1,
>> --num-executors, 3)
>> Usage: org.apache.spark.deploy.yarn.ApplicationMaster [options]
>> Options:
>>   --jar JAR_PATH   Path to your application's JAR file (required)
>>   --class CLASS_NAME   Name of your application's main class (required)
>> ...bla-bla-bla
>> "
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-run-Spark-1-0-SparkPi-on-HDP-2-0-tp8802p8873.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>


Re: Unable to run Spark 1.0 SparkPi on HDP 2.0

2014-07-06 Thread Konstantin Kudryavtsev
Hello, thanks for your message... I'm confused, Hortonworhs suggest install
spark rpm on each node, but on Spark main page said that yarn enough and I
don't need to install it... What the difference?

sent from my HTC
On Jul 6, 2014 8:34 PM, "vs"  wrote:

> Konstantin,
>
> HWRK provides a Tech Preview of Spark 0.9.1 with HDP 2.1 that you can try
> from
> http://hortonworks.com/wp-content/uploads/2014/05/SparkTechnicalPreview.pdf
>
> Let me know if you see issues with the tech preview.
>
> "spark PI example on HDP 2.0
>
> I downloaded spark 1.0 pre-build from
> http://spark.apache.org/downloads.html
> (for HDP2)
> The run example from spark web-site:
> ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master
> yarn-cluster --num-executors 3 --driver-memory 2g --executor-memory 2g
> --executor-cores 1 ./lib/spark-examples-1.0.0-hadoop2.2.0.jar 2
>
> I got error:
> Application application_1404470405736_0044 failed 3 times due to AM
> Container for appattempt_1404470405736_0044_03 exited with exitCode: 1
> due to: Exception from container-launch:
> org.apache.hadoop.util.Shell$ExitCodeException:
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:464)
> at org.apache.hadoop.util.Shell.run(Shell.java:379)
> at
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
> at
>
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
> at
>
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283)
> at
>
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:744)
> .Failing this attempt.. Failing the application.
>
> Unknown/unsupported param List(--executor-memory, 2048, --executor-cores,
> 1,
> --num-executors, 3)
> Usage: org.apache.spark.deploy.yarn.ApplicationMaster [options]
> Options:
>   --jar JAR_PATH   Path to your application's JAR file (required)
>   --class CLASS_NAME   Name of your application's main class (required)
> ...bla-bla-bla
> "
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-run-Spark-1-0-SparkPi-on-HDP-2-0-tp8802p8873.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


[no subject]

2014-07-05 Thread Konstantin Kudryavtsev
I faced in very strange behavior of job that I was run on YARN hadoop
cluster. One of stages (map function) was split in 80 tasks, 10 of them
successfully finished in ~2 min, but all other jobs are running > 40 min
and still not finished... I suspect they hung on.
Any ideas what's going on and how can it be fixed?


Thank you,
Konstantin Kudryavtsev


Spark 1.0 failed on HDP 2.0 with absurd exception

2014-07-05 Thread Konstantin Kudryavtsev
Hi all,

I have cluster with HDP 2.0. I built Spark 1.0 on edge node and trying to
run with a command
./bin/spark-submit --class test.etl.RunETL --master yarn-cluster
--num-executors 14 --driver-memory 3200m --executor-memory 3g
--executor-cores 2 my-etl-1.0-SNAPSHOT-hadoop2.2.0.jar

in result I got failed YARN application with following stack trace

Application application_1404481778533_0068 failed 3 times due to AM
Container for appattempt_1404481778533_0068_03 exited with exitCode: 1
due to: Exception from container-launch:
org.apache.hadoop.util.Shell$ExitCodeException:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:464)
at org.apache.hadoop.util.Shell.run(Shell.java:379)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
at
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
 .Failing this attempt.. Failing the application

Log Type: stderr

Log Length: 686

Unknown/unsupported param List(--executor-memory, 3072,
--executor-cores, 2, --num-executors, 14)
Usage: org.apache.spark.deploy.yarn.ApplicationMaster [options]
Options:
  --jar JAR_PATH   Path to your application's JAR file (required)
  --class CLASS_NAME   Name of your application's main class (required)
  --args ARGS  Arguments to be passed to your application's main class.
   Mutliple invocations are possible, each will be
passed in order.
  --num-workers NUMNumber of workers to start (Default: 2)
  --worker-cores NUM   Number of cores for the workers (Default: 1)
  --worker-memory MEM  Memory per Worker (e.g. 1000M, 2G) (Default: 1G)


Seems like the old spark notation any ideas?

Thank you,
Konstantin Kudryavtsev


Unable to run Spark 1.0 SparkPi on HDP 2.0

2014-07-04 Thread Konstantin Kudryavtsev
Hi all,

I stuck in issue with runing spark PI example on HDP 2.0

I downloaded spark 1.0 pre-build from http://spark.apache.org/downloads.html
(for HDP2)
The run example from spark web-site:
./bin/spark-submit --class org.apache.spark.examples.SparkPi --master
yarn-cluster --num-executors 3 --driver-memory 2g --executor-memory 2g
--executor-cores 1 ./lib/spark-examples-1.0.0-hadoop2.2.0.jar 2

I got error:
Application application_1404470405736_0044 failed 3 times due to AM
Container for appattempt_1404470405736_0044_03 exited with exitCode: 1
due to: Exception from container-launch:
org.apache.hadoop.util.Shell$ExitCodeException:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:464)
at org.apache.hadoop.util.Shell.run(Shell.java:379)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
at
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
.Failing this attempt.. Failing the application.

Unknown/unsupported param List(--executor-memory, 2048,
--executor-cores, 1, --num-executors, 3)
Usage: org.apache.spark.deploy.yarn.ApplicationMaster [options]
Options:
  --jar JAR_PATH   Path to your application's JAR file (required)
  --class CLASS_NAME   Name of your application's main class (required)

...bla-bla-bla


any ideas? how can I make it works?

Thank you,
Konstantin Kudryavtsev


Re: Run spark unit test on Windows 7

2014-07-02 Thread Konstantin Kudryavtsev
It sounds really strange...

I guess it is a bug, critical bug and must be fixed... at least some flag
must be add (unable.hadoop)

I found the next workaround :
1) download compiled winutils.exe from
http://social.msdn.microsoft.com/Forums/windowsazure/en-US/28a57efb-082b-424b-8d9e-731b1fe135de/please-read-if-experiencing-job-failures?forum=hdinsight
2) put this file into d:\winutil\bin
3) add in my test: System.setProperty("hadoop.home.dir", "d:\\winutil\\")

after that test runs

Thank you,
Konstantin Kudryavtsev


On Wed, Jul 2, 2014 at 10:24 PM, Denny Lee  wrote:

> You don't actually need it per se - its just that some of the Spark
> libraries are referencing Hadoop libraries even if they ultimately do not
> call them. When I was doing some early builds of Spark on Windows, I
> admittedly had Hadoop on Windows running as well and had not run into this
> particular issue.
>
>
>
> On Wed, Jul 2, 2014 at 12:04 PM, Kostiantyn Kudriavtsev <
> kudryavtsev.konstan...@gmail.com> wrote:
>
>> No, I don't
>>
>> why do I need to have HDP installed? I don't use Hadoop at all and I'd
>> like to read data from local filesystem
>>
>> On Jul 2, 2014, at 9:10 PM, Denny Lee  wrote:
>>
>> By any chance do you have HDP 2.1 installed? you may need to install the
>> utils and update the env variables per
>> http://stackoverflow.com/questions/18630019/running-apache-hadoop-2-1-0-on-windows
>>
>>
>> On Jul 2, 2014, at 10:20 AM, Konstantin Kudryavtsev <
>> kudryavtsev.konstan...@gmail.com> wrote:
>>
>> Hi Andrew,
>>
>> it's windows 7 and I doesn't set up any env variables here
>>
>> The full stack trace:
>>
>> 14/07/02 19:59:31 WARN NativeCodeLoader: Unable to load native-hadoop
>> library for your platform... using builtin-java classes where applicable
>> 14/07/02 19:59:31 ERROR Shell: Failed to locate the winutils binary in
>> the hadoop binary path
>> java.io.IOException: Could not locate executable null\bin\winutils.exe in
>> the Hadoop binaries.
>> at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318)
>>  at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333)
>> at org.apache.hadoop.util.Shell.(Shell.java:326)
>>  at org.apache.hadoop.util.StringUtils.(StringUtils.java:76)
>> at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93)
>>  at org.apache.hadoop.security.Groups.(Groups.java:77)
>> at
>> org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:240)
>>  at
>> org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:255)
>> at
>> org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:283)
>>  at
>> org.apache.spark.deploy.SparkHadoopUtil.(SparkHadoopUtil.scala:36)
>> at
>> org.apache.spark.deploy.SparkHadoopUtil$.(SparkHadoopUtil.scala:109)
>>  at
>> org.apache.spark.deploy.SparkHadoopUtil$.(SparkHadoopUtil.scala)
>> at org.apache.spark.SparkContext.(SparkContext.scala:228)
>>  at org.apache.spark.SparkContext.(SparkContext.scala:97)
>> at my.example.EtlTest.testETL(IxtoolsDailyAggTest.scala:13)
>>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>  at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:606)
>>  at junit.framework.TestCase.runTest(TestCase.java:168)
>> at junit.framework.TestCase.runBare(TestCase.java:134)
>>  at junit.framework.TestResult$1.protect(TestResult.java:110)
>> at junit.framework.TestResult.runProtected(TestResult.java:128)
>>  at junit.framework.TestResult.run(TestResult.java:113)
>> at junit.framework.TestCase.run(TestCase.java:124)
>>  at junit.framework.TestSuite.runTest(TestSuite.java:232)
>> at junit.framework.TestSuite.run(TestSuite.java:227)
>>  at
>> org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:81)
>> at org.junit.runner.JUnitCore.run(JUnitCore.java:130)
>>  at
>> com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:74)
>> at
>> com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:211)
>>  at
>> com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:67)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>  at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>> at
>> sun

NullPointerException on ExternalAppendOnlyMap

2014-07-02 Thread Konstantin Kudryavtsev
Hi all,

I catch very confusing exception running Spark 1.0 on HDP2.1
During save rdd as text file I got:


14/07/02 10:11:12 WARN TaskSetManager: Loss was due to
java.lang.NullPointerException
java.lang.NullPointerException
at 
org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator.org$apache$spark$util$collection$ExternalAppendOnlyMap$ExternalIterator$$getMorePairs(ExternalAppendOnlyMap.scala:254)
at 
org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator$$anonfun$3.apply(ExternalAppendOnlyMap.scala:237)
at 
org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator$$anonfun$3.apply(ExternalAppendOnlyMap.scala:236)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator.(ExternalAppendOnlyMap.scala:236)
at 
org.apache.spark.util.collection.ExternalAppendOnlyMap.iterator(ExternalAppendOnlyMap.scala:218)
at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:162)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at 
org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at 
org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
at org.apache.spark.scheduler.Task.run(Task.scala:51)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)


Do you have any idea what is it? how can I debug this issue or perhaps
access another log?


Thank you,
Konstantin Kudryavtsev


Re: Run spark unit test on Windows 7

2014-07-02 Thread Konstantin Kudryavtsev
Hi Andrew,

it's windows 7 and I doesn't set up any env variables here

The full stack trace:

14/07/02 19:59:31 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
14/07/02 19:59:31 ERROR Shell: Failed to locate the winutils binary in the
hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in
the Hadoop binaries.
at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318)
 at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333)
at org.apache.hadoop.util.Shell.(Shell.java:326)
 at org.apache.hadoop.util.StringUtils.(StringUtils.java:76)
at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93)
 at org.apache.hadoop.security.Groups.(Groups.java:77)
at
org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:240)
 at
org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:255)
at
org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:283)
 at org.apache.spark.deploy.SparkHadoopUtil.(SparkHadoopUtil.scala:36)
at
org.apache.spark.deploy.SparkHadoopUtil$.(SparkHadoopUtil.scala:109)
 at org.apache.spark.deploy.SparkHadoopUtil$.(SparkHadoopUtil.scala)
at org.apache.spark.SparkContext.(SparkContext.scala:228)
 at org.apache.spark.SparkContext.(SparkContext.scala:97)
at my.example.EtlTest.testETL(IxtoolsDailyAggTest.scala:13)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
 at junit.framework.TestCase.runTest(TestCase.java:168)
at junit.framework.TestCase.runBare(TestCase.java:134)
 at junit.framework.TestResult$1.protect(TestResult.java:110)
at junit.framework.TestResult.runProtected(TestResult.java:128)
 at junit.framework.TestResult.run(TestResult.java:113)
at junit.framework.TestCase.run(TestCase.java:124)
 at junit.framework.TestSuite.runTest(TestSuite.java:232)
at junit.framework.TestSuite.run(TestSuite.java:227)
 at
org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:81)
at org.junit.runner.JUnitCore.run(JUnitCore.java:130)
 at
com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:74)
at
com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:211)
 at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:67)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:120)


Thank you,
Konstantin Kudryavtsev


On Wed, Jul 2, 2014 at 8:15 PM, Andrew Or  wrote:

> Hi Konstatin,
>
> We use hadoop as a library in a few places in Spark. I wonder why the path
> includes "null" though.
>
> Could you provide the full stack trace?
>
> Andrew
>
>
> 2014-07-02 9:38 GMT-07:00 Konstantin Kudryavtsev <
> kudryavtsev.konstan...@gmail.com>:
>
> Hi all,
>>
>> I'm trying to run some transformation on *Spark*, it works fine on
>> cluster (YARN, linux machines). However, when I'm trying to run it on local
>> machine (*Windows 7*) under unit test, I got errors:
>>
>>
>> java.io.IOException: Could not locate executable null\bin\winutils.exe in 
>> the Hadoop binaries.
>> at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318)
>> at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333)
>> at org.apache.hadoop.util.Shell.(Shell.java:326)
>> at org.apache.hadoop.util.StringUtils.(StringUtils.java:76)
>> at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93)
>>
>>
>> My code is following:
>>
>>
>> @Test
>> def testETL() = {
>> val conf = new SparkConf()
>> val sc = new SparkContext("local", "test", conf)
>> try {
>> val etl = new IxtoolsDailyAgg() // empty constructor
>>
>> val data = sc.parallelize(List("in1", "in2", "in3"))
>>
>> etl.etl(data) // rdd transformation, no access to SparkContext or 
>> Hadoop
>> Assert.assertTrue(true)
>> } finally {
>> if(sc != null)
>> sc.stop()
>> }
>> }
>>
>>
>> Why is it trying to access hadoop at all? and how can I fix it? Thank you
>> in advance
>>
>> Thank you,
>> Konstantin Kudryavtsev
>>
>
>


Run spark unit test on Windows 7

2014-07-02 Thread Konstantin Kudryavtsev
Hi all,

I'm trying to run some transformation on *Spark*, it works fine on cluster
(YARN, linux machines). However, when I'm trying to run it on local machine
(*Windows 7*) under unit test, I got errors:

java.io.IOException: Could not locate executable null\bin\winutils.exe
in the Hadoop binaries.
at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318)
at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333)
at org.apache.hadoop.util.Shell.(Shell.java:326)
at org.apache.hadoop.util.StringUtils.(StringUtils.java:76)
at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93)


My code is following:

@Test
def testETL() = {
val conf = new SparkConf()
val sc = new SparkContext("local", "test", conf)
try {
val etl = new IxtoolsDailyAgg() // empty constructor

val data = sc.parallelize(List("in1", "in2", "in3"))

etl.etl(data) // rdd transformation, no access to SparkContext or Hadoop
Assert.assertTrue(true)
} finally {
if(sc != null)
sc.stop()
}
}


Why is it trying to access hadoop at all? and how can I fix it? Thank you
in advance

Thank you,
Konstantin Kudryavtsev


unsibscribe

2014-05-05 Thread Konstantin Kudryavtsev
unsibscribe

Thank you,
Konstantin Kudryavtsev


Re: Pig on Spark

2014-04-10 Thread Konstantin Kudryavtsev
Hi Mayur,

I wondered if you could share your findings in some way (github, blog post,
etc). I guess your experience will be very interesting/useful for many
people

sent from Lenovo YogaTablet
On Apr 8, 2014 8:48 PM, "Mayur Rustagi"  wrote:

> Hi Ankit,
> Thanx for all the work on Pig.
> Finally got it working. Couple of high level bugs right now:
>
>- Getting it working on Spark 0.9.0
>- Getting UDF working
>- Getting generate functionality working
>- Exhaustive test suite on Spark on Pig
>
> are you maintaining a Jira somewhere?
>
> I am currently trying to deploy it on 0.9.0.
>
> Regards
> Mayur
>
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi 
>
>
>
> On Fri, Mar 14, 2014 at 1:37 PM, Aniket Mokashi wrote:
>
>> We will post fixes from our side at - https://github.com/twitter/pig.
>>
>> Top on our list are-
>> 1. Make it work with pig-trunk (execution engine interface) (with 0.8 or
>> 0.9 spark).
>> 2. Support for algebraic udfs (this mitigates the group by oom problems).
>>
>> Would definitely love more contribution on this.
>>
>> Thanks,
>> Aniket
>>
>>
>> On Fri, Mar 14, 2014 at 12:29 PM, Mayur Rustagi 
>> wrote:
>>
>>> Dam I am off to NY for Structure Conf. Would it be possible to meet
>>> anytime after 28th March?
>>> I am really interested in making it stable & production quality.
>>>
>>> Regards
>>> Mayur Rustagi
>>> Ph: +1 (760) 203 3257
>>> http://www.sigmoidanalytics.com
>>> @mayur_rustagi 
>>>
>>>
>>>
>>> On Fri, Mar 14, 2014 at 11:53 AM, Julien Le Dem wrote:
>>>
 Hi Mayur,
 Are you going to the Pig meetup this afternoon?
 http://www.meetup.com/PigUser/events/160604192/
 Aniket and I will be there.
 We would be happy to chat about Pig-on-Spark



 On Tue, Mar 11, 2014 at 8:56 AM, Mayur Rustagi >>> > wrote:

> Hi Lin,
> We are working on getting Pig on spark functional with 0.8.0, have you
> got it working on any spark version ?
> Also what all functionality works on it?
> Regards
> Mayur
>
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi 
>
>
>
> On Mon, Mar 10, 2014 at 11:00 PM, Xiangrui Meng wrote:
>
>> Hi Sameer,
>>
>> Lin (cc'ed) could also give you some updates about Pig on Spark
>> development on her side.
>>
>> Best,
>> Xiangrui
>>
>> On Mon, Mar 10, 2014 at 12:52 PM, Sameer Tilak 
>> wrote:
>> > Hi Mayur,
>> > We are planning to upgrade our distribution MR1> MR2 (YARN) and the
>> goal is
>> > to get SPROK set up next month. I will keep you posted. Can you
>> please keep
>> > me informed about your progress as well.
>> >
>> > 
>> > From: mayur.rust...@gmail.com
>> > Date: Mon, 10 Mar 2014 11:47:56 -0700
>> >
>> > Subject: Re: Pig on Spark
>> > To: user@spark.apache.org
>> >
>> >
>> > Hi Sameer,
>> > Did you make any progress on this. My team is also trying it out
>> would love
>> > to know some detail so progress.
>> >
>> > Mayur Rustagi
>> > Ph: +1 (760) 203 3257
>> > http://www.sigmoidanalytics.com
>> > @mayur_rustagi
>> >
>> >
>> >
>> > On Thu, Mar 6, 2014 at 2:20 PM, Sameer Tilak 
>> wrote:
>> >
>> > Hi Aniket,
>> > Many thanks! I will check this out.
>> >
>> > 
>> > Date: Thu, 6 Mar 2014 13:46:50 -0800
>> > Subject: Re: Pig on Spark
>> > From: aniket...@gmail.com
>> > To: user@spark.apache.org; tgraves...@yahoo.com
>> >
>> >
>> > There is some work to make this work on yarn at
>> > https://github.com/aniket486/pig. (So, compile pig with ant
>> > -Dhadoopversion=23)
>> >
>> > You can look at
>> https://github.com/aniket486/pig/blob/spork/pig-spark to
>> > find out what sort of env variables you need (sorry, I haven't been
>> able to
>> > clean this up- in-progress). There are few known issues with this,
>> I will
>> > work on fixing them soon.
>> >
>> > Known issues-
>> > 1. Limit does not work (spork-fix)
>> > 2. Foreach requires to turn off schema-tuple-backend (should be a
>> pig-jira)
>> > 3. Algebraic udfs dont work (spork-fix in-progress)
>> > 4. Group by rework (to avoid OOMs)
>> > 5. UDF Classloader issue (requires SPARK-1053, then you can put
>> > pig-withouthadoop.jar as SPARK_JARS in SparkContext along with udf
>> jars)
>> >
>> > ~Aniket
>> >
>> >
>> >
>> >
>> > On Thu, Mar 6, 2014 at 1:36 PM, Tom Graves 
>> wrote:
>> >
>> > I had asked a similar question on the dev mailing list a while back
>> (Jan
>> > 22nd).
>> >
>> > See the archives:
>> >
>> http://mail-arc

Re: is it possible to initiate Spark jobs from Oozie?

2014-04-10 Thread Konstantin Kudryavtsev
I believe you need to write custom action or engage java action
On Apr 10, 2014 12:11 AM, "Segerlind, Nathan L" <
nathan.l.segerl...@intel.com> wrote:

>  Howdy.
>
>
>
> Is it possible to initiate Spark jobs from Oozie (presumably as a java
> action)? If so, are there known limitations to this?  And would anybody
> have a pointer to an example?
>
>
>
> Thanks,
>
> Nate
>
>
>


Re: Spark output compression on HDFS

2014-04-04 Thread Konstantin Kudryavtsev
Can anybody suggest how to change compression level (Record, Block) for
Snappy?
if it possible, of course

thank you in advance

Thank you,
Konstantin Kudryavtsev


On Thu, Apr 3, 2014 at 10:28 PM, Konstantin Kudryavtsev <
kudryavtsev.konstan...@gmail.com> wrote:

> Thanks all, it works fine now and I managed to compress output. However, I
> am still in stuck... How is it possible to set compression type for Snappy?
> I mean to set up record or block level of compression for output
>  On Apr 3, 2014 1:15 AM, "Nicholas Chammas" 
> wrote:
>
>> Thanks for pointing that out.
>>
>>
>> On Wed, Apr 2, 2014 at 6:11 PM, Mark Hamstra wrote:
>>
>>> First, you shouldn't be using spark.incubator.apache.org anymore, just
>>> spark.apache.org.  Second, saveAsSequenceFile doesn't appear to exist
>>> in the Python API at this point.
>>>
>>>
>>> On Wed, Apr 2, 2014 at 3:00 PM, Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
>>>> Is this a 
>>>> Scala-only<http://spark.incubator.apache.org/docs/latest/api/pyspark/pyspark.rdd.RDD-class.html#saveAsTextFile>feature?
>>>>
>>>>
>>>> On Wed, Apr 2, 2014 at 5:55 PM, Patrick Wendell wrote:
>>>>
>>>>> For textFile I believe we overload it and let you set a codec directly:
>>>>>
>>>>>
>>>>> https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/FileSuite.scala#L59
>>>>>
>>>>> For saveAsSequenceFile yep, I think Mark is right, you need an option.
>>>>>
>>>>>
>>>>> On Wed, Apr 2, 2014 at 12:36 PM, Mark Hamstra >>>> > wrote:
>>>>>
>>>>>> http://www.scala-lang.org/api/2.10.3/index.html#scala.Option
>>>>>>
>>>>>> The signature is 'def saveAsSequenceFile(path: String, codec:
>>>>>> Option[Class[_ <: CompressionCodec]] = None)', but you are providing a
>>>>>> Class, not an Option[Class].
>>>>>>
>>>>>> Try counts.saveAsSequenceFile(output,
>>>>>> Some(classOf[org.apache.hadoop.io.compress.SnappyCodec]))
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Apr 2, 2014 at 12:18 PM, Kostiantyn Kudriavtsev <
>>>>>> kudryavtsev.konstan...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi there,
>>>>>>>
>>>>>>>
>>>>>>> I've started using Spark recently and evaluating possible use cases
>>>>>>> in our company.
>>>>>>>
>>>>>>> I'm trying to save RDD as compressed Sequence file. I'm able to save
>>>>>>> non-compressed file be calling:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> counts.saveAsSequenceFile(output)
>>>>>>>
>>>>>>> where counts is my RDD (IntWritable, Text). However, I didn't manage
>>>>>>> to compress output. I tried several configurations and always got 
>>>>>>> exception:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> counts.saveAsSequenceFile(output, 
>>>>>>> classOf[org.apache.hadoop.io.compress.SnappyCodec])
>>>>>>> :21: error: type mismatch;
>>>>>>>  found   : 
>>>>>>> Class[org.apache.hadoop.io.compress.SnappyCodec](classOf[org.apache.hadoop.io.compress.SnappyCodec])
>>>>>>>  required: Option[Class[_ <: 
>>>>>>> org.apache.hadoop.io.compress.CompressionCodec]]
>>>>>>>   counts.saveAsSequenceFile(output, 
>>>>>>> classOf[org.apache.hadoop.io.compress.SnappyCodec])
>>>>>>>
>>>>>>>  counts.saveAsSequenceFile(output, 
>>>>>>> classOf[org.apache.spark.io.SnappyCompressionCodec])
>>>>>>> :21: error: type mismatch;
>>>>>>>  found   : 
>>>>>>> Class[org.apache.spark.io.SnappyCompressionCodec](classOf[org.apache.spark.io.SnappyCompressionCodec])
>>>>>>>  required: Option[Class[_ <: 
>>>>>>> org.apache.hadoop.io.compress.CompressionCodec]]
>>>>>>>   counts.saveAsSequenceFile(output, 
>>>>>>> classOf[org.apache.spark.io.SnappyCompressionCodec])
>>>>>>>
>>>>>>> and it doesn't work even for Gzip:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>  counts.saveAsSequenceFile(output, 
>>>>>>> classOf[org.apache.hadoop.io.compress.GzipCodec])
>>>>>>> :21: error: type mismatch;
>>>>>>>  found   : 
>>>>>>> Class[org.apache.hadoop.io.compress.GzipCodec](classOf[org.apache.hadoop.io.compress.GzipCodec])
>>>>>>>  required: Option[Class[_ <: 
>>>>>>> org.apache.hadoop.io.compress.CompressionCodec]]
>>>>>>>   counts.saveAsSequenceFile(output, 
>>>>>>> classOf[org.apache.hadoop.io.compress.GzipCodec])
>>>>>>>
>>>>>>> Could you please suggest solution? also, I didn't find how is it
>>>>>>> possible to specify compression parameters (i.e. compression type for
>>>>>>> Snappy). I wondered if you could share code snippets for writing/reading
>>>>>>> RDD with compression?
>>>>>>>
>>>>>>> Thank you in advance,
>>>>>>> Konstantin Kudryavtsev
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>


Re: how to save RDD partitions in different folders?

2014-04-04 Thread Konstantin Kudryavtsev
Hi Evan,

Could you please provide a code-snippet? Because it not clear for me, in
Hadoop you need to engage addNamedOutput method and I'm in stuck how to use
it from Spark

Thank you,
Konstantin Kudryavtsev


On Fri, Apr 4, 2014 at 5:27 PM, Evan Sparks  wrote:

> Have a look at MultipleOutputs in the hadoop API. Spark can read and write
> to arbitrary hadoop formats.
>
> > On Apr 4, 2014, at 6:01 AM, dmpour23  wrote:
> >
> > Hi all,
> > Say I have an input file which I would like to partition using
> > HashPartitioner k times.
> >
> > Calling  rdd.saveAsTextFile(""hdfs://"); will save k files as part-0
> > part-k
> > Is there a way to save each partition in specific folders?
> >
> > i.e. src
> >  part0/part-0
> >  part1/part-1
> >  part1/part-k
> >
> > thanks
> > Dimitri
> >
> >
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/how-to-save-RDD-partitions-in-different-folders-tp3754.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: Spark output compression on HDFS

2014-04-03 Thread Konstantin Kudryavtsev
Thanks all, it works fine now and I managed to compress output. However, I
am still in stuck... How is it possible to set compression type for Snappy?
I mean to set up record or block level of compression for output
On Apr 3, 2014 1:15 AM, "Nicholas Chammas" 
wrote:

> Thanks for pointing that out.
>
>
> On Wed, Apr 2, 2014 at 6:11 PM, Mark Hamstra wrote:
>
>> First, you shouldn't be using spark.incubator.apache.org anymore, just
>> spark.apache.org.  Second, saveAsSequenceFile doesn't appear to exist in
>> the Python API at this point.
>>
>>
>> On Wed, Apr 2, 2014 at 3:00 PM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> Is this a 
>>> Scala-only<http://spark.incubator.apache.org/docs/latest/api/pyspark/pyspark.rdd.RDD-class.html#saveAsTextFile>feature?
>>>
>>>
>>> On Wed, Apr 2, 2014 at 5:55 PM, Patrick Wendell wrote:
>>>
>>>> For textFile I believe we overload it and let you set a codec directly:
>>>>
>>>>
>>>> https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/FileSuite.scala#L59
>>>>
>>>> For saveAsSequenceFile yep, I think Mark is right, you need an option.
>>>>
>>>>
>>>> On Wed, Apr 2, 2014 at 12:36 PM, Mark Hamstra 
>>>> wrote:
>>>>
>>>>> http://www.scala-lang.org/api/2.10.3/index.html#scala.Option
>>>>>
>>>>> The signature is 'def saveAsSequenceFile(path: String, codec:
>>>>> Option[Class[_ <: CompressionCodec]] = None)', but you are providing a
>>>>> Class, not an Option[Class].
>>>>>
>>>>> Try counts.saveAsSequenceFile(output,
>>>>> Some(classOf[org.apache.hadoop.io.compress.SnappyCodec]))
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Apr 2, 2014 at 12:18 PM, Kostiantyn Kudriavtsev <
>>>>> kudryavtsev.konstan...@gmail.com> wrote:
>>>>>
>>>>>> Hi there,
>>>>>>
>>>>>>
>>>>>> I've started using Spark recently and evaluating possible use cases
>>>>>> in our company.
>>>>>>
>>>>>> I'm trying to save RDD as compressed Sequence file. I'm able to save
>>>>>> non-compressed file be calling:
>>>>>>
>>>>>>
>>>>>>
>>>>>> counts.saveAsSequenceFile(output)
>>>>>>
>>>>>> where counts is my RDD (IntWritable, Text). However, I didn't manage
>>>>>> to compress output. I tried several configurations and always got 
>>>>>> exception:
>>>>>>
>>>>>>
>>>>>>
>>>>>> counts.saveAsSequenceFile(output, 
>>>>>> classOf[org.apache.hadoop.io.compress.SnappyCodec])
>>>>>> :21: error: type mismatch;
>>>>>>  found   : 
>>>>>> Class[org.apache.hadoop.io.compress.SnappyCodec](classOf[org.apache.hadoop.io.compress.SnappyCodec])
>>>>>>  required: Option[Class[_ <: 
>>>>>> org.apache.hadoop.io.compress.CompressionCodec]]
>>>>>>   counts.saveAsSequenceFile(output, 
>>>>>> classOf[org.apache.hadoop.io.compress.SnappyCodec])
>>>>>>
>>>>>>  counts.saveAsSequenceFile(output, 
>>>>>> classOf[org.apache.spark.io.SnappyCompressionCodec])
>>>>>> :21: error: type mismatch;
>>>>>>  found   : 
>>>>>> Class[org.apache.spark.io.SnappyCompressionCodec](classOf[org.apache.spark.io.SnappyCompressionCodec])
>>>>>>  required: Option[Class[_ <: 
>>>>>> org.apache.hadoop.io.compress.CompressionCodec]]
>>>>>>   counts.saveAsSequenceFile(output, 
>>>>>> classOf[org.apache.spark.io.SnappyCompressionCodec])
>>>>>>
>>>>>> and it doesn't work even for Gzip:
>>>>>>
>>>>>>
>>>>>>
>>>>>>  counts.saveAsSequenceFile(output, 
>>>>>> classOf[org.apache.hadoop.io.compress.GzipCodec])
>>>>>> :21: error: type mismatch;
>>>>>>  found   : 
>>>>>> Class[org.apache.hadoop.io.compress.GzipCodec](classOf[org.apache.hadoop.io.compress.GzipCodec])
>>>>>>  required: Option[Class[_ <: 
>>>>>> org.apache.hadoop.io.compress.CompressionCodec]]
>>>>>>   counts.saveAsSequenceFile(output, 
>>>>>> classOf[org.apache.hadoop.io.compress.GzipCodec])
>>>>>>
>>>>>> Could you please suggest solution? also, I didn't find how is it
>>>>>> possible to specify compression parameters (i.e. compression type for
>>>>>> Snappy). I wondered if you could share code snippets for writing/reading
>>>>>> RDD with compression?
>>>>>>
>>>>>> Thank you in advance,
>>>>>> Konstantin Kudryavtsev
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>