The default CDH4 build uses avro-mapred hadoop1

2015-02-20 Thread Mingyu Kim
Hi all,

Related to https://issues.apache.org/jira/browse/SPARK-3039, the default CDH4 
build, which is built with "mvn -Dhadoop.version=2.0.0-mr1-cdh4.2.0 -DskipTests 
clean package”, pulls in avro-mapred hadoop1, as opposed to avro-mapred 
hadoop2. This ends up in the same error as mentioned in the linked bug. (pasted 
below).

The right solution would be to create a hadoop-2.0 profile that sets 
avro.mapred.classifier to hadoop2, and to build CDH4 build with “-Phadoop-2.0” 
option.

What do people think?

Mingyu

——

java.lang.IncompatibleClassChangeError: Found interface 
org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
   at 
org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47)
   at 
org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:133)
   at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:107)
   at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
   at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
   at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
   at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
   at org.apache.spark.scheduler.Task.run(Task.scala:56)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)



OSGI bundles for spark project..

2015-02-20 Thread Niranda Perera
Hi,

I am interested in a Spark OSGI bundle.

While checking the maven repository I found out that it is still not being
implemented.

Can we see an OSGI bundle being released soon? Is it in the Spark Project
roadmap?

Rgds
-- 
Niranda


Spark performance on 32 Cpus Server Cluster

2015-02-20 Thread Dirceu Semighini Filho
Hi all,
I'm running Spark 1.2.0, in Stand alone mode, on different cluster and
server sizes. All of my data is cached in memory.
Basically I have a mass of data, about 8gb, with about 37k of columns, and
I'm running different configs of an BinaryLogisticRegressionBFGS.
When I put spark to run on 9 servers (1 master and 8 slaves), with 32 cores
each. I noticed that the cpu usage was varying from 20% to 50% (counting
the cpu usage of 9 servers in the cluster).
First I tried to repartition the Rdds to the same number of total client
cores (256), but that didn't help. After I've tried to change the
property *spark.default.parallelism
* to the same number (256) but that didn't helped to increase the cpu usage.
Looking at the spark monitoring tool, I saw that some stages  took 52s to
be completed.
My last shot was trying to run some tasks in parallel, but when I start
running tasks in parallel (4 tasks) the total cpu time spent to complete
this has increased in about 10%, task parallelism didn't helped.
Looking at the monitoring tool I've noticed that when running tasks in
parallel, the stages complete together, if I have 4 stages running in
parallel (A,B,C and D), if A, B and C finishes, they will wait for D to
mark all this 4 stages as completed, is that right?
Is there any way to improve the cpu usage when running on large servers?
Spending more time when running tasks is an expected behaviour?

Kind Regards,
Dirceu


Re: Spark performance on 32 Cpus Server Cluster

2015-02-20 Thread Sean Owen
It sounds like your computation just isn't CPU bound, right? or maybe
that only some stages are. It's not clear what work you are doing
beyond the core LR.

Stages don't wait on each other unless one depends on the other. You'd
have to clarify what you mean by running stages in parallel, like what
are the interdependencies.

On Fri, Feb 20, 2015 at 10:01 AM, Dirceu Semighini Filho
 wrote:
> Hi all,
> I'm running Spark 1.2.0, in Stand alone mode, on different cluster and
> server sizes. All of my data is cached in memory.
> Basically I have a mass of data, about 8gb, with about 37k of columns, and
> I'm running different configs of an BinaryLogisticRegressionBFGS.
> When I put spark to run on 9 servers (1 master and 8 slaves), with 32 cores
> each. I noticed that the cpu usage was varying from 20% to 50% (counting
> the cpu usage of 9 servers in the cluster).
> First I tried to repartition the Rdds to the same number of total client
> cores (256), but that didn't help. After I've tried to change the
> property *spark.default.parallelism
> * to the same number (256) but that didn't helped to increase the cpu usage.
> Looking at the spark monitoring tool, I saw that some stages  took 52s to
> be completed.
> My last shot was trying to run some tasks in parallel, but when I start
> running tasks in parallel (4 tasks) the total cpu time spent to complete
> this has increased in about 10%, task parallelism didn't helped.
> Looking at the monitoring tool I've noticed that when running tasks in
> parallel, the stages complete together, if I have 4 stages running in
> parallel (A,B,C and D), if A, B and C finishes, they will wait for D to
> mark all this 4 stages as completed, is that right?
> Is there any way to improve the cpu usage when running on large servers?
> Spending more time when running tasks is an expected behaviour?
>
> Kind Regards,
> Dirceu

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: The default CDH4 build uses avro-mapred hadoop1

2015-02-20 Thread Sean Owen
True, although a number of other little issues make me, personally,
not want to continue down this road:

- There are already a lot of build profiles to try to cover Hadoop versions
- I don't think it's quite right to have vendor-specific builds in
Spark to begin with
- We should be moving to only support Hadoop 2 soon IMHO anyway
- CDH4 is EOL in a few months I think

On Fri, Feb 20, 2015 at 8:30 AM, Mingyu Kim  wrote:
> Hi all,
>
> Related to https://issues.apache.org/jira/browse/SPARK-3039, the default CDH4 
> build, which is built with "mvn -Dhadoop.version=2.0.0-mr1-cdh4.2.0 
> -DskipTests clean package”, pulls in avro-mapred hadoop1, as opposed to 
> avro-mapred hadoop2. This ends up in the same error as mentioned in the 
> linked bug. (pasted below).
>
> The right solution would be to create a hadoop-2.0 profile that sets 
> avro.mapred.classifier to hadoop2, and to build CDH4 build with 
> “-Phadoop-2.0” option.
>
> What do people think?
>
> Mingyu
>
> ——
>
> java.lang.IncompatibleClassChangeError: Found interface 
> org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
>at 
> org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47)
>at 
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:133)
>at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:107)
>at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69)
>at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
>at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
>at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
>at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
>at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
>at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
>at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34)
>at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
>at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
>at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>at org.apache.spark.scheduler.Task.run(Task.scala:56)
>at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
>at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>at java.lang.Thread.run(Thread.java:745)
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spark performance on 32 Cpus Server Cluster

2015-02-20 Thread Dirceu Semighini Filho
Hi Sean,
I'm trying to increase the cpu usage by running logistic regression in
different datasets in parallel. They shouldn't depend on each other.
I train several  logistic regression models from different column
combinations of a main dataset. I processed the combinations in a ParArray
in an attempt to increase cpu usage but id did not help.



2015-02-20 8:17 GMT-02:00 Sean Owen :

> It sounds like your computation just isn't CPU bound, right? or maybe
> that only some stages are. It's not clear what work you are doing
> beyond the core LR.
>
> Stages don't wait on each other unless one depends on the other. You'd
> have to clarify what you mean by running stages in parallel, like what
> are the interdependencies.
>
> On Fri, Feb 20, 2015 at 10:01 AM, Dirceu Semighini Filho
>  wrote:
> > Hi all,
> > I'm running Spark 1.2.0, in Stand alone mode, on different cluster and
> > server sizes. All of my data is cached in memory.
> > Basically I have a mass of data, about 8gb, with about 37k of columns,
> and
> > I'm running different configs of an BinaryLogisticRegressionBFGS.
> > When I put spark to run on 9 servers (1 master and 8 slaves), with 32
> cores
> > each. I noticed that the cpu usage was varying from 20% to 50% (counting
> > the cpu usage of 9 servers in the cluster).
> > First I tried to repartition the Rdds to the same number of total client
> > cores (256), but that didn't help. After I've tried to change the
> > property *spark.default.parallelism
> > * to the same number (256) but that didn't helped to increase the cpu
> usage.
> > Looking at the spark monitoring tool, I saw that some stages  took 52s to
> > be completed.
> > My last shot was trying to run some tasks in parallel, but when I start
> > running tasks in parallel (4 tasks) the total cpu time spent to complete
> > this has increased in about 10%, task parallelism didn't helped.
> > Looking at the monitoring tool I've noticed that when running tasks in
> > parallel, the stages complete together, if I have 4 stages running in
> > parallel (A,B,C and D), if A, B and C finishes, they will wait for D to
> > mark all this 4 stages as completed, is that right?
> > Is there any way to improve the cpu usage when running on large servers?
> > Spending more time when running tasks is an expected behaviour?
> >
> > Kind Regards,
> > Dirceu
>


Re: Spark performance on 32 Cpus Server Cluster

2015-02-20 Thread Sean Owen
Yes that makes sense, but it doesn't make the jobs CPU-bound. What is
the bottleneck? the model building or other stages? I would think you
can get the model building to be CPU bound, unless you have chopped it
up into really small partitions. I think it's best to look further
into what stages are slow, and what it seems to be spending time on --
GC? I/O?

On Fri, Feb 20, 2015 at 12:18 PM, Dirceu Semighini Filho
 wrote:
> Hi Sean,
> I'm trying to increase the cpu usage by running logistic regression in
> different datasets in parallel. They shouldn't depend on each other.
> I train several  logistic regression models from different column
> combinations of a main dataset. I processed the combinations in a ParArray
> in an attempt to increase cpu usage but id did not help.
>
>
>
> 2015-02-20 8:17 GMT-02:00 Sean Owen :
>
>> It sounds like your computation just isn't CPU bound, right? or maybe
>> that only some stages are. It's not clear what work you are doing
>> beyond the core LR.
>>
>> Stages don't wait on each other unless one depends on the other. You'd
>> have to clarify what you mean by running stages in parallel, like what
>> are the interdependencies.
>>
>> On Fri, Feb 20, 2015 at 10:01 AM, Dirceu Semighini Filho
>>  wrote:
>> > Hi all,
>> > I'm running Spark 1.2.0, in Stand alone mode, on different cluster and
>> > server sizes. All of my data is cached in memory.
>> > Basically I have a mass of data, about 8gb, with about 37k of columns,
>> > and
>> > I'm running different configs of an BinaryLogisticRegressionBFGS.
>> > When I put spark to run on 9 servers (1 master and 8 slaves), with 32
>> > cores
>> > each. I noticed that the cpu usage was varying from 20% to 50% (counting
>> > the cpu usage of 9 servers in the cluster).
>> > First I tried to repartition the Rdds to the same number of total client
>> > cores (256), but that didn't help. After I've tried to change the
>> > property *spark.default.parallelism
>> > * to the same number (256) but that didn't helped to increase the cpu
>> > usage.
>> > Looking at the spark monitoring tool, I saw that some stages  took 52s
>> > to
>> > be completed.
>> > My last shot was trying to run some tasks in parallel, but when I start
>> > running tasks in parallel (4 tasks) the total cpu time spent to complete
>> > this has increased in about 10%, task parallelism didn't helped.
>> > Looking at the monitoring tool I've noticed that when running tasks in
>> > parallel, the stages complete together, if I have 4 stages running in
>> > parallel (A,B,C and D), if A, B and C finishes, they will wait for D to
>> > mark all this 4 stages as completed, is that right?
>> > Is there any way to improve the cpu usage when running on large servers?
>> > Spending more time when running tasks is an expected behaviour?
>> >
>> > Kind Regards,
>> > Dirceu
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spark SQL, Hive & Parquet data types

2015-02-20 Thread Cheng Lian
For the second question, we do plan to support Hive 0.14, possibly in 
Spark 1.4.0.


For the first question:

1. In Spark 1.2.0, the Parquet support code doesn’t support timestamp
   type, so you can’t.
2. In Spark 1.3.0, timestamp support was added, also Spark SQL uses its
   own Parquet support to handle both read path and write path when
   dealing with Parquet tables declared in Hive metastore, as long as
   you’re not writing to a partitioned table. So yes, you can.

The Parquet version bundled with Spark 1.3.0 is 1.6.0rc3, which supports 
timestamp type natively. However, the Parquet versions bundled with Hive 
0.13.1 and Hive 0.14.0 are 1.3.2 and 1.5.0 respectively. Neither of them 
supports timestamp type. Hive 0.14.0 “supports” read/write timestamp 
from/to Parquet by converting timestamps from/to Parquet binaries. 
Similarly, Impala converts timestamp into Parquet int96. This can be 
annoying for Spark SQL, because we must interpret Parquet files in 
different ways according to the original writer of the file. As Parquet 
matures, recent Parquet versions support more and more standard data 
types. Mappings from complex nested types to Parquet types are also 
being standardized 1 
.


On 2/20/15 6:50 AM, The Watcher wrote:


Still trying to get my head around Spark SQL & Hive.

1) Let's assume I *only* use Spark SQL to create and insert data into HIVE
tables, declared in a Hive meta-store.

Does it matter at all if Hive supports the data types I need with Parquet,
or is all that matters what Catalyst & spark's parquet relation support ?

Case in point : timestamps & Parquet
* Parquet now supports them as per
https://github.com/Parquet/parquet-mr/issues/218
* Hive only supports them in 0.14
So would I be able to read/write timestamps natively in Spark 1.2 ? Spark
1.3 ?

I have found this thread
http://apache-spark-user-list.1001560.n3.nabble.com/timestamp-not-implemented-yet-td15414.html
which seems to indicate that the data types supported by Hive would matter
to Spark SQL.
If so, why is that ? Doesn't the read path go through Spark SQL to read the
parquet file ?

2) Is there planned support for Hive 0.14 ?

Thanks


​


Re: Spark SQL, Hive & Parquet data types

2015-02-20 Thread The Watcher
>
>
>1. In Spark 1.3.0, timestamp support was added, also Spark SQL uses
>its own Parquet support to handle both read path and write path when
>dealing with Parquet tables declared in Hive metastore, as long as you’re
>not writing to a partitioned table. So yes, you can.
>
> Ah, I had missed the part about being partitioned or not. Is this related
to the work being done on ParquetRelation2 ?

We will indeed write to a partitioned table : do neither the read nor the
write path go through Spark SQL's parquet support in that case ? Is there a
JIRA/PR I can monitor to see when this would change ?

Thanks


Spark 1.3 RC1 Generate schema based on string of schema

2015-02-20 Thread Denny Lee
In the Spark SQL 1.2 Programmers Guide, we can generate the schema based on
the string of schema via

val schema =
  StructType(
schemaString.split(" ").map(fieldName => StructField(fieldName,
StringType, true)))

But when running this on Spark 1.3.0 (RC1), I get the error:

val schema =  StructType(schemaString.split(" ").map(fieldName =>
StructField(fieldName, StringType, true)))

:26: error: not found: value StringType

   val schema =  StructType(schemaString.split(" ").map(fieldName =>
StructField(fieldName, StringType, true)))

I'm looking through the various datatypes within
org.apache.spark.sql.types.DataType
but thought I'd ask to see if I was missing something obvious here.

Thanks!
Denny


Re: Spark SQL, Hive & Parquet data types

2015-02-20 Thread yash datta
For the old parquet path (available in 1.2.1) , i made a few changes for
being able to read/write to a table partitioned on timestamp type column

https://github.com/apache/spark/pull/4469


On Fri, Feb 20, 2015 at 8:28 PM, The Watcher  wrote:

> >
> >
> >1. In Spark 1.3.0, timestamp support was added, also Spark SQL uses
> >its own Parquet support to handle both read path and write path when
> >dealing with Parquet tables declared in Hive metastore, as long as
> you’re
> >not writing to a partitioned table. So yes, you can.
> >
> > Ah, I had missed the part about being partitioned or not. Is this related
> to the work being done on ParquetRelation2 ?
>
> We will indeed write to a partitioned table : do neither the read nor the
> write path go through Spark SQL's parquet support in that case ? Is there a
> JIRA/PR I can monitor to see when this would change ?
>
> Thanks
>



-- 
When events unfold with calm and ease
When the winds that blow are merely breeze
Learn from nature, from birds and bees
Live your life in love, and let joy not cease.


Re: Spark 1.3 RC1 Generate schema based on string of schema

2015-02-20 Thread Denny Lee
Oh, I just realized that I never imported all of sql._ .  My bad!


On Fri Feb 20 2015 at 7:51:32 AM Denny Lee  wrote:

> In the Spark SQL 1.2 Programmers Guide, we can generate the schema based
> on the string of schema via
>
> val schema =
>   StructType(
> schemaString.split(" ").map(fieldName => StructField(fieldName,
> StringType, true)))
>
> But when running this on Spark 1.3.0 (RC1), I get the error:
>
> val schema =  StructType(schemaString.split(" ").map(fieldName =>
> StructField(fieldName, StringType, true)))
>
> :26: error: not found: value StringType
>
>val schema =  StructType(schemaString.split(" ").map(fieldName =>
> StructField(fieldName, StringType, true)))
>
> I'm looking through the various datatypes within 
> org.apache.spark.sql.types.DataType
> but thought I'd ask to see if I was missing something obvious here.
>
> Thanks!
>
>
> Denny
>


Re: OSGI bundles for spark project..

2015-02-20 Thread Sean Owen
No, you usually run Spark apps via the spark-submit script, and the
Spark machinery is already deployed on a cluster. Although it's
possible to embed the driver and get it working that way, it's not
supported.

On Fri, Feb 20, 2015 at 4:48 PM, Niranda Perera
 wrote:
> Hi Sean,
>
> does it mean that Spark is not encouraged to be embedded on other products?
>
> On Fri, Feb 20, 2015 at 3:29 PM, Sean Owen  wrote:
>>
>> I don't think an OSGI bundle makes sense for Spark. It's part JAR,
>> part lifecycle manager. Spark has its own lifecycle  management and is
>> not generally embeddable. Packaging is generally 'out of scope' for
>> the core project beyond the standard Maven and assembly releases.
>>
>> On Fri, Feb 20, 2015 at 8:33 AM, Niranda Perera
>>  wrote:
>> > Hi,
>> >
>> > I am interested in a Spark OSGI bundle.
>> >
>> > While checking the maven repository I found out that it is still not
>> > being
>> > implemented.
>> >
>> > Can we see an OSGI bundle being released soon? Is it in the Spark
>> > Project
>> > roadmap?
>> >
>> > Rgds
>> > --
>> > Niranda
>
>
>
>
> --
> Niranda

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: OSGI bundles for spark project..

2015-02-20 Thread Niranda Perera
Hi Sean,

does it mean that Spark is not encouraged to be embedded on other products?

On Fri, Feb 20, 2015 at 3:29 PM, Sean Owen  wrote:

> I don't think an OSGI bundle makes sense for Spark. It's part JAR,
> part lifecycle manager. Spark has its own lifecycle  management and is
> not generally embeddable. Packaging is generally 'out of scope' for
> the core project beyond the standard Maven and assembly releases.
>
> On Fri, Feb 20, 2015 at 8:33 AM, Niranda Perera
>  wrote:
> > Hi,
> >
> > I am interested in a Spark OSGI bundle.
> >
> > While checking the maven repository I found out that it is still not
> being
> > implemented.
> >
> > Can we see an OSGI bundle being released soon? Is it in the Spark Project
> > roadmap?
> >
> > Rgds
> > --
> > Niranda
>



-- 
Niranda


Re: The default CDH4 build uses avro-mapred hadoop1

2015-02-20 Thread Mingyu Kim
Thanks for the explanation.

To be clear, I meant to speak for any hadoop 2 releases before 2.2, which
have profiles in Spark. I referred to CDH4, since that¹s the only Hadoop
2.0/2.1 version Spark ships a prebuilt package for.

I understand the hesitation of making a code change if Spark doesn¹t plan
to support Hadoop 2.0/2.1 in general. (Please note, this is not specific
to CDH4) If so, can I propose alternative options until Spark moves to
only support hadoop2?

- Build the CDH4 package with ³-Davro.mapred.classifier=hadoop2², and
update http://spark.apache.org/docs/latest/building-spark.html for all
³2.0.*² examples.
- Build the CDH4 package as is, but note known issues clearly in the
³download² page.
- Simply do not ship CDH4 prebuilt package, and let people figure it out
themselves. Preferably, note in documentation that
³-Davro.mapred.classifier=hadoop2² should be used for all hadoop ³2.0.*²
builds.

Please let me know what you think!

Mingyu





On 2/20/15, 2:34 AM, "Sean Owen"  wrote:

>True, although a number of other little issues make me, personally,
>not want to continue down this road:
>
>- There are already a lot of build profiles to try to cover Hadoop
>versions
>- I don't think it's quite right to have vendor-specific builds in
>Spark to begin with
>- We should be moving to only support Hadoop 2 soon IMHO anyway
>- CDH4 is EOL in a few months I think
>
>On Fri, Feb 20, 2015 at 8:30 AM, Mingyu Kim  wrote:
>> Hi all,
>>
>> Related to 
>>https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_ji
>>ra_browse_SPARK-2D3039&d=AwIFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oO
>>nmz8&r=ennQJq47pNnObsDh-88a9YUrUulcYQoV8giPASqXB84&m=s1MfvBlt11h2xojQItkw
>>aeh094ttUKTu9K5F-lA6DJY&s=Sb2SVubKkvdjaLer3K-b_Z0RfeC1fm-CP4A-Uh6nvEQ&e=
>>, the default CDH4 build, which is built with "mvn
>>-Dhadoop.version=2.0.0-mr1-cdh4.2.0 -DskipTests clean package², pulls in
>>avro-mapred hadoop1, as opposed to avro-mapred hadoop2. This ends up in
>>the same error as mentioned in the linked bug. (pasted below).
>>
>> The right solution would be to create a hadoop-2.0 profile that sets
>>avro.mapred.classifier to hadoop2, and to build CDH4 build with
>>³-Phadoop-2.0² option.
>>
>> What do people think?
>>
>> Mingyu
>>
>> ‹‹
>>
>> java.lang.IncompatibleClassChangeError: Found interface
>>org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
>>at 
>>org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyIn
>>putFormat.java:47)
>>at 
>>org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:133)
>>at 
>>org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:107)
>>at 
>>org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69)
>>at 
>>org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
>>at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
>>at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>>at 
>>org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
>>at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
>>at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>>at 
>>org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
>>at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
>>at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34)
>>at 
>>org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
>>at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
>>at 
>>org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>>at org.apache.spark.scheduler.Task.run(Task.scala:56)
>>at 
>>org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
>>at 
>>java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java
>>:1145)
>>at 
>>java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.jav
>>a:615)
>>at java.lang.Thread.run(Thread.java:745)
>>


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-20 Thread Tom Graves
Trying to run pyspark on yarn in client mode with basic wordcount example I see 
the following error when doing the collect:
Error from python worker:  /usr/bin/python: No module named sqlPYTHONPATH was:  
/grid/3/tmp/yarn-local/usercache/tgraves/filecache/20/spark-assembly-1.3.0-hadoop2.6.0.1.1411101121.jarjava.io.EOFException
        at java.io.DataInputStream.readInt(DataInputStream.java:392)        at 
org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:163)
        at 
org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86)
        at 
org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62)
        at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:105)     
   at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:69)        
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)        at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:244)        at 
org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:308)        at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)        at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:244)        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)      
  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) 
       at org.apache.spark.scheduler.Task.run(Task.scala:64)        at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:197)        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
       at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
       at java.lang.Thread.run(Thread.java:722)
any ideas on this?
Tom 

 On Wednesday, February 18, 2015 2:14 AM, Patrick Wendell 
 wrote:
   

 Please vote on releasing the following candidate as Apache Spark version 1.3.0!

The tag to be voted on is v1.3.0-rc1 (commit f97b0d4a):
https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=f97b0d4a6b26504916816d7aefcf3132cd1da6c2

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-1.3.0-rc1/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1069/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-1.3.0-rc1-docs/

Please vote on releasing this package as Apache Spark 1.3.0!

The vote is open until Saturday, February 21, at 08:03 UTC and passes
if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.3.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see
http://spark.apache.org/

== How can I help test this release? ==
If you are a Spark user, you can help us test this release by
taking a Spark 1.2 workload and running on this release candidate,
then reporting any regressions.

== What justifies a -1 vote for this release? ==
This vote is happening towards the end of the 1.3 QA period,
so -1 votes should only occur for significant regressions from 1.2.1.
Bugs already present in 1.2.X, minor regressions, or bugs related
to new features will not block this release.

- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org