Trying to run pyspark on yarn in client mode with basic wordcount example I see
the following error when doing the collect:
Error from python worker: /usr/bin/python: No module named sqlPYTHONPATH was:
/grid/3/tmp/yarn-local/usercache/tgraves/filecache/20/spark-assembly-1.3.0-hadoop2.6.0.1.1411
Thanks for the explanation.
To be clear, I meant to speak for any hadoop 2 releases before 2.2, which
have profiles in Spark. I referred to CDH4, since that¹s the only Hadoop
2.0/2.1 version Spark ships a prebuilt package for.
I understand the hesitation of making a code change if Spark doesn¹t p
Hi Sean,
does it mean that Spark is not encouraged to be embedded on other products?
On Fri, Feb 20, 2015 at 3:29 PM, Sean Owen wrote:
> I don't think an OSGI bundle makes sense for Spark. It's part JAR,
> part lifecycle manager. Spark has its own lifecycle management and is
> not generally em
No, you usually run Spark apps via the spark-submit script, and the
Spark machinery is already deployed on a cluster. Although it's
possible to embed the driver and get it working that way, it's not
supported.
On Fri, Feb 20, 2015 at 4:48 PM, Niranda Perera
wrote:
> Hi Sean,
>
> does it mean that
Oh, I just realized that I never imported all of sql._ . My bad!
On Fri Feb 20 2015 at 7:51:32 AM Denny Lee wrote:
> In the Spark SQL 1.2 Programmers Guide, we can generate the schema based
> on the string of schema via
>
> val schema =
> StructType(
> schemaString.split(" ").map(fieldNa
For the old parquet path (available in 1.2.1) , i made a few changes for
being able to read/write to a table partitioned on timestamp type column
https://github.com/apache/spark/pull/4469
On Fri, Feb 20, 2015 at 8:28 PM, The Watcher wrote:
> >
> >
> >1. In Spark 1.3.0, timestamp support wa
In the Spark SQL 1.2 Programmers Guide, we can generate the schema based on
the string of schema via
val schema =
StructType(
schemaString.split(" ").map(fieldName => StructField(fieldName,
StringType, true)))
But when running this on Spark 1.3.0 (RC1), I get the error:
val schema = Struc
>
>
>1. In Spark 1.3.0, timestamp support was added, also Spark SQL uses
>its own Parquet support to handle both read path and write path when
>dealing with Parquet tables declared in Hive metastore, as long as you’re
>not writing to a partitioned table. So yes, you can.
>
> Ah, I h
For the second question, we do plan to support Hive 0.14, possibly in
Spark 1.4.0.
For the first question:
1. In Spark 1.2.0, the Parquet support code doesn’t support timestamp
type, so you can’t.
2. In Spark 1.3.0, timestamp support was added, also Spark SQL uses its
own Parquet support
Yes that makes sense, but it doesn't make the jobs CPU-bound. What is
the bottleneck? the model building or other stages? I would think you
can get the model building to be CPU bound, unless you have chopped it
up into really small partitions. I think it's best to look further
into what stages are
Hi Sean,
I'm trying to increase the cpu usage by running logistic regression in
different datasets in parallel. They shouldn't depend on each other.
I train several logistic regression models from different column
combinations of a main dataset. I processed the combinations in a ParArray
in an att
True, although a number of other little issues make me, personally,
not want to continue down this road:
- There are already a lot of build profiles to try to cover Hadoop versions
- I don't think it's quite right to have vendor-specific builds in
Spark to begin with
- We should be moving to only
It sounds like your computation just isn't CPU bound, right? or maybe
that only some stages are. It's not clear what work you are doing
beyond the core LR.
Stages don't wait on each other unless one depends on the other. You'd
have to clarify what you mean by running stages in parallel, like what
Hi all,
I'm running Spark 1.2.0, in Stand alone mode, on different cluster and
server sizes. All of my data is cached in memory.
Basically I have a mass of data, about 8gb, with about 37k of columns, and
I'm running different configs of an BinaryLogisticRegressionBFGS.
When I put spark to run on 9
Hi,
I am interested in a Spark OSGI bundle.
While checking the maven repository I found out that it is still not being
implemented.
Can we see an OSGI bundle being released soon? Is it in the Spark Project
roadmap?
Rgds
--
Niranda
Hi all,
Related to https://issues.apache.org/jira/browse/SPARK-3039, the default CDH4
build, which is built with "mvn -Dhadoop.version=2.0.0-mr1-cdh4.2.0 -DskipTests
clean package”, pulls in avro-mapred hadoop1, as opposed to avro-mapred
hadoop2. This ends up in the same error as mentioned in t
16 matches
Mail list logo