Well I don't know about postgres but you can limit the number of columns
abd rows fetched via JDBC at source rather than loading and filtering them
in Spark
val c = HiveContext.load("jdbc",
Map("url" -> _ORACLEserver,
"dbtable" -> "(SELECT to_char(CHANNEL_ID) AS CHANNEL_ID, CHANNEL_DESC FROM
On Tue, May 10, 2016 at 2:24 PM, Abi wrote:
> 1. How come pyspark does not provide the localvalue function like scala ?
>
> 2. Why is pyspark more restrictive than scala ?
On Tue, May 10, 2016 at 2:20 PM, Abi wrote:
> Is there any example of this ? I want to see how you write the the
> iterable example
I try to load some rows from a big SQL table. Here is my code:
===
jdbcDF = sqlContext.read.format("jdbc").options(
url="jdbc:postgresql://...",
dbtable="mytable",
partitionColumn="t",
lowerBound=1451577600,
upperBound=1454256000,
numPartitions=1).load()
Hi All,
I am trying to get spark 1.4.1 (Java) work with Kafka 0.8.2 in Kerberos enabled
cluster. HDP 2.3.2
Is there any document I can refer to.
Thanks,
Pradeep
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For
Folks,
Was curious to find out if anybody ever considered/attempted to support
golang with spark .
-Thanks
Sourav
def kernel(arg):
input = broadcast_var.value + 1
#some processing with input
def foo():
broadcast_var = sc.broadcast(var)
rdd.foreach(kernel)
def main():
#something
In this code , I get the following error:
NameError: global name 'broadcast_var ' is not defined
Thank you . Looking at the source code helped :)
I set spark.testing.memory to 512 MB and it worked :)
private def getMaxMemory(conf: SparkConf): Long = {
val systemMemory = conf.getLong("spark.testing.memory",
Runtime.getRuntime.maxMemory)
val reservedMemory =
Hi Ted
Its seems really strange. Its seems like in the version were I used 2 data
frames spark added ³as(tag)². (Which is really nice. )
Odd that I got different behavior
Is this a bug?
Kind regards
Andy
From: Ted Yu
Date: Friday, May 13, 2016 at 12:38 PM
To:
Basically, I want to run the following query:
select 'a\'b', case(null as Array)
However, neither HiveContext and SQLContext can execute it without
exception.
I have tried
sql(select 'a\'b', case(null as Array))
and
df.selectExpr("'a\'b'", "case(null as Array)")
Neither of them works.
Here is related code:
val executorMemory = conf.*getSizeAsBytes*("spark.executor.memory")
if (executorMemory < minSystemMemory) {
throw new IllegalArgumentException(s"Executor memory
$executorMemory must be at least " +
On Fri, May 13, 2016 at 12:47 PM, satish saley
Hi,
I have my application jar sitting in HDFS which defines long-running Spark
Streaming job and I am using checkpoint dir also in HDFS. Every time I have
any changes to the job, I go delete that jar and upload a new one.
Now if I upload a new jar and delete checkpoint directory it works fine.
Hello,
I am running
https://github.com/apache/spark/blob/branch-1.6/examples/src/main/python/pi.py
example,
but facing following exception
What is the unit of memory pointed out in the error?
Following are configs
--master
local[*]
Ok, so that worked flawlessly after I upped the number of partitions to 400
from 40.
Thanks!
On Fri, May 13, 2016 at 7:28 PM, Sung Hwan Chung
wrote:
> I'll try that, as of now I have a small number of partitions in the order
> of 20~40.
>
> It would be great if
In the structure shown, tag is under element.
I wonder if that was a factor.
On Fri, May 13, 2016 at 11:49 AM, Andy Davidson <
a...@santacruzintegration.com> wrote:
> I am using spark-1.6.1.
>
> I create a data frame from a very complicated JSON file. I would assume
> that query planer would
I'll try that, as of now I have a small number of partitions in the order
of 20~40.
It would be great if there's some documentation on the memory requirement
wrt the number of keys and the number of partitions per executor (i.e., the
Spark's internal memory requirement outside of the user space).
Have you taken a look at SPARK-11293 ?
Consider using repartition to increase the number of partitions.
FYI
On Fri, May 13, 2016 at 12:14 PM, Sung Hwan Chung
wrote:
> Hello,
>
> I'm using Spark version 1.6.0 and have trouble with memory when trying to
> do
Hello,
I'm using Spark version 1.6.0 and have trouble with memory when trying to
do reducebykey on a dataset with as many as 75 million keys. I.e. I get the
following exception when I run the task.
There are 20 workers in the cluster. It is running under the standalone
mode with 12 GB assigned
Hi,
Problem is every time job fails or perform poorly at certain stages you
need to study your data distribution just before THAT stage. Overall look
at input data set doesn't help very much if you have so many transformation
going on in DAG. I alway end up writing complicated typed code to run
I am using spark-1.6.1.
I create a data frame from a very complicated JSON file. I would assume that
query planer would treat both version of my transformation chains the same
way.
// org.apache.spark.sql.AnalysisException: Cannot resolve column name "tag"
among (actor, body, generator, pip,
On 5/13/2016 10:39 AM, Anthony May wrote:
It looks like it might only be available via REST,
http://spark.apache.org/docs/latest/monitoring.html#rest-api
Nice, thanks!
On Fri, 13 May 2016 at 11:24 Dood@ODDO > wrote:
On 5/13/2016
It looks like it might only be available via REST,
http://spark.apache.org/docs/latest/monitoring.html#rest-api
On Fri, 13 May 2016 at 11:24 Dood@ODDO wrote:
> On 5/13/2016 10:16 AM, Anthony May wrote:
> >
>
On 5/13/2016 10:16 AM, Anthony May wrote:
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkStatusTracker
Might be useful
How do you use it? You cannot instantiate the class - is the constructor
private? Thanks!
On Fri, 13 May 2016 at 11:11 Ted Yu
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkStatusTracker
Might be useful
On Fri, 13 May 2016 at 11:11 Ted Yu wrote:
> Have you looked
> at core/src/main/scala/org/apache/spark/ui/jobs/JobProgressListener.scala ?
>
> Cheers
>
> On Fri,
Have you looked
at core/src/main/scala/org/apache/spark/ui/jobs/JobProgressListener.scala ?
Cheers
On Fri, May 13, 2016 at 10:05 AM, Dood@ODDO wrote:
> I provide a RESTful API interface from scalatra for launching Spark jobs -
> part of the functionality is tracking these
I provide a RESTful API interface from scalatra for launching Spark jobs
- part of the functionality is tracking these jobs. What API is
available to track the progress of a particular spark application? How
about estimating where in the total job progress the job is?
Thanks!
pandas dataframe is broadcasted successfully. giving errors in datanode
function called kernel
Code:
dataframe_broadcast = sc.broadcast(dataframe)
def kernel():
df_v = dataframe_broadcast.value
Error:
I get this error when I try accessing the value member of the broadcast
variable.
I am running my cluster on Ubuntu 14.04
Regards
Prateek
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-get-and-save-core-dump-of-native-library-in-executors-tp26945p26952.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Thank you. That's exactly I was looking for.
Regards
Prashant
On Fri, May 13, 2016 at 9:38 PM, Xinh Huynh wrote:
> Hi Prashant,
>
> You can create struct columns using the struct() function in
> org.apache.spark.sql.functions --
>
>
Is it possible to come up with code snippet which reproduces the following ?
Thanks
On Fri, May 13, 2016 at 8:13 AM, Raghava Mutharaju <
m.vijayaragh...@gmail.com> wrote:
> I am able to run my application after I compiled Spark source in the
> following way
>
> ./dev/change-scala-version.sh
Hi Prashant,
You can create struct columns using the struct() function in
org.apache.spark.sql.functions --
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$
val df = sc.parallelize(List(("a", "b", "c"))).toDF("A", "B", "C")
import
Hi Cyril,
In the case where there are no documents, it looks like there is a typo in
"addresses" (check the number of "d"s):
| scala> df.select(explode(df("addresses.id")).as("aid"), df("id")) <==
addresses
| org.apache.spark.sql.AnalysisException: Cannot resolve column name "id"
among
I am able to run my application after I compiled Spark source in the
following way
./dev/change-scala-version.sh 2.11
./dev/make-distribution.sh --name spark-2.0.0-snapshot-bin-hadoop2.6 --tgz
-Phadoop-2.6 -DskipTests
But while the application is running I get the following exception, which I
Makes sense. thank you cody.
Regards,
Chandan
On Fri, May 13, 2016 at 8:10 PM, Cody Koeninger wrote:
> No, I wouldn't expect it to, once the stream is defined (at least for
> the direct stream integration for kafka 0.8), the topicpartitions are
> fixed.
>
> My answer to any
I'm trying to save a table using this code in pyspark with 1.6.1:
prices = sqlContext.sql("SELECT AVG(amount) AS mean_price, country FROM src
GROUP BY country")
prices.collect()
prices.write.saveAsTable('prices', format='parquet', mode='overwrite',
path='/mnt/bigdisk/tables')
but I'm getting
No, I wouldn't expect it to, once the stream is defined (at least for
the direct stream integration for kafka 0.8), the topicpartitions are
fixed.
My answer to any question about "but what if checkpoints don't let me
do this" is always going to be "well, don't rely on checkpoints."
If you want
I do not know Postgres but that sounds like a system table much like Oracle
v$instance?
Why running a Hive schema script against a hive schema/DB in Postgres
should impact system schema?
Mine is Oracle
s...@mydb12.mich.LOCAL> SELECT version FROM v$instance;
VERSION
-
12.1.0.2.0
On 5/12/2016 10:01 PM, Holden Karau wrote:
This is not the expected behavior, can you maybe post
the code where you are running into this?
Hello, thanks for replying!
Below is the function I took out from the code.
def converter(rdd:
Hello,
I am trying to write a basic analyzer, by extending the catalyst analyzer
with a few extra rules.
I am getting the following error:
*""" trait CatalystConf in package catalyst cannot be accessed in package
org.apache.spark.sql.catalyst """*
In my attempt I am doing the following:
class
Taking it to a more basic level, I compared between a simple transformation
with RDDs and with Datasets. This is far simpler than Renato's use case and
this brungs up two good question:
1. Is the time it takes to "spin-up" a standalone instance of Spark(SQL) is
just an additional one-time overhead
Hello,
I am trying to write a basic analyzer, by extending the catalyst analyzer
with a few extra rules.
I am getting the following error:
""" *trait CatalystConf in package catalyst cannot be accessed in package
org.apache.spark.sql.catalyst* """
In my attempt I am doing the following:
class
I corrected the type to RDD, but it's still giving
me the error.
I believe I have found the reason though. The vals variable is created
using the map procedure on some other RDD. Although it is declared as a
JavaRDD, the classTag it returns is Object. I think
that because
Hi
Let's say I have a flat dataframe with 6 columns like.
{
"a": "somevalue",
"b": "somevalue",
"c": "somevalue",
"d": "somevalue",
"e": "somevalue",
"f": "somevalue"
}
Now I want to convert this dataframe to contain nested column like
{
"nested_obj1": {
"a": "somevalue",
"b": "somevalue"
},
Mayank,
Assuming Anova not present in MLIB can you not exploit the Anova from SparkR? I
am enquiring not making a factual statement.
Thanks
On May 13, 2016, at 15:54, mayankshete wrote:
> Is ANOVA present in Spark Mllib if not then, when will be this feature be
>
The Java docs won't help since they only show "Object", yes. Have a
look at the Scala docs:
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.rdd.RDDFunctions
An RDD of T produces an RDD of T[].
On Fri, May 13, 2016 at 12:10 PM, Tom Godden
I assumed the "fixed size blocks" mentioned in the documentation
(https://spark.apache.org/docs/1.6.0/api/java/org/apache/spark/mllib/rdd/RDDFunctions.html#sliding%28int,%20int%29)
were RDDs, but I guess they're arrays? Even when I change the RDD to
arrays (so it looks like RDD), it
Have you used awaitTermination() on your ssc ? --> Yes, i have used that.
Also try setting the deployment mode to yarn-client. --> Is this not
supported on yarn-cluster mode? I am trying to find root cause for
yarn-cluster mode.
Have you tested graceful shutdown on yarn-cluster mode?
On Fri, May
I'm not sure what you're trying there. The return type is an RDD of
arrays, not of RDDs or of ArrayLists. There may be another catch but
that is not it.
On Fri, May 13, 2016 at 11:50 AM, Tom Godden wrote:
> I believe it's an illegal cast. This is the line of code:
>>
I believe it's an illegal cast. This is the line of code:
> RDD> windowed =
> RDDFunctions.fromRDD(vals.rdd(), vals.classTag()).sliding(20, 1);
with vals being a JavaRDD. Explicitly casting
doesn't work either:
> RDD> windowed = (RDD>)
>
Thank you for the response.
I used the following command to build from source
build/mvn -Dhadoop.version=2.6.4 -Phadoop-2.6 -DskipTests clean package
Would this put in the required jars in .ivy2 during the build process? If
so, how can I make the spark distribution runnable, so that I can use
Is ANOVA present in Spark Mllib if not then, when will be this feature be
available in Spark ?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/ANOVA-test-in-Spark-tp26949.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
On 12 May 2016, at 18:35, Aaron Jackson
> wrote:
I'm using the spark 1.6.1 (hadoop-2.6) and I'm trying to load a file that's in
s3. I've done this previously with spark 1.5 with no issue. Attempting to
load and count a single file as follows:
Follow up question :
If spark streaming is using checkpointing (/tmp/checkpointDir) for
AtLeastOnce and number of Topics or/and partitions has increased then
will gracefully shutting down and restarting from checkpoint will consider
new topics or/and partitions ?
If the answer is NO
no, i have set master to yarn-cluster.
when the sparkpi.running,the result of free -t as follow
[running]mqq@10.205.3.29:/data/home/hive/conf$ free -t
total used free shared buffers cached
Mem: 32740732 32105684 635048 0 683332
Hi all,
I use PostgreSQL to store the hive metadata.
First, I imported a sql script to metastore database as follows:
psql -U postgres -d metastore -h 192.168.50.30 -f
hive-schema-1.2.0.postgres.sql
Then, when I started $SPARK_HOME/bin/spark-sql, the PostgreSQL gave the
following
The problem is there's no Java-friendly version of this, and the Scala
API return type actually has no analog in Java (an array of any type,
not just of objects) so it becomes Object. You can just cast it to the
type you know it will be -- RDD or RDD or whatever.
On Fri, May 13,
Hello,
We're trying to use PrefixSpan on sequential data, by passing a sliding
window over it. Spark Streaming is not an option.
RDDFunctions.sliding() returns an item of class RDD,
regardless of the original type of the RDD. Because of this, the
returned item seems to be pretty much worthless.
Hi all,
I use PostgreSQL to store the hive metadata.
First, I imported a sql script to metastore database as follows:
psql -U postgres -d metastore -h 192.168.50.30 -f
hive-schema-1.2.0.postgres.sql
Then, when I started $SPARK_HOME/bin/spark-sql, the PostgreSQL gave the
following
Spill-overs are a common issue for in-memory computing systems, after all
memory is limited. In Spark where RDDs are immutable, if an RDD got created
with its size > 1/2 node's RAM then a transformation and generation of the
consequent RDD' can potentially fill all the node's memory that can
Rakesh
Have you used awaitTermination() on your ssc ?
If not , dd this and see if it changes the behavior.
I am guessing this issue may be related to yarn deployment mode.
Also try setting the deployment mode to yarn-client.
Thanks
Deepak
On Fri, May 13, 2016 at 10:17 AM, Rakesh H (Marketing
60 matches
Mail list logo