mapTopair that time you can break the key.
On 8 June 2015 at 23:27, Bill Q bill.q@gmail.com wrote:
Hi,
I have a rdd with the following structure:
row1: key: Seq[a, b]; value: value 1
row2: key: seq[a, c, f]; value: value 2
Is there an efficient way to de-flat the rows into?
row1: key:
I think it works in Python
```
df = sqlContext.createDataFrame([(1, {'a': 1})])
df.printSchema()
root
|-- _1: long (nullable = true)
|-- _2: map (nullable = true)
||-- key: string
||-- value: long (valueContainsNull = true)
df.select(df._2.getField('a')).show()
+-+
|_2[a]|
It turns out there is a bug in the code which makes an infinite loop some
time after start. :)
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Stuck-After-10mins-Issue-tp23189p23210.html
Sent from the Apache Spark User List mailing list
Sorry for the delay.
The files (called .bed files) have format like -
Chromosome start endfeature score strand
chr1 713776 714375 peak.1 599+
chr1 752401 753000 peak.2 599+
The mandatory fields are
1. chrom - The name of the chromosome (e.g. chr3, chrY,
Hi Roni,
We have a full suite of genomic feature parsers that can read BED, narrowPeak,
GATK interval lists, and GTF/GFF into Spark RDDs in ADAM Additionally, we have
support for efficient overlap joins (query 3 in your email below). You can load
the genomic features with
Hi,
I have a rdd with the following structure:
row1: key: Seq[a, b]; value: value 1
row2: key: seq[a, c, f]; value: value 2
Is there an efficient way to de-flat the rows into?
row1: key: a; value: value1
row2: key: a; value: value2
row3: key: b; value: value1
row4: key: c; value: value2
row5:
Hello,
https://spark.apache.org/docs/latest/mllib-feature-extraction.html
would Feature Extraction and Transformation work in a streaming context?
Wanted to extract text features, build K-means clusters for streaming
context
to detect anomalies on a continuous text stream.
Would it be possible?
It appears this may be related.
https://issues.apache.org/jira/browse/SPARK-1403
Granted the NPE is in MapR's code, having Spark (seemingly, I am not an
expert here, just basing it off the comments) switch in its behavior (if
that's what it is doing) probably isn't good either. I guess the level
Hi,
The problem I am looking at is as follows:
- I read in a log file of multiple users as a RDD
- I'd like to group the above RDD into *multiple RDDs* by userIds (the key)
- my processEachUser() function then takes in each RDD mapped into
each individual user, and calls for RDD.map or
Hi Snehal
Are you running the latest kafka consumer from github/spark-packages ? If
not can you take the latest changes. This low level receiver will make
attempt to keep trying if underlying BlockManager gives error. Are you see
those retry cycle in log ? If yes then there is issue writing
If I read the code correctly, in RDD.scala, each rdd keeps track of it's own
dependencies, (from Dependency.scala), and has methods to access to it's
/ancestors/ dependencies, thus being able to recompute the lineage (see
getNarrowAncestors() or getDependencies() in some rdd like UnionRDD).
So it
All,
I am using Kafka Spark Consumer
https://github.com/dibbhatt/kafka-spark-consumer in spark streaming job .
After spark streaming job runs for few hours , all executors exit and I
still see status of application on SPARK UI as running
Does anyone know cause of this exception and how to fix
Event log is enabled in my spark streaming app. My code runs in standalone mode
and the spark version is 1.3.1. I periodically stop and restart the streaming
context by calling ssc.stop(). However, from the web UI, when clicking on a
past job, it says the job is still in progress and does not
Hello,
Running Spark examples fails on one machine, but succeeds in Virtual
Machine with exact same Spark Java version installed.
The weird part it fails on one machine, but runs successfully on VM.
Did anyone face same problem ? Any solution tip ?
Thanks in advance.
*Spark version*:
Could someone help explain what happens that leads to the Task not serializable
issue?
Thanks.
bit1...@163.com
From: bit1...@163.com
Date: 2015-06-08 19:08
To: user
Subject: Wired Problem: Task not serializable[Spark Streaming]
Hi,
With the following simple code, I got an exception that
Note that in scala, return is a non-local return:
https://tpolecat.github.io/2014/05/09/return.htmlSo that return is *NOT*
returning from the anonymous function, but attempting to return from the
enclosing method, i.e., main.Which is running on the driver, not on the
workers.So on the workers,
Thanks for the help!
I am actually trying Spark SQL to run queries against tables that I've
defined in Hive.
I follow theses steps:
- I start hiveserver2 and in Spark, I start Spark's Thrift server by:
$SPARK_HOME/sbin/start-thriftserver.sh --master
spark://spark-master-node-ip:7077
- and I
On 6/9/15 8:42 AM, James Pirz wrote:
Thanks for the help!
I am actually trying Spark SQL to run queries against tables that I've
defined in Hive.
I follow theses steps:
- I start hiveserver2 and in Spark, I start Spark's Thrift server by:
$SPARK_HOME/sbin/start-thriftserver.sh --master
Seems to be related to this JIRA :
https://issues.apache.org/jira/browse/SPARK-3612 ?
On Tue, Jun 9, 2015 at 7:39 AM, Dibyendu Bhattacharya
dibyendu.bhattach...@gmail.com wrote:
Hi Snehal
Are you running the latest kafka consumer from github/spark-packages ? If
not can you take the latest
Hi,
I am trying to run a SQL form a JDBC driver using Spark's Thrift Server.
I'm doing a join between a Hive Table of size around 100 GB and another
Hive Table with 10 KB, with a filter on a particular column
The query takes more than 45 minutes and then I get ExecutorLostFailure.
That is
Hi Dear Spark Users
I am very new to Spark/Scala.
Am using Datastax (4.7/Spark 1.2.1) and struggling with following
error/issue.
Already tried options like import org.apache.spark.SparkContext._ or
explicit import org.apache.spark.SparkContext.rddToPairRDDFunctions.
But error not resolved.
join is operation of DataFrame
You can call sc.createDataFrame(myRDD) to obtain DataFrame where sc is
sqlContext
Cheers
On Mon, Jun 8, 2015 at 9:44 PM, amit tewari amittewar...@gmail.com wrote:
Hi Dear Spark Users
I am very new to Spark/Scala.
Am using Datastax (4.7/Spark 1.2.1) and
Maybe someone has asked this question before. I have this compilation issue
when compiling spark sql. And I found couple of posts on stackoverflow, but
did'nt work for me. Does anyone has experience on this ? thanks
http://stackoverflow.com/questions/26788367/quasiquotes-in-intellij-14
Correct; and PairRDDFunctions#join does still exist in versions of Spark
that do have DataFrame, so you don't necessarily have to use DataFrame to
do this even then (although there are advantages to using the DataFrame
approach.)
Your basic problem is that you have an RDD of tuples, where each
Hi all,
I'm storing an rdd as sequencefile with the following content:
key=filename(string) value=python str from numpy.savez(not unicode)
In order to make sure the whole numpy array get's stored I have to first
serialize it with:
def serialize_numpy_array(numpy_array):
output = io.BytesIO()
Hi, spark-users:
I'm using spark-submit to submit multiple jars and files(all in HDFS) to run a
job, with the following command:
Spark-submit
--class myClass
--master spark://localhost:7077/
--deploy-mode cluster
--jars hdfs://localhost/1.jar, hdfs://localhost/2.jar
--files
Update: Using bytearray before storing to RDD is not a solution either.
This happens when trying to read the RDD when the value was stored as
python bytearray:
Traceback (most recent call last):
[0/9120]
File /vagrant/python/kmeans.py, line 24, in module
features =
Thanks, but Spark 1.2 doesnt yet have DataFrame I guess?
Regards
Amit
On Tue, Jun 9, 2015 at 10:25 AM, Ted Yu yuzhih...@gmail.com wrote:
join is operation of DataFrame
You can call sc.createDataFrame(myRDD) to obtain DataFrame where sc is
sqlContext
Cheers
On Mon, Jun 8, 2015 at 9:44
Thanks alot Mohammed, Gerard and Yana.
I can write to table, but exception returns me. It says *Exception in
thread main java.io.IOException: Failed to open thrift connection to
Cassandra at 127.0.0.1:9160 http://127.0.0.1:9160*
In yaml file :
rpc_address: localhost
rpc_port: 9160
And at project
Which Spark release are you using ?
Can you pastebin the stack trace w.r.t. ExecutorLostFailure ?
Thanks
On Mon, Jun 8, 2015 at 8:52 PM, Sourav Mazumder sourav.mazumde...@gmail.com
wrote:
Hi,
I am trying to run a SQL form a JDBC driver using Spark's Thrift Server.
I'm doing a join
Update: I've done a workaround to use saveAsPickleFile instead which
handles everything correctly. It stays in byte format. Noticed python got
messy with str and byte being the same in Python 2.7, wondering whether
using Python 3 would have the same problem.
I would still like to use a cross
Cheng, thanks for the response.
Yes, I was using HiveContext.setConf() to set dfs.replication.
However, I cannot change the value in Hadoop core-site.xml because that
will change every HDFS file.
I only want to change the replication factor of some specific files.
-Original Message-
I was wondering if there were any consultants in high standing in the
community. We are considering using Spark, and we'd love to have someone
with a lot of experience help us get up to speed and implement a preexisting
data pipeline to use Spark (and perhaps first help answer the question of
Hi,
I am writing some code inside an update function for updateStateByKey that
flushes data to a remote system using akk-http.
For the akka-http request I need an ActorSystem and an ActorFlowMaterializer.
Can anyone share a pattern or insights that address the following questions:
- Where and
James,
As I can see, there are three distinct parts to your program:
- for loop
- synchronized block
- final outputFrame.save statement
Can you do a separate timing measurement by putting a simple
System.currentTimeMillis() around these blocks to know how much they are
taking and then
Can you look in your worker logs for more detailed stack-trace? If its
about winutils.exe you can look at these links to get it resolved.
- http://qnalist.com/questions/4994960/run-spark-unit-test-on-windows-7
- https://issues.apache.org/jira/browse/SPARK-2356
Thanks
Best Regards
On Mon, Jun 8,
Thanks Akhil so such!
It turns out to be HADOOP_HOME not set.
Dong Lei
From: Akhil Das [mailto:ak...@sigmoidanalytics.com]
Sent: Monday, June 8, 2015 3:12 PM
To: Dong Lei
Cc: user@spark.apache.org
Subject: Re: Driver crash at the end with InvocationTargetException when
running SparkPi
Can you
? = ip address of your cassandra host
On Mon, Jun 8, 2015 at 10:12 AM, Yasemin Kaya godo...@gmail.com wrote:
Hi ,
How can I find spark.cassandra.connection.host? And what should I change ?
Should I change cassandra.yaml ?
Error says me *Exception in thread main java.io.IOException:
Hi,
Thanks for your guidelines. I will try it out.
Btw how do you know HiveContext.sql (and also
DataFrame.registerTempTable) is only expected to be invoked on driver
side? Where can I find document?
BR,
Patcharee
On 07. juni 2015 16:40, Cheng Lian wrote:
Spark SQL supports Hive dynamic
Hi ,
How can I find spark.cassandra.connection.host? And what should I change ?
Should I change cassandra.yaml ?
Error says me *Exception in thread main java.io.IOException: Failed to
open native connection to Cassandra at {127.0.1.1}:9042*
What should I add *SparkConf sparkConf = new
Hi
You are looking for the explode method (in Dataframe API starting 1.3 I
believe)
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala#L1002
Ram
On Sun, Jun 7, 2015 at 9:22 PM, Dimp Bhat dimp201...@gmail.com wrote:
Hi,
I'm trying to write
Strange, you can manually start it by login in to the Worker machine and
then issuing this command:
sbin/start-slave.sh 1 spark://$SPARK_MASTER_IP:$SPARK_MASTER_PORT
Thanks
Best Regards
On Mon, Jun 8, 2015 at 3:44 PM, James King jakwebin...@gmail.com wrote:
Thanks Akhil, yes that works fine
Two simple suggestions:
1. No need to call zipWithIndex twice. Use the earlier RDD dt.
2. Replace zipWithIndex with zipWithUniqueId which does not trigger a spark
job
Below your code with the above changes:
var dataRDD = sc.textFile(/test.csv).map(_.split(,))
val dt =
hello,
i submit my spark job with the following parameters:
./spark-1.1.0-bin-hadoop2.4/bin/spark-submit \
--class mgm.tp.bigdata.ma_spark.SparkMain \
--master spark://quickstart.cloudera:7077 \
ma-spark.jar \
1000
and get the following exception:
java.io.IOException: Mkdirs failed to
HDFS path should be something like; hdfs://
127.0.0.1:8020/user/cloudera/inputs/
On Mon, Jun 8, 2015 at 4:15 PM, Pa Rö paul.roewer1...@googlemail.com
wrote:
hello,
i submit my spark job with the following parameters:
./spark-1.1.0-bin-hadoop2.4/bin/spark-submit \
--class
Parquet file when are you loading these file?
can you please share the code where you are passing parquet file to spark?.
On 8 June 2015 at 16:39, Cheng Lian lian.cs@gmail.com wrote:
Are you appending the joined DataFrame whose PolicyType is string to an
existing Parquet file whose
Hi,
I am playing with Mllib (Spark 1.3.1) and my auto completion propositions
don't correspond to the official API.
Here are my dependencies :
libraryDependencies += org.apache.hadoop % hadoop-client % 1.0.4
libraryDependencies += org.apache.spark %% spark-core % 1.3.1 excludeAll(
Can you do *ssh -v 192.168.1.16* from the Master machine and make sure its
able to login without password?
Thanks
Best Regards
On Mon, Jun 8, 2015 at 2:51 PM, James King jakwebin...@gmail.com wrote:
I have two hosts 192.168.1.15 (Master) and 192.168.1.16 (Worker)
These two hosts have
On 6/8/15 4:02 PM, patcharee wrote:
Hi,
Thanks for your guidelines. I will try it out.
Btw how do you know HiveContext.sql (and also
DataFrame.registerTempTable) is only expected to be invoked on driver
side? Where can I find document?
I'm afraid we don't state this explicitly on the SQL
Thanks Akhil, yes that works fine it just lets me straight in.
On Mon, Jun 8, 2015 at 11:58 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Can you do *ssh -v 192.168.1.16* from the Master machine and make sure
its able to login without password?
Thanks
Best Regards
On Mon, Jun 8, 2015 at
For DataFrame, there are also transformations and actions. And
transformations are also lazily evaluated. However, DataFrame
transformations like filter(), select(), agg() return a DataFrame rather
than an RDD. Other methods like show() and collect() are actions.
Cheng
On 6/8/15 1:33 PM,
Are you appending the joined DataFrame whose PolicyType is string to an
existing Parquet file whose PolicyType is int? The exception indicates
that Parquet found a column with conflicting data types.
Cheng
On 6/8/15 5:29 PM, bipin wrote:
Hi I get this error message when saving a table:
Hi,
I run my project on local. How can find ip address of my cassandra host ?
From cassandra.yaml or ??
yasemin
2015-06-08 11:27 GMT+03:00 Gerard Maas gerard.m...@gmail.com:
? = ip address of your cassandra host
On Mon, Jun 8, 2015 at 10:12 AM, Yasemin Kaya godo...@gmail.com wrote:
Hi
I'm getting sometimes errors like below
spark 1.3.1
history enabled to hdfs
I've found few jiras but they seems to be resolved, e.g.
https://issues.apache.org/jira/browse/SPARK-1475
any ideas?
2015-06-08 08:33:06.426 ERROR LiveListenerBus: Listener EventLoggingListener
threw an exception
I have two hosts 192.168.1.15 (Master) and 192.168.1.16 (Worker)
These two hosts have exchanged public keys so they have free access to each
other.
But when I do spark home/sbin/start-all.sh from 192.168.1.15 I still get
192.168.1.16: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).
Two simple suggestions:
1. No need to call zipWithIndex twice. Use the earlier RDD dt.
2. Replace zipWithIndex with zipWithUniqueId which does not trigger a spark
job
Below your code with the above changes:
var dataRDD = sc.textFile(/test.csv).map(_.split(,))
val dt =
Hi I get this error message when saving a table:
parquet.io.ParquetDecodingException: The requested schema is not compatible
with the file schema. incompatible types: optional binary PolicyType (UTF8)
!= optional int32 PolicyType
at
Then one possible workaround is to set dfs.replication in
sc.hadoopConfiguration.
However, this configuration is shared by all Spark jobs issued within
the same application. Since different Spark jobs can be issued from
different threads, you need to pay attention to synchronization.
Cheng
your HDFS path to spark job is incorrect.
On 8 June 2015 at 16:24, Nirmal Fernando nir...@wso2.com wrote:
HDFS path should be something like; hdfs://
127.0.0.1:8020/user/cloudera/inputs/
On Mon, Jun 8, 2015 at 4:15 PM, Pa Rö paul.roewer1...@googlemail.com
wrote:
hello,
i submit my spark
Thanks, did that and now I am getting an out of memory. But I am not
sure where this occurs. It can't be on the spark executor as I have 28GB
allocated to it. It is not the driver because I run this locally and
monitor it via jvisualvm. Unfortunately I can't jmx-monitor hadoop.
From the
I would think DF=RDD+Schema+some additional methods. In fact, a DF object
has a DF.rdd in it so you can (if needed) convert DF=RDD really easily.
On Mon, Jun 8, 2015 at 5:41 PM, kiran lonikar loni...@gmail.com wrote:
Thanks. Can you point me to a place in the documentation of SQL
programming
Thanks. Can you point me to a place in the documentation of SQL programming
guide or DataFrame scaladoc where this transformation and actions are
grouped like in the case of RDD?
Also if you can tell me if sqlContext.load and unionAll are transformations
or actions...
I answered a question on
hi there I am trying to descrease my app's running time in worker node. I
checked the log and found the most time-wasting part is below:15/06/08 16:14:23
INFO storage.MemoryStore: Block broadcast_0 stored as values in memory
(estimated size 2.1 KB, free 353.3 MB)
15/06/08 16:14:42 INFO
Hi,
Is it possible to query a data structure that is a dictionary within a
dictionary?
I have a parquet file with a a structure:
test
|key1: {key_string: val_int}
|key2: {key_string: val_int}
if I try to do:
parquetFile.test
-- Columntest
parquetFile.test.key2
-- AttributeError:
I suggest you include your code and the error message! It's not even
immediately clear what programming language you mean to ask about.
On Mon, Jun 8, 2015 at 2:50 PM, elbehery elbeherymust...@gmail.com wrote:
Hi,
I have two datasets of customer types, and I would like to apply coGrouping
on
I am reading millions of xml files via
val xmls = sc.binaryFiles(xmlDir)
The operation runs fine locally but on yarn it fails with:
client token: N/A
diagnostics: Application application_1433491939773_0012 failed 2 times due
to ApplicationMaster for attempt
Try putting a * on the end of xmlDir, i.e.
xmlDir = fdfs:///abc/def/*
Rather than
xmlDir = Hdfs://abc/def
and see what happens. I don't know why, but that appears to be more reliable
for me with S3 as the filesystem.
I'm also using binaryFiles, but I've tried running the same command while
It looks like saveAsTextFiles doesn't support the compression parameter of
RDD.saveAsTextFile. Is there a way to add the functionality in my client
code without patching Spark? I tried making my own saveFunc function and
calling DStream.foreachRDD but ran into trouble with invoking rddToFileName
No luck I am afraid. After giving the namenode 16GB of RAM, I am still
getting an out of mem exception, kind of different one:
15/06/08 15:35:52 ERROR yarn.ApplicationMaster: User class threw
exception: GC overhead limit exceeded
java.lang.OutOfMemoryError: GC overhead limit exceeded
at
Hi Jeetendra, Cheng
I am using following code for joining
val Bookings = sqlContext.load(/home/administrator/stageddata/Bookings)
val Customerdetails =
sqlContext.load(/home/administrator/stageddata/Customerdetails)
val CD = Customerdetails.
where($CreatedOn 2015-04-01 00:00:00.0).
for the sake of the history : DON'T do System.exit within spark code
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Jobs-aborted-due-to-EventLoggingListener-Filesystem-closed-tp23202p23205.html
Sent from the Apache Spark User List mailing list archive at
Hi,
I have two datasets of customer types, and I would like to apply coGrouping
on them.
I could not find example for that on the website, and I have tried to apply
this on my code, but the compiler complained ..
Any suggestions ?
--
View this message in context:
I am learning more about Spark (and in this case Spark Streaming) and am
getting that a functions like dstream.map() takes a function call and does
something to each element of the rdd and that in turn returns a new rdd
based on the original.
That's cool for the simple map functions in the
unsubscribe
[Descrição: Descrição: Descrição: cid:image002.jpg@01CC89A8.2B628650]
Ricardo Goncalves da Silva
Lead Data Scientist | Seção de Desenvolvimento de Sistemas de
Business Intelligence - Projetos de Inovação | IDPB02
Av. Eng. Luis Carlos Berrini, 1.376 - 7º - 04571-000 - SP
Hi,
we've been seeing occasional issues in production with the FileOutCommitter
reaching a deadlock situation.
We are writing our data to S3 and currently have speculation enabled. What
we see is that Spark get's a file not found error trying to access a
temporary part file that it wrote
Can you do a simple
sc.binaryFiles(hdfs:///path/to/files/*).count()
in the spark-shell and verify that part works?
Ewan
-Original Message-
From: Konstantinos Kougios [mailto:kostas.koug...@googlemail.com]
Sent: 08 June 2015 15:40
To: Ewan Leith; user@spark.apache.org
Subject: Re:
Send email to user-unsubscr...@spark.apache.org
Cheers
2015-06-08 7:50 GMT-07:00 Ricardo Goncalves da Silva
ricardog.si...@telefonica.com:
unsubscribe
[image: Descrição: Descrição: Descrição:
cid:image002.jpg@01CC89A8.2B628650]
*Ricardo Goncalves da Silva*
Lead Data Scientist *|*
yes, whatever you put for listen_address in cassandra.yaml. Also, you
should try to connect to your cassandra cluster via bin/cqlsh to make sure
you have connectivity before you try to make a a connection via spark.
On Mon, Jun 8, 2015 at 4:43 AM, Yasemin Kaya godo...@gmail.com wrote:
Hi,
I
You may refer to DataFrame Scaladoc
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame
Methods listed in Language Integrated Queries and RDD Options can be
viewed as transformations, and those listed in Actions are, of
course, actions. As for
It was giving the same error, which made me figure out it is the driver
but the driver running on hadoop - not the local one. So I did
--conf spark.driver.memory=8g
and now it is processing the files!
Cheers
On 08/06/15 15:52, Ewan Leith wrote:
Can you do a simple
I suspect that Bookings and Customerdetails both have a PolicyType
field, one is string and the other is an int.
Cheng
On 6/8/15 9:15 PM, Bipin Nag wrote:
Hi Jeetendra, Cheng
I am using following code for joining
val Bookings = sqlContext.load(/home/administrator/stageddata/Bookings)
val
Hi Cheng, Ayan,
Thanks for the answers. I like the rule of thumb. I cursorily went through
the DataFrame, SQLContext and sql.execution.basicOperators.scala code. It
is apparent that these functions are lazily evaluated. The SQLContext.load
functions are similar to SparkContext.textFile kind of
It turns out my assumption on load and unionAll being blocking is not
correct. They are transformations. So instead of just running only the load
and unionAll in the run() methods, I think you will have to save the
intermediate dfInput[i] to temp (parquet) files (possibly to in memory DFS
like
83 matches
Mail list logo