I started a AWS cluster (1master + 3core) and download the prebuilt Spark
binary. I downloaded the latest Anaconda Python and started a iPython
notebook server by running the command below:
ipython notebook --port --profile nbserver --no-browser
Then, I try to develop a Spark
when i run spark application longtime.
15/05/11 12:50:47 INFO spark.SparkContext: Created broadcast 8 from broadcast
at DAGScheduler.scala:839
15/05/11 12:50:47 INFO storage.BlockManagerInfo: Removed input-0-1431345102800
on dn2.vi:45688 in memory (size: 27.6 KB, free: 524.2 MB)
15/05/11
Sorry, I was using Spark 1.3.x.
I cannot reproduce it in master.
But should I still open a JIRA because can I request it to be back
ported to 1.3.x branch? Thanks again!
From: Michael Armbrust [mailto:mich...@databricks.com]
Sent: Saturday, May 09,
Ran into this same issue. Only solution seems to be to coerce the DataFrame's
schema back into the right state. Looks like you have to convert the DF to
an RDD, which has an overhead. But otherwise this worked for me:
val newDF = sqlContext.createDataFrame(origDF.rdd, new
Hi,
I am using hadoop2.5.2. My codes are listed as following. Besides, I also made
some further tests. I found the following interesting result:
1.I will meet those exceptions when I set the Key Class as NullWritable,
LongWritable, or IntWritable and used the
Looking for help with this. Thank you!
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Cassandra-connector-number-of-Tasks-tp22820p22839.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi all,
I'd like to monitor the akka using kamon, which need to set the
akka.extension to a list like this in typesafe config format:
akka {
extensions = [kamon.system.SystemMetrics, kamon.statsd.StatsD]
}
But i can not find a way to do this, i have tried these:
1.
As far as I know, that is not possible. If the file is too big to load to one
node, What I would do is to use a RDD.map() function instead to load the
file to distributed memory and then filter the lines that are relevant to
me.
I am not sure how to just read part of a single file. Sorry I'm
And one other suggestion in relation to the connection pool line of enquiry -
check whether your cassandra service is configured to allow only one session
per e.g. User
I think the error is generated inside thr connection pool when it tries to
initialize a connection after the first one
Sent
And in case you are running in local mode try giving more cores to spark with
e.g. [5] – low number could be interfering with the tuning params which you can
try to play with as well – all this is in the context of how those params
interact with the Connection Pool and what that pool is doing
Assuming a web server access log shall be analyzed and target of computation
shall be csv-files per time, e.g. one per day containing the
minute-statistics and one per month containing the hour statistics. Incoming
statistics are computed as discretized streams using spark streaming
context.
http://stackoverflow.com/questions/30149868/generate-events-tuples-using-csv-file-with-timestamps
--
Thanks Regards,
Anshu Shukla
I'm familiar with the TableWriter code and that log only appears if the
write actually succeeded. (See
https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector/writer/TableWriter.scala
)
Thinking infrastructure, we see
Hi all,
I'd like to monitor the akka using kamon, which need to set the
akka.extension to a list like this in typesafe config format:
akka {
extensions = [kamon.system.SystemMetrics, kamon.statsd.StatsD]
}
But i can not find a way to do this, i have tried these:
1.
It successfully writes some data and fails afterwards, like the host or
connection goes down. Weird.
Maybe you should post this question on the Spark-Cassandra connector group:
https://groups.google.com/a/lists.datastax.com/forum/#!forum/spark-connector-user
-kr, Gerard.
On Sun, May 10, 2015
I think the message that it has written 2 rows is misleading
If you look further down you will see that it could not initialize a connection
pool for Casandra (presumably while trying to write the previously mentioned 2
rows)
Another confirmation of this hypothesis is the phrase “error
This should work
se1 = sc.parallelize(setupRow(10),1)
base2 = sc.parallelize(setupRow(10),1)
df1 = ssc.createDataFrame(base1)
df2 = ssc.createDataFrame(base2)
df1.show()
df2.show()
df1.registerTempTable(df1)
df2.registerTempTable(df2)
j = ssc.sql(select df1.k1
Hmm there is also a Connection Pool involved and such things (especially while
still rough on the edges) may behave erratically in a distributed multithreaded
environment
Can you try forEachPartition and foreach together – this will create a
slightly different multithreading execution
Hi
Thanks for the quick response.
No I'm not using Streaming. Each DataFrame represents tabular data read
from a CSV file. They have the same schema.
There is also the option of appending each DF to the parquet file, but then
I can't maintain them as separate DF when reading back in without
Looking
at ./core/src/main/scala/org/apache/spark/api/java/JavaSparkContext.scala :
* Load an RDD saved as a SequenceFile containing serialized objects,
with NullWritable keys and
* BytesWritable values that contain a serialized partition. This is
still an experimental storage
...
def
How did you end up with thousands of df? Are you using streaming? In that
case you can do foreachRDD and keep merging incoming rdds to single rdd and
then save it through your own checkpoint mechanism.
If not, please share your use case.
On 11 May 2015 00:38, Peter Aberline
Hi
In that case read entire folder as a rdd and give some reasonable number of
partitions.
Best
Ayan
On 11 May 2015 01:35, Peter Aberline peter.aberl...@gmail.com wrote:
Hi
Thanks for the quick response.
No I'm not using Streaming. Each DataFrame represents tabular data read
from a CSV
22 matches
Mail list logo