Hi thanks Void works I use same custom format in Hive and it works with
Void as key. Please share example if you have to create DataFrame using
custom Hadoop format.
On Aug 1, 2015 2:07 AM, Ted Yu yuzhih...@gmail.com wrote:
I don't think using Void class is the right choice - it is not even a
Hi,
I tried to develop a RandomForest model for my data in PySpark as follows:
rf_model = RandomForest.trainClassifier(train_idf, 2, {},
numTrees=15, seed=144)
print RF: Num trees = %d, Num nodes = %d\n %(rf_model.numTrees(),
rf_model.totalNumNodes())
pred_label =
Do you notice how you are making a List of Int's?
input: org.apache.spark.rdd.RDD[*Int*] = ParallelCollectionRDD[0] at
parallelize at console:21
And these are also being mapped to more Int's
result: org.apache.spark.rdd.RDD[*Int*] = MapPartitionsRDD[1] at map at
console:23
Generally,
Typesafe (http://typesafe.com). We provide commercial support for Spark on
Mesos and Mesosphere DCOS. We contribute to Spark's Mesos integration and
Spark Streaming enhancements.
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
Found an approach:
sc._jsc.hadoopConfiguration().set(key, value)
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/setting-fs-umask-in-pyspark-tp24070p24098.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Thanks for the quick reply. I will be unable to collect more data until
Monday though, but I will update the thread accordingly.
I am using Spark 1.4.0. Were there any related issues reported? I wasn't
able to find any, but I may have overlooked something. I have also updated
the original
Can you call collect() and log the output to get more clue what is left ?
Which Spark release are you using ?
Cheers
On Fri, Jul 31, 2015 at 9:01 AM, Warfish sebastian.ka...@gmail.com wrote:
Hi everyone,
I work with Spark for a little while now and have encountered a strange
problem that
Does anybody have any idea how to solve this problem?
Ningjun
From: Wang, Ningjun (LNG-NPV)
Sent: Thursday, July 30, 2015 11:06 AM
To: user@spark.apache.org
Subject: How to register array class with Kyro in spark-defaults.conf
I register my class with Kyro in spark-defaults.conf as follow
Hi everyone,
I work with Spark for a little while now and have encountered a strange
problem that gives me headaches, which has to do with the JavaRDD.subtract
method. Consider the following piece of code:
public static void main(String[] args) {
//context is of type
Hi,
I had a job that got stuck on yarn due to
https://issues.apache.org/jira/browse/SPARK-6954
It never exited properly.
Is there a way to set a timeout for a stage or all stages?
Nope, output stream of that subprocess should be spark.getInputStream()
According to Oracle Doc
https://docs.oracle.com/javase/8/docs/api/java/lang/Process.html#getOutputStream--
:
public abstract InputStream
https://docs.oracle.com/javase/8/docs/api/java/io/InputStream.html
getInputStream()
Hi thanks for the response. It looks like YARN container is getting killed
but dont know why I see shuffle metafetchexception as mentioned in the
following SO link. I have enough memory 8 nodes 8 cores 30 gig memory each.
And because of this metafetchexpcetion YARN killing container running
file can be a directory (look at all children) or even a glob
(/path/*.ext, for example).
On Fri, Jul 31, 2015 at 11:35 AM, swetha swethakasire...@gmail.com wrote:
Hi,
How to add multiple sequence files from HDFS to a Spark Context to do Batch
processing? I have something like the following
Can you pastebin the complete stack trace ?
If you can show skeleton of MyInputFormat and MyRecordWritable, that would
provide additional information as well.
Cheers
On Fri, Jul 31, 2015 at 11:24 AM, unk1102 umesh.ka...@gmail.com wrote:
Hi I am having my own Hadoop custom InputFormat which I
Hi Tomasz,
*Answer to your 1st question*:
Clear/read the error (spark.getErrorStream()) and output
(spark.getInputStream()) stream buffers before you call spark.waitFor(), it
would be better to clear/read them with 2 different threads. Then it should
work fine.
As Spark job is launched as
If you've set the checkpoint dir, it seems like indeed the intent is
to use a default checkpoint interval in DStream:
private[streaming] def initialize(time: Time) {
...
// Set the checkpoint interval to be slideDuration or 10 seconds,
which ever is larger
if (mustCheckpoint
It looks like there's an issue with the 'Parameters' pojo I'm using within
my driver program. For some reason that needs to be serializable, which is
odd.
java.io.NotSerializableException: com.kona.consumer.kafka.spark.Parameters
Giving it another whirl though having to make it serializable
Hi Ted thanks much for the reply. I cant share code on public forum. I have
created custom format by extending Hadoop mapred InputFormat class and same
way RecordReader class. If you can help me how do I use the same in
DataFrame it would be very helpful.
On Sat, Aug 1, 2015 at 12:12 AM, Ted Yu
I've instrumented checkpointing per the programming guide and I can tell
that Spark Streaming is creating the checkpoint directories but I'm not
seeing any content being created in those directories nor am I seeing the
effects I'd expect from checkpointing. I'd expect any data that comes into
Here is the definition of EsDoc
case class EsDoc(id: Long, isExample: Boolean, docSetIds: Array[String],
randomId: Double, vector: String) extends Serializable
Note that it is not EsDoc having problem with registration. It is the EsDoc[]
(the array class of EsDoc) that has problem with
I'll check the log info message..
Meanwhile, the code is basically
public class KafkaSparkStreamingDriver implements Serializable {
..
SparkConf sparkConf = createSparkConf(appName, kahunaEnv);
JavaStreamingContext jssc = params.isCheckpointed() ?
Hi I am having my own Hadoop custom InputFormat which I need to use in
creating DataFrame. I tried to do the following
JavaPairRDDVoid,MyRecordWritable myFormatAsPairRdd =
jsc.hadoopFile(hdfs://tmp/data/myformat.xyz,MyInputFormat.class,Void.class,MyRecordWritable.class);
JavaRDDMyRecordWritable
Hi,
How to add multiple sequence files from HDFS to a Spark Context to do Batch
processing? I have something like the following in my code. Do I have to add
Comma separated list of Sequence file paths to the Spark Context.
val data = if(args.length0 args(0)!=null)
sc.sequenceFile(file,
Show us the relevant code
On Fri, Jul 31, 2015 at 12:16 PM, Dmitry Goldenberg
dgoldenberg...@gmail.com wrote:
I've instrumented checkpointing per the programming guide and I can tell
that Spark Streaming is creating the checkpoint directories but I'm not
seeing any content being created in
Dear Michael, dear all,
a minimal example is listed below.
After some further analysis I could figure out, that the problem is related
to the *leftOuterJoinWithRemovalOfEqualColumn*-Method, as I use columns of
the left and right dataframes when doing the select on the joined table.
/**
*
Hello,
I am trying to run a Spark job that hits an external webservice to get back
some information. The cluster is 1 master + 4 workers, each worker has 60GB
RAM and 4 CPUs. The external webservice is a standalone Solr server, and is
accessed using code similar to that shown below.
def
I don't think using Void class is the right choice - it is not even a
Writable.
BTW in the future, capture text output instead of image.
Thanks
On Fri, Jul 31, 2015 at 12:35 PM, Umesh Kacha umesh.ka...@gmail.com wrote:
Hi Ted thanks My key is always Void because my custom format file is non
Since one input dstream creates one receiver and one receiver uses one
executor / node.
What happens if you create more Dstreams than nodes in the cluster?
Say I have 30 Dstreams on a 15 node cluster.
Will ~10 streams get assigned to ~10 executors / nodes then the other ~20
streams will be
Looking closer at the code you posted, the error likely was caused by the
3rd parameter: Void.class
It is supposed to be the class of key.
FYI
On Fri, Jul 31, 2015 at 11:24 AM, unk1102 umesh.ka...@gmail.com wrote:
Hi I am having my own Hadoop custom InputFormat which I need to use in
Tathagata,
Could the bottleneck possibility be the number of executor nodes in our
cluster? Since we are creating 500 Dstreams based off 500 textfile
directories, do we need at least 500 executors / nodes to be receivers for
each one of the streams?
On Tue, Jul 28, 2015 at 6:09 PM, Tathagata Das
Hi Guys,
I have struggled for a while on this seeming simple thing:
I have a sequence of timestamps and want to create a dataframe with 1 column.
Seq[java.sql.Timestamp]
//import collection.breakOut
var seqTimestamp = scala.collection.Seq(listTs:_*)
seqTimestamp: Seq[java.sql.Timestamp] =
The referenced bug has been fixed in 1.4.0, are you able to upgrade ?
Cheers
On Fri, Jul 31, 2015 at 10:01 AM, William Kinney william.kin...@gmail.com
wrote:
Hi,
I had a job that got stuck on yarn due to
https://issues.apache.org/jira/browse/SPARK-6954
It never exited properly.
Is there
Tomasz:
Please take a look at the Redirector class inside:
./launcher/src/test/java/org/apache/spark/launcher/SparkLauncherSuite.java
FYI
On Fri, Jul 31, 2015 at 10:02 AM, Elkhan Dadashov elkhan8...@gmail.com
wrote:
Hi Tomasz,
*Answer to your 1st question*:
Clear/read the error
Hi Folks
I would like to do RL in spark, as legacy approach using blocking query is
not working as data is getting huge. Please provide necessary links and
information for doing so.
Thanks
--
View this message in context:
@Brandon Each node can host multiple executors. For example, In a 15 node
cluster, if your NodeManager ( In YARN) or equivalent ( MESOS or
Standalone), runs on each of this node and if the node has enough resources
to host say 5 executors, then in total you can have 15*5 executors and each
of this
Thanks a lot @Das @Cody. I moved from receiver based to direct stream and I
can get the topics from the offset!!
On Fri, Jul 31, 2015 at 4:41 PM, Brandon White bwwintheho...@gmail.com
wrote:
Tathagata,
Could the bottleneck possibility be the number of executor nodes in our
cluster? Since we
It doesn't make sense to me. Because in the another cluster process all
data in less than a second.
Anyway, I'm going to set that parameter.
2015-07-31 0:36 GMT+02:00 Tathagata Das t...@databricks.com:
Yes, and that is indeed the problem. It is trying to process all the data
in Kafka, and
HI,
I have submitted a Spark Job with options jars,class,master as *local* but
i am getting an error as below
*dse spark-submit spark error exception in thread main java.io.ioexception:
Invalid Request Exception(Why you have not logged in)*
*Note: submitting datastax spark node*
please let me
@Brandon, the file streams do not use receivers, so the bottleneck is not
about executors per se. But there could be couple of bottlenecks
1. Every batch interval, the 500 dstreams are going to get directory
listing from 500 directories, SEQUENTIALLY. So preparing the batch's RDDs
and jobs can
Please take a look at stringToTimestamp() in
./sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
Representing timestamp with long should work.
Cheers
On Fri, Jul 31, 2015 at 2:50 PM, Joanne Contact joannenetw...@gmail.com
wrote:
Hi Guys,
I have struggled for
Hi,
I am currently working on the latest version of Apache Spark (1.4.1), pre-built
package for Hadoop 2.6+.
Is there any feature in Spark/Hadoop to encrypt RDDs or in-memory/cache
(something similar is Altibase's HDB:
http://altibase.com/in-memory-database-computing-solutions/security/)
I detected the error. The final step is to index data in ElasticSearch, The
elasticSearch in one of the cluster is overhelmed and it doesn't work
correctly.
I linked the cluster which doesn't work with another ES and don't get any
delay.
Sorry, it wasn't relationed with Spark!
2015-07-31
Hi,
I am getting buffer over flow exception while using spark via thrifserver
base.May I know how to overcome this?
Code:
HqlConnection con = new HqlConnection(localhost, 10001,
HiveServer.HiveServer2);
con.Open();
HqlCommand createCommand = new HqlCommand(tablequery,
Dear list,
Hi~I am new to spark and graphx, and I have a few experiences using scala. I
want to use graphx to calculate some basic statistics in linked open data,
which is basically a graph.
Suppose the graph only contains one type of edge, directing from individuals to
concepts, and the edge
44 matches
Mail list logo