Raymond,
Thank you.
But I read from other thread
http://apache-spark-user-list.1001560.n3.nabble.com/When-does-Spark-switch-from-PROCESS-LOCAL-to-NODE-LOCAL-or-RACK-LOCAL-td7091.html
that PROCESS_LOCAL means the data is in the same JVM as the code that is
running. When data is in the same JVM
when the data’s source host is not one of the registered executors, it will
also be marked as PROCESS_LOCAL too, though it should have a different NAME for
this. I don’t know did someone change this name very recently. but for 0.9, it
is the case .
When I say satisfy, yes, if the executors
I am new to spark, i am using Spark streaming with Kafka..
My streaming duration is 1s..
Assume i get 100 records in 1s and 120 records in 2s and 80 records in 3s
-- {sec 1 1,2,...100} -- {sec 2 1,2..120} -- {sec 3 1,2,..80}
I apply my logic in sec 1 and have a result = result1
i want to use
Mahout context does not include _all_ possible transitive dependencies.
Would not be lighting fast to take all legacy etc. dependencies.
There's an ignored unit test that asserts context path correctness. you
can uningnore it and run to verify it still works as ex[ected.The reason
it is set to
This is my error:
14/10/17 10:24:56 WARN BLAS: Failed to load implementation from:
com.github.fommil.netlib.NativeSystemBLAS
14/10/17 10:24:56 WARN BLAS: Failed to load implementation from:
com.github.fommil.netlib.NativeRefBLAS
However, it seems to work. What does it means?
--
View this
Thanks for your reply!
Sorry for i forgot referring the spark which i'm using is *Version 1.0.2*
instead of 1.1.0.
Also the document of 1.0.2 seems not same like 1.1.0:
http://spark.apache.org/docs/1.0.2/running-on-yarn.html
And i tried your suggestion(upload ) but did not work:
*1. set my
New open-access research published in the journal of Parallel Computing
demonstrates a novel approach to engineering analytics for deployment in
streaming and batch contexts.
Increasing numbers of users are extracting real value from their data using
tools like IBM InfoSphere Streams for
If you don't need part-xxx files in the output but 1 file, then you should
repartition (or coalesce) the RDD into 1 (This will be bottleneck since you
are disabling the parallelism - its like giving everything to 1 machine to
process). You are better off merging those part-xxx files afterwards
The exception drives me crazy, because it occurs randomly.
I didn't know which line of my code causes this exception.
I didn't even understand what KryoException:
java.lang.NegativeArraySizeException means, or even implies?
14/10/20 15:59:01 WARN scheduler.TaskSetManager: Lost task 32.2 in stage
Its a known bug in JDK7 and OSX's naming convention, here's how to resolve
it:
1. Get the Snappy jar file from
http://central.maven.org/maven2/org/xerial/snappy/snappy-java/
2. Copy the appropriate one to your project's class path.
Thanks
Best Regards
On Sun, Oct 19, 2014 at 10:18 PM, bdev
What is the application that you are running? and what is the cluster setup
that you are having? Given the logs, it looks like the master is dead for
some reason.
Thanks
Best Regards
On Sun, Oct 19, 2014 at 2:48 PM, randylu randyl...@gmail.com wrote:
In additional, driver receives serveral
One approach would be to convert those XML file into json file and use the
jsonRDD, another approach would be to convert the XML file into parquet
file and use parquetFile.
Thanks
Best Regards
On Sun, Oct 19, 2014 at 9:38 AM, gtinside gtins...@gmail.com wrote:
Hi ,
I have bunch of Xml files
I used to hit this issue when my data size was too large and the number of
partitions was too large ( 1200 ), I got ride of it by
- Reducing the number of partitions
- Setting the following while creating the sparkContext:
.set(spark.rdd.compress,true)
Pinging TD -- I'm sure you know :-)
-kr, Gerard.
On Fri, Oct 17, 2014 at 11:20 PM, Gerard Maas gerard.m...@gmail.com wrote:
Hi,
We have been implementing several Spark Streaming jobs that are basically
processing data and inserting it into Cassandra, sorting it among different
keyspaces.
Dear Akhil Das-2,
My application runs in standalone mode, with 50 machines. It's okay if the
input file is small, but if i increases the input to 8GB, the application
just serveral iterations, and then print following error logs:
14/10/20 17:15:28 WARN AppClient$ClientActor: Connection to
Hi,
I will write the code in python
{code:title=test.py}
data = sc.textFile(...).map(...) ## Please make sure that the rdd is
like[[id, c1, c2, c3], [id, c1, c2, c3],...]
keypair = data.map(lambda l: ((l[0],l[1],l[2]), float(l[3])))
keypair = keypair.reduceByKey(add)
out = keypair.map(lambda l:
Thank you, Guillaume, my dataset is not that large, it's totally ~2GB
2014-10-20 16:58 GMT+08:00 Guillaume Pitel guillaume.pi...@exensa.com:
Hi,
It happened to me with blocks which take more than 1 or 2 GB once
serialized
I think the problem was that during serialization, a Byte Array is
Thanks Akhil Das-2: actually I tried setting spark.default.parallelism but no
effect :-/
I am running standalone and performing a mix of map/filter/foreachRDD.
I had to force parallelism with repartition to get both workers to process
tasks, but I do not think this should be required (and I am
I'm getting the same warning on my mac. Accompanied by what appears to be
pretty low CPU usage
(http://apache-spark-user-list.1001560.n3.nabble.com/mlib-model-build-and-low-CPU-usage-td16777.html),
I wonder if they are connected?
I've used jblas on a mac several times, it always just works
Hi I have a rdd which I want to register as multiple tables based on key
val context = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.hive.HiveContext(context)
import sqlContext.createSchemaRDD
case class KV(key:String,id:String,value:String)
val logsRDD =
1) Yes, a single node can have multiple workers. SPARK_WORKER_INSTANCES (in
conf/spark-env.sh) is used to set number of worker instances to run on each
machine (default is 1). If you do set this, make sure to also set
SPARK_WORKER_CORES explicitly to limit the cores per worker, or else each
worker
Hi,
I’m using Spark 1.1.0 and I’m having some issues to setup memory options.
I get “Requested array size exceeds VM limit” and I’m probably missing
something regarding memory configuration
(https://spark.apache.org/docs/1.1.0/configuration.html).
My server has 30G of memory and this are my
Hi Arian,
You will get this exception because you are trying to create an array that
is larger than the maximum contiguous block of memory in your Java VMs heap.
Here since you are setting Worker memory as *5Gb* and you are exporting the
*_OPTS as *8Gb*, your application actually thinks it has
1. All RDD operations are executed in workers. So reading a text file or
executing val x = 1 will happen on worker. (link
http://stackoverflow.com/questions/24637312/spark-driver-in-apache-spark)
2.
a. Without braodcast: Let's say you have 'n' nodes. You can set hadoop's
replication factor to n
Hi Akhil,
thanks for your help
but I was originally running without xmx option. With that I was just
trying to push the limit of my heap size, but obviously doing it wrong.
Arian Pasquali
http://about.me/arianpasquali
2014-10-20 12:24 GMT+01:00 Akhil Das ak...@sigmoidanalytics.com:
Hi
Try setting SPARK_EXECUTOR_MEMORY=5g (not sure how many workers you are
having), You can also set the executor memory while creating the
sparkContext (like *sparkContext.set(spark.executor.memory,5g)* )
Thanks
Best Regards
On Mon, Oct 20, 2014 at 5:01 PM, Arian Pasquali ar...@arianpasquali.com
Hi,
The array size you (or the serializer) tries to allocate is just too big
for the JVM. No configuration can help :
https://plumbr.eu/outofmemoryerror/requested-array-size-exceeds-vm-limit
The only option is to split you problem further by increasing parallelism.
Guillaume
Hi,
I’m using
Well, reading your logs, here is what happens :
You do a combineByKey (so you have a join probably somewhere), which
spills on disk because it's too big. To spill on disk it serializes, and
the blocks are 2GB.
From a 2GB dataset, it's easy to exand to several TB
Increase parallelism, make
in spark-shell,I do in follows
val input = sc.textfile(hdfs://192.168.1.10/people/testinput/)
input.cache()
In webui,I cannot see any rdd in storage tab.can anyone tell me how to show rdd
size?thank you
-
To unsubscribe,
I believe it won't show up there until you trigger an action that causes
the RDD to actually be cached. Remember that certain operations in Spark
are *lazy*, and caching is one of them.
Nick
On Mon, Oct 20, 2014 at 9:19 AM, marylucy qaz163wsx_...@hotmail.com wrote:
in spark-shell,I do in
What about:
http://mail-archives.apache.org/mod_mbox/spark-user/201310.mbox/%3CCAF_KkPwk7iiQVD2JzOwVVhQ_U2p3bPVM=-bka18v4s-5-lp...@mail.gmail.com%3Ehttp://mail-archives.apache.org/mod_mbox/spark-user/201310.mbox/CAF_KkPwk7iiQVD2JzOwVVhQ_U2p3bPVM=-bka18v4s-5-lp...@mail.gmail.com
Regards
-
Hi Andrew,
The behavior that I see now is that under the hood it tries to reconnect
endlessly. While this lasts, the thread that tries to fire a new task is
blocked at JobWaiter.awaitResult() and never gets released.
The full stacktrace for spark-1.0.2 is:
jmsContainer-7 prio=10
thank you for your reply!
is unpersist operation lazy?if yes,how to decrease memory size as quickly as
possible
在 Oct 20, 2014,21:26,Nicholas Chammas nicholas.cham...@gmail.com 写道:
I believe it won't show up there until you trigger an action that causes the
RDD to actually be cached.
No, I believe unpersist acts immediately.
On Mon, Oct 20, 2014 at 10:13 AM, marylucy qaz163wsx_...@hotmail.com
wrote:
thank you for your reply!
is unpersist operation lazy?if yes,how to decrease memory size as quickly
as possible
在 Oct 20, 2014,21:26,Nicholas Chammas
MLlib relies on breeze for much of its linear algebra, which in turn relies on
netlib-java. netlib-java will attempt to load a native BLAS at runtime and then
attempt to load it's own precompiled version. Failing that, it will default
back to a Java version that it has built in. The Java
http://spark.apache.org/docs/latest/streaming-programming-guide.html
http://spark.apache.org/docs/latest/streaming-programming-guide.html
foreachRDD is executed on the driver….
mn
On Oct 20, 2014, at 3:07 AM, Gerard Maas gerard.m...@gmail.com wrote:
Pinging TD -- I'm sure you know :-)
One detail, even forcing partitions (/repartition/), spark is still holding
some tasks; if I increase the load of the system (increasing
/spark.streaming.receiver.maxRate/), even if all workers are used, the one
with the receiver gets twice as many tasks compared with the other workers.
Total
Hi Yin,
Sorry for the delay, but I’ll try the code change when I get a chance, but
Michael’s initial response did solve my problem. In the meantime, I’m hitting
another issue with SparkSQL which I will probably post another message if I
can’t figure a workaround.
Thanks,
-Terry
From: Yin
I ran into the same issue when the dataset is very big.
Marcelo from Cloudera found that it may be caused by SPARK-2711, so their
Spark 1.1 release reverted SPARK-2711, and the issue is gone. See
https://issues.apache.org/jira/browse/SPARK-3633 for detail.
You can checkout Cloudera's version
Hi,
I'm working on the problem of remotely submitting apps to the spark
master. I'm trying to use the spark-jobserver project
(https://github.com/ooyala/spark-jobserver) for that purpose.
For scala apps looks like things are working smoothly, but for java
apps, I have an issue with implementing
Hi all,
Spark Streaming occasionally (not always) hangs indefinitely on my program
right after the first batch has been processed.
As you can see in the following screenshots of the Spark Streaming
monitoring UI, it hangs on the map stages that correspond (I assume) to the
second batch that is
Hello,
I am facing a problem with implementing this - My mapper should emit
multiple keys for the same value - for every input (k, v) it should emit
(k, v), (k+1, v),(k+2,v) (k+n,v).
In MapReduce, it was pretty straight forward - I used a for loop and
performed Context write within that.
flatMap should help, it returns a Seq for every input.
On Mon, Oct 20, 2014 at 12:31 PM, HARIPRIYA AYYALASOMAYAJULA
aharipriy...@gmail.com wrote:
Hello,
I am facing a problem with implementing this - My mapper should emit
multiple keys for the same value - for every input (k, v) it should
You can do this using flatMap which return a Seq of (key, value) pairs.
Sincerely,
DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai
On Mon, Oct 20, 2014 at 9:31 AM, HARIPRIYA AYYALASOMAYAJULA
You also could use Spark SQL:
from pyspark.sql import Row, SQLContext
row = Row('id', 'C1', 'C2', 'C3')
# convert each
data = sc.textFile(test.csv).map(lambda line: line.split(','))
sqlContext = SQLContext(sc)
rows = data.map(lambda r: row(*r))
Hi all,
I’m getting a TreeNodeException for unresolved attributes when I do a simple
select from a schemaRDD generated by a join in Spark 1.1.0. A little background
first. I am using a HiveContext (against Hive 0.12) to grab two tables, join
them, and then perform multiple INSERT-SELECT with
Hi,
*In a reduce operation I am trying to accumulate a list of SparseVectors.
The code is given below;*
val WNode = trainingData.reduce{(node1:Node,node2:Node) =
val wNode = new Node(num1,num2)
wNode.WhatList ++= (node1.WList)
Thanks for the pointers
I will look into oryx2 design and see whether we need a spary/akka http
based backend...I feel we will specially when we have a model database for
a number of scenarios (say 100 scenarios build a different ALS model)
I am not sure if we really need a full blown
I have been trying to implement a CustomReceiverStream using akka actors. I
have a feeder actor which produces messages and a Receiver actor(i give this
to the spark streaming context to get actor stream) which subscribes to the
feeder.
When i create the feeder actor within the scope of the
Have you tried this on master? There were several problems with resolution
of complex queries that were registered as tables in the 1.1.0 release.
On Mon, Oct 20, 2014 at 10:33 AM, Terry Siu terry@smartfocus.com
wrote:
Hi all,
I’m getting a TreeNodeException for unresolved attributes
Hi Michael,
Thanks again for the reply. Was hoping it was something I was doing wrong in
1.1.0, but I’ll try master.
Thanks,
-Terry
From: Michael Armbrust mich...@databricks.commailto:mich...@databricks.com
Date: Monday, October 20, 2014 at 12:11 PM
To: Terry Siu
Hi Prem,
I am experiencing the same problem on Spark 1.0.2 and Job Server 0.4.0
Did you find a solution for this problem?
Thank you,
Hung
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/RDD-Cleanup-tp9182p16843.html
Sent from the Apache Spark User List
I am trying to convert some json logs to Parquet and save them on S3.
In principle this is just
import org.apache.spark._
val sqlContext = new sql.SQLContext(sc)
val data = sqlContext.jsonFile(s3n://source/path/*/*,10e-8)
data.registerAsTable(data)
data.saveAsParquetFile(s3n://target/path)
This
Hi, everyone,
According to Xiangrui Meng(I think that he is the author of ALS), this
problem is caused by Kryo serialization:
/In PySpark 1.1, we switched to Kryo serialization by default. However, ALS
code requires special registration with Kryo in order to work. The error
happens when there
I am launching EC2 clusters using the spark-ec2 scripts.
My understanding is that this configures spark to use the available
resources.
I can see that spark will use the available memory on larger istance types.
However I have never seen spark running at more than 400% (using 100% on 4
cores)
on
How are you launching the cluster, and how are you submitting the job to
it? Can you list any Spark configuration parameters you provide?
On Mon, Oct 20, 2014 at 12:53 PM, Daniel Mahler dmah...@gmail.com wrote:
I am launching EC2 clusters using the spark-ec2 scripts.
My understanding is that
Perhaps your RDD is not partitioned enough to utilize all the cores in your
system.
Could you post a simple code snippet and explain what kind of parallelism
you are seeing for it? And can you report on how many partitions your RDDs
have?
On Mon, Oct 20, 2014 at 3:53 PM, Daniel Mahler
created a EC2 cluster using spark-ec2 command. If I run the pi.py example in
the cluster without using the example.jar, it works. But if I added the
example.jar as the driver class (sth like follows), it will fail with an
exception. Could anyone help with this? -- what is the cause of the problem?
I usually run interactively from the spark-shell.
My data definitely has more than enough partitions to keep all the workers
busy.
When I first launch the cluster I first do:
+
cat EOF ~/spark/conf/spark-defaults.conf
spark.serializer
I launch the cluster using vanilla spark-ec2 scripts.
I just specify the number of slaves and instance type
On Mon, Oct 20, 2014 at 4:07 PM, Daniel Mahler dmah...@gmail.com wrote:
I usually run interactively from the spark-shell.
My data definitely has more than enough partitions to keep all
Hi All,
I'm trying to save RDD as sequencefile and not able to use compresiontype
(BLOCK or RECORD)
Can any one let me know how we can use compressiontype
here is the code I'm using
RDD.saveAsSequenceFile(target,Some(classOf[org.apache.hadoop.io.compress.GzipCodec]))
Thanks
--
View this
At the end of a set of computation I have a JavaRDDString . I want a
single file where each string is printed in order. The data is small enough
that it is acceptable to handle the printout on a single processor. It may
be large enough that using collect to generate a list might be unacceptable.
Hi,
I have a question about the worker_instances setting and worker_cores
setting in aws ec2 cluster. I understand it is a cluster and the default
setting in the cluster is
*SPARK_WORKER_CORES = 8
SPARK_WORKER_INSTANCES = 1*
However after I changed it to
*SPARK_WORKER_CORES = 8
This was covered a few days ago:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-write-a-RDD-into-One-Local-Existing-File-td16720.html
The multiple output files is actually essential for parallelism, and
certainly not a bad idea. You don't want 100 distributed workers
writing to 1
Are you dealing with gzipped files by any chance? Does explicitly
repartitioning your RDD to match the number of cores in your cluster help
at all? How about if you don't specify the configs you listed and just go
with defaults all around?
On Mon, Oct 20, 2014 at 5:22 PM, Daniel Mahler
Thanks Akhil.
On Mon, Oct 20, 2014 at 1:57 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Its a known bug in JDK7 and OSX's naming convention, here's how to resolve
it:
1. Get the Snappy jar file from
http://central.maven.org/maven2/org/xerial/snappy/snappy-java/
2. Copy the appropriate
Sorry I missed the discussion - although it did not answer the question -
In my case (and I suspect the askers) the 100 slaves are doing a lot of
useful work but the generated output is small enough to be handled by a
single process.
Many of the large data problems I have worked process a lot of
Fixed by recompiling. Thanks.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/example-jar-caused-exception-when-running-pi-py-spark-1-1-tp16849p16862.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hello Experts,
After repeated attempt I am unable to run query on map json date string. I
tried two approaches:
*** Approach 1 *** created a Bean class with timespamp field. When I try to
run it I get scala.MatchError: class java.sql.Timestamp (of class
java.lang.Class). Here is the code:
import
The biggest danger with gzipped files is this:
raw = sc.textFile(/path/to/file.gz, 8) raw.getNumPartitions()1
You think you’re telling Spark to parallelize the reads on the input, but
Spark cannot parallelize reads against gzipped files. So 1 gzipped file
gets assigned to 1 partition.
It might
I am using globs though
raw = sc.textFile(/path/to/dir/*/*)
and I have tons of files so 1 file per partition should not be a problem.
On Mon, Oct 20, 2014 at 7:14 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
The biggest danger with gzipped files is this:
raw =
I think you are running into a bug that will be fixed by this PR:
https://github.com/apache/spark/pull/2850
On Mon, Oct 20, 2014 at 4:34 PM, tridib tridib.sama...@live.com wrote:
Hello Experts,
After repeated attempt I am unable to run query on map json date string. I
tried two approaches:
Hi:
I am using Spark 1.1, and want to add an external jars to spark-shell. I
dig around, and found others are doing it in two ways.
*Method 1*
bin/spark-shell --jars path-to-jars --master ...
*Method 2*
ADD_JARS=path-to-jars SPARK_CLASSPATH=path-to-jars bin/spark-shell
--master ...
What is
Hi Anny, SPARK_WORKER_INSTANCES is the number of copies of spark workers
running on a single box. If you change the number you change how the
hardware you have is split up (useful for breaking large servers into 32GB
heaps each which perform better) but doesn't change the amount of hardware
you
Thanks a lot Andrew! Yeah I actually realized that later. I made a silly
mistake here.
On Mon, Oct 20, 2014 at 6:03 PM, Andrew Ash and...@andrewash.com wrote:
Hi Anny, SPARK_WORKER_INSTANCES is the number of copies of spark workers
running on a single box. If you change the number you change
My observation is opposite. When my job runs under default
spark.shuffle.manager, I don't see this exception. However, when it runs
with SORT based, I start seeing this error? How would that be possible?
I am running my job in YARN, and I noticed that the YARN process limits
(cat
Hi Song,
For what I know in sort-based shuffle.
Normally parallel opened file numbers for sort-based shuffle is much smaller
than hash-based shuffle.
In hash based shuffle, parallel opened file numbers is C * R (where C is core
number used and R is the reducer number), as you can see the file
–jar (ADD_JARS) is a special class loading for Spark while
–driver-class-path (SPARK_CLASSPATH) is captured by the startup scripts and
appended to classpath settings that is used to start the JVM running the
driver
You can reference
https://www.concur.com/blog/en-us/connect-tableau-to-sparksql
Hi Tridib,
For the second approach, can you attach the complete stack trace?
Thanks,
Yin
On Mon, Oct 20, 2014 at 8:24 PM, Michael Armbrust mich...@databricks.com
wrote:
I think you are running into a bug that will be fixed by this PR:
https://github.com/apache/spark/pull/2850
On Mon, Oct
The cluster also runs other applcations every hour as normal, so the master
is always running. No matter what the cores i use or the quantity of
input-data(but big enough), the application just fail at 1.1 hours later.
--
View this message in context:
I think this has something to do with my recent work at
https://issues.apache.org/jira/browse/SPARK-4003
You can check PR https://github.com/apache/spark/pull/2850 .
Thanks,
Daoyuan
From: Yin Huai [mailto:huaiyin@gmail.com]
Sent: Tuesday, October 21, 2014 10:00 AM
To: Michael Armbrust
Cc:
Hi,
I am trying to use spark to perform some basic text processing on news
articles
Recently I am facing issues on codes which ran perfectly well on the same
data before
I am pasting the last few lines, including the exception message. I am
using python.
Can anybody suggest a remedy
Thanks,
Seems the second approach does not go through applySchema. So, I was
wondering if there is an issue related to our JSON apis in Java.
On Mon, Oct 20, 2014 at 10:04 PM, Wang, Daoyuan daoyuan.w...@intel.com
wrote:
I think this has something to do with my recent work at
I got that, it is in JsonRDD.java of `typeOfPrimitiveValues`. I’ll fix that
together.
Thanks,
Daoyuan
From: Yin Huai [mailto:huaiyin@gmail.com]
Sent: Tuesday, October 21, 2014 10:13 AM
To: Wang, Daoyuan
Cc: Michael Armbrust; tridib; u...@spark.incubator.apache.org
Subject: Re: spark sql:
Seems I made a mistake…
From: Wang, Daoyuan
Sent: Tuesday, October 21, 2014 10:35 AM
To: 'Yin Huai'
Cc: Michael Armbrust; tridib; u...@spark.incubator.apache.org
Subject: RE: spark sql: timestamp in json - fails
I got that, it is in JsonRDD.java of `typeOfPrimitiveValues`. I’ll fix that
Hi Spark SQL team,
I trying to explore automatic schema detection for json document. I have few
questions:
1. What should be the date format to detect the fields as date type?
2. Is automatic schema infer slower than applying specific schema?
3. At this moment I am parsing json myself using map
Hi, All
Is there any way to convert iterable to RDD?
Thanks,
Kevin.
sounds more like a use case for using collect... and writing out the file
in your program?
On Mon, Oct 20, 2014 at 6:53 PM, Steve Lewis lordjoe2...@gmail.com wrote:
Sorry I missed the discussion - although it did not answer the question -
In my case (and I suspect the askers) the 100 slaves
Stack trace for my second case:
2014-10-20 23:00:36,903 ERROR [Executor task launch worker-0]
executor.Executor (Logging.scala:logError(96)) - Exception in task 0.0 in
stage 0.0 (TID 0)
scala.MatchError: TimestampType (of class
org.apache.spark.sql.catalyst.types.TimestampType$)
at
Hey Steve - the way to do this is to use the coalesce() function to
coalesce your RDD into a single partition. Then you can do a saveAsTextFile
and you'll wind up with outpuDir/part-0 containing all the data.
-Ilya Ganelin
On Mon, Oct 20, 2014 at 11:01 PM, jay vyas
That's weird, I think we have that Pattern match in enforceCorrectType. What
version of spark are you using?
Thanks,
Daoyuan
-Original Message-
From: tridib [mailto:tridib.sama...@live.com]
Sent: Tuesday, October 21, 2014 11:03 AM
To: u...@spark.incubator.apache.org
Subject: Re: spark
In addition, how to convert Iterable[Iterable[T]] to RDD[T]
Thanks,
Kevin.
From: Dai, Kevin [mailto:yun...@ebay.com]
Sent: 2014年10月21日 10:58
To: user@spark.apache.org
Subject: Convert Iterable to RDD
Hi, All
Is there any way to convert iterable to RDD?
Thanks,
Kevin.
Spark 1.1.0
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-timestamp-in-json-fails-tp16864p16888.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
The exception of second approach, has been resolved by SPARK-3853.
Thanks,
Daoyuan
-Original Message-
From: Wang, Daoyuan [mailto:daoyuan.w...@intel.com]
Sent: Tuesday, October 21, 2014 11:06 AM
To: tridib; u...@spark.incubator.apache.org
Subject: RE: spark sql: timestamp in json -
Yes, SPARK-3853 just got merged 11 days ago. It should be OK in 1.2.0. And for
the first approach, It would be ok after SPARK-4003 is merged.
-Original Message-
From: tridib [mailto:tridib.sama...@live.com]
Sent: Tuesday, October 21, 2014 11:09 AM
To: u...@spark.incubator.apache.org
Could you show your spark version ?
And the value of `spark.default.parallelism` you are setting?
Best Regards,
Yi Tian
tianyi.asiai...@gmail.com
On Oct 20, 2014, at 12:38, Kevin Jung itsjb.j...@samsung.com wrote:
Hi,
I usually use file on hdfs to make PairRDD and analyze it by using
I think you could use `repartition` to make sure there would be no empty
partitions.
You could also try `coalesce` to combine partitions , but it can't make sure
there are no more empty partitions.
Best Regards,
Yi Tian
tianyi.asiai...@gmail.com
On Oct 18, 2014, at 20:30,
Hello Experts,
I have two tables build using jsonFile(). I can successfully run join query
on these tables. But once I cacheTable(), all join query fails?
Here is stackstrace:
java.lang.NullPointerException
at
I use Spark 1.1.0 and set these options to spark-defaults.conf
spark.scheduler.mode FAIR
spark.cores.max 48
spark.default.parallelism 72
Thanks,
Kevin
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/default-parallelism-bug-tp16787p16894.html
Sent from the
You could try setting the spark.akka.frameSize while creating the
sparkContext, but its strange that the message it shows is saying your
master is dead, usually its the other way, executor dies. Can you also
explain the behavior of your application (what exactly you are doing over
the 8Gb data)?
1 - 100 of 102 matches
Mail list logo