Hello,
According to Spark Documentation at
https://spark.apache.org/docs/1.2.1/submitting-applications.html :
--conf: Arbitrary Spark configuration property in key=value format. For
values that contain spaces wrap “key=value” in quotes (as shown).
And indeed, when I use that parameter, in my
You can use the coalesce method to reduce the number of partitions. You can
reduce to one if the data is not too big. Then write the output.
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
Typesafe http://typesafe.com
Hi,
I am querying hbase via Spark SQL with java APIs.
Step -1
creating
JavaPairRdd, then JavaRdd, then JavaSchemaRdd.applySchema objects.
Step -2
sqlContext.sql(sql query).
If am updating my hbase database between these two steps(by hbase shell in some
other console) the query in step two is not
In 1.2 it's a member of SchemaRDD and it becomes available on RDD (through
the type class mechanism) when you add a SQLContext, like so.
val sqlContext = new SQLContext(sc)import sqlContext._
In 1.3, the method has moved to the new DataFrame type.
Dean Wampler, Ph.D.
Author: Programming Scala,
Hi,
I am running spark when I use sc.version I get 1.2 but when I call
registerTempTable(MyTable) I get error saying registedTempTable is not a
member of RDD
Why?
--
Eran | CTO
Hi,
I am using CDH 5.3.2 packages installation through Cloudera Manager 5.3.2
I am trying to run one spark job with following command
PYTHONPATH=~/code/utils/ spark-submit --master yarn --executor-memory 3G
--num-executors 30 --driver-memory 2G --executor-cores 2 --name=analytics
From: Williams, Ken Williams
ken.willi...@windlogics.commailto:ken.willi...@windlogics.com
Date: Thursday, March 19, 2015 at 10:59 AM
To: Spark list user@spark.apache.orgmailto:user@spark.apache.org
Subject: JAVA_HOME problem with upgrade to 1.3.0
[…]
Finally, I go and check the YARN
You can include * and a column alias in the same select clause
var df1 = sqlContext.sql(select *, column_id AS table1_id from table1)
FYI, this does not ultimately work as the * still includes column_id and
you cannot have two columns of that name in the joined DataFrame. So I
ended up
Hi,
I have received three replies to my question on my personal e-mail, why
don't they also show up on the mailing list? I would like to reply to the 3
users through a thread.
Thanks,
Maria
--
View this message in context:
That's a hadoop version incompatibility issue, you need to make sure
everything runs on the same version.
Thanks
Best Regards
On Sat, Mar 21, 2015 at 1:24 AM, morfious902002 anubha...@gmail.com wrote:
Hi,
I created a cluster using spark-ec2 script. But it installs HDFS version
1.0. I would
Is it a way to tunnel Spark UI?
I tried to tunnel client-node:4040 but my browser was redirected from
localhost to some cluster locally visible domain name..
Maybe there is some startup option to encourage Spark UI be fully
accessiable just through single endpoint (address:port)?
Serg.
--
I caught exceptions in the python UDF code, flush exceptions into a single
file, and made sure the the column number of the output lines as same as
sql schema.
Sth. interesting is that my output line of the UDF code is just 10 columns,
and the exception above is
HI,
i have a simple question about creating RDD . Whenever RDD is created in
spark streaming for the particular time window .When does the RDD gets
stored .
1. Does it get stored at the Driver machine ? or it gets stored on all the
machines in the cluster ?
2. Does the data gets stored in memory
Hey Abhi,
many of StreamingContext's methods to create input streams take a
StorageLevel parameter to configure this behavior. RDD partitions are
generally stored in the in-memory cache of worker nodes I think. You can
also configure replication and spilling to disk if needed.
Regards,
Jeff
Michael, thank you for the workaround and for letting me know of the
upcoming enhancements, both of which sound appealing.
On Sun, Mar 22, 2015 at 1:25 PM, Michael Armbrust mich...@databricks.com
wrote:
You can include * and a column alias in the same select clause
var df1 =
Thanks.
I am new to the environment and running cloudera CDH5.3 with spark in it.
apparently when running in spark-shell this command val sqlContext = new
SQLContext(sc)
I am failing with the not found type SQLContext
Any idea why?
On Mon, Mar 23, 2015 at 3:05 PM, Dean Wampler
Have you tried adding the following ?
import org.apache.spark.sql.SQLContext
Cheers
On Mon, Mar 23, 2015 at 6:45 AM, IT CTO goi@gmail.com wrote:
Thanks.
I am new to the environment and running cloudera CDH5.3 with spark in it.
apparently when running in spark-shell this command val
Hi,
We have a JavaRDD mapped to a hbase table and when we query on the Hbase table
using Spark-sql API we can access the data. However when we update Hbase table
while the SparkSQL SparkConf is intialised we cannot see updated data. Is
there any way we can have the RDD mapped to Hbase
This is a very interesting issue, the root reason for the lower performance
probably is, in Scala UDF, Spark SQL converts the data type from internal
representation to Scala representation via Scala reflection recursively.
Can you create a Jira issue for tracking this? I can start to work on
In this thread:
http://search-hadoop.com/m/JW1q5DM69G
I only saw two replies. Maybe some people forgot to use 'Reply to All' ?
Cheers
On Mon, Mar 23, 2015 at 8:19 AM, mrm ma...@skimlinks.com wrote:
Hi,
I have received three replies to my question on my personal e-mail, why
don't they also
Is yarn-standalone mode deprecated in Spark now. The reason I am asking is
because while I can find it in 0.9.0
documentation(https://spark.apache.org/docs/0.9.0/running-on-yarn.html). I
am not able to find it in 1.2.0.
I am using this mode to run the Spark jobs from Oozie as a java action.
Data is not (necessarily) sorted when read from disk, no. A file might
have many blocks even, and while a block yields a partition in
general, the order in which those partitions appear in the RDD is not
defined. This is why you'd sort if you need the data sorted.
I think you could conceivably
A basis change needed by spark is setting the executor memory which
defaults to 512MB by default.
On Mon, Mar 23, 2015 at 10:16 AM, Denny Lee denny.g@gmail.com wrote:
How are you running your spark instance out of curiosity? Via YARN or
standalone mode? When connecting Spark thriftserver
OK,I found what the problem is: It couldn't work with mysql-connector-5.0.8.
I updated the connector version to 5.1.34 and it worked.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-DataFrame-with-MySQL-tp22178p22182.html
Sent from the Apache
Thanks Ted, I'll try, hope there's no transitive dependencies on 3.2.10.
On Tue, Mar 24, 2015 at 4:21 AM, Ted Yu yuzhih...@gmail.com wrote:
Looking at core/pom.xml :
dependency
groupIdorg.json4s/groupId
artifactIdjson4s-jackson_${scala.binary.version}/artifactId
Hi, Yin
But our data is customized sequence file which can be read by our customized
load in pig
And I want to use spark to reuse these load function to read data and transfer
them to the RDD.
Best Regards,
Kevin.
From: Yin Huai [mailto:yh...@databricks.com]
Sent: 2015年3月24日 11:53
To: Dai,
Thanks Marcelo, this options solved the problem (I'm using 1.3.0), but it
works only if I remove extends Logging from the object, with extends
Logging it return:
Exception in thread main java.lang.LinkageError: loader constraint
violation in interface itable initialization: when resolving method
The mode is not deprecated, but the name yarn-standalone is now
deprecated. It's now referred to as yarn-cluster.
-Sandy
On Mon, Mar 23, 2015 at 11:49 AM, nitinkak001 nitinkak...@gmail.com wrote:
Is yarn-standalone mode deprecated in Spark now. The reason I am asking is
because while I can
Hi Yin,
Yes, I have set spark.executor.memory to 8g and the worker memory to 16g
without any success.
I cannot figure out how to increase the number of mapPartitions tasks.
Thanks a lot
On 20 March 2015 at 18:44, Yin Huai yh...@databricks.com wrote:
spark.sql.shuffle.partitions only control
InputSplit is in hadoop-mapreduce-client-core jar
Please check that the jar is in your classpath.
Cheers
On Mon, Mar 23, 2015 at 8:10 AM, , Roy rp...@njit.edu wrote:
Hi,
I am using CDH 5.3.2 packages installation through Cloudera Manager 5.3.2
I am trying to run one spark job with
In case anyone wants to learn about my solution for this:
groupByKey is highly inefficient due to the swapping of elements between the
different partitions as well as requiring enough mem in each worker to
handle the elements for each group. So instead of using groupByKey, I ended
up taking the
The former is deprecated. However, the latter is functionally equivalent
to it. Both launch an app in what is now called yarn-cluster mode.
Oozie now also has a native Spark action, though I'm not familiar on the
specifics.
-Sandy
On Mon, Mar 23, 2015 at 1:01 PM, Nitin kak
Hey all,
I'd like to use the Scalaz library in some of my Spark jobs, but am running
into issues where some stuff I use from Scalaz is not serializable. For
instance, in Scalaz there is a trait
/** In Scalaz */
trait Applicative[F[_]] {
def apply2[A, B, C](fa: F[A], fb: F[B])(f: (A, B) = C):
bq. is to modify compute_classpath.sh on all worker nodes to include your
driver JARs.
Please follow the above advice.
Cheers
On Mon, Mar 23, 2015 at 12:34 PM, Jack Arenas j...@ckarenas.com wrote:
Hi Team,
I’m trying to create a DF using jdbc as detailed here
Yes each application can use its own log4j.properties but I am not sure how
to configure log4j so that the driver and executor write to file. This is
because if we set the spark.executor.extraJavaOptions it will read from a
file and that is not what I need.
How do I configure log4j from the app so
I think I didn't explain myself properly :) What I meant to say was that
generally spark worker runs on either on HDFS's data nodes or on Cassandra
nodes, which typically is in a private network (protected). When a
condition is matched it's difficult to send out the alerts directly from
the worker
Hi,
I am having issue starting spark-thriftserver. I'm running spark 1.3.with
Hadoop 2.4.0. I would like to be able to change its port too so, I can hive
hive-thriftserver as well as spark-thriftserver running at the same time.
Starting sparkthrift server:-
sudo ./start-thriftserver.sh --master
That's a very old page, try this instead:
http://spark.apache.org/docs/latest/running-on-mesos.html
When you run your Spark job on Mesos, tasks will be started on the slave
nodes as needed, since fine-grained mode is the default.
For a job like your example, very few tasks will be needed.
Have you tried instantiating the instance inside the closure, rather than
outside of it?
If that works, you may need to switch to use mapPartition /
foreachPartition for efficiency reasons.
On Mon, Mar 23, 2015 at 3:03 PM, Adelbert Chang adelbe...@gmail.com wrote:
Is there no way to pull out
I think this is a bad example since testData is not deterministic at
all. I thought we had fixed this or similar examples in the past? As
in https://github.com/apache/spark/pull/1250/files
Hm, anyone see a reason that shouldn't be changed too?
On Mon, Mar 23, 2015 at 7:00 PM, Ofer Mendelevitch
Thanks Sean,
Sorting definitely solves it, but I was hoping it could be avoided :)
In the documentation for Classification in ML-Lib for example, zip() is used to
create labelsAndPredictions:
-
from pyspark.mllib.tree import RandomForest
from pyspark.mllib.util import MLUtils
# Load and
is it safe to access SparkEnv.get inside say mapPartitions?
i need to get a Serializer (so SparkEnv.get.serializer)
thanks
Well, it's complaining about trait OptionInstances which is defined in
Option.scala in the std package. Use scalap or javap on the scalaz library
to find out which member of the trait is the problem, but since it says
$$anon$1, I suspect it's the first value member, implicit val
optionInstance,
Hi Team,
I’m trying to create a DF using jdbc as detailed here – I’m currently using DB2
v9.7.0.6 and I’ve tried to use the db2jcc.jar and db2jcc_license_cu.jar combo,
and while it works in --master local using the command below, I get some
strange behavior in --master yarn-client. Here is
You may be able to utilize Spork (Pig on Apache Spark) as a mechanism to do
this: https://github.com/sigmoidanalytics/spork
On Mon, Mar 23, 2015 at 2:29 AM Dai, Kevin yun...@ebay.com wrote:
Hi, all
Can spark use pig’s load function to load data?
Best Regards,
Kevin.
Is there no way to pull out the bits of the instance I want before I sent
it through the closure for aggregate? I did try pulling things out, along
the lines of
def foo[G[_], B](blah: Blah)(implicit G: Applicative[G]) = {
val lift: B = G[RDD[B]] = b = G.point(sparkContext.parallelize(List(b)))
in the comments on SparkContext.objectFile it says:
It will also be pretty slow if you use the default serializer (Java
serialization)
this suggests the spark.serializer is used, which means i can switch to the
much faster kryo serializer. however when i look at the code it uses
There isn't any automated way. Note that as the DataFrame implementation
improves, it will probably do a better job with query optimization than
hand-rolled Scala code. I don't know if that's true yet, though.
For now, there are a few examples at the beginning of the DataFrame
scaladocs
There is not an interface to this at this time, and in general I'm hesitant
to open up interfaces where the user could make a mistake where they think
something is going to improve performance but will actually impact
correctness. Since, as you say, we are picking the partitioner
automatically in
One approach would be to repartition the whole data into 1 (costly
operation though, but will give you a single file). Also, You could try
using zipWithIndex before writing it out.
Thanks
Best Regards
On Sat, Mar 21, 2015 at 4:11 AM, Michael Albert
m_albert...@yahoo.com.invalid wrote:
Hi all, I tried to transfer some hive jobs into spark-sql. When i ran a sql
job with python udf i got a exception:
java.lang.ArrayIndexOutOfBoundsException: 9
at
org.apache.spark.sql.catalyst.expressions.GenericRow.apply(Row.scala:142)
at
Akhil,
that's what I did.
The problem is that probably web server tried to forward my request to another
address accessible locally only.
23 марта 2015 г., в 11:12, Akhil Das ak...@sigmoidanalytics.com написал(а):
Did you try ssh -L 4040:127.0.0.1:4040 user@host
Thanks
Best Regards
Dear Taotao,
Yes, I tried sparkCSV.
Thanks,
Nawwar
On Mon, Mar 23, 2015 at 12:20 PM, Taotao.Li taotao...@datayes.com wrote:
can it load successfully if the format is invalid?
--
*发件人: *Ahmed Nawar ahmed.na...@gmail.com
*收件人: *user@spark.apache.org
*发送时间:
Dear Raunak,
Source system provided logs with some errors. I need to make sure each
row is in correct format (number of columns/ attributes and data types is
correct) and move incorrect Rows to separated List.
Of course i can do my logic but i need to make sure there is no direct way.
You can try playing with spark.streaming.blockInterval so that it wont
consume a lot of data, default value is 200ms
Thanks
Best Regards
On Fri, Mar 20, 2015 at 8:49 PM, jamborta jambo...@gmail.com wrote:
Hi all,
We are designing a workflow where we try to stream local files to a Socket
Hi,
I'm working on a system which has to deal with time series data. I've been
happy using Cassandra for time series and Spark looks promising as a
computational platform.
I consider chunking time series in Cassandra necessary, e.g. by 3 weeks as
kairosdb does it. This allows an 8 byte chunk
Hi,
I executed a task on Spark in YARN and it failed.
I see just executor lost message from YARNClientScheduler, no further
details..
(I read ths error can be connected to spark.yarn.executor.memoryOverhead
setting and already played with this param)
How to go more deeply in details in log files
can it load successfully if the format is invalid?
- 原始邮件 -
发件人: Ahmed Nawar ahmed.na...@gmail.com
收件人: user@spark.apache.org
发送时间: 星期一, 2015年 3 月 23日 下午 4:48:54
主题: Data/File structure Validation
Dears,
Is there any way to validate the CSV, Json ... Files while loading to
Could you elaborate on the UDF code?
On 3/23/15 3:43 PM, lonely Feb wrote:
Hi all, I tried to transfer some hive jobs into spark-sql. When i ran
a sql job with python udf i got a exception:
java.lang.ArrayIndexOutOfBoundsException: 9
at
What exactly do you mean by alerts?
Something specific to your data or general events of the spark cluster? For
the first, sth like Akhil suggested should work. For the latter, I would
suggest having a log consolidation system like logstash in place and use
this to generate alerts.
Regards,
Jeff
Dears,
Is there any way to validate the CSV, Json ... Files while loading to
DataFrame.
I need to ignore corrupted rows.(Rows with not matching with the
schema).
Thanks,
Ahmed Nawwar
What do you mean you can't send it directly from spark workers? Here's a
simple approach which you could do:
val data = ssc.textFileStream(sigmoid/)
val dist = data.filter(_.contains(ERROR)).foreachRDD(rdd =
alert(Errors : + rdd.count()))
And the alert() function could be anything
Was a solution ever found for this. Trying to run some test cases with sbt
test which use spark sql and in Spark 1.3.0 release with Scala 2.11.6 I get
this error. Setting fork := true in sbt seems to work but its a less than
idea work around.
On Tue, Mar 17, 2015 at 9:37 PM, Eric Charles
ok i'll try asap
2015-03-23 17:00 GMT+08:00 Cheng Lian lian.cs@gmail.com:
I suspect there is a malformed row in your input dataset. Could you try
something like this to confirm:
sql(SELECT * FROM your-table).foreach(println)
If there does exist a malformed line, you should see similar
sql(SELECT * FROM your-table).foreach(println)
can be executed successfully. So the problem may still be in UDF code. How
can i print the the line with ArrayIndexOutOfBoundsException in catalyst?
2015-03-23 17:04 GMT+08:00 lonely Feb lonely8...@gmail.com:
ok i'll try asap
2015-03-23 17:00
It seems your driver is getting flooded by those many executors and hence
it gets timeout. There are some configuration options like
spark.akka.timeout etc, you could try playing with those. More information
will be available here:
http://spark.apache.org/docs/latest/configuration.html
Thanks
I did not build my own Spark. I got the binary version online. If it can
load the native libs from IDE, it should also be able to load native when
running with --matter local.
On Mon, 23 Mar 2015 07:15 Burak Yavuz brk...@gmail.com wrote:
Did you build Spark with: -Pnetlib-lgpl?
Ref:
I suspect there is a malformed row in your input dataset. Could you try
something like this to confirm:
|sql(SELECT * FROM your-table).foreach(println)
|
If there does exist a malformed line, you should see similar exception.
And you can catch it with the help of the output. Notice that the
My test env:1. Spark version is 1.3.02. 3 node per 80G/20C3. read 250G
parquet files from hdfs Test case:1. register floor func with command:
*sqlContext.udf.register(floor, (ts: Int) = ts - ts % 300), *then run
with sql select chan, floor(ts) as tt, sum(size) from qlogbase3 group by
chan,
Akhil
You are right in tour answer to what Mohit wrote. However what Mohit seems to
be alluring but did not write properly might be different.
Mohit
You are wrong in saying generally streaming works in HDFS and cassandra .
Streaming typically works with streaming or queing source like Kafka,
Oh in that case you could try adding the hostname in your /etc/hosts under
your localhost. Also make sure there is a request going to another host by
inspecting the network calls:
[image: Inline image 1]
Thanks
Best Regards
On Mon, Mar 23, 2015 at 1:55 PM, Sergey Gerasimov ser...@gmail.com
Hello Sergun,
Generally you can use
yarn application -list
to see the applicationIDs of applications and then you can see the logs
of finished applications using:
yarn logs -applicationId applicationID
Hope this helps.
--
Emre Sevinç
http://www.bigindustries.be/
On Mon, Mar 23, 2015
It is actually number of cores. If your processor has hyperthreading then
it will be more (number of processors your OS sees)
niedz., 22 mar 2015, 4:51 PM Ted Yu użytkownik yuzhih...@gmail.com
napisał:
I assume spark.default.parallelism is 4 in the VM Ashish was using.
Cheers
It seems that node is not getting allocated with enough tasks, try
increasing your level of parallelism or do a manual repartition so that
everyone gets even tasks to operate on.
Thanks
Best Regards
On Fri, Mar 20, 2015 at 8:05 PM, Yiannis Gkoufas johngou...@gmail.com
wrote:
Hi all,
I have 6
Hi Nikos,
We experienced something similar in our setting where the Spark app was
supposed to write to a Redis instance the final state changes. Over time
the delay caused by re-writing the entire dataset in each iteration
exceeded the Spark streaming batch size.
In our cased the solution was
Did you try ssh -L 4040:127.0.0.1:4040 user@host
Thanks
Best Regards
On Mon, Mar 23, 2015 at 1:12 PM, sergunok ser...@gmail.com wrote:
Is it a way to tunnel Spark UI?
I tried to tunnel client-node:4040 but my browser was redirected from
localhost to some cluster locally visible domain
Hi, all
Can spark use pig's load function to load data?
Best Regards,
Kevin.
Maybe implement a very simple function that uses the Hadoop API to read in
based on file names (i.e. parts)?
On Mon, Mar 23, 2015 at 10:55 AM, Koert Kuipers ko...@tresata.com wrote:
there is a way to reinstate the partitioner, but that requires
sc.objectFile to read exactly what i wrote, which
Instantiating the instance? The actual instance it's complaining about is:
https://github.com/scalaz/scalaz/blob/16838556c9309225013f917e577072476f46dc14/core/src/main/scala/scalaz/std/Option.scala#L10-11
The specific import where it's picking up the instance is:
Spark has a dependency on json4s 3.2.10, but this version has several bugs
and I need to use 3.2.11. I added json4s-native 3.2.11 dependency to
build.sbt and everything compiled fine. But when I spark-submit my JAR it
provides me with 3.2.10.
build.sbt
import sbt.Keys._
name := sparkapp
On Mon, Mar 23, 2015 at 2:15 PM, Manoj Samel manojsamelt...@gmail.com wrote:
Found the issue above error - the setting for spark_shuffle was incomplete.
Now it is able to ask and get additional executors. The issue is once they
are released, it is not able to proceed with next query.
That
for me, it's only working if I set --driver-class-path to mysql library.
On Sun, Mar 22, 2015 at 11:29 PM, gavin zhang gavin@gmail.com wrote:
OK,I found what the problem is: It couldn't work with
mysql-connector-5.0.8.
I updated the connector version to 5.1.34 and it worked.
--
View
Hello,
I am running TeraSort https://github.com/ehiggs/spark-terasort on 100GB
of data. The final metrics I am getting on Shuffle Spill are:
Shuffle Spill(Memory): 122.5 GB
Shuffle Spill(Disk): 3.4 GB
What's the difference and relation between these two metrics? Does these
mean 122.5 GB was
currently its pretty hard to control the Hadoop Input/Output formats used
in Spark. The conventions seems to be to add extra parameters to all
methods and then somewhere deep inside the code (for example in
PairRDDFunctions.saveAsHadoopFile) all these parameters get translated into
settings on the
Log shows stack traces that seem to match the assert in JIRA so it seems I
am hitting the issue. Thanks for the heads up ...
15/03/23 20:29:50 ERROR actor.OneForOneStrategy: assertion failed:
Allocator killed more executors than are allocated!
java.lang.AssertionError: assertion failed: Allocator
+1 - I currently am doing what Marcelo is suggesting as I have a CDH 5.2
cluster (with Spark 1.1) and I'm also running Spark 1.3.0+ side-by-side in
my cluster.
On Wed, Mar 18, 2015 at 1:23 PM Marcelo Vanzin van...@cloudera.com wrote:
Since you're using YARN, you should be able to download a
Found the issue above error - the setting for spark_shuffle was incomplete.
Now it is able to ask and get additional executors. The issue is once they
are released, it is not able to proceed with next query.
The environment is CDH 5.3.2 (Hadoop 2.5) with Kerberos Spark 1.3
After idle time, the
Is there a way to take advantage of the underlying datasource partitions
when generating a DataFrame/SchemaRDD via catalyst? It seems from the sql
module that the only options are RangePartitioner and HashPartitioner - and
further that those are selected automatically by the code . It was not
Hi Emre,
The --conf property is meant to work with yarn-cluster mode.
System.getProperty(key) isn't guaranteed, but new SparkConf().get(key)
should. Does it not?
-Sandy
On Mon, Mar 23, 2015 at 8:39 AM, Emre Sevinc emre.sev...@gmail.com wrote:
Hello,
According to Spark Documentation at
Hi All,
Suppose I have a parquet file of 100 MB in HDFS my HDFS block is 64MB, so
I have 2 block of data.
When I do, *sqlContext.parquetFile(path)* followed by an action , two
tasks are stared on two partitions.
My intend is to read this 2 blocks in more partitions to fully utilize my
cluster
Have you tried to repartition() your original data to make more partitions
before you aggregate?
--
Martin Goodson | VP Data Science
(0)20 3397 1240
[image: Inline image 1]
On Mon, Mar 23, 2015 at 4:12 PM, Yiannis Gkoufas johngou...@gmail.com
wrote:
Hi Yin,
Yes, I have set
i just realized the major limitation is that i lose partitioning info...
On Mon, Mar 23, 2015 at 1:34 AM, Reynold Xin r...@databricks.com wrote:
On Sun, Mar 22, 2015 at 6:03 PM, Koert Kuipers ko...@tresata.com wrote:
so finally i can resort to:
rdd.saveAsObjectFile(...)
sc.objectFile(...)
Hey Akriti23,
pyspark gives you a saveAsParquetFile() api, to save your rdd as parquet.
You will however, need to infer the schema or describe it manually before
you can do so. Here are some docs about that (v1.2.1, you can search for the
others, they're relatively similar 1.1 and up):
i have a mesos cluster, which i deploy spark to by using instructions on
http://spark.apache.org/docs/0.7.2/running-on-mesos.html
after that the spark shell starts up fine.
then i try the following on the shell:
val data = 1 to 1
val distData = sc.parallelize(data)
distData.filter(_
Hi Sean,
Thanks a ton for you reply.
The particular situation I have is case (3) that you have mentioned. The
class that I am using from commons-net is FTPClient(). This class is
present in both the 2.2 version and the 3.3 version. However, in the 3.3
version there are two additional methods
I think the explanation is that the join does not guarantee any order,
since it causes a shuffle in general, and it is computed twice in the
first example, resulting in a difference for d1 and d2.
You can persist() the result of the join and in practice I believe
you'd find it behaves as
there is a way to reinstate the partitioner, but that requires
sc.objectFile to read exactly what i wrote, which means sc.objectFile
should never split files on reading (a feature of hadoop file inputformat
that gets in the way here).
On Mon, Mar 23, 2015 at 1:39 PM, Koert Kuipers
I think it's spark.yarn.user.classpath.first in 1.2, and
spark.{driver,executor}.extraClassPath in 1.3. Obviously that's for
if you are using YARN, in the first instance.
On Mon, Mar 23, 2015 at 5:41 PM, Jacob Abraham abe.jac...@gmail.com wrote:
Hi Sean,
Thanks a ton for you reply.
The
Thanks for the information! (to all who responded)
The code below *seems* to work.Any hidden gotcha's that anyone sees?
And still, in terasort, how did they check that the data was actually sorted?
:-)
-Mike
class MyInputFormat[T] extends parquet.hadoop.ParquetInputFormat[T]{
override def
I have a complex SparkSQL query of the nature
select a.a, b.b, c.c from a,b,c where a.x = b.x and b.y = c.y
How do I convert this efficiently into scala query of
a.join(b,..,..)
and so on. Can anyone help me with this? If my question needs more
clarification, please let me know.
--
View
1 - 100 of 118 matches
Mail list logo