Before doing saveAsParquetFile, you can call the repartition and provide a
decent number which will result in the total number of output files
generated.
Thanks
Best Regards
On Mon, Nov 3, 2014 at 1:12 PM, ag007 agre...@mac.com wrote:
Hi there,
I have a pySpark job that is simply taking a
Hi All,
I'm trying to understand below link example program. When I run this
program, I'm getting *java.lang.NullPointerException* at below
highlighted line.
*https://gist.github.com/ankurdave/4a17596669b36be06100
https://gist.github.com/ankurdave/4a17596669b36be06100*
val updatedDists =
Hi Michael,
Thanks for response. I did test with query that you send me. And it works
really faster:
Old queries stats by phases:
3.2min
17s
Your query stats by phases:
0.3 s
16 s
20 s
But will this improvement also affect when you want to count distinct on 2
or more fields:
SELECT COUNT(f1),
Hi All,
I'm new to GraphX. I'm understanding Triangle Count use cases. I'm able to
get number of triangles in a graph but I'm not able to collect vertices
details in each Triangle.
*For example* : I'm playing one of the graphx graph example Vertices and
Edges
val vertexArray = Array(
(1L,
Thanks Akhil,
Am I right in saying that the repartition will spread the data randomly so I
loose chronological order?
I really just want the csv -- parquet format in the same order it came in.
If I set repartition with 1 will this not be random?
cheers,
Ag
--
View this message in context:
I don't think directy .aggregateByKey() can be done, because we will need
count of keys (for average). Maybe we can use .countByKey() which returns a
map and .foldByKey(0)(_+_) (or aggregateByKey()) which gives sum of values
per key. I myself ain't getting how to proceed.
Regards
On Fri, Oct 31,
Hi,
I just wonder how number of partitions effect the performance in Spark!
Is it just the parallelism (more partitions, more parallel sub-tasks) that
improves the performance? or there exist other considerations?
In my case,I run couple of map/reduce jobs on same dataset two times with
two
The result was no different with saveAsHadoopFile. In both cases, I can see
that I've misinterpreted the API docs. I'll explore the API's a bit further
for ways to save the iterable as chunks rather than one large text/binary.
It might also help to clarify this aspect in the API docs. For those
I also realized from your description of saveAsText that the API is indeed
behaving as expected i.e. it is appropriate (though not optimal) for the
API to construct a single string out of the value. If the value turns out
to be large, the user of the API needs to reconsider the implementation
Yes, that's the same thing really. You're still writing a huge value
as part of one single (key,value) record. The value exists in memory
in order to be written to storage. Although there aren't hard limits,
in general, keys and values aren't intended to be huge, like, hundreds
of megabytes.
You
I did some simple experiments with Impala and Spark, and Impala came out ahead.
But it’s also less flexible, couldn’t handle irregular schemas, didn't support
Json, and so on.
On 01.11.2014, at 02:20, Soumya Simanta soumya.sima...@gmail.com wrote:
I agree. My personal experience with Spark
Hi
We are currently using Mapr Distribution.
To read the files from the file system we specify as follows :
test = sc.textFile(mapr/mycluster/user/mapr/test.csv)
This works fine from Spark Context.
But ...
Currently we are trying to create a table in hive using the hiveContext from
Spark.
Hi,
I have a schemaRDD created like below :
schemaTransactions = sqlContext.applySchema(transactions,schema);
When I try to save the schemaRDD as a table using :
schemaTransactions.saveAsTable(transactions) I get the error below
Py4JJavaError: An error occurred while calling
Yes partitions matter. Usually you can use the default, which will
make a partition per input split, and that's usually good, to let one
task process one block of data, which will all be on one machine.
Reasons I could imagine why 9 partitions is faster than 7:
Probably: Your cluster can execute
Thanks Sean for very useful comments. I understand now better what could be
the reasons that my evaluations are messed up.
best,
/Shahab
On Mon, Nov 3, 2014 at 12:08 PM, Sean Owen so...@cloudera.com wrote:
Yes partitions matter. Usually you can use the default, which will
make a partition per
hi
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
Thanks
Best Regards
On Mon, Nov 3, 2014 at 5:53 PM, Karthikeyan Arcot Kuppusamy
karthikeyan...@zanec.com wrote:
hi
-
To unsubscribe, e-mail:
Hello,
I am trying to load a very large graph to run a GraphX algorithm, and the
graph is not fix the memory,
I found that if I use DISK_ONLY or MEMORY_AND_DISK_SER storage level, the
program will met OOM, but if I use MEMORY_ONLY_SER, the program will not.
Thus I want to know what kind of
Hi all,
I can't seem to find a clear answer on the documentation.
Does the standalone cluster support dynamic assigment of nr of allocated
cores to an application once another app stops?
I'm aware that we can have core sharding if we use Mesos between active
applications depending on the nr of
Great! Thanks for the information. I will try it out.
-
Novice Big Data Programmer
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-cluster-stability-tp17929p17956.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi Guys,
Anyone could explain me how to work Kafka with Spark, I am using the
JavaKafkaWordCount.java like a test and the line command is:
./run-example org.apache.spark.streaming.examples.JavaKafkaWordCount
spark://192.168.0.13:7077 computer49:2181 test-consumer-group unibs.it 3
and like a
Hi All,
I'm new to GraphX. I'm understanding Triangle Count use cases. I'm able to
get number of triangles in a graph but I'm not able to collect vertices
details in each Triangle.
*For example* : I'm playing one of the graphx graph example Vertices and
Edges
val vertexArray = Array(
(1L,
Hello,
I have a Spark 1.1.0 standalone cluster, with several nodes, and several
jobs (applications) being scheduled at the same time.
By default, each Spark job takes up all available CPUs.
This way, when more than one job is scheduled, all but the first are stuck
in WAITING.
On the other hand,
Have a look at scheduling pools
https://spark.apache.org/docs/latest/job-scheduling.html. If you want
more sophisticated resource allocation, then you are better of to use
cluster managers like mesos or yarn
Thanks
Best Regards
On Mon, Nov 3, 2014 at 9:10 PM, Romi Kuntsman r...@totango.com
So, as said there, static partitioning is used in Spark’s standalone and
YARN modes, as well as the coarse-grained Mesos mode.
That leaves us only with Mesos, where there is *dynamic sharing* of CPU
cores.
It says when the application is not running tasks on a machine, other
applications may run
Let me know if you are interested in participating in a meet up in Cincinnati,
OH to discuss Apache Spark.
We currently have 4-5 different companies expressing interest but would like a
few more.
Darin.
Hi,
Is there a nice or optimal method to randomly shuffle spark streaming RDDs?
Thanks,
Josh
I didn't notice your message and asked about the same question, in the
thread with the title Spark job resource allocation best practices.
Adding specific case to your example:
1 - There are 12 cores available in the cluster
2 - I start app B with all cores - gets 12
3 - I start app A - it needs
Hi,
I'm a newbie in Spark and faces the following use case :
val data = Array ( A, 1;2;3)
val rdd = sc.parallelize(data)
// Something here to produce RDD of (Key,value)
// ( A, 1) , (A, 2), (A, 3)
Does anybody know how to do ?
Thank's
--
View this message in
Very straightforward:
You want to use cartesian.
If you have two RDDs - RDD_1(³A²) and RDD_2(1,2,3)
RDD_1.cartesian(RDD_2) will generate the cross product between the two
RDDs and you will have
RDD_3((³A²,1), (³B²,2), (³C², 3))
On 11/3/14, 11:38 AM, david david...@free.fr wrote:
Hi,
I'm a
I think the answer will be the same in streaming as in the core. You
want a random permutation of an RDD? in general RDDs don't have
ordering at all -- excepting when you sort for example -- so a
permutation doesn't make sense. Do you just want a well-defined but
random ordering of the data? Do
When I'm outputting the RDDs to an external source, I would like the RDDs
to be outputted in a random shuffle so that even the order is random. So
far what I understood is that the RDDs do have a type of order, in that the
order for spark streaming RDDs would be the order in which spark streaming
Hi i'm reading the O´really´s book Learning Spark and i have a doubt, the
accumulator's fault tolerance still only happening in the actions
operations? this behaviour is also expected if we use accumulables?
Thank in advance
Jorge López-Malla Matute
Big Data Developer
Avenida de Europa, 26.
A use case would be helpful?
Batches of RDDs from Streams are going to have temporal ordering in terms of
when they are processed in a typical application ... , but maybe you could
shuffle the way batch iterations work
On Nov 3, 2014, at 11:59 AM, Josh J joshjd...@gmail.com wrote:
When
Any one has experience or advice to fix this problem? highly appreciated!
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/akka-connection-refused-bug-fix-tp17764p17972.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Befire saveAsParquetFile(), you can call coalesce(N), then you will
have N files,
it will keep the order as before (repartition() will not).
On Mon, Nov 3, 2014 at 1:16 AM, ag007 agre...@mac.com wrote:
Thanks Akhil,
Am I right in saying that the repartition will spread the data randomly so I
On Sun, Nov 2, 2014 at 1:35 AM, jan.zi...@centrum.cz wrote:
Hi,
I am using Spark on Yarn, particularly Spark in Python. I am trying to run:
myrdd = sc.textFile(s3n://mybucket/files/*/*/*.json)
How many files do you have? and the average size of each file?
myrdd.getNumPartitions()
If you iterated over an RDD's partitions, I'm not sure that in
practice you would find the order matches the order they were
received. The receiver is replicating data to another node or node as
it goes and I don't know much is guaranteed about that.
If you want to permute an RDD, how about a
I just built the 1.2 snapshot current as of commit 76386e1a23c using:
$ ./make-distribution.sh —tgz —name my-spark —skip-java-test -DskipTests
-Phadoop-2.4 -Phive -Phive-0.13.1 -Pyarn
I drop in my Hive configuration files into the conf directory, launch
spark-shell, and then create my
Hi,
I noticed a bug in the sample java code in MLlib - Naive Bayes docs page:
http://spark.apache.org/docs/1.1.0/mllib-naive-bayes.html
In the filter:
|double accuracy = 1.0 * predictionAndLabel.filter(new FunctionTuple2Double,
Double, Boolean() {
@Override public Boolean
Is there any reason why StringType is not a supported type the GT, GTE, LT, LTE
operations? I was able to previously have a predicate where my column type was
a string and execute a filter with one of the above operators in SparkSQL w/o
any problems. However, I synced up to the latest code this
Hi Terry
I think the issue you mentioned will be resolved by following PR.
https://github.com/apache/spark/pull/3072
- Kousuke
(2014/11/03 10:42), Terry Siu wrote:
I just built the 1.2 snapshot current as of commit 76386e1a23c using:
$ ./make-distribution.sh —tgz —name my-spark
Yes, good catch. I also think the 1.0 * is suboptimal as a cast to
double. I searched for similar issues and didn't see any. Open a PR --
I'm not even sure this is enough to warrant a JIRA? but feel free to
as well.
On Mon, Nov 3, 2014 at 6:46 PM, Dariusz Kobylarz
darek.kobyl...@gmail.com wrote:
I have 3 datasets in all the datasets the average file size is 10-12Kb.
I am able to run my code on the dataset with 70K files, but I am not able to
run it on datasets with 1.1M and 3.8M files.
__
On Sun, Nov 2, 2014 at 1:35 AM,
Thanks, Kousuke. I’ll wait till this pull request makes it into the master
branch.
-Terry
From: Kousuke Saruta
saru...@oss.nttdata.co.jpmailto:saru...@oss.nttdata.co.jp
Date: Monday, November 3, 2014 at 11:11 AM
To: Terry Siu terry@smartfocus.commailto:terry@smartfocus.com,
Thanks Akhil.
I realized that earlier, and i thought mvn -Phive should have captured and
included all these dependencies.
In any case, i proceeded with that, included other such dependencies that
were missing, and finally hit the guava version mismatch issue. (Spark
with Guava 14 vs Hadoop/Hive
If you want to permute an RDD, how about a sortBy() on a good hash
function of each value plus some salt? (Haven't thought this through
much but sounds about right.)
This sounds promising. Where can I read more about the space (memory and
network overhead) and time complexity of sortBy?
On
Hi All,
I have been using LinearRegression model of MLLib and very pleased with its
scalability and robustness. Right now, we are just calculating MSE of our
model. We would like to characterize the performance of our model. I was
wondering adding support for computing things such as Confidence
Hi All,
I have spent last two years on hadoop but new to spark.
I am planning to move one of my existing system to spark to get some
enhanced features.
My question is:
If I try to do a map side join (something similar to Replicated key word
in Pig), how can I do it? Is it anyway to declare a
no luck :(! Still observing the same behavior!
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/CANNOT-FIND-ADDRESS-tp17637p17988.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi,
I am planning to run spark on EMR. And because my application might take a
lot of memory. On EMR, I know there is a hard limit 16G physical memory on
individual mapper/reducer (otherwise I will have an exception and this is
confirmed by AWS EMR team, at least it is the spec at this moment).
I have a Spark Streaming program that works fine if I execute it via
sbt runMain com.cray.examples.spark.streaming.cyber.StatefulDhcpServerHisto
-f /Users/spr/Documents/.../tmp/ -t 10
but if I start it via
$S/bin/spark-submit --master local[12] --class StatefulNewDhcpServers
You need to use broadcast followed by flatMap or mapPartitions to do map-side
joins (in your map function, you can look at the hash table you broadcast and
see what records match it). Spark SQL also does it by default for tables
smaller than the spark.sql.autoBroadcastJoinThreshold setting (by
Hi,
Well, I doesn't find original documentation, but according to
http://qnalist.com/questions/2791828/about-the-cpu-cores-and-cpu-usage
http://qnalist.com/questions/2791828/about-the-cpu-cores-and-cpu-usage ,
the vcores is not for physics cpu core but for virtual cores.
And I used top command
That sounds like a regression. Could you open a JIRA with steps to
reproduce (https://issues.apache.org/jira/browse/SPARK)? We'll want to fix
this before the 1.2 release.
On Mon, Nov 3, 2014 at 11:04 AM, Terry Siu terry@smartfocus.com wrote:
Is there any reason why StringType is not a
On Mon, Nov 3, 2014 at 12:45 AM, Bojan Kostic blood9ra...@gmail.com wrote:
But will this improvement also affect when you want to count distinct on 2
or more fields:
SELECT COUNT(f1), COUNT(DISTINCT f2), COUNT(DISTINCT f3), COUNT(DISTINCT
f4)
FROM parquetFile
Unfortunately I think this
I am running local (client). My vm is 16 cpu/108gb ram. My configuration is
as following:
spark.executor.extraJavaOptions -XX:+PrintGCDetails -XX:+UseCompressedOops
-XX:+UseParallelGC -XX:+UseParallelOldGC -XX:+DisableExplicitGC
-XX:MaxPermSize=1024m
spark.daemon.memory=20g
It is merged!
On Mon, Nov 3, 2014 at 12:06 PM, Terry Siu terry@smartfocus.com wrote:
Thanks, Kousuke. I’ll wait till this pull request makes it into the
master branch.
-Terry
From: Kousuke Saruta saru...@oss.nttdata.co.jp
Date: Monday, November 3, 2014 at 11:11 AM
To: Terry Siu
Hi all,
I have a spark streaming job that consumes data from Kafka and produces
some simple operations on the data. This job is run in an EMR cluster with
10 nodes. The batch size I use is 1 minute and it takes around 10 seconds
to generate the results that are inserted to a MySQL database.
i'm struggling with implementing a few algorithms with spark. hope to get
help from the community.
most of the machine learning algorithms today are sequential, while spark
is all about parallelism. it seems to me that using spark doesn't
actually help much, because in most cases you can't
Done.
https://issues.apache.org/jira/browse/SPARK-4213
Thanks,
-Terry
From: Michael Armbrust mich...@databricks.commailto:mich...@databricks.com
Date: Monday, November 3, 2014 at 1:37 PM
To: Terry Siu terry@smartfocus.commailto:terry@smartfocus.com
Cc:
David, that's exactly what I was after :) Awesome, thanks.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Parquet-files-are-only-6-20MB-in-size-tp17935p18002.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi,
I've written a short scala app to perform word counts on a text file and am
getting the following exception as the program completes (after it prints
out all of the word counts).
Exception in thread delete Spark temp dir
The NullPointerException seems to be because edge.dstAttr is null, which
might be due to SPARK-3936
https://issues.apache.org/jira/browse/SPARK-3936. Until that's fixed, I
edited the Gist with a workaround. Does that fix the problem?
Ankur http://www.ankurdave.com/
On Mon, Nov 3, 2014 at 12:23
Team,
We are running a build of spark 1.1.1 for hadoop 2.2. We can't get the code
to read LZO or snappy files in YARN. It fails to find the native libs. I
have tried many different ways of defining the lib path - LD_LIBRARY_PATH,
--driver-class-path, spark.executor.extraLibraryPath in
Hi all,
I was just reading this nice documentation here:
http://ampcamp.berkeley.edu/3/exercises/realtime-processing-with-spark-streaming.html
And got to the end of it, which says:
Note that there are more efficient ways to get the top 10 hashtags. For
example, instead of sorting the entire of
Hollo there,
Just set up an ec2 cluster with no HDFS, hadoop, hbase whatsoever. Just
installed spark to read/process data from a hbase in a different cluster.
The spark was built against the hbase/hadoop version in the remote (ec2)
hbase cluster, which is 0.98.1 and 2.3.0 respectively.
but I
Hi,
On Mon, Nov 3, 2014 at 1:29 PM, Amey Chaugule ambr...@gmail.com wrote:
I thought that only applied when you're trying to run a job using
spark-submit or in the shell...
And how are you starting your Yarn job, if not via spark-submit?
Tobias
Hi,
On Fri, Oct 31, 2014 at 4:31 PM, lieyan lie...@yahoo.com wrote:
The code are here: LogReg.scala
http://apache-spark-user-list.1001560.n3.nabble.com/file/n17803/LogReg.scala
Then I click the Run button of the IDEA, and I get the following error
message
errlog.txt
I am running python 2.7.3 and 2.1.0 of ipython notebook. I installed spark
in my home directory.
'/home/felix/spark-1.1.0/python/lib/py4j-0.8.1-src.zip',
'/home/felix/spark-1.1.0/python', '', '/opt/bluekai/python/src/bk',
'/usr/local/lib/python2.7/dist-packages/setuptools-6.1-py2.7.egg',
From: Tobias Pfeiffer t...@preferred.jpmailto:t...@preferred.jp
Am I right that you are actually executing two different classes here?
Yes, I realized after I posted that I was calling 2 different classes, though
they are in the same JAR. I went back and tried it again with the same class
Yes, I am using Spark1.1.0 and have used rdd.registerTempTable().
I tried by adding sqlContext.cacheTable(), but it took 59 seconds (more than
earlier).
I also tried by changing schema to use Long data type in some fields but
seems conversion takes more time.
Is there any way to specify index ?
I have a spark application that deserialize an object 'Seq[Page]', save to
HDFS/S3, and read by another worker to be used elsewhere. The serialization
and deserialization use the same serializer as Spark itself. (Read from
SparkEnv.get.serializer.newInstance())
However I sporadically get this
I have a spark application that deserialize an object 'Seq[Page]', save to
HDFS/S3, and read by another worker to be used elsewhere. The serialization
and deserialization use the same serializer as Spark itself. (Read from
SparkEnv.get.serializer.newInstance())
However I sporadically get this
Sorry its a timeout duplicate, please remove it
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-make-sure-a-ClassPath-is-always-shipped-to-workers-tp18018p18020.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
You are right. You pointed out the very cause of my problem. Thanks.
I have to specify the path to my jar file.
The solution can be found in an earlier post.
http://apache-spark-user-list.1001560.n3.nabble.com/ClassNotFoundException-with-simple-Spark-job-on-cluster-td932.html
--
View
Greetings!
I'm trying to use avro and parquet with the following schema:
{
name: TestStruct,
namespace: bughunt,
type: record,
fields: [
{
name: string_array,
type: { type: array, items: string }
}
]
}
The writing
I am trying to convert terabytes of json log files into parquet files.
but I need to clean it a little first.
I end up doing the following
txt = sc.textFile(inpath).coalesce(800)
val json = (for {
line - txt
JObject(child) = parse(line)
child2 = (for {
Hi,
I'm using Spark Streaming 1.1, and I have the following logs keep growing:
/opt/spark-1.1.0-bin-cdh4/work/app-20141029175309-0005/2/stderr
I think it is executor log, so I setup the following options in
spark-defaults.conf:
spark.executor.logs.rolling.strategy time
Many ML algorithms are sequential because they were not designed to be
parallel. However, ML is not driven by algorithms in practice, but by
data and applications. As datasets getting bigger and bigger, some
algorithms got revised to work in parallel, like SGD and matrix
factorization. MLlib tries
Thank you Ankur for your help and support!!!
On Tue, Nov 4, 2014 at 5:24 AM, Ankur Dave ankurd...@gmail.com wrote:
The NullPointerException seems to be because edge.dstAttr is null, which
might be due to SPARK-3936
https://issues.apache.org/jira/browse/SPARK-3936. Until that's fixed, I
Hi,
I tried hard to get a version of netty into my jar file created with sbt
assembly that works with all my libraries. Now I managed that and was
really happy, but it seems like spark-submit puts an older version of netty
on the classpath when submitting to a cluster, such that my code ends up
If you are using capacity scheduler in yarn: By default yarn capacity
scheduler uses DefaultResourceCalculator. DefaultResourceCalculator
consider¹s only memory while allocating contains.
You can use DominantResourceCalculator, it considers memory and cpu.
In capacity-scheduler.xml set
Yes. i believe Mesos is the right choice for you.
http://mesos.apache.org/documentation/latest/mesos-architecture/
Thanks
Best Regards
On Mon, Nov 3, 2014 at 9:33 PM, Romi Kuntsman r...@totango.com wrote:
So, as said there, static partitioning is used in Spark’s standalone and
YARN modes, as
Hi all,
I have a spark job that I build with sbt and I can run without any problem
with sbt run. But when I run it inside IntelliJ Idea I got the following
error :
*Exception encountered when invoking run on a nested suite - class
javax.servlet.FilterRegistration's signer information does not
Not quiet sure, but moving the Guava 11 jar to first position in the
classpath may solve this issue.
Thanks
Best Regards
On Tue, Nov 4, 2014 at 1:47 AM, Pala M Muthaia mchett...@rocketfuelinc.com
wrote:
Thanks Akhil.
I realized that earlier, and i thought mvn -Phive should have captured and
Hi,
But i've only one RDD. Hre is a more complete exemple :
my rdd is something like (A, 1;2;3), (B, 2;5;6), (C, 3;2;1)
And i expect to have the following result :
(A,1) , (A,2) , (A,3) , (B,2) , (B,5) , (B,6) , (C,3) ,
(C,2) , (C,1)
Any idea about how can i achieve this ?
Thank's
87 matches
Mail list logo