Liquan, yes, for full outer join, one hash table on both sides is more
efficient.
For the left/right outer join, it looks like one hash table should be enought.
From: Liquan Pei [mailto:liquan...@gmail.com]
Sent: 2014年9月30日 18:34
To: Haopu Wang
Cc:
Hi Michael,
I think you are meaning batch interval instead of windowing. It can be
helpful for cases when you do not want to process very small batch sizes.
HDFS sink in Flume has the concept of rolling files based on time, number
of events or size.
Yes, I meant batch interval. Thanks for clarifying.
Cheers,
Michael
On Oct 7, 2014, at 11:14 PM, jayant [via Apache Spark User List]
ml-node+s1001560n15904...@n3.nabble.com wrote:
Hi Michael,
I think you are meaning batch interval instead of windowing. It can be
helpful for cases when
Thank you, this seems to be the way to go, but unfortunately, when I'm trying
to use sc.wholeTextFiles() on file that is stored amazon S3 I'm getting
following Error:
14/10/08 06:09:50 INFO input.FileInputFormat: Total input paths to process : 1
14/10/08 06:09:50 INFO input.FileInputFormat:
Looks like https://issues.apache.org/jira/browse/SPARK-1800 is not merged
into master?
I cannot find spark.sql.hints.broadcastTables in latest master, but it's in
the following patch.
https://github.com/apache/spark/commit/76ca4341036b95f71763f631049fdae033990ab5
Jianshi
On Mon, Sep 29,
Hi Sean,
Do I need to specify the number of executors when submitting the job? I
suppose the number of executors will be determined by the number of regions
of the table. Just like a MapReduce job, you needn't specify the number of
map tasks when reading from a HBase table.
The script to
I'm pretty sure inner joins on Spark SQL already build only one of the sides.
Take a look at ShuffledHashJoin, which calls HashJoin.joinIterators. Only outer
joins do both, and it seems like we could optimize it for those that are not
full.
Matei
On Oct 7, 2014, at 11:04 PM, Haopu Wang
You do need to specify the number of executor cores to use. Executors are
not like mappers. After all they may do much more in their lifetime than
just read splits from HBase so would not make sense to determine it by
something that the first line of the program does.
On Oct 8, 2014 8:00 AM, Tao
I am working on a PR to leverage the HashJoin trait code to optimize the
Left/Right outer join. It's already been tested locally and will send out
the PR soon after some clean up.
Thanks,
Liquan
On Wed, Oct 8, 2014 at 12:09 AM, Matei Zaharia matei.zaha...@gmail.com
wrote:
I'm pretty sure inner
Hi,
Has anyone tried Mosek http://www.mosek.com/ Solver in Spark?
I getting weird serialization errors. I came to know that Mosek uses shared
libraries which may not be serialized.
Is this the reason that they are not serialized or Is it working for anyone.
--
Regards,
Raghuveer Chanda
4th
Looks like an OOM issue? Have you tried persisting your RDDs to allow
disk writes?
I've seen a lot of similar crashes in a Spark app that reads from HDFS
and does joins. I.e. I've seen java.io.IOException: Filesystem
closed, Executor lost, FetchFailed, etc etc with
non-deterministic crashes.
Ok, currently there's cost-based optimization however Parquet statistics is
not implemented...
What's the good way if I want to join a big fact table with several tiny
dimension tables in Spark SQL (1.1)?
I wish we can allow user hint for the join.
Jianshi
On Wed, Oct 8, 2014 at 2:18 PM,
Hi, All
We need an interactive interface tool for spark in which we can run spark job
and plot graph to explorer the data interactively.
Ipython notebook is good, but it only support python (we want one supporting
scala)...
BR,
Kevin.
ok let me rephrase my question once again. python-wise I am preferring
.predict_proba(X) instead of .decision_function(X) since it is easier for
me to interpret the results. as far as I can see, the latter functionality
is already implemented in Spark (well, in version 0.9.2 for example I have
to
Hello,
I have been developing a Spark Streaming application using Kafka, which runs
successfully on my Macbook. I am now trying to run it on an AWS Ubuntu spark
cluster... and I receive a ClassNotFoundException.
Kafka 0.8.1.1
Spark 1.1.0
I am submitting the job like this:
Hi all,
I'd like to use an octree data structure in order to simplify several
computations in a big data set. I've been wondering if Spark has any
built-in options for such structures (the only thing I could find is the
DecisionTree), specially if they make use of RDDs.
I've also been exploring
My additional question is if this problem can be possibly caused by the fact
that my file is bigger than RAM memory across the whole cluster?
__
Hi
I'm trying to use sc.wholeTextFiles() on file that is stored amazon S3 I'm
getting
Hi, I'm trying to read from Kafka. I was able to do it correctly with this
method.
def createStream(
ssc: StreamingContext,
zkQuorum: String,
groupId: String,
topics: Map[String, Int],
storageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK_SER_2
):
Hi Xiangrui,Changing the default step size to 0.01 made a huge difference. The
results make sense when I use A + B + C + D. MSE is ~0.07 and the outcome
matches the domain knowledge.
I was wondering is there any documentation on the parameters and when/how to
vary them.
Date: Tue, 7 Oct
Hi,
I finally found a solution after reading the post :
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-split-RDD-by-key-and-save-to-different-path-td11887.html#a11983
--
View this message in context:
Hi Andy,
This sounds awesome. Please keep us posted. Meanwhile, can you share a link to
your project? I wasn't able to find it.
Cheers,
Michael
On Oct 8, 2014, at 3:38 AM, andy petrella andy.petre...@gmail.com wrote:
Heya
You can check Zeppellin or my fork of the Scala notebook.
I'm
Hi,
Were you able to figure out how to choose a specific version? Im having the
same issue.
Thanks.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-could-I-start-new-spark-cluster-with-hadoop2-0-2-tp10450p15939.html
Sent from the Apache Spark User
Increasing the driver memory resolved this issue. Thanks to Nick for the
hint. Here is how I am starting the shell: spark-shell --driver-memory 4g
--driver-cores 4 --master local
--
View this message in context:
One more update: I've realized that this problem is not only Python related. I've tried
it also in Scala, but I'm still getting the same error, my scala code: val file =
sc.wholeTextFiles(s3n://wiki-dump/wikiinput).first()
__
My
Take this as a bit of a guess, since I don't use S3 much and am only a
bit aware of the Hadoop+S3 integration issues. But I know that S3's
lack of proper directories causes a few issues when used with Hadoop,
which wants to list directories.
According to
Ummm... what's helium? Link, plz?
On Oct 8, 2014, at 9:13 AM, Stephen Boesch java...@gmail.com wrote:
@kevin, Michael,
Second that: interested in seeing the zeppelin. pls use helium though ..
2014-10-08 7:57 GMT-07:00 Michael Allman mich...@videoamp.com:
Hi Andy,
This sounds awesome.
They reverted to a previous version of the spark-ec2 script and things are
working again!
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/spark-ec2-HDFS-doesn-t-start-on-AWS-EC2-cluster-tp15921p15945.html
Sent from the Apache Spark User List mailing list
Sure! I'll post updates as well in the ML :-)
I'm doing it on twitter for now (until doc is ready).
The repo is there (branch spark) :
https://github.com/andypetrella/scala-notebook/tree/spark
Some tweets:
* very first working stuff:
https://twitter.com/noootsab/status/508758335982927872/photo/1
Yup, though to be clear, Josh reverted a change to a hosted script that
spark-ec2 references. The spark-ec2 script y’all are running locally hasn’t
changed, obviously.
On Wed, Oct 8, 2014 at 12:20 PM, mrm ma...@skimlinks.com wrote:
They reverted to a previous version of the spark-ec2 script
Thanks for explanation, i was going to ask exactly about this :)
On Wed, Oct 8, 2014 at 6:23 PM, Nicholas Chammas nicholas.cham...@gmail.com
wrote:
Yup, though to be clear, Josh reverted a change to a hosted script that
spark-ec2 references. The spark-ec2 script y’all are running locally
The proper step size partially depends on the Lipschitz constant of
the objective. You should let the machine try different combinations
of parameters and select the best. We are working with people from
AMPLab to make hyperparameter tunning easier in MLlib 1.2. For the
theory, Nesterov's book
We have our analytics infra built on Spark and Parquet.
We are trying to replace some of our queries based on the direct Spark RDD
API to SQL based either on Spark SQL/HiveQL.
Our motivation was to take advantage of the transparent projection
predicate pushdown that's offered by Spark SQL and
Hi all,
We’re organizing a meetup October 30-31st in downtown SF that might be of
interest to the Spark community. The focus is on large-scale data analysis and
its role in neuroscience. It will feature several active Spark developers and
users, including Xiangrui Meng, Josh Rosen, Reza Zadeh,
I am running on Windows 8 using Spark 1.1.0 in local mode with Hadoop 2.2 -
I repeatedly see
the following in my logs.
I believe this happens in combineByKey
14/10/08 09:36:30 INFO executor.Executor: Running task 3.0 in stage 0.0
(TID 3)
14/10/08 09:36:30 INFO broadcast.TorrentBroadcast:
We are working to improve the integration here, but I can recommend the
following when running spark 1.1: create an external table and
set spark.sql.hive.convertMetastoreParquet=true
Note that even with a HiveContext we don't support window functions yet.
On Wed, Oct 8, 2014 at 10:41 AM, Anand
Hi Lewis,
For debugging purpose, can you try using HttpBroadCast to see if the error
remains? You can enable HttpBroadCast by setting spark.broadcast.factory
to org.apache.spark.broadcast.HttpBroadcastFactory in spark conf.
Thanks,
Liquan
On Wed, Oct 8, 2014 at 11:21 AM, Steve Lewis
Thanks for the input. We purposefully made sure that the config option did
not make it into a release as it is not something that we are willing to
support long term. That said we'll try and make this easier in the future
either through hints or better support for statistics.
In this particular
Using a var for RDDs in this way is not going to work. In this example,
tx1.zip(tx2) would create and RDD that depends on tx2, but then soon after
that, you change what tx2 means, so you would end up having a circular
dependency.
On Wed, Oct 8, 2014 at 12:01 PM, Sung Hwan Chung
So, I think this is a bug, but I wanted to get some feedback before I reported
it as such. On Spark on YARN, 1.1.0, if you specify the --driver-memory value
to be higher than the memory available on the client machine, Spark errors out
due to failing to allocate enough memory. This happens
There is a toDebugString method in rdd that will print a description of
this RDD and its recursive dependencies for debugging.
Thanks,
Liquan
On Wed, Oct 8, 2014 at 12:01 PM, Sung Hwan Chung coded...@cs.stanford.edu
wrote:
My job is not being fault-tolerant (e.g., when there's a fetch failure
Thats a good question, I'm not sure if that will work. I will note that we
are hoping to do some upgrades of our parquet support in the near future.
On Tue, Oct 7, 2014 at 10:33 PM, Michael Allman mich...@videoamp.com
wrote:
Hello,
I was interested in testing Parquet V2 with Spark SQL, but
There is no circular dependency. Its simply dropping references to prev RDDs
because there is no need for it.
I wonder if that messes up things up though internally for Spark due to losing
references to intermediate RDDs.
On Oct 8, 2014, at 12:13 PM, Akshat Aranya aara...@gmail.com wrote:
Revert the script to an older version.
https://github.com/apache/spark/tree/branch-1.1/ec2
Thanks
Best Regards
On Wed, Oct 8, 2014 at 9:57 PM, Jan Warchoł jan.warc...@codilime.com
wrote:
Thanks for explanation, i was going to ask exactly about this :)
On Wed, Oct 8, 2014 at 6:23 PM, Nicholas
I need to do deduplication processing in Spark. The current plan is to generate
a tuple where key is the dedup criteria and value is the original input. I am
thinking to use reduceByKey to discard duplicate values. If I do that, can I
simply return the first argument or should I return a copy
Multiple values may be different, yet still be considered duplicates
depending on how the dedup criteria is selected. Is that correct? Do you
care in that case what value you select for a given key?
On Wed, Oct 8, 2014 at 3:37 PM, Ge, Yao (Y.) y...@ford.com wrote:
I need to do deduplication
Hi,
If using Ganglia is not an absolute requirement, check out SPM
http://sematext.com/spm/ for Spark --
http://blog.sematext.com/2014/10/07/apache-spark-monitoring/
It monitors all Spark metrics (i.e. you don't need to figure out what you
need to monitor, how to get it, how to graph it, etc.)
Hi Greg,
It does seem like a bug. What is the particular exception message that you
see?
Andrew
2014-10-08 12:12 GMT-07:00 Greg Hill greg.h...@rackspace.com:
So, I think this is a bug, but I wanted to get some feedback before I
reported it as such. On Spark on YARN, 1.1.0, if you specify
Theodore, did you ever get this resolved? I just ran into the same thing.
Before digging, I figured I'd ask.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Running-Spark-cluster-on-local-machine-cannot-connect-to-master-error-tp12743p15972.html
Sent
Hi Russell and Theodore,
This usually means your Master / Workers / client machine are running
different versions of Spark. On a local machine, you may want to restart
your master and workers (sbin/stop-all.sh, then sbin/start-all.sh). On a
real cluster, you want to make sure that every node
Hi Jamborta,
It could be that your executors are requesting too much memory. I'm not
sure why it works in client mode but not in cluster mode, however. Have you
checked the RM logs for messages that complain about container memory
requested being too high? How much memory is each of your
Hi:
I tried to build Spark (1.1.0) with hadoop 2.4.0, and ran a simple
wordcount example using spark_shell on mesos. When I ran my application, I
got following error that looks related to the mismatch of protobuf versions
between hadoop cluster (protobuf 2.5) and spark (protobuf 4.1). I ran mvn
hi Andrew,
Thanks for the reply, I tried to tune the memory, changed it as low as
possible, no luck.
My guess is that this issue is related to what is discussed here
http://apache-spark-user-list.1001560.n3.nabble.com/Initial-job-has-not-accepted-any-resources-td11668.html
that is the
I have faced this in the past and I have to put a profile -Phadoop2.3...
mvn -Dhadoop.version=2.3.0-cdh5.1.0 -Phadoop-2.3 -Pyarn -DskipTests install
On Wed, Oct 8, 2014 at 1:40 PM, Chuang Liu liuchuan...@gmail.com wrote:
Hi:
I tried to build Spark (1.1.0) with hadoop 2.4.0, and ran a simple
The build instructions for pyspark appear to be:
sbt/sbt assembly
Given that maven is the preferred build tool since July 1, presumably I
have overlooked the instructions for building via maven? Anyone please
point it out? thanks
Maybe you could implement something like this (i don't know if something
similar already exists in spark):
http://www.cs.berkeley.edu/~jnwang/papers/icde14_massjoin.pdf
Best,
Flavio
On Oct 8, 2014 9:58 PM, Nicholas Chammas nicholas.cham...@gmail.com
wrote:
Multiple values may be different, yet
Looking more closely, inside the core/pom.xml there are a few references to
the python build
This question mostly has to do with my limited of knowledge of python
environment . I will look up how to set up python module. It appears a
hack is to add
I am planning to try upgrading spark sql to a newer version of parquet, too.
I'll let you know if I make progress.
Thanks,
Michael
On Oct 8, 2014, at 12:17 PM, Michael Armbrust mich...@databricks.com wrote:
Thats a good question, I'm not sure if that will work. I will note that we
are
Hi,
Please check Zeppelin, too.
http://zeppelin-project.org
https://github.com/nflabs/zeppelin
Which is similar to scala notebook.
Best,
moon
2014년 10월 9일 목요일, andy petrellaandy.petre...@gmail.com님이 작성한 메시지:
Sure! I'll post updates as well in the ML :-)
I'm doing it on twitter for now
Have you looked at
http://spark.apache.org/docs/latest/building-with-maven.html ?
Especially
http://spark.apache.org/docs/latest/building-with-maven.html#building-for-pyspark-on-yarn
Cheers
On Wed, Oct 8, 2014 at 2:01 PM, Stephen Boesch java...@gmail.com wrote:
The build instructions for
That converts the error to the following
14/10/08 13:27:40 INFO executor.Executor: Running task 3.0 in stage 0.0
(TID 3)
14/10/08 13:27:40 INFO broadcast.HttpBroadcast: Started reading broadcast
variable 0
14/10/08 13:27:40 ERROR executor.Executor: Exception in task 1.0 in stage
0.0 (TID 1)
One thing I didn't mention is that we actually do data.repartition before
hand with shuffle.
I found that this can actually introduce randomness to lineage steps,
because data get shuffled to different partitions and lead to inconsistent
behavior if your algorithm is dependent on the order at
I noticed that repartition will result in non-deterministic lineage because
it'll result in changed orders for rows.
So for instance, if you do things like:
val data = read(...)
val k = data.repartition(5)
val h = k.repartition(5)
It seems that this results in different ordering of rows for 'k'
Hi Andy,
It sounds great! Quick questions: I have been using IPython + PySpark. I
crunch the data by PySpark and then visualize the data by Python libraries
like matplotlib and basemap. Could I still use these Python libraries in
the Scala Notebook? If not, what is suggested approaches for
Hi
I am in the process of migrating some logic in pig scripts to Spark-SQL. As
part of this process, I am creating a few Select...Group By query and
registering them as tables using the SchemaRDD.registerAsTable feature.
When using such a registered table in a subsequent Select...Group By
query,
Using SUM on a string should automatically cast the column. Also you can
use CAST to change the datatype
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-TypeConversionFunctions
.
What version of Spark are you running? This could be
Sean,
I am having a similar issue, but I have a lot of data for a group I cannot
materialize the iterable into a List or Seq in memory. [I tried it runs
into OOM]. is there any other way to do this ?
I also tried a secondary-sort, with the key having the group::time, but
the problem with that
Thanks Michael. Should the cast be done in the source RDD or while doing
the SUM?
To give a better picture here is the code sequence:
val sourceRdd = sql(select ... from source-hive-table)
sourceRdd.registerAsTable(sourceRDD)
val aggRdd = sql(select c1, c2, sum(c3) from sourceRDD group by c1, c2)
Which version of Spark are you running?
On Wed, Oct 8, 2014 at 4:18 PM, Ranga sra...@gmail.com wrote:
Thanks Michael. Should the cast be done in the source RDD or while doing
the SUM?
To give a better picture here is the code sequence:
val sourceRdd = sql(select ... from source-hive-table)
Sorry. Its 1.1.0.
After digging a bit more into this, it seems like the OpenCSV Deseralizer
converts all the columns to a String type. This maybe throwing the
execution off. Planning to create a class and map the rows to this custom
class. Will keep this thread updated.
On Wed, Oct 8, 2014 at
I built Spark 1.2.0 succesfully, but was unable to build my Spark program
under 1.2.0 with sbt assembly my build.sbt file. It contains:
I tried:
org.apache.spark %% spark-sql % 1.2.0,
org.apache.spark %% spark-core % 1.2.0,
and
org.apache.spark %% spark-sql % 1.2.0-SNAPSHOT,
Hi, I think you have to change the code like this to specify the type info,
like this:
val kafkaStream: ReceiverInputDStream[(String, String)] =
KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](
ssc,
kafkaParams,
topicMap,
Hi Pat,
Couple of points:
1) I must have done something naive like:
git clone git://github.com/apache/spark.git -b branch-1.2.0
because git branch is telling me I'm on the master branch, and I see
that branch-1.2.0 doesn't exist (https://github.com/apache/spark).
Nevertheless, when I compiled
Sean,
I did specify the number of cores to use as follows:
... ...
val sparkConf = new SparkConf()
.setAppName( Reading HBase )
.set(spark.cores.max, 32)
val sc = new SparkContext(sparkConf)
... ...
But that does not solve the problem --- only 2 workers are allocated.
I'm
Hi -
When I run the following Spark SQL query in Spark-shell ( version 1.1.0) :
val rdd = sqlContext.sql(SELECT a FROM x WHERE ts = '2012-01-01T00:00:00' AND
ts = '2012-03-31T23:59:59' )
it gives the following error:
rdd: org.apache.spark.sql.SchemaRDD =
SchemaRDD[294] at RDD at
HI,
I am looking at the documentation for Java API for Streams. The scala
library has option to save file locally, but the Java version doesnt seem
to. The only option i see is saveAsHadoopFiles.
Is there a reason why this option was left out from Java API?
This is also happening to me on a regular basis, when the job is large with
relatively large serialized objects used in each RDD lineage. A bad thing
about this is that this exception always stops the whole job.
On Fri, Sep 26, 2014 at 11:17 AM, Brad Miller bmill...@eecs.berkeley.edu
wrote:
This is a bit strange. When I print the schema for the RDD, it reflects the
correct data type for each column. But doing any kind of mathematical
calculation seems to result in ClassCastException. Here is a sample that
results in the exception:
select c1, c2
...
cast (c18 as int) * cast (c21 as
What is your data like? Are you looking at exact matching or are you
interested in nearly same records? Do you need to merge similar records to
get a canonical value?
Best Regards,
Sonal
Nube Technologies http://www.nubetech.co
http://in.linkedin.com/in/sonalgoyal
On Thu, Oct 9, 2014 at 2:31
IIRC - the random is seeded with the index, so it will always produce
the same result for the same index. Maybe I don't totally follow
though. Could you give a small example of how this might change the
RDD ordering in a way that you don't expect? In general repartition()
will not preserve the
Spark will need to connect both to the hive metastore and to all HDFS
nodes (NN and DN's). If that is all in place then it should work. In
this case it looks like maybe it can't connect to a datanode in HDFS
to get the raw data. Keep in mind that the performance might not be
very good if you are
There is not yet a 1.2.0 branch; there is no 1.2.0 release. master is
1.2.0-SNAPSHOT, not 1.2.0. Your final command is correct, but it's
redundant to 'package' and then throw that away with another 'clean'.
Just the final command with '... clean install' is needed.
On Thu, Oct 9, 2014 at 2:12 AM,
81 matches
Mail list logo