Hello,
I am getting this in unclear error message when I read a parquet file, it
seems something is wrong with data but what? I googled a lot but did not
find any clue. I hope some spark experts could help me with this?
best,
Shahab
Traceback (most recent call last):
File "/usr/lib/
after grouping by, can you perform a join between
this map and your other dataset rather than trying to fit the map in memory?
Regards,
Shahab
On Mon, May 13, 2019 at 3:58 PM Kumar sp wrote:
> I have a use case where i am using collect().toMap (Group by certain
> column and finding count ,cr
Hi there.
I have a Hive external table (storage format is ORC, data stored on S3,
partitioned on one bigint type column) that I am trying to query through
pyspark (or spark-shell) shell. df.count() fails with lower values of LIMIT
clause with the following exception (seen in Spark UI.) df.show()
Could be a tangential idea but might help: Why not use queryExecution and
logicalPlan objects that are available when you execute a query using
SparkSession and get a DataFrame back? The Json representation contains
almost all the info that you need and you don't need to go to Hive to get
this
>
> Original DF -> Iterate -> Pass every element to a function that takes the
> element of the original DF and returns a new dataframe including all the
> matching terms
>
>
>
>
>
> *From:* Andrew Melo
> *Sent:* Friday, December 28, 2018 8:48 PM
> *To:
Can you have a dataframe with a column which stores json (type string)? Or
you can also have a column of array type in which you store all cities
matching your query.
On Fri, Dec 28, 2018 at 2:48 AM wrote:
> Hi community ,
>
>
>
> As shown in other answers online , Spark does not support the
-conditionally
On Tue, Dec 18, 2018 at 9:55 AM Shahab Yunus wrote:
> Have you tried using withColumn? You can add a boolean column based on
> whether the age exists or not and then drop the older age column. You
> wouldn't need union of dataframes then
>
> On Tue, Dec 18, 2018 at 8
Have you tried using withColumn? You can add a boolean column based on
whether the age exists or not and then drop the older age column. You
wouldn't need union of dataframes then
On Tue, Dec 18, 2018 at 8:58 AM Devender Yadav
wrote:
> Hi All,
>
>
> useful code:
>
> public class EmployeeBean
Hi James.
--num-executors is use to control the number of parallel tasks (each per
executors) running for your application. For reading and writing data in
parallel data partitioning is employed. You can look here for quick intro
how data partitioning work:
Curious why you think this is not smart code?
On Mon, Dec 3, 2018 at 8:04 AM James Starks
wrote:
> By taking with your advice flatMap, now I can convert result from
> RDD[Iterable[MyCaseClass]] to RDD[MyCaseClass]. Basically just to perform
> flatMap in the end before starting to convert RDD
Hi there. Have you seen this link?
https://medium.com/@mrpowers/manually-creating-spark-dataframes-b14dae906393
It shows you multiple ways to manually create a dataframe.
Hope it helps.
Regards,
Shahab
On Wed, Sep 26, 2018 at 8:02 AM Kuttaiah Robin wrote:
> Hello,
>
> Current
Unsubscribe
as well?
Regards,
Shahab
On Tue, Apr 10, 2018 at 10:05 AM, Nick Pentreath <nick.pentre...@gmail.com>
wrote:
> Also check out FeatureHasher in Spark 2.3.0 which is designed to handle
> this use case in a more natural way than HashingTF (and handles multiple
> columns at once).
>
&g
needs to be indexed is huge and columns to be indexed
are high cardinality (or with lots of categories) and more than one such
column need to be indexed? Meaning it wouldn't fit in memory.
Thanks.
Regards,
Shahab
[-R] GROUP PATH...*
Software:
Scala version 2.11.8
(OpenJDK 64-Bit Server VM, Java 1.8.0_121)
Spark 2.0.2
Hadoop 2.7.3-amzn-0
Thanks & Regards,
Shahab
Thanks & Regards,
Shahab
Hi,
I am trying to use Spark 1.5, Mlib, but I keep getting
"sbt.ResolveException: unresolved dependency:
org.apache.spark#spark-streaming_2.10;1.5.0: not found" .
It is weird that this happens, but I could not find any solution for this.
Does any one faced the same issue?
best,
/Sh
spark-submit.
Anyone knows What's the solution to this?
best,
/Shahab
It works using yarn-client but I want to make it running on cluster. Is
there any way to do so?
best,
/Shahab
On Fri, Sep 18, 2015 at 12:54 PM, Aniket Bhatnagar <
aniket.bhatna...@gmail.com> wrote:
> Can you try yarn-client mode?
>
> On Fri, Sep 18, 2015, 3:38 PM shahab <shaha
. But I will try your solution as well.
@Neil : I think something is wrong with my fat jar file, I think I am
missing some dependencies in my jar file !
Again thank you all
/Shahab
On Wed, Sep 9, 2015 at 11:28 PM, Dean Wampler <deanwamp...@gmail.com> wrote:
> If you log into the cluste
-mode cluster --class
mypack.MyMainClass --master yarn-cluster s3://mybucket/MySparkApp.jar Is
there any one who has similar problem with EMR?
best,
/Shahab
with EMR .
best,
/Shahab
Sorry, I misunderstood.
best,
/Shahab
On Wed, Jul 8, 2015 at 9:52 AM, spark user spark_u...@yahoo.com wrote:
Hi 'I am looking how to load data in redshift .
Thanks
On Wednesday, July 8, 2015 12:47 AM, shahab shahab.mok...@gmail.com
wrote:
Hi,
I did some experiment with loading
Hi,
Apparently, sc.paralleize (..) operation is performed in the driver
program not in the workers ! Is it possible to do this in worker process
for the sake of scalability?
best
/Shahab
Thanks Akhil, it solved the problem.
best
/Shahab
On Fri, Jun 12, 2015 at 8:50 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Looks like your spark is not able to pick up the HADOOP_CONF. To fix this,
you can actually add jets3t-0.9.0.jar to the classpath
(sc.addJar(/path/to/jets3t-0.9.0
Hi,
I tried to read a csv file from amazon s3, but I get the following
exception which I have no clue how to solve this. I tried both spark 1.3.1
and 1.2.1, but no success. Any idea how to solve this is appreciated.
best,
/Shahab
the code:
val hadoopConf=sc.hadoopConfiguration
Hi George,
I have same issue, did you manage to find a solution?
best,
/Shahab
On Wed, May 13, 2015 at 9:21 PM, George Adams g.w.adams...@gmail.com
wrote:
Hey all, I seem to be having an issue with PostgreSQL JDBC jar on my
classpath. I’ve outlined the issue on Stack Overflow (
http
Thanks Tristan for sharing this. Actually this happens when I am reading a
csv file of 3.5 GB.
best,
/Shahab
On Tue, May 5, 2015 at 9:15 AM, Tristan Blakers tris...@blackfrog.org
wrote:
Hi Shahab,
I’ve seen exceptions very similar to this (it also manifests as negative
array size
Hi,
Is there any way to enforce Spark to partition cached data across all
worker nodes, so all data is not cached only in one of the worker nodes?
best,
/Shahab
Hi,
I am getting No space left on device exception when doing repartitioning
of approx. 285 MB of data while these is still 2 GB space left ??
does it mean that repartitioning needs more space (more than 2 GB) for
repartitioning of 285 MB of data ??
best,
/Shahab
java.io.IOException
Hi,
I am using sprak-1.2.0 and I used Kryo serialization but I get the
following excepton.
java.io.IOException: com.esotericsoftware.kryo.KryoException:
java.lang.IndexOutOfBoundsException: Index: 3448, Size: 1
I do apprecciate if anyone could tell me how I can resolve this?
best,
/Shahab
stand-alone cluster with two nodes A, B. Where node A accommodates
Cassandra, Spark Master and Worker and node B contains the second spark
worker.
best,
/Shahab
Thanks Alex, but 482MB was just example size, and I am looking for
generic approach doing this without broadcasting,
any idea?
best,
/Shahab
On Thu, Apr 30, 2015 at 4:21 PM, Alex lxv...@gmail.com wrote:
482 MB should be small enough to be distributed as a set of broadcast
variables
in memory?
best,
/Shahab
of
files. If you want to avoid disk write, you can mount a ramdisk and
configure spark.local.dir to this ram disk. So shuffle output will write
to memory based FS, and will not introduce disk IO.
Thanks
Jerry
2015-03-30 17:15 GMT+08:00 shahab shahab.mok...@gmail.com:
Hi,
I was looking
this would be true for *any* transformation which causes a shuffle.
It would not be true if you're combining RDDs with union, since that
doesn't cause a shuffle.
On Thu, Mar 12, 2015 at 11:04 AM, shahab shahab.mok...@gmail.com
javascript:_e(%7B%7D,'cvml','shahab.mok...@gmail.com'); wrote:
Hi
, and
then deploy the JAR file to SparkSQL .
But is there any way to avoid deploying the jar file and register it
programmatically?
best,
/Shahab
Hi,
Does any one know how to deploy a custom UDAF jar file in SparkSQL? Where
should i put the jar file so SparkSQL can pick it up and make it accessible
for SparkSQL applications?
I do not use spark-shell instead I want to use it in an spark application.
best,
/Shahab
Thanks Hao,
But my question concerns UDAF (user defined aggregation function ) not
UDTF( user defined type function ).
I appreciate if you could point me to some starting point on UDAF
development in Spark.
Thanks
Shahab
On Tuesday, March 10, 2015, Cheng, Hao hao.ch...@intel.com wrote
version are you using
Shahab?
*From:* yana [mailto:yana.kadiy...@gmail.com]
*Sent:* Wednesday, March 4, 2015 8:47 PM
*To:* shahab; user@spark.apache.org
*Subject:* RE: Does SparkSQL support . having count (fieldname) in
SQL statement?
I think the problem is that you are using
(('cnt 2), BooleanType), tree:
I couldn't find anywhere is documentation whether having keyword is not
supported ?
If this is the case, what would be the work around? using two nested select
statements?
best,
/Shahab
HiveContext, but it seems I can not do this!
@Yes, it is added in Hive 0.12, but do you mean It is not supported by
HiveContext in Spark
Thanks,
/Shahab
On Tue, Mar 3, 2015 at 5:23 PM, Yin Huai yh...@databricks.com wrote:
Regarding current_date, I think it is not in either Hive 0.12.0
for supporting them?
2-It does not support Spark 1.1 and 1.2. Any plan for new release?
best,
/Shahab
On Tue, Mar 3, 2015 at 5:41 PM, Rohit Rai ro...@tuplejump.com wrote:
Hello Shahab,
I think CassandraAwareHiveContext
https://github.com/tuplejump/calliope/blob/develop/sql/hive/src/main/scala/org
table suing SQL context. Is this a normal case?
best,
/Shahab
On Tue, Mar 3, 2015 at 1:35 PM, Cheng, Hao hao.ch...@intel.com wrote:
Hive UDF are only applicable for HiveContext and its subclass instance,
is the CassandraAwareSQLContext a direct sub class of HiveContext or
SQLContext
(
RetryingHMSHandler.java:103)
best,
/Shahab
@Yin: sorry for my mistake, you are right it was added in 1.2, not 0.12.0 ,
my bad!
On Tue, Mar 3, 2015 at 6:47 PM, shahab shahab.mok...@gmail.com wrote:
Thanks Rohit, yes my mistake, it does work with 1.1 ( I am actually
running it on spark 1.1)
But do you mean that even HiveConext
Thanks Michael. I understand now.
best,
/Shahab
On Tue, Mar 3, 2015 at 9:38 PM, Michael Armbrust mich...@databricks.com
wrote:
As it says in the API docs
https://spark.apache.org/docs/1.2.0/api/scala/index.html#org.apache.spark.sql.SchemaRDD,
tables created with registerTempTable are local
Thanks Rohit, yes my mistake, it does work with 1.1 ( I am actually running
it on spark 1.1)
But do you mean that even HiveConext of spark (nit Calliope
CassandraAwareHiveContext) is not supporting Hive 0.12 ??
best,
/Shahab
On Tue, Mar 3, 2015 at 5:55 PM, Rohit Rai ro...@tuplejump.com wrote
(RetryingHMSHandler.java:103)
best,
/Shahab
...@intel.com wrote:
Can you provide the detailed failure call stack?
*From:* shahab [mailto:shahab.mok...@gmail.com]
*Sent:* Tuesday, March 3, 2015 3:52 PM
*To:* user@spark.apache.org
*Subject:* Supporting Hive features in Spark SQL Thrift JDBC server
Hi,
According to Spark SQL
. There are
couple of other UDFs which cause similar error.
Am I missing something in my JDBC server ?
/Shahab
Thanks Vijay, but the setup requirement for GML was not straightforward for
me at all, so I put it aside for a while.
best,
/Shahab
On Sun, Mar 1, 2015 at 9:34 AM, Vijay Saraswat vi...@saraswat.org wrote:
GML is a fast, distributed, in-memory sparse (and dense) matrix
libraries.
It does
Thanks Josef for the comments, I think I need to do some benchmarking.
best,
/Shahab
On Sun, Mar 1, 2015 at 1:25 AM, Joseph Bradley jos...@databricks.com
wrote:
Hi Shahab,
There are actually a few distributed Matrix types which support sparse
representations: RowMatrix, IndexedRowMatrix
Thanks a lot Vijay, let me see how it performs.
Best
Shahab
On Friday, February 27, 2015, Vijay Saraswat vi...@saraswat.org wrote:
Available in GML --
http://x10-lang.org/x10-community/applications/global-matrix-library.html
We are exploring how to make it available within Spark. Any ideas
wrote:
Yes, it's called Coordinated Matrix(
http://spark.apache.org/docs/latest/mllib-data-types.html#coordinatematrix)
you need to fill it with elemets of type MatrixEntry( (Long, Long,
Double))
Thanks,
Peter Rudenko
On 2015-02-27 14:01, shahab wrote:
Hi,
I just wonder if there is any
Hi,
I just wonder if there is any Sparse Matrix implementation available in
Spark, so it can be used in spark application?
best,
/Shahab
Thanks Imran, but I do appreciate if you explain what this mean and what
are the reasons make it happening. I do need it.
If there is any documentation somewhere you can simply direct me there so
I can try to understand it myself.
best,
/Shahab
On Sat, Feb 21, 2015 at 12:26 AM, Imran Rashid
if you plan for near real-time response from Spark ?!
best,
/Shahab
Thanks you all. Just changing RDD to Map structure saved me approx. 1
second.
Yes, I will check out IndexedRDD to see if it has better performance.
best,
/Shahab
On Thu, Feb 19, 2015 at 6:38 PM, Burak Yavuz brk...@gmail.com wrote:
If your dataset is large, there is a Spark Package called
to lazy evaluation?
best,
/Shahab
, like
HashMap to keep data and look up it there and use Broadcast to send a copy
to all machines?
best,
/Shahab
Thanks Francois for the comment and useful link. I understand the problem
better now.
best,
/Shahab
On Wed, Feb 18, 2015 at 10:36 AM, francois.garil...@typesafe.com wrote:
In a nutshell : because it’s moving all of your data, compared to other
operations (e.g. reduce) that summarize
Hi,
Based on what I could see in the Spark UI, I noticed that groupBy
transformation is quite slow (taking a lot of time) compared to other
operations.
Is there any reason that groupBy is slow?
shahab
else!
Do you have any other idea where I should look for the cause?
best,
/Shahab
On Wed, Feb 18, 2015 at 4:22 PM, Sean Owen so...@cloudera.com wrote:
The mostly likely explanation is that you wanted to put all the
partitions in memory and they don't all fit. Unless you asked to
persist
Hi,
I just wonder if there is any way to unregister/re-register a TempTable in
Spark?
best,
/Shahab
?
best,
/Shahab
)
?
best,
/shahab
Hi,
I just wonder if Cassandra-Spark connector supports executing HiveQL on
Cassandra tables?
best,
/Shahab
Hello,
By some (unknown) reasons some of my tasks, that fetch data from Cassandra,
are failing so often, and apparently the master removes a tasks which fails
more than 4 times (in my case).
Is there any way to increase the number of re-tries ?
best,
/Shahab
this to perform some benchmarking and I need to
separate rdd caching and rdd transformation/action processing time.
best,
/Shahab
Daniel and Paolo, thanks for the comments.
best,
/Shahab
On Wed, Dec 3, 2014 at 3:12 PM, Paolo Platter paolo.plat...@agilelab.it
wrote:
Yes,
otherwise you can try:
rdd.cache().count()
and then run your benchmark
Paolo
*Da:* Daniel Darabos daniel.dara...@lynxanalytics.com
),
org.apache.hadoop % hadoop-client % 1.0.4 % provided,
best,
/Shahab
Hi,
I just wonder if there is any implementation for Item-based Collaborative
Filtering in Spark?
best,
/Shahab
Thanks a lot, both solutions work.
best,
/Shahab
On Tue, Nov 18, 2014 at 5:28 PM, Daniel Siegmann daniel.siegm...@velos.io
wrote:
I think zipWithIndex is zero-based, so if you want 1 to N, you'll need to
increment them like so:
val r2 = r1.keys.distinct().zipWithIndex().mapValues(_ + 1
I faced same problem, and s work around solution is here :
https://github.com/datastax/spark-cassandra-connector/issues/292
best,
/Shahab
On Mon, Nov 24, 2014 at 3:21 PM, Ashic Mahtab as...@live.com wrote:
I've got a Cassandra 2.1.1 + Spark 1.1.0 cluster running. I'm using
sbt-assembly
rows can
have same string key .
In spark context, how I can map each row into (Numeric_Key, OriginalRow) as
map/reduce tasks such that rows with same original string key get same
numeric consecutive key?
Any hints?
best,
/Shahab
% log4j % 1.2.17,
com.google.guava % guava % 16.0
)
best,
/Shahab
And this is the exception I get:
Exception in thread main
com.google.common.util.concurrent.ExecutionError:
java.lang.NoSuchMethodError:
com.google.common.collect.Sets.newConcurrentHashSet()Ljava/util/Set
it mean that
choosing right number of partitions is the key factor in the Spark
performance ?
best,
/Shahab
Thanks Sean for very useful comments. I understand now better what could be
the reasons that my evaluations are messed up.
best,
/Shahab
On Mon, Nov 3, 2014 at 12:08 PM, Sean Owen so...@cloudera.com wrote:
Yes partitions matter. Usually you can use the default, which will
make a partition per
idea?
best,
/Shahab
Here is how my SBT looks like:
libraryDependencies ++= Seq(
com.datastax.spark %% spark-cassandra-connector % 1.1.0-beta1
withSources() withJavadoc(),
org.apache.cassandra % cassandra-all % 2.0.9 intransitive(),
org.apache.cassandra % cassandra-thrift % 2.0.9
spark.app.name : SomethingElse
spark.fileserver.uri : http://192.168.1.111:51463
spark.driver.port : 51461
spark.master : local
Does it have anything to do with the version of Apache Cassandra that I
use?? I use apache-cassandra-2.1.0
best,
/Shahab
The shortened SBT :
com.datastax.spark
OK, I created an issue. Hopefully it will be resolved soon.
Again thanks,
best,
/Shahab
On Fri, Oct 31, 2014 at 7:05 PM, Helena Edelson helena.edel...@datastax.com
wrote:
Hi Shahab,
The apache cassandra version looks great.
I think that doing
cc.setKeyspace(mydb)
cc.sql(SELECT * FROM
repartition on the RDD , and then I did the map/reduce
functions.
But the main problem is that repartition takes so much time (almost 2
min), which is not acceptable in my use-case. Is there any better way to do
repartitioning?
best,
/Shahab
Hi Helena,
Well... I am just running a toy example, I have one Cassandra node
co-located with the Spark Master and one of Spark Workers, all in one
machine. I have another node which runs the second Spark worker.
/Shahab,
On Thu, Oct 30, 2014 at 6:12 PM, Helena Edelson helena.edel
Hi,
I noticed that the count (of RDD) in many of my queries is the most time
consuming one as it runs in the driver process rather then done by
parallel worker nodes,
Is there any way to perform count in parallel , at at least parallelize
it as much as possible?
best,
/Shahab
Thanks Helena, very useful comment,
But is ‘spark.cassandra.input.split.size only effective in Cluster not in
Single node?
best,
/Shahab
On Thu, Oct 30, 2014 at 6:26 PM, Helena Edelson helena.edel...@datastax.com
wrote:
Shahab,
Regardless, WRT cassandra and spark when using the spark
I am missing in my settings, or... ?
thanks,
/Shahab
cache()? By itself it does nothing but once an action
requires it to be computed it should become cached.
On Oct 28, 2014 8:19 AM, shahab shahab.mok...@gmail.com wrote:
Hi,
I have a standalone spark , where the executor is set to have 6.3 G
memory , as I am using two workers so in total
,...) ?!
best,
/Shahab
Thanks for the useful comment. But I guess this setting applies only when I
use SparkSQL right= is there any similar settings for Spark?
best,
/Shahab
On Tue, Oct 28, 2014 at 2:38 PM, Wanda Hawk wanda_haw...@yahoo.com wrote:
Is this what are you looking for ?
In Shark, default reducer
java.nio.channels.CancelledKeyException at
org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:386)
at
org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:139)
Any idea where should I look for the cause?
best,
/shahab
This following is the part
of the
underlying DAG and associated tasks it is hard to find what I am looking
for.
best,
/Shahab
Hi,
I just wonder if SparkSQL supports Hive built-in functions (e.g.
from_unixtime) or any of the functions pointed out here : (
https://cwiki.apache.org/confluence/display/Hive/Tutorial)
best,
/Shahab
,
/Shahab
: struct (containsNull = false)
|||-- id: string (nullable = true)
|| |-- data string (nullable = true)
best,
/Shahab
,
for example counting number of attributes (considering that attributes in
schema is presented as array), or any other type of aggregation?
best,
/Shahab
On Mon, Oct 13, 2014 at 4:01 PM, Yin Huai huaiyin@gmail.com wrote:
Hi Shahab,
Can you try to use HiveContext? Its should work in 1.1
)
anotherPeople.registerTempTable(people)
val query_people = sqlContext.sql(select attributes[0].collectApp from
people)
query_people.foreach(println)
But instead of getting Web as print out, I am getting the following:
[[web,[woman],1409064792512, Economy]]
thanks,
/shahab
to any document or resource.
Thanks a lot.
Regards,
Shahab
98 matches
Mail list logo