What version of the Hive you are using? And do you compile to the right version
of Hive when you compiled Spark?
BTY, spark-avro works great for our experience, but still, some non-tech people
just want to use as a SQL shell in spark, like HIVE-CLI.
Yong
From: mich...@databricks.com
Date: Wed,
Dears,
I needs to commit DB Transaction for each partition,Not for each row.
below didn't work for me.
rdd.mapPartitions(partitionOfRecords = {
DBConnectionInit()
val results = partitionOfRecords.map(..)
DBConnection.commit()
})
Best regards,
Ahmed Atef Nawwar
Data Management
Hi all,
What would be a good way to filter rows in a dataframe, where any value in a
row has null? I wouldn't want to go through each column manually.
Thanks,
Saif
Hello,
My streaming application needs to allow consuming new Kafka topics at
arbitrary times. I know I can stop and start the streaming context when I
need to introduce a new stream, but that seems quite disruptive. I am
wondering if other people have this situation and if there is a more
elegant
This is something we have been needing for a while too. We are restarting the
streaming context to handle new topic subscriptions unsubscriptions which
affects latency of update handling. I think this is something that needs to be
addressed in core Spark Streaming (I can't think of any
If you all can say a little more about what your requirements are, maybe we
can get a jira together.
I think the easiest way to deal with this currently is to start a new job
before stopping the old one, which should prevent latency problems.
On Thu, Aug 27, 2015 at 9:24 AM, Sudarshan Kadambi
Map is lazy. You need an actual action, or nothing will happen. Use
foreachPartition, or do an empty foreach after the map.
On Thu, Aug 27, 2015 at 8:53 AM, Ahmed Nawar ahmed.na...@gmail.com wrote:
Dears,
I needs to commit DB Transaction for each partition,Not for each row.
below
I have a java program that does this - (using Spark 1.3.1 ) Create a command
string that uses spark-submit in it ( with my Class file etc ), and i
store this string in a temp file somewhere as a shell script Using
Runtime.exec, i execute this script and wait for its completion, using
Hello, thank you for the response.
I found a blog where a guy explains that it is not possible to join columns
from different data frames.
I was trying to modify one column’s information, so selecting it and then
trying to replace the original dataframe column. Found another way,
Thanks
Saif
I suggest taking a heap dump of driver process using jmap. Then open that
dump in a tool like Visual VM to see which object(s) are taking up heap
space. It is easy to do. We did this and found out that in our case it was
the data structure that stores info about stages, jobs and tasks. There can
Hi ,
I am getting this sbt error when trying to compile a terasort sbt
file with following contents, but facinng below error. Please suggest for
what can be tried.. (Terasort is written in scala for our Benchmark)
root@:~# more teraSort/tera.sbt
name := IBM ARL TeraSort
version := 1.0
As it stands currently, no.
If you're already overriding the dstream, it would be pretty
straightforward to change the kafka parameters used when creating the rdd
for the next batch though
On Wed, Aug 26, 2015 at 11:41 PM, Shushant Arora shushantaror...@gmail.com
wrote:
Can I change this param
Dears,
I needs to commit DB Transaction for each partition,Not for each row.
below didn't work for me.
rdd.mapPartitions(partitionOfRecords = {
DBConnectionInit()
val results = partitionOfRecords.map(..)
DBConnection.commit()
results
})
Best regards,
Ahmed Atef Nawwar
Data
Hello,
I was wondering if there's a way to kill stages in spark streaming via a
StreamingListener. Sometimes stages will hang, and simply killing it via the
UI:
http://apache-spark-user-list.1001560.n3.nabble.com/file/n24478/Screen_Shot_2015-08-27_at_11.png
is enough to let the streaming job
Kill the job in the middle of a batch, look at the worker logs to see which
offsets were being processed, verify the messages for those offsets are
read when you start the job back up
On Thu, Aug 27, 2015 at 10:14 AM, Hafiz Mujadid hafizmujadi...@gmail.com
wrote:
Hi!
I have enables check
I was using different build of spark compiled with different version of
hive before
I error which I see now
org.apache.hadoop.hive.serde2.avro.BadSchemaException
at
org.apache.hadoop.hive.serde2.avro.AvroSerDe.deserialize(AvroSerDe.java:195)
at
What's the canonical way to find out the number of physical machines in a
cluster at runtime in Spark? I believe SparkContext.defaultParallelism will
give me the number of cores, but I'm interested in the number of NICs.
I'm writing a Spark streaming application to ingest from Kafka with the
I put up https://issues.apache.org/jira/browse/SPARK-10320. Let's continue this
conversation there.
From: c...@koeninger.org At: Aug 27 2015 10:30:12
To: Sudarshan Kadambi (BLOOMBERG/ 731 LEX)
Cc: user@spark.apache.org
Subject: Re: Adding Kafka topics to a running streaming context
If you all
You can run hive query in the spark-avro, but you cannot query the hive view in
the spark-avro, as the view is stored in the Hive metadata.
What do you mean the right version of spark, then can't determine table
schema problem is fixed? I faced this problem before, and my guess is the Hive
On Thu, Aug 27, 2015 at 2:59 PM, Shreeharsha G Neelakantachar
shreeharsh...@in.ibm.com wrote:
root@:~# sbt package
Getting org.scala-sbt sbt 0.13.9 ...
:: problems summary ::
WARNINGS
module not found: org.scala-sbt#sbt;0.13.9
...
ERRORS
Server access
Hello
I am new to spark world and started to explore recently in standalone mode.
It would be great if I get clarifications on below doubts-
1. Driver locality - It is mentioned in documentation that client
deploy-mode is not good if machine running spark-submit is not co-located
with worker
can we run hive queries using spark-avro ?
In our case its not just reading the avro file. we have view in hive which
is based on multiple tables.
On Thu, Aug 27, 2015 at 9:41 AM, Giri P gpatc...@gmail.com wrote:
we are using hive1.1 .
I was able to fix below error when I used right version
Hi!
I have enables check pointing in spark streaming with kafka. I can see that
spark streaming is checkpointing to the mentioned directory at hdfs. How can
i test that it works fine and recover with no data loss ?
Thanks
--
View this message in context:
Hi,
Any way to store/load RDDs keeping their original object instead of string?
I am having trouble with parquet (there is always some error at class
conversion), and don't use hadoop. Looking for alternatives.
Thanks in advance
Saif
Thanks for this tip.
I ran it in yarn-client mode with driver-memory = 4G and took a dump once the
heap got close to 4G.
num#instances #bytes class name
--
1: 446169 3661137256 [J
2: 2032795 222636720
we are using hive1.1 .
I was able to fix below error when I used right version spark
15/08/26 17:51:12 WARN avro.AvroSerdeUtils: Encountered AvroSerdeException
determining schema. Returning signal schema to indicate problem
org.apache.hadoop.hive.serde2.avro.AvroSerdeException: Neither
Anyone has similar problem or thoughts on this?
On Wed, Aug 26, 2015 at 10:37 AM, Chen Song chen.song...@gmail.com wrote:
When running long-lived job on YARN like Spark Streaming, I found that
container logs gone after days on executor nodes, although the job itself
is still running.
I am
Thanks for foreach idea. But once i used it i got empty rdd. I think
because results is an iterator.
Yes i know Map is lazy but i expected there is solution to force action.
I can not use foreachPartition because i need reuse the new RDD after some
maps.
On Thu, Aug 27, 2015 at 5:11 PM, Cody
BTY, spark-avro works great for our experience, but still, some non-tech
people just want to use as a SQL shell in spark, like HIVE-CLI.
To clarify: you can still use the spark-avro library with pure SQL. Just
use the CREATE TABLE ... USING com.databricks.spark.avro OPTIONS (path
'...')
Hi Swapnil,
Let me try to answer some of the questions. Answers inline. Hope it helps.
On Thursday, August 27, 2015, Swapnil Shinde swapnilushi...@gmail.com
wrote:
Hello
I am new to spark world and started to explore recently in standalone
mode. It would be great if I get clarifications on
You need to return an iterator from the closure you provide to mapPartitions
On Thu, Aug 27, 2015 at 1:42 PM, Ahmed Nawar ahmed.na...@gmail.com wrote:
Thanks for foreach idea. But once i used it i got empty rdd. I think
because results is an iterator.
Yes i know Map is lazy but i expected
Yes, of course, I am doing that. But once i added results.foreach(row= {})
i pot empty RDD.
rdd.mapPartitions(partitionOfRecords = {
DBConnectionInit()
val results = partitionOfRecords.map(..)
DBConnection.commit()
results.foreach(row= {})
results
})
On Thu, Aug 27, 2015 at 10:18
This job contains a spark output action, and is what I originally meant:
rdd.mapPartitions {
result
}.foreach {
}
This job is just a transformation, and won't do anything unless you have
another output action. Not to mention, it will exhaust the iterator, as
you noticed:
rdd.mapPartitions
On Thu, Aug 27, 2015 at 5:40 PM, Jacek Laskowski ja...@japila.pl wrote:
Server access Error: java.lang.RuntimeException: Unexpected error:
java.security.InvalidAlgorithmParameterException: the trustAnchors parameter
must be non-empty
Hi all ,
Can we create data frame from excels sheet or csv file , in below example It
seems they support only json ?
DataFrame df =
sqlContext.read().json(examples/src/main/resources/people.json);
Check out spark-csv: http://spark-packages.org/package/databricks/spark-csv
On Thu, Aug 27, 2015 at 11:48 AM, spark user spark_u...@yahoo.com.invalid
wrote:
Hi all ,
Can we create data frame from excels sheet or csv file , in below example
It seems they support only json ?
DataFrame df =
Thanks Rishitesh !!
1. I get that driver doesn't need to be on master but there is lot of
communication between driver and cluster. That's why co-located gateway was
recommended. How much is the impact of driver not being co-located with
cluster?
4. How does hdfs split get assigned to worker node
Yes, any java serializable object. Its important to note that since its
saving serialized objects it is as brittle as java serialization when it
comes to version changes, so if you can make your data fit in something
like sequence files, parquet, avro, or similar it can be not only more
space
I am working on code which uses executor service to parallelize tasks
(think machine learning computations done over small dataset over and over
again).
My goal is to execute some code as fast as possible, multiple times and
store the result somewhere (total executions will be on the order of 100M
What types of RDD can saveAsObjectFile(path) handle? I tried a naive test
with an RDD[Array[String]], but when I tried to read back the result with
sc.objectFile(path).take(5).foreach(println), I got a non-promising output
looking like:
[Ljava.lang.String;@46123a
[Ljava.lang.String;@76123b
Ah, yes, that did the trick.
So more generally, can this handle any serializable object?
On Thu, Aug 27, 2015 at 2:11 PM, Jonathan Coveney jcove...@gmail.com
wrote:
array[String] doesn't pretty print by default. Use .mkString(,) for
example
El jueves, 27 de agosto de 2015, Arun Luthra
This is a question on general usage/best practice/best transformation
method to use for
a sentiment analysis on tweets...
Input:
Tweets (e.g, @xyz, sorry but this movie is poorly scripted
http://t.co/uyser876;) - large data set, ie. 1 billion tweets
Sentiment dictionary (e.g,
Thanks a lot for your support. It is working now.
I wrote it like below
val newRDD = rdd.mapPartitions { partition = {
val result = partition.map(.)
result
}
}
newRDD.foreach {
}
On Thu, Aug 27, 2015 at 10:34 PM, Cody Koeninger c...@koeninger.org wrote:
This job contains a spark
I see the following error time to time when try to start slaves on spark
1.4.0
[hadoop@ip-10-0-27-240 apps]$ pwd
/mnt/var/log/apps
[hadoop@ip-10-0-27-240 apps]$ cat
spark-hadoop-org.apache.spark.deploy.worker.Worker-1-ip-10-0-27-240.ec2.internal.out
Spark Command: /usr/java/latest/bin/java -cp
array[String] doesn't pretty print by default. Use .mkString(,) for
example
El jueves, 27 de agosto de 2015, Arun Luthra arun.lut...@gmail.com
escribió:
What types of RDD can saveAsObjectFile(path) handle? I tried a naive test
with an RDD[Array[String]], but when I tried to read back the
So println of any array of strings will look like that. The
java.util.Arrays class has some options to print arrays nicely.
On Thu, Aug 27, 2015 at 2:08 PM, Arun Luthra arun.lut...@gmail.com wrote:
What types of RDD can saveAsObjectFile(path) handle? I tried a naive test
with an
Hi Team,
Say I have a test.json file: {c1:[1,2,3]}
I can create a parquet file like :
var df = sqlContext.load(/tmp/test.json,json)
var df_c = df.repartition(1)
df_c.select(*).save(/tmp/testjson_spark,parquet”)
The output parquet file’s schema is like:
c1: OPTIONAL F:1
.bag:
HI every one,
I am trying to run KDD data set - basically chapter 5 of the Advanced
Analytics with Spark book. The data set is of 789MB, but Spark is taking
some 3 to 4 hours. Is it normal behaviour.or some tuning is required.
The server RAM is 32 GB, but we can only give 4 GB RAM on 64 bit
I'm getting errors like Removing executor with no recent heartbeats
Missing an output location for shuffle errors for a large SparkSql join
(1bn rows/2.5TB joined with 1bn rows/30GB) and I'm not sure how to
configure the job to avoid them.
The initial stage completes fine with some 30k tasks on
Hi James,
It's a good idea. A JSON format is more convenient for visualization though a
little inconvenient to read. How about toJson() method? It might make the mllib
api inconsistent across models though.
You should probably create a JIRA for this.
CC: dev list
-Manish
On Aug 26, 2015,
You can create a custom mesos framework for your requirement, to get you
started you can check this out
http://mesos.apache.org/documentation/latest/app-framework-development-guide/
Thanks
Best Regards
On Mon, Aug 24, 2015 at 12:11 PM, Romi Kuntsman r...@totango.com wrote:
Hi,
I have a spark
Just did some tests.
I have 6000 files, each has 14K records with 900Mb file size. In spark
sql, it would take one task roughly 1 min to parse.
On the local machine, using the same Jackson lib inside Spark lib. Just
parse it.
FileInputStream fstream = new FileInputStream(testfile);
Hey
I am using the Json4s-Jackson parser coming with spark and parsing roughly 80m
records with totally size 900mb.
But the speed is slow. It took my 50 nodes(16cores cpu,100gb mem) roughly
30mins to parse Json to use spark sql.
Jackson has the benchmark saying parsing should be ms level.
It seems like this might be better suited to a broadcasted hash map since
200k entries isn't that big. You can then map over the tweets and lookup
each word in the broadcasted map.
On Thursday, August 27, 2015, Jesse F Chen jfc...@us.ibm.com wrote:
This is a question on general usage/best
On Wed, Aug 26, 2015 at 11:02 PM, Joanne Contact
joannenetw...@gmail.com wrote:
Hi I have a Ubuntu box with 4GB memory and duo cores. Do you think it
won't be enough to run spark streaming and kafka? I try to install
standalone mode spark kafka so I can debug them in IDE. Do I need to
install
Thank you all for these links. I'll check them.
On Wed, Aug 26, 2015 at 5:05 PM, Charlie Hack charles.t.h...@gmail.com
wrote:
+1 to all of the above esp. Dimensionality reduction and locality
sensitive hashing / min hashing.
There's also an algorithm implemented in MLlib called DIMSUM which
Can you select something from this table using Hive? And also could you post
your spark code which leads to this exception.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/query-avro-hive-table-in-spark-sql-tp24462p24468.html
Sent from the Apache Spark User
Check permission for user which runs spark-shell
(Permission denied) - means that you do not have permissions to /tmp
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/error-accessing-vertexRDD-tp24466p24469.html
Sent from the Apache Spark User List mailing
Can you paste the stacks-trace? Is it complaining about directory already
exists?
Thanks
Best Regards
On Thu, Aug 20, 2015 at 11:23 AM, Jeetendra Gangele gangele...@gmail.com
wrote:
HI All,
I have a data in HDFS partition with Year/month/data/event_type. And I am
creating a hive tables with
Are you reading the column data from a SQL database? If so, have a look at
the JdbcRDD. http://www.infoobjects.com/spark-sql-jdbcrdd/
Thanks
Best Regards
On Fri, Aug 21, 2015 at 1:25 AM, SAHA, DEBOBROTA ds3...@att.com wrote:
Hi ,
Can anyone help me in loading a column that may or may not
Thanks Akhil, I will have a look.
I have a dude regarding to spark streaming and filestream. If spark
streaming crashs and while spark was down new files are created in input
folder, when spark streaming is launched again, how can I process these
files?
Thanks.
Regards.
Miguel.
On Thu, Aug
You can use the driver ui and click on the Jobs - Stages to see the number
of partitions being created for that job. If you want to increase the
partitions, then you could do a .repartition too. With directStream api i
guess the # partitions in spark will be equal to the number of partitions
in
Have a look at the spark streaming. You can make use of the ssc.fileStream.
Eg:
val avroStream = ssc.fileStream[AvroKey[GenericRecord], NullWritable,
AvroKeyInputFormat[GenericRecord]](input)
You can also specify a filter function
How many mesos slaves are you having? and how many cores are you having in
total?
sparkConf.set(spark.mesos.coarse, true)
sparkConf.set(spark.cores.max, 128)
These two configurations are sufficient. Now regarding the active tasks,
how many partitions are you seeing for that job? You
Did you try with s3a? Also make sure your key does not have any wildcard
chars in it.
Thanks
Best Regards
On Fri, Aug 21, 2015 at 2:03 AM, Shuai Zheng szheng.c...@gmail.com wrote:
Hi All,
I try to access S3 file from S3 in Hadoop file format:
Below is my code:
Used to hit GC: Overhead limit exceeded and sometime GC: Heap error too.
Thanks
Best Regards
On Sat, Aug 22, 2015 at 2:44 AM, java8964 java8...@hotmail.com wrote:
In the test job I am running in Spark 1.3.1 in our stage cluster, I can
see following information on the application stage
If you have enabled checkpointing the spark will handle that for you.
Thanks
Best Regards
On Thu, Aug 27, 2015 at 4:21 PM, Masf masfwo...@gmail.com wrote:
Thanks Akhil, I will have a look.
I have a dude regarding to spark streaming and filestream. If spark
streaming crashs and while spark
I have a spark v.1.4.1 on YARN job where the first stage has ~149,000 tasks
(it’s reading a few TB of data). The job itself is fairly simple - it’s just
getting a list of distinct values:
val days = spark
.sequenceFile(inputDir, classOf[KeyClass], classOf[ValueClass])
Are you using the Kryo serializer? If not, have a look at it, it can save a lot
of memory during shuffles
https://spark.apache.org/docs/latest/tuning.html
I did a similar task and had various issues with the volume of data being
parsed in one go, but that helped a lot. It looks like the main
I should have mentioned: yes I am using Kryo and have registered KeyClass and
ValueClass.
I guess it’s not clear to me what is actually taking up space on the driver
heap - I can’t see how it can be data with the code that I have.
On 27/08/2015 12:09, Ewan Leith ewan.le...@realitymine.com
Hello,
I'm trying to query a nested data record of the form:
root
|-- userid: string (nullable = true)
|-- datarecords: array (nullable = true)
||-- element: struct (containsNull = true)
|||-- name: string (nullable = true)
|||-- system: boolean (nullable = true)
||
Hi,
Is this a known issue of building Spark from today's sources using
Scala 2.11? I did the following:
➜ spark git:(master) ✗ git rev-parse HEAD
de0278286cf6db8df53b0b68918ea114f2c77f1f
➜ spark git:(master) ✗ ./dev/change-scala-version.sh 2.11
➜ spark git:(master) ✗ ./build/mvn -Pyarn
Hi,
I'm trying to nail it down myself, too. Is there anything relevant to
help on my side?
Pozdrawiam,
Jacek
--
Jacek Laskowski | http://blog.japila.pl | http://blog.jaceklaskowski.pl
Follow me at https://twitter.com/jaceklaskowski
Upvote at
Hi,
Sean helped me offline and I sent
https://github.com/apache/spark/pull/8479 for review. That's the only
breaking place for the build I could find. Tested with 2.10 and 2.11.
Pozdrawiam,
Jacek
--
Jacek Laskowski | http://blog.japila.pl | http://blog.jaceklaskowski.pl
Follow me at
Hm, if anything that would be related to my change at
https://issues.apache.org/jira/browse/SPARK-9613. Let me investigate.
There's some chance there is some Scala 2.11-specific problem here.
On Thu, Aug 27, 2015 at 8:51 AM, Jacek Laskowski ja...@japila.pl wrote:
Hi,
Is this a known issue of
75 matches
Mail list logo