On Mon, Feb 22, 2016 at 12:18 PM, ashokkumar rajendran <
ashokkumar.rajend...@gmail.com> wrote:
> Hi Folks,
>
>
>
> I am exploring spark for streaming from two sources (a) Kinesis and (b)
> HDFS for some of our use-cases. Since we maintain state gathered over last
> x hours in spark streaming, we
Hi,
Can anybody help me by providing me example how can we read schema of the
data set from the file.
Thanks,
Divya
Your logs are getting archived in your logs bucket in S3.
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-debugging.html
Regards
Sab
On Mon, Feb 22, 2016 at 12:14 PM, HARSH TAKKAR
wrote:
> Hi
>
> In am using an EMR cluster for running my
Hi Folks,
I am exploring spark for streaming from two sources (a) Kinesis and (b)
HDFS for some of our use-cases. Since we maintain state gathered over last
x hours in spark streaming, we would like to replay the data from last x
hours as batches during deployment. I have gone through the Spark
Hi
In am using an EMR cluster for running my spark jobs, but after the job
finishes logs disappear,
I have added a log4j.properties in my jar, but all the logs still redirects
to EMR resource manager which vanishes after jobs completes, is there a way
i could redirect the logs to a location in
Thanks Gourav, Eduardo
I tried http://localhost:8080 and http://OAhtvJ5MCA:8080/ . Both
cases the forefox just hangs.
Also I tried with lynx text based browser. I get the message "HTTP
request sent; waiting for response." and it hangs as well.
Is there way to enable debug logs in
On the line preceding the one that the compiler is complaining about (which
doesn't actually have a problem in itself), you declare df as
"df"+fileName, making it a string. Then you try to assign a DataFrame to
df, but it's already a string. I don't quite understand your intent with
that previous
Compaction would have been triggered automatically as following properties
already set in *hive-site.xml*. and also *NO_AUTO_COMPACTION* property not
been set for these tables.
hive.compactor.initiator.on
true
hive.compactor.worker.threads
1
Max number of cores per executor can be controlled using
spark.executor.cores. And maximum number of executors on a single worker
can be determined by environment variable: SPARK_WORKER_INSTANCES.
However, to ensure that all available cores are used, you will have to take
care of how the stream
Hi,
I am trying to dynamically create Dataframe by reading subdirectories under
parent directory
My code looks like
> import org.apache.spark._
> import org.apache.spark.sql._
> val hadoopConf = new org.apache.hadoop.conf.Configuration()
> val hdfsConn = org.apache.hadoop.fs.FileSystem.get(new
>
Yes, I was burned down by this issue couple of weeks back. This also means
that after every insert job, compaction should be run to access new rows
from Spark. Sad that this issue is not documented / mentioned anywhere.
On Mon, Feb 22, 2016 at 9:27 AM, @Sanjiv Singh
Hi Varadharajan,
Thanks for your response.
Yes it is transnational table; See below *show create table. *
Table hardly have 3 records , and after triggering minor compaction on
tables , it start showing results on spark SQL.
> *ALTER TABLE hivespark COMPACT 'major';*
> *show create table
Hi,
Is the transaction attribute set on your table? I observed that hive
transaction storage structure do not work with spark yet. You can confirm
this by looking at the transactional attribute in the output of "desc
extended " in hive console.
If you'd need to access transactional table,
Hi,
I have observed that Spark SQL is not returning records for hive bucketed
ORC tables on HDP.
On spark SQL , I am able to list all tables , but queries on hive bucketed
tables are not returning records.
I have also tried the same for non-bucketed hive tables. it is working fine.
Same
Hi there,
I had similar problem in Java with the standalone cluster on Linux but got
that working by passing the following option
-Dspark.jars=file:/path/to/sparkapp.jar
sparkapp.jar has the launch application
Hope that helps.
Regards,
Patrick
-Original Message-
From: Arko Provo
Well my version of Spark is 1.5.2
On 21/02/2016 23:54, Jacky Wang wrote:
> df.write.saveAsTable("db_name.tbl_name") // works, spark-shell, latest spark
> version 1.6.0
> df.write.saveAsTable("db_name.tbl_name") // NOT work, spark-shell, old spark
> version 1.4
>
> --
>
> Jacky Wang
>
>
> Yeah, i tried with reduceByKey, was able to do it. Thanks Ayan and Jatin.
> For reference, final solution:
>
> def main(args: Array[String]): Unit = {
> val conf = new SparkConf().setAppName("HBaseStream")
> val sc = new SparkContext(conf)
> // create a StreamingContext, the main
Yeah, i tried with reduceByKey, was able to do it. Thanks Ayan and Jatin.
For reference, final solution:
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("HBaseStream")
val sc = new SparkContext(conf)
// create a StreamingContext, the main entry point for
df.write.saveAsTable("db_name.tbl_name") // works, spark-shell, latest spark
version 1.6.0
df.write.saveAsTable("db_name.tbl_name") // NOT work, spark-shell, old spark
version 1.4
--
Jacky Wang
At 2016-02-21 17:35:53, "Mich Talebzadeh" wrote:
I looked at doc on
I believe the best way would be to use reduceByKey operation.
On Mon, Feb 22, 2016 at 5:10 AM, Jatin Kumar <
jku...@rocketfuelinc.com.invalid> wrote:
> You will need to do a collect and update a global map if you want to.
>
> myDStream.map(record => (record._2, (record._3, record_4, record._5))
Sorry,
please include the following questions to the list above:
the SPARK version?
whether you are using RDD or DataFrames?
is the code run locally or in SPARK Cluster mode or in AWS EMR?
Regards,
Gourav Sengupta
On Sun, Feb 21, 2016 at 7:37 PM, Gourav Sengupta
Hi Tamara,
few basic questions first.
How many executors are you using?
Is the data getting all cached into the same executor?
How many partitions do you have of the data?
How many fields are you trying to use in the join?
If you need any help in finding answer to these questions please let me
w.r.t. the new mapWithState(), there have been some bug fixes since the
release of 1.6.0
e.g.
SPARK-13121 java mapWithState mishandles scala Option
Looks like 1.6.1 RC should come out next week.
FYI
On Sun, Feb 21, 2016 at 10:47 AM, Chris Fregly wrote:
> good catch on the
good catch on the cleaner.ttl
@jatin- when you say "memory-persisted RDD", what do you mean exactly? and
how much data are you talking about? remember that spark can evict these
memory-persisted RDDs at any time. they can be recovered from Kafka, but this
is not a good situation to be in.
Hi Ted,
Thanks a lot for you reply
I tried your code in spark-shell on my laptop it works well.
But when I tried it on another computer installed with spark I got an Error
scala> val df0 = Seq(("a", "b", "c", 3), ("c", "b", "a", 3)).toDF("A", "B",
"C", "num")
:11: error: value toDF is not a
You will need to do a collect and update a global map if you want to.
myDStream.map(record => (record._2, (record._3, record_4, record._5))
.groupByKey((r1, r2) => (r1._1 + r2._1, r1._2 + r2._2, r1._3 +
r2._3))
.foreachRDD(rdd => {
rdd.collect().foreach((fileName,
Nevermind, seems like an executor level mutable map is not recommended as
stated in
http://blog.guillaume-pitel.fr/2015/06/spark-trick-shot-node-centered-aggregator/
On Sun, Feb 21, 2016 at 9:54 AM, Vinti Maheshwari
wrote:
> Thanks for your reply Jatin. I changed my
Thanks for your reply Jatin. I changed my parsing logic to what you
suggested:
def parseCoverageLine(str: String) = {
val arr = str.split(",")
...
...
(arr(0), arr(1) :: count.toList) // (test, [file, 1, 1, 2])
}
Then in the grouping, can i use a global hash map
Hello Vinti,
One way to get this done is you split your input line into key and value
tuple and then you can simply use groupByKey and handle the values the way
you want. For example:
Assuming you have already split the values into a 5 tuple:
myDStream.map(record => (record._2, (record._3,
Hi,
I'm running a spark streaming application onto a spark cluster that spans 6
machines/workers. I'm using spark cluster standalone mode. Each machine has
8 cores. Is there any way to specify that I want to run my application on
all 6 machines and just use 2 cores on each machine?
Thanks
Thanks a lot!
Best Regards,
Weiwei
On Sat, Feb 20, 2016 at 11:53 PM, Hemant Bhanawat
wrote:
> toDF internally calls sqlcontext.createDataFrame which transforms the RDD
> to RDD[InternalRow]. This RDD[InternalRow] is then mapped to a dataframe.
>
> Type conversions (from
w.r.t. cleaner TTL, please see:
[SPARK-7689] Remove TTL-based metadata cleaning in Spark 2.0
FYI
On Sun, Feb 21, 2016 at 4:16 AM, Gerard Maas wrote:
>
> It sounds like another window operation on top of the 30-min window will
> achieve the desired objective.
> Just
Hello,
I have input lines like below
*Input*
t1, file1, 1, 1, 1
t1, file1, 1, 2, 3
t1, file2, 2, 2, 2, 2
t2, file1, 5, 5, 5
t2, file2, 1, 1, 2, 2
and i want to achieve the output like below rows which is a vertical
addition of the corresponding numbers.
*Output*
“file1” : [ 1+1+5, 1+2+5, 1+3+5
Make sure the xml input file is well formed (check your end tags).
Sent from my iPhone
> On Feb 21, 2016, at 8:14 AM, Prathamesh Dharangutte
> wrote:
>
> This is the code I am using for parsing xml file:
>
>
>
> import org.apache.spark.{SparkConf,SparkContext}
>
I tried the following in spark-shell:
scala> val df0 = Seq(("a", "b", "c", 3), ("c", "b", "a", 3)).toDF("A", "B",
"C", "num")
df0: org.apache.spark.sql.DataFrame = [A: string, B: string ... 2 more
fields]
scala> val idList = List("1", "2", "3")
idList: List[String] = List(1, 2, 3)
scala> val
No because you didn't say that explicitly. Can you share a sample file too?
On Sun, 21 Feb 2016, 14:34 Prathamesh Dharangutte
wrote:
> I am using spark 1.4.0 with scala 2.10.4 and 0.3.2 of spark-xml
> Orderid is empty for some books and multiple entries of it for other
I am using spark 1.4.0 with scala 2.10.4 and 0.3.2 of spark-xmlOrderid is empty for some books and multiple entries of it for other books,did you include that in your xml file?
Just ran that code and it works fine, here is the output:
What version are you using?
val ctx = SQLContext.getOrCreate(sc)
val df = ctx.read.format("com.databricks.spark.xml").option("rowTag",
"book").load("file:///tmp/sample.xml")
df.printSchema()
root
|-- name: long (nullable = true)
|--
This is the code I am using for parsing xml file:
import org.apache.spark.{SparkConf,SparkContext}
import org.apache.spark.sql.{DataFrame,SQLContext}
import com.databricks.spark.xml
object XmlProcessing {
def main(args : Array[String]) = {
val conf = new SparkConf()
Can you paste the code you are using?
On Sun, 21 Feb 2016, 13:19 Prathamesh Dharangutte
wrote:
> I am trying to parse xml file using spark-xml. But for some reason when i
> print schema it only shows root instead of the hierarchy. I am using
> sqlcontext to read the
I am trying to parse xml file using spark-xml. But for some reason when i
print schema it only shows root instead of the hierarchy. I am using
sqlcontext to read the data. I am proceeding according to this video :
https://www.youtube.com/watch?v=NemEp53yGbI
The structure of xml file is somewhat
It sounds like another window operation on top of the 30-min window will
achieve the desired objective.
Just keep in mind that you'll need to set the clean TTL (spark.cleaner.ttl)
to a long enough value and you will require enough resources (mem & disk)
to keep the required data.
-kr, Gerard.
Hello Spark users,
I have to aggregate messages from kafka and at some fixed interval (say
every half hour) update a memory persisted RDD and run some computation.
This computation uses last one day data. Steps are:
- Read from realtime Kafka topic X in spark streaming batches of 5 seconds
-
I have a DataFrame that has a Python dict() as one of the columns. I'd like
to filter he DataFrame for those Rows that where the dict() contains a
specific value. e.g something like this:-
DF2 = DF1.filter('name' in DF1.params)
but that gives me this error
ValueError: Cannot convert column
That's very useful information.
The reason for weird problem is because of the non-determination of RDD
before applying randomSplit.
By caching RDD, we can make RDD become deterministic and so problem is
solved.
Thank you for your help.
2016-02-21 11:12 GMT+07:00 Ted Yu :
>
I looked at doc on this. It is not clear what goes behind the scene. Very
little documentation on it
First in Hive a database has to exist before it can be used so sql(“use
mytable”) will not create a database for you.
Also you cannot call your table mytable in database mytable!
46 matches
Mail list logo