Hi,
What happens if the master node fails in the case of Spark Streaming? Would
the data be lost?
Thanks,
Swetha
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Regarding-master-node-failure-tp23701.html
Sent from the Apache Spark User List mailing list
Hi,
How to set the number of executors and tasks in a Spark Streaming job in
Mesos? I have the following settings but my job still shows me 11 active
tasks and 11 executors. Any idea as to why this is happening
?
sparkConf.set(spark.mesos.coarse, true)
sparkConf.set(spark.cores.max, 128)
Hi,
I see the following error in my Spark Job even after using like 100 cores
and 16G memory. Did any of you experience the same problem earlier?
15/08/18 21:51:23 ERROR shuffle.RetryingBlockFetcher: Failed to fetch block
input-0-1439959114400, and will not retry (0 retries)
the performance
be impacted?
Thanks,
Swetha
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Json-file-groupby-function-tp9618p24041.html
Sent from the Apache Spark User List mailing list archive at Nabble.com
keys
with different return values?
Thanks,
Swetha
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Multiple-UpdateStateByKey-Functions-in-the-same-job-tp24119.html
Sent from the Apache Spark User List mailing list archive at Nabble.com
, classOf[LongWritable], classOf[Text]).
map{case (x, y) = (x.toString, y.toString)}
Thanks,
Swetha
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-add-multiple-sequence-files-from-HDFS-to-a-Spark-Context-to-do-Batch-processing-tp24102.html
Hi,
What is the optimal approach to do Secondary sort in Spark? I have to first
Sort by an Id in the key and further sort it by timeStamp which is present
in the value.
Thanks,
Swetha
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/What-is-the-optimal
Scala?
val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).
map{case (x, y) = ((x.toString,
Utils.toJsonObject(y.toString).get(request).getAsJsonObject().get(queryString).toString))}
Thanks,
Swetha
--
View this message in context:
http://apache-spark-user-list.1001560
with
active sessions are still available for joining with those in the current
job. So, what do we need to keep the data in memory in between two batch
jobs? Can we use Tachyon?
Thanks,
Swetha
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-keep-RDDs
Streaming?
Thanks,
Swetha
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-maintain-Windows-of-data-along-with-maintaining-session-state-using-UpdateStateByKey-tp23986.html
Sent from the Apache Spark User List mailing list archive at Nabble.com
Hi,
Is IndexedRDD available in Spark 1.4.0? We would like to use this in Spark
Streaming to do lookups/updates/deletes in RDDs using keys by storing them
as key/value pairs.
Thanks,
Swetha
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Is-IndexedRDD
Hi Ankur,
Is IndexedRDD available in Spark 1.4.0? We would like to use this in Spark
Streaming to do lookups/updates/deletes in RDDs using keys by storing them
as key/value pairs.
Thanks,
Swetha
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com
have
to use any code like ssc.checkpoint(checkpointDir)? Also, how is the
performance if I use both DStream Checkpointing for maintaining the state
and use Kafka Direct approach for exactly once semantics?
Thanks,
Swetha
--
View this message in context:
http://apache-spark-user-list.1001560
?
SparkContext.addFile()
SparkFiles.get(fileName)
Thanks,
Swetha
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Distributed-caching-of-a-file-in-SPark-Streaming-tp25157.html
Sent from the Apache Spark User List mailing list archive at Nabble.com
Hi,
We currently use reduceByKey to reduce by a particular metric name in our
Streaming/Batch job. It seems to be doing a lot of shuffles and it has
impact on performance. Does using a custompartitioner before calling
reduceByKey improve performance?
Thanks,
Swetha
--
View this message
Hi,
I see a lot of unwanted SysOuts when I try to save an RDD as parquet file.
Following is the code and
SysOuts. Any idea as to how to avoid the unwanted SysOuts?
ParquetOutputFormat.setWriteSupportClass(job, classOf[AvroWriteSupport])
AvroParquetOutputFormat.setSchema(job,
Hi ,
What is the appropriate dependency to include for Spark Indexed RDD? I get
compilation error if I include 0.3 as the version as shown below:
amplab
spark-indexedrdd
0.3
Thanks,
Swetha
--
View this message in context:
http://apache
Hi,
I see unwanted Warning when I try to save a Parquet file in hdfs in Spark.
Please find below the code and the Warning message. Any idea as to how to
avoid the unwanted Warning message?
activeSessionsToBeSaved.saveAsNewAPIHadoopFile("test", classOf[Void],
classOf[ActiveSession],
Hi,
How to unpersist a DStream in Spark Streaming? I know that we can persist
using dStream.persist() or dStream.cache. But, I don't see any method to
unPersist.
Thanks,
Swetha
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-unpersist-a-DStream
formats would be of great help.
Thanks,
Swetha
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Efficient-approach-to-store-an-RDD-as-a-file-in-HDFS-and-read-it-back-as-an-RDD-tp25279.html
Sent from the Apache Spark User List mailing list archive at Nabble.com
Hi,
What is the efficient way to join two RDDs? Would converting both the RDDs
to IndexedRDDs be of any help?
Thanks,
Swetha
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/What-is-the-efficient-way-to-Join-two-RDDs-tp25310.html
Sent from the Apache Spark
Hi,
I have a requirement wherein I have to load data from hdfs, build an RDD and
then lookup by key to do some updates to the value and then save it back to
hdfs. How to lookup for a value using a key in an RDD?
Thanks,
Swetha
--
View this message in context:
http://apache-spark-user-list
to do shuffles?
Thanks,
Swetha
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Job-splling-to-disk-and-memory-in-Spark-Streaming-tp25149.html
Sent from the Apache Spark User List mailing list archive at Nabble.com
(this.trackerClass)
Some(newCount)
}
Thanks,
Swetha
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-have-Single-refernce-of-a-class-in-Spark-Streaming-tp25103.html
Sent from the Apache Spark User List mailing list archive at Nabble.com
Hi,
I have the following functions that I am using for my job in Scala. If you
see the getSessionId function I am returning null sometimes. If I return
null the only way that I can avoid processing those records is by filtering
out null records. I wanted to avoid having another pass for filtering
Hi,
I see java.lang.NoClassDefFoundError after changing the Streaming job
version to 1.5.2. Any idea as to why this is happening? Following are my
dependencies and the error that I get.
org.apache.spark
spark-core_2.10
${sparkVersion}
Hi,
How is the ContextCleaner different from spark.cleaner.ttl?Is
spark.cleaner.ttl when there is ContextCleaner in the Streaming job?
Thanks,
Swetha
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/broadcast-variable-get-cleaned-by-ContextCleaner
Sep 17 18:52 shuffle_23103_6_0.data
-rw-r--r-- 1 371932812 Sep 17 18:52 shuffle_23125_6_0.data
-rw-r--r-- 1 19857974 Sep 17 18:53 shuffle_23291_19_0.data
-rw-r--r-- 1 55342005 Sep 17 18:53 shuffle_23305_8_0.data
-rw-r--r-- 1 92920590 Sep 17 18:53 shuffle_23303_4_0.data
Thanks,
Swetha
12. You either provided an
invalid fromOffset, or the Kafka topic has been damaged
Thanks,
Swetha
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Lost-leader-exception-in-Kafka-Direct-for-Streaming-tp24891.html
Sent from the Apache Spark User List mailing
val groupedSessions = sessions.groupByKey();
val sortedSessions = groupedSessions.mapValues[(List[(Long,
String)])](iter => iter.toList.sortBy(_._1))
*
Does use of transform for code reuse affect groupByKey performance?
Thanks,
Swetha
--
View this message in context:
http://apache-spark-use
variable when submitting a job in Spark?
-Dcom.w1.p1.config.runOnEnv=dev
Thanks,
Swetha
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-set-System-environment-variables-in-Spark-tp24875.html
Sent from the Apache Spark User List mailing list archive
Hi,
How to use reduceByKey inside updateStateByKey? Suppose I have a bunch of
keys for which I need to do sum and average inside the updateStateByKey by
joining with old state. How do I accomplish that?
Thanks,
Swetha
--
View this message in context:
http://apache-spark-user-list.1001560
Hi,
How to make Group By more efficient? Is it recommended to use a custom
partitioner and then do a Group By? And can we use a custom partitioner and
then use a reduceByKey for optimization?
Thanks,
Swetha
--
View this message in context:
http://apache-spark-user-list.1001560.n3
Hi,
When I try to recover my Spark Streaming job from a checkpoint directory, I
get a StackOverFlow Error as shown below. Any idea as to why this is
happening?
15/09/18 09:02:20 ERROR streaming.StreamingContext: Error starting the
context, marking it as stopped
java.lang.StackOverflowError
Hi,
How to obtain the current key in updateStateBykey ?
Thanks,
Swetha
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-obtain-the-key-in-updateStateByKey-tp24792.html
Sent from the Apache Spark User List mailing list archive at Nabble.com
Hi,
How to return an RDD of key/value pairs from an RDD that has
foreachPartition applied. I have my code something like the following. It
looks like an RDD that has foreachPartition can have only the return type as
Unit. How do I apply foreachPartition and do a save and at the same return a
pair
Hi,
We have a lot of temp files that gets created due to shuffles caused by
group by. How to clear the files that gets created due to intermediate
operations in group by?
Thanks,
Swetha
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-clear
Hi,
Has anybody used FastUtil equivalent to HashSet for Strings in Spark? Any
example would be of great help.
Thanks,
Swetha
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/FastUtil-DataStructures-in-Spark-tp25429.html
Sent from the Apache Spark User List
Hi,
We see a bunch of issues like the following in Our Spark Kafka Direct. Any
idea as to how make Kafka Direct Consumers show up in Kafka Consumer
reporting to debug this issue?
Job aborted due to stage failure: Task 47 in stage 336.0 failed 4 times,
most recent failure: Lost task 47.3 in
):
java.lang.AssertionError: assertion failed: Ran out of messages before
reaching ending offset 221572238 for topic hubble_stream partition 88 start
221563725. This should not happen, and indicates that messages may have been
lost
Thanks,
Swetha
--
View this message in context:
http://apache
Hi,
Does the use of custom partitioner in Streaming affect performance?
On Mon, Oct 5, 2015 at 1:06 PM, Adrian Tanase wrote:
> Great article, especially the use of a custom partitioner.
>
> Also, sorting by multiple fields by creating a tuple out of them is an
> awesome,
ow it could affect performance.
>
> Used correctly it should improve performance as you can better control
> placement of data and avoid shuffling…
>
> -adrian
>
> From: swetha kasireddy
> Date: Monday, October 26, 2015 at 6:56 AM
> To: Adrian Tanase
> Cc: Bill Bejeck, "us
me other scheme other than hash-based) then you need to
> implement a custom partitioner. It can be used to improve data skews, etc.
> which ultimately improves performance.
>
> On Tue, Oct 27, 2015 at 1:20 PM, swetha <swethakasire...@gmail.com> wrote:
>
>> Hi,
>>
>
unt memory allocated for shuffles by changing
> the configuration spark.shuffle.memoryFraction . More fraction would cause
> less spilling.
>
>
> On Tue, Oct 27, 2015 at 6:35 PM, swetha kasireddy <
> swethakasire...@gmail.com> wrote:
>
>> So, Wouldn't
I am using the following:
com.twitter
parquet-avro
1.6.0
On Mon, Nov 9, 2015 at 1:00 AM, Fengdong Yu <fengdo...@everstring.com>
wrote:
> Which Spark version used?
>
> It was fixed in Parquet-1.7x, so Spark-1.5.x will be work.
>
>
>
>
> > O
heckpoint directory is a good way to restart
> the streaming job, you should stop the spark context or at the very least
> kill the driver process, then restart.
>
> On Mon, Nov 9, 2015 at 2:03 PM, swetha kasireddy <
> swethakasire...@gmail.com> wrote:
>
>> Hi Cody,
&g
t;c...@koeninger.org> wrote:
> Without knowing more about what's being stored in your checkpoint
> directory / what the log output is, it's hard to say. But either way, just
> deleting the checkpoint directory probably isn't sufficient to restart the
> job...
>
> On Mon, Nov 9,
Other than setting the following.
sparkConf.set("spark.streaming.unpersist", "true")
sparkConf.set("spark.cleaner.ttl", "7200s")
On Wed, Nov 4, 2015 at 5:03 PM, swetha <swethakasire...@gmail.com> wrote:
> Hi,
>
> How to unpersist a DStream
Its just in the same thread for a particular RDD, I need to uncache it
every 2 minutes to clear out the data that is present in a Map inside that.
On Wed, Nov 4, 2015 at 11:54 PM, Saisai Shao <sai.sai.s...@gmail.com> wrote:
> Hi Swetha,
>
> Would you mind elaborating your
test/sql-programming-guide.html#parquet-files>
> .
>
> On Thu, Nov 5, 2015 at 12:09 AM, swetha <swethakasire...@gmail.com> wrote:
>
>> Hi,
>>
>> What is the efficient approach to save an RDD as a file in HDFS and
>> retrieve
>> it back? I was thinkin
> On Fri, Nov 6, 2015 at 3:21 PM, swetha <swethakasire...@gmail.com> wrote:
>
>> Hi,
>>
>> What is the efficient way to join two RDDs? Would converting both the RDDs
>> to IndexedRDDs be of any help?
>>
>> Thanks,
>> Swetha
>>
>>
>
't be a problem imho.
> in general hdfs is pretty fast, s3 is less so
> the issue with storing data is that you will loose your partitioner(even
> though rdd has it) at loading moment. There is PR that tries to solve this.
>
>
> On 5 November 2015 at 01:09, swetha <sweth
Hi,
Is Indexed RDDs released yet?
Thanks,
Swetha
On Sun, Nov 1, 2015 at 1:21 AM, Gylfi <gy...@berkeley.edu> wrote:
> Hi.
>
> You may want to look into Indexed RDDs
> https://github.com/amplab/spark-indexedrdd
>
> Regards,
> Gylfi.
>
>
>
>
>
>
t[T]])
sc.newAPIHadoopFile(
parquetFile,
classOf[ParquetInputFormat[T]],
classOf[Void],
tag.runtimeClass.asInstanceOf[Class[T]],
jobConf)
.map(_._2.asInstanceOf[T])
}
On Thu, Nov 5, 2015 at 2:14 PM, swetha kasireddy <swethakasire...@gmail.com>
wrote:
> No scala. Sup
ng from java - toJavaRDD
> <https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/sql/DataFrame.html#toJavaRDD()>*
> ()
>
> On 5 November 2015 at 21:13, swetha kasireddy <swethakasire...@gmail.com>
> wrote:
>
>> How to convert a parquet file that is saved
I read about the IndexedRDD. Is the IndexedRDD join with another RDD that
is not an IndexedRDD efficient?
On Mon, Nov 2, 2015 at 9:56 PM, Deenar Toraskar <deenar.toras...@gmail.com>
wrote:
> Swetha
>
> Currently IndexedRDD is an external library and not part of Spark Core
an IndexedRDD for a certain set of data
and then get those keys that are present in the IndexedRDD but not present
in some other RDD.
How would an IndexedRDD support such an usecase in an efficient manner?
Thanks,
Swetha
On Wed, Jul 15, 2015 at 2:46 AM, Jem Tucker <jem.tuc...@gmail.com>
ncache the
> previous one, and cache a new one.
>
> TD
>
> On Fri, Oct 16, 2015 at 12:02 PM, swetha <swethakasire...@gmail.com>
> wrote:
>
>> Hi,
>>
>> How to put a changing object in Cache for ever in Streaming. I know that
>> we
>> can do
e manually screwed up a topic, or ... ?
>
> If you really want to just blindly "recover" from this situation (even
> though something is probably wrong with your data), the most
> straightforward thing to do is monitor and restart your job.
>
>
>
>
> On W
This error I see locally.
On Tue, Nov 17, 2015 at 5:44 PM, Tathagata Das <t...@databricks.com> wrote:
> Are you running 1.5.2-compiled jar on a Spark 1.5.2 cluster?
>
> On Tue, Nov 17, 2015 at 5:34 PM, swetha <swethakasire...@gmail.com> wrote:
>
emp files. They are not
> necessary for checkpointing and only stored in your local temp directory.
> They will be stored in "/tmp" by default. You can use `spark.local.dir` to
> set the path if you find your "/tmp" doesn't have enough space.
>
> Best Regards,
> Shixiong Zhu
>
Hi Cody,
How to look at Option 2(see the following)? Which portion of the code in
Spark Kafka Direct to look at to handle this issue specific to our
requirements.
2.Catch that exception and somehow force things to "reset" for that
partition And how would it handle the offsets already calculated
at 3:40 PM, Cody Koeninger <c...@koeninger.org> wrote:
> KafkaRDD.scala , handleFetchErr
>
> On Tue, Dec 1, 2015 at 3:39 PM, swetha kasireddy <
> swethakasire...@gmail.com> wrote:
>
>> Hi Cody,
>>
>> How to look at Option 2(see the following)? Which
Following is the Option 2 that I was talking about:
2.Catch that exception and somehow force things to "reset" for that
partition And how would it handle the offsets already calculated in the
backlog (if there is one)?
On Tue, Dec 1, 2015 at 1:39 PM, swetha kasireddy <swethakasire
Any documentation/sample code on how to use Ganglia with Spark?
On Sat, Dec 5, 2015 at 10:29 PM, manasdebashiskar
wrote:
> spark has capability to report to ganglia, graphite or jmx.
> If none of that works for you you can register your own spark extra
> listener
> that
Hi,
How to verify whether the GangliaSink directory got created?
Thanks,
Swetha
On Mon, Dec 15, 2014 at 11:29 AM, danilopds <danilob...@gmail.com> wrote:
> Thanks tsingfu,
>
> I used this configuration based in your post: (with ganglia unicast mode)
> # Enable GangliaSink
g Ganglia? What is the
> command for the same?
>
> Thanks,
> Swetha
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-build-Spark-with-Ganglia-to-enable-monitoring-using-Ganglia-tp25625.html
> Sent from the Apache Sp
OK. What should the table be? Suppose I have a bunch of parquet files, do I
just specify the directory as the table?
On Fri, Jan 1, 2016 at 11:32 PM, UMESH CHAUDHARY
wrote:
> Ok, so whats wrong in using :
>
> var df=HiveContext.sql("Select * from table where id = ")
>
next stage/exits. Basically it happens when it has
>> mapPartition/foreachPartition in a stage. Any idea as to why this is
>> happening?
>>
>> Thanks,
>> Swetha
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-lis
; I am submitting my Spark job with supervise option as shown below. When I
>> kill the driver and the app from UI, the driver does not restart
>> automatically. This is in a cluster mode. Any suggestion on how to make
>> Automatic Driver Restart work would be of great help.
>
code.
>
>
>
>
> On Wed, Nov 25, 2015 at 12:57 PM, swetha kasireddy <
> swethakasire...@gmail.com> wrote:
>
>> I am killing my Streaming job using UI. What error code does UI provide
>> if the job is killed from there?
>>
>> On Wed, Nov 25, 2015 at 11:
:31 AM, Cody Koeninger <c...@koeninger.org> wrote:
> No, the direct stream only communicates with Kafka brokers, not Zookeeper
> directly. It asks the leader for each topicpartition what the highest
> available offsets are, using the Kafka offset api.
>
> On Mon, Nov 23, 201
re the partitions it's failing for all on the same leader?
> Have there been any leader rebalances?
> Do you have enough log retention?
> If you log the offset for each message as it's processed, when do you see
> the problem?
>
> On Tue, Nov 24, 2015 at 10:28 AM, swetha kasired
s? Can
>> someone manually delete folders from the checkpoint folder to help the job
>> recover? E.g. Go 2 steps back, hoping that kafka has those offsets.
>>
>> -adrian
>>
>> From: swetha kasireddy
>> Date: Monday, November 9, 2015 at 10:40 PM
>> To: Cody Koeni
your situation.
>
> The KafkaRDD will use the value of refresh.leader.backoff.ms, so you can
> try adjusting that to get a longer sleep before retrying the task.
>
> On Mon, Nov 30, 2015 at 1:50 PM, swetha kasireddy <
> swethakasire...@gmail.com> wrote:
>
>> Hi Cody,
&g
Looks like I can use mapPartitions but can it be done using
forEachPartition?
On Tue, Nov 17, 2015 at 11:51 PM, swetha <swethakasire...@gmail.com> wrote:
> Hi,
>
> How to return an RDD of key/value pairs from an RDD that has
> foreachPartition applied. I have my code something
It works fine after some changes.
-Thanks,
Swetha
On Tue, Nov 17, 2015 at 10:22 PM, Tathagata Das <t...@databricks.com> wrote:
> Can you verify that the cluster is running the correct version of Spark.
> 1.5.2.
>
> On Tue, Nov 17, 2015 at 7:23 PM, swetha kasireddy <
> s
That was actually an issue with our Mesos.
On Wed, Nov 18, 2015 at 5:29 PM, Tathagata Das <t...@databricks.com> wrote:
> If possible, could you give us the root cause and solution for future
> readers of this thread.
>
> On Wed, Nov 18, 2015 at 6:37 AM, swetha kasiredd
TD's comment at the end.
>
> Cheers
>
> On Wed, Nov 18, 2015 at 7:28 PM, swetha <swethakasire...@gmail.com> wrote:
>
>> Hi,
>>
>> We have a lot of temp files that gets created due to shuffles caused by
>> group by. How to clear the files tha
intln(s"${o.topic} ${o.partition} ${o.fromOffset} ${o.untilOffset}")
}
...
}
On Mon, Nov 23, 2015 at 6:31 PM, swetha kasireddy <swethakasire...@gmail.com
> wrote:
> Also, does Kafka direct query the offsets from the zookeeper directly?
> From where does it get the offset
led, the kafka leader
> reported the ending offset was 221572238, but during processing, kafka
> stopped returning messages before reaching that ending offset.
>
> That probably means something got screwed up with Kafka - e.g. you lost a
> leader and lost messages in the proces
you mean by kafka consumer reporting?
>
> I'd log the offsets in your spark job and try running
>
> kafka-simple-consumer-shell.sh --partition $yourbadpartition
> --print-offsets
>
> at the same time your spark job is running
>
> On Mon, Nov 23, 2015 at 7:37 PM, swet
Also, does Kafka direct query the offsets from the zookeeper directly? From
where does it get the offsets? There is data in those offsets, but somehow
Kafka Direct does not seem to pick it up?
On Mon, Nov 23, 2015 at 6:18 PM, swetha kasireddy <swethakasire...@gmail.com
> wrote:
> I mea
No, I am reading the data from hdfs, transforming it , registering the data
in a temp table using registerTempTable and then doing insert overwrite
using Spark SQl' hiveContext.
On Thu, Jun 9, 2016 at 3:40 PM, Mich Talebzadeh
wrote:
> how are you doing the insert?
400 cores are assigned to this job.
On Thu, Jun 9, 2016 at 1:16 PM, Stephen Boesch wrote:
> How many workers (/cpu cores) are assigned to this job?
>
> 2016-06-09 13:01 GMT-07:00 SRK :
>
>> Hi,
>>
>> How to insert data into 2000
erRecord,
ps.idPartitioner, ps.dtPartitioner CLUSTER BY idPartitioner, dtPartitioner
""".stripMargin)
On Mon, Jun 13, 2016 at 10:57 AM, swetha kasireddy <
swethakasire...@gmail.com> wrote:
> Hi Bijay,
>
> If I am hitting this issue,
> https://issues.apache.or
Hi Bijay,
If I am hitting this issue,
https://issues.apache.org/jira/browse/HIVE-11940. What needs to be done?
Incrementing to higher version of hive is the only solution?
Thanks!
On Mon, Jun 13, 2016 at 10:47 AM, swetha kasireddy <
swethakasire...@gmail.com> wrote:
> Hi,
>
e.
>>
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn *
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
Hi Bijay,
This approach might not work for me as I have to do partial
inserts/overwrites in a given table and data_frame.write.partitionBy will
overwrite the entire table.
Thanks,
Swetha
On Mon, Jun 13, 2016 at 9:25 PM, Bijay Pathak <bijay.pat...@cloudwick.com>
wrote:
> Hi Swetha
Hi Mich,
No I have not tried that. My requirement is to insert that from an hourly
Spark Batch job. How is it different by trying to insert with Hive CLI or
beeline?
Thanks,
Swetha
On Tue, Jun 14, 2016 at 10:44 AM, Mich Talebzadeh <mich.talebza...@gmail.com
> wrote:
> Hi Swetha,
>
sampleMap is populated from inside a method that is getting called from
updateStateByKey
On Thu, Jun 23, 2016 at 1:13 PM, Ted Yu wrote:
> Can you illustrate how sampleMap is populated ?
>
> Thanks
>
> On Thu, Jun 23, 2016 at 12:34 PM, SRK wrote:
uding
> the --supervise option?
>
>
> Thanks,
> Swetha
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Driver-not-able-to-restart-the-job-automatically-after-the-application-of-Streaming-with-Kafka-Direcn-tp26155.h
Hi,
I want to edit/delete a message posted in Spark User List. How do I do that?
Thanks!
It seems to be failing when I do something like following in both
sqlContext and hiveContext
sqlContext.sql("SELECT ssd.savedDate from saveSessionDatesRecs ssd
where ssd.partitioner in (SELECT sr1.partitioner from
sparkSessionRecords1 sr1))")
On Tue, Feb 23, 2016 at 5:57 PM, swetha
These tables are stored in hdfs as parquet. Can sqlContext be applied for
the subQueries?
On Tue, Feb 23, 2016 at 5:31 PM, Mich Talebzadeh <
mich.talebza...@cloudtechnologypartners.co.uk> wrote:
> Assuming these are all in Hive, you can either use spark-sql or
> spark-shell.
>
> HiveContext has
OK. would it only query for the records that I want in hive as per filter
or just load the entire table? My user table will have millions of records
and I do not want to cause OOM errors by loading the entire table in memory.
On Mon, Feb 15, 2016 at 12:51 AM, Mich Talebzadeh
ENDAR_MONTH_DESC, t_c.CHANNEL_DESC
>
> ) rs
>
> LIMIT 10
>
>
>
> [1998-01,Direct Sales,1823005210]
>
> [1998-01,Internet,248172522]
>
> [1998-01,Partners,474646900]
>
> [1998-02,Direct Sales,1819659036]
>
> [1998-02,Internet,298586496]
>
> [1998
How to use a customPartttioner hashed by userId inside saveAsTable using a
dataframe?
On Mon, Feb 15, 2016 at 11:24 AM, swetha kasireddy <
swethakasire...@gmail.com> wrote:
> How about saving the dataframe as a table partitioned by userId? My User
> records have userId, number of ses
a saveAsTable in a dataframe.
>>
>>
>> Thanks,
>> Swetha
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-a-custom-partitioner-in-a-d
of
a number of small files and also to be able to scan faster.
Something like ...df.write.format("parquet").partitionBy( "userIdHash"
, "userId").mode(SaveMode.Append).save("userRecords");
On Wed, Feb 17, 2016 at 2:16 PM, swetha kasireddy <swethakasire
1 - 100 of 138 matches
Mail list logo