Hello, have question , we seeing below exceptions, and at the moment are
enabling JVM profiler to look into gc activity on workers and if you have
any other suggestions do let know please , we dont just want increase rpc
timeout (from 120) to 600 sec lets say but get to reason why workers
timeout
I looked back into this today. I made some changes last week to the
application to allow for not only compatibility with Spark 1.5.2, but also
backwards compatibility with Spark 1.4.1 (the version our current deployment
uses). The changes mostly involved changing dependencies from compile to
Apologize in advance if someone has already asked and addressed this
question.
In Spark Streaming, how can I programmatically get the batch statistics
like schedule delay, total delay and processing time (They are shown in the
job UI streaming tab)? I need such information to raise alerts in some
Hi Hemant, thanks much can we use SnappyData on YARN. My Spark jobs run
using yarn client mode. Please guide.
On Mon, Feb 8, 2016 at 9:46 AM, Hemant Bhanawat
wrote:
> You may want to have a look at spark druid project already in progress:
>
It's interesting to see what spark dev people will say.
Corey do you have presentation available online?
On 8 February 2016 at 05:16, Corey Nolet wrote:
> Charles,
>
> Thank you for chiming in and I'm glad someone else is experiencing this
> too and not just me. I know very
Thanks Luciano, now it looks like I’m the only guy who have this issue. My
options is narrowed down to upgrade my spark to 1.6.0, to see if this issue
is gone.
—
Cheers,
Todd Leo
On Mon, Feb 8, 2016 at 2:12 PM Luciano Resende wrote:
> I tried in both 1.5.0, 1.6.0 and
I’ve found the trigger of my issue: if I start my spark-shell or submit by
spark-submit with --conf
spark.serializer=org.apache.spark.serializer.KryoSerializer, the DataFrame
content goes wrong, as I described earlier.
On Mon, Feb 8, 2016 at 5:42 PM SLiZn Liu wrote:
>
Hi all,
I have asked this question here on StackOverflow:
http://stackoverflow.com/questions/35222365/spark-sql-hivethriftserver2-get-liststring-from-cassandra-in-squirrelsql
But hoping I get more luck from this group. When I write a Java SparkSQL
application to query a
hi
I have a dataframe df and i use df.decribe() to get the stats value. but
not able to parse and extract all the individual information. Please help
--
Thanks and Regards
Arun
In python, concatenating two lists can be done simply using the + operator.
I'm assuming the RDD you're using map over consists of a tuple:
map(lambda x: x[0] + x[1])
--
View this message in context:
Now, using DirectStream I am able to process 2 Million messages from 20
partition topic in a batch interval of 2000ms.
Finally figured out that Kafka producer from a source system is sending
same topic name instead of key in keyedmessage . It could put messages
Hi
I'm using sql query find the percentile value. Is there any pre defined
functions for percentile calculation
--
Thanks and Regards
Arun
Hi All,
A long running Spark job on YARN throws below exception after running
for few days.
yarn.ApplicationMaster: Reporter thread fails 1 time(s) in a row.
org.apache.hadoop.yarn.exceptions.YarnException: *No AMRMToken found* for
user prabhu at
SnappyData's deployment is different that how Spark is deployed. See
http://snappydatainc.github.io/snappydata/deployment/ and
http://snappydatainc.github.io/snappydata/jobs/.
For further questions, you can join us on stackoverflow
http://stackoverflow.com/questions/tagged/snappydata.
Hemant
Hi,
Plz use DataFrame#repartition.
On Tue, Feb 9, 2016 at 7:30 AM, Cesar Flores wrote:
>
> I have a data frame which I sort using orderBy function. This operation
> causes my data frame to go to a single partition. After using those
> results, I would like to re-partition to
Finally I figured out the problem and fixed it.
There was some inconsistency in my .ivy2 and .m2 repository. Spark resolves
the dependencies using meta data in ivy2/cache by not verifies its real
location. That was why Spark resolved jackson-core-asl in local-m2-cache.
But when Spark tried to
Hi Dhimant,
As I had indicated in my next mail, my problem was due to disk getting full
with log messages (these were dumped into the slaves) and did not have
anything to do with the content pushed into s3. So, looks like this error
message is very generic and is thrown for various reasons. You
I guess the problem is:
dummy.df<-withColumn(dataframe,paste0(colnames(cat.column),j),ifelse(column[[1]]==levels(as.factor(unlist(cat.column)))[j],1,0)
)
dataframe<-dummy.df
Once dataframe is re-assigned to reference a new DataFrame in each iteration,
the column variable has to be
+ Spark-Dev
On Tue, Feb 9, 2016 at 10:04 AM, Prabhu Joseph
wrote:
> Hi All,
>
> A long running Spark job on YARN throws below exception after running
> for few days.
>
> yarn.ApplicationMaster: Reporter thread fails 1 time(s) in a row.
>
In the "new" ALS intermediate RDDs (including the ratings input RDD after
transforming to block-partitioned ratings) is cached using
intermediateRDDStorageLevel, and you can select the final RDD storage level
(for user and item factors) using finalRDDStorageLevel.
The old MLLIB API now calls the
I have a spark-streaming service, where I am processing and detecting
anomalies on the basis of some offline generated model. I feed data into
this service from a log file, which is streamed using the following command
tail -f | nc -lk
Here the spark streaming service is taking data from
Hi Arunkumar,
>From the scala documentation it's recommended to use the agg function for
performing any actual statistics programmatically on your data.
df.describe() is meant only for data exploration.
See Aggregator here:
I sure do! [1] And yes- I'm really hoping they will chime in, otherwise I
may dig a little deeper myself and start posting some jira tickets.
[1] http://www.slideshare.net/cjnolet
On Mon, Feb 8, 2016 at 3:02 AM, Igor Berman wrote:
> It's interesting to see what spark dev
Sorry, same expected results with trunk and Kryo serializer
On Mon, Feb 8, 2016 at 4:15 AM, SLiZn Liu wrote:
> I’ve found the trigger of my issue: if I start my spark-shell or submit
> by spark-submit with --conf
>
Hi All ,
How do change the log level for the running spark streaming Job , Any help
will be appriciated.
Thanks,
Are there any examples as how to implement onEnvironmentUpdate method for
customer listener
Thanks,
Hello all,
Could someone please help me figure out what wrong with my query that
I'm running over Parquet tables? the query has the following form:
weird_query = "SELECT a._example.com/aa/1.1/aa_,
b._example.com/bb/1.2/bb_ FROM www$aa@aa a LEFT JOIN www$bb@bb b ON
>From within a Spark job you can use a Periodic Listener:
ssc.addStreamingListener(PeriodicStatisticsListener(Seconds(60)))
class PeriodicStatisticsListener(timePeriod: Duration) extends
StreamingListener {
private val logger = LoggerFactory.getLogger("Application")
override def
I am using Multilayer Percertron Classifier. In each training instance
there are multiple 1.0 in the ouput vector of the Multilayer Perceptron
Classifier. This is necessary. With small number of training data I am
getting the following error
*ERROR LBFGS: Failure again! Giving up and returning.
I had similar problems with multi part uploads. In my case the real error
was something else which was being masked by this issue
https://issues.apache.org/jira/browse/SPARK-6560. In the end this bad
digest exception was a side effect and not the original issue. For me it
was some library version
Spark Summit East is just 10 days away and we are almost sold out! One of
the highlights this year will focus on how Spark is being used across
businesses to solve both big and small data needs. Check out the full
agenda here: https://spark-summit.org/east-2016/schedule/
Use "ApacheList" for 30%
I have a data frame which I sort using orderBy function. This operation
causes my data frame to go to a single partition. After using those
results, I would like to re-partition to a larger number of partitions.
Currently I am just doing:
val rdd = df.rdd.coalesce(100, true) //df is a dataframe
I am storing a model in s3 in this path:
"bucket_name/p1/models/lr/20160204_0410PM/ser" and the structure of the
saved dir looks like this:
1. bucket_name/p1/models/lr/20160204_0410PM/ser/data -> _SUCCESS,
_metadata, _common_metadata
and
When using ALS from mllib, would it be better/recommended to cache the ratings
RDD?
I'm asking because when predicting products for users (for example) it is
recommended to cache product/user matrices.
Thank you,
At least works for me though, temporarily disabled Kyro serilizer until
upgrade to 1.6.0. Appreciate for your update. :)
Luciano Resende 于2016年2月9日 周二02:37写道:
> Sorry, same expected results with trunk and Kryo serializer
>
> On Mon, Feb 8, 2016 at 4:15 AM, SLiZn Liu
35 matches
Mail list logo