I had downloaded the pre build package labeled "Spark 2.1.1 prebuilt with
Hadoop 2.7 or later" from the direct download link on spark.apache.org.
However, I am seeing compatibility errors running against a deployed HDFS
2.7.3. (See my earlier message about Flume DStream producing 0 records
after
I have already seen on example where data is generated using spark, no reason
to think it's a bad idea as far as I know.
You can check the code here, I m not very sure but I think there is something
there which generates data for the TPCDS benchmark and you can provide how much
data you want in
Unsubscribe
Thanks & Best Regards,
Engr. Palash Gupta
Consultant, OSS/CEM/Big Data
Skype: palash2494
https://www.linkedin.com/in/enggpalashgupta
you should make hbase a data source(seems we already have hbase connector?),
create a dataframe from hbase, and do join in Spark SQL.
> On 21 Jun 2017, at 10:17 AM, sunerhan1...@sina.com wrote:
>
> Hello,
> My scenary is like this:
> 1.val df=hivecontext/carboncontex.sql("sql")
>
After investigation, it looks like my Spark 2.1.1 jars got corrupted during
download - all good now... ;)
> On Jun 20, 2017, at 4:14 PM, Jean Georges Perrin wrote:
>
> Hey all,
>
> i was giving a run to 2.1.1 and got an error on one of my test program:
>
> package
Ok some more info about this issue to see if someone can shine a light on
what could be going on. I turned on debug logging for
org.apache.spark.streaming.scheduler in the driver process and this is what
gets thrown in the logs and keeps throwing it even after the downed HDFS
node is restarted.
never mind!
I has a space at the end of my data which was not showing up in manual testing.
thanks
From: jeff saremi
Sent: Tuesday, June 20, 2017 2:48:06 PM
To: user@spark.apache.org
Subject: Bizzare diff in behavior between scala REPL
I have this function which does a regex matching in scala. I test it in the
REPL I get expected results.
I use it as a UDF in sparkSQL i get completely incorrect results.
Function:
class UrlFilter (filters: Seq[String]) extends Serializable {
val regexFilters = filters.map(new Regex(_))
It's in the spark-catalyst_2.11-2.1.1.jar since the logical query plans and
optimization also need to know about types.
On Tue, Jun 20, 2017 at 1:14 PM, Jean Georges Perrin wrote:
> Hey all,
>
> i was giving a run to 2.1.1 and got an error on one of my test program:
>
> package
Hi,
How do we bootstrap the streaming job with the previous state when we do a
code change and redeploy? We use updateStateByKey to maintain the state and
store session objects and LinkedHashMaps in the checkpoint.
Thanks,
Swetha
--
View this message in context:
Thanks Vadim & Jörn... I will look into those.
jg
> On Jun 20, 2017, at 2:12 PM, Vadim Semenov
> wrote:
>
> You can launch one permanent spark context and then execute your jobs within
> the context. And since they'll be running in the same context, they can
Hey all,
i was giving a run to 2.1.1 and got an error on one of my test program:
package net.jgp.labs.spark.l000_ingestion;
import java.util.Arrays;
import java.util.List;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import
You can launch one permanent spark context and then execute your jobs
within the context. And since they'll be running in the same context, they
can share data easily.
These two projects provide the functionality that you need:
You could all express it in one program, alternatively ignite in memory file
system or the ignite sharedrdd ( not sure if dataframe is supported)
> On 20. Jun 2017, at 19:46, Jean Georges Perrin wrote:
>
> Hey,
>
> Here is my need: program A does something on a set of data and
Hi Assaf,
Thanks for the suggestion on checkpointing - I'll need to read up more on
that.
My current implementation seems to be crashing with a GC memory limit
exceeded error if Im keeping multiple persist calls for a large number of
files.
Thus, I was also thinking about the constant calls to
Hey,
Here is my need: program A does something on a set of data and produces
results, program B does that on another set, and finally, program C combines
the data of A and B. Of course, the easy way is to dump all on disk after A and
B are done, but I wanted to avoid this.
I was thinking of
BTW, this is running on Spark 2.1.1.
I have been trying to debug this issue and what I have found till now is
that it is somehow related to the Spark WAL. The directory named
/receivedBlockMetadata seems to stop getting
written to after the point of an HDFS node being killed and restarted. I
have
Unsubscribe
Sent from my iPhone
And we will having a webinar on July 27 going into some more details. Stay
tuned.
Cheers
Jules
Sent from my iPhone
Pardon the dumb thumb typos :)
> On Jun 20, 2017, at 7:00 AM, Michael Mior wrote:
>
> It's still in the early stages, but check out Deep Learning
It is fine, but you have to design it that generated rows are written in large
blocks for optimal performance.
The most tricky part with data generation is the conceptual part, such as
probabilistic distribution etc
You have to check as well that you use a good random generator, for some cases
Hi
Spark is a data analyzer, but would it be possible to use Spark as a data
generator or simulator ?
My simulation can be very huge and i think a parallelized simulation using by
Spark (cloud) could work.
Is that good or bad idea ?
Regards
Esa Heikkinen
Hi,
I have seen that databricks have higher order functions
(https://docs.databricks.com/_static/notebooks/higher-order-functions.html,
https://databricks.com/blog/2017/05/24/working-with-nested-data-using-higher-order-functions-in-sql-on-databricks.html)
which basically allows to do generic
It's still in the early stages, but check out Deep Learning Pipelines from
Databricks
https://github.com/databricks/spark-deep-learning
--
Michael Mior
mm...@apache.org
2017-06-20 0:36 GMT-04:00 Gaurav1809 :
> Hi All,
>
> Similar to how we have machine learning library
Correction.
On Tue, Jun 20, 2017 at 5:27 PM, sujeet jog wrote:
> , Below is the query, looks like from physical plan, the query is same as
> that of cqlsh,
>
> val query = s"""(select * from model_data
> where TimeStamp > \'$timeStamp+\' and TimeStamp <=
>
, Below is the query, looks like from physical plan, the query is same as
that of cqlsh,
val query = s"""(select * from model_data
where TimeStamp > \'$timeStamp+\' and TimeStamp <=
\'$startTS+\'
and MetricID = $metricID)"""
println("Model query" + query)
val df
Hi,
Personally I would inspect how dates are managed. How does your spark code
looks like? What does the explain say. Does TimeStamp gets parsed the same
way?
Best,
On Tue, Jun 20, 2017 at 12:52 PM, sujeet jog wrote:
> Hello,
>
> I have a table as below
>
> CREATE TABLE
Hello,
I have a table as below
CREATE TABLE analytics_db.ml_forecast_tbl (
"MetricID" int,
"TimeStamp" timestamp,
"ResourceID" timeuuid
"Value" double,
PRIMARY KEY ("MetricID", "TimeStamp", "ResourceID")
)
select * from ml_forecast_tbl where "MetricID" = 1 and "TimeStamp" >
Unsubscribe
Sent from Yahoo Mail on Android
Hi Edwin,
I have faced a similar issue as well and this behaviour is very abrupt. I
even created a question on StackOverflow but no solution yet.
https://stackoverflow.com/questions/43496205/spark-job-processing-time-increases-to-4s-without-explanation
For us, we sometimes had this constant
hi,all :
https://issues.apache.org/jira/browse/SPARK-19680
is this issue have any method to patch it ? I met the same problem.
2017-06-20
lk_spark
Note that depending on the number of iterations, the query plan for the
dataframe can become long and this can cause slowdowns (or even crashes).
A possible solution would be to checkpoint (or simply save and reload the
dataframe) every once in a while. When reloading from disk, the newly loaded
31 matches
Mail list logo