valid functions can be written for reduce and merge when the zero is null.
so not being able to provide null as the initial value is something
troublesome.
i guess the proper way to do this is use Option, and have the None be the
zero, which is what i assumed you did?
unfortunately last time i
I found the jira for the issue will there be a fix in future ? or no fix ?
https://issues.apache.org/jira/browse/SPARK-6221
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/spark-parquet-too-many-small-files-tp27264p27267.html
Sent from the Apache Spark
Hi Neelesh,
I told you in my emails it's not spark-Scala application , I am working on just
spark SQL.
I am launching spark-SQL shell and running my hive code inside spark SQL she'll.
Spark SQL she'll accepts functions which relate to spark SQL doesn't accepts
fictions like collasece which is
Hi Sri,
Thanks for the question.
You can simply start by doing this in the initial stage:
val sqlContext = new SQLContext(sc)
val customerList = sqlContext.read.json(args(0)).coalesce(20) //using a json
example here
where the argument is the path to the file(s). This will reduce the
partitions.
Hi All,
I am running hive in spark-sql in yarn client mode, the sql is pretty simple
load dynamic partitions to target parquet table.
I used hive configurations parameters such as (set
hive.merge.smallfiles.avgsize=25600;set
hive.merge.size.per.task=256000;) which usually merges small
Hi,
In order to force broadcast hash join, we can set
the spark.sql.autoBroadcastJoinThreshold config. Is there a way to enforce
shuffle hash join in spark sql?
Thanks,
Lalitha
I'm having the same difficulty porting
https://github.com/metamx/druid-spark-batch/tree/spark2 over to spark2.x,
where I have to go track down who is pulling in bad jackson versions.
On Fri, Jul 1, 2016 at 11:59 AM Sean Owen wrote:
> Are you just asking why you can't use
Try using collasece function to repartition to desired number of partitions
files, to merge already output files use hive and insert overwrite table
using below options.
set hive.merge.smallfiles.avgsize=256;
set hive.merge.size.per.task=256;
set
--
View this message in context:
I am not sure but you can use collasece function to reduce number of output
files .
Thanks
Sri
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/output-part-files-max-size-tp17013p27262.html
Sent from the Apache Spark User List mailing list archive at
I guess you extended some InputFormat for providing locality information.
Can you share some code snippet ?
Which non-distributed file system are you using ?
Thanks
On Fri, Jul 1, 2016 at 2:46 PM, Raajen wrote:
> I would like to use Spark on a non-distributed file system
I would like to use Spark on a non-distributed file system but am having
trouble getting the driver to assign tasks to the workers that are local to
the files. I have extended InputSplits to create my own version of
FileSplits, so that each worker gets a bit more information than the default
Thanks for pointing that Koert!
I understand now why zero() and not init(a: IN), though I still don't see a
good reason to skip the aggregation if zero returns null.
If the user did it, it's on him to take care of null cases in reduce/merge,
but it opens-up the possibility to use the input to
What about yarn-cluster mode?
2016-07-01 11:24 GMT-07:00 Egor Pahomov :
> Separate bad users with bad quires from good users with good quires. Spark
> do not provide no scope separation out of the box.
>
> 2016-07-01 11:12 GMT-07:00 Jeff Zhang :
>
>> I
Hi Nick,
Thanks for the answer. Do you think an implementation like the one in this
article is infeasible in production for say, hundreds of queries per
minute?
https://www.codementor.io/spark/tutorial/building-a-web-service-with-apache-spark-flask-example-app-part2.
The article uses Flask to
(The more core JPMML libs are Apache 2; OpenScoring is AGPL. We use
JPMML in Spark and couldn't otherwise because the Affero license is
not Apache compatible.)
On Fri, Jul 1, 2016 at 8:16 PM, Nick Pentreath wrote:
> I believe open-scoring is one of the well-known PMML
I believe open-scoring is one of the well-known PMML serving frameworks in
Java land (https://github.com/jpmml/openscoring). One can also use the raw
https://github.com/jpmml/jpmml-evaluator for embedding in apps.
(Note the license on both of these is AGPL - the older version of JPMML
used to be
Are you just asking why you can't use 2.5.3 in your app? because
Jackson isn't shaded, which is sort of the bad news. But just use
2.6.5 too, ideally. I don't know where 2.6.1 is coming from, but Spark
doesn't use it.
On Fri, Jul 1, 2016 at 5:48 PM, wrote:
> In my project I
Separate bad users with bad quires from good users with good quires. Spark
do not provide no scope separation out of the box.
2016-07-01 11:12 GMT-07:00 Jeff Zhang :
> I think so, any reason you want to deploy multiple thrift server on one
> machine ?
>
> On Fri, Jul 1, 2016 at
Hi Nick,
Thanks a lot for the exhaustive and prompt response! (In the meantime
I watched a video about PMML to get a better understanding of the
topic).
What are the tools that could "consume" PMML exports (after running
JPMML)? What tools would be the endpoint to deliver low-latency
predictions
I think so, any reason you want to deploy multiple thrift server on one
machine ?
On Fri, Jul 1, 2016 at 10:59 AM, Egor Pahomov
wrote:
> Takeshi, of course I used different HIVE_SERVER2_THRIFT_PORT
> Jeff, thanks, I would try, but from your answer I'm getting the
Takeshi, of course I used different HIVE_SERVER2_THRIFT_PORT
Jeff, thanks, I would try, but from your answer I'm getting the feeling,
that I'm trying some very rare case?
2016-07-01 10:54 GMT-07:00 Jeff Zhang :
> This is not a bug, because these 2 processes use the same
This is not a bug, because these 2 processes use the same SPARK_PID_DIR
which is /tmp by default. Although you can resolve this by using
different SPARK_PID_DIR, I suspect you would still have other issues like
port conflict. I would suggest you to deploy one spark thrift server per
machine for
As said earlier, how about changing a bound port by using env
`HIVE_SERVER2_THRIFT_PORT`?
// maropu
On Fri, Jul 1, 2016 at 10:47 AM, Egor Pahomov
wrote:
> I get
>
> "org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 running as
> process 28989. Stop it first."
>
I get
"org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 running as
process 28989. Stop it first."
Is it a bug?
2016-07-01 10:10 GMT-07:00 Jeff Zhang :
> I don't think the one instance per machine is true. As long as you
> resolve the conflict issue such as port
Generally there are 2 ways to use a trained pipeline model - (offline)
batch scoring, and real-time online scoring.
For batch (or even "mini-batch" e.g. on Spark streaming data), then yes
certainly loading the model back in Spark and feeding new data through the
pipeline for prediction works just
Hi Rishabh,
I've just today had similar conversation about how to do a ML Pipeline
deployment and couldn't really answer this question and more because I
don't really understand the use case.
What would you expect from ML Pipeline model deployment? You can save
your model to a file by
You mean `spark.sql.shuffle.partitions`?
http://spark.apache.org/docs/latest/sql-programming-guide.html#other-configuration-options
// maropu
On Fri, Jul 1, 2016 at 8:42 AM, emiretsk wrote:
> It seems like threads are created by SQLExecution.withExecutionId, which is
I don't think the one instance per machine is true. As long as you resolve
the conflict issue such as port conflict, pid file, log file and etc, you
can run multiple instances of spark thrift server.
On Fri, Jul 1, 2016 at 9:32 AM, Egor Pahomov wrote:
> Hi, I'm using
In my project I found the library which brings Jackson core 2.6.5 and it is
used in conjunction with the requested Jackson scala module 2.5.3 wanted by
spark 2.0.0 preview. At runtime it's the cause of exception.
Now I have excluded 2.6.5 using sbt but it could be dangerous for the other
Hello,I've got a Spark stand-alone cluster using EC2 instances. I can submit
jobs using "--deploy-mode client", however using "--deploy-mode cluster" is
proving to be a challenge. I've tries this:
spark-submit --class foo --master spark:://master-ip:7077 --deploy-mode cluster
Hi, I'm using Spark Thrift JDBC server and 2 limitations are really bother
me -
1) One instance per machine
2) Yarn client only(not yarn cluster)
Are there any architectural reasons for such limitations? About yarn-client
I might understand in theory - master is the same process as a server, so
Hi Bryan.
Thanks for your continued help.
Here is the code shown in a Jupyter notebook. I figured this was easier
that cutting and pasting the code into an email. If you would like me to
send you the code in a different format let, me know. The necessary data is
all downloaded within the
It seems like threads are created by SQLExecution.withExecutionId, which is
called inside BroadcastExchangeExec.scala.
When does the plan executor execute a BroadcaseExchange, and is there a way
to control the number of threads? We have a job that writes DataFrames to an
external DB, and it seems
Hi Rishabh,
My colleague, Richard Garris from Databricks, actually just gave a talk last
night at the Bay Area Spark Meetup on ML model deployment. The slides and
recording should be up soon, you should be able to find a link here:
http://www.meetup.com/spark-users/events/231574440/
Thanks,
hi,
In general if your ORC tables is not bucketed it is not going to do much.
the idea is that using predicate pushdown you will only get the data from
the partition concerned and avoid expensive table scans!
Orc provides what is known as store index at file, stripe and rowset levels
(default
Hi,
developing a custom receiver up today I used spark version "2.0.0-SNAPSHOT" and
scala version 2.11.7.
With these version all tests work fine.
I have just switching to "2.0.0-preview" as spark version but not I have
following error :
An exception or error caused a run to abort: class
Hi Rishabh,
I have a similar use-case and have struggled to find the best solution. As
I understand it 1.6 provides pipeline persistence in Scala, and that will
be expanded in 2.x. This project https://github.com/jpmml/jpmml-sparkml
claims to support about a dozen pipeline transformers, and 6 or
On 7/1/2016 6:42 AM, Akhil Das wrote:
case class Holder(str: String, js:JsValue)
Hello,
Thanks!
I tried that before posting the question to the list but I keep getting
an error such as this even after the map() operation to convert
(String,JsValue) -> Holder and then toDF().
I am simply
Hi All,
I am looking for ways to deploy a ML Pipeline model in production .
Spark has already proved to be a one of the best framework for model
training and creation, but once the ml pipeline model is ready how can I
deploy it outside spark context ?
MLlib model has toPMML method but today
http://why-not-learn-something.blogspot.in/2016/05/apache-spark-streaming-how-to-do.html
On Fri, Jul 1, 2016 at 1:42 PM, manoop wrote:
> I have a Spark job and I just want to stop it on some condition. Once the
> condition is met, I am calling JavaStreamingContext.stop(),
Can you try the Cassandra connector 1.5? It is also compatible with Spark
1.6 according to their documentation
https://github.com/datastax/spark-cassandra-connector#version-compatibility
You can also crosspost it over here
You can use this https://github.com/wurstmeister/kafka-docker to spin up a
kafka cluster and then point your sparkstreaming to it to consume from it.
On Fri, Jul 1, 2016 at 1:19 AM, SRK wrote:
> Hi,
>
> I need to do integration tests using Spark Streaming. My idea is
HI Akhil
I am using:
Cassandra: 3.0.5
Spark: 1.6.1
Scala 2.10
Spark-cassandra connector: 1.6.0
From: Akhil Das [mailto:ak...@hacked.work]
Sent: 01 July 2016 11:38
To: Joaquin Alzola
Cc: user@spark.apache.org
Subject: Re: Remote RPC client disassociated
This looks
Something like this?
import sqlContext.implicits._
case class Holder(str: String, js:JsValue)
yourRDD.map(x => Holder(x._1, x._2)).toDF()
On Fri, Jul 1, 2016 at 3:36 AM, Dood@ODDO wrote:
> Hello,
>
> I have an RDD[(String,JsValue)] that I want to convert into a
This looks like a version conflict, which version of spark are you using?
The Cassandra connector you are using is for Scala 2.10x and Spark 1.6
version.
On Thu, Jun 30, 2016 at 6:34 PM, Joaquin Alzola
wrote:
> HI List,
>
>
>
> I am launching this spark-submit job:
>
Hi,
Using sparkHiveContext when we read all rows where age was between 0 and
100, even though we requested rows where age was less than 15. Such full
table scanning is an expensive operation.
ORC avoids this type of overhead by using predicate push-down with three
levels of built-in indexes
Let us take this for a ride.
Simple code. Reads from an existing of 22miilion rows stored as ORC and
saves it as a Parquet
val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
HiveContext.sql("use oraclehadoop")
val s = HiveContext.table("sales2")
val sorted =
I have a Spark job and I just want to stop it on some condition. Once the
condition is met, I am calling JavaStreamingContext.stop(), but it just
hangs. Does not move on to the next line, which is just a debug line. I
expect it to come out.
I already tried different variants of stop, that is,
Hey guys I am using Apache Spark 1.5.2, and I am running the Sql query using
the SQLContext and when I run the insert query it saves the data in
partition (as expected).
I am just curious and want to know how these partitions are made and how the
permissions to these partition is assigned . Can
49 matches
Mail list logo