Hi Eugene,
AFAIK, the current implementation of MultilayerPerceptronClassifier have
some scalability problems if the model is very huge (such as >10M),
although I think the limitation can cover many use cases already.
Yanbo
2015-12-16 6:00 GMT+08:00 Joseph Bradley :
> Hi
have you tried to specify format of your output, might be parquet is
default format?
df.write().format("json").mode(SaveMode.Overwrite).save("/tmp/path");
On 27 December 2015 at 15:18, Նարեկ Գալստեան wrote:
> Hey all!
> I am willing to partition *json *data by a column
Have you tried declaring RDD[ChildTypeOne] and writing separate functions
for each sub-type ?
Cheers
On Sun, Dec 27, 2015 at 10:08 AM, pkhamutou wrote:
> Hello,
>
> I have a such situation:
>
> abstract class SuperType {...}
> case class ChildTypeOne(x: String) extends
Given a SQLContext (or HiveContext) is it possible to pass in parameters to a
query. There are several reasons why this makes sense, including loss of
data type during conversion to string, SQL injection, etc.
But currently, it appears that SQLContext.sql() only takes a single
parameter which is
Hello,
I have a such situation:
abstract class SuperType {...}
case class ChildTypeOne(x: String) extends SuperType {.}
case class ChildTypeTwo(x: String) extends SuperType {}
than I have:
val rdd1: RDD[SuperType] = sc./*some code*/.map(r => ChildTypeOne(r))
val rdd2: RDD[SuperType] =
Is upgrading to 1.5.x a possibility for you ?
Cheers
On Sun, Dec 27, 2015 at 9:28 AM, Նարեկ Գալստեան
wrote:
>
> http://spark.apache.org/docs/1.4.1/api/scala/index.html#org.apache.spark.sql.DataFrameWriter
> I did try but it all was in vain.
> It is also explicitly
http://spark.apache.org/docs/1.4.1/api/scala/index.html#org.apache.spark.sql.DataFrameWriter
I did try but it all was in vain.
It is also explicitly written in api docs that it only supports Parquet.
Narek Galstyan
Նարեկ Գալստյան
On 27 December 2015 at 17:52, Igor Berman
But the idea is to keep at as RDD[SuperType] since i have
implicit contention to add custom functionality to RDD. Like here:
http://blog.madhukaraphatak.com/extending-spark-api/
Cheers
On 27 December 2015 at 19:13, Ted Yu wrote:
> Have you tried declaring RDD[ChildTypeOne]
You can do it using scala string interpolation
http://docs.scala-lang.org/overviews/core/string-interpolation.html
On Mon, Dec 28, 2015 at 5:11 AM, Ajaxx wrote:
> Given a SQLContext (or HiveContext) is it possible to pass in parameters
> to a
> query. There are several
Thanks a lot Jim, looking to forward to pick up some bugs.
On Mon, Dec 28, 2015 at 8:42 AM, jiml [via Apache Spark User List] <
ml-node+s1001560n25813...@n3.nabble.com> wrote:
> You probably want to start on the dev list:
> http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> I have
yes
Sharing the execution flow
15/12/28 00:19:15 INFO SessionState: No Tez session required at this point.
hive.execution.engine=mr.
15/12/28 00:19:15 INFO SparkILoop: Created sql context (with Hive support)..
SQL context available as sqlContext.
scala> import
See
http://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation
On Mon, Dec 28, 2015 at 2:00 PM, 顾亮亮 wrote:
> Hi all,
>
>
>
> SPARK-3174 (https://issues.apache.org/jira/browse/SPARK-3174) is a useful
> feature to save resources on yarn.
>
> We
Hi,
I am trying to join two dataframes and able to display the results in the
console ater join. I am saving that data and and saving in the joined data
in CSV format using spark-csv api . Its just saving the column names not
data at all.
Below is the sample code for the reference:
spark-shell
Should not df.select just have the column names?
And sqlC.sql have the select statement?
Therefore perhaps we could use: df.select("COLUMN1, COLUMN2") and
sqlC.sql("select COLUMN1, COLUMN2 from tablename")
Why would someone want to do a select on a dataframe after registering it
as a table? I
My driver is running OOM with my 4T data set... I don't collect any data to
driver. All what the program done is map - reduce - saveAsTextFile. But the
partitions to be shuffled is quite large - 20K+.
The symptom what I'm seeing the timeout when GetMapOutputStatuses from
Driver.
15/12/24 02:04:21
Thank you Silvio for the update.
On Sat, Dec 26, 2015 at 1:14 PM, Silvio Fiorito <
silvio.fior...@granturing.com> wrote:
> Skipped stages result from existing shuffle output of a stage when
> re-running a transformation. The executors will have the output of the
> stage in their local dirs and
Hi,
I am new bee to spark and a bit confused about RDDs and DataFames in Spark.
Can somebody explain me with the use cases which one to use when ?
Would really appreciate the clarification .
Thanks,
Divya
The only way to do this for SQL is though the JDBC driver.
However, you can use literal values without lossy/unsafe string conversions
by using the DataFrame API. For example, to filter:
import org.apache.spark.sql.functions._
df.filter($"columnName" === lit(value))
On Sun, Dec 27, 2015 at
Hi,
I noticed an inconsistent behavior when using rdd.randomSplit when the
source rdd is repartitioned, but only in YARN mode. It works fine in local
mode though.
*Code:*
val rdd = sc.parallelize(1 to 100)
val rdd2 = rdd.repartition(64)
rdd.partitions.size
rdd2.partitions.size
val
Scenario1:
val z = sc.parallelize(List("12","23","345",""),2)
z.aggregate("")((x,y) => math.min(x.length, y.length).toString, (x,y) => x
+ y)
res143: String = 10
Scenario2:
val z = sc.parallelize(List("12","23","","345"),2)
z.aggregate("")((x,y) => math.min(x.length, y.length).toString, (x,y) =>
Not sure what you try to do, but the result is correct.
Scenario 2:
Partition 1 ("12", "23")
("","12") => "0"
("0","23") => "1"
Partition 2 ("","345")
("","") => "0"
("0","345") => "1"
Final merge:
("1","1") => "11"
On Mon, Dec 28, 2015 at 7:14 AM, Prem Spark
Can you confirm that file1df("COLUMN2") and file2df("COLUMN10") appeared in
the output of joineddf.collect.foreach(println)
?
Thanks
On Sun, Dec 27, 2015 at 6:32 PM, Divya Gehlot
wrote:
> Hi,
> I am trying to join two dataframes and able to display the results in the
Hi all,
SPARK-3174 (https://issues.apache.org/jira/browse/SPARK-3174) is a useful
feature to save resources on yarn.
We want to open this feature on our yarn cluster.
I have a question about the version of shuffle service.
I’m now using spark-1.5.1 (shuffle service).
If I want to upgrade to
External shuffle service is backward compatible, so if you deployed 1.6
shuffle service on NM, it could serve both 1.5 and 1.6 Spark applications.
Thanks
Saisai
On Mon, Dec 28, 2015 at 2:33 PM, 顾亮亮 wrote:
> Is it possible to support both spark-1.5.1 and spark-1.6.0 on
Is it possible to support both spark-1.5.1 and spark-1.6.0 on one yarn cluster?
From: Saisai Shao [mailto:sai.sai.s...@gmail.com]
Sent: Monday, December 28, 2015 2:29 PM
To: Jeff Zhang
Cc: 顾亮亮; user@spark.apache.org; 刘骋昺
Subject: Re: Opening Dynamic Scaling Executors on Yarn
Replace all the
Thank you for you response!
But this approach did not help.
This one works:
def check[T: ClassTag](rdd: List[T]) = rdd match {
case rdd: List[Int] if classTag[T] == classTag[Int] => println("Int")
case rdd: List[String] if classTag[T] == classTag[String] =>
println("String")
Finally able to resolve the issue
For sample example having small dataset , its creating some 200 files .. I
was just doing the random file check in output directory and Alas ! was
getting all column files
Attaching the output files now ..
Now another question arises why so many (200 output files)
Replace all the shuffle jars and restart the NodeManager is enough, no need
to restart NN.
On Mon, Dec 28, 2015 at 2:05 PM, Jeff Zhang wrote:
> See
> http://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation
>
>
>
> On Mon, Dec 28, 2015 at 2:00 PM,
Hey all!
I am willing to partition *json *data by a column name and store the result
as a collection of json files to be loaded to another database.
I could use spark's built in *partitonBy *function but it only outputs in
parquet format which is not desirable for me.
Could you suggest me a way
29 matches
Mail list logo