Hi Arunkumar,
If you want to create a vector from multiple columns of DataFrame, Spark ML
provided VectorAssembler to help us.
Yanbo
2015-12-21 13:44 GMT+08:00 Arunkumar Pillai :
> Hi
>
>
> I'm trying to use Linear Regression from ml library
>
> but the problem is the independent variable shoul
Hi Friends,
I have created a hive external table with partition. I want to alter the
hive table partition through spark with java code.
alter table table1
add if not exists
partition(datetime='2015-12-01')
location 'hdfs://localhost:54310/spark/twitter/datetime=2015-12-01/'
The above query
Ptoblem resolved, syntext issue )-:
On Mon, 21 Dec 2015 at 06:09 Jeff Zhang wrote:
> If it does not return a column you expect, then what does this return ? Do
> you will have 2 columns with the same column name ?
>
> On Sun, Dec 20, 2015 at 7:40 PM, Eran Witkon wrote:
>
>> Hi,
>>
>> I am a bit
Once I removed the CR LF from the file it worked ok.
eran
On Mon, 21 Dec 2015 at 06:29 Yin Huai wrote:
> Hi Eran,
>
> Can you try 1.6? With the change in
> https://github.com/apache/spark/pull/10288, JSON data source will not
> throw a runtime exception if there is any record that it cannot parse
Hi
I'm using ml.LinearRegession package
How to get estimates and standard Error for the coefficient
PFB the code snippet
val lr = new LinearRegression()
lr.setMaxIter(10)
.setRegParam(0.01)
.setFitIntercept(true)
val model= lr.fit(test)
val estimates = model.summary
Hi
I'm trying to use Linear Regression from ml library
but the problem is the independent variable should be a vector.
My code snippet is as as follows
var dataDF = sqlContext.emptyDataFrame
dataDF = sqlContext.sql("SELECT "+
dependentVariable+","+independentVariables +" FROM " + tab
Hi Eran,
Can you try 1.6? With the change in
https://github.com/apache/spark/pull/10288, JSON data source will not throw
a runtime exception if there is any record that it cannot parse. Instead,
it will put the entire record to the column of "_corrupt_record".
Thanks,
Yin
On Sun, Dec 20, 2015 a
If it does not return a column you expect, then what does this return ? Do
you will have 2 columns with the same column name ?
On Sun, Dec 20, 2015 at 7:40 PM, Eran Witkon wrote:
> Hi,
>
> I am a bit confused with dataframe operations.
> I have a function which takes a string and returns a strin
>>> Would the driver not wait till all the stuff related to test1 is
completed before calling test2 as test2 is dependent on test1?
>>> val test1 =RDD1.mapPartitions.()
>>> val test2 = test1.mapPartititions()
On the driver side, actually these 2 lines of code will be executed but the
real computat
Normally there will be one RDD in each batch.
You could refer to the implementation of DStream#getOrCompute.
On Mon, Dec 21, 2015 at 11:04 AM, Arun Patel
wrote:
> It may be simple question...But, I am struggling to understand this
>
> DStream is a sequence of RDDs created in a batch window
I see this happens when there is a deadlock situation. The RDD test1 has a
Couchbase call and it seems to be having threads hanging there. Eventhough
all the connections are closed I see the threads related to Couchbase
causing the job to hang for sometime before it gets cleared up.
Would the driv
It may be simple question...But, I am struggling to understand this
DStream is a sequence of RDDs created in a batch window. So, how do I know
how many RDDs are created in a batch?
I am clear about the number of partitions created which is
Number of Partitions = (Batch Interval / spark.str
1. scala> import org.apache.spark.sql.hive.HiveContext
2. import org.apache.spark.sql.hive.HiveContext
3.
4. scala> import org.apache.spark.sql.hive.orc._
5. import org.apache.spark.sql.hive.orc._
6.
7. scala> import org.apache.spark.sql.types.{StructType, StructField,
Strin
I have a large Map that is assembled in the driver and broadcast to each node.
My question is how best to allocate memory for this. The Driver has to have
enough memory for the Maps, but only one copy is serialized to each node. What
type of memory should I size to match the Maps? Is the broadc
Hi,
can anyone please help me troubleshooting this prob: I have a streaming pyspark
application (spark 1.5.2 on yarn-client) which keeps crashing after few hours.
Doesn't seem to be running out of mem neither on driver or executors.
driver error:
py4j.protocol.Py4JJavaError: An error occurred whi
I have a directory that contains a set of parquet files that have been
partitioned. As such, there are subdirectories all of the form MyKey=XXX.
The structure is relatively large but I have a query that wants to use only
the data from the maximum value for MyKey. I tried the trivial approach
(i.
@Chris,
There is a 1-1 mapping b/w spark partitions & kafka partitions out of
the box . One can break it by repartitioning of course and add more
parallelism, but that has its own issues around consumer offset management-
when do I commit the offsets, for example. While its trivial to increase
spark.sql.autoBroadcastJoinThreshold default value in 1.5.2 is 10MB
According to the output in console Spark is doing broadcast, but query
which looks like the following does not perform well
select
big_t.*,
small_t.name range_name
from big_t
join small_t on (1=1)
where small_t.min <= big_t.v an
I am running Spark programs on a large cluster (for which, I do not have
administrative privileges). numpy is not installed on the worker nodes.
Hence, I bundled numpy with my program, but I get the following error:
Traceback (most recent call last):
File "/home/user/spark-script.py", line 12, i
I have the similar observation with 1.4.1 where the 3rd stage running
mapPartitionsWithIndex at Word2Vec.scala:312 seems running with a single
thread (which takes forever for reasonable large corpus). Can anyone help
explain if this is an algorithm limitation or there model parameters can be
effec
Thanks for this!
This was the problem...
On Sun, 20 Dec 2015 at 18:49 Chris Fregly wrote:
> hey Eran, I run into this all the time with Json.
>
> the problem is likely that your Json is "too pretty" and extending beyond
> a single line which trips up the Json reader.
>
> my solution is usually to
this type of broadcast should be handled by Spark SQL/DataFrames automatically.
this is the primary cost-based, physical-plan query optimization that the Spark
SQL Catalyst optimizer supports.
in Spark 1.5 and before, you can trigger this optimization by properly setting
the spark.sql.autobroad
Thanks Chris will give it a go and report back.
Bizarrely if I start the pyspark shell I don't see any issues
Kr
Marco
On 20 Dec 2015 5:02 pm, "Chris Fregly" wrote:
> hopping on a plane, but check the hive-site.xml that's in your spark/conf
> directory (or should be, anyway). I believe you can c
hopping on a plane, but check the hive-site.xml that's in your spark/conf
directory (or should be, anyway). I believe you can change the root path thru
this mechanism.
if not, this should give you more info google on.
let me know as this comes up a fair amount.
> On Dec 19, 2015, at 4:58 PM,
how does Spark SQL/DataFrame know that train_users_2.csv has a field named,
"id" or anything else domain specific? is there a header? if so, does
sc.textFile() know about this header?
I'd suggest using the Databricks spark-csv package for reading csv data. there
is an option in there to spec
hey Eran, I run into this all the time with Json.
the problem is likely that your Json is "too pretty" and extending beyond a
single line which trips up the Json reader.
my solution is usually to de-pretty the Json - either manually or through an
ETL step - by stripping all white space before p
Got it to work, thanks
On Sun, 20 Dec 2015 at 17:01 Eran Witkon wrote:
> I might be missing you point but I don't get it.
> My understanding is that I need a RDD containing Rows but how do I get it?
>
> I started with a DataFrame
> run a map on it and got the RDD [string,string,string,strng] not
separating out your code into separate streaming jobs - especially when there
are no dependencies between the jobs - is almost always the best route. it's
easier to combine atoms (fusion), then split them (fission).
I recommend splitting out jobs along batch window, stream window, and
state-tr
I might be missing you point but I don't get it.
My understanding is that I need a RDD containing Rows but how do I get it?
I started with a DataFrame
run a map on it and got the RDD [string,string,string,strng] not I want to
convert it back to a DataFrame and failing
Why?
On Sun, Dec 20, 2
See the comment for createDataFrame(rowRDD: RDD[Row], schema: StructType)
method:
* Creates a [[DataFrame]] from an [[RDD]] containing [[Row]]s using the
given schema.
* It is important to make sure that the structure of every [[Row]] of
the provided RDD matches
* the provided schema. Oth
Hi,
I have an RDD
jsonGzip
res3: org.apache.spark.rdd.RDD[(String, String, String, String)] =
MapPartitionsRDD[8] at map at :65
which I want to convert to a DataFrame with schema
so I created a schema:
al schema =
StructType(
StructField("cty", StringType, false) ::
StructField("hse"
Was there stack trace following the error ?
Which Spark release are you using ?
Cheers
> On Dec 19, 2015, at 10:43 PM, Sree Eedupuganti wrote:
>
> i had 9 rows in my Mysql table
>
>
> options.put("dbtable", "(select * from employee");
>options.put("lowerBound", "1");
>options
Yes, this works...
Thanks
On Sun, Dec 20, 2015 at 3:57 PM Peter Zhang wrote:
> Hi Eran,
>
> Missing import package.
>
> import org.apache.spark.sql.types._
>
> will work. please try.
>
> Peter Zhang
> --
> Google
> Sent with Airmail
>
> On December 20, 2015 at 21:43:42, Eran Witkon (eranwit...@g
Hi Eran,
Missing import package.
import org.apache.spark.sql.types._
will work. please try.
Peter Zhang
--
Google
Sent with Airmail
On December 20, 2015 at 21:43:42, Eran Witkon (eranwit...@gmail.com) wrote:
Hi,
I am using spark-shell with version 1.5.2.
scala> sc.ver
Hi,
I am using spark-shell with version 1.5.2.
scala> sc.version
res17: String = 1.5.2
but when trying to use StructType I am getting error:
val struct =
StructType(
StructField("a", IntegerType, true) ::
StructField("b", LongType, false) ::
StructField("c", BooleanType, false) :: Ni
On 19 Dec 2015, at 13:34, Steve Loughran
mailto:ste...@hortonworks.com>> wrote:
On 18 Dec 2015, at 21:39, Andrew Or
mailto:and...@databricks.com>> wrote:
Hi Roy,
I believe Spark just gets its application ID from YARN, so you can just do
`sc.applicationId`.
If you listen for a spark start e
Hi,
I am a bit confused with dataframe operations.
I have a function which takes a string and returns a string
I want to apply this functions on all rows on a single column in my
dataframe
I was thinking of the following:
jsonData.withColumn("computedField",computeString(jsonData("hse")))
BUT js
disregard my last question - my mistake.
I accessed it as a col not as a row :
jsonData.first.getAs[String]("cty")
Eran
On Sun, Dec 20, 2015 at 11:42 AM Eran Witkon wrote:
> Thanks, That's works.
> One other thing -
> I have the following code:
>
> val jsonData = sqlContext.read.json("/home/era
Thanks, That's works.
One other thing -
I have the following code:
val jsonData = sqlContext.read.json("/home/eranw/Workspace/JSON/sample")
jsonData.show()
+--++---+-+
| cty| hse| nm| yrs|
+--+-
Yes it is. You can actually use the java.util.zip.GZIPInputStream in your
case.
Thanks
Best Regards
On Sun, Dec 20, 2015 at 3:23 AM, Eran Witkon wrote:
> Thanks, since it is just a snippt do you mean that Inflater is coming
> from ZLIB?
> Eran
>
> On Fri, Dec 18, 2015 at 11:37 AM Akhil Das
> w
Flume could be interesting for you.
> On 19 Dec 2015, at 00:27, SRK wrote:
>
> Hi,
>
> How to run multiple Spark jobs that takes Spark Streaming data as the
> input as a workflow in Oozie? We have to run our Streaming job first and
> then have a workflow of Spark Batch jobs to process the dat
Just point loader to the folder. You do not need *
On Dec 19, 2015 11:21 PM, "Eran Witkon" wrote:
> Hi,
> Can I combine multiple JSON files to one DataFrame?
>
> I tried
> val df = sqlContext.read.json("/home/eranw/Workspace/JSON/sample/*")
> but I get an empty DF
> Eran
>
Use Spark job server https://github.com/spark-jobserver/spark-jobserver
Additional:
1. You can also write your on job server with spray (a Scala REST
framework).
2. Create Thrift server and pass states of each job states (Thrift Object )
between your different Jobs.
-
Software Developer
43 matches
Mail list logo