This seems like a deployment or dependency issue. Please check the
following:
1. The unmodified Spark jars were not on the classpath (already existed on
the cluster or pulled in by other packages).
2. The modified jars were indeed deployed to both master and slave nodes.
On Tue, Jul 5, 2016 at
Yes, it can.
On Fri, Jul 8, 2016 at 3:03 PM, Ashok Kumar wrote:
> thanks so basically Spark Thrift Server runs on a port much like beeline
> that uses JDBC to connect to Hive?
>
> Can Spark thrift server access Hive tables?
>
> regards
>
>
> On Friday, 8 July 2016, 5:27,
thanks so basically Spark Thrift Server runs on a port much like beeline that
uses JDBC to connect to Hive?
Can Spark thrift server access Hive tables?
regards
On Friday, 8 July 2016, 5:27, ayan guha wrote:
Spark Thrift Server..works as jdbc server. you can
Yes, absolutely. You need to "save" the table using saveAsTable function.
Underlying storage is HDFS or any other storage and you are basically using
spark's embedded hive services (when you do not have hadoop in the set up).
Think STS as a JDBC server in front of your datastore. STS runs as a
Tableau has its own Tableau server that stores reports prepared not data.
It does not cache data. What users do is to access Tableau server from
their Tableau client and use reports. You still need to get data out from
the persistent store. I have not heard of Tableau having its own storage
layer.
Hi Manoj,
I have a spark meetup talk that explains the issues with dimsum where you
have to calculate row similarities. You can still use the PR since it has
all the code you need but I have not got time to refactor it for the merge.
I believe few kernels are supported as well.
Thanks.
Deb
On
Hi Ayan,
Thanks for replying. It’s sound great. Let me check.
One thing confuse is there any way to share things between too? I mean Zeppelin
and Thift Server. For example: I update, insert data to a table on Zeppelin and
external tool connect through STS can get it.
Thanks & regards,
Chanh
Hi Mich,
Actually technical users they can write some kind of complex machine learning
things in the future too so that why zeppelin is promising.
> Those business users. Do they Oracle BI (OBI) to connect to DW like Oracle
> now?
Yes, they are. Our data is still storing in Oracle but It’s
Spark Thrift Server..works as jdbc server. you can connect to it from
any jdbc tool like squirrel
On Fri, Jul 8, 2016 at 3:50 AM, Ashok Kumar
wrote:
> Hello gurus,
>
> We are storing data externally on Amazon S3
>
> What is the optimum or best way to use Spark
Hi
Spark Thrift does not need Hive/hadoop. STS should be your first choice if
you are planning to integrate BI tools with Spark. It works with Zeppelin
as well. We do all our development using Zeppelin and STS.
One thing to note: many BI tools like Qliksense, Tablaue (not sure of
oracle Bi Tool)
Interesting Chanh
Those business users. Do they Oracle BI (OBI) to connect to DW like Oracle
now?
Certainly power users can use Zeppelin to write code that will be executed
through Spark but much doubt Zeppelin can do what OBI tool provides.
What you need is to investigate if OBI tool can
Yea, I totally agree with Yong.
Anyway, this might not be a great idea but you might want to take a look
this,
http://pivotal-field-engineering.github.io/pmr-common/pmr/apidocs/com/gopivotal/mapreduce/lib/input/JsonInputFormat.html
This does not recognise nested structure but I assume you might
Hi Mich,
Thanks for replying. Currently we think we need to separate 2 groups of user.
1. Technical: Can write SQL
2. Business: Can drag and drop fields or metrics and see the result.
Our stack using Zeppeline, Spark SQL to query data from Alluxio. Our data
current store in parquet files.
hi,
I have not used Alluxio but it is a distributed file system much like an
IMDB say Oracle TimesTen. Spark is your query tool and Zeppelin is the GUI
interface to your Spark which basically allows you graphs with Spark
queries.
You mentioned Hive so I assume your persistent storage is Hive?
Hi everyone,
Currently we use Zeppelin to analytics our data and because of using SQL it’s
hard to distribute for users use. But users are using some kind of Oracle BI
tools to analytic because it support some kinds of drag and drop and we can do
some kind of permitted for each user.
Our
Hi Debasish, All,
I see the status of SPARK-4823 [0] is "in-progress" still. I couldn't
gather from the relevant pull request [1] if part of it is already in 1.6.0
(it's closed now). We are facing the same problem of computing pairwise
distances between vectors where rows are > 5M and columns in
The N is much bigger than 1 in my case.
Here is an example describes my issue.
"select column1, stddev_samp(column2) from table1 group by column1" gives NaN
"select column1, cast(stddev_samp(column2) as decimal(16,3)) from
table1 group by column1" gives numeric values. e.g. 234.234
"select
Kafka has an interesting model that might be applicable.
You can think of kafka as enabling a queue system. Writes are called
producers, and readers are called consumers. The server is called a broker.
A ³topic² is like a named queue.
Producer are independent. They can write to a ³topic² at
We are planning to address this issue in the future.
At a high level, we'll have to add a delta mode so that updates can be
communicated from one operator to the next.
On Thu, Jul 7, 2016 at 8:59 AM, Arnaud Bailly
wrote:
> Indeed. But nested aggregation does not work
I am running Spark 1.6.1 built for Hadoop 2.0.0-mr1-cdh4.2.0 and using kafka
direct stream approach. I am running into performance problems. My
processing time is > than my window size. Changing window sizes, adding
cores and executor memory does not change performance. I am having a lot of
Do any of the Spark SQL 1.x or 2.0 api’s allow reading from a rest endpoint
consuming json or xml?
For example, in the 2.0 context I’ve tried the following attempts with varying
errors:
val conf = new SparkConf().setAppName("http test").setMaster("local[2]")
val builder =
Thanks.
Will this presentation recorded as well?
Regards
On Wednesday, 6 July 2016, 22:38, Mich Talebzadeh
wrote:
Dear forum members
I will be presenting on the topic of "Running Spark on Hive or Hive on Spark,
your mileage varies" in Future of Data: London
Thanks Prajwal.
I tried these options and they make no difference.
On Thu, Jul 7, 2016 at 12:20 PM Prajwal Tuladhar wrote:
> You can try to play with experimental flags [1]
> `spark.executor.userClassPathFirst`
> and `spark.driver.userClassPathFirst`. But this can also
Thanks Marco
The code snippet has something like below.
ClassLoader cl = Thread.currentThread().getContextClassLoader();
String packagePath = "com.xxx.xxx";
final Enumeration resources = cl.getResources(packagePath);
So resources collection is always empty, indicating no classes are loaded.
As
Hello gurus,
We are storing data externally on Amazon S3
What is the optimum or best way to use Spark as SQL engine to access data on S3?
Any info/write up will be greatly appreciated.
Regards
The problem is for Hadoop Input format to identify the record delimiter. If the
whole json record is in one line, then the nature record delimiter will be the
new line character.
Keep in mind in distribute file system, the file split position most likely IS
not on the record delimiter. The
Hi, there,
Thank you all for your input. @Hyukjin, as a matter of fact, I have read
the blog link you posted before asking the question on the forum. As you
pointed out, the link uses wholeTextFiles(0, which is bad in my case,
because my json file can be as large as 20G+ and OOM might occur. I am
You can try to play with experimental flags [1]
`spark.executor.userClassPathFirst`
and `spark.driver.userClassPathFirst`. But this can also potentially break
other things (like: dependencies that Spark master required initializing
overridden by Spark app and so on) so, you will need to verify.
Hi Chen
pls post
1 . snippet code
2. exception
any particular reason why you need to load classes in other jars
programmatically?
Have you tried to build a fat jar with all the dependencies ?
hth
marco
On Thu, Jul 7, 2016 at 5:05 PM, Chen Song wrote:
> Sorry to spam
Sorry to spam people who are not interested. Greatly appreciate it if
anyone who is familiar with this can share some insights.
On Wed, Jul 6, 2016 at 2:28 PM Chen Song wrote:
> Hi
>
> I ran into problems to use class loader in Spark. In my code (run within
> executor),
Indeed. But nested aggregation does not work with Structured Streaming,
that's the point. I would like to know if there is workaround, or what's
the plan regarding this feature which seems to me quite useful. If the
implementation is not overtly complex and it is just a matter of manpower,
I am
Can you try running the example like this
./bin/run-example sql.RDDRelation
I know there are some jars in the example folders, and running them this
way adds them to the classpath
On Jul 7, 2016 3:47 AM, "kevin" wrote:
> hi,all:
> I build spark use:
>
>
Hi,
Can anyone care to please look into this issue? I would really love some
assistance here.
Thanks a lot.
Thanks & Regards
Biplob Biswas
On Tue, Jul 5, 2016 at 1:00 PM, Biplob Biswas
wrote:
>
> Hi,
>
> I implemented the streamingKmeans example provided in the
Hey all,
I hit a pretty nasty bug on 1.6.2 that I can't reproduce on 2.0. Here is
the code/logical plan http://pastebin.com/ULnHd1b6. I have filterPushdown
disabled, so when I call collect here it hits the Exception in my UDF
before doing a null check on the input.
I believe it is a symptom of
That was what I am thinking to do.
Do you have any idea about this? Or any documentations?
Many thanks.
2016-07-07 17:07 GMT+02:00 Koert Kuipers :
> i dont see any easy way to extend the plans, beyond creating a custom
> version of spark.
>
> On Thu, Jul 7, 2016 at 9:31 AM,
Thank you for your answer.
Since Spark 1.6.0, it is possible to partition a dataframe using hash
partitioning with Repartition "
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame
"
I have also sorted a dataframe and it using a range partitioning in the
i dont see any easy way to extend the plans, beyond creating a custom
version of spark.
On Thu, Jul 7, 2016 at 9:31 AM, tan shai wrote:
> Hi,
>
> I need to add new operations to the dataframe API.
> Can any one explain to me how to extend the plans of query execution?
>
>
since dataframes represent more or less a plan of execution, they do not
have partition information as such i think?
you could however do dataFrame.rdd, to force it to create a physical plan
that results in an actual rdd, and then query the rdd for partition info.
On Thu, Jul 7, 2016 at 4:24 AM,
You have to distribute the files in some distributed file system like hdfs.
Or else copy the files to all executors local file system and make sure to
mention the file scheme in the URI explicitly.
Thanks
Deepak
On Thu, Jul 7, 2016 at 7:13 PM, Balachandar R.A.
wrote:
Hi
Thanks for the code snippet. If the executable inside the map process needs
to access directories and files present in the local file system. Is it
possible? I know they are running in slave node in a temporary working
directory and i can think about distributed cache. But still would like to
Hi,
I need to add new operations to the dataframe API.
Can any one explain to me how to extend the plans of query execution?
Many thanks.
Hi Puneet,
Have you tried appending
--jars $SPARK_HOME/lib/spark-examples-*.jar
to the execution command?
Ram
On Thu, Jul 7, 2016 at 5:19 PM, Puneet Tripathi <
puneet.tripa...@dunnhumby.com> wrote:
> Guys, Please can anyone help on the issue below?
>
>
>
> Puneet
>
>
>
> *From:* Puneet
Hi,
how i can use this option in Random Forest .
when i transform my vector (100 features ) i have 20 categoriel feature
include.
if i understand categorielFeatureinfo , i should past the position of my 20
categoriels feature inside of the vector containing 100 with map{
positionof feature
Hi,
Your statement below:
sortedDF.registerTempTable("sortedDF")
val top50 = hc.sql("select id,location from sortedDF where
rowNumber<=50")
Is on Hive table. Try
sortedDF.registerTempTable("sortedDF")
val top50 = hc.sql("select id,location from sortedDF* LIMIT 50")*
rowNumber
Hi,
I can try to guess what is wrong, but I might be incorrect.
You should be careful with window frames (you define them via the
rowsBetween() method).
In my understanding, all window functions can be divided into 2 groups:
- functions defined by the
Thank you for the answer.
One of the optimizations of Dataframes/Datasets (beyond the Catalyst) are
the Encoders (Project Tungsten), which translate domain objects into
Spark's internal format (binary). By using encoders, the data is not
managed by the Java Virtual Machine anymore (which increase
Hi Michal,
Will an example help?
import scala.util.parsing.json._//Requires scala-parsec-combinators
because it is no longer part of core scala
val wbJSON = JSON.parseFull(weatherBox) //wbJSON is a JSON object now
//Depending on the structure, now traverse through the object
val
Hi guys
I`m trying to extract Map[String, Any] from json string, this works well
in any scala repl I tried, both scala 2.11 and 2.10 and using both
json4s and liftweb-json libraries, but if I try to do the same thing in
spark-shell I`m always getting |No information known about type...|
Arnauld,
You could aggregate the first table and then merge it with the second table
(assuming that they are similarly structured) and then carry out the second
aggregation. Unless the data is very large, I don’t see why you should persist
it to disk. IMO, nested aggregation is more elegant
Yes, finally it will be converted to an RDD internally. However DataFrame
queries are passed through catalyst , which provides several optimizations
e.g. code generation, intelligent shuffle etc , which is not the case for
pure RDDs.
Regards,
Rishitesh Mishra,
SnappyData .
It's aggregation at multiple levels in a query: first do some aggregation
on one tavle, then join with another table and do a second aggregation. I
could probably rewrite the query in such a way that it does aggregation in
one pass but that would obfuscate the purpose of the various stages.
Le 7
Guys, Please can anyone help on the issue below?
Puneet
From: Puneet Tripathi [mailto:puneet.tripa...@dunnhumby.com]
Sent: Thursday, July 07, 2016 12:42 PM
To: user@spark.apache.org
Subject: Spark with HBase Error - Py4JJavaError
Hi,
We are running Hbase in fully distributed mode. I tried to
How can you verify that it is loading only the part of time and network in
filter ?
2016-07-07 11:58 GMT+02:00 Chanh Le :
> Hi Tan,
> It depends on how data organise and what your filter is.
> For example in my case: I store data by partition by field time and
> network_id.
Yes it is operating on the sorted column
2016-07-07 11:43 GMT+02:00 Ted Yu :
> Does the filter under consideration operate on sorted column(s) ?
>
> Cheers
>
> > On Jul 7, 2016, at 2:25 AM, tan shai wrote:
> >
> > Hi,
> >
> > I have a sorted
hi Anton: I check the docs you mentioned, and have code accordingly,
however met an exception like "org.apache.spark.sql.AnalysisException: Window
function row_number does not take a frame specification.;" It Seems that
the row_number API is giving a global row numbers of every row
Dear guys,
I'm investigating the differences between RDDs and Dataframes/Datasets. I
couldn't find the answer for this question: Dataframes acts as a new layer
in the Spark stack? I mean, in the execution there is a conversion to RDD?
For example, if I create a Dataframe and perform a query, in
I have tied this already. It does not work.
What version of Python is needed for this package?
On Wed, Jul 6, 2016 at 12:45 AM, Felix Cheung
wrote:
> This could be the workaround:
>
> http://stackoverflow.com/a/36419857
>
>
>
>
> On Tue, Jul 5, 2016 at 5:37 AM
In that case, I suspect that Mqtt is not getting data while you are
submitting in yarn cluster .
Can you please try dumping data in text file instead of printing while
submitting in yarn cluster mode.?
On Jul 7, 2016 12:46 PM, "Yu Wei" wrote:
> Yes. Thanks for your
Hi Arnauld,
Sorry for the doubt, but what exactly is multiple aggregation? What is the use
case?
Regards,
Sivakumaran
> On 07-Jul-2016, at 11:18 AM, Arnaud Bailly wrote:
>
> Hello,
>
> I understand multiple aggregations over streaming dataframes is not currently
>
It's not required ,
*Simplified Parallelism:* No need to create multiple input Kafka streams
and union them. With directStream, Spark Streaming will create as many RDD
partitions as there are Kafka partitions to consume, which will all read
data from Kafka in parallel. So there is a one-to-one
hi,all:
I build spark use:
./make-distribution.sh --name "hadoop2.7.1" --tgz
"-Pyarn,hadoop-2.6,parquet-provided,hive,hive-thriftserver" -DskipTests
-Dhadoop.version=2.7.1
I can run example :
./bin/spark-submit --class org.apache.spark.examples.SparkPi \
--master spark://master1:7077 \
Hello,
I understand multiple aggregations over streaming dataframes is not
currently supported in Spark 2.0. Is there a workaround? Out of the top of
my head I could think of having a two stage approach:
- first query writes output to disk/memory using "complete" mode
- second query reads from
Hi,
I changed auto.offset.reset to largest. The result 30, 50, 40, 40, 35, 30
seconds... Instead of 10 seconds. It looks like attempt to react on
backpressure but very slow. In any case it is far from any real time tasks
including soft real time. And ok, I agreed with Spark usage with data
Hi Tan,
It depends on how data organise and what your filter is.
For example in my case: I store data by partition by field time and network_id.
If I filter by time or network_id or both and with other field Spark only load
part of time and network in filter then filter the rest.
> On Jul 7,
Does the filter under consideration operate on sorted column(s) ?
Cheers
> On Jul 7, 2016, at 2:25 AM, tan shai wrote:
>
> Hi,
>
> I have a sorted dataframe, I need to optimize the filter operations.
> How does Spark performs filter operations on sorted dataframe?
>
Hi Team,
Is there a way we can consume from Kafka using spark Streaming direct API
using multiple consumers (belonging to same consumer group)
Regards,
Sam
--
View this message in context:
Hi,
I have a sorted dataframe, I need to optimize the filter operations.
How does Spark performs filter operations on sorted dataframe?
It is scanning all the data?
Many thanks.
Sample standard deviation can't be defined in the case of N=1, because
it has N-1 in the denominator. My guess is that this is the case
you're seeing. A population of N=1 still has a standard deviation of
course (which is 0).
On Thu, Jul 7, 2016 at 9:51 AM, Mungeol Heo
The OP is not calling stddev though, so I still don't see that this is
the question at hand.
But while we're off on the topic -- while I certainly agree that
stddev is mapped to the sample standard deviation in DBs, it doesn't
actually make much sense as a default.
What you get back is not the
I know stddev_samp and stddev_pop gives different values, because they
have different definition. What I want to know is why stddev_samp
gives "NaN", and not a numeric value.
On Thu, Jul 7, 2016 at 5:39 PM, Sean Owen wrote:
> I don't think that's relevant here. The question
stddev is mapped to stdddev_samp. That is the general use case or rather
common use of standard deviation.
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
I don't think that's relevant here. The question is why would samp
give a different result to pop, not the result of "stddev". Neither
one is a 'correct' definition of standard deviation in the abstract;
one or the other is correct depending on what standard deviation you
are trying to measure.
The correct STDDEV function used is STDDEV_SAMP not STDDEV_POP. That is
the correct one.
You can actually work that one out yourself
BTW Hive also gives a wrong value. This is what I reported back in April
about Hive giving incorrect value
Both Oracle and Sybase point STDDEV to STDDEV_SAMP
No, because these are different values defined differently. If you
have 1 data point, the sample stdev is undefined while population
stdev is defined. Refer to their definition.
On Thu, Jul 7, 2016 at 9:23 AM, Mungeol Heo wrote:
> Hello,
>
> As I mentioned at the title,
Using partitioning with dataframes, how can we retrieve informations about
partitions? partitions bounds for example
Thanks,
Shaira
2016-07-07 6:30 GMT+02:00 Koert Kuipers :
> spark does keep some information on the partitions of an RDD, namely the
> partitioning/partitioner.
Hello,
As I mentioned at the title, stddev_samp function gives a NaN while
stddev_pop gives a numeric value on the same data.
The stddev_samp function will give a numeric value, if I cast it to decimal.
E.g. cast(stddev_samp(column_name) as decimal(16,3))
Is it a bug?
Thanks
- mungeol
Hi,
I am also interested in having JOIN support streams on both sides. I
understand Spark SQL's JOIN currently support having a stream on a single
side only. Could you please provide some more details on why this is the
case? What are the technical limitations that make it harder to implement
We will look into streaming-streaming joins in future release of Spark,
though no promises on any timeline yet. We are currently fighting to get
Spark 2.0 out of the door.
There isnt a JIRA for this right now. However, you can track the Structured
Streaming Epic JIRA to track whats going on. I try
Yes. Thanks for your clarification.
The problem I encountered is that in yarn cluster mode, no output for
"DStream.print()" in yarn logs.
In spark implementation org/apache/spark/streaming/dstream/DStream.scala, the
logs related with "Time" was printed out. However, other information for
Hi,
We are running Hbase in fully distributed mode. I tried to connect to Hbase via
pyspark and then write to hbase using saveAsNewAPIHadoopDataset , but it failed
the error says:
Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.saveAsHadoopDataset.
:
Hi,
This jira https://issues.apache.org/jira/browse/SPARK-8813 is fixed in spark
2.0.But resolution is not mentioned there.
In our use case, there are big as well as many small parquet files which are
being queried using spark sql.Can someone please explain what is the fix and
how I can use it
The link uses wholeTextFiles() API which treats each file as each record.
2016-07-07 15:42 GMT+09:00 Jörn Franke :
> This does not need necessarily the case if you look at the Hadoop
> FileInputFormat architecture then you can even split large multi line Jsons
> without
This does not need necessarily the case if you look at the Hadoop
FileInputFormat architecture then you can even split large multi line Jsons
without issues. I would need to have a look at it, but one large file does not
mean one Executor independent of the underlying format.
> On 07 Jul 2016,
Hi, all:
As recorded in https://issues.apache.org/jira/browse/SPARK-16408, when
using Spark-sql to execute sql like:
add file hdfs://xxx/user/test;
If the HDFS path( hdfs://xxx/user/test) is a directory, then we will get
an exception like:
org.apache.spark.SparkException: Added file
Hi Sarath,
By any chance have you resolved this issue ?
Thanks,
Padma CH
On Tue, Apr 28, 2015 at 11:20 PM, sarath [via Apache Spark User List] <
ml-node+s1001560n22694...@n3.nabble.com> wrote:
>
> I am trying to train a large dataset consisting of 8 million data points
> and 20 million
There is a good link for this here,
http://searchdatascience.com/spark-adventures-1-processing-multi-line-json-files
If there are a lot of small files, then it would work pretty okay in a
distributed manner, but I am worried if it is single large file.
In this case, this would only work in
do you want id1, id2, id3 to be processed similarly?
The Java code I use is:
df = df.withColumn(K.NAME, df.col("fields.premise_name"));
the original structure is something like {"fields":{"premise_name":"ccc"}}
hope it helps
> On Jul 7, 2016, at 1:48 AM, Lan Jiang
87 matches
Mail list logo