Hey,
Recently, we found in our cluster, that when we kill a spark streaming
app, the whole cluster cannot response for 10 minutes.
And, we investigate the master node, and found the master process
consumes 100% CPU when we kill the spark streaming app.
How could it happen? Did
Hi,
In our project, we use stand alone duo master + zookeeper to make
the HA of spark master.
Now the problem is, how do we know which master is the current alive
master?
We tried to read the info that the master stored in zookeeper. But we
found there is no information to
Hi,
I'm trying to migrate some hive scripts to Spark-SQL. However, I
found some statement is incompatible in Spark-sql.
Here is my SQL. And the same SQL works fine in HIVE environment.
SELECT
*if(ad_user_id1000, 1000, ad_user_id) as user_id*
FROM
, DEVAN M.S. msdeva...@gmail.com wrote:
Which context are you using HiveContext or SQLContext ? Can you try with
HiveContext
??
Devan M.S. | Research Associate | Cyber Security | AMRITA VISHWA
VIDYAPEETHAM | Amritapuri | Cell +919946535290 |
On Tue, Jan 20, 2015 at 3:49 PM, Xuelin Cao
Hi, I'm using Spark 1.2
On Tue, Jan 20, 2015 at 5:59 PM, Wang, Daoyuan daoyuan.w...@intel.com
wrote:
Hi Xuelin,
What version of Spark are you using?
Thanks,
Daoyuan
*From:* Xuelin Cao [mailto:xuelincao2...@gmail.com]
*Sent:* Tuesday, January 20, 2015 5:22 PM
*To:* User
12, 2015 at 9:50 PM, Xuelin Cao xuelincao2...@gmail.com
wrote:
Hi,
I'd like to create a transform function, that convert RDD[String] to
RDD[Int]
Occasionally, the input RDD could be an empty RDD. I just want to
directly create an empty RDD[Int] if the input RDD is empty. And, I
.).
Why not increase the tasks per core?
Best regards
Le 9 janv. 2015 06:46, Xuelin Cao xuelincao2...@gmail.com a écrit :
Hi,
I'm wondering whether it is a good idea to overcommit CPU cores on
the spark cluster.
For example, in our testing cluster, each worker machine has 24
Hi,
I'm wondering whether it is a good idea to overcommit CPU cores on
the spark cluster.
For example, in our testing cluster, each worker machine has 24
physical CPU cores. However, we are allowed to set the CPU core number to
48 or more in the spark configuration file. As a result,
of the input data for each task is also 1212.5MB
On Thu, Jan 8, 2015 at 6:40 PM, Cheng Lian lian.cs@gmail.com wrote:
Hey Xuelin, which data item in the Web UI did you check?
On 1/7/15 5:37 PM, Xuelin Cao wrote:
Hi,
Curious and curious. I'm puzzled by the Spark SQL cached table
multiple times to generate a larger file.
Cheng
On 1/8/15 7:43 PM, Xuelin Cao wrote:
Hi, Cheng
I checked the Input data for each stage. For example, in my
attached screen snapshot, the input data is 1212.5MB, which is the total
amount of the whole table
[image: Inline image 1
Hi,
Curious and curious. I'm puzzled by the Spark SQL cached table.
Theoretically, the cached table should be columnar table, and only scan
the column that included in my SQL.
However, in my test, I always see the whole table is scanned even though
I only select one column in
.
https://issues.apache.org/jira/browse/SPARK-4258
You can turn it on if you want:
http://spark.apache.org/docs/latest/sql-programming-guide.html#configuration
Daniel
On 7 בינו׳ 2015, at 08:18, Xuelin Cao xuelin...@yahoo.com.INVALID
wrote:
Hi,
I'm testing parquet file format
Hi,
Currently, we are building up a middle scale spark cluster (100 nodes)
in our company. One thing bothering us is, the how spark manages the
resource (CPU, memory).
I know there are 3 resource management modes: stand-along, Mesos, Yarn
In the stand along mode, the cluster
applications you want to be running besides Spark in the same
cluster and also your use cases, to see what resource management fits your
need.
Tim
On Wed, Jan 7, 2015 at 10:55 PM, Xuelin Cao xuelincao2...@gmail.com
wrote:
Hi,
Currently, we are building up a middle scale spark cluster (100
, 2015 at 11:19 PM, Xuelin Cao xuelincao2...@gmail.com
wrote:
Hi,
Thanks for the information.
One more thing I want to clarify, when does Mesos or Yarn allocate
and release the resource? Aka, what is the resource life time?
For example, in the stand-along mode, the resource
Hi,
I'm testing parquet file format, and the predicate pushdown is a very
useful feature for us.
However, it looks like the predicate push down doesn't work after I set
sqlContext.sql(SET spark.sql.parquet.filterPushdown=true) Here
is my sql:
Hi,
In Spark SQL help document, it says Some of these (such as indexes) are
less important due to Spark SQL’s in-memory computational model. Others are
slotted for future releases of Spark SQL.
- Block level bitmap indexes and virtual columns (used to build indexes)
For our
looked at partitioned table support? That would only scan data where
the predicate matches the partition. Depending on the cardinality of the
customerId column that could be a good option for you.
On Wed, Dec 17, 2014 at 2:25 AM, Xuelin Cao xuelin...@yahoo.com.invalid wrote:
Hi,
In Spark
Hi,
I tried to create a function that to convert an Unix time stamp to the
hour number in a day.
It works if the code is like this:sqlContext.registerFunction(toHour,
(x:Long)={new java.util.Date(x*1000).getHours})
But, if I do it like this, it doesn't work:
def toHour
Hi,
I'm wondering whether there is an efficient way to continuously append
new data to a registered spark SQL table.
This is what I want: I want to make an ad-hoc query service to a
json formated system log. Certainly, the system log is continuously generated.
I will use
Hi,
I'm generating a Spark SQL table from an offline Json file.
The difficulty is, in the original json file, there is a hierarchical
structure. And, as a result, this is what I get:
scala tb.printSchemaroot |-- budget: double (nullable = true) |-- filterIp:
array (nullable = true) |
Hi,
I'm generating a Spark SQL table from an offline Json file.
The difficulty is, in the original json file, there is a hierarchical
structure. And, as a result, this is what I get:
scala tb.printSchemaroot |-- budget: double (nullable = true) |-- filterIp:
array (nullable = true) |
Hi,
I'd like to make an operation on an RDD that ONLY change the value of
some items, without make a full copy or full scan of each data.
It is useful when I need to handle a large RDD, and each time I need only
to change a little fraction of the data, and keeps other data
Hi,
I'm going to debug some spark applications on our testing platform. And it
would be helpful if we can see the eventLog on the worker node.
I've tried to turn on spark.eventLog.enabled and set spark.eventLog.dir
parameters on the worker node. However, it doesn't work.
I do
Hi,
I'm going to debug some spark applications on our testing platform. And
it would be helpful if we can see the eventLog on the *worker *node.
I've tried to turn on *spark.eventLog.enabled* and set
*spark.eventLog.dir* parameters on the worker node. However, it doesn't
work.
I
25 matches
Mail list logo