[jira] [Updated] (HUDI-1551) Support Partition with BigDecimal/Integer field

2021-04-07 Thread Chanh Le (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chanh Le updated HUDI-1551:
---
Description: 
In my data the time indicator field is in BigDecimal/Integer -> due to trading 
data related so need to records in more precision than normal.

I would like to add support to partition based on this field type for 
TimestampBasedKeyGenerator.

 

  was:
In my data the time indicator field is in BigDecimal -> due to trading data 
related so need to records in more precision than normal.

I would like to add support to partition based on this field type for 
TimestampBasedKeyGenerator.

 


> Support Partition with BigDecimal/Integer field
> ---
>
> Key: HUDI-1551
> URL: https://issues.apache.org/jira/browse/HUDI-1551
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: newbie
>Reporter: Chanh Le
>Priority: Trivial
>
> In my data the time indicator field is in BigDecimal/Integer -> due to 
> trading data related so need to records in more precision than normal.
> I would like to add support to partition based on this field type for 
> TimestampBasedKeyGenerator.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1551) Support Partition with BigDecimal/Integer field

2021-04-07 Thread Chanh Le (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chanh Le updated HUDI-1551:
---
Fix Version/s: (was: 0.7.0)

> Support Partition with BigDecimal/Integer field
> ---
>
> Key: HUDI-1551
> URL: https://issues.apache.org/jira/browse/HUDI-1551
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: newbie
>    Reporter: Chanh Le
>Priority: Trivial
>
> In my data the time indicator field is in BigDecimal -> due to trading data 
> related so need to records in more precision than normal.
> I would like to add support to partition based on this field type for 
> TimestampBasedKeyGenerator.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1551) Support Partition with BigDecimal/Integer field

2021-04-07 Thread Chanh Le (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chanh Le updated HUDI-1551:
---
Summary: Support Partition with BigDecimal/Integer field  (was: Support 
Partition with BigDecimal field)

> Support Partition with BigDecimal/Integer field
> ---
>
> Key: HUDI-1551
> URL: https://issues.apache.org/jira/browse/HUDI-1551
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: newbie
>    Reporter: Chanh Le
>Priority: Trivial
> Fix For: 0.7.0
>
>
> In my data the time indicator field is in BigDecimal -> due to trading data 
> related so need to records in more precision than normal.
> I would like to add support to partition based on this field type for 
> TimestampBasedKeyGenerator.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1551) Support Partition with BigDecimal field

2021-01-25 Thread Chanh Le (Jira)
Chanh Le created HUDI-1551:
--

 Summary: Support Partition with BigDecimal field
 Key: HUDI-1551
 URL: https://issues.apache.org/jira/browse/HUDI-1551
 Project: Apache Hudi
  Issue Type: New Feature
  Components: newbie
Reporter: Chanh Le
 Fix For: 0.7.0


In my data the time indicator field is in BigDecimal -> due to trading data 
related so need to records in more precision than normal.

I would like to add support to partition based on this field type for 
TimestampBasedKeyGenerator.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: SPARK environment settings issue when deploying a custom distribution

2017-06-12 Thread Chanh Le
Just add more information how I build the custom distribution.
I clone spark repo then switch to branch 2.2 then make distribution that
following.

λ ~/workspace/big_data/spark/ branch-2.2*
λ ~/workspace/big_data/spark/ ./dev/make-distribution.sh --name custom
--tgz -Phadoop-2.7 -Dhadoop.version=2.7.0 -Phive -Phive-thriftserver
-Pmesos -Pyarn



On Mon, Jun 12, 2017 at 6:14 PM Chanh Le <giaosu...@gmail.com> wrote:

> Hi everyone,
>
> Recently I discovered an issue when processing csv of spark. So I decided
> to fix it following this https://issues.apache.org/jira/browse/SPARK-21024 I
> built a custom distribution for internal uses. I built it in my local
> machine then upload the distribution to server.
>
> server's *~/.bashrc*
>
> # added by Anaconda2 4.3.1 installer
> export PATH="/opt/etl/anaconda/anaconda2/bin:$PATH"
> export SPARK_HOME="/opt/etl/spark-2.1.0-bin-hadoop2.7"
> export
> PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.4-src.zip:$PYTHONPATH
>
> What I did on server was:
> export SPARK_HOME=/home/etladmin/spark-2.2.1-SNAPSHOT-bin-custom
>
> $SPARK_HOME/bin/spark-submit --version
> It print out version *2.1.1* which* is not* the version I built (2.2.1)
>
>
> I did set *SPARK_HOME* in my local machine (MACOS) for this distribution
> and it's working well, print out the version *2.2.1*
>
> I need the way to investigate the invisible environment variable.
>
> Do you have any suggestions?
> Thank in advance.
>
> Regards,
> Chanh
>
> --
> Regards,
> Chanh
>
-- 
Regards,
Chanh


SPARK environment settings issue when deploying a custom distribution

2017-06-12 Thread Chanh Le
Hi everyone,

Recently I discovered an issue when processing csv of spark. So I decided
to fix it following this https://issues.apache.org/jira/browse/SPARK-21024 I
built a custom distribution for internal uses. I built it in my local
machine then upload the distribution to server.

server's *~/.bashrc*

# added by Anaconda2 4.3.1 installer
export PATH="/opt/etl/anaconda/anaconda2/bin:$PATH"
export SPARK_HOME="/opt/etl/spark-2.1.0-bin-hadoop2.7"
export
PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.4-src.zip:$PYTHONPATH

What I did on server was:
export SPARK_HOME=/home/etladmin/spark-2.2.1-SNAPSHOT-bin-custom

$SPARK_HOME/bin/spark-submit --version
It print out version *2.1.1* which* is not* the version I built (2.2.1)


I did set *SPARK_HOME* in my local machine (MACOS) for this distribution
and it's working well, print out the version *2.2.1*

I need the way to investigate the invisible environment variable.

Do you have any suggestions?
Thank in advance.

Regards,
Chanh

-- 
Regards,
Chanh


Re: [CSV] If number of columns of one row bigger than maxcolumns it stop the whole parsing process.

2017-06-09 Thread Chanh Le
Hi Takeshi,

Thank you very much.

Regards,
Chanh


On Thu, Jun 8, 2017 at 11:05 PM Takeshi Yamamuro <linguin@gmail.com>
wrote:

> I filed a jira about this issue:
> https://issues.apache.org/jira/browse/SPARK-21024
>
> On Thu, Jun 8, 2017 at 1:27 AM, Chanh Le <giaosu...@gmail.com> wrote:
>
>> Can you recommend one?
>>
>> Thanks.
>>
>> On Thu, Jun 8, 2017 at 2:47 PM Jörn Franke <jornfra...@gmail.com> wrote:
>>
>>> You can change the CSV parser library
>>>
>>> On 8. Jun 2017, at 08:35, Chanh Le <giaosu...@gmail.com> wrote:
>>>
>>>
>>> I did add mode -> DROPMALFORMED but it still couldn't ignore it because
>>> the error raise from the CSV library that Spark are using.
>>>
>>>
>>> On Thu, Jun 8, 2017 at 12:11 PM Jörn Franke <jornfra...@gmail.com>
>>> wrote:
>>>
>>>> The CSV data source allows you to skip invalid lines - this should also
>>>> include lines that have more than maxColumns. Choose mode "DROPMALFORMED"
>>>>
>>>> On 8. Jun 2017, at 03:04, Chanh Le <giaosu...@gmail.com> wrote:
>>>>
>>>> Hi Takeshi, Jörn Franke,
>>>>
>>>> The problem is even I increase the maxColumns it still have some lines
>>>> have larger columns than the one I set and it will cost a lot of memory.
>>>> So I just wanna skip the line has larger columns than the maxColumns I
>>>> set.
>>>>
>>>> Regards,
>>>> Chanh
>>>>
>>>>
>>>> On Thu, Jun 8, 2017 at 12:48 AM Takeshi Yamamuro <linguin@gmail.com>
>>>> wrote:
>>>>
>>>>> Is it not enough to set `maxColumns` in CSV options?
>>>>>
>>>>>
>>>>> https://github.com/apache/spark/blob/branch-2.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L116
>>>>>
>>>>> // maropu
>>>>>
>>>>> On Wed, Jun 7, 2017 at 9:45 AM, Jörn Franke <jornfra...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Spark CSV data source should be able
>>>>>>
>>>>>> On 7. Jun 2017, at 17:50, Chanh Le <giaosu...@gmail.com> wrote:
>>>>>>
>>>>>> Hi everyone,
>>>>>> I am using Spark 2.1.1 to read csv files and convert to avro files.
>>>>>> One problem that I am facing is if one row of csv file has more
>>>>>> columns than maxColumns (default is 20480). The process of parsing
>>>>>> was stop.
>>>>>>
>>>>>> Internal state when error was thrown: line=1, column=3, record=0,
>>>>>> charIndex=12
>>>>>> com.univocity.parsers.common.TextParsingException:
>>>>>> java.lang.ArrayIndexOutOfBoundsException - 2
>>>>>> Hint: Number of columns processed may have exceeded limit of 2
>>>>>> columns. Use settings.setMaxColumns(int) to define the maximum number of
>>>>>> columns your input can have
>>>>>> Ensure your configuration is correct, with delimiters, quotes and
>>>>>> escape sequences that match the input format you are trying to parse
>>>>>> Parser Configuration: CsvParserSettings:
>>>>>>
>>>>>>
>>>>>> I did some investigation in univocity
>>>>>> <https://github.com/uniVocity/univocity-parsers> library but the way
>>>>>> it handle is throw error that why spark stop the process.
>>>>>>
>>>>>> How to skip the invalid row and just continue to parse next valid one?
>>>>>> Any libs can replace univocity in that job?
>>>>>>
>>>>>> Thanks & regards,
>>>>>> Chanh
>>>>>> --
>>>>>> Regards,
>>>>>> Chanh
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> ---
>>>>> Takeshi Yamamuro
>>>>>
>>>> --
>>>> Regards,
>>>> Chanh
>>>>
>>>> --
>>> Regards,
>>> Chanh
>>>
>>> --
>> Regards,
>> Chanh
>>
>
>
>
> --
> ---
> Takeshi Yamamuro
>
-- 
Regards,
Chanh


Re: [CSV] If number of columns of one row bigger than maxcolumns it stop the whole parsing process.

2017-06-08 Thread Chanh Le
Can you recommend one?

Thanks.

On Thu, Jun 8, 2017 at 2:47 PM Jörn Franke <jornfra...@gmail.com> wrote:

> You can change the CSV parser library
>
> On 8. Jun 2017, at 08:35, Chanh Le <giaosu...@gmail.com> wrote:
>
>
> I did add mode -> DROPMALFORMED but it still couldn't ignore it because
> the error raise from the CSV library that Spark are using.
>
>
> On Thu, Jun 8, 2017 at 12:11 PM Jörn Franke <jornfra...@gmail.com> wrote:
>
>> The CSV data source allows you to skip invalid lines - this should also
>> include lines that have more than maxColumns. Choose mode "DROPMALFORMED"
>>
>> On 8. Jun 2017, at 03:04, Chanh Le <giaosu...@gmail.com> wrote:
>>
>> Hi Takeshi, Jörn Franke,
>>
>> The problem is even I increase the maxColumns it still have some lines
>> have larger columns than the one I set and it will cost a lot of memory.
>> So I just wanna skip the line has larger columns than the maxColumns I
>> set.
>>
>> Regards,
>> Chanh
>>
>>
>> On Thu, Jun 8, 2017 at 12:48 AM Takeshi Yamamuro <linguin@gmail.com>
>> wrote:
>>
>>> Is it not enough to set `maxColumns` in CSV options?
>>>
>>>
>>> https://github.com/apache/spark/blob/branch-2.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L116
>>>
>>> // maropu
>>>
>>> On Wed, Jun 7, 2017 at 9:45 AM, Jörn Franke <jornfra...@gmail.com>
>>> wrote:
>>>
>>>> Spark CSV data source should be able
>>>>
>>>> On 7. Jun 2017, at 17:50, Chanh Le <giaosu...@gmail.com> wrote:
>>>>
>>>> Hi everyone,
>>>> I am using Spark 2.1.1 to read csv files and convert to avro files.
>>>> One problem that I am facing is if one row of csv file has more columns
>>>> than maxColumns (default is 20480). The process of parsing was stop.
>>>>
>>>> Internal state when error was thrown: line=1, column=3, record=0,
>>>> charIndex=12
>>>> com.univocity.parsers.common.TextParsingException:
>>>> java.lang.ArrayIndexOutOfBoundsException - 2
>>>> Hint: Number of columns processed may have exceeded limit of 2 columns.
>>>> Use settings.setMaxColumns(int) to define the maximum number of columns
>>>> your input can have
>>>> Ensure your configuration is correct, with delimiters, quotes and
>>>> escape sequences that match the input format you are trying to parse
>>>> Parser Configuration: CsvParserSettings:
>>>>
>>>>
>>>> I did some investigation in univocity
>>>> <https://github.com/uniVocity/univocity-parsers> library but the way
>>>> it handle is throw error that why spark stop the process.
>>>>
>>>> How to skip the invalid row and just continue to parse next valid one?
>>>> Any libs can replace univocity in that job?
>>>>
>>>> Thanks & regards,
>>>> Chanh
>>>> --
>>>> Regards,
>>>> Chanh
>>>>
>>>>
>>>
>>>
>>> --
>>> ---
>>> Takeshi Yamamuro
>>>
>> --
>> Regards,
>> Chanh
>>
>> --
> Regards,
> Chanh
>
> --
Regards,
Chanh


Re: [CSV] If number of columns of one row bigger than maxcolumns it stop the whole parsing process.

2017-06-08 Thread Chanh Le
I did add mode -> DROPMALFORMED but it still couldn't ignore it because the
error raise from the CSV library that Spark are using.


On Thu, Jun 8, 2017 at 12:11 PM Jörn Franke <jornfra...@gmail.com> wrote:

> The CSV data source allows you to skip invalid lines - this should also
> include lines that have more than maxColumns. Choose mode "DROPMALFORMED"
>
> On 8. Jun 2017, at 03:04, Chanh Le <giaosu...@gmail.com> wrote:
>
> Hi Takeshi, Jörn Franke,
>
> The problem is even I increase the maxColumns it still have some lines
> have larger columns than the one I set and it will cost a lot of memory.
> So I just wanna skip the line has larger columns than the maxColumns I set.
>
> Regards,
> Chanh
>
>
> On Thu, Jun 8, 2017 at 12:48 AM Takeshi Yamamuro <linguin@gmail.com>
> wrote:
>
>> Is it not enough to set `maxColumns` in CSV options?
>>
>>
>> https://github.com/apache/spark/blob/branch-2.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L116
>>
>> // maropu
>>
>> On Wed, Jun 7, 2017 at 9:45 AM, Jörn Franke <jornfra...@gmail.com> wrote:
>>
>>> Spark CSV data source should be able
>>>
>>> On 7. Jun 2017, at 17:50, Chanh Le <giaosu...@gmail.com> wrote:
>>>
>>> Hi everyone,
>>> I am using Spark 2.1.1 to read csv files and convert to avro files.
>>> One problem that I am facing is if one row of csv file has more columns
>>> than maxColumns (default is 20480). The process of parsing was stop.
>>>
>>> Internal state when error was thrown: line=1, column=3, record=0,
>>> charIndex=12
>>> com.univocity.parsers.common.TextParsingException:
>>> java.lang.ArrayIndexOutOfBoundsException - 2
>>> Hint: Number of columns processed may have exceeded limit of 2 columns.
>>> Use settings.setMaxColumns(int) to define the maximum number of columns
>>> your input can have
>>> Ensure your configuration is correct, with delimiters, quotes and escape
>>> sequences that match the input format you are trying to parse
>>> Parser Configuration: CsvParserSettings:
>>>
>>>
>>> I did some investigation in univocity
>>> <https://github.com/uniVocity/univocity-parsers> library but the way it
>>> handle is throw error that why spark stop the process.
>>>
>>> How to skip the invalid row and just continue to parse next valid one?
>>> Any libs can replace univocity in that job?
>>>
>>> Thanks & regards,
>>> Chanh
>>> --
>>> Regards,
>>> Chanh
>>>
>>>
>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
> --
> Regards,
> Chanh
>
> --
Regards,
Chanh


Re: [CSV] If number of columns of one row bigger than maxcolumns it stop the whole parsing process.

2017-06-07 Thread Chanh Le
Hi Takeshi, Jörn Franke,

The problem is even I increase the maxColumns it still have some lines have
larger columns than the one I set and it will cost a lot of memory.
So I just wanna skip the line has larger columns than the maxColumns I set.

Regards,
Chanh


On Thu, Jun 8, 2017 at 12:48 AM Takeshi Yamamuro <linguin@gmail.com>
wrote:

> Is it not enough to set `maxColumns` in CSV options?
>
>
> https://github.com/apache/spark/blob/branch-2.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L116
>
> // maropu
>
> On Wed, Jun 7, 2017 at 9:45 AM, Jörn Franke <jornfra...@gmail.com> wrote:
>
>> Spark CSV data source should be able
>>
>> On 7. Jun 2017, at 17:50, Chanh Le <giaosu...@gmail.com> wrote:
>>
>> Hi everyone,
>> I am using Spark 2.1.1 to read csv files and convert to avro files.
>> One problem that I am facing is if one row of csv file has more columns
>> than maxColumns (default is 20480). The process of parsing was stop.
>>
>> Internal state when error was thrown: line=1, column=3, record=0,
>> charIndex=12
>> com.univocity.parsers.common.TextParsingException:
>> java.lang.ArrayIndexOutOfBoundsException - 2
>> Hint: Number of columns processed may have exceeded limit of 2 columns.
>> Use settings.setMaxColumns(int) to define the maximum number of columns
>> your input can have
>> Ensure your configuration is correct, with delimiters, quotes and escape
>> sequences that match the input format you are trying to parse
>> Parser Configuration: CsvParserSettings:
>>
>>
>> I did some investigation in univocity
>> <https://github.com/uniVocity/univocity-parsers> library but the way it
>> handle is throw error that why spark stop the process.
>>
>> How to skip the invalid row and just continue to parse next valid one?
>> Any libs can replace univocity in that job?
>>
>> Thanks & regards,
>> Chanh
>> --
>> Regards,
>> Chanh
>>
>>
>
>
> --
> ---
> Takeshi Yamamuro
>
-- 
Regards,
Chanh


[CSV] If number of columns of one row bigger than maxcolumns it stop the whole parsing process.

2017-06-07 Thread Chanh Le
Hi everyone,
I am using Spark 2.1.1 to read csv files and convert to avro files.
One problem that I am facing is if one row of csv file has more columns
than maxColumns (default is 20480). The process of parsing was stop.

Internal state when error was thrown: line=1, column=3, record=0,
charIndex=12
com.univocity.parsers.common.TextParsingException:
java.lang.ArrayIndexOutOfBoundsException - 2
Hint: Number of columns processed may have exceeded limit of 2 columns. Use
settings.setMaxColumns(int) to define the maximum number of columns your
input can have
Ensure your configuration is correct, with delimiters, quotes and escape
sequences that match the input format you are trying to parse
Parser Configuration: CsvParserSettings:


I did some investigation in univocity
 library but the way it
handle is throw error that why spark stop the process.

How to skip the invalid row and just continue to parse next valid one?
Any libs can replace univocity in that job?

Thanks & regards,
Chanh
-- 
Regards,
Chanh


Re: How to query a query with not contain, not start_with, not end_with condition effective?

2017-02-21 Thread Chanh Le
Thank you YZ,
Now I understand why it causes high CPU usage on driver side.

Thank you Ayan,
> First thing i would do is to add distinct, both inner and outer queries

I believe that would reduce number of record to join.

Regards,
Chanh

Hi everyone,

I am working on a dataset like this
user_id url 
1lao.com/buy <http://lao.com/buy>
2bao.com/sell <http://bao.com/sell>
2cao.com/market <http://cao.com/market>
1lao.com/sell <http://lao.com/sell>
3vui.com/sell <http://vui.com/sell>

I have to find all user_id with url not contain sell. Which means I need to 
query all user_id contains sell and put it into a set then do another query to 
find all user_id not in that set.
SELECT user_id 
FROM data
WHERE user_id not in ( SELECT user_id FROM data WHERE url like ‘%sell%’;

My data is about 20 million records and it’s growing. When I tried in zeppelin 
I need to set spark.sql.crossJoin.enabled = true
Then I ran the query and the driver got extremely high CPU percentage and the 
process get stuck and I need to kill it.
I am running at client mode that submit to a Mesos cluster.

I am using Spark 2.0.2 and my data store in HDFS with parquet format.

Any advices for me in this situation?

Thank you in advance!.

Regards,
Chanh





> On Feb 22, 2017, at 8:52 AM, Yong Zhang <java8...@hotmail.com> wrote:
> 
> If you read the source code of SparkStrategies
> 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L106
>  
> <https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L106>
> 
> If there is no joining keys, Join implementations are chosen with the 
> following precedence:
> BroadcastNestedLoopJoin: if one side of the join could be broadcasted
> CartesianProduct: for Inner join
> BroadcastNestedLoopJoin
> 
> So your case will use BroadcastNestedLoopJoin, as there is no joining keys.
> 
> In this case, if there are lots of userId where url not like '%sell%', then 
> Spark has to retrieve them back to Driver (to be broadcast), that explains 
> why the high CPU usage on the driver side. 
> 
> So if there are lots of userId where url not like '%sell%', then you can just 
> try left semi join, which Spark will use SortMerge join in this case, I guess.
> 
> Yong
> 
> From: Yong Zhang <java8...@hotmail.com <mailto:java8...@hotmail.com>>
> Sent: Tuesday, February 21, 2017 1:17 PM
> To: Sidney Feiner; Chanh Le; user @spark
> Subject: Re: How to query a query with not contain, not start_with, not 
> end_with condition effective?
>  
> Sorry, didn't pay attention to the originally requirement.
> 
> Did you try the left outer join, or left semi join?
> 
> What is the explain plan when you use "not in"? Is it leading to a 
> broadcastNestedLoopJoin?
> 
> spark.sql("select user_id from data where user_id not in (select user_id from 
> data where url like '%sell%')").explain(true)
> 
> Yong
> 
> 
> From: Sidney Feiner <sidney.fei...@startapp.com 
> <mailto:sidney.fei...@startapp.com>>
> Sent: Tuesday, February 21, 2017 10:46 AM
> To: Yong Zhang; Chanh Le; user @spark
> Subject: RE: How to query a query with not contain, not start_with, not 
> end_with condition effective?
>  
> Chanh wants to return user_id's that don't have any record with a url 
> containing "sell". Without a subquery/join, it can only filter per record 
> without knowing about the rest of the user_id's record
>  
> Sidney Feiner   /  SW Developer
> M: +972.528197720  /  Skype: sidney.feiner.startapp
>  
>  <http://www.startapp.com/>
>  
> From: Yong Zhang [mailto:java8...@hotmail.com <mailto:java8...@hotmail.com>] 
> Sent: Tuesday, February 21, 2017 4:10 PM
> To: Chanh Le <giaosu...@gmail.com <mailto:giaosu...@gmail.com>>; user @spark 
> <user@spark.apache.org <mailto:user@spark.apache.org>>
> Subject: Re: How to query a query with not contain, not start_with, not 
> end_with condition effective?
>  
> Not sure if I misunderstand your question, but what's wrong doing it this way?
>  
> scala> spark.version
> res6: String = 2.0.2
> scala> val df = Seq((1,"lao.com/sell <http://lao.com/sell>"), (2, 
> "lao.com/buy <http://lao.com/buy>")).toDF("user_id", "url")
> df: org.apache.spark.sql.DataFrame = [user_id: int, url: string]
>  
> scala> df.registerTempTable("data")
> warning: there was one deprecation warning; re-run with -deprecation for 
> details
>  
> scala> spark.sql(&q

Re: How to query a query with not contain, not start_with, not end_with condition effective?

2017-02-21 Thread Chanh Le
I tried a new way by using JOIN

select user_id from data a
left join (select user_id from data where url like ‘%sell%') b
on a.user_id = b.user_id
where b.user_id is NULL

It’s faster and seem that Spark rather optimize for JOIN than sub query.


Regards,
Chanh


> On Feb 21, 2017, at 4:56 PM, Chanh Le <giaosu...@gmail.com> wrote:
> 
> Hi everyone,
> 
> I am working on a dataset like this
> user_id url 
> 1  lao.com/buy <http://lao.com/buy>
> 2  bao.com/sell <http://bao.com/sell>
> 2  cao.com/market <http://cao.com/market>
> 1  lao.com/sell <http://lao.com/sell>
> 3  vui.com/sell <http://vui.com/sell>
> 
> I have to find all user_id with url not contain sell. Which means I need to 
> query all user_id contains sell and put it into a set then do another query 
> to find all user_id not in that set.
> SELECT user_id 
> FROM data
> WHERE user_id not in ( SELECT user_id FROM data WHERE url like ‘%sell%’;
> 
> My data is about 20 million records and it’s growing. When I tried in 
> zeppelin I need to set spark.sql.crossJoin.enabled = true
> Then I ran the query and the driver got extremely high CPU percentage and the 
> process get stuck and I need to kill it.
> I am running at client mode that submit to a Mesos cluster.
> 
> I am using Spark 2.0.2 and my data store in HDFS with parquet format.
> 
> Any advices for me in this situation?
> 
> Thank you in advance!.
> 
> Regards,
> Chanh



How to query a query with not contain, not start_with, not end_with condition effective?

2017-02-21 Thread Chanh Le
Hi everyone,

I am working on a dataset like this
user_id url 
1lao.com/buy
2bao.com/sell
2cao.com/market
1lao.com/sell
3vui.com/sell

I have to find all user_id with url not contain sell. Which means I need to 
query all user_id contains sell and put it into a set then do another query to 
find all user_id not in that set.
SELECT user_id 
FROM data
WHERE user_id not in ( SELECT user_id FROM data WHERE url like ‘%sell%’;

My data is about 20 million records and it’s growing. When I tried in zeppelin 
I need to set spark.sql.crossJoin.enabled = true
Then I ran the query and the driver got extremely high CPU percentage and the 
process get stuck and I need to kill it.
I am running at client mode that submit to a Mesos cluster.

I am using Spark 2.0.2 and my data store in HDFS with parquet format.

Any advices for me in this situation?

Thank you in advance!.

Regards,
Chanh

How to config zookeeper quorum in sqlline command?

2017-02-15 Thread Chanh Le
Hi everybody,
I am a newbie start using phoenix for a few days after did some research about 
config zookeeper quorum and still stuck I finally wanna ask directly into the 
community.

Current zk quorum of mine a little odd "hbase.zookeeper.quorum", 
"zoo1:2182,zoo1:2183,zoo2:2182"
I edited the env.sh and add HBASE_PATH=/build/etl/hbase-1.2.4
So I tried to run sqlline by 
 ./sqlline.py zk://zoo1:2182,zoo1:2183,zoo2:2182/hbase 

 ./sqlline.py zoo1:2182,zoo1:2183,zoo2:2182:/hbase

Both not working.

So I tried ./queryserver.py start
and used sqlline-thin.py  and got this error
Caused by: java.sql.SQLException: ERROR 102 (08001): Malformed connection url. 
:zoo1:2182,zoo1:2183,zoo2:2182:2181:/hbase;


Thank you in advance.
Chanh

How to set classpath for a job that submit to Mesos cluster

2016-12-13 Thread Chanh Le
Hi everyone,
I have a job that read segment data from druid then convert to csv.
When I run it in local mode it works fine.

/home/airflow/spark-2.0.2-bin-hadoop2.7/bin/spark-submit --driver-memory 1g 
--master "local[4]" --files /home/airflow/spark-jobs/forecast_jobs/prod.conf 
--conf spark.executor.extraJavaOptions=-Dconfig.fuction.conf --conf 
'spark.driver.extraJavaOptions=-Dconfig.file=/home/airflow/spark-jobs/forecast_jobs/prod.conf'
 --class com.ants.druid.spark.GenerateForecastData --driver-class-path 
/home/airflow/spark-jobs/forecast_jobs/classmate-0.5.4.jar:/home/airflow/spark-jobs/forecast_jobs/jboss-logging-3.3.0.Final.jar:/home/airflow/spark-jobs/forecast_jobs/hibernate-validator-5.1.3.Final.jar'
 --jars 
/home/airflow/spark-jobs/forecast_jobs/generate_forecast_data-assembly-1.0-deps.jar
 /home/airflow/spark-jobs/forecast_jobs/generate_forecast_data.jar 2016-12-01-02



but When I switch to submit to Mesos I have this error

16/12/13 15:08:45 INFO Guice: An exception was caught and reported. Message: 
javax.validation.ValidationException: Unable to create a Configuration, because 
no Bean Validation provider could be found. Add a provider like Hibernate 
Validator (RI) to your classpath.
javax.validation.ValidationException: Unable to create a Configuration, because 
no Bean Validation provider could be found. Add a provider like Hibernate 
Validator (RI) to your classpath.
at 
javax.validation.Validation$GenericBootstrapImpl.configure(Validation.java:271)
at 
javax.validation.Validation.buildDefaultValidatorFactory(Validation.java:110)
at io.druid.guice.ConfigModule.configure(ConfigModule.java:39)
at 
com.google.inject.spi.Elements$RecordingBinder.install(Elements.java:223)
at com.google.inject.spi.Elements.getElements(Elements.java:101)
at 
com.google.inject.internal.InjectorShell$Builder.build(InjectorShell.java:133)
at 
com.google.inject.internal.InternalInjectorCreator.build(InternalInjectorCreator.java:103)
at com.google.inject.Guice.createInjector(Guice.java:95)
at com.google.inject.Guice.createInjector(Guice.java:72)
at 
io.druid.guice.GuiceInjectors.makeStartupInjector(GuiceInjectors.java:59)
at 
io.druid.indexer.HadoopDruidIndexerConfig.(HadoopDruidIndexerConfig.java:99)
at 
io.druid.indexer.hadoop.DatasourceInputSplit.readFields(DatasourceInputSplit.java:87)
at 
org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:285)
at 
org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:77)
at 
org.apache.spark.SerializableWritable$$anonfun$readObject$1.apply$mcV$sp(SerializableWritable.scala:45)
at 
org.apache.spark.SerializableWritable$$anonfun$readObject$1.apply(SerializableWritable.scala:41)
at 
org.apache.spark.SerializableWritable$$anonfun$readObject$1.apply(SerializableWritable.scala:41)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1276)
at 
org.apache.spark.SerializableWritable.readObject(SerializableWritable.scala:41)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1900)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
at 
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:253)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

I tried 

[jira] [Created] (ZEPPELIN-1723) Math formula support library path error

2016-11-28 Thread Chanh Le (JIRA)
Chanh Le created ZEPPELIN-1723:
--

 Summary: Math formula support library path error
 Key: ZEPPELIN-1723
 URL: https://issues.apache.org/jira/browse/ZEPPELIN-1723
 Project: Zeppelin
  Issue Type: Bug
  Components: front-end
Affects Versions: 0.7.0
Reporter: Chanh Le


I set ZEPPELIN_SERVER_CONTEXT_PATH is /zeppelin/

and this is what happen after I do that.
!https://camo.githubusercontent.com/586205cd96d380676754968157d0fe78fafdc78b/687474703a2f2f692e696d6775722e636f6d2f444531556769782e6a7067!

It is working without set ZEPPELIN_SERVER_CONTEXT_PATH.






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Sharing RDDS across applications and users

2016-10-28 Thread Chanh Le
> Can you elaborate on how to implement "shared sparkcontext and fair 
> scheduling" option? 


It just reuse 1 Spark Context by not letting it stop when the application had 
done. Should check: livy, spark-jobserver
FAIR https://spark.apache.org/docs/1.2.0/job-scheduling.html 
<https://spark.apache.org/docs/1.2.0/job-scheduling.html> just how you 
scheduler your job in the pool but FAIR help you run job in parallel vs FIFO 
(default) 1 job at the time.


> My approach was to use  sparkSession.getOrCreate() method and register temp 
> table in one application. However, I was not able to access this tempTable in 
> another application. 


Store metadata in Hive may help but I am not sure about this.
I use Spark Thrift Server create table on that then let Zeppelin query from 
that.

Regards,
Chanh





> On Oct 27, 2016, at 9:01 PM, Victor Shafran <victor.shaf...@equalum.io> wrote:
> 
> Hi Vincent,
> Can you elaborate on how to implement "shared sparkcontext and fair 
> scheduling" option? 
> 
> My approach was to use  sparkSession.getOrCreate() method and register temp 
> table in one application. However, I was not able to access this tempTable in 
> another application. 
> You help is highly appreciated 
> Victor
> 
> On Thu, Oct 27, 2016 at 4:31 PM, Gene Pang <gene.p...@gmail.com 
> <mailto:gene.p...@gmail.com>> wrote:
> Hi Mich,
> 
> Yes, Alluxio is commonly used to cache and share Spark RDDs and DataFrames 
> among different applications and contexts. The data typically stays in 
> memory, but with Alluxio's tiered storage, the "colder" data can be evicted 
> out to other medium, like SSDs and HDDs. Here is a blog post discussing Spark 
> RDDs and Alluxio: 
> https://www.alluxio.com/blog/effective-spark-rdds-with-alluxio 
> <https://www.alluxio.com/blog/effective-spark-rdds-with-alluxio>
> 
> Also, Alluxio also has the concept of an "Under filesystem", which can help 
> you access your existing data across different storage systems. Here is more 
> information about the unified namespace abilities: 
> http://www.alluxio.org/docs/master/en/Unified-and-Transparent-Namespace.html 
> <http://www.alluxio.org/docs/master/en/Unified-and-Transparent-Namespace.html>
> 
> Hope that helps,
> Gene
> 
> On Thu, Oct 27, 2016 at 3:39 AM, Mich Talebzadeh <mich.talebza...@gmail.com 
> <mailto:mich.talebza...@gmail.com>> wrote:
> Thanks Chanh,
> 
> Can it share RDDs.
> 
> Personally I have not used either Alluxio or Ignite.
> 
> Are there major differences between these two
> Have you tried Alluxio for sharing Spark RDDs and if so do you have any 
> experience you can kindly share
> Regards
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> On 27 October 2016 at 11:29, Chanh Le <giaosu...@gmail.com 
> <mailto:giaosu...@gmail.com>> wrote:
> Hi Mich,
> Alluxio is the good option to go. 
> 
> Regards,
> Chanh
> 
>> On Oct 27, 2016, at 5:28 PM, Mich Talebzadeh <mich.talebza...@gmail.com 
>> <mailto:mich.talebza...@gmail.com>> wrote:
>> 
>> 
>> There was a mention of using Zeppelin to share RDDs with many users. From 
>> the notes on Zeppelin it appears that this is sharing UI and I am not sure 
>> how easy it is going to be changing the result set with different users 
>> modifying say sql queries.
>> 
>> There is also the idea of caching RDDs with something like Apache Ignite. 
>> Has anyone really tried this. Will that work with multiple applications?
>> 
>> It looks feasible as RDDs are immutable and so are registered tempTables etc.
>> 
>> Thanks
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>> loss, damage or destruction of data or any other property which may arise 
>> from relying on this email's technical content is explicitly disclaimed. The 
>> author will in no case be liable for any monetary damages arising from such 
>> loss, damage or destruction.
>>  
> 
> 
> 
> 
> 
> 
> -- 
> Victor Shafran
> 
> VP R| Equalum
> 
> 
> Mobile: +972-523854883 <tel:%2B972-523854883> | Email: 
> victor.shaf...@equalum.io <mailto:victor.shaf...@equalum.io>


Re: Sharing RDDS across applications and users

2016-10-27 Thread Chanh Le
Hi Mich,
I only tried Alluxio so I can’t give you a comparison.
In my experience, I use Alluxio for the big data set (50GB - 100GB) which is 
the input of the pipelines jobs so you can reuse the result from previous job.


> On Oct 27, 2016, at 5:39 PM, Mich Talebzadeh <mich.talebza...@gmail.com> 
> wrote:
> 
> Thanks Chanh,
> 
> Can it share RDDs.
> 
> Personally I have not used either Alluxio or Ignite.
> 
> Are there major differences between these two
> Have you tried Alluxio for sharing Spark RDDs and if so do you have any 
> experience you can kindly share
> Regards
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> On 27 October 2016 at 11:29, Chanh Le <giaosu...@gmail.com 
> <mailto:giaosu...@gmail.com>> wrote:
> Hi Mich,
> Alluxio is the good option to go. 
> 
> Regards,
> Chanh
> 
>> On Oct 27, 2016, at 5:28 PM, Mich Talebzadeh <mich.talebza...@gmail.com 
>> <mailto:mich.talebza...@gmail.com>> wrote:
>> 
>> 
>> There was a mention of using Zeppelin to share RDDs with many users. From 
>> the notes on Zeppelin it appears that this is sharing UI and I am not sure 
>> how easy it is going to be changing the result set with different users 
>> modifying say sql queries.
>> 
>> There is also the idea of caching RDDs with something like Apache Ignite. 
>> Has anyone really tried this. Will that work with multiple applications?
>> 
>> It looks feasible as RDDs are immutable and so are registered tempTables etc.
>> 
>> Thanks
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>> loss, damage or destruction of data or any other property which may arise 
>> from relying on this email's technical content is explicitly disclaimed. The 
>> author will in no case be liable for any monetary damages arising from such 
>> loss, damage or destruction.
>>  
> 
> 



Re: Sharing RDDS across applications and users

2016-10-27 Thread Chanh Le
Hi Mich,
Alluxio is the good option to go. 

Regards,
Chanh

> On Oct 27, 2016, at 5:28 PM, Mich Talebzadeh  
> wrote:
> 
> 
> There was a mention of using Zeppelin to share RDDs with many users. From the 
> notes on Zeppelin it appears that this is sharing UI and I am not sure how 
> easy it is going to be changing the result set with different users modifying 
> say sql queries.
> 
> There is also the idea of caching RDDs with something like Apache Ignite. Has 
> anyone really tried this. Will that work with multiple applications?
> 
> It looks feasible as RDDs are immutable and so are registered tempTables etc.
> 
> Thanks
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  



Re: How to make Mesos Cluster Dispatcher of Spark 1.6.1 load my config files?

2016-10-19 Thread Chanh Le
Thank you Daniel,
Actually I tried this before but this way is still not flexible way if you are 
running multiple jobs at the time and may different dependencies between each 
job configuration so I gave up.

Another simple solution is set the command bellow as a service and I am using 
it.

> /build/analytics/spark-1.6.1-bin-hadoop2.6/bin/spark-submit \
> --files /build/analytics/kafkajobs/prod.conf \
> --conf 'spark.executor.extraJavaOptions=-Dconfig.fuction.conf' \
> --conf 
> 'spark.driver.extraJavaOptions=-Dconfig.file=/build/analytics/kafkajobs/prod.conf'
>  \
> --conf 
> 'spark.driver.extraClassPath=/build/analytics/spark-1.6.1-bin-hadoop2.6/lib/postgresql-9.3-1102.jdbc41.jar'
>  \
> --conf 
> 'spark.executor.extraClassPath=/build/analytics/spark-1.6.1-bin-hadoop2.6/lib/postgresql-9.3-1102.jdbc41.jar'
>  \
> --class com.ants.util.kafka.PersistenceData \
> --master mesos://10.199.0.19:5050 <> \
> --executor-memory 5G \
> --driver-memory 2G \
> --total-executor-cores 4 \
> --jars /build/analytics/kafkajobs/spark-streaming-kafka_2.10-1.6.2.jar \
> /build/analytics/kafkajobs/kafkajobs-prod.jar


[Unit]
Description=Mesos Cluster Dispatcher

[Service]
ExecStart=/build/analytics/kafkajobs/persist-job.sh
PIDFile=/var/run/spark-persist.pid
[Install]
WantedBy=multi-user.target


Regards,
Chanh

> On Oct 19, 2016, at 2:15 PM, Daniel Carroza <dcarr...@stratio.com> wrote:
> 
> Hi Chanh,
> 
> I found a workaround that works to me:
> http://stackoverflow.com/questions/29552799/spark-unable-to-find-jdbc-driver/40114125#40114125
>  
> <http://stackoverflow.com/questions/29552799/spark-unable-to-find-jdbc-driver/40114125#40114125>
> 
> Regards,
> Daniel
> 
> El jue., 6 oct. 2016 a las 6:26, Chanh Le (<giaosu...@gmail.com 
> <mailto:giaosu...@gmail.com>>) escribió:
> Hi everyone,
> I have the same config in both mode and I really want to change config 
> whenever I run so I created a config file and run my application with it.
> My problem is: 
> It’s works with these config without using Mesos Cluster Dispatcher.
> 
> /build/analytics/spark-1.6.1-bin-hadoop2.6/bin/spark-submit \
> --files /build/analytics/kafkajobs/prod.conf \
> --conf 'spark.executor.extraJavaOptions=-Dconfig.fuction.conf' \
> --conf 
> 'spark.driver.extraJavaOptions=-Dconfig.file=/build/analytics/kafkajobs/prod.conf'
>  \
> --conf 
> 'spark.driver.extraClassPath=/build/analytics/spark-1.6.1-bin-hadoop2.6/lib/postgresql-9.3-1102.jdbc41.jar'
>  \
> --conf 
> 'spark.executor.extraClassPath=/build/analytics/spark-1.6.1-bin-hadoop2.6/lib/postgresql-9.3-1102.jdbc41.jar'
>  \
> --class com.ants.util.kafka.PersistenceData \
> --master mesos://10.199.0.19:5050 <> \
> --executor-memory 5G \
> --driver-memory 2G \
> --total-executor-cores 4 \
> --jars /build/analytics/kafkajobs/spark-streaming-kafka_2.10-1.6.2.jar \
> /build/analytics/kafkajobs/kafkajobs-prod.jar
> 
> 
> And it’s didn't work with these:
> 
> /build/analytics/spark-1.6.1-bin-hadoop2.6/bin/spark-submit \
> --files /build/analytics/kafkajobs/prod.conf \
> --conf 'spark.executor.extraJavaOptions=-Dconfig.fuction.conf' \
> --conf 
> 'spark.driver.extraJavaOptions=-Dconfig.file=/build/analytics/kafkajobs/prod.conf'
>  \
> --conf 
> 'spark.driver.extraClassPath=/build/analytics/spark-1.6.1-bin-hadoop2.6/lib/postgresql-9.3-1102.jdbc41.jar'
>  \
> --conf 
> 'spark.executor.extraClassPath=/build/analytics/spark-1.6.1-bin-hadoop2.6/lib/postgresql-9.3-1102.jdbc41.jar'
>  \
> --class com.ants.util.kafka.PersistenceData \
> --master mesos://10.199.0.19:7077 <> \
> --deploy-mode cluster \
> --supervise \
> --executor-memory 5G \
> --driver-memory 2G \
> --total-executor-cores 4 \
> --jars /build/analytics/kafkajobs/spark-streaming-kafka_2.10-1.6.2.jar \
> /build/analytics/kafkajobs/kafkajobs-prod.jar
> 
> It threw me an error: Exception in thread "main" java.sql.SQLException: No 
> suitable driver found for jdbc:postgresql://psqlhost:5432/kafkajobs <>
> which means my —conf didn’t work and those config I put in 
> /build/analytics/kafkajobs/prod.conf wasn’t loaded. It only loaded thing I 
> put in application.conf (default config).
> 
> How to make MCD load my config?
> 
> Regards,
> Chanh
> 
> -- 
> Daniel Carroza Santana
> 
> Vía de las Dos Castillas, 33, Ática 4, 3ª Planta.
> 28224 Pozuelo de Alarcón. Madrid.
> Tel: +34 91 828 64 73 <> // @stratiobd <https://twitter.com/StratioBD>


Re: What is the difference between mini-batch vs real time streaming in practice (not theory)?

2016-09-27 Thread Chanh Le
The different between Stream vs Micro Batch is about Ordering of Messages
> Spark Streaming guarantees ordered processing of RDDs in one DStream. Since 
> each RDD is processed in parallel, there is not order guaranteed within the 
> RDD. This is a tradeoff design Spark made. If you want to process the 
> messages in order within the RDD, you have to process them in one thread, 
> which does not have the benefit of parallelism.

More about that 
http://samza.apache.org/learn/documentation/0.10/comparisons/spark-streaming.html
 






> On Sep 27, 2016, at 2:12 PM, kant kodali  wrote:
> 
> What is the difference between mini-batch vs real time streaming in practice 
> (not theory)? In theory, I understand mini batch is something that batches in 
> the given time frame whereas real time streaming is more like do something as 
> the data arrives but my biggest question is why not have mini batch with 
> epsilon time frame (say one millisecond) or I would like to understand reason 
> why one would be an effective solution than other?
> I recently came across one example where mini-batch (Apache Spark) is used 
> for Fraud detection and real time streaming (Apache Flink) used for Fraud 
> Prevention. Someone also commented saying mini-batches would not be an 
> effective solution for fraud prevention (since the goal is to prevent the 
> transaction from occurring as it happened) Now I wonder why this wouldn't be 
> so effective with mini batch (Spark) ? Why is it not effective to run 
> mini-batch with 1 millisecond latency? Batching is a technique used 
> everywhere including the OS and the Kernel TCP/IP stack where the data to the 
> disk or network are indeed buffered so what is the convincing factor here to 
> say one is more effective than other?
> Thanks,
> kant
> 



Re: Using Zeppelin with Spark FP

2016-09-15 Thread Chanh Le
Yes, I used to experience that issue before even I set 1 hour cron for a job to 
update Spark Cache but it crashed.
I didn’t check recently release but ZP cron is not stable I think.





> On Sep 15, 2016, at 2:47 PM, Mich Talebzadeh <mich.talebza...@gmail.com> 
> wrote:
> 
> Thanks Chanh,
> 
> I noticed one thing. If you put on a cron refresh say every 30 seconds after 
> a whilt the job crashes with OOM error.
> 
> Then I stop and restart Zeppelin daemon and it works again!
> 
> Have you come across it?
> 
> cheers
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> On 15 September 2016 at 08:27, Chanh Le <giaosu...@gmail.com 
> <mailto:giaosu...@gmail.com>> wrote:
> Hi,
> I am using Zeppelin 0.7 snapshot and it works well both Spark 2.0 and STS of 
> Spark 2.0.
> 
> 
> 
> 
>> On Sep 12, 2016, at 4:38 PM, Mich Talebzadeh <mich.talebza...@gmail.com 
>> <mailto:mich.talebza...@gmail.com>> wrote:
>> 
>> Hi Sachin,
>> 
>> Downloaded Zeppelin 0.6.1
>> 
>> Now I can see the plot in a tabular format and graph. it looks good. Many 
>> thanks
>> 
>> 
>> 
>> And a plot
>> 
>> 
>> 
>> 
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>> loss, damage or destruction of data or any other property which may arise 
>> from relying on this email's technical content is explicitly disclaimed. The 
>> author will in no case be liable for any monetary damages arising from such 
>> loss, damage or destruction.
>>  
>> 
>> On 12 September 2016 at 08:24, Sachin Janani <sjan...@snappydata.io 
>> <mailto:sjan...@snappydata.io>> wrote:
>> Zeppelin imports ZeppelinContext object in spark interpreter using which you 
>> can plot dataframe,dataset and even rdd.To do so you just need to use 
>> "z.show(df)" in the paragraph (here df is the Dataframe which you want to 
>> plot)
>> 
>> 
>> Regards,
>> Sachin Janani
>> 
>> On Mon, Sep 12, 2016 at 11:20 AM, andy petrella <andy.petre...@gmail.com 
>> <mailto:andy.petre...@gmail.com>> wrote:
>> Heya, probably worth giving the Spark Notebook 
>> <https://github.com/andypetrella/spark-notebook/> a go then.
>> It can plot any scala data (collection, rdd, df, ds, custom, ...), all are 
>> reactive so they can deal with any sort of incoming data. You can ask on the 
>> gitter <https://gitter.im/andypetrella/spark-notebook> if you like.
>> 
>> hth
>> cheers
>> 
>> On Sun, Sep 11, 2016 at 11:12 PM Mich Talebzadeh <mich.talebza...@gmail.com 
>> <mailto:mich.talebza...@gmail.com>> wrote:
>> Hi,
>> 
>> Zeppelin is getting better.
>> 
>> In its description it says:
>> 
>> 
>> 
>> So far so good. One feature that I have not managed to work out is creating 
>> plots with Spark functional programming. I can get SQL going by connecting 
>> to Spark thrift server and you can plot the results
>> 
>> 
>> 
>> However, if I wrote that using functional programming I won't be able to 
>> plot it. the plot feature is not available.
>> 
>> Is this correct or I am missing something?
>> 
>> Thanks
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>> loss, damage or destruction of data or any other property which may arise 
>> from relying on this email's technical content is explicitly disclaimed. The 
>> author will in no case be liable for any monetary damages arising from such 
>> loss, damage or destruction.
>>  
>> -- 
>> andy
>> 
>> 
> 
> 



Re: Using Zeppelin with Spark FP

2016-09-15 Thread Chanh Le
Hi,
I am using Zeppelin 0.7 snapshot and it works well both Spark 2.0 and STS of 
Spark 2.0.




> On Sep 12, 2016, at 4:38 PM, Mich Talebzadeh  
> wrote:
> 
> Hi Sachin,
> 
> Downloaded Zeppelin 0.6.1
> 
> Now I can see the plot in a tabular format and graph. it looks good. Many 
> thanks
> 
> 
> 
> And a plot
> 
> 
> 
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> On 12 September 2016 at 08:24, Sachin Janani  > wrote:
> Zeppelin imports ZeppelinContext object in spark interpreter using which you 
> can plot dataframe,dataset and even rdd.To do so you just need to use 
> "z.show(df)" in the paragraph (here df is the Dataframe which you want to 
> plot)
> 
> 
> Regards,
> Sachin Janani
> 
> On Mon, Sep 12, 2016 at 11:20 AM, andy petrella  > wrote:
> Heya, probably worth giving the Spark Notebook 
>  a go then.
> It can plot any scala data (collection, rdd, df, ds, custom, ...), all are 
> reactive so they can deal with any sort of incoming data. You can ask on the 
> gitter  if you like.
> 
> hth
> cheers
> 
> On Sun, Sep 11, 2016 at 11:12 PM Mich Talebzadeh  > wrote:
> Hi,
> 
> Zeppelin is getting better.
> 
> In its description it says:
> 
> 
> 
> So far so good. One feature that I have not managed to work out is creating 
> plots with Spark functional programming. I can get SQL going by connecting to 
> Spark thrift server and you can plot the results
> 
> 
> 
> However, if I wrote that using functional programming I won't be able to plot 
> it. the plot feature is not available.
> 
> Is this correct or I am missing something?
> 
> Thanks
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> -- 
> andy
> 
> 



Re: Zeppelin patterns with the streaming data

2016-09-13 Thread Chanh Le
Hi Mich,
I think it can 
http://www.quartz-scheduler.org/documentation/quartz-2.1.x/tutorials/crontrigger
 





> On Sep 13, 2016, at 1:57 PM, Mich Talebzadeh  
> wrote:
> 
> Thanks Sachin.
> 
> The cron gives the granularity of 1 min. On normal one can use wait 10 and 
> loop in the cron to run the job every 10 seconds. I am not sure that is 
> possible with Zeppelin?
> 
> cheers
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> On 13 September 2016 at 05:05, Sachin Janani  > wrote:
> Good to see that you are enjoying zeppelin.You can schedule the paragraph 
> running after every x seconds.You can find this option on top of the notebook 
> just beside delete notebook button.
> 
> 
> Regards,
> Sachin Janani
> http://snappydatainc.github.io/snappydata/ 
> 
> 
> 
> On Tue, Sep 13, 2016 at 3:13 AM, Mich Talebzadeh  > wrote:
> The latest version of Zeppelin 0.6.1 looks pretty impressive with Spark 2 and 
> also with Spark Thrift server (it runs on Hive Thrift server) and uses Hive 
> execution engine. Make sure that you do not use MapReduce as Hive's execution 
> engine.
> 
> Now for streaming data (in this case some test generated data using Kafka 
> topic), I store them as  text file on HDFS. To my surprise text files (all 
> created every two seconds under some HDFS some directory) with timestamp 
> added to the file name seem to be pretty fast. I am sceptical what benefit 
> one gets if I decide to store them as Hbase file? Anyone can shed some light 
> on it?
> 
> Also I use Zeppelin to do some plots on the stored security prices. These are 
> all  Fictitious (the data belonging to these securities are all fictitious 
> and randomly generated). Now I don't think there is anyway one can automate 
> Zeppelin to run the same code say every x seconds? Thinking loud the other 
> alternative is to add new data as comes in and tail off the old data? Has 
> anyone done any work of this type?
> 
> Anyway I show you a sample graph below. Appreciate any ideas. My view is that 
> Zeppelin is not designed for real time dashboard but increasingly looking 
> good and may be with some change one can use it near real time?
> 
>   
> 
> 
> Thanks
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> 



Re: Spark 2.0.0 Thrift Server problem with Hive metastore

2016-09-06 Thread Chanh Le
Did anyone use STS of Spark 2.0 on production?
For me I still waiting for the compatible in parquet file created by Spark 
1.6.1 


> On Sep 6, 2016, at 2:46 PM, Campagnola, Francesco 
>  wrote:
> 
> I mean I have installed Spark 2.0 in the same environment where Spark 1.6 
> thrift server was running,
> then stopped the Spark 1.6 thrift server and started the Spark 2.0 one.
>  
> If I’m not mistaken, Spark 2.0 should be still compatible with Hive 1.2.1 and 
> no upgrade procedures are required.
> The spark-defaults.conf file has not been changed.
>  
> The following commands issued to the Spark 2.0 thrift server work:
> create database test;
> use test;
> create table tb_1 (id int);
> insert into table tb_1 select t.id from (select 1 as id) t;
>  
> While all of these commands return the same error:
> show databases;
> show tables;
> show partitions tb_1;
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 62.0 failed 10 times, most recent failure: Lost task 0.9 in 
> stage 62.0 (TID 540, vertica204): java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast 
> to org.apache.spark.sql.catalyst.expressions.UnsafeRow
>  
>  
>  
>  
> From: Jeff Zhang [mailto:zjf...@gmail.com ] 
> Sent: martedì 6 settembre 2016 02:50
> To: Campagnola, Francesco  >
> Cc: user@spark.apache.org 
> Subject: Re: Spark 2.0.0 Thrift Server problem with Hive metastore
>  
> How do you upgrade to spark 2.0 ? 
>  
> On Mon, Sep 5, 2016 at 11:25 PM, Campagnola, Francesco 
> > 
> wrote:
> Hi,
>  
> in an already working Spark - Hive environment with Spark 1.6 and Hive 1.2.1, 
> with Hive metastore configured on Postgres DB, I have upgraded Spark to the 
> 2.0.0.
>  
> I have started the thrift server on YARN, then tried to execute from the 
> beeline cli or a jdbc client the following command:
> SHOW DATABASES;
> It always gives this error on Spark server side:
>  
> spark@spark-test[spark] /home/spark> beeline -u 
> jdbc:hive2://$(hostname):1 -n spark
>  
> Connecting to jdbc:hive2://spark-test:1
> 16/09/05 17:41:43 INFO jdbc.Utils: Supplied authorities: spark-test:1
> 16/09/05 17:41:43 INFO jdbc.Utils: Resolved authority: spark-test:1
> 16/09/05 17:41:43 INFO jdbc.HiveConnection: Will try to open client transport 
> with JDBC Uri: jdbc:hive2:// spark-test:1
> Connected to: Spark SQL (version 2.0.0)
> Driver: Hive JDBC (version 1.2.1.spark2)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> Beeline version 1.2.1.spark2 by Apache Hive
>  
> 0: jdbc:hive2:// spark-test:1> show databases;
> java.lang.IllegalStateException: Can't overwrite cause with 
> java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast 
> to org.apache.spark.sql.catalyst.expressions.UnsafeRow
> at java.lang.Throwable.initCause(Throwable.java:457)
> at 
> org.apache.hive.service.cli.HiveSQLException.toStackTrace(HiveSQLException.java:236)
> at 
> org.apache.hive.service.cli.HiveSQLException.toStackTrace(HiveSQLException.java:236)
> at 
> org.apache.hive.service.cli.HiveSQLException.toCause(HiveSQLException.java:197)
> at 
> org.apache.hive.service.cli.HiveSQLException.(HiveSQLException.java:108)
> at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:256)
> at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:242)
> at 
> org.apache.hive.jdbc.HiveQueryResultSet.next(HiveQueryResultSet.java:365)
> at org.apache.hive.beeline.BufferedRows.(BufferedRows.java:42)
> at org.apache.hive.beeline.BeeLine.print(BeeLine.java:1794)
> at org.apache.hive.beeline.Commands.execute(Commands.java:860)
> at org.apache.hive.beeline.Commands.sql(Commands.java:713)
> at org.apache.hive.beeline.BeeLine.dispatch(BeeLine.java:973)
> at org.apache.hive.beeline.BeeLine.execute(BeeLine.java:813)
> at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:771)
> at 
> org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:484)
> at org.apache.hive.beeline.BeeLine.main(BeeLine.java:467)
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 3.0 failed 10 times, most recent failure: Lost task 0.9 in 
> stage 3.0 (TID 12, vertica204): java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast 
> to org.apache.spark.sql.catalyst.expressions.UnsafeRow
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:247)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>  

Re: Design patterns involving Spark

2016-08-29 Thread Chanh Le
Hi everyone,

Seems a lot people using Druid for realtime Dashboard.
I’m just wondering of using Druid for main storage engine because Druid can 
store the raw data and can integrate with Spark also (theoretical). 
In that case do we need to store 2 separate storage Druid (store segment in 
HDFS) and HDFS?.
BTW did anyone try this one https://github.com/SparklineData/spark-druid-olap 
?


Regards,
Chanh


> On Aug 30, 2016, at 3:23 AM, Mich Talebzadeh  
> wrote:
> 
> Thanks Bhaarat and everyone.
> 
> This is an updated version of the same diagram
> 
> 
> ​​​
> The frequency of Recent data is defined by the Windows length in Spark 
> Streaming. It can vary between 0.5 seconds to an hour. ( Don't think we can 
> move any Spark granularity below 0.5 seconds in anger. For some applications 
> like Credit card transactions and fraud detection. Data is stored real time 
> by Spark in Hbase tables. Hbase tables will be on HDFS as well. The same 
> Spark Streaming will write asynchronously to HDFS Hive tables.
> One school of thought is never write to Hive from Spark, write  straight to 
> Hbase and then read Hbase tables into Hive periodically?
> 
> Now the third component in this layer is Serving Layer that can combine data 
> from the current (Hbase) and the historical (Hive tables) to give the user 
> visual analytics. Now that visual analytics can be Real time dashboard on top 
> of Serving Layer. That Serving layer could be an in-memory NoSQL offering or 
> Data from Hbase (Red Box) combined with Hive tables.
> 
> I am not aware of any industrial strength Real time Dashboard.  The idea is 
> that one uses such dashboard in real time. Dashboard in this sense meaning a 
> general purpose API to data store of some type like on Serving layer to 
> provide visual analytics real time on demand, combining real time data and 
> aggregate views. As usual the devil in the detail.
> 
> 
> 
> Let me know your thoughts. Anyway this is first cut pattern.
> 
> ​​
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> On 29 August 2016 at 18:53, Bhaarat Sharma  > wrote:
> Hi Mich
> 
> This is really helpful. I'm trying to wrap my head around the last diagram 
> you shared (the one with kafka). In this diagram spark streaming is pushing 
> data to HDFS and NoSql. However, I'm confused by the "Real Time Queries, 
> Dashboards" annotation. Based on this diagram, will real time queries be 
> running on Spark or HBase?
> 
> PS: My intention was not to steer the conversation away from what Ashok asked 
> but I found the diagrams shared by Mich very insightful. 
> 
> On Sun, Aug 28, 2016 at 7:18 PM, Mich Talebzadeh  > wrote:
> Hi,
> 
> In terms of positioning, Spark is really the first Big Data platform to 
> integrate batch, streaming and interactive computations in a unified 
> framework. What this boils down to is the fact that whichever way one look at 
> it there is somewhere that Spark can make a contribution to. In general, 
> there are few design patterns common to Big Data
>  
> ETL & Batch
> The first one is the most common one with Established tools like Sqoop, 
> Talend for ETL and HDFS for storage of some kind. Spark can be used as the 
> execution engine for Hive at the storage level which  actually makes it a 
> true vendor independent (BTW, Impala and Tez and LLAP) are offered by 
> vendors) processing engine. Personally I use Spark at ETL layer by extracting 
> data from sources through plug ins (JDBC and others) and storing in on HDFS 
> in some kind
>  
> Batch, real time plus Analytics
> In this pattern you have data coming in real time and you want to query them 
> real time through real time dashboard. HDFS is not ideal for updating data in 
> real time and neither for random access of data. Source could be all sorts of 
> Web Servers and need Flume Agent with Flume. At the storage layer we are 
> probably looking at something like Hbase. The crucial point being that saved 
> data needs to be ready for queries immediately The dashboards requires Hbase 
> APIs. The Analytics can be done through Hive again running on Spark engine. 
> Again note here that we ideally should process batch and real time 
> separately.   
>  
> Real time / 

Re: Best practises to storing data in Parquet files

2016-08-28 Thread Chanh Le
> Does parquet file has limit in size ( 1TB ) ? 
I did’t see any problem but 1TB is too big to operation need to divide into 
small pieces.
> Should we use SaveMode.APPEND for long running streaming app ?
Yes, but you need to partition it by time so it easy to maintain like update or 
delete a specific time without rebuild them all.
> How should we store in HDFS (directory structure, ... )?
Should partition the file into small pieces.


> On Aug 28, 2016, at 9:43 PM, Kevin Tran  wrote:
> 
> Hi,
> Does anyone know what is the best practises to store data to parquet file?
> Does parquet file has limit in size ( 1TB ) ? 
> Should we use SaveMode.APPEND for long running streaming app ?
> How should we store in HDFS (directory structure, ... )?
> 
> Thanks,
> Kevin.


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Anyone else having trouble with replicated off heap RDD persistence?

2016-08-16 Thread Chanh Le
Hi Michael,

You should you Alluxio instead.
http://www.alluxio.org/docs/master/en/Running-Spark-on-Alluxio.html 

It should be easier.


Regards,
Chanh



> On Aug 17, 2016, at 5:45 AM, Michael Allman  wrote:
> 
> Hello,
> 
> A coworker was having a problem with a big Spark job failing after several 
> hours when one of the executors would segfault. That problem aside, I 
> speculated that her job would be more robust against these kinds of executor 
> crashes if she used replicated RDD storage. She's using off heap storage (for 
> good reason), so I asked her to try running her job with the following 
> storage level: `StorageLevel(useDisk = true, useMemory = true, useOffHeap = 
> true, deserialized = false, replication = 2)`. The job would immediately fail 
> with a rather suspicious looking exception. For example:
> 
> com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 
> 9086
>   at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137)
>   at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781)
>   at 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
>   at 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
>   at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> 
> or
> 
> java.lang.IndexOutOfBoundsException: Index: 6, Size: 0
>   at java.util.ArrayList.rangeCheck(ArrayList.java:653)
>   at java.util.ArrayList.get(ArrayList.java:429)
>   at 
> com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:60)
>   at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:834)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:788)
>   at 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
>   at 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
>   at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> 

Re: Does Spark SQL support indexes?

2016-08-13 Thread Chanh Le
Hi Taotao,

Spark SQL doesn’t support index :).




> On Aug 14, 2016, at 10:03 AM, Taotao.Li  wrote:
> 
> 
> hi, guys, does Spark SQL support indexes?  if so, how can I create an index 
> on my temp table? if not, how can I handle some specific queries on a very 
> large table? it would iterate all the table even though all I want is just a 
> small piece of that table.
> 
> great thanks, 
> 
> 
> ___
> Quant | Engineer | Boy
> ___
> blog:http://litaotao.github.io 
> 
> github: www.github.com/litaotao 
> 
> 



Re: Spark Thrift Server (Spark 2.0) show table has value with NULL in all fields

2016-08-10 Thread Chanh Le
Hi Gene,
It's a Spark 2.0 issue.
I switch to Spark 1.6.1 it's ok now.

Thanks.

On Thursday, July 28, 2016 at 4:25:48 PM UTC+7, Chanh Le wrote:
>
> Hi everyone,
>
> I have problem when I create a external table in Spark Thrift Server (STS) 
> and query the data.
>
> Scenario:
> *Spark 2.0*
> *Alluxio 1.2.0 *
> *Zeppelin 0.7.0*
> STS start script 
> */home/spark/spark-2.0.0-bin-hadoop2.6/sbin/start-thriftserver.sh --master 
> mesos://zk://master1:2181,master2:2181,master3:2181/mesos --conf 
> spark.driver.memory=5G --conf spark.scheduler.mode=FAIR --class 
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 --jars 
> /home/spark/spark-2.0.0-bin-hadoop2.6/jars/alluxio-core-client-spark-1.2.0-jar-with-dependencies.jar
>  
> --total-executor-cores 35 spark-internal --hiveconf 
> hive.server2.thrift.port=1 --hiveconf 
> hive.metastore.warehouse.dir=/user/hive/warehouse --hiveconf 
> hive.metastore.metadb.dir=/user/hive/metadb --conf 
> spark.sql.shuffle.partitions=20*
>
> I have a file store in Alluxio *alluxio://master2:19998/etl_info/TOPIC*
>
> then I create a table in STS by 
> CREATE EXTERNAL TABLE topic (topic_id int, topic_name_vn String, 
> topic_name_en String, parent_id int, full_parent String, level_id int)
> STORED AS PARQUET LOCATION 'alluxio://master2:19998/etl_info/TOPIC';
>
> to compare STS with Spark I create a temp table with name topics
> spark.sqlContext.read.parquet("alluxio://master2:19998/etl_info/TOPIC
> ").registerTempTable("topics")
>
> Then I do query and compare.
>
>
> As you can see the result is different.
> Is that a bug? Or I did something wrong
>
> Regards,
> Chanh
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: hdfs persist rollbacks when spark job is killed

2016-08-08 Thread Chanh Le
Thank you Gourav,

> Moving files from _temp folders to main folders is an additional overhead 
> when you are working on S3 as there is no move operation.

Good catch. Is that GCS the same?

> I generally have a set of Data Quality checks after each job to ascertain 
> whether everything went fine, the results are stored so that it can be 
> published in a graph for monitoring, thus solving two purposes.


So that mean after the job done you query the data to check right?



> On Aug 8, 2016, at 1:46 PM, Gourav Sengupta <gourav.sengu...@gmail.com> wrote:
> 
> But you have to be careful, that is the default setting. There is a way you 
> can overwrite it so that the writing to _temp folder does not take place and 
> you write directly to the main folder. 
> 
> Moving files from _temp folders to main folders is an additional overhead 
> when you are working on S3 as there is no move operation. 
> 
> I generally have a set of Data Quality checks after each job to ascertain 
> whether everything went fine, the results are stored so that it can be 
> published in a graph for monitoring, thus solving two purposes.
> 
> 
> Regards,
> Gourav Sengupta
> 
> On Mon, Aug 8, 2016 at 7:41 AM, Chanh Le <giaosu...@gmail.com 
> <mailto:giaosu...@gmail.com>> wrote:
> It’s out of the box in Spark. 
> When you write data into hfs or any storage it only creates a new parquet 
> folder properly if your Spark job was success else only _temp folder inside 
> to mark it’s still not success (spark was killed) or nothing inside (Spark 
> job was failed).
> 
> 
> 
> 
> 
>> On Aug 8, 2016, at 1:35 PM, Sumit Khanna <sumit.kha...@askme.in 
>> <mailto:sumit.kha...@askme.in>> wrote:
>> 
>> Hello,
>> 
>> the use case is as follows : 
>> 
>> say I am inserting 200K rows as dataframe.write.formate("parquet") etc etc 
>> (like a basic write to hdfs  command), but say due to some reason or rhyme 
>> my job got killed, when the run was in the mid of it, meaning lets say I was 
>> only able to insert 100K rows when my job got killed.
>> 
>> twist is that I might actually be upserting, and even in append only cases, 
>> my delta change data that is being inserted / written in this run might 
>> actually be spanning across various partitions.
>> 
>> Now what I am looking for is something to role the changes back, like the 
>> batch insertion should be all or nothing, and even if it is partition, it 
>> must must be atomic to each row/ unit of insertion.
>> 
>> Kindly help.
>> 
>> Thanks,
>> Sumit
> 
> 



Re: hdfs persist rollbacks when spark job is killed

2016-08-08 Thread Chanh Le
It’s out of the box in Spark. 
When you write data into hfs or any storage it only creates a new parquet 
folder properly if your Spark job was success else only _temp folder inside to 
mark it’s still not success (spark was killed) or nothing inside (Spark job was 
failed).





> On Aug 8, 2016, at 1:35 PM, Sumit Khanna  wrote:
> 
> Hello,
> 
> the use case is as follows : 
> 
> say I am inserting 200K rows as dataframe.write.formate("parquet") etc etc 
> (like a basic write to hdfs  command), but say due to some reason or rhyme my 
> job got killed, when the run was in the mid of it, meaning lets say I was 
> only able to insert 100K rows when my job got killed.
> 
> twist is that I might actually be upserting, and even in append only cases, 
> my delta change data that is being inserted / written in this run might 
> actually be spanning across various partitions.
> 
> Now what I am looking for is something to role the changes back, like the 
> batch insertion should be all or nothing, and even if it is partition, it 
> must must be atomic to each row/ unit of insertion.
> 
> Kindly help.
> 
> Thanks,
> Sumit



Re: [Spark1.6] Or (||) operator not working in DataFrame

2016-08-07 Thread Chanh Le
You should use df.where(conditionExpr) which is more convenient to express some 
simple term in SQL.
 

/**
 * Filters rows using the given SQL expression.
 * {{{
 *   peopleDf.where("age > 15")
 * }}}
 * @group dfops
 * @since 1.5.0
 */
def where(conditionExpr: String): DataFrame = {
  filter(Column(SqlParser.parseExpression(conditionExpr)))
}




> On Aug 7, 2016, at 10:58 PM, Mich Talebzadeh  
> wrote:
> 
> although the logic should be col1 <> a && col(1) <> b
> 
> to exclude both
> 
> Like
> 
> df.filter('transactiontype > " ").filter(not('transactiontype ==="DEB") && 
> not('transactiontype 
> ==="BGC")).select('transactiontype).distinct.collect.foreach(println)
> 
> HTH
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> On 7 August 2016 at 16:53, Mich Talebzadeh  > wrote:
> try similar to this
> 
> df.filter(not('transactiontype ==="DEB") || not('transactiontype ==="CRE"))
> 
> HTH
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> On 7 August 2016 at 15:43, Divya Gehlot  > wrote:
> Hi,
> I have use case where I need to use or[||] operator in filter condition.
> It seems its not working its taking the condition before the operator and 
> ignoring the other filter condition after or operator.
> As any body faced similar issue .
> 
> Psuedo code :
> df.filter(col("colName").notEqual("no_value") || col("colName").notEqual(""))
> 
> Am I missing something.
> Would really appreciate the help.
> 
> 
> Thanks,
> Divya 
> 
> 



Re: [Spark 2.0] Problem with Spark Thrift Server show NULL instead of showing BIGINT value

2016-08-04 Thread Chanh Le
I checked with Spark 1.6.1 it still works fine.
I also check out latest source code in Spark 2.0 branch and built and get the 
same issue.

I think because of changing API to dataset in Spark 2.0?



Regards,
Chanh


> On Aug 5, 2016, at 9:44 AM, Chanh Le <giaosu...@gmail.com> wrote:
> 
> Hi Nicholas,
> Thanks for the information. 
> How did you solve the issue? 
> Did you change the parquet file by renaming the column name? 
> I used to change the column name when I create a table in Hive without 
> changing the parquet file but it’s still showing NULL.
> The parquet files of mine quite big so anything I can do without rewriting 
> the parquet will be better.
> 
> 
> Regards,
> Chanh.
> 
> 
>> On Aug 5, 2016, at 2:24 AM, Nicholas Hakobian 
>> <nicholas.hakob...@rallyhealth.com 
>> <mailto:nicholas.hakob...@rallyhealth.com>> wrote:
>> 
>> Its due to the casing of the 'I' in userId. Your schema (from printSchema) 
>> names the field "userId", while your external table definition has it as 
>> "userid".
>> 
>> We've run into similar issues with external Parquet tables defined in Hive 
>> defined with lowercase only and accessing through HiveContext. You should 
>> check out this documentation as it describes how Spark handles column 
>> definitions:
>> http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-metastore-parquet-table-conversion
>>  
>> <http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-metastore-parquet-table-conversion>
>> 
>> 
>> Nicholas Szandor Hakobian, Ph.D.
>> Data Scientist
>> Rally Health
>> nicholas.hakob...@rallyhealth.com <mailto:nicholas.hakob...@rallyhealth.com>
>> 
>> 
>> On Thu, Aug 4, 2016 at 4:53 AM, Chanh Le <giaosu...@gmail.com 
>> <mailto:giaosu...@gmail.com>> wrote:
>> Hi Takeshi, 
>> I already have changed the colum type into INT and String but it got the 
>> same Null values. 
>> it only happens in userid that why it so annoying.
>> 
>> thanks and regards, 
>> Chanh
>> 
>> 
>> On Aug 4, 2016 5:59 PM, "Takeshi Yamamuro" <linguin@gmail.com 
>> <mailto:linguin@gmail.com>> wrote:
>> Hi,
>> 
>> When changing the long type into int one, does the issue also happen?
>> And also, could you show more simple query to reproduce the issue?
>> 
>> // maropu
>> 
>> On Thu, Aug 4, 2016 at 7:35 PM, Chanh Le <giaosu...@gmail.com 
>> <mailto:giaosu...@gmail.com>> wrote:
>> 
>> Hi everyone,
>> 
>> I have a parquet file and it has data but when I use Spark Thrift Server to 
>> query it shows NULL for userid.
>> As you can see I can get data by Spark Scala but STS is not.
>> 
>> 
>> 
>> The file schema
>> root
>>  |-- time: string (nullable = true)
>>  |-- topic_id: integer (nullable = true)
>>  |-- interest_id: integer (nullable = true)
>>  |-- inmarket_id: integer (nullable = true)
>>  |-- os_id: integer (nullable = true)
>>  |-- browser_id: integer (nullable = true)
>>  |-- device_type: integer (nullable = true)
>>  |-- device_id: integer (nullable = true)
>>  |-- location_id: integer (nullable = true)
>>  |-- age_id: integer (nullable = true)
>>  |-- gender_id: integer (nullable = true)
>>  |-- website_id: integer (nullable = true)
>>  |-- channel_id: integer (nullable = true)
>>  |-- section_id: integer (nullable = true)
>>  |-- zone_id: integer (nullable = true)
>>  |-- placement_id: integer (nullable = true)
>>  |-- advertiser_id: integer (nullable = true)
>>  |-- campaign_id: integer (nullable = true)
>>  |-- payment_id: integer (nullable = true)
>>  |-- creative_id: integer (nullable = true)
>>  |-- audience_id: integer (nullable = true)
>>  |-- merchant_cate: integer (nullable = true)
>>  |-- ad_default: integer (nullable = true)
>>  |-- userId: long (nullable = true)
>>  |-- impression: integer (nullable = true)
>>  |-- viewable: integer (nullable = true)
>>  |-- click: integer (nullable = true)
>>  |-- click_fraud: integer (nullable = true)
>>  |-- revenue: double (nullable = true)
>>  |-- proceeds: double (nullable = true)
>>  |-- spent: double (nullable = true)
>>  |-- network_id: integer (nullable = true)
>> 
>> 
>> I create a table in Spark Thrift Server by.
>> 
>> CREATE EXTERNAL TABLE ad_cookie_report (time String, advertiser_id int, 
>> campaign_id int, payment_id int, creative_id int, website_id int, channel_id 
>> int, section_id in

Re: [Spark 2.0] Problem with Spark Thrift Server show NULL instead of showing BIGINT value

2016-08-04 Thread Chanh Le
Hi Nicholas,
Thanks for the information. 
How did you solve the issue? 
Did you change the parquet file by renaming the column name? 
I used to change the column name when I create a table in Hive without changing 
the parquet file but it’s still showing NULL.
The parquet files of mine quite big so anything I can do without rewriting the 
parquet will be better.


Regards,
Chanh.


> On Aug 5, 2016, at 2:24 AM, Nicholas Hakobian 
> <nicholas.hakob...@rallyhealth.com> wrote:
> 
> Its due to the casing of the 'I' in userId. Your schema (from printSchema) 
> names the field "userId", while your external table definition has it as 
> "userid".
> 
> We've run into similar issues with external Parquet tables defined in Hive 
> defined with lowercase only and accessing through HiveContext. You should 
> check out this documentation as it describes how Spark handles column 
> definitions:
> http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-metastore-parquet-table-conversion
>  
> <http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-metastore-parquet-table-conversion>
> 
> 
> Nicholas Szandor Hakobian, Ph.D.
> Data Scientist
> Rally Health
> nicholas.hakob...@rallyhealth.com <mailto:nicholas.hakob...@rallyhealth.com>
> 
> 
> On Thu, Aug 4, 2016 at 4:53 AM, Chanh Le <giaosu...@gmail.com 
> <mailto:giaosu...@gmail.com>> wrote:
> Hi Takeshi, 
> I already have changed the colum type into INT and String but it got the same 
> Null values. 
> it only happens in userid that why it so annoying.
> 
> thanks and regards, 
> Chanh
> 
> 
> On Aug 4, 2016 5:59 PM, "Takeshi Yamamuro" <linguin@gmail.com 
> <mailto:linguin@gmail.com>> wrote:
> Hi,
> 
> When changing the long type into int one, does the issue also happen?
> And also, could you show more simple query to reproduce the issue?
> 
> // maropu
> 
> On Thu, Aug 4, 2016 at 7:35 PM, Chanh Le <giaosu...@gmail.com 
> <mailto:giaosu...@gmail.com>> wrote:
> 
> Hi everyone,
> 
> I have a parquet file and it has data but when I use Spark Thrift Server to 
> query it shows NULL for userid.
> As you can see I can get data by Spark Scala but STS is not.
> 
> 
> 
> The file schema
> root
>  |-- time: string (nullable = true)
>  |-- topic_id: integer (nullable = true)
>  |-- interest_id: integer (nullable = true)
>  |-- inmarket_id: integer (nullable = true)
>  |-- os_id: integer (nullable = true)
>  |-- browser_id: integer (nullable = true)
>  |-- device_type: integer (nullable = true)
>  |-- device_id: integer (nullable = true)
>  |-- location_id: integer (nullable = true)
>  |-- age_id: integer (nullable = true)
>  |-- gender_id: integer (nullable = true)
>  |-- website_id: integer (nullable = true)
>  |-- channel_id: integer (nullable = true)
>  |-- section_id: integer (nullable = true)
>  |-- zone_id: integer (nullable = true)
>  |-- placement_id: integer (nullable = true)
>  |-- advertiser_id: integer (nullable = true)
>  |-- campaign_id: integer (nullable = true)
>  |-- payment_id: integer (nullable = true)
>  |-- creative_id: integer (nullable = true)
>  |-- audience_id: integer (nullable = true)
>  |-- merchant_cate: integer (nullable = true)
>  |-- ad_default: integer (nullable = true)
>  |-- userId: long (nullable = true)
>  |-- impression: integer (nullable = true)
>  |-- viewable: integer (nullable = true)
>  |-- click: integer (nullable = true)
>  |-- click_fraud: integer (nullable = true)
>  |-- revenue: double (nullable = true)
>  |-- proceeds: double (nullable = true)
>  |-- spent: double (nullable = true)
>  |-- network_id: integer (nullable = true)
> 
> 
> I create a table in Spark Thrift Server by.
> 
> CREATE EXTERNAL TABLE ad_cookie_report (time String, advertiser_id int, 
> campaign_id int, payment_id int, creative_id int, website_id int, channel_id 
> int, section_id int, zone_id int, ad_default int, placment_id int, topic_id 
> int, interest_id int, inmarket_id int, audience_id int, os_id int, browser_id 
> int, device_type int, device_id int, location_id int, age_id int, gender_id 
> int, merchant_cate int, userid bigint, impression int, viewable int, click 
> int, click_fraud int, revenue double, proceeds double, spent double, 
> network_id integer)
> STORED AS PARQUET LOCATION 'alluxio://master2:19998/AD_COOKIE_REPORT' <>;
> 
> But when I query it got all in  NULL values.
> 
> 0: jdbc:hive2://master1:1> select userid from ad_cookie_report limit 10;
> +-+--+
> | userid  |
> +-+--+
> | NULL|
> | NULL|
> | NULL|
> | NULL|
> | NULL|
> | NULL|
> | NULL|
> | NULL|
> | NULL|
> | NULL|
> +-+--+
> 10 rows selected (3.507 seconds)
> 
> How to solve the problem? Is that related to field with Uppercase?
> How to change the field name in this situation.
> 
> 
> Regards,
> Chanh
> 
> 
> 
> 
> -- 
> ---
> Takeshi Yamamuro
> 
> 



Re: [Thriftserver2] Controlling number of tasks

2016-08-03 Thread Chanh Le
I believe there is no way to reduce tasks by Hive using coalesce because when 
It come to Hive just read the files and depend on number of files you put into. 
So The way to did was coalesce at the ELT layer put a small number of files as 
possible reduce IO time for reading file.


> On Aug 3, 2016, at 7:03 PM, Yana Kadiyska  wrote:
> 
> Hi folks, I have an ETL pipeline that drops a file every 1/2 hour. When spark 
> reads these files, I end up with 315K tasks for a dataframe reading a few 
> days worth of data.
> 
> I now with a regular Spark job, I can use coalesce to come to a lower number 
> of tasks. Is there a way to tell HiveThriftserver2 to coalsce? I have a line 
> in hive-conf that says to use CombinedInputFormat but I'm not sure it's 
> working.
> 
> (Obviously haivng fewer large files is better but I don't control the file 
> generation side of this)
> 
> Tips much appreciated


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Does it has a way to config limit in query on STS by default?

2016-08-02 Thread Chanh Le
Hi Ayan, 
You mean 
common.max_count = 1000
Max number of SQL result to display to prevent the browser overload. This is 
common properties for all connections




It already set default in Zeppelin but I think it doesn’t work with Hive.


DOC: http://zeppelin.apache.org/docs/0.7.0-SNAPSHOT/interpreter/jdbc.html 
<http://zeppelin.apache.org/docs/0.7.0-SNAPSHOT/interpreter/jdbc.html>


> On Aug 2, 2016, at 6:03 PM, ayan guha <guha.a...@gmail.com> wrote:
> 
> Zeppelin already has a param for jdbc
> 
> On 2 Aug 2016 19:50, "Mich Talebzadeh" <mich.talebza...@gmail.com 
> <mailto:mich.talebza...@gmail.com>> wrote:
> Ok I have already set up mine
> 
> 
> hive.limit.optimize.fetch.max
> 5
> 
>   Maximum number of rows allowed for a smaller subset of data for simple 
> LIMIT, if it is a fetch query.
>   Insert queries are not restricted by this limit.
> 
>   
> 
> I am surprised that yours was missing. What did you set it up to?
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> On 2 August 2016 at 10:18, Chanh Le <giaosu...@gmail.com 
> <mailto:giaosu...@gmail.com>> wrote:
> I tried and it works perfectly.
> 
> Regards,
> Chanh
> 
> 
>> On Aug 2, 2016, at 3:33 PM, Mich Talebzadeh <mich.talebza...@gmail.com 
>> <mailto:mich.talebza...@gmail.com>> wrote:
>> 
>> OK
>> 
>> Try that
>> 
>> Another tedious way is to create views in Hive based on tables and use limit 
>> on those views.
>> 
>> But try that parameter first if it does anything.
>> 
>> HTH
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>> loss, damage or destruction of data or any other property which may arise 
>> from relying on this email's technical content is explicitly disclaimed. The 
>> author will in no case be liable for any monetary damages arising from such 
>> loss, damage or destruction.
>>  
>> 
>> On 2 August 2016 at 09:13, Chanh Le <giaosu...@gmail.com 
>> <mailto:giaosu...@gmail.com>> wrote:
>> Hi Mich,
>> I use Spark Thrift Server basically it acts like Hive.
>> 
>> I see that there is property in Hive.
>> 
>>> hive.limit.optimize.fetch.max
>>> Default Value: 5
>>> Added In: Hive 0.8.0
>>> Maximum number of rows allowed for a smaller subset of data for simple 
>>> LIMIT, if it is a fetch query. Insert queries are not restricted by this 
>>> limit.
>> 
>> Is that related to the problem?
>> 
>> 
>> 
>> 
>>> On Aug 2, 2016, at 2:55 PM, Mich Talebzadeh <mich.talebza...@gmail.com 
>>> <mailto:mich.talebza...@gmail.com>> wrote:
>>> 
>>> This is a classic problem on any RDBMS
>>> 
>>> Set the limit on the number of rows returned like maximum of 50K rows 
>>> through JDBC
>>> 
>>> What is your JDBC connection going to? Meaning which RDBMS if any?
>>> 
>>> HTH
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>  
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>>  
>>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>> 
>>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>>> loss, damage or destruction of data or any other property which may arise 
>>> from relying on this email's technical content is explicitly disclaimed. 
>>> The author will in no case be liable for any monetary damages arising from 
>>> such loss, damage or destruction.
>>>  
>>> 
>>> On 2 August 2016 at 08:41, Chanh Le <giaosu...@gmail.com 
>>> <mailto:giaosu...@gmail.com>> wrote:
>>> Hi everyone,
>>> I setup STS and use Zeppelin to query data through JDBC connection.
>>> A problem we are facing is users usually forget to put limit in the query 
>>> so it causes hang the cluster.
>>> 
>>> SELECT * FROM tableA;
>>> 
>>> Is there anyway to config the limit by default ?
>>> 
>>> 
>>> Regards,
>>> Chanh
>>> 
>> 
>> 
> 
> 



Re: Does it has a way to config limit in query on STS by default?

2016-08-02 Thread Chanh Le
I just added to spark thrift server as it starts a param —hiveconf 
hive.limit.optimize.fetch.max=1000 




> On Aug 2, 2016, at 4:50 PM, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:
> 
> Ok I have already set up mine
> 
> 
> hive.limit.optimize.fetch.max
> 5
> 
>   Maximum number of rows allowed for a smaller subset of data for simple 
> LIMIT, if it is a fetch query.
>   Insert queries are not restricted by this limit.
> 
>   
> 
> I am surprised that yours was missing. What did you set it up to?
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> On 2 August 2016 at 10:18, Chanh Le <giaosu...@gmail.com 
> <mailto:giaosu...@gmail.com>> wrote:
> I tried and it works perfectly.
> 
> Regards,
> Chanh
> 
> 
>> On Aug 2, 2016, at 3:33 PM, Mich Talebzadeh <mich.talebza...@gmail.com 
>> <mailto:mich.talebza...@gmail.com>> wrote:
>> 
>> OK
>> 
>> Try that
>> 
>> Another tedious way is to create views in Hive based on tables and use limit 
>> on those views.
>> 
>> But try that parameter first if it does anything.
>> 
>> HTH
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>> loss, damage or destruction of data or any other property which may arise 
>> from relying on this email's technical content is explicitly disclaimed. The 
>> author will in no case be liable for any monetary damages arising from such 
>> loss, damage or destruction.
>>  
>> 
>> On 2 August 2016 at 09:13, Chanh Le <giaosu...@gmail.com 
>> <mailto:giaosu...@gmail.com>> wrote:
>> Hi Mich,
>> I use Spark Thrift Server basically it acts like Hive.
>> 
>> I see that there is property in Hive.
>> 
>>> hive.limit.optimize.fetch.max
>>> Default Value: 5
>>> Added In: Hive 0.8.0
>>> Maximum number of rows allowed for a smaller subset of data for simple 
>>> LIMIT, if it is a fetch query. Insert queries are not restricted by this 
>>> limit.
>> 
>> Is that related to the problem?
>> 
>> 
>> 
>> 
>>> On Aug 2, 2016, at 2:55 PM, Mich Talebzadeh <mich.talebza...@gmail.com 
>>> <mailto:mich.talebza...@gmail.com>> wrote:
>>> 
>>> This is a classic problem on any RDBMS
>>> 
>>> Set the limit on the number of rows returned like maximum of 50K rows 
>>> through JDBC
>>> 
>>> What is your JDBC connection going to? Meaning which RDBMS if any?
>>> 
>>> HTH
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>  
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>>  
>>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>> 
>>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>>> loss, damage or destruction of data or any other property which may arise 
>>> from relying on this email's technical content is explicitly disclaimed. 
>>> The author will in no case be liable for any monetary damages arising from 
>>> such loss, damage or destruction.
>>>  
>>> 
>>> On 2 August 2016 at 08:41, Chanh Le <giaosu...@gmail.com 
>>> <mailto:giaosu...@gmail.com>> wrote:
>>> Hi everyone,
>>> I setup STS and use Zeppelin to query data through JDBC connection.
>>> A problem we are facing is users usually forget to put limit in the query 
>>> so it causes hang the cluster.
>>> 
>>> SELECT * FROM tableA;
>>> 
>>> Is there anyway to config the limit by default ?
>>> 
>>> 
>>> Regards,
>>> Chanh
>>> 
>> 
>> 
> 
> 



Re: Does it has a way to config limit in query on STS by default?

2016-08-02 Thread Chanh Le
I tried and it works perfectly.

Regards,
Chanh


> On Aug 2, 2016, at 3:33 PM, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:
> 
> OK
> 
> Try that
> 
> Another tedious way is to create views in Hive based on tables and use limit 
> on those views.
> 
> But try that parameter first if it does anything.
> 
> HTH
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> On 2 August 2016 at 09:13, Chanh Le <giaosu...@gmail.com 
> <mailto:giaosu...@gmail.com>> wrote:
> Hi Mich,
> I use Spark Thrift Server basically it acts like Hive.
> 
> I see that there is property in Hive.
> 
>> hive.limit.optimize.fetch.max
>> Default Value: 5
>> Added In: Hive 0.8.0
>> Maximum number of rows allowed for a smaller subset of data for simple 
>> LIMIT, if it is a fetch query. Insert queries are not restricted by this 
>> limit.
> 
> Is that related to the problem?
> 
> 
> 
> 
>> On Aug 2, 2016, at 2:55 PM, Mich Talebzadeh <mich.talebza...@gmail.com 
>> <mailto:mich.talebza...@gmail.com>> wrote:
>> 
>> This is a classic problem on any RDBMS
>> 
>> Set the limit on the number of rows returned like maximum of 50K rows 
>> through JDBC
>> 
>> What is your JDBC connection going to? Meaning which RDBMS if any?
>> 
>> HTH
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>> loss, damage or destruction of data or any other property which may arise 
>> from relying on this email's technical content is explicitly disclaimed. The 
>> author will in no case be liable for any monetary damages arising from such 
>> loss, damage or destruction.
>>  
>> 
>> On 2 August 2016 at 08:41, Chanh Le <giaosu...@gmail.com 
>> <mailto:giaosu...@gmail.com>> wrote:
>> Hi everyone,
>> I setup STS and use Zeppelin to query data through JDBC connection.
>> A problem we are facing is users usually forget to put limit in the query so 
>> it causes hang the cluster.
>> 
>> SELECT * FROM tableA;
>> 
>> Is there anyway to config the limit by default ?
>> 
>> 
>> Regards,
>> Chanh
>> 
> 
> 



Re: Does it has a way to config limit in query on STS by default?

2016-08-02 Thread Chanh Le
Hi Mich,
I use Spark Thrift Server basically it acts like Hive.

I see that there is property in Hive.

> hive.limit.optimize.fetch.max
> Default Value: 5
> Added In: Hive 0.8.0
> Maximum number of rows allowed for a smaller subset of data for simple LIMIT, 
> if it is a fetch query. Insert queries are not restricted by this limit.

Is that related to the problem?




> On Aug 2, 2016, at 2:55 PM, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:
> 
> This is a classic problem on any RDBMS
> 
> Set the limit on the number of rows returned like maximum of 50K rows through 
> JDBC
> 
> What is your JDBC connection going to? Meaning which RDBMS if any?
> 
> HTH
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> On 2 August 2016 at 08:41, Chanh Le <giaosu...@gmail.com 
> <mailto:giaosu...@gmail.com>> wrote:
> Hi everyone,
> I setup STS and use Zeppelin to query data through JDBC connection.
> A problem we are facing is users usually forget to put limit in the query so 
> it causes hang the cluster.
> 
> SELECT * FROM tableA;
> 
> Is there anyway to config the limit by default ?
> 
> 
> Regards,
> Chanh
> 



Does it has a way to config limit in query on STS by default?

2016-08-02 Thread Chanh Le
Hi everyone,
I setup STS and use Zeppelin to query data through JDBC connection.
A problem we are facing is users usually forget to put limit in the query so it 
causes hang the cluster. 

SELECT * FROM tableA;

Is there anyway to config the limit by default ?


Regards,
Chanh
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [Spark 2.0] Why MutableInt cannot be cast to MutableLong?

2016-07-31 Thread Chanh Le
If I have a column store in a parquet file under INT type and I create a table 
with the same column but change the time from int to bigint.

in Spark 2.0 it shows error:

Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 
0 in stage 259.0 failed 4 times, most recent failure: Lost task 0.3 in stage 
259.0 (TID 22958, slave2): java.lang.UnsupportedOperationException: 
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary
at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:48)
at 
org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:233)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)


So I think this error still happen in Spark 2.0




> On Aug 1, 2016, at 9:21 AM, Chanh Le <giaosu...@gmail.com> wrote:
> 
> Sorry my bad, I ran in Spark 1.6.1 but what about this error?
> Why Int cannot be cast to Long?
> 
> 
> Thanks.
> 
> 
>> On Aug 1, 2016, at 2:44 AM, Michael Armbrust <mich...@databricks.com 
>> <mailto:mich...@databricks.com>> wrote:
>> 
>> Are you sure you are running Spark 2.0?
>> 
>> In your stack trace I see SqlNewHadoopRDD, which was removed in #12354 
>> <https://github.com/apache/spark/pull/12354>.
>> 
>> On Sun, Jul 31, 2016 at 2:12 AM, Chanh Le <giaosu...@gmail.com 
>> <mailto:giaosu...@gmail.com>> wrote:
>> Hi everyone,
>> Why MutableInt cannot be cast to MutableLong?
>> It’s really weird and seems Spark 2.0 has a lot of error with parquet about 
>> format.
>> 
>> org.apache.spark.sql.catalyst.expressions.MutableInt cannot be cast to 
>> org.apache.spark.sql.catalyst.expressions.MutableL ong
>> 
>> Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read 
>> value at 0 in block 0 in file 
>> file:/data/etl-report/parquet/AD_COOKIE_REPORT/time=2016-07-
>> 25-16/network_id=31713/part-r-0-9adbef89-f2f4-4836-a50c-a2e7b381d558.snappy.parquet
>> at 
>> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
>> at 
>> org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
>> at 
>> org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:194)
>> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>> at 
>> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:88)
>> at 
>> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86)
>> at 
>> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
>> at 
>> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
>> at 
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>> at 
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>> at org.apache.spark.rdd.RDD.iterator(

Re: [Spark 2.0] Why MutableInt cannot be cast to MutableLong?

2016-07-31 Thread Chanh Le
Sorry my bad, I ran in Spark 1.6.1 but what about this error?
Why Int cannot be cast to Long?


Thanks.


> On Aug 1, 2016, at 2:44 AM, Michael Armbrust <mich...@databricks.com> wrote:
> 
> Are you sure you are running Spark 2.0?
> 
> In your stack trace I see SqlNewHadoopRDD, which was removed in #12354 
> <https://github.com/apache/spark/pull/12354>.
> 
> On Sun, Jul 31, 2016 at 2:12 AM, Chanh Le <giaosu...@gmail.com 
> <mailto:giaosu...@gmail.com>> wrote:
> Hi everyone,
> Why MutableInt cannot be cast to MutableLong?
> It’s really weird and seems Spark 2.0 has a lot of error with parquet about 
> format.
> 
> org.apache.spark.sql.catalyst.expressions.MutableInt cannot be cast to 
> org.apache.spark.sql.catalyst.expressions.MutableL ong
> 
> Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value 
> at 0 in block 0 in file 
> file:/data/etl-report/parquet/AD_COOKIE_REPORT/time=2016-07-
> 25-16/network_id=31713/part-r-0-9adbef89-f2f4-4836-a50c-a2e7b381d558.snappy.parquet
> at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
> at 
> org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
> at 
> org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:194)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:88)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> 
> Caused by: java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.MutableInt cannot be cast to 
> org.apache.spark.sql.catalyst.expressions.MutableL
> ong
> at 
> org.apache.spark.sql.catalyst.expressions.SpecificMutableRow.setLong(SpecificMutableRow.scala:295)
> at 
> org.apache.spark.sql.execution.datasources.parquet.CatalystRowConverter$RowUpdater.setLong(CatalystRowConverter.scala:161)
> at 
> org.apache.spark.sql.execution.datasources.parquet.CatalystPrimitiveConverter.addLong(CatalystRowConverter.scala:85)
> at 
> org.apache.parquet.column.impl.ColumnReaderImpl$2$4.writeValue(ColumnReaderImpl.java:269)
> at 
> org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:365)
> at 
> org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:405)
> at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:209)
> ... 20 more
> 



[Spark 2.0] Why MutableInt cannot be cast to MutableLong?

2016-07-31 Thread Chanh Le
Hi everyone,
Why MutableInt cannot be cast to MutableLong?
It’s really weird and seems Spark 2.0 has a lot of error with parquet about 
format.

org.apache.spark.sql.catalyst.expressions.MutableInt cannot be cast to 
org.apache.spark.sql.catalyst.expressions.MutableL ong

Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value 
at 0 in block 0 in file 
file:/data/etl-report/parquet/AD_COOKIE_REPORT/time=2016-07-
25-16/network_id=31713/part-r-0-9adbef89-f2f4-4836-a50c-a2e7b381d558.snappy.parquet
at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
at 
org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
at 
org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:194)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:88)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Caused by: java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.MutableInt cannot be cast to 
org.apache.spark.sql.catalyst.expressions.MutableL
ong
at 
org.apache.spark.sql.catalyst.expressions.SpecificMutableRow.setLong(SpecificMutableRow.scala:295)
at 
org.apache.spark.sql.execution.datasources.parquet.CatalystRowConverter$RowUpdater.setLong(CatalystRowConverter.scala:161)
at 
org.apache.spark.sql.execution.datasources.parquet.CatalystPrimitiveConverter.addLong(CatalystRowConverter.scala:85)
at 
org.apache.parquet.column.impl.ColumnReaderImpl$2$4.writeValue(ColumnReaderImpl.java:269)
at 
org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:365)
at 
org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:405)
at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:209)
... 20 more

[jira] [Commented] (SPARK-16518) Schema Compatibility of Parquet Data Source

2016-07-30 Thread Chanh Le (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15400755#comment-15400755
 ] 

Chanh Le commented on SPARK-16518:
--

Did we have a patch for that?
Right now I have this error too.


> Schema Compatibility of Parquet Data Source
> ---
>
> Key: SPARK-16518
> URL: https://issues.apache.org/jira/browse/SPARK-16518
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> Currently, we are not checking the schema compatibility. Different file 
> formats behave differently. This JIRA just summarizes what I observed for 
> parquet data source tables.
> *Scenario 1 Data type mismatch*:
> The existing schema is {{(col1 int, col2 string)}}
> The schema of appending dataset is {{(col1 int, col2 int)}}
> *Case 1*: _when {{spark.sql.parquet.mergeSchema}} is {{false}}_, the error we 
> got:
> {noformat}
> Job aborted due to stage failure: Task 0 in stage 4.0 failed 1 times, most 
> recent failure:
>  Lost task 0.0 in stage 4.0 (TID 4, localhost): java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:231)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(generated.java:62)
> {noformat}
> *Case 2*: _when {{spark.sql.parquet.mergeSchema}} is {{true}}_, the error we 
> got:
> {noformat}
> Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most 
> recent failure: Lost task 0.0 in stage 3.0 (TID 3, localhost): 
> org.apache.spark.SparkException:
>  Failed merging schema of file 
> file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-4c2f0b69-ee05-4be1-91f0-0e54f89f2308/part-r-0-6b76638c-a624-444c-9479-3c8e894cb65e.snappy.parquet:
> root
>  |-- a: integer (nullable = false)
>  |-- b: string (nullable = true)
> {noformat}
> *Scenario 2 More columns in append dataset*:
> The existing schema is {{(col1 int, col2 string)}}
> The schema of appending dataset is {{(col1 int, col2 string, col3 int)}}
> *Case 1*: _when {{spark.sql.parquet.mergeSchema}} is {{false}}_, the schema 
> of the resultset is {{(col1 int, col2 string)}}.
> *Case 2*: _when {{spark.sql.parquet.mergeSchema}} is {{true}}_, the schema of 
> the resultset is {{(col1 int, col2 string, col3 int)}}.
> *Scenario 3 Less columns in append dataset*:
> The existing schema is {{(col1 int, col2 string)}}
> The schema of appending dataset is {{(col1 int)}}
>*Case 1*: _when {{spark.sql.parquet.mergeSchema}} is {{false}}_, the 
> schema of the resultset is {{(col1 int, col2 string)}}.
>*Case 2*: _when {{spark.sql.parquet.mergeSchema}} is {{true}}_, the schema 
> of the resultset is {{(col1 int)}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Spark Thrift Server (Spark 2.0) show table has value with NULL in all fields

2016-07-30 Thread Chanh Le
Hi Mich some thing different between your log
> On Jul 30, 2016, at 6:58 PM, Mich Talebzadeh  
> wrote:
> 
> parquet-mr version 1.6.0
> org.apache.parquet.VersionParser$VersionParseException: Could not parse 
> created_by: parquet-mr version 1.6.0 using format: (.+) version ((.*) 
> )?\(build ?(.*)\)

vs mine
> Jul 30, 2016 6:32:27 PM WARNING: org.apache.parquet.CorruptStatistics: 
> Ignoring statistics because created_by could not be parsed (see PARQUET-251): 
> parquet-mr (build 32c46643845ea8a705c35d4ec8fc654cc8ff816d)
> org.apache.parquet.VersionParser$VersionParseException: Could not parse 
> created_by: parquet-mr (build 32c46643845ea8a705c35d4ec8fc654cc8ff816d) using 
> format: (.+) version ((.*) )?\(build ?(.*)\)

So the parquet-mr version is different right?




Re: Spark Thrift Server (Spark 2.0) show table has value with NULL in all fields

2016-07-30 Thread Chanh Le
/07/30 18:32:27 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have 
all completed, from pool 
16/07/30 18:32:27 INFO DAGScheduler: ResultStage 0 (run at 
AccessController.java:-2) finished in 0.509 s
16/07/30 18:32:27 INFO DAGScheduler: Job 0 finished: run at 
AccessController.java:-2, took 0.625829 s
16/07/30 18:32:27 INFO CodeGenerator: Code generated in 18.418581 ms
16/07/30 18:32:28 INFO SparkExecuteStatementOperation: Result Schema: 
List(topic_id#0, topic_name_en#1, parent_id#2, full_parent#3, level_id#4)















> On Jul 30, 2016, at 6:08 PM, Mich Talebzadeh <mich.talebza...@gmail.com> 
> wrote:
> 
> Actually Hive SQL is a superset of Spark SQL. Data type may not be an issue.
> 
> If I create the table after DataFrame creation as explicitly a Hive parquet 
> table through Spark, Hive sees it and you can see it in Spark thrift server 
> with data in it (basically you are using Hive Thrift server under the bonnet).
> 
> If I let Spark create table with 
> df.write.mode("overwrite").parquet("/user/hduser/ll_18740868.parquet")
> 
> Then Hive does not seem to see the data when an external Hive table is 
> created on it!
> 
> HTH
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> On 30 July 2016 at 11:52, Chanh Le <giaosu...@gmail.com 
> <mailto:giaosu...@gmail.com>> wrote:
> I agree with you. Maybe some change on data type in Spark that Hive still not 
> support or not competitive so that why It shows NULL.
> 
> 
>> On Jul 30, 2016, at 5:47 PM, Mich Talebzadeh <mich.talebza...@gmail.com 
>> <mailto:mich.talebza...@gmail.com>> wrote:
>> 
>> I think it is still a Hive problem because Spark thrift server is basically 
>> a Hive thrift server.
>> 
>> An ACID test would be to log in to Hive CLI or Hive thrift server (you are 
>> actually using Hive thrift server on port 1 when using Spark thrift 
>> server) and see whether you see data
>> 
>> When you use Spark it should work.
>> 
>> I still believe it is a bug in Hive
>> 
>> HTH
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>> loss, damage or destruction of data or any other property which may arise 
>> from relying on this email's technical content is explicitly disclaimed. The 
>> author will in no case be liable for any monetary damages arising from such 
>> loss, damage or destruction.
>>  
>> 
>> On 30 July 2016 at 11:43, Chanh Le <giaosu...@gmail.com 
>> <mailto:giaosu...@gmail.com>> wrote:
>> Hi Mich,
>> Thanks for supporting. Here some of my thoughts.
>> 
>>> BTW can you log in to thrift server and do select * from  limit 10
>>> 
>>> Do you see the rows?
>> 
>> Yes I can see the row but all the fields value NULL.
>> 
>>> Works OK for me
>> 
>> You just test the number of row. In my case I check and it shows 117 rows 
>> but the problem is about the data is NULL in all fields.
>> 
>> 
>>> AS I see it the issue is that Hive table created as external on Parquet 
>>> table somehow does not see data. Rows are all nulls.
>>> 
>>> I don't think this is specific to thrift server. Just log in to Hive and 
>>> see you can read the data from your table topic created as external.
>>> 
>>> I noticed the same issue
>> 
>> I don’t think it’s a Hive issue. Right now I am using Spark and Zeppelin.
>> 
>> 
>> And the point is why with the same parquet file ( I convert from CSV to 
>> parquet) it can be read in Spark but not in STS.
>> 
>> One more thing is with the same file and method to create table in STS in 
>> Sp

Re: Spark Thrift Server (Spark 2.0) show table has value with NULL in all fields

2016-07-30 Thread Chanh Le
I agree with you. Maybe some change on data type in Spark that Hive still not 
support or not competitive so that why It shows NULL.


> On Jul 30, 2016, at 5:47 PM, Mich Talebzadeh <mich.talebza...@gmail.com> 
> wrote:
> 
> I think it is still a Hive problem because Spark thrift server is basically a 
> Hive thrift server.
> 
> An ACID test would be to log in to Hive CLI or Hive thrift server (you are 
> actually using Hive thrift server on port 1 when using Spark thrift 
> server) and see whether you see data
> 
> When you use Spark it should work.
> 
> I still believe it is a bug in Hive
> 
> HTH
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> On 30 July 2016 at 11:43, Chanh Le <giaosu...@gmail.com 
> <mailto:giaosu...@gmail.com>> wrote:
> Hi Mich,
> Thanks for supporting. Here some of my thoughts.
> 
>> BTW can you log in to thrift server and do select * from  limit 10
>> 
>> Do you see the rows?
> 
> Yes I can see the row but all the fields value NULL.
> 
>> Works OK for me
> 
> You just test the number of row. In my case I check and it shows 117 rows but 
> the problem is about the data is NULL in all fields.
> 
> 
>> AS I see it the issue is that Hive table created as external on Parquet 
>> table somehow does not see data. Rows are all nulls.
>> 
>> I don't think this is specific to thrift server. Just log in to Hive and see 
>> you can read the data from your table topic created as external.
>> 
>> I noticed the same issue
> 
> I don’t think it’s a Hive issue. Right now I am using Spark and Zeppelin.
> 
> 
> And the point is why with the same parquet file ( I convert from CSV to 
> parquet) it can be read in Spark but not in STS.
> 
> One more thing is with the same file and method to create table in STS in 
> Spark 1.6.1 it works fine.
> 
> 
> Regards,
> Chanh
> 
> 
> 
>> On Jul 30, 2016, at 2:10 PM, Mich Talebzadeh <mich.talebza...@gmail.com 
>> <mailto:mich.talebza...@gmail.com>> wrote:
>> 
>> BTW can you log in to thrift server and do select * from  limit 10
>> 
>> Do you see the rows?
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>> loss, damage or destruction of data or any other property which may arise 
>> from relying on this email's technical content is explicitly disclaimed. The 
>> author will in no case be liable for any monetary damages arising from such 
>> loss, damage or destruction.
>>  
>> 
>> On 30 July 2016 at 07:20, Mich Talebzadeh <mich.talebza...@gmail.com 
>> <mailto:mich.talebza...@gmail.com>> wrote:
>> Works OK for me
>> 
>> scala> val df = 
>> sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", 
>> "true").option("header", 
>> "false").load("hdfs://rhes564:9000/data/stg/accounts/ll/18740868")
>> df: org.apache.spark.sql.DataFrame = [C0: string, C1: string, C2: string, 
>> C3: string, C4: string, C5: string, C6: string, C7: string, C8: string]
>> scala> df.write.mode("overwrite").parquet("/user/hduser/ll_18740868.parquet")
>> scala> sqlContext.read.parquet("/user/hduser/ll_18740868.parquet")count
>> res2: Long = 3651
>> scala> val ff = sqlContext.read.parquet("/user/hduser/ll_18740868.parquet")
>> ff: org.apache.spark.sql.DataFrame = [C0: string, C1: string, C2: string, 
>> C3: string, C4: string, C5: string, C6: string, C7: string, C8: string]
>> scala> ff.take(5)
>> res3: Array[org.apache.spark.sql.Row] = Array([Transaction Date,Transaction

Re: Spark Thrift Server (Spark 2.0) show table has value with NULL in all fields

2016-07-30 Thread Chanh Le
Hi Mich,
Thanks for supporting. Here some of my thoughts.

> BTW can you log in to thrift server and do select * from  limit 10
> 
> Do you see the rows?

Yes I can see the row but all the fields value NULL.

> Works OK for me

You just test the number of row. In my case I check and it shows 117 rows but 
the problem is about the data is NULL in all fields.


> AS I see it the issue is that Hive table created as external on Parquet table 
> somehow does not see data. Rows are all nulls.
> 
> I don't think this is specific to thrift server. Just log in to Hive and see 
> you can read the data from your table topic created as external.
> 
> I noticed the same issue

I don’t think it’s a Hive issue. Right now I am using Spark and Zeppelin.


And the point is why with the same parquet file ( I convert from CSV to 
parquet) it can be read in Spark but not in STS.

One more thing is with the same file and method to create table in STS in Spark 
1.6.1 it works fine.


Regards,
Chanh



> On Jul 30, 2016, at 2:10 PM, Mich Talebzadeh <mich.talebza...@gmail.com> 
> wrote:
> 
> BTW can you log in to thrift server and do select * from  limit 10
> 
> Do you see the rows?
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> On 30 July 2016 at 07:20, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:
> Works OK for me
> 
> scala> val df = 
> sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", 
> "true").option("header", 
> "false").load("hdfs://rhes564:9000/data/stg/accounts/ll/18740868")
> df: org.apache.spark.sql.DataFrame = [C0: string, C1: string, C2: string, C3: 
> string, C4: string, C5: string, C6: string, C7: string, C8: string]
> scala> df.write.mode("overwrite").parquet("/user/hduser/ll_18740868.parquet")
> scala> sqlContext.read.parquet("/user/hduser/ll_18740868.parquet")count
> res2: Long = 3651
> scala> val ff = sqlContext.read.parquet("/user/hduser/ll_18740868.parquet")
> ff: org.apache.spark.sql.DataFrame = [C0: string, C1: string, C2: string, C3: 
> string, C4: string, C5: string, C6: string, C7: string, C8: string]
> scala> ff.take(5)
> res3: Array[org.apache.spark.sql.Row] = Array([Transaction Date,Transaction 
> Type,Sort Code,Account Number,Transaction Description,Debit Amount,Credit 
> Amount,Balance,], [31/12/2009,CPT,'30-64-72,18740868,LTSB STH KENSINGTO CD 
> 5710 31DEC09 ,90.00,,400.00,null], [31/12/2009,CPT,'30-64-72,18740868,LTSB 
> CHELSEA (3091 CD 5710 31DEC09 ,10.00,,490.00,null], 
> [31/12/2009,DEP,'30-64-72,18740868,CHELSEA ,,500.00,500.00,null], 
> [Transaction Date,Transaction Type,Sort Code,Account Number,Transaction 
> Description,Debit Amount,Credit Amount,Balance,])
> 
> Now in Zeppelin create an external table and read it
> 
> 
> 
> 
> HTH
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> On 29 July 2016 at 09:04, Chanh Le <giaosu...@gmail.com> wrote:
> I continue to debug
> 
> 16/07/29 13:57:35 INFO FileScanRDD: Reading File path: 
> file:///Users/giaosudau/Documents/Topics.parquet/part-r-0-8997050f-e063-427e-b53c-f0a61739706f.gz.parquet,
>  range: 0-3118, partition values: [empty row]
> vs OK one
> 16/07/29 15:02:47 INFO FileScanRDD: Reading File path: 
> file:///Users/giaosudau/data_example/FACT_ADMIN_HOURLY/time=2016-07-24-18/network_id=30206/part-r-0-c5f5e18d-c8a1-4831-8903-3c60b02bdfe8.snappy.parquet,
>  range: 0-6050, partition values: [2016-07-24-18,30206]
> 
> I attached 2 files.
> 
> 
> 
> 
> 
> 
>> On Jul 29, 2016, at 9:44 AM, Chanh Le <giaosu...@gmail.com> wrote:
>> 
>> Hi everyone,
>> 
>> For more investigation I attached the file that I conver

Re: Spark Standalone Cluster: Having a master and worker on the same node

2016-07-28 Thread Chanh Le
Hi Jestin,
I saw most of setup usually setup along master and slave in a same node.
Because I think master doesn't do as much job as slave does and resource is
expensive we need to use it.
BTW In my setup I setup along master and slave.
I have  5 nodes and 3 of which are master and slave running alongside.
Hope it can help!.


Regards,
Chanh





On Thu, Jul 28, 2016 at 12:19 AM, Jestin Ma 
wrote:

> Hi, I'm doing performance testing and currently have 1 master node and 4
> worker nodes and am submitting in client mode from a 6th cluster node.
>
> I know we can have a master and worker on the same node. Speaking in terms
> of performance and practicality, is it possible/suggested to have another
> working running on either the 6th node or the master node?
>
> Thank you!
>
>


Re: Spark Thrift Server 2.0 set spark.sql.shuffle.partitions not working when query

2016-07-28 Thread Chanh Le
Thank you Takeshi it works fine now.

Regards,
Chanh


> On Jul 28, 2016, at 2:03 PM, Takeshi Yamamuro <linguin@gmail.com> wrote:
> 
> Hi,
> 
> you need to set the value when you just start the server.
> 
> // maropu
> 
> On Thu, Jul 28, 2016 at 3:59 PM, Chanh Le <giaosu...@gmail.com 
> <mailto:giaosu...@gmail.com>> wrote:
> Hi everyone,
> 
> I set spark.sql.shuffle.partitions=10 after started Spark Thrift Server but I 
> seems not working.
> 
> ./sbin/start-thriftserver.sh --master 
> mesos://zk://master1:2181,master2:2181,master3:2181/mesos <> --conf 
> spark.driver.memory=5G --conf spark.scheduler.mode=FAIR --class 
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 --jars 
> /home/spark/spark-2.0.0-bin-hadoop2.6/jars/alluxio-core-client-spark-1.2.0-jar-with-dependencies.jar
>  --total-executor-cores 35 spark-internal --hiveconf 
> hive.server2.thrift.port=1 --hiveconf 
> hive.metastore.warehouse.dir=/user/hive/warehouse --hiveconf 
> hive.metastore.metadb.dir=/user/hive/metadb
> 
> Did anyone has the same with me?
> 
> 
> 
> 
> 
> -- 
> ---
> Takeshi Yamamuro



Spark 2.0 just released

2016-07-26 Thread Chanh Le
Its official now http://spark.apache.org/releases/spark-release-2-0-0.html 

Everyone should check it out.




Re: Spark Web UI port 4040 not working

2016-07-26 Thread Chanh Le
You’re running in StandAlone Mode?
Usually inside active task it will show the address of current job.
or you can check in master node by using netstat -apn | grep 4040



> On Jul 26, 2016, at 8:21 AM, Jestin Ma  wrote:
> 
> Hello, when running spark jobs, I can access the master UI (port 8080 one) no 
> problem. However, I'm confused as to how to access the web UI to see 
> jobs/tasks/stages/etc.
> 
> I can access the master UI at http://:8080. But port 4040 gives 
> me a -connection cannot be reached-.
> 
> Is the web UI http:// with a port of 4040?
> 
> I'm running my Spark job on a cluster machine and submitting it to a master 
> node part of the cluster. I heard of ssh tunneling; is that relevant here?
> 
> Thank you!


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: dataframe.foreach VS dataframe.collect().foreach

2016-07-26 Thread Chanh Le
Hi Ken,

blacklistDF -> just DataFrame 
Spark is lazy until you call something like collect, take, write it will 
execute the hold process like you do map or filter before you collect.
That mean until you call collect spark do nothing so you df would not have any 
data -> can’t call foreach.
Call collect execute the process -> get data -> foreach is ok.


> On Jul 26, 2016, at 2:30 PM, kevin  wrote:
> 
>  blacklistDF.collect()



Re: Optimize filter operations with sorted data

2016-07-21 Thread Chanh Le
You can check in spark UI or in output of spark application.
How many stages and tasks before you partition and after.
Also compare the run time.

Regards,
Chanh

On Thu, Jul 7, 2016 at 6:40 PM, tan shai <tan.shai...@gmail.com> wrote:

> How can you verify that it is loading only the part of time and network in
> filter ?
>
> 2016-07-07 11:58 GMT+02:00 Chanh Le <giaosu...@gmail.com>:
>
>> Hi Tan,
>> It depends on how data organise and what your filter is.
>> For example in my case: I store data by partition by field time and
>> network_id. If I filter by time or network_id or both and with other field
>> Spark only load part of time and network in filter then filter the rest.
>>
>>
>>
>> > On Jul 7, 2016, at 4:43 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>> >
>> > Does the filter under consideration operate on sorted column(s) ?
>> >
>> > Cheers
>> >
>> >> On Jul 7, 2016, at 2:25 AM, tan shai <tan.shai...@gmail.com> wrote:
>> >>
>> >> Hi,
>> >>
>> >> I have a sorted dataframe, I need to optimize the filter operations.
>> >> How does Spark performs filter operations on sorted dataframe?
>> >>
>> >> It is scanning all the data?
>> >>
>> >> Many thanks.
>> >
>> > -
>> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> >
>>
>>
>


Re: run spark apps in linux crontab

2016-07-21 Thread Chanh Le
If you use > it only print the (print or println) to log file in the others log 
like (INFO, WARN, ERROR) -> (stdout) I believe it not print to the log file. 
But tee can do that.
The following command (with the help of tee command) writes the output both to 
the screen (stdout) and to the file.

Thanks.



> On Jul 21, 2016, at 1:48 PM, <luohui20...@sina.com> <luohui20...@sina.com> 
> wrote:
> 
> 
> got it.
> 
> difference:
> > : all messages goes to the log file, leaving no messages in STDOUT
> tee: all message goes to the log file and STDOUT at the same time.
> 
> 
>  
> ThanksBest regards!
> San.Luo
> 
> - 原始邮件 -
> 发件人:Chanh Le <giaosu...@gmail.com>
> 收件人:luohui20...@sina.com
> 抄送人:focus <focushe...@qq.com>, user <user@spark.apache.org>
> 主题:Re: run spark apps in linux crontab
> 日期:2016年07月21日 11点38分
> 
> you should you use command.sh | tee file.log
> 
>> On Jul 21, 2016, at 10:36 AM, <luohui20...@sina.com 
>> <mailto:luohui20...@sina.com>> <luohui20...@sina.com 
>> <mailto:luohui20...@sina.com>> wrote:
>> 
>> 
>> thank you focus, and all.
>> this problem solved by adding a line ". /etc/profile" in my shell.
>> 
>> 
>> 
>>  
>> ThanksBest regards!
>> San.Luo
>> 
>> - 原始邮件 -
>> 发件人:"focus" <focushe...@qq.com <mailto:focushe...@qq.com>>
>> 收件人:"luohui20001" <luohui20...@sina.com <mailto:luohui20...@sina.com>>, 
>> "user@spark.apache.org <mailto:user@spark.apache.org>" 
>> <user@spark.apache.org <mailto:user@spark.apache.org>>
>> 主题:Re:run spark apps in linux crontab
>> 日期:2016年07月20日 18点11分
>> 
>> Hi, I just meet this problem, too! The reason is crontab runtime doesn't 
>> have the variables you defined, such as $SPARK_HOME.
>> I defined the $SPARK_HOME and other variables in /etc/profile like this:
>> 
>> export $MYSCRIPTS=/opt/myscripts
>> export $SPARK_HOME=/opt/spark
>> 
>> then, in my crontab job script daily_job.sh
>> 
>> #!/bin/sh
>> 
>> . /etc/profile
>> 
>> $SPARK_HOME/bin/spark-submit $MYSCRIPTS/fix_fh_yesterday.py
>> 
>> then, in crontab -e
>> 
>> 0 8 * * * /home/user/daily_job.sh
>> 
>> hope this helps~
>> 
>> 
>> 
>> 
>> -- Original --
>> From: "luohui20001"<luohui20...@sina.com <mailto:luohui20...@sina.com>>;
>> Date: 2016年7月20日(星期三) 晚上6:00
>> To: "user@spark.apache.org 
>> <mailto:user@spark.apache.org>"<user@spark.apache.org 
>> <mailto:user@spark.apache.org>>;
>> Subject: run spark apps in linux crontab
>> 
>> hi guys:
>>   I add a spark-submit job into my Linux crontab list by the means below 
>> ,however none of them works. If I change it to a normal shell script, it is 
>> ok. I don't quite understand why. I checked the 8080 web ui of my spark 
>> cluster, no job submitted, and there is not messages in /home/hadoop/log.
>>   Any idea is welcome.
>> 
>> [hadoop@master ~]$ crontab -e
>> 1.
>> 22 21 * * * sh /home/hadoop/shellscripts/run4.sh > /home/hadoop/log
>> 
>> and in run4.sh,it wrote:
>> $SPARK_HOME/bin/spark-submit --class com.abc.myclass --total-executor-cores 
>> 10 --jars $SPARK_HOME/lib/MyDep.jar $SPARK_HOME/MyJar.jar  > /home/hadoop/log
>> 
>> 2.
>> 22 21 * * * $SPARK_HOME/bin/spark-submit --class com.abc.myclass 
>> --total-executor-cores 10 --jars $SPARK_HOME/lib/MyDep.jar 
>> $SPARK_HOME/MyJar.jar  > /home/hadoop/log
>> 
>> 3.
>> 22 21 * * * /usr/lib/spark/bin/spark-submit --class com.abc.myclass 
>> --total-executor-cores 10 --jars /usr/lib/spark/lib/MyDep.jar 
>> /usr/lib/spark/MyJar.jar  > /home/hadoop/log
>> 
>> 4.
>> 22 21 * * * hadoop /usr/lib/spark/bin/spark-submit --class com.abc.myclass 
>> --total-executor-cores 10 --jars /usr/lib/spark/lib/MyDep.jar 
>> /usr/lib/spark/MyJar.jar  > /home/hadoop/log
>> 
>> 
>>  
>> ThanksBest regards!
>> San.Luo
> 



Re: run spark apps in linux crontab

2016-07-20 Thread Chanh Le
you should you use command.sh | tee file.log

> On Jul 21, 2016, at 10:36 AM,   
> wrote:
> 
> 
> thank you focus, and all.
> this problem solved by adding a line ". /etc/profile" in my shell.
> 
> 
> 
>  
> ThanksBest regards!
> San.Luo
> 
> - 原始邮件 -
> 发件人:"focus" 
> 收件人:"luohui20001" , "user@spark.apache.org" 
> 
> 主题:Re:run spark apps in linux crontab
> 日期:2016年07月20日 18点11分
> 
> Hi, I just meet this problem, too! The reason is crontab runtime doesn't have 
> the variables you defined, such as $SPARK_HOME.
> I defined the $SPARK_HOME and other variables in /etc/profile like this:
> 
> export $MYSCRIPTS=/opt/myscripts
> export $SPARK_HOME=/opt/spark
> 
> then, in my crontab job script daily_job.sh
> 
> #!/bin/sh
> 
> . /etc/profile
> 
> $SPARK_HOME/bin/spark-submit $MYSCRIPTS/fix_fh_yesterday.py
> 
> then, in crontab -e
> 
> 0 8 * * * /home/user/daily_job.sh
> 
> hope this helps~
> 
> 
> 
> 
> -- Original --
> From: "luohui20001";
> Date: 2016年7月20日(星期三) 晚上6:00
> To: "user@spark.apache.org";
> Subject: run spark apps in linux crontab
> 
> hi guys:
>   I add a spark-submit job into my Linux crontab list by the means below 
> ,however none of them works. If I change it to a normal shell script, it is 
> ok. I don't quite understand why. I checked the 8080 web ui of my spark 
> cluster, no job submitted, and there is not messages in /home/hadoop/log.
>   Any idea is welcome.
> 
> [hadoop@master ~]$ crontab -e
> 1.
> 22 21 * * * sh /home/hadoop/shellscripts/run4.sh > /home/hadoop/log
> 
> and in run4.sh,it wrote:
> $SPARK_HOME/bin/spark-submit --class com.abc.myclass --total-executor-cores 
> 10 --jars $SPARK_HOME/lib/MyDep.jar $SPARK_HOME/MyJar.jar  > /home/hadoop/log
> 
> 2.
> 22 21 * * * $SPARK_HOME/bin/spark-submit --class com.abc.myclass 
> --total-executor-cores 10 --jars $SPARK_HOME/lib/MyDep.jar 
> $SPARK_HOME/MyJar.jar  > /home/hadoop/log
> 
> 3.
> 22 21 * * * /usr/lib/spark/bin/spark-submit --class com.abc.myclass 
> --total-executor-cores 10 --jars /usr/lib/spark/lib/MyDep.jar 
> /usr/lib/spark/MyJar.jar  > /home/hadoop/log
> 
> 4.
> 22 21 * * * hadoop /usr/lib/spark/bin/spark-submit --class com.abc.myclass 
> --total-executor-cores 10 --jars /usr/lib/spark/lib/MyDep.jar 
> /usr/lib/spark/MyJar.jar  > /home/hadoop/log
> 
> 
>  
> ThanksBest regards!
> San.Luo



Attribute name "sum(proceeds)" contains invalid character(s) among " ,;{}()\n\t="

2016-07-20 Thread Chanh Le
Hi everybody,
I got a error about the name of the columns is not following the rule. 
Please tell me the way to fix it.
Here is my code

metricFields
Here is a Seq of metrics: spent, proceed, click, impression
 sqlContext
  .sql(s"select * from hourly where time between '$dateStr-00' and 
'$dateStr-23' ")
  .groupBy("time", dimensions.filter(!"time".contains(_)): _*)
  .agg(metricFields.map(a => a -> "sum").toMap)
Error message was:

Exception in thread "main" org.apache.spark.sql.AnalysisException: Attribute 
name "sum(proceeds)" contains invalid character(s) among " ,;{}()\n\t=". Please 
use alias to rename it.;
at 
org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter$.checkConversionRequirement(CatalystSchemaConverter.scala:556)
at 
org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter$.checkFieldName(CatalystSchemaConverter.scala:542)
at 
org.apache.spark.sql.execution.datasources.parquet.CatalystWriteSupport$$anonfun$setSchema$2.apply(CatalystWriteSupport.scala:430)
at 
org.apache.spark.sql.execution.datasources.parquet.CatalystWriteSupport$$anonfun$setSchema$2.apply(CatalystWriteSupport.scala:430)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
org.apache.spark.sql.execution.datasources.parquet.CatalystWriteSupport$.setSchema(CatalystWriteSupport.scala:430)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetRelation.prepareJobForWrite(ParquetRelation.scala:258)
at 
org.apache.spark.sql.execution.datasources.BaseWriterContainer.driverSideSetup(WriterContainer.scala:103)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:147)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108)
at 
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
at 
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
at 
org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
at 
org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:256)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:139)
at 
org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:334)
at jobs.DailyJob$delayedInit$body.apply(DailyJob.scala:46)
at scala.Function0$class.apply$mcV$sp(Function0.scala:40)
at 
scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App$$anonfun$main$1.apply(App.scala:71)
at scala.App$$anonfun$main$1.apply(App.scala:71)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32)
at scala.App$class.main(App.scala:71)
at jobs.DailyJob$.main(DailyJob.scala:12)
at jobs.DailyJob.main(DailyJob.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

[jira] [Created] (MESOS-5868) Task is running but not show in UI

2016-07-19 Thread Chanh Le (JIRA)
Chanh Le created MESOS-5868:
---

 Summary: Task is running but not show in UI
 Key: MESOS-5868
 URL: https://issues.apache.org/jira/browse/MESOS-5868
 Project: Mesos
  Issue Type: Bug
  Components: webui
Affects Versions: 0.28.1
 Environment: Centos 6.7
Reporter: Chanh Le


This happen when I try to restart the masters node without downing any slaves.
As you can see 6 tasks are running and in Active Tasks show nothing.
!http://imgur.com/a/jmmak| Tasks are running!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5868) Task is running but not show in UI

2016-07-19 Thread Chanh Le (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chanh Le updated MESOS-5868:

Description: 
This happen when I try to restart the masters node without downing any slaves.
As you can see 6 tasks are running and in Active Tasks show nothing.
!http://i.imgur.com/UaYqDN1.png| Tasks are running!

  was:
This happen when I try to restart the masters node without downing any slaves.
As you can see 6 tasks are running and in Active Tasks show nothing.
!http://imgur.com/a/jmmak| Tasks are running!


> Task is running but not show in UI
> --
>
> Key: MESOS-5868
> URL: https://issues.apache.org/jira/browse/MESOS-5868
> Project: Mesos
>  Issue Type: Bug
>  Components: webui
>Affects Versions: 0.28.1
> Environment: Centos 6.7
>Reporter: Chanh Le
>  Labels: easyfix
>
> This happen when I try to restart the masters node without downing any slaves.
> As you can see 6 tasks are running and in Active Tasks show nothing.
> !http://i.imgur.com/UaYqDN1.png| Tasks are running!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: the spark job is so slow - almost frozen

2016-07-18 Thread Chanh Le
Hi,
What about the network (bandwidth) between hive and spark? 
Does it run in Hive before then you move to Spark?
Because It's complex you can use something like EXPLAIN command to show what 
going on.




 
> On Jul 18, 2016, at 5:20 PM, Zhiliang Zhu  wrote:
> 
> the sql logic in the program is very much complex , so do not describe the 
> detailed codes   here . 
> 
> 
> On Monday, July 18, 2016 6:04 PM, Zhiliang Zhu  
> wrote:
> 
> 
> Hi All,  
> 
> Here we have one application, it needs to extract different columns from 6 
> hive tables, and then does some easy calculation, there is around 100,000 
> number of rows in each table,
> finally need to output another table or file (with format of consistent 
> columns) .
> 
>  However, after lots of days trying, the spark hive job is unthinkably slow - 
> sometimes almost frozen. There is 5 nodes for spark cluster. 
>  
> Could anyone offer some help, some idea or clue is also good. 
> 
> Thanks in advance~
> 
> Zhiliang 
> 
> 



Re: Inode for STS

2016-07-18 Thread Chanh Le
Hi Ayan,
I seem like you mention this 
https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-hive.start.cleanup.scratchdir
 

Default it was set false by default.





> On Jul 13, 2016, at 5:01 PM, ayan guha  wrote:
> 
> Thanks Christophe. Any comment from Spark dev community member would really 
> helpful on the Jira.
> 
> What I saw today is shutting down the thrift server process lead to a clean 
> up. Also, we started removing any empty folders from /tmp. Is there any other 
> or better method? 
> 
> On Wed, Jul 13, 2016 at 5:25 PM, Christophe Préaud 
> > wrote:
> Hi Ayan,
> 
> I have opened a JIRA about this issues, but there are no answer so far: 
> SPARK-15401 
> 
> Regards,
> Christophe.
> 
> 
> On 13/07/16 05:54, ayan guha wrote:
>> Hi
>> 
>> We are running Spark Thrift Server as a long running application. However,  
>> it looks like it is filling up /tmp/hive folder with lots of small files and 
>> directories with no file in them, blowing out inode limit and preventing any 
>> connection with "No Space Left in Device" issue. 
>> 
>> What is the best way to clean up those small files periodically? 
>> 
>> -- 
>> Best Regards,
>> Ayan Guha
> 
> 
> Kelkoo SAS
> Société par Actions Simplifiée
> Au capital de € 4.168.964,30
> Siège social : 158 Ter Rue du Temple 75003 Paris
> 425 093 069 RCS Paris
> 
> Ce message et les pièces jointes sont confidentiels et établis à l'attention 
> exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
> message, merci de le détruire et d'en avertir l'expéditeur.
> 
> 
> 
> -- 
> Best Regards,
> Ayan Guha



Re: How to run Zeppelin and Spark Thrift Server Together

2016-07-17 Thread Chanh Le
Hi Ayan,
I succeed with Tableau but I still can’t import metadata from Hive to Oracle BI.



Is that Oracle BI still can’t connect to STS.

Regards,
Chanh




> On Jul 15, 2016, at 11:44 AM, ayan guha <guha.a...@gmail.com> wrote:
> 
> Its possible that transfar protocols are not matching, thats what Simba is 
> complaining about. Try to change the protocol to SASL?
> 
> On Fri, Jul 15, 2016 at 1:20 PM, Chanh Le <giaosu...@gmail.com 
> <mailto:giaosu...@gmail.com>> wrote:
> Hi Ayan,
> Thanks. I got it.
> Did you have any problem when connecting Oracle BI with STS?
> 
> I have some error 
> 
> 
> If I use Tableau 
> 
> 
> 
>> On Jul 15, 2016, at 10:03 AM, ayan guha <guha.a...@gmail.com 
>> <mailto:guha.a...@gmail.com>> wrote:
>> 
>> This looks like a Spark code. I am not sure if this is what you intended to 
>> use with STS? I think it should be a insert overwrite command (SQL)
>> 
>> On Fri, Jul 15, 2016 at 12:22 PM, Chanh Le <giaosu...@gmail.com 
>> <mailto:giaosu...@gmail.com>> wrote:
>> Hi Ayan,
>> 
>> Spark Code:
>> df.write.mode(SaveMode.Overwrite).
>>   parquet(dataPath)
>> I overwrite current data frame every hour to update data.
>> dataPath is /etl_info/WEBSITE 
>> So that mean the part-xxx file will change every hour as well.
>> This also happen on Spark if I register as tempTable and I need to drop and 
>> recreate every time data was changed.
>> 
>> Regards,
>> Chanh
>> 
>> 
>> 
>>> On Jul 14, 2016, at 10:39 PM, ayan guha <guha.a...@gmail.com 
>>> <mailto:guha.a...@gmail.com>> wrote:
>>> 
>>> Can you kindly share essential parts your spark code?
>>> 
>>> On 14 Jul 2016 21:12, "Chanh Le" <giaosu...@gmail.com 
>>> <mailto:giaosu...@gmail.com>> wrote:
>>> Hi Ayan,
>>> Thank you for your suggestion. I switch to use STS for primary and Zeppelin 
>>> just call through JDBC.
>>> But my data is a bit update data so Every hour I need to rewrite of some 
>>> data and therefore it has a issue with parquet file. 
>>> Scenario is: I have parquet file NETWORK and each one hour I need overwrite 
>>> that to update a new one and that gonna happened. 
>>> Caused by: alluxio.exception.FileDoesNotExistException: Path 
>>> /etl_info/WEBSITE/part-r-1-a26015c3-6de5-4a10-bda8-d985d7241953.snappy.parquet
>>>  does not exist.
>>> So I need to drop old table and create a new one. 
>>> 
>>> Is there anyway to work around this?
>>> 
>>> Regards,
>>> Chanh
>>> 
>>> 
>>>> On Jul 14, 2016, at 11:36 AM, ayan guha <guha.a...@gmail.com 
>>>> <mailto:guha.a...@gmail.com>> wrote:
>>>> 
>>>> Hi
>>>> 
>>>> Thanks for the information. However, I still strongly believe you should 
>>>> be able to set URI in STS hive site xml and then create table through 
>>>> JDBC. 
>>>> 
>>>> On Thu, Jul 14, 2016 at 1:49 PM, Chanh Le <giaosu...@gmail.com 
>>>> <mailto:giaosu...@gmail.com>> wrote:
>>>> Hi Ayan,
>>>> I found that Zeppelin 0.6.0 can’t set hive.metastore.warehouse.dir 
>>>> property in somehow. 
>>>> Because I change the hive-site.xml inside $ZEP_DIR/conf/hive-site.xml for 
>>>> other config it works for instance I changed hive.metastore.metadb.dir 
>>>> it’s ok but only for hive.metastore.warehouse.dir didn’t work.
>>>> 
>>>> This is wired.
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> On Jul 13, 2016, at 4:53 PM, ayan guha <guha.a...@gmail.com 
>>>>> <mailto:guha.a...@gmail.com>> wrote:
>>>>> 
>>>>> I would suggest you to restart zeppelin and STS. 
>>>>> 
>>>>> On Wed, Jul 13, 2016 at 6:35 PM, Chanh Le <giaosu...@gmail.com 
>>>>> <mailto:giaosu...@gmail.com>> wrote:
>>>>> Hi Ayan,
>>>>> 
>>>>> I don’t know I did something wrong but still couldn’t set 
>>>>> hive.metastore.warehouse.dir property.
>>>>> 
>>>>> I set 3 hive-site.xml files in spark location, zeppelin, hive as well but 
>>>>> still didn’t work.
>>>>> 
>>>>> zeppeline/conf/hive-site.xml 
>>>>> spark/conf/hive-site.xml
>>>>> hive/conf/hive-site.xml
>>>>> 
>>>>> 
>>>>>

Re: Memory issue java.lang.OutOfMemoryError: Java heap space

2016-07-13 Thread Chanh Le
Can you show me at Spark UI -> executors tab and storage tab.
It will show us how many executor was executed and how much memory we use to 
cache.

 


> On Jul 14, 2016, at 9:49 AM, Jean Georges Perrin <j...@jgp.net> wrote:
> 
> I use it as a standalone cluster.
> 
> I run it through start-master, then start-slave. I only have one slave now, 
> but I will probably have a few soon.
> 
> The "application" is run on a separate box.
> 
> When everything was running on my mac, i was in local mode, but i never setup 
> anything in local mode. Going "production" was a little more complex that I 
> thought.
> 
>> On Jul 13, 2016, at 10:35 PM, Chanh Le <giaosu...@gmail.com 
>> <mailto:giaosu...@gmail.com>> wrote:
>> 
>> Hi Jean,
>> How do you run your Spark Application? Local Mode, Cluster Mode? 
>> If you run in local mode did you use —driver-memory and —executor-memory 
>> because in local mode your setting about executor and driver didn’t work 
>> that you expected.
>> 
>> 
>> 
>> 
>>> On Jul 14, 2016, at 8:43 AM, Jean Georges Perrin <j...@jgp.net 
>>> <mailto:j...@jgp.net>> wrote:
>>> 
>>> Looks like replacing the setExecutorEnv() by set() did the trick... let's 
>>> see how fast it'll process my 50x 10ˆ15 data points...
>>> 
>>>> On Jul 13, 2016, at 9:24 PM, Jean Georges Perrin <j...@jgp.net 
>>>> <mailto:j...@jgp.net>> wrote:
>>>> 
>>>> I have added:
>>>> 
>>>>SparkConf conf = new 
>>>> SparkConf().setAppName("app").setExecutorEnv("spark.executor.memory", "8g")
>>>>.setMaster("spark://10.0.100.120:7077 
>>>> ");
>>>> 
>>>> but it did not change a thing
>>>> 
>>>>> On Jul 13, 2016, at 9:14 PM, Jean Georges Perrin <j...@jgp.net 
>>>>> <mailto:j...@jgp.net>> wrote:
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> I have a Java memory issue with Spark. The same application working on my 
>>>>> 8GB Mac crashes on my 72GB Ubuntu server...
>>>>> 
>>>>> I have changed things in the conf file, but it looks like Spark does not 
>>>>> care, so I wonder if my issues are with the driver or executor.
>>>>> 
>>>>> I set:
>>>>> 
>>>>> spark.driver.memory 20g
>>>>> spark.executor.memory   20g
>>>>> And, whatever I do, the crash is always at the same spot in the app, 
>>>>> which makes me think that it is a driver problem.
>>>>> 
>>>>> The exception I get is:
>>>>> 
>>>>> 16/07/13 20:36:30 WARN TaskSetManager: Lost task 0.0 in stage 7.0 (TID 
>>>>> 208, micha.nc.rr.com): java.lang.OutOfMemoryError: Java heap space
>>>>> at java.nio.HeapCharBuffer.(HeapCharBuffer.java:57)
>>>>> at java.nio.CharBuffer.allocate(CharBuffer.java:335)
>>>>> at java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:810)
>>>>> at org.apache.hadoop.io.Text.decode(Text.java:412)
>>>>> at org.apache.hadoop.io.Text.decode(Text.java:389)
>>>>> at org.apache.hadoop.io.Text.toString(Text.java:280)
>>>>> at 
>>>>> org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$org$apache$spark$sql$execution$datasources$json$JSONRelation$$createBaseRdd$1.apply(JSONRelation.scala:105)
>>>>> at 
>>>>> org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$org$apache$spark$sql$execution$datasources$json$JSONRelation$$createBaseRdd$1.apply(JSONRelation.scala:105)
>>>>> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>>>>> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>>>>> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>>>>> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>>>>> at 
>>>>> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)
>>>>> at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157)
>>>>> at 
>>>>> scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201)
>>>>> at scala.collection.AbstractIterator.aggregate(Iterator.scala:1157)
>>>>> at 
>>&

Re: Memory issue java.lang.OutOfMemoryError: Java heap space

2016-07-13 Thread Chanh Le
Hi Jean,
How do you run your Spark Application? Local Mode, Cluster Mode? 
If you run in local mode did you use —driver-memory and —executor-memory 
because in local mode your setting about executor and driver didn’t work that 
you expected.




> On Jul 14, 2016, at 8:43 AM, Jean Georges Perrin  wrote:
> 
> Looks like replacing the setExecutorEnv() by set() did the trick... let's see 
> how fast it'll process my 50x 10ˆ15 data points...
> 
>> On Jul 13, 2016, at 9:24 PM, Jean Georges Perrin > > wrote:
>> 
>> I have added:
>> 
>>  SparkConf conf = new 
>> SparkConf().setAppName("app").setExecutorEnv("spark.executor.memory", "8g")
>>  .setMaster("spark://10.0.100.120:7077 
>> ");
>> 
>> but it did not change a thing
>> 
>>> On Jul 13, 2016, at 9:14 PM, Jean Georges Perrin >> > wrote:
>>> 
>>> Hi,
>>> 
>>> I have a Java memory issue with Spark. The same application working on my 
>>> 8GB Mac crashes on my 72GB Ubuntu server...
>>> 
>>> I have changed things in the conf file, but it looks like Spark does not 
>>> care, so I wonder if my issues are with the driver or executor.
>>> 
>>> I set:
>>> 
>>> spark.driver.memory 20g
>>> spark.executor.memory   20g
>>> And, whatever I do, the crash is always at the same spot in the app, which 
>>> makes me think that it is a driver problem.
>>> 
>>> The exception I get is:
>>> 
>>> 16/07/13 20:36:30 WARN TaskSetManager: Lost task 0.0 in stage 7.0 (TID 208, 
>>> micha.nc.rr.com): java.lang.OutOfMemoryError: Java heap space
>>> at java.nio.HeapCharBuffer.(HeapCharBuffer.java:57)
>>> at java.nio.CharBuffer.allocate(CharBuffer.java:335)
>>> at java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:810)
>>> at org.apache.hadoop.io.Text.decode(Text.java:412)
>>> at org.apache.hadoop.io.Text.decode(Text.java:389)
>>> at org.apache.hadoop.io.Text.toString(Text.java:280)
>>> at 
>>> org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$org$apache$spark$sql$execution$datasources$json$JSONRelation$$createBaseRdd$1.apply(JSONRelation.scala:105)
>>> at 
>>> org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$org$apache$spark$sql$execution$datasources$json$JSONRelation$$createBaseRdd$1.apply(JSONRelation.scala:105)
>>> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>>> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>>> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>>> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>>> at 
>>> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)
>>> at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157)
>>> at 
>>> scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201)
>>> at scala.collection.AbstractIterator.aggregate(Iterator.scala:1157)
>>> at 
>>> org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$23.apply(RDD.scala:1135)
>>> at 
>>> org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$23.apply(RDD.scala:1135)
>>> at 
>>> org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$24.apply(RDD.scala:1136)
>>> at 
>>> org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$24.apply(RDD.scala:1136)
>>> at 
>>> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
>>> at 
>>> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
>>> at 
>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>>> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>>> at org.apache.spark.scheduler.Task.run(Task.scala:89)
>>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
>>> at 
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>> at 
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>> at java.lang.Thread.run(Thread.java:745)
>>> 
>>> I have set a small memory "dumper" in my app. At the beginning, it says:
>>> 
>>> **  Free . 1,413,566
>>> **  Allocated  1,705,984
>>> **  Max .. 16,495,104
>>> **> Total free ... 16,202,686
>>> Just before the crash, it says:
>>> 
>>> **  Free . 1,461,633
>>> **  Allocated  1,786,880
>>> **  Max .. 16,495,104
>>> **> Total free ... 16,169,857
>>> 
>>> 
>>> 
>>> 
>> 
> 



Re: How to run Zeppelin and Spark Thrift Server Together

2016-07-13 Thread Chanh Le
Hi Ayan,

I don’t know I did something wrong but still couldn’t set 
hive.metastore.warehouse.dir property.

I set 3 hive-site.xml files in spark location, zeppelin, hive as well but still 
didn’t work.

zeppeline/conf/hive-site.xml 
spark/conf/hive-site.xml
hive/conf/hive-site.xml



My hive-site.xml



  hive.metastore.metadb.dir
  alluxio://master1:19998/metadb
  
  Required by metastore server or if the uris argument below is not 
supplied
  


  hive.metastore.warehouse.dir
  alluxio://master1:19998/warehouse
  
  Required by metastore server or if the uris argument below is not 
supplied
  



Is there anything I can do?

Regards,
Chanh



> On Jul 13, 2016, at 12:43 PM, ayan guha <guha.a...@gmail.com> wrote:
> 
> Hi
> 
> Create table always happens through Hive. In Hive, when you create a 
> database, the default metadata storage location is driven by 
> hive.metastore.metadb.dir and data storage is driven by 
> hive.metastore.warehouse.dir property (set in hive site xml). So, you do not 
> need to set this property in Zeppelin. 
> 
> What you can do:
>a. Modify  hive-site.xml to include those properties, if they are 
> not already set.  use the same hive site.xml to run STS. Then connect through 
> JDBC, create table and you should find metadata & data in your desired 
> location. 
> b. I think you can set these properties (same way you'd do in hive cli)
> c. You can create tables/databases with a LOCATION clause,  in case you need 
> to use non-standard path. 
> 
> Best
> Ayan
> 
> On Wed, Jul 13, 2016 at 3:20 PM, Chanh Le <giaosu...@gmail.com 
> <mailto:giaosu...@gmail.com>> wrote:
> Hi Ayan,
> Thank you for replying. 
> But I wanna create a table in Zeppelin and store the metadata in Alluxio like 
> I tried to do set hive.metastore.warehouse.dir=alluxio://master1:19998/metadb 
> <>   <>So I can share data with STS.
> 
> The way you’ve mentioned through JDBC I already did and it works but I can’t 
> create table in Spark way easily.
> 
> Regards,
> Chanh
> 
> 
>> On Jul 13, 2016, at 12:06 PM, ayan guha <guha.a...@gmail.com 
>> <mailto:guha.a...@gmail.com>> wrote:
>> 
>> HI
>> 
>> I quickly tried with available hive interpreter 
>> 
>> 
>> 
>> Please try similarly. 
>> 
>> I will try with jdbc interpreter but I need to add it to zeppelin :)
>> 
>> Best
>> Ayan
>> 
>> On Wed, Jul 13, 2016 at 1:53 PM, Chanh Le <giaosu...@gmail.com 
>> <mailto:giaosu...@gmail.com>> wrote:
>> Hi Ayan,
>> How to set hive metastore in Zeppelin. I tried but not success.
>> The way I do I add into Spark Interpreter
>> 
>> 
>> 
>> And also try in a notebook by 
>> %sql
>> set hive.metastore.metadb.dir=alluxio://master1:19998/metadb <>
>> 
>> %sql 
>> set hive.metastore.warehouse.dir=alluxio://master1:19998/metadb <>
>> 
>> %spark
>> sqlContext.setConf("hive.metastore.warehouse.dir", 
>> "alluxio://master1:19998/metadb <>")
>> sqlContext.setConf("hive.metastore.metadb.dir", 
>> "alluxio://master1:19998/metadb <>")
>> sqlContext.read.parquet("alluxio://master1:19998/etl_info/WEBSITE 
>> <>").saveAsTable("tests_5”)
>> 
>> But It’s 
>> 
>>> On Jul 11, 2016, at 1:26 PM, ayan guha <guha.a...@gmail.com 
>>> <mailto:guha.a...@gmail.com>> wrote:
>>> 
>>> Hi
>>> 
>>> When you say "Zeppelin and STS", I am assuming you mean "Spark Interpreter" 
>>> and "JDBC interpreter" respectively. 
>>> 
>>> Through Zeppelin, you can either run your own spark application (by using 
>>> Zeppelin's own spark context) using spark interpreter OR you can access 
>>> STS, which  is a spark application ie separate Spark Context using JDBC 
>>> interpreter. There should not be any need for these 2 contexts to coexist. 
>>> 
>>> If you want to share data, save it to hive from either context, and you 
>>> should be able to see the data from other context. 
>>> 
>>> Best
>>> Ayan
>>> 
>>> 
>>> 
>>> On Mon, Jul 11, 2016 at 3:00 PM, Chanh Le <giaosu...@gmail.com 
>>> <mailto:giaosu...@gmail.com>> wrote:
>>> Hi Ayan,
>>> I tested It works fine but one more confuse is If my (technical) users want 
>>> to write some code in zeppelin to apply thing into Hive table? 

Re: How to run Zeppelin and Spark Thrift Server Together

2016-07-12 Thread Chanh Le
Hi Ayan,
Thank you for replying. 
But I wanna create a table in Zeppelin and store the metadata in Alluxio like I 
tried to do set hive.metastore.warehouse.dir=alluxio://master1:19998/metadb 
   <>So I can share data with STS.

The way you’ve mentioned through JDBC I already did and it works but I can’t 
create table in Spark way easily.

Regards,
Chanh


> On Jul 13, 2016, at 12:06 PM, ayan guha <guha.a...@gmail.com> wrote:
> 
> HI
> 
> I quickly tried with available hive interpreter 
> 
> 
> 
> Please try similarly. 
> 
> I will try with jdbc interpreter but I need to add it to zeppelin :)
> 
> Best
> Ayan
> 
> On Wed, Jul 13, 2016 at 1:53 PM, Chanh Le <giaosu...@gmail.com 
> <mailto:giaosu...@gmail.com>> wrote:
> Hi Ayan,
> How to set hive metastore in Zeppelin. I tried but not success.
> The way I do I add into Spark Interpreter
> 
> 
> 
> And also try in a notebook by 
> %sql
> set hive.metastore.metadb.dir=alluxio://master1:19998/metadb <>
> 
> %sql 
> set hive.metastore.warehouse.dir=alluxio://master1:19998/metadb <>
> 
> %spark
> sqlContext.setConf("hive.metastore.warehouse.dir", 
> "alluxio://master1:19998/metadb <>")
> sqlContext.setConf("hive.metastore.metadb.dir", 
> "alluxio://master1:19998/metadb <>")
> sqlContext.read.parquet("alluxio://master1:19998/etl_info/WEBSITE 
> <>").saveAsTable("tests_5”)
> 
> But It’s 
> 
>> On Jul 11, 2016, at 1:26 PM, ayan guha <guha.a...@gmail.com 
>> <mailto:guha.a...@gmail.com>> wrote:
>> 
>> Hi
>> 
>> When you say "Zeppelin and STS", I am assuming you mean "Spark Interpreter" 
>> and "JDBC interpreter" respectively. 
>> 
>> Through Zeppelin, you can either run your own spark application (by using 
>> Zeppelin's own spark context) using spark interpreter OR you can access STS, 
>> which  is a spark application ie separate Spark Context using JDBC 
>> interpreter. There should not be any need for these 2 contexts to coexist. 
>> 
>> If you want to share data, save it to hive from either context, and you 
>> should be able to see the data from other context. 
>> 
>> Best
>> Ayan
>> 
>> 
>> 
>> On Mon, Jul 11, 2016 at 3:00 PM, Chanh Le <giaosu...@gmail.com 
>> <mailto:giaosu...@gmail.com>> wrote:
>> Hi Ayan,
>> I tested It works fine but one more confuse is If my (technical) users want 
>> to write some code in zeppelin to apply thing into Hive table? 
>> Zeppelin and STS can’t share Spark Context that mean we need separated 
>> process? Is there anyway to use the same Spark Context of STS?
>> 
>> Regards,
>> Chanh
>> 
>> 
>>> On Jul 11, 2016, at 10:05 AM, Takeshi Yamamuro <linguin@gmail.com 
>>> <mailto:linguin@gmail.com>> wrote:
>>> 
>>> Hi,
>>> 
>>> ISTM multiple sparkcontexts are not recommended in spark.
>>> See: https://issues.apache.org/jira/browse/SPARK-2243 
>>> <https://issues.apache.org/jira/browse/SPARK-2243>
>>> 
>>> // maropu
>>> 
>>> 
>>> On Mon, Jul 11, 2016 at 12:01 PM, ayan guha <guha.a...@gmail.com 
>>> <mailto:guha.a...@gmail.com>> wrote:
>>> Hi
>>> 
>>> Can you try using JDBC interpreter with STS? We are using Zeppelin+STS on 
>>> YARN for few months now without much issue. 
>>> 
>>> On Mon, Jul 11, 2016 at 12:48 PM, Chanh Le <giaosu...@gmail.com 
>>> <mailto:giaosu...@gmail.com>> wrote:
>>> Hi everybody,
>>> We are using Spark to query big data and currently we’re using Zeppelin to 
>>> provide a UI for technical users.
>>> Now we also need to provide a UI for business users so we use Oracle BI 
>>> tools and set up a Spark Thrift Server (STS) for it.
>>> 
>>> When I run both Zeppelin and STS throw error:
>>> 
>>> INFO [2016-07-11 09:40:21,905] ({pool-2-thread-4} 
>>> SchedulerFactory.java[jobStarted]:131) - Job 
>>> remoteInterpretJob_1468204821905 started by scheduler 
>>> org.apache.zeppelin.spark.SparkInterpreter835015739
>>>  INFO [2016-07-11 09:40:21,911] ({pool-2-thread-4} 
>>> Logging.scala[logInfo]:58) - Changing view acls to: giaosudau
>>>  INFO [2016-07-11 09:40:21,912] ({pool-2-thread-4} 
>>> Logging.scala[logInfo]:58) - Changing modify acls to: giaosudau
>>>  INFO [2016-07-11 09:40:21,912] ({pool-2-thread-4} 
>>> Logging.scala[logInfo]:58) - SecurityManager: 

Re: Spark cache behaviour when the source table is modified

2016-07-12 Thread Chanh Le
Hi Anjali,
The Cached is immutable you can’t update data into. 
They way to update cache is re-create cache.


> On Jun 16, 2016, at 4:24 PM, Anjali Chadha  wrote:
> 
> Hi all,
> 
> I am having a hard time understanding the caching concepts in Spark.
> 
> I have a hive table("person"), which is cached in Spark.
> 
> sqlContext.sql("create table person (name string, age int)") //Create a new 
> table
> //Add some values to the table
> ...
> ...
> //Cache the table in Spark
> sqlContext.cacheTable("person") 
> sqlContext.isCached("person") //Returns true
> sqlContext.sql("insert into table person values ("Foo", 25)") // Insert some 
> other value in the table
> 
> //Check caching status again
> sqlContext.isCached("person") //Returns true
> sqlContext is HiveContext.
> 
> Will the entries inserted after cacheTable("person") statement be cached? In 
> other words, ("Foo", 25) entry is cached in Spark or not?
> 
> If not, how can I cache only the entries inserted later? I don't want to 
> first uncache and then again cache the whole table.
> 
> Any relevant web link or information will be appreciated.
> 
> - Anjali Chadha
> 



Re: Connection via JDBC to Oracle hangs after count call

2016-07-11 Thread Chanh Le
Hi Mich,

If I have a stored procedure in Oracle write like this
SP get Info: 
PKG_ETL.GET_OBJECTS_INFO( 
p_LAST_UPDATED VARCHAR2, 
p_OBJECT_TYPE VARCHAR2, 
p_TABLE OUT SYS_REFCURSOR); 
How to call in Spark because the output is cursor p_TABLE OUT SYS_REFCURSOR.


Thanks.


> On Jul 11, 2016, at 4:18 PM, Mark Vervuurt  wrote:
> 
> Thanks Mich,
> 
> we have got it working using the example here under ;)
> 
> Mark
> 
>> On 11 Jul 2016, at 09:45, Mich Talebzadeh > > wrote:
>> 
>> Hi Mark,
>> 
>> Hm. It should work. This is Spark 1.6.1 on Oracle 12c
>>  
>>  
>> scala> val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
>> HiveContext: org.apache.spark.sql.hive.HiveContext = 
>> org.apache.spark.sql.hive.HiveContext@70f446c
>>  
>> scala> var _ORACLEserver : String = "jdbc:oracle:thin:@rhes564:1521:mydb12"
>> _ORACLEserver: String = jdbc:oracle:thin:@rhes564:1521:mydb12
>>  
>> scala> var _username : String = "sh"
>> _username: String = sh
>>  
>> scala> var _password : String = ""
>> _password: String = sh
>>  
>> scala> val c = HiveContext.load("jdbc",
>>  | Map("url" -> _ORACLEserver,
>>  | "dbtable" -> "(SELECT to_char(CHANNEL_ID) AS CHANNEL_ID, CHANNEL_DESC 
>> FROM sh.channels)",
>>  | "user" -> _username,
>>  | "password" -> _password))
>> warning: there were 1 deprecation warning(s); re-run with -deprecation for 
>> details
>> c: org.apache.spark.sql.DataFrame = [CHANNEL_ID: string, CHANNEL_DESC: 
>> string]
>>  
>> scala> c.registerTempTable("t_c")
>>  
>> scala> c.count
>> res2: Long = 5
>>  
>> scala> HiveContext.sql("select * from t_c").collect.foreach(println)
>> [3,Direct Sales]
>> [9,Tele Sales]
>> [5,Catalog]
>> [4,Internet]
>> [2,Partners]
>>  
>> HTH
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> 
>>  
>> http://talebzadehmich.wordpress.com 
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>> loss, damage or destruction of data or any other property which may arise 
>> from relying on this email's technical content is explicitly disclaimed. The 
>> author will in no case be liable for any monetary damages arising from such 
>> loss, damage or destruction.
>>  
>> 
>> On 11 July 2016 at 08:25, Mark Vervuurt > > wrote:
>> Hi Mich,
>> 
>> sorry for bothering did you manage to solve your problem? We have a similar 
>> problem with Spark 1.5.2 using a JDBC connection with a DataFrame to an 
>> Oracle Database.
>> 
>> Thanks,
>> Mark
>> 
>>> On 12 Feb 2016, at 11:45, Mich Talebzadeh >> > wrote:
>>> 
>>> Hi,
>>>  
>>> I use the following to connect to Oracle DB from Spark shell 1.5.2
>>>  
>>> spark-shell --master spark://50.140.197.217:7077 <> --driver-class-path 
>>> /home/hduser/jars/ojdbc6.jar
>>>  
>>> in Scala I do
>>>  
>>> scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
>>> sqlContext: org.apache.spark.sql.SQLContext = 
>>> org.apache.spark.sql.SQLContext@f9d4387
>>>  
>>> scala> val channels = sqlContext.read.format("jdbc").options(
>>>  |  Map("url" -> "jdbc:oracle:thin:@rhes564:1521:mydb",
>>>  |  "dbtable" -> "(select * from sh.channels where channel_id = 
>>> 14)",
>>>  |  "user" -> "sh",
>>>  |   "password" -> "xxx")).load
>>> channels: org.apache.spark.sql.DataFrame = [CHANNEL_ID: decimal(0,-127), 
>>> CHANNEL_DESC: string, CHANNEL_CLASS: string, CHANNEL_CLASS_ID: 
>>> decimal(0,-127), CHANNEL_TOTAL: string, CHANNEL_TOTAL_ID: decimal(0,-127)]
>>>  
>>> scala> channels.count()
>>>  
>>> But the latter command keeps hanging?
>>>  
>>> Any ideas appreciated
>>>  
>>> Thanks,
>>>  
>>> Mich Talebzadeh
>>>  
>>> LinkedIn  
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>  
>>> 
>>>  
>>> http://talebzadehmich.wordpress.com 
>>>  
>>> NOTE: The information in this email is proprietary and confidential. This 
>>> message is for the designated recipient only, if you are not the intended 
>>> recipient, you should destroy it immediately. Any information in this 
>>> message shall not be understood as given or endorsed by Peridale Technology 
>>> Ltd, its subsidiaries or their employees, unless expressly so stated. It is 
>>> the responsibility of the recipient to ensure that this email is virus 
>>> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their 
>>> employees accept any responsibility.
>> 
>> Met vriendelijke groet | Best regards,
>> 

Re: Zeppelin Spark with Dynamic Allocation

2016-07-11 Thread Chanh Le
Hi Tamas,
I am using Spark 1.6.1.





> On Jul 11, 2016, at 3:24 PM, Tamas Szuromi <tamas.szur...@odigeo.com> wrote:
> 
> Hello,
> 
> What spark version do you use? I have the same issue with Spark 1.6.1 and 
> there is a ticket somewhere.
> 
> cheers,
> 
> 
> 
> 
> Tamas Szuromi
> Data Analyst
> Skype: tromika
> E-mail: tamas.szur...@odigeo.com <mailto:n...@odigeo.com>
> 
> ODIGEO Hungary Kft.
> 1066 Budapest
> Weiner Leó u. 16.
> www.liligo.com  <http://www.liligo.com/>
> check out our newest video  <http://www.youtube.com/user/liligo>
> 
> 
> 
> On 11 July 2016 at 10:09, Chanh Le <giaosu...@gmail.com 
> <mailto:giaosu...@gmail.com>> wrote:
> Hi everybody,
> I am testing zeppelin with dynamic allocation but seem it’s not working.
> 
> 
> 
> 
> 
> 
> 
> Logs I received I saw that Spark Context was created successfully and task 
> was running but after that was terminated.
> Any ideas on that?
> Thanks.
> 
> 
> 
>  INFO [2016-07-11 15:03:40,096] ({Thread-0} 
> RemoteInterpreterServer.java[run]:81) - Starting remote interpreter server on 
> port 24994
>  INFO [2016-07-11 15:03:40,471] ({pool-1-thread-2} 
> RemoteInterpreterServer.java[createInterpreter]:169) - Instantiate 
> interpreter org.apache.zeppelin.spark.SparkInterpreter
>  INFO [2016-07-11 15:03:40,521] ({pool-1-thread-2} 
> RemoteInterpreterServer.java[createInterpreter]:169) - Instantiate 
> interpreter org.apache.zeppelin.spark.PySparkInterpreter
>  INFO [2016-07-11 15:03:40,526] ({pool-1-thread-2} 
> RemoteInterpreterServer.java[createInterpreter]:169) - Instantiate 
> interpreter org.apache.zeppelin.spark.SparkRInterpreter
>  INFO [2016-07-11 15:03:40,528] ({pool-1-thread-2} 
> RemoteInterpreterServer.java[createInterpreter]:169) - Instantiate 
> interpreter org.apache.zeppelin.spark.SparkSqlInterpreter
>  INFO [2016-07-11 15:03:40,531] ({pool-1-thread-2} 
> RemoteInterpreterServer.java[createInterpreter]:169) - Instantiate 
> interpreter org.apache.zeppelin.spark.DepInterpreter
>  INFO [2016-07-11 15:03:40,563] ({pool-2-thread-5} 
> SchedulerFactory.java[jobStarted]:131) - Job remoteInterpretJob_1468224220562 
> started by scheduler org.apache.zeppelin.spark.SparkInterpreter998491254
>  WARN [2016-07-11 15:03:41,559] ({pool-2-thread-5} 
> NativeCodeLoader.java[]:62) - Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
>  INFO [2016-07-11 15:03:41,703] ({pool-2-thread-5} Logging.scala[logInfo]:58) 
> - Changing view acls to: root
>  INFO [2016-07-11 15:03:41,704] ({pool-2-thread-5} Logging.scala[logInfo]:58) 
> - Changing modify acls to: root
>  INFO [2016-07-11 15:03:41,708] ({pool-2-thread-5} Logging.scala[logInfo]:58) 
> - SecurityManager: authentication disabled; ui acls disabled; users with view 
> permissions: Set(root); users with modify permissions: Set(root)
>  INFO [2016-07-11 15:03:41,977] ({pool-2-thread-5} Logging.scala[logInfo]:58) 
> - Starting HTTP Server
>  INFO [2016-07-11 15:03:42,029] ({pool-2-thread-5} Server.java[doStart]:272) 
> - jetty-8.y.z-SNAPSHOT
>  INFO [2016-07-11 15:03:42,047] ({pool-2-thread-5} 
> AbstractConnector.java[doStart]:338) - Started SocketConnector@0.0.0.0 
> <mailto:SocketConnector@0.0.0.0>:53313
>  INFO [2016-07-11 15:03:42,048] ({pool-2-thread-5} Logging.scala[logInfo]:58) 
> - Successfully started service 'HTTP class server' on port 53313.
>  INFO [2016-07-11 15:03:43,978] ({pool-2-thread-5} 
> SparkInterpreter.java[createSparkContext]:233) - -- Create new 
> SparkContext mesos://zk://master1:2181,master2:2181,master3:2181/mesos <> 
> ---
>  INFO [2016-07-11 15:03:44,003] ({pool-2-thread-5} Logging.scala[logInfo]:58) 
> - Running Spark version 1.6.1
>  INFO [2016-07-11 15:03:44,036] ({pool-2-thread-5} Logging.scala[logInfo]:58) 
> - Changing view acls to: root
>  INFO [2016-07-11 15:03:44,036] ({pool-2-thread-5} Logging.scala[logInfo]:58) 
> - Changing modify acls to: root
>  INFO [2016-07-11 15:03:44,037] ({pool-2-thread-5} Logging.scala[logInfo]:58) 
> - SecurityManager: authentication disabled; ui acls disabled; users with view 
> permissions: Set(root); users with modify permissions: Set(root)
>  INFO [2016-07-11 15:03:44,231] ({pool-2-thread-5} Logging.scala[logInfo]:58) 
> - Successfully started service 'sparkDriver' on port 33913.
>  INFO [2016-07-11 15:03:44,552] 
> ({sparkDriverActorSystem-akka.actor.default-dispatcher-4} 
> Slf4jLogger.scala[applyOrElse]:80) - Slf4jLogger started
>  INFO [2016-07-11 15:03:44,597] 
> ({sparkDriverActorSystem-akka.actor.default-dispatcher-4} 
> Slf4jLogger.scala[apply$mcV$sp]:74) - Starting remoting
>  INFO [2016-07-11 15:03:44,754] 

Re: How to run Zeppelin and Spark Thrift Server Together

2016-07-10 Thread Chanh Le
Hi Ayan,
I tested It works fine but one more confuse is If my (technical) users want to 
write some code in zeppelin to apply thing into Hive table? 
Zeppelin and STS can’t share Spark Context that mean we need separated process? 
Is there anyway to use the same Spark Context of STS?

Regards,
Chanh


> On Jul 11, 2016, at 10:05 AM, Takeshi Yamamuro <linguin@gmail.com> wrote:
> 
> Hi,
> 
> ISTM multiple sparkcontexts are not recommended in spark.
> See: https://issues.apache.org/jira/browse/SPARK-2243 
> <https://issues.apache.org/jira/browse/SPARK-2243>
> 
> // maropu
> 
> 
> On Mon, Jul 11, 2016 at 12:01 PM, ayan guha <guha.a...@gmail.com 
> <mailto:guha.a...@gmail.com>> wrote:
> Hi
> 
> Can you try using JDBC interpreter with STS? We are using Zeppelin+STS on 
> YARN for few months now without much issue. 
> 
> On Mon, Jul 11, 2016 at 12:48 PM, Chanh Le <giaosu...@gmail.com 
> <mailto:giaosu...@gmail.com>> wrote:
> Hi everybody,
> We are using Spark to query big data and currently we’re using Zeppelin to 
> provide a UI for technical users.
> Now we also need to provide a UI for business users so we use Oracle BI tools 
> and set up a Spark Thrift Server (STS) for it.
> 
> When I run both Zeppelin and STS throw error:
> 
> INFO [2016-07-11 09:40:21,905] ({pool-2-thread-4} 
> SchedulerFactory.java[jobStarted]:131) - Job remoteInterpretJob_1468204821905 
> started by scheduler org.apache.zeppelin.spark.SparkInterpreter835015739
>  INFO [2016-07-11 09:40:21,911] ({pool-2-thread-4} Logging.scala[logInfo]:58) 
> - Changing view acls to: giaosudau
>  INFO [2016-07-11 09:40:21,912] ({pool-2-thread-4} Logging.scala[logInfo]:58) 
> - Changing modify acls to: giaosudau
>  INFO [2016-07-11 09:40:21,912] ({pool-2-thread-4} Logging.scala[logInfo]:58) 
> - SecurityManager: authentication disabled; ui acls disabled; users with view 
> permissions: Set(giaosudau); users with modify permissions: Set(giaosudau)
>  INFO [2016-07-11 09:40:21,918] ({pool-2-thread-4} Logging.scala[logInfo]:58) 
> - Starting HTTP Server
>  INFO [2016-07-11 09:40:21,919] ({pool-2-thread-4} Server.java[doStart]:272) 
> - jetty-8.y.z-SNAPSHOT
>  INFO [2016-07-11 09:40:21,920] ({pool-2-thread-4} 
> AbstractConnector.java[doStart]:338) - Started SocketConnector@0.0.0.0:54818 
> <http://SocketConnector@0.0.0.0:54818/>
>  INFO [2016-07-11 09:40:21,922] ({pool-2-thread-4} Logging.scala[logInfo]:58) 
> - Successfully started service 'HTTP class server' on port 54818.
>  INFO [2016-07-11 09:40:22,408] ({pool-2-thread-4} 
> SparkInterpreter.java[createSparkContext]:233) - -- Create new 
> SparkContext local[*] ---
>  WARN [2016-07-11 09:40:22,411] ({pool-2-thread-4} 
> Logging.scala[logWarning]:70) - Another SparkContext is being constructed (or 
> threw an exception in its constructor).  This may indicate an error, since 
> only one SparkContext may be running in this JVM (see SPARK-2243). The other 
> SparkContext was created at:
> 
> Is that mean I need to setup allow multiple context? Because It’s only test 
> in local with local mode If I deploy on mesos cluster what would happened?
> 
> Need you guys suggests some solutions for that. Thanks.
> 
> Chanh
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
> <mailto:user-unsubscr...@spark.apache.org>
> 
> 
> 
> 
> -- 
> Best Regards,
> Ayan Guha
> 
> 
> 
> -- 
> ---
> Takeshi Yamamuro



Re: How to run Zeppelin and Spark Thrift Server Together

2016-07-10 Thread Chanh Le
Hi Ayan,

It is brilliant idea. Thank you every much. I will try this way.

Regards,
Chanh


> On Jul 11, 2016, at 10:01 AM, ayan guha <guha.a...@gmail.com> wrote:
> 
> Hi
> 
> Can you try using JDBC interpreter with STS? We are using Zeppelin+STS on 
> YARN for few months now without much issue. 
> 
> On Mon, Jul 11, 2016 at 12:48 PM, Chanh Le <giaosu...@gmail.com 
> <mailto:giaosu...@gmail.com>> wrote:
> Hi everybody,
> We are using Spark to query big data and currently we’re using Zeppelin to 
> provide a UI for technical users.
> Now we also need to provide a UI for business users so we use Oracle BI tools 
> and set up a Spark Thrift Server (STS) for it.
> 
> When I run both Zeppelin and STS throw error:
> 
> INFO [2016-07-11 09:40:21,905] ({pool-2-thread-4} 
> SchedulerFactory.java[jobStarted]:131) - Job remoteInterpretJob_1468204821905 
> started by scheduler org.apache.zeppelin.spark.SparkInterpreter835015739
>  INFO [2016-07-11 09:40:21,911] ({pool-2-thread-4} Logging.scala[logInfo]:58) 
> - Changing view acls to: giaosudau
>  INFO [2016-07-11 09:40:21,912] ({pool-2-thread-4} Logging.scala[logInfo]:58) 
> - Changing modify acls to: giaosudau
>  INFO [2016-07-11 09:40:21,912] ({pool-2-thread-4} Logging.scala[logInfo]:58) 
> - SecurityManager: authentication disabled; ui acls disabled; users with view 
> permissions: Set(giaosudau); users with modify permissions: Set(giaosudau)
>  INFO [2016-07-11 09:40:21,918] ({pool-2-thread-4} Logging.scala[logInfo]:58) 
> - Starting HTTP Server
>  INFO [2016-07-11 09:40:21,919] ({pool-2-thread-4} Server.java[doStart]:272) 
> - jetty-8.y.z-SNAPSHOT
>  INFO [2016-07-11 09:40:21,920] ({pool-2-thread-4} 
> AbstractConnector.java[doStart]:338) - Started SocketConnector@0.0.0.0:54818 
> <http://SocketConnector@0.0.0.0:54818/>
>  INFO [2016-07-11 09:40:21,922] ({pool-2-thread-4} Logging.scala[logInfo]:58) 
> - Successfully started service 'HTTP class server' on port 54818.
>  INFO [2016-07-11 09:40:22,408] ({pool-2-thread-4} 
> SparkInterpreter.java[createSparkContext]:233) - -- Create new 
> SparkContext local[*] ---
>  WARN [2016-07-11 09:40:22,411] ({pool-2-thread-4} 
> Logging.scala[logWarning]:70) - Another SparkContext is being constructed (or 
> threw an exception in its constructor).  This may indicate an error, since 
> only one SparkContext may be running in this JVM (see SPARK-2243). The other 
> SparkContext was created at:
> 
> Is that mean I need to setup allow multiple context? Because It’s only test 
> in local with local mode If I deploy on mesos cluster what would happened?
> 
> Need you guys suggests some solutions for that. Thanks.
> 
> Chanh
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
> <mailto:user-unsubscr...@spark.apache.org>
> 
> 
> 
> 
> -- 
> Best Regards,
> Ayan Guha



How to run Zeppelin and Spark Thrift Server Together

2016-07-10 Thread Chanh Le
Hi everybody,
We are using Spark to query big data and currently we’re using Zeppelin to 
provide a UI for technical users.
Now we also need to provide a UI for business users so we use Oracle BI tools 
and set up a Spark Thrift Server (STS) for it.

When I run both Zeppelin and STS throw error:

INFO [2016-07-11 09:40:21,905] ({pool-2-thread-4} 
SchedulerFactory.java[jobStarted]:131) - Job remoteInterpretJob_1468204821905 
started by scheduler org.apache.zeppelin.spark.SparkInterpreter835015739
 INFO [2016-07-11 09:40:21,911] ({pool-2-thread-4} Logging.scala[logInfo]:58) - 
Changing view acls to: giaosudau
 INFO [2016-07-11 09:40:21,912] ({pool-2-thread-4} Logging.scala[logInfo]:58) - 
Changing modify acls to: giaosudau
 INFO [2016-07-11 09:40:21,912] ({pool-2-thread-4} Logging.scala[logInfo]:58) - 
SecurityManager: authentication disabled; ui acls disabled; users with view 
permissions: Set(giaosudau); users with modify permissions: Set(giaosudau)
 INFO [2016-07-11 09:40:21,918] ({pool-2-thread-4} Logging.scala[logInfo]:58) - 
Starting HTTP Server
 INFO [2016-07-11 09:40:21,919] ({pool-2-thread-4} Server.java[doStart]:272) - 
jetty-8.y.z-SNAPSHOT
 INFO [2016-07-11 09:40:21,920] ({pool-2-thread-4} 
AbstractConnector.java[doStart]:338) - Started SocketConnector@0.0.0.0:54818
 INFO [2016-07-11 09:40:21,922] ({pool-2-thread-4} Logging.scala[logInfo]:58) - 
Successfully started service 'HTTP class server' on port 54818.
 INFO [2016-07-11 09:40:22,408] ({pool-2-thread-4} 
SparkInterpreter.java[createSparkContext]:233) - -- Create new SparkContext 
local[*] ---
 WARN [2016-07-11 09:40:22,411] ({pool-2-thread-4} 
Logging.scala[logWarning]:70) - Another SparkContext is being constructed (or 
threw an exception in its constructor).  This may indicate an error, since only 
one SparkContext may be running in this JVM (see SPARK-2243). The other 
SparkContext was created at:

Is that mean I need to setup allow multiple context? Because It’s only test in 
local with local mode If I deploy on mesos cluster what would happened?

Need you guys suggests some solutions for that. Thanks.

Chanh
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: problem making Zeppelin 0.6 work with Spark 1.6.1, throwing jackson.databind.JsonMappingException exception

2016-07-09 Thread Chanh Le
Hi,
This weird because I am using Zeppelin from version 0.5.6 and just upgraded to 
0.6.0 for couple of days both work fine with Spark 1.6.1.

For 0.6.0 I am using zeppelin-0.6.0-bin-netinst.



> On Jul 9, 2016, at 9:25 PM, Mich Talebzadeh  wrote:
> 
> Hi,
> 
> I just installed the latest Zeppelin 0.6 as follows:
> 
> Source: zeppelin-0.6.0-bin-all
> 
> With Spark 1.6.1
> 
> 
> Now I am getting this issue with jackson. I did some search  that suggested 
> this is caused by the classpath providing you with a different version of 
> Jackson than the one Spark is expecting. However, no luck yet. With Spark 
> 1.5.2 and the previous version of Zeppelin namely 0.5.6-incubating  it used 
> to work without problem.
> 
> Any ideas will be appreciated
> 
> com.fasterxml.jackson.databind.JsonMappingException: Could not find creator 
> property with name 'id' (in class org.apache.spark.rdd.RDDOperationScope)
>  at [Source: {"id":"14","name":"ExecutedCommand"}; line: 1, column: 1]
>   at 
> com.fasterxml.jackson.databind.JsonMappingException.from(JsonMappingException.java:148)
>   at 
> com.fasterxml.jackson.databind.DeserializationContext.mappingException(DeserializationContext.java:843)
>   at 
> com.fasterxml.jackson.databind.deser.BeanDeserializerFactory.addBeanProps(BeanDeserializerFactory.java:533)
>   at 
> com.fasterxml.jackson.databind.deser.BeanDeserializerFactory.buildBeanDeserializer(BeanDeserializerFactory.java:220)
>   at 
> com.fasterxml.jackson.databind.deser.BeanDeserializerFactory.createBeanDeserializer(BeanDeserializerFactory.java:143)
>   at 
> com.fasterxml.jackson.databind.deser.DeserializerCache._createDeserializer2(DeserializerCache.java:409)
>   at 
> com.fasterxml.jackson.databind.deser.DeserializerCache._createDeserializer(DeserializerCache.java:358)
>   at 
> com.fasterxml.jackson.databind.deser.DeserializerCache._createAndCache2(DeserializerCache.java:265)
>   at 
> com.fasterxml.jackson.databind.deser.DeserializerCache._createAndCacheValueDeserializer(DeserializerCache.java:245)
>   at 
> com.fasterxml.jackson.databind.deser.DeserializerCache.findValueDeserializer(DeserializerCache.java:143)
>   at 
> com.fasterxml.jackson.databind.DeserializationContext.findRootValueDeserializer(DeserializationContext.java:439)
>   at 
> com.fasterxml.jackson.databind.ObjectMapper._findRootDeserializer(ObjectMapper.java:3666)
>   at 
> com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:3558)
>   at 
> com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:2578)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.fromJson(RDDOperationScope.scala:85)
>   at 
> org.apache.spark.rdd.RDDOperationScope$$anonfun$5.apply(RDDOperationScope.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$$anonfun$5.apply(RDDOperationScope.scala:136)
>   at scala.Option.map(Option.scala:145)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
>   at org.apache.spark.SparkContext.withScope(SparkContext.scala:714)
>   at org.apache.spark.SparkContext.parallelize(SparkContext.scala:728)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:145)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:130)
>   at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:34)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:39)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:41)
>   at $iwC$$iwC$$iwC$$iwC$$iwC.(:43)
>   at $iwC$$iwC$$iwC$$iwC.(:45)
>   at $iwC$$iwC$$iwC.(:47)
>   at $iwC$$iwC.(:49)
>   at $iwC.(:51)
>   at (:53)
>   at .(:57)
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or 

Re: Why so many parquet file part when I store data in Alluxio or File?

2016-07-08 Thread Chanh Le
Hi Gene,
Thank for your support. I agree with you because of number executor but many 
parquet files influence to read performance so I need a way to improve that. So 
the way I work around is 
  df.coalesce(1)
  .write.mode(SaveMode.Overwrite).partitionBy("network_id")
  .parquet(s"$alluxioURL/$outFolderName/time=${dailyFormat.print(jobRunTime)}")
I know this is not good because create a shuffle and cost time but the read 
improve a lot. Right now, I am using that method to partition my data.

Regards,
Chanh


> On Jul 8, 2016, at 8:33 PM, Gene Pang <gene.p...@gmail.com> wrote:
> 
> Hi Chanh,
> 
> You should be able to set the Alluxio block size with:
> 
> sc.hadoopConfiguration.set("alluxio.user.block.size.bytes.default", "256mb")
> 
> I think you have many parquet files because you have many Spark executors 
> writing out their partition of the files.
> 
> Hope that helps,
> Gene
> 
> On Sun, Jul 3, 2016 at 8:02 PM, Chanh Le <giaosu...@gmail.com 
> <mailto:giaosu...@gmail.com>> wrote:
> Hi Gene,
> Could you give some suggestions on that?
> 
> 
> 
>> On Jul 1, 2016, at 5:31 PM, Ted Yu <yuzhih...@gmail.com 
>> <mailto:yuzhih...@gmail.com>> wrote:
>> 
>> The comment from zhangxiongfei was from a year ago.
>> 
>> Maybe something changed since them ?
>> 
>> On Fri, Jul 1, 2016 at 12:07 AM, Chanh Le <giaosu...@gmail.com 
>> <mailto:giaosu...@gmail.com>> wrote:
>> Hi Ted,
>> I set sc.hadoopConfiguration.setBoolean("fs.hdfs.impl.disable.cache", true)
>> sc.hadoopConfiguration.setLong("fs.local.block.size", 268435456)
>> but It seems not working.
>> 
>> 
>> 
>> 
>>> On Jul 1, 2016, at 11:38 AM, Ted Yu <yuzhih...@gmail.com 
>>> <mailto:yuzhih...@gmail.com>> wrote:
>>> 
>>> Looking under Alluxio source, it seems only "fs.hdfs.impl.disable.cache" is 
>>> in use.
>>> 
>>> FYI
>>> 
>>> On Thu, Jun 30, 2016 at 9:30 PM, Deepak Sharma <deepakmc...@gmail.com 
>>> <mailto:deepakmc...@gmail.com>> wrote:
>>> Ok.
>>> I came across this issue.
>>> Not sure if you already assessed this:
>>> https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-6921 
>>> <https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-6921>
>>> The workaround mentioned may work for you .
>>> 
>>> Thanks
>>> Deepak
>>> 
>>> On 1 Jul 2016 9:34 am, "Chanh Le" <giaosu...@gmail.com 
>>> <mailto:giaosu...@gmail.com>> wrote:
>>> Hi Deepark,
>>> Thank for replying. The way to write into alluxio is 
>>> df.write.mode(SaveMode.Append).partitionBy("network_id", 
>>> "time").parquet("alluxio://master1:1/FACT_ADMIN_HOURLY <>”)
>>> 
>>> 
>>> I partition by 2 columns and store. I just want when I write it automatic 
>>> write a size properly for what I already set in Alluxio 512MB per block.
>>> 
>>> 
>>>> On Jul 1, 2016, at 11:01 AM, Deepak Sharma <deepakmc...@gmail.com 
>>>> <mailto:deepakmc...@gmail.com>> wrote:
>>>> 
>>>> Before writing coalesing your rdd to 1 .
>>>> It will create only 1 output file .
>>>> Multiple part file happens as all your executors will be writing their 
>>>> partitions to separate part files.
>>>> 
>>>> Thanks
>>>> Deepak
>>>> 
>>>> On 1 Jul 2016 8:01 am, "Chanh Le" <giaosu...@gmail.com 
>>>> <mailto:giaosu...@gmail.com>> wrote:
>>>> Hi everyone,
>>>> I am using Alluxio for storage. But I am little bit confuse why I am do 
>>>> set block size of alluxio is 512MB and my file part only few KB and too 
>>>> many part.
>>>> Is that normal? Because I want to read it fast? Is that many part effect 
>>>> the read operation?
>>>> How to set the size of file part?
>>>> 
>>>> Thanks.
>>>> Chanh
>>>> 
>>>> 
>>>> 
>>>>  
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
> 
> 



Re: Any ways to connect BI tool to Spark without Hive

2016-07-07 Thread Chanh Le
Hi Ayan,

Thanks for replying. It’s sound great. Let me check. 
One thing confuse is there any way to share things between too? I mean Zeppelin 
and Thift Server. For example: I update, insert data to a table on Zeppelin and 
external tool connect through STS can get it.

Thanks & regards,
Chanh

> On Jul 8, 2016, at 11:21 AM, ayan guha <guha.a...@gmail.com> wrote:
> 
> Hi
> 
> Spark Thrift does not need Hive/hadoop. STS should be your first choice if 
> you are planning to integrate BI tools with Spark. It works with Zeppelin as 
> well. We do all our development using Zeppelin and STS. 
> 
> One thing to note: many BI tools like Qliksense, Tablaue (not sure of oracle 
> Bi Tool) queires and the caches data on client side. This works really well 
> in real life. 
> 
> 
> On Fri, Jul 8, 2016 at 1:58 PM, Chanh Le <giaosu...@gmail.com 
> <mailto:giaosu...@gmail.com>> wrote:
> Hi Mich,
> Thanks for replying. Currently we think we need to separate 2 groups of user. 
> 1. Technical: Can write SQL 
> 2. Business: Can drag and drop fields or metrics and see the result.
> Our stack using Zeppeline, Spark SQL to query data from Alluxio. Our data 
> current store in parquet files. Zeppelin is using HiveContext but we haven’t 
> set up Hive and Hadoop yet. 
> 
> I am little bit confuse in Spark Thift Server because Thift Server in Spark 
> can allow external tools connect but is that require to set up Hive and 
> Hadoop?
> 
> Thanks and regards,
> Chanh
> 
> 
> 
>> On Jul 8, 2016, at 10:49 AM, Mich Talebzadeh <mich.talebza...@gmail.com 
>> <mailto:mich.talebza...@gmail.com>> wrote:
>> 
>> hi,
>> 
>> I have not used Alluxio but it is a distributed file system much like an 
>> IMDB say Oracle TimesTen. Spark is your query tool and Zeppelin is the GUI 
>> interface to your Spark which basically allows you graphs with Spark queries.
>> 
>> You mentioned Hive so I assume your persistent storage is Hive?
>> 
>> Your business are using Oracle BI tool. It is like Tableau. I assume Oracle 
>> BI tool accesses a database of some sort say Oracle DW using native 
>> connectivity and it may also have ODBC and JDBC connections to Hive etc.
>> 
>> The issue I see here is your GUI tool Zeppelin which does the same thing as 
>> Oracle BI tool. Can you please clarify below:
>> 
>> you use Hive as your database/persistent storage and use Alluxio on top of 
>> Hive?
>> are users accessing Hive or a Data Warehouse like Oracle
>> Oracle BI tools are pretty mature. Zeppelin is not in the same league so you 
>> have to decide which technology stack to follow
>> Spark should work with Oracle BI tool as well (need to check this) as a fast 
>> query tool. In that case the users can use Oracle BI tool with Spark as well.
>> It seems to me that the issue is that users don't want to move from Oracle 
>> BI tool. We had the same issue with Tableau. So you really need to make that 
>> Oracle BI tool use Spark and Alluxio and leave Zeppelin at one side.
>> 
>> Zeppelin as I used it a while back may not do what Oracle BI tool does. So 
>> the presentation layer has to be Oracle BI tool.
>> 
>> HTH
>> 
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>> loss, damage or destruction of data or any other property which may arise 
>> from relying on this email's technical content is explicitly disclaimed. The 
>> author will in no case be liable for any monetary damages arising from such 
>> loss, damage or destruction.
>>  
>> 
>> On 8 July 2016 at 04:19, Chanh Le <giaosu...@gmail.com 
>> <mailto:giaosu...@gmail.com>> wrote:
>> Hi everyone,
>> Currently we use Zeppelin to analytics our data and because of using SQL 
>> it’s hard to distribute for users use. But users are using some kind of 
>> Oracle BI tools to analytic because it support some kinds of drag and drop 
>> and we can do some kind of permitted for each user.
>> Our architecture is Spark, Alluxio, Zeppelin. Because We want to share what 
>> we have done in Zeppelin to business users.
>> 
>> Is there any way to do that?
>> 
>> Thanks.
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
>> <mailto:user-unsubscr...@spark.apache.org>
>> 
>> 
> 
> 
> 
> 
> -- 
> Best Regards,
> Ayan Guha



Re: Any ways to connect BI tool to Spark without Hive

2016-07-07 Thread Chanh Le

Hi Mich,

Actually technical users they can write some kind of complex machine learning 
things in the future too so that why zeppelin is promising.

> Those business users. Do they Oracle BI (OBI) to connect to DW like Oracle 
> now?
Yes, they are. Our data is still storing in Oracle but It’s becoming bigger and 
bigger everyday and some queries can’t execute in Oracle then we’re moving to 
another storage and using Spark to query. 

> I have also Hive running on Spark engine that makes such a solution easier by 
> allowing users to connect to Hive and execute their queries. You want to 
> provide a fast retrieval system for your users. Your case is interesting as 
> you have two parallel stack here.


So It means I still need to setup Hive and Hadoop? Because our resource is 
limited. We need to spend memory and cpu for Spark and Alluxio almost.

Thanks & regards,
Chanh




> On Jul 8, 2016, at 11:18 AM, Mich Talebzadeh <mich.talebza...@gmail.com> 
> wrote:
> 
> Interesting  Chanh
> 
> Those business users. Do they Oracle BI (OBI) to connect to DW like Oracle 
> now?
> 
> Certainly power users can use Zeppelin to write code that will be executed 
> through Spark but much doubt Zeppelin can do what OBI tool provides.
> 
> What you need is to investigate if OBI tool can connect to Spark Thrift 
> Server to use Spark to access your parquet files. Your parquet files are 
> already on HDFS (part of Hadoop).
> 
>  Hive has ODBC interfaces to Tableau and sure it can also work with OBI.
> 
> I have also Hive running on Spark engine that makes such a solution easier by 
> allowing users to connect to Hive and execute their queries. You want to 
> provide a fast retrieval system for your users. Your case is interesting as 
> you have two parallel stack here.
> 
> HTH
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> On 8 July 2016 at 04:58, Chanh Le <giaosu...@gmail.com 
> <mailto:giaosu...@gmail.com>> wrote:
> Hi Mich,
> Thanks for replying. Currently we think we need to separate 2 groups of user. 
> 1. Technical: Can write SQL 
> 2. Business: Can drag and drop fields or metrics and see the result.
> Our stack using Zeppeline, Spark SQL to query data from Alluxio. Our data 
> current store in parquet files. Zeppelin is using HiveContext but we haven’t 
> set up Hive and Hadoop yet. 
> 
> I am little bit confuse in Spark Thift Server because Thift Server in Spark 
> can allow external tools connect but is that require to set up Hive and 
> Hadoop?
> 
> Thanks and regards,
> Chanh
> 
> 
> 
>> On Jul 8, 2016, at 10:49 AM, Mich Talebzadeh <mich.talebza...@gmail.com 
>> <mailto:mich.talebza...@gmail.com>> wrote:
>> 
>> hi,
>> 
>> I have not used Alluxio but it is a distributed file system much like an 
>> IMDB say Oracle TimesTen. Spark is your query tool and Zeppelin is the GUI 
>> interface to your Spark which basically allows you graphs with Spark queries.
>> 
>> You mentioned Hive so I assume your persistent storage is Hive?
>> 
>> Your business are using Oracle BI tool. It is like Tableau. I assume Oracle 
>> BI tool accesses a database of some sort say Oracle DW using native 
>> connectivity and it may also have ODBC and JDBC connections to Hive etc.
>> 
>> The issue I see here is your GUI tool Zeppelin which does the same thing as 
>> Oracle BI tool. Can you please clarify below:
>> 
>> you use Hive as your database/persistent storage and use Alluxio on top of 
>> Hive?
>> are users accessing Hive or a Data Warehouse like Oracle
>> Oracle BI tools are pretty mature. Zeppelin is not in the same league so you 
>> have to decide which technology stack to follow
>> Spark should work with Oracle BI tool as well (need to check this) as a fast 
>> query tool. In that case the users can use Oracle BI tool with Spark as well.
>> It seems to me that the issue is that users don't want to move from Oracle 
>> BI tool. We had the same issue with Tableau. So you really need to make that 
>> Oracle BI tool use Spark and 

Re: Any ways to connect BI tool to Spark without Hive

2016-07-07 Thread Chanh Le
Hi Mich,
Thanks for replying. Currently we think we need to separate 2 groups of user. 
1. Technical: Can write SQL 
2. Business: Can drag and drop fields or metrics and see the result.
Our stack using Zeppeline, Spark SQL to query data from Alluxio. Our data 
current store in parquet files. Zeppelin is using HiveContext but we haven’t 
set up Hive and Hadoop yet. 

I am little bit confuse in Spark Thift Server because Thift Server in Spark can 
allow external tools connect but is that require to set up Hive and Hadoop?

Thanks and regards,
Chanh



> On Jul 8, 2016, at 10:49 AM, Mich Talebzadeh <mich.talebza...@gmail.com> 
> wrote:
> 
> hi,
> 
> I have not used Alluxio but it is a distributed file system much like an IMDB 
> say Oracle TimesTen. Spark is your query tool and Zeppelin is the GUI 
> interface to your Spark which basically allows you graphs with Spark queries.
> 
> You mentioned Hive so I assume your persistent storage is Hive?
> 
> Your business are using Oracle BI tool. It is like Tableau. I assume Oracle 
> BI tool accesses a database of some sort say Oracle DW using native 
> connectivity and it may also have ODBC and JDBC connections to Hive etc.
> 
> The issue I see here is your GUI tool Zeppelin which does the same thing as 
> Oracle BI tool. Can you please clarify below:
> 
> you use Hive as your database/persistent storage and use Alluxio on top of 
> Hive?
> are users accessing Hive or a Data Warehouse like Oracle
> Oracle BI tools are pretty mature. Zeppelin is not in the same league so you 
> have to decide which technology stack to follow
> Spark should work with Oracle BI tool as well (need to check this) as a fast 
> query tool. In that case the users can use Oracle BI tool with Spark as well.
> It seems to me that the issue is that users don't want to move from Oracle BI 
> tool. We had the same issue with Tableau. So you really need to make that 
> Oracle BI tool use Spark and Alluxio and leave Zeppelin at one side.
> 
> Zeppelin as I used it a while back may not do what Oracle BI tool does. So 
> the presentation layer has to be Oracle BI tool.
> 
> HTH
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> On 8 July 2016 at 04:19, Chanh Le <giaosu...@gmail.com 
> <mailto:giaosu...@gmail.com>> wrote:
> Hi everyone,
> Currently we use Zeppelin to analytics our data and because of using SQL it’s 
> hard to distribute for users use. But users are using some kind of Oracle BI 
> tools to analytic because it support some kinds of drag and drop and we can 
> do some kind of permitted for each user.
> Our architecture is Spark, Alluxio, Zeppelin. Because We want to share what 
> we have done in Zeppelin to business users.
> 
> Is there any way to do that?
> 
> Thanks.
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
> <mailto:user-unsubscr...@spark.apache.org>
> 
> 



Any ways to connect BI tool to Spark without Hive

2016-07-07 Thread Chanh Le
Hi everyone,
Currently we use Zeppelin to analytics our data and because of using SQL it’s 
hard to distribute for users use. But users are using some kind of Oracle BI 
tools to analytic because it support some kinds of drag and drop and we can do 
some kind of permitted for each user.
Our architecture is Spark, Alluxio, Zeppelin. Because We want to share what we 
have done in Zeppelin to business users. 

Is there any way to do that?

Thanks.
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Optimize filter operations with sorted data

2016-07-07 Thread Chanh Le
Hi Tan,
It depends on how data organise and what your filter is.
For example in my case: I store data by partition by field time and network_id. 
If I filter by time or network_id or both and with other field Spark only load 
part of time and network in filter then filter the rest.



> On Jul 7, 2016, at 4:43 PM, Ted Yu  wrote:
> 
> Does the filter under consideration operate on sorted column(s) ?
> 
> Cheers
> 
>> On Jul 7, 2016, at 2:25 AM, tan shai  wrote:
>> 
>> Hi, 
>> 
>> I have a sorted dataframe, I need to optimize the filter operations.
>> How does Spark performs filter operations on sorted dataframe? 
>> 
>> It is scanning all the data? 
>> 
>> Many thanks. 
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Why so many parquet file part when I store data in Alluxio or File?

2016-07-03 Thread Chanh Le
Hi Gene,
Could you give some suggestions on that?



> On Jul 1, 2016, at 5:31 PM, Ted Yu <yuzhih...@gmail.com> wrote:
> 
> The comment from zhangxiongfei was from a year ago.
> 
> Maybe something changed since them ?
> 
> On Fri, Jul 1, 2016 at 12:07 AM, Chanh Le <giaosu...@gmail.com 
> <mailto:giaosu...@gmail.com>> wrote:
> Hi Ted,
> I set sc.hadoopConfiguration.setBoolean("fs.hdfs.impl.disable.cache", true)
> sc.hadoopConfiguration.setLong("fs.local.block.size", 268435456)
> but It seems not working.
> 
> 
> 
> 
>> On Jul 1, 2016, at 11:38 AM, Ted Yu <yuzhih...@gmail.com 
>> <mailto:yuzhih...@gmail.com>> wrote:
>> 
>> Looking under Alluxio source, it seems only "fs.hdfs.impl.disable.cache" is 
>> in use.
>> 
>> FYI
>> 
>> On Thu, Jun 30, 2016 at 9:30 PM, Deepak Sharma <deepakmc...@gmail.com 
>> <mailto:deepakmc...@gmail.com>> wrote:
>> Ok.
>> I came across this issue.
>> Not sure if you already assessed this:
>> https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-6921 
>> <https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-6921>
>> The workaround mentioned may work for you .
>> 
>> Thanks
>> Deepak
>> 
>> On 1 Jul 2016 9:34 am, "Chanh Le" <giaosu...@gmail.com 
>> <mailto:giaosu...@gmail.com>> wrote:
>> Hi Deepark,
>> Thank for replying. The way to write into alluxio is 
>> df.write.mode(SaveMode.Append).partitionBy("network_id", 
>> "time").parquet("alluxio://master1:1/FACT_ADMIN_HOURLY <>”)
>> 
>> 
>> I partition by 2 columns and store. I just want when I write it automatic 
>> write a size properly for what I already set in Alluxio 512MB per block.
>> 
>> 
>>> On Jul 1, 2016, at 11:01 AM, Deepak Sharma <deepakmc...@gmail.com 
>>> <mailto:deepakmc...@gmail.com>> wrote:
>>> 
>>> Before writing coalesing your rdd to 1 .
>>> It will create only 1 output file .
>>> Multiple part file happens as all your executors will be writing their 
>>> partitions to separate part files.
>>> 
>>> Thanks
>>> Deepak
>>> 
>>> On 1 Jul 2016 8:01 am, "Chanh Le" <giaosu...@gmail.com 
>>> <mailto:giaosu...@gmail.com>> wrote:
>>> Hi everyone,
>>> I am using Alluxio for storage. But I am little bit confuse why I am do set 
>>> block size of alluxio is 512MB and my file part only few KB and too many 
>>> part.
>>> Is that normal? Because I want to read it fast? Is that many part effect 
>>> the read operation?
>>> How to set the size of file part?
>>> 
>>> Thanks.
>>> Chanh
>>> 
>>> 
>>> 
>>>  
>>> 
>>> 
>> 
>> 
> 
> 



Re: Why so many parquet file part when I store data in Alluxio or File?

2016-06-30 Thread Chanh Le
Hi Deepark,
Thank for replying. The way to write into alluxio is 
df.write.mode(SaveMode.Append).partitionBy("network_id", 
"time").parquet("alluxio://master1:1/FACT_ADMIN_HOURLY”)


I partition by 2 columns and store. I just want when I write it automatic write 
a size properly for what I already set in Alluxio 512MB per block.


> On Jul 1, 2016, at 11:01 AM, Deepak Sharma <deepakmc...@gmail.com> wrote:
> 
> Before writing coalesing your rdd to 1 .
> It will create only 1 output file .
> Multiple part file happens as all your executors will be writing their 
> partitions to separate part files.
> 
> Thanks
> Deepak
> 
> On 1 Jul 2016 8:01 am, "Chanh Le" <giaosu...@gmail.com 
> <mailto:giaosu...@gmail.com>> wrote:
> Hi everyone,
> I am using Alluxio for storage. But I am little bit confuse why I am do set 
> block size of alluxio is 512MB and my file part only few KB and too many part.
> Is that normal? Because I want to read it fast? Is that many part effect the 
> read operation?
> How to set the size of file part?
> 
> Thanks.
> Chanh
> 
> 
> 
>  
> 
> 



Re: Looking for help about stackoverflow in spark

2016-06-30 Thread Chanh Le
Hi John,
I think it relates to drivers memory more than the others thing you said.

Can you just increase more memory for driver?




> On Jul 1, 2016, at 9:03 AM, johnzeng  wrote:
> 
> I am trying to load a 1 TB collection into spark cluster from mongo. But I am
> keep getting stack overflow error  after running for a while.
> 
> I have posted a question in stackoverflow.com, and tried all advies they
> have provide, nothing works...
> 
> how to load large database into spark
> 
>   
> 
> I have tried:
> 1, use persist to make it MemoryAndDisk,  same error after running same
> time.
> 2, add more instance,  same error after running same time.
> 3, run this script on another collection which is much smaller, everything
> is good, so I think my codes are all right.
> 4, remove the reduce process, same error after running same time.
> 5, remove the map process,  same error after running same time.
> 6, change the sql I used, it's faster, but  same error after running shorter
> time.
> 7,retrieve "_id" instead of "u_at" and "c_at",  same error after running
> same time.
> 
> Anyone knows how many resources do I need to handle this 1TB database? I
> only retrieve two fields form it, and this field is only 1% of a
> document(because we have an array containing about 90+ embedded documents in
> it.)
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Looking-for-help-about-stackoverflow-in-spark-tp27255.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Best practice for handing tables between pipeline components

2016-06-29 Thread Chanh Le
Hi Everett,
We are using Alluxio for the last 2 months. We implement Alluxio for sharing 
data each Spark Job, isolated Spark only for process layer and Alluxio for the 
storage layer.



> On Jun 29, 2016, at 2:52 AM, Everett Anderson  
> wrote:
> 
> Thanks! Alluxio looks quite promising, but also quite new.
> 
> What did people do before?
> 
> On Mon, Jun 27, 2016 at 12:33 PM, Gene Pang  > wrote:
> Yes, Alluxio (http://www.alluxio.org/ ) can be used 
> to store data in-memory between stages in a pipeline.
> 
> Here is more information about running Spark with Alluxio: 
> http://www.alluxio.org/documentation/v1.1.0/en/Running-Spark-on-Alluxio.html 
> 
> 
> Hope that helps,
> Gene
> 
> On Mon, Jun 27, 2016 at 10:38 AM, Sathish Kumaran Vairavelu 
> > wrote:
> Alluxio off heap memory would help to share cached objects
> 
> On Mon, Jun 27, 2016 at 11:14 AM Everett Anderson  
> wrote:
> Hi,
> 
> We have a pipeline of components strung together via Airflow running on AWS. 
> Some of them are implemented in Spark, but some aren't. Generally they can 
> all talk to a JDBC/ODBC end point or read/write files from S3.
> 
> Ideally, we wouldn't suffer the I/O cost of writing all the data to HDFS or 
> S3 and reading it back in, again, in every component, if it could stay cached 
> in memory in a Spark cluster. 
> 
> Our current investigation seems to lead us towards exploring if the following 
> things are possible:
> Using a Hive metastore with S3 as its backing data store to try to keep a 
> mapping from table name to files on S3 (not sure if one can cache a Hive 
> table in Spark across contexts, though)
> Using something like the spark-jobserver to keep a Spark SQLContext open 
> across Spark components so they could avoid file I/O for cached tables
> What's the best practice for handing tables between Spark programs? What 
> about between Spark and non-Spark programs?
> 
> Thanks!
> 
> - Everett
> 
> 
> 



Re: Spark 2.0 Preview After caching query didn't work and can't kill job.

2016-06-15 Thread Chanh Le
Hi everyone,
I added more logs for my use case:

When I cached all my data 500 mil records and count.
I receive this.
16/06/16 10:09:25 ERROR TaskSetManager: Total size of serialized results of 27 
tasks (1876.7 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)
>>> that weird because I just count
After increase maxResultSize to 10g
I still waiting slow for result and error
16/06/16 10:09:25 INFO BlockManagerInfo: Removed taskresult_94 on slave1:27743 
in memory (size: 69.5 MB, free: 6.2 GB)
org.apache.spark.SparkException: Job aborted due to stage failure: Total size 
of serialized results of 15 tasks (1042.6 MB) is bigger than 
spark.driver.maxResultSize (1024.0 MB
)
  at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1450)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1438)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1437)
  at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1437)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
  at scala.Option.foreach(Option.scala:257)
  at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811)
  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1659)
  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1618)
  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1607)
  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:1863)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:1876)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:1889)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:1903)
  at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:883)
  at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:357)
  at org.apache.spark.rdd.RDD.collect(RDD.scala:882)
  at 
org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:290)
  at 
org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2122)
  at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
  at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2436)
  at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2121)
  at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2128)
  at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2156)
  at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2155)
  at org.apache.spark.sql.Dataset.withCallback(Dataset.scala:2449)
  at org.apache.spark.sql.Dataset.count(Dataset.scala:2155)
  ... 48 elided

I lost all my executors.



> On Jun 15, 2016, at 8:44 PM, Chanh Le <giaosu...@gmail.com> wrote:
> 
> Hi Gene,
> I am using Alluxio 1.1.0.
> Spark 2.0 Preview version. 
> Load from alluxio then cached and query for 2nd time. Spark will stuck.
> 
> 
> 
>> On Jun 15, 2016, at 8:42 PM, Gene Pang <gene.p...@gmail.com 
>> <mailto:gene.p...@gmail.com>> wrote:
>> 
>> Hi,
>> 
>> Which version of Alluxio are you using?
>> 
>> Thanks,
>> Gene
>> 
>> On Tue, Jun 14, 2016 at 3:45 AM, Chanh Le <giaosu...@gmail.com 
>> <mailto:giaosu...@gmail.com>> wrote:
>> I am testing Spark 2.0
>> I load data from alluxio and cached then I query but the first query is ok 
>> because it kick off cache action. But after that I run the query again and 
>> it’s stuck.
>> I ran in cluster 5 nodes in spark-shell.
>> 
>> Did anyone has this issue?
>> 
>> 
>> 
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
>> <mailto:user-unsubscr...@spark.apache.org>
>> For additional commands, e-mail: user-h...@spark.apache.org 
>> <mailto:user-h...@spark.apache.org>
>> 
>> 
> 



Re: Spark 2.0 Preview After caching query didn't work and can't kill job.

2016-06-15 Thread Chanh Le
Hi Gene,
I am using Alluxio 1.1.0.
Spark 2.0 Preview version. 
Load from alluxio then cached and query for 2nd time. Spark will stuck.



> On Jun 15, 2016, at 8:42 PM, Gene Pang <gene.p...@gmail.com> wrote:
> 
> Hi,
> 
> Which version of Alluxio are you using?
> 
> Thanks,
> Gene
> 
> On Tue, Jun 14, 2016 at 3:45 AM, Chanh Le <giaosu...@gmail.com 
> <mailto:giaosu...@gmail.com>> wrote:
> I am testing Spark 2.0
> I load data from alluxio and cached then I query but the first query is ok 
> because it kick off cache action. But after that I run the query again and 
> it’s stuck.
> I ran in cluster 5 nodes in spark-shell.
> 
> Did anyone has this issue?
> 
> 
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> <mailto:user-unsubscr...@spark.apache.org>
> For additional commands, e-mail: user-h...@spark.apache.org 
> <mailto:user-h...@spark.apache.org>
> 
> 



Spark 2.0 Preview After caching query didn't work and can't kill job.

2016-06-14 Thread Chanh Le
I am testing Spark 2.0
I load data from alluxio and cached then I query but the first query is ok 
because it kick off cache action. But after that I run the query again and it’s 
stuck.
I ran in cluster 5 nodes in spark-shell.

Did anyone has this issue?



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Partition by Columns doesn't work properly

2016-06-09 Thread Chanh Le
Ok, thanks.

On Thu, Jun 9, 2016, 12:51 PM Jasleen Kaur <jasleenkaur1...@gmail.com>
wrote:

> The github repo is https://github.com/datastax/spark-cassandra-connector
>
> The talk video and slides should be uploaded soon on spark summit website
>
>
> On Wednesday, June 8, 2016, Chanh Le <giaosu...@gmail.com> wrote:
>
>> Thanks, I'll look into it. Any luck to get link related to.
>>
>> On Thu, Jun 9, 2016, 12:43 PM Jasleen Kaur <jasleenkaur1...@gmail.com>
>> wrote:
>>
>>> Try using the datastax package. There was a great talk on spark summit
>>> about it. It will take care of the boiler plate code and you can focus on
>>> real business value
>>>
>>> On Wednesday, June 8, 2016, Chanh Le <giaosu...@gmail.com> wrote:
>>>
>>>> Hi everyone,
>>>> I tested the partition by columns of data frame but it’s not good I
>>>> mean wrong.
>>>> I am using Spark 1.6.1 load data from Cassandra.
>>>> I repartition by 2 field date, network_id - 200 partitions
>>>> I reparation by 1 field date - 200 partitions.
>>>> but my data is data of 90 days -> I mean if we reparation by date it
>>>> will be 90 partitions.
>>>>
>>>> val daily = sql
>>>>   .read
>>>>   .format("org.apache.spark.sql.cassandra")
>>>>   .options(Map("table" -> dailyDetailTableName, "keyspace" -> reportSpace))
>>>>   .load()
>>>>   .repartition(col("date"))
>>>>
>>>>
>>>>
>>>> I mean It doesn’t change the way I put the columns to repartition.
>>>>
>>>> Does anyone has the same problem?
>>>>
>>>> Thank in advance.
>>>>
>>>


Re: Spark Partition by Columns doesn't work properly

2016-06-08 Thread Chanh Le
Thanks, I'll look into it. Any luck to get link related to.

On Thu, Jun 9, 2016, 12:43 PM Jasleen Kaur <jasleenkaur1...@gmail.com>
wrote:

> Try using the datastax package. There was a great talk on spark summit
> about it. It will take care of the boiler plate code and you can focus on
> real business value
>
> On Wednesday, June 8, 2016, Chanh Le <giaosu...@gmail.com> wrote:
>
>> Hi everyone,
>> I tested the partition by columns of data frame but it’s not good I mean
>> wrong.
>> I am using Spark 1.6.1 load data from Cassandra.
>> I repartition by 2 field date, network_id - 200 partitions
>> I reparation by 1 field date - 200 partitions.
>> but my data is data of 90 days -> I mean if we reparation by date it will
>> be 90 partitions.
>>
>> val daily = sql
>>   .read
>>   .format("org.apache.spark.sql.cassandra")
>>   .options(Map("table" -> dailyDetailTableName, "keyspace" -> reportSpace))
>>   .load()
>>   .repartition(col("date"))
>>
>>
>>
>> I mean It doesn’t change the way I put the columns to repartition.
>>
>> Does anyone has the same problem?
>>
>> Thank in advance.
>>
>


Spark Partition by Columns doesn't work properly

2016-06-08 Thread Chanh Le
Hi everyone,
I tested the partition by columns of data frame but it’s not good I mean wrong.
I am using Spark 1.6.1 load data from Cassandra.
I repartition by 2 field date, network_id - 200 partitions
I reparation by 1 field date - 200 partitions.
but my data is data of 90 days -> I mean if we reparation by date it will be 90 
partitions.
val daily = sql
  .read
  .format("org.apache.spark.sql.cassandra")
  .options(Map("table" -> dailyDetailTableName, "keyspace" -> reportSpace))
  .load()
  .repartition(col("date"))


I mean It doesn’t change the way I put the columns to repartition.

Does anyone has the same problem? 

Thank in advance.

[jira] [Issue Comment Deleted] (SPARK-7703) Task failure caused by block fetch failure in BlockManager.doGetRemote() when using TorrentBroadcast

2016-05-27 Thread Chanh Le (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chanh Le updated SPARK-7703:

Comment: was deleted

(was: Any update on that? 
I have the same error too.
java.io.IOException: org.apache.spark.storage.BlockFetchException: Failed to 
fetch block from 1 locations. Most recent failure cause:
https://gist.github.com/giaosudau/3f7087707dcabc53c3b3bf54b0503720)

> Task failure caused by block fetch failure in BlockManager.doGetRemote() when 
> using TorrentBroadcast
> 
>
> Key: SPARK-7703
> URL: https://issues.apache.org/jira/browse/SPARK-7703
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.1, 1.3.1
> Environment: Red Hat Enterprise Linux Server release 7.0 (Maipo)
> Spark 1.3.1 Release
>Reporter: Hailong Wen
>
> I am from IBM Platform Symphony team and we are working to integration Spark 
> with our EGO to provide a fine-grained dynamic allocation Resource Manager. 
> We found a defect in current implementation of BlockManager.doGetRemote():
> {noformat}
>   private def doGetRemote(blockId: BlockId, asBlockResult: Boolean): 
> Option[Any] = {
> require(blockId != null, "BlockId is null")
> val locations = Random.shuffle(master.getLocations(blockId)) 
> <--- Issue2: locations may be out of date
> for (loc <- locations) {
>   logDebug(s"Getting remote block $blockId from $loc")
>   val data = blockTransferService.fetchBlockSync(
> loc.host, loc.port, loc.executorId, blockId.toString).nioByteBuffer() 
>  <--- Issue1: This statement is not in try/catch
>   if (data != null) {
> if (asBlockResult) {
>   return Some(new BlockResult(
> dataDeserialize(blockId, data),
> DataReadMethod.Network,
> data.limit()))
> } else {
>   return Some(data)
> }
>   }
>   logDebug(s"The value of block $blockId is null")
> }
> logDebug(s"Block $blockId not found")
> None
>   }
> {noformat}
> * Issue 1: Although the block fetch uses "for" to try all available 
> locations, the fetch method is not guarded by a "Try" block. When exception 
> occurs, this method will directly throw the error instead of trying other 
> block locations. The uncaught exception will cause task failure.
> * Issue 2: Constant "location" is acquired before fetching, however in a 
> dynamic allocation environment the block locations may change.
> We hit the above 2 issues in our use case, where Executors exit after all its 
> assigned tasks are done. We *occasionally* get the following error (issue 1.):
> {noformat}
> 15/05/13 10:28:35 INFO Executor: Running task 27.0 in stage 0.0 (TID 27)
> 15/05/13 10:28:35 DEBUG Executor: Task 26's epoch is 0
> 15/05/13 10:28:35 DEBUG Executor: Task 28's epoch is 0
> 15/05/13 10:28:35 DEBUG Executor: Task 27's epoch is 0
> 15/05/13 10:28:35 DEBUG BlockManager: Getting local block broadcast_0
> 15/05/13 10:28:35 DEBUG BlockManager: Block broadcast_0 not registered locally
> 15/05/13 10:28:35 INFO TorrentBroadcast: Started reading broadcast variable 0
> 15/05/13 10:28:35 DEBUG TorrentBroadcast: Reading piece broadcast_0_piece0 of 
> broadcast_0
> 15/05/13 10:28:35 DEBUG BlockManager: Getting local block broadcast_0_piece0 
> as bytes
> 15/05/13 10:28:35 DEBUG BlockManager: Block broadcast_0_piece0 not registered 
> locally
> 15/05/13 10:28:35 DEBUG BlockManager: Getting remote block broadcast_0_piece0 
> as bytes
> 15/05/13 10:28:35 DEBUG BlockManager: Getting remote block broadcast_0_piece0 
> from BlockManagerId(c390c311-bd97-4a99-bcb9-b32fd3dede17, sparkbj01, 37599)
> 15/05/13 10:28:35 TRACE NettyBlockTransferService: Fetch blocks from 
> sparkbj01:37599 (executor id c390c311-bd97-4a99-bcb9-b32fd3dede17)
> 15/05/13 10:28:35 DEBUG TransportClientFactory: Creating new connection to 
> sparkbj01/9.111.254.195:37599
> 15/05/13 10:28:35 ERROR RetryingBlockFetcher: Exception while beginning fetch 
> of 1 outstanding blocks 
> java.io.IOException: Failed to connect to sparkbj01/9.111.254.195:37599
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:191)
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)
>   at 
> org.apache

[jira] [Commented] (SPARK-7703) Task failure caused by block fetch failure in BlockManager.doGetRemote() when using TorrentBroadcast

2016-05-27 Thread Chanh Le (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15303655#comment-15303655
 ] 

Chanh Le commented on SPARK-7703:
-

Any update on that? 
I have the same error.
java.io.IOException: org.apache.spark.storage.BlockFetchException: Failed to 
fetch block from 1 locations. Most recent failure cause:
https://gist.github.com/giaosudau/3f7087707dcabc53c3b3bf54b0503720

> Task failure caused by block fetch failure in BlockManager.doGetRemote() when 
> using TorrentBroadcast
> 
>
> Key: SPARK-7703
> URL: https://issues.apache.org/jira/browse/SPARK-7703
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.1, 1.3.1
> Environment: Red Hat Enterprise Linux Server release 7.0 (Maipo)
> Spark 1.3.1 Release
>Reporter: Hailong Wen
>
> I am from IBM Platform Symphony team and we are working to integration Spark 
> with our EGO to provide a fine-grained dynamic allocation Resource Manager. 
> We found a defect in current implementation of BlockManager.doGetRemote():
> {noformat}
>   private def doGetRemote(blockId: BlockId, asBlockResult: Boolean): 
> Option[Any] = {
> require(blockId != null, "BlockId is null")
> val locations = Random.shuffle(master.getLocations(blockId)) 
> <--- Issue2: locations may be out of date
> for (loc <- locations) {
>   logDebug(s"Getting remote block $blockId from $loc")
>   val data = blockTransferService.fetchBlockSync(
> loc.host, loc.port, loc.executorId, blockId.toString).nioByteBuffer() 
>  <--- Issue1: This statement is not in try/catch
>   if (data != null) {
> if (asBlockResult) {
>   return Some(new BlockResult(
> dataDeserialize(blockId, data),
> DataReadMethod.Network,
> data.limit()))
> } else {
>   return Some(data)
> }
>   }
>   logDebug(s"The value of block $blockId is null")
> }
> logDebug(s"Block $blockId not found")
> None
>   }
> {noformat}
> * Issue 1: Although the block fetch uses "for" to try all available 
> locations, the fetch method is not guarded by a "Try" block. When exception 
> occurs, this method will directly throw the error instead of trying other 
> block locations. The uncaught exception will cause task failure.
> * Issue 2: Constant "location" is acquired before fetching, however in a 
> dynamic allocation environment the block locations may change.
> We hit the above 2 issues in our use case, where Executors exit after all its 
> assigned tasks are done. We *occasionally* get the following error (issue 1.):
> {noformat}
> 15/05/13 10:28:35 INFO Executor: Running task 27.0 in stage 0.0 (TID 27)
> 15/05/13 10:28:35 DEBUG Executor: Task 26's epoch is 0
> 15/05/13 10:28:35 DEBUG Executor: Task 28's epoch is 0
> 15/05/13 10:28:35 DEBUG Executor: Task 27's epoch is 0
> 15/05/13 10:28:35 DEBUG BlockManager: Getting local block broadcast_0
> 15/05/13 10:28:35 DEBUG BlockManager: Block broadcast_0 not registered locally
> 15/05/13 10:28:35 INFO TorrentBroadcast: Started reading broadcast variable 0
> 15/05/13 10:28:35 DEBUG TorrentBroadcast: Reading piece broadcast_0_piece0 of 
> broadcast_0
> 15/05/13 10:28:35 DEBUG BlockManager: Getting local block broadcast_0_piece0 
> as bytes
> 15/05/13 10:28:35 DEBUG BlockManager: Block broadcast_0_piece0 not registered 
> locally
> 15/05/13 10:28:35 DEBUG BlockManager: Getting remote block broadcast_0_piece0 
> as bytes
> 15/05/13 10:28:35 DEBUG BlockManager: Getting remote block broadcast_0_piece0 
> from BlockManagerId(c390c311-bd97-4a99-bcb9-b32fd3dede17, sparkbj01, 37599)
> 15/05/13 10:28:35 TRACE NettyBlockTransferService: Fetch blocks from 
> sparkbj01:37599 (executor id c390c311-bd97-4a99-bcb9-b32fd3dede17)
> 15/05/13 10:28:35 DEBUG TransportClientFactory: Creating new connection to 
> sparkbj01/9.111.254.195:37599
> 15/05/13 10:28:35 ERROR RetryingBlockFetcher: Exception while beginning fetch 
> of 1 outstanding blocks 
> java.io.IOException: Failed to connect to sparkbj01/9.111.254.195:37599
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:191)
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)
>   at 
> org.apache

[jira] [Comment Edited] (SPARK-7703) Task failure caused by block fetch failure in BlockManager.doGetRemote() when using TorrentBroadcast

2016-05-27 Thread Chanh Le (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15303655#comment-15303655
 ] 

Chanh Le edited comment on SPARK-7703 at 5/27/16 6:52 AM:
--

Any update on that? 
I have the same error too.
java.io.IOException: org.apache.spark.storage.BlockFetchException: Failed to 
fetch block from 1 locations. Most recent failure cause:
https://gist.github.com/giaosudau/3f7087707dcabc53c3b3bf54b0503720


was (Author: giaosuddau):
Any update on that? 
I have the same error.
java.io.IOException: org.apache.spark.storage.BlockFetchException: Failed to 
fetch block from 1 locations. Most recent failure cause:
https://gist.github.com/giaosudau/3f7087707dcabc53c3b3bf54b0503720

> Task failure caused by block fetch failure in BlockManager.doGetRemote() when 
> using TorrentBroadcast
> 
>
> Key: SPARK-7703
> URL: https://issues.apache.org/jira/browse/SPARK-7703
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.1, 1.3.1
> Environment: Red Hat Enterprise Linux Server release 7.0 (Maipo)
> Spark 1.3.1 Release
>Reporter: Hailong Wen
>
> I am from IBM Platform Symphony team and we are working to integration Spark 
> with our EGO to provide a fine-grained dynamic allocation Resource Manager. 
> We found a defect in current implementation of BlockManager.doGetRemote():
> {noformat}
>   private def doGetRemote(blockId: BlockId, asBlockResult: Boolean): 
> Option[Any] = {
> require(blockId != null, "BlockId is null")
> val locations = Random.shuffle(master.getLocations(blockId)) 
> <--- Issue2: locations may be out of date
> for (loc <- locations) {
>   logDebug(s"Getting remote block $blockId from $loc")
>   val data = blockTransferService.fetchBlockSync(
> loc.host, loc.port, loc.executorId, blockId.toString).nioByteBuffer() 
>  <--- Issue1: This statement is not in try/catch
>   if (data != null) {
> if (asBlockResult) {
>   return Some(new BlockResult(
> dataDeserialize(blockId, data),
> DataReadMethod.Network,
> data.limit()))
> } else {
>   return Some(data)
> }
>   }
>   logDebug(s"The value of block $blockId is null")
> }
> logDebug(s"Block $blockId not found")
> None
>   }
> {noformat}
> * Issue 1: Although the block fetch uses "for" to try all available 
> locations, the fetch method is not guarded by a "Try" block. When exception 
> occurs, this method will directly throw the error instead of trying other 
> block locations. The uncaught exception will cause task failure.
> * Issue 2: Constant "location" is acquired before fetching, however in a 
> dynamic allocation environment the block locations may change.
> We hit the above 2 issues in our use case, where Executors exit after all its 
> assigned tasks are done. We *occasionally* get the following error (issue 1.):
> {noformat}
> 15/05/13 10:28:35 INFO Executor: Running task 27.0 in stage 0.0 (TID 27)
> 15/05/13 10:28:35 DEBUG Executor: Task 26's epoch is 0
> 15/05/13 10:28:35 DEBUG Executor: Task 28's epoch is 0
> 15/05/13 10:28:35 DEBUG Executor: Task 27's epoch is 0
> 15/05/13 10:28:35 DEBUG BlockManager: Getting local block broadcast_0
> 15/05/13 10:28:35 DEBUG BlockManager: Block broadcast_0 not registered locally
> 15/05/13 10:28:35 INFO TorrentBroadcast: Started reading broadcast variable 0
> 15/05/13 10:28:35 DEBUG TorrentBroadcast: Reading piece broadcast_0_piece0 of 
> broadcast_0
> 15/05/13 10:28:35 DEBUG BlockManager: Getting local block broadcast_0_piece0 
> as bytes
> 15/05/13 10:28:35 DEBUG BlockManager: Block broadcast_0_piece0 not registered 
> locally
> 15/05/13 10:28:35 DEBUG BlockManager: Getting remote block broadcast_0_piece0 
> as bytes
> 15/05/13 10:28:35 DEBUG BlockManager: Getting remote block broadcast_0_piece0 
> from BlockManagerId(c390c311-bd97-4a99-bcb9-b32fd3dede17, sparkbj01, 37599)
> 15/05/13 10:28:35 TRACE NettyBlockTransferService: Fetch blocks from 
> sparkbj01:37599 (executor id c390c311-bd97-4a99-bcb9-b32fd3dede17)
> 15/05/13 10:28:35 DEBUG TransportClientFactory: Creating new connection to 
> sparkbj01/9.111.254.195:37599
> 15/05/13 10:28:35 ERROR RetryingBlockFetcher: Exception while beginning fetch 
> of 1 outstanding blocks 
> java.io.IOException: Failed to connect to sparkbj01/9.111.254.195:37599
>   at

[jira] [Commented] (MESOS-4565) slave recovers and attempt to destroy executor's child containers, then begins rejecting task status updates

2016-05-23 Thread Chanh Le (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15297666#comment-15297666
 ] 

Chanh Le commented on MESOS-4565:
-

Any update on that?
I still get the issues.

> slave recovers and attempt to destroy executor's child containers, then 
> begins rejecting task status updates
> 
>
> Key: MESOS-4565
> URL: https://issues.apache.org/jira/browse/MESOS-4565
> Project: Mesos
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 0.26.0
>Reporter: James DeFelice
>  Labels: mesosphere
>
> AFAICT the slave is doing this:
> 1) recovering from some kind of failure
> 2) checking the containers that it pulled from its state store
> 3) complaining about cgroup children hanging off of executor containers
> 4) rejecting task status updates related to the executor container, the first 
> of which in the logs is:
> {code}
> E0130 02:22:21.979852 12683 slave.cpp:2963] Failed to update resources for 
> container 1d965a20-849c-40d8-9446-27cb723220a9 of executor 
> 'd701ab48a0c0f13_k8sm-executor' running task 
> pod.f2dc2c43-c6f7-11e5-ad28-0ad18c5e6c7f on status update for terminal task, 
> destroying container: Container '1d965a20-849c-40d8-9446-27cb723220a9' not 
> found
> {code}
> To be fair, I don't believe that my custom executor is re-registering 
> properly with the slave prior to attempting to send these (failing) status 
> updates. But the slave doesn't complain about that .. it complains that it 
> can't find the **container**.
> slave log here:
> https://gist.github.com/jdef/265663461156b7a7ed4e



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-10661) Integrate SASI to Cassandra

2016-04-27 Thread Chanh Le (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15261541#comment-15261541
 ] 

Chanh Le commented on CASSANDRA-10661:
--

[~xedin] Thank man. You got my day.

> Integrate SASI to Cassandra
> ---
>
> Key: CASSANDRA-10661
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10661
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local Write-Read Paths
>Reporter: Pavel Yaskevich
>Assignee: Pavel Yaskevich
>  Labels: sasi
> Fix For: 3.4
>
>
> We have recently released new secondary index engine 
> (https://github.com/xedin/sasi) build using SecondaryIndex API, there are 
> still couple of things to work out regarding 3.x since it's currently 
> targeted on 2.0 released. I want to make this an umbrella issue to all of the 
> things related to integration of SASI, which are also tracked in 
> [sasi_issues|https://github.com/xedin/sasi/issues], into mainline Cassandra 
> 3.x release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-10661) Integrate SASI to Cassandra

2016-04-27 Thread Chanh Le (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15261528#comment-15261528
 ] 

Chanh Le commented on CASSANDRA-10661:
--

Hi I am using cassandra 3.5 and I have problem when create index with that.
CREATE CUSTOM INDEX ON bar (fname) USING 
'org.apache.cassandra.db.index.SSTableAttachedSecondaryIndex'
WITH OPTIONS = {
'analyzer_class':
'org.apache.cassandra.db.index.sasi.analyzer.NonTokenizingAnalyzer',
'case_sensitive': 'false'
};

it throws: unable to find custom indexer class 
'org.apache.cassandra.db.index.SSTableAttachedSecondaryIndex



> Integrate SASI to Cassandra
> ---
>
> Key: CASSANDRA-10661
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10661
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local Write-Read Paths
>Reporter: Pavel Yaskevich
>Assignee: Pavel Yaskevich
>  Labels: sasi
> Fix For: 3.4
>
>
> We have recently released new secondary index engine 
> (https://github.com/xedin/sasi) build using SecondaryIndex API, there are 
> still couple of things to work out regarding 3.x since it's currently 
> targeted on 2.0 released. I want to make this an umbrella issue to all of the 
> things related to integration of SASI, which are also tracked in 
> [sasi_issues|https://github.com/xedin/sasi/issues], into mainline Cassandra 
> 3.x release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


How to config my django project to use tastypie and mongoengine

2012-11-25 Thread Chanh Le
In the settings.py
import mongoengine
mongoengine.connect('cooking')
AUTHENTICATION_BACKENDS = (
'mongoengine.django.auth.MongoEngineBackend',
)
SESSION_ENGINE = 'mongoengine.django.sessions'


MIDDLEWARE_CLASSES = (
'django.middleware.common.CommonMiddleware',
'django.contrib.sessions.middleware.SessionMiddleware',
'django.middleware.csrf.CsrfViewMiddleware',
'django.contrib.auth.middleware.AuthenticationMiddleware',
'django.contrib.messages.middleware.MessageMiddleware',
'django.contrib.sessions.middleware.SessionMiddleware',

# Uncomment the next line for simple clickjacking protection:
# 'django.middleware.clickjacking.XFrameOptionsMiddleware',
)

INSTALLED_APPS = (
'django.contrib.auth',
'django.contrib.contenttypes',
'django.contrib.sessions',
'django.contrib.sites',
'django.contrib.messages',
'django.contrib.staticfiles',
# Uncomment the next line to enable the admin:
'django.contrib.admin',
# Uncomment the next line to enable admin documentation:
# 'django.contrib.admindocs',
'tastypie',
'tastypie_mongoengine',
'django.contrib.sessions',
'api',
)

But when run syncdb it has error
 django.core.exceptions.ImproperlyConfigured: settings.DATABASES is 
improperly configured. Please supply the ENGINE value.
How to config this correct?

-- 
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To view this discussion on the web visit 
https://groups.google.com/d/msg/django-users/-/tsety0NmZMsJ.
To post to this group, send email to django-users@googlegroups.com.
To unsubscribe from this group, send email to 
django-users+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en.

# Django settings for socialcooking project.

DEBUG = True
TEMPLATE_DEBUG = DEBUG

ADMINS = (
# ('Your Name', 'your_em...@example.com'),
)

MANAGERS = ADMINS


import mongoengine
mongoengine.connect('cooking')
AUTHENTICATION_BACKENDS = (
'mongoengine.django.auth.MongoEngineBackend',
)
SESSION_ENGINE = 'mongoengine.django.sessions'



#DATABASES = {
#'default': {
#'ENGINE': 'django.db.backends.', # Add 'postgresql_psycopg2', 'mysql', 'sqlite3' or 'oracle'.
#'NAME': '', # Or path to database file if using sqlite3.
#'USER': '', # Not used with sqlite3.
#'PASSWORD': '', # Not used with sqlite3.
#'HOST': '', # Set to empty string for localhost. Not used with sqlite3.
#'PORT': '', # Set to empty string for default. Not used with sqlite3.
#}
#}

# Local time zone for this installation. Choices can be found here:
# http://en.wikipedia.org/wiki/List_of_tz_zones_by_name
# although not all choices may be available on all operating systems.
# In a Windows environment this must be set to your system time zone.
TIME_ZONE = 'America/Chicago'

# Language code for this installation. All choices can be found here:
# http://www.i18nguy.com/unicode/language-identifiers.html
LANGUAGE_CODE = 'en-us'

SITE_ID = 1

# If you set this to False, Django will make some optimizations so as not
# to load the internationalization machinery.
USE_I18N = True

# If you set this to False, Django will not format dates, numbers and
# calendars according to the current locale.
USE_L10N = True

# If you set this to False, Django will not use timezone-aware datetimes.
USE_TZ = True

# Absolute filesystem path to the directory that will hold user-uploaded files.
# Example: "/home/media/media.lawrence.com/media/"
MEDIA_ROOT = ''

# URL that handles the media served from MEDIA_ROOT. Make sure to use a
# trailing slash.
# Examples: "http://media.lawrence.com/media/;, "http://example.com/media/;
MEDIA_URL = ''

# Absolute path to the directory static files should be collected to.
# Don't put anything in this directory yourself; store your static files
# in apps' "static/" subdirectories and in STATICFILES_DIRS.
# Example: "/home/media/media.lawrence.com/static/"
STATIC_ROOT = ''

# URL prefix for static files.
# Example: "http://media.lawrence.com/static/;
STATIC_URL = '/static/'

# Additional locations of static files
STATICFILES_DIRS = (
# Put strings here, like "/home/html/static" or "C:/www/django/static".
# Always use forward slashes, even on Windows.
# Don't forget to use absolute paths, not relative paths.
)

# List of finder classes that know how to find static files in
# various locations.
STATICFILES_FINDERS = (
'django.contrib.staticfiles.finders.FileSystemFinder',
'django.contrib.staticfiles.finders.AppDirectoriesFinder',
#'django.contrib.staticfiles.finders.DefaultStorageFinder',
)

# Make this unique, and don't share it with anybody.
SECRET_KEY = '@b$#)$w^6d8ogxdih@i2gwubv5$im)!dl@^y7w50w1x3==7!p'

# List of callables that know how to import templates from various sources.
TEMPLATE_LOADERS = (
'django.template.loaders.filesystem.Loader',

Chanh Le is out of the office.

2008-02-07 Thread Chanh Le
I will be out of the office starting  02/02/2008 and will not  return until
02/19/2008. If you need immediate assistance, please contact Murali
Bharathan at 818-575-1500 or internally at 578-4304.

Please contact Murali Bharathan or Kathy Ragatz for AS400 issues.

Thanks.
CL



==

Confidentiality Notice: The information contained in and transmitted with this 
communication is strictly confidential, is intended only for the use of the 
intended recipient, and is the property of Countrywide Financial Corporation or 
its affiliates and subsidiaries.  If you are not the intended recipient, you 
are hereby notified that any use of the information contained in or transmitted 
with the communication or dissemination, distribution, or copying of this 
communication is strictly prohibited by law.  If you have received this 
communication in error, please immediately return this communication to the 
sender and delete the original message and any copy of it in your possession.

==

===
To unsubscribe: mailto [EMAIL PROTECTED] with body: signoff JSP-INTEREST.
For digest: mailto [EMAIL PROTECTED] with body: set JSP-INTEREST DIGEST.

Some relevant archives, FAQs and Forums on JSPs can be found at:

 http://java.sun.com/products/jsp
 http://archives.java.sun.com/jsp-interest.html
 http://forums.java.sun.com
 http://www.jspinsider.com


Chanh Le is out of the office.

2005-01-23 Thread Chanh Le
I will be out of the office starting 1/17/2005 and will not return until
2/7/2005.

I am on vacation and back on 02/07/2005. Please contact Paul Bambah and
Ravindra Dabbiru for AS400 issues. Thanks.

___
To unsubscribe, send email to [EMAIL PROTECTED] and include in the body
of the message signoff SERVLET-INTEREST.

Archives: http://archives.java.sun.com/archives/servlet-interest.html
Resources: http://java.sun.com/products/servlet/external-resources.html
LISTSERV Help: http://www.lsoft.com/manuals/user/user.html


  1   2   >