Re: [I] Making Comet Common Module Engine Independent [datafusion-comet]

2024-05-02 Thread via GitHub


viirya commented on issue #329:
URL: 
https://github.com/apache/datafusion-comet/issues/329#issuecomment-2091997670

   I think it makes more sense to use Arrow types as a bridge between Comet and 
Parquet reader in Iceberg.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Making Comet Common Module Engine Independent [datafusion-comet]

2024-05-02 Thread via GitHub


parthchandra commented on issue #329:
URL: 
https://github.com/apache/datafusion-comet/issues/329#issuecomment-2091903091

   > For the Parquet part, we may need to define something like `CometDataType` 
which gets converted from the Parquet schema, and from which we can derive 
Spark catalyst data type or Iceberg data type.
   
   How about Arrow types as the canonical data types for Comet? ` 
org.apache.spark.sql.util.ArrowUtils` has conversions between Arrow and Spark 
schema/types. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Making Comet Common Module Engine Independent [datafusion-comet]

2024-05-01 Thread via GitHub


sunchao commented on issue #329:
URL: 
https://github.com/apache/datafusion-comet/issues/329#issuecomment-2089524991

   The original purpose of `comet-common` module is to make it engine-agnostic 
so it can be used for other use cases like Iceberg. Unfortunately we didn't 
have time to make it completely isolated, so it is still tightly coupled with 
Spark in several ways like Parquet -> catalyst schema conversion, ColumnVector, 
and later on a bunch of shuffle related stuff which are all closely related to 
Spark.
   
   If necessary, we can perhaps consider splitting the module further into 
`comet-parquet`, `comet-spark-shuffle` etc. For the Parquet part, we may need 
to define something like `CometDataType` which gets converted from the Parquet 
schema, and from which we can derive Spark catalyst data type or Iceberg data 
type.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Making Comet Common Module Engine Independent [datafusion-comet]

2024-04-30 Thread via GitHub


advancedxy commented on issue #329:
URL: 
https://github.com/apache/datafusion-comet/issues/329#issuecomment-2084810855

   > @advancedxy Good suggestions. I believe this Issue is to address point 3 
above while 1 and 2 are in progress.
   
   Thanks for the clarification, it makes totally sense then.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Making Comet Common Module Engine Independent [datafusion-comet]

2024-04-29 Thread via GitHub


parthchandra commented on issue #329:
URL: 
https://github.com/apache/datafusion-comet/issues/329#issuecomment-2083237844

   @advancedxy Good suggestions. I believe this Issue is to address  point 3 
above while 1 and 2 are in progress.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Making Comet Common Module Engine Independent [datafusion-comet]

2024-04-29 Thread via GitHub


advancedxy commented on issue #329:
URL: 
https://github.com/apache/datafusion-comet/issues/329#issuecomment-208203

   I'm +1 for this direction in the long term and I can help review the Iceberg 
integration if needed.
   
   In the short term, I think Iceberg could integrate Comet in its 
iceberg-spark module though, which doesn't require Comet's common module to be 
engine independent? So it would be great that we can make this work 
incrementally, such as:
   1. release Comet 0.1(or any other first version) first
   2. integrate Comet in Iceberg's spark module
   3. refactor and making comet common module engine independent incrementally 
in the next release or various releases
   4. integrate Comet in Iceberg's arrow/comet module and make the vectorized 
reader generally available for other engines in the iceberg repo.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Making Comet Common Module Engine Independent [datafusion-comet]

2024-04-26 Thread via GitHub


snmvaughan commented on issue #329:
URL: 
https://github.com/apache/datafusion-comet/issues/329#issuecomment-2079553631

   +1 for this direction.
   
   We can start migrating in this direction by moving a subset of `Utils.scala` 
which is specific to the mapping between Spark and Arrow.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Making Comet Common Module Engine Independent [datafusion-comet]

2024-04-25 Thread via GitHub


viirya commented on issue #329:
URL: 
https://github.com/apache/datafusion-comet/issues/329#issuecomment-2078371802

   This sounds a good direction to go. In the short term it might add some 
additional works that require us to refactor common and spark modules, though.
   
   Currently I'm still not sure about integrations with other engines. It is a 
great target, I think. Although to me it seems a little too far from current 
project status and bandwidth. 😄 
   
   I would like to focus on Spark integration at current stage. But if this 
refactoring is necessary to move Iceberg integration forward for now, I will 
support it.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Making Comet Common Module Engine Independent [datafusion-comet]

2024-04-25 Thread via GitHub


andygrove commented on issue #329:
URL: 
https://github.com/apache/datafusion-comet/issues/329#issuecomment-2078337974

   I am +1 for making comet-common Arrow-native and easier to integrate with 
other engines. Let me know how I can help.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Making Comet Common Module Engine Independent [datafusion-comet]

2024-04-25 Thread via GitHub


parthchandra commented on issue #329:
URL: 
https://github.com/apache/datafusion-comet/issues/329#issuecomment-2078218574

   Seems to me it would be a step in the right direction. The idea that 
comet-common should be independent of any engine is sound. It would be a 
necessary first step towards integration with other engines (e.g presto/trino).
   For the most part it looks like the Parquet file reader is mostly 
independent of Spark. The ColumnReaders and CometVector itself could be made 
more generic using only Arrow types and Arrow vectors while the adaptation to 
operate as a Spark columnar batch may be moved into comet-spark.
   This may involve a fair amount of difficult refactoring, but imho, would be 
worth it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Making Comet Common Module Engine Independent [datafusion-comet]

2024-04-25 Thread via GitHub


huaxingao commented on issue #329:
URL: 
https://github.com/apache/datafusion-comet/issues/329#issuecomment-2078208013

   also cc @andygrove @parthchandra @snmvaughan @kazuyukitanimura 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Making Comet Common Module Engine Independent [datafusion-comet]

2024-04-25 Thread via GitHub


viirya commented on issue #329:
URL: 
https://github.com/apache/datafusion-comet/issues/329#issuecomment-2078204505

   cc @sunchao 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org